We’ve discussed linear regression quite a bit by now. We turn now to logistic regression, which has a lot in common with linear regression. In general, regression studies the relationship between two variables. If the relationship is strong enough, then we try to predict the possible values of one variable using the other as a predictor.
In linear regression, we the variables are continuous, numerical variables.
In logistic regression, the output or response variable is discrete or even categorical. Typically, the prediction is made in terms of probability; if the relationship is strong enough, that might even be coerced into a classifier.
Some data for context
Let’s place this all in the context of data based problem, namely, the Big South Basketball data from last time but updated with Massey Ratings:
team1_name
score1
rating1
team2_name
score2
rating2
rating_diff
label
0
Presbyterian
69
1.991879
Radford
82
-3.289193
-5.281072
0
1
High_Point
84
10.878501
Winthrop
62
1.284302
9.594199
1
2
Presbyterian
61
-2.947140
Gardner_Webb
63
-3.289193
-0.342053
0
3
Longwood
77
0.591144
Radford
74
1.991879
-1.400736
1
4
UNC_Asheville
93
3.839321
SC_Upstate
92
-9.859515
13.698835
1
Note that rating information is included; our objective is to use this data to predict future possible outcomes.
Objective
Again, our objective is to use these ratings to predict future outcomes. For example, UNCA will play Radford this coming Thursday night. Looking at the table, we see that
UNCA has a rating of 3.839321 and
Radford has a rating of 1.284302
That translates to a predicted margin of victory of 2.55502 points. Of course, there’s randomness involved so the real question is
What’s the probability that UNCA defeats Radford this coming Thursday?
The variables
Let’s take another look at those variables:
team1_name
score1
rating1
team2_name
score2
rating2
rating_diff
label
48
UNC_Asheville
92
3.839321
SC_Upstate
85
-9.859515
13.698835
1
60
UNC_Asheville
76
0.591144
Longwood
85
3.839321
3.248177
0
I guess most of those make sense. A couple of them are of primary importance.
The rating_diff is simply the first team’s rating minus the second.
The label is an indicator telling us whether the first team won (1) or lost (0).
Thus, the question is, can we use the rating_diff to predict the label. Presumably, the larger the rating_diff, the more likely the label is to be 1.
Visualizing the data
Let’s draw a picture of the data with rating_diff on the horizontal axis and the label on the vertical.
Indeed, the data looks somewhat skewed to the upper right and lower left. The data is also symmetric, since each game appears twice with the order of the teams changed.
General data
More generally, we’ve got a list of \(N\) data points - often, presented as a data frame with \(N\) rows. In this case, one column would be the predictor and one column would be the target.
In our data,
there are 102 rows (two for each game),
the predictor is the rating_diff, and
the target is the label, which is a common generic name.
The logistic curve
In logistic regression we fit a specific curve to this type of data.
This curve is called a logistic curve or sigmoid. Its formula has the form \[
p(x) = \frac{1}{1+e^{-(ax+b)}}.
\]
Properties of the sigmoid
The sigmoid has some nice properties implying that it’s a good, cumulative probability distribution function:
In the context of our problem, if \(d\) denotes the rating difference between team 1 and team 2, then the probability that team 1 beats team 2 could be computed as \(p(d)\). I suppose the properties above should make some sense.
Fitting \(p(x)\)
Our data is certainly not linear, the sigmoid is not a linear combination of basis functions and it’s not clear if least squares is a suitable metric for the error. On the other hand, \(p(x)\) is supposed to represent a probability so it might make sense to account for this in our fit.
The quantity that we use for doing so is called maximum likelihood, which is the following function of the data: \[
L = \prod_{k:y_k=1}p(x_k)\prod_{k:y_k=0}(1-p(x_k)).
\] Now, if we choose our parameters \(a\) and \(b\) to maximize \(L\), then we will effectively maximize the probability that the data we’ve observed could actually happen.
Translating to log-likelihood
If we take the natural logarithm of both sides, we get \[\begin{aligned}
\log(L) &= \sum_{k:y_k=1}p(x_k) + \sum_{k:y_k=0}(1-p(x_k)) \\
&= \sum_{k=1}^N (y_k \log(p(x_k)) + (1-y_k) \log(1-p(x_k))).
\end{aligned}\] Note that we are summing over our data points \((x_k,y_k)\). Each \(y_k\) is either \(0\) or \(1\), which forces our two sums to be equal. Each \(p(x_k)\) has the form \[
p(x_k) = \frac{1}{1+e^{-(ax_k+b)}}.
\]
A simplifying assumption
If you take another look at our picture of the logistic curve, you’ll notice that it inherits a symmetry from the data. This symmetry implies that \(b=0\) so that our sigmoid function is a just bit simpler: \[
p(x_k) = \frac{1}{1+e^{-ax_k}}.
\] While this looks like a minor change, it has important implications. In particular, our maximum likelihood is now a function of just one parameter \(a\). Thus, we can use single variable optimization techniques.
Code
OK, let’s look at some code to implement all this. We’ll start with a look at the data:
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom scipy.optimize import minimizegames = pd.read_csv('https://marksmath.org/data/big_south_games_with_ratings.csv')games.head(3)
team1_name
score1
rating1
team2_name
score2
rating2
rating_diff
label
0
Winthrop
95
1.284302
SC_Upstate
76
-9.859515
11.143817
1
1
Presbyterian
68
-3.289193
Longwood
60
0.591144
-3.880336
1
2
High_Point
76
10.878501
Radford
58
1.991879
8.886621
1
The objective function
We now define \(p_a(x)\) and the LogLoss function in code as follows:
We pick up the minus sign since we’re going to use a minimizer.
The log-loss graph
Again, the negative of the log-likelihood is called the log-loss. Fitting our model is equivalent to minimizing the log-loss. Let’s investigate the graph of our log_loss:
A = np.linspace(0.001,1,200)y = log_loss(A)log_loss_graph = plt.plot(A,y)
Optimization
It sure looks like there’s a minimum around \(a=0.2\). Here’s all we need to do to find that more precisely:
Well, we won by \(7\). The rating_diff is only \(1.847\), which translates to a win probability of about \(60\%\):
p(1.847441, 0.2046)
0.59338736234573
Other libraries
It’s worth pointing out that NumPy/SciPy are fairly low level libraries in the Python ecosystem. There are other libraries that are built more specifically for statistics and/or machine learning that allow us to do these things and more with far less code.
statsmodels
Statsmodels is Python’s main statistics library and the logistic model fits well within that context. Here’s how to fit a logistic function to model win probabilities based on rating differences in statmodels:
import statsmodels.api as smmodel = sm.Logit(games.label, games.rating_diff).fit()
Optimization terminated successfully.
Current function value: 0.501884
Iterations 6
The result
The output is an object with a number of properties. Notably, the summary property allows us to interpret the result:
model.summary()
Logit Regression Results
Dep. Variable:
label
No. Observations:
102
Model:
Logit
Df Residuals:
101
Method:
MLE
Df Model:
0
Date:
Mon, 17 Feb 2025
Pseudo R-squ.:
0.2759
Time:
12:25:26
Log-Likelihood:
-51.192
converged:
True
LL-Null:
-70.701
Covariance Type:
nonrobust
LLR p-value:
nan
coef
std err
z
P>|z|
[0.025
0.975]
rating_diff
0.2046
0.042
4.849
0.000
0.122
0.287
Note the 0.2046 coef of rating_diff in the lower left.
Scikit Learn
Scikit Learn is a full-fledged machine learning library with a lot of great tools; we’ll use it in the next lab for sure. Here’s how to run the logistic regression for our working example with Scikit Learn:
from sklearn.linear_model import LogisticRegressionlogreg = LogisticRegression()X = games[['rating_diff']].to_numpy()Y = games.label.to_numpy()logreg.fit(X,Y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
The coefficient
Scikit Learn doesn’t provide the same type of summary that statsmodels does. Here’s how to view the coefficient, though:
logreg.coef_
array([[0.2042248]])
Note that the result is just a touch off from our previous result of \(0.2046\). The reason is that Scikit Learn imposes a penalty by default to avoid overfitting. We’ll learn about that next week, when we talk about practical regression.
If you want to try to recover our simpler result, you can do so like this:
logreg = LogisticRegression(penalty=None)
More???
There’s a lot more to talk about in the context of regression! This includes:
Multiple regression (dealing with more input and output)
feature selection (figuring out which inputs are most important)
regularization (dealing with overfitting)
encoding categorical variables
classification (via rounding probabilities, for example)
We’ll do some of this when we discuss “practical regression” and have a lab next week. A lot of this stuff is applicable to other types of algorithms as well.