Logistic regression

Mon, Feb 17, 2025

Logistic regression

We’ve discussed linear regression quite a bit by now. We turn now to logistic regression, which has a lot in common with linear regression. In general, regression studies the relationship between two variables. If the relationship is strong enough, then we try to predict the possible values of one variable using the other as a predictor.

  • In linear regression, we the variables are continuous, numerical variables.
  • In logistic regression, the output or response variable is discrete or even categorical. Typically, the prediction is made in terms of probability; if the relationship is strong enough, that might even be coerced into a classifier.

Some data for context

Let’s place this all in the context of data based problem, namely, the Big South Basketball data from last time but updated with Massey Ratings:

team1_name score1 rating1 team2_name score2 rating2 rating_diff label
0 Presbyterian 69 1.991879 Radford 82 -3.289193 -5.281072 0
1 High_Point 84 10.878501 Winthrop 62 1.284302 9.594199 1
2 Presbyterian 61 -2.947140 Gardner_Webb 63 -3.289193 -0.342053 0
3 Longwood 77 0.591144 Radford 74 1.991879 -1.400736 1
4 UNC_Asheville 93 3.839321 SC_Upstate 92 -9.859515 13.698835 1

Note that rating information is included; our objective is to use this data to predict future possible outcomes.

Objective

Again, our objective is to use these ratings to predict future outcomes. For example, UNCA will play Radford this coming Thursday night. Looking at the table, we see that

  • UNCA has a rating of 3.839321 and
  • Radford has a rating of 1.284302

That translates to a predicted margin of victory of 2.55502 points. Of course, there’s randomness involved so the real question is

What’s the probability that UNCA defeats Radford this coming Thursday?

The variables

Let’s take another look at those variables:

team1_name score1 rating1 team2_name score2 rating2 rating_diff label
48 UNC_Asheville 92 3.839321 SC_Upstate 85 -9.859515 13.698835 1
60 UNC_Asheville 76 0.591144 Longwood 85 3.839321 3.248177 0

I guess most of those make sense. A couple of them are of primary importance.

  • The rating_diff is simply the first team’s rating minus the second.
  • The label is an indicator telling us whether the first team won (1) or lost (0).

Thus, the question is, can we use the rating_diff to predict the label. Presumably, the larger the rating_diff, the more likely the label is to be 1.

Visualizing the data

Let’s draw a picture of the data with rating_diff on the horizontal axis and the label on the vertical.

Indeed, the data looks somewhat skewed to the upper right and lower left. The data is also symmetric, since each game appears twice with the order of the teams changed.

General data

More generally, we’ve got a list of \(N\) data points - often, presented as a data frame with \(N\) rows. In this case, one column would be the predictor and one column would be the target.

In our data,

  • there are 102 rows (two for each game),
  • the predictor is the rating_diff, and
  • the target is the label, which is a common generic name.

The logistic curve

In logistic regression we fit a specific curve to this type of data.

This curve is called a logistic curve or sigmoid. Its formula has the form \[ p(x) = \frac{1}{1+e^{-(ax+b)}}. \]

Properties of the sigmoid

The sigmoid has some nice properties implying that it’s a good, cumulative probability distribution function:

  • \(0<p(x)<1\) for all real \(x\),
  • \(p(x)\) is strictly increasing,
  • \(\displaystyle \lim_{x\to -\infty}p(x) = 0\) and \(\displaystyle \lim_{x\to\infty} p(x) = 1\).

In the context of our problem, if \(d\) denotes the rating difference between team 1 and team 2, then the probability that team 1 beats team 2 could be computed as \(p(d)\). I suppose the properties above should make some sense.

Fitting \(p(x)\)

Our data is certainly not linear, the sigmoid is not a linear combination of basis functions and it’s not clear if least squares is a suitable metric for the error. On the other hand, \(p(x)\) is supposed to represent a probability so it might make sense to account for this in our fit.

The quantity that we use for doing so is called maximum likelihood, which is the following function of the data: \[ L = \prod_{k:y_k=1}p(x_k)\prod_{k:y_k=0}(1-p(x_k)). \] Now, if we choose our parameters \(a\) and \(b\) to maximize \(L\), then we will effectively maximize the probability that the data we’ve observed could actually happen.

Translating to log-likelihood

If we take the natural logarithm of both sides, we get \[\begin{aligned} \log(L) &= \sum_{k:y_k=1}p(x_k) + \sum_{k:y_k=0}(1-p(x_k)) \\ &= \sum_{k=1}^N (y_k \log(p(x_k)) + (1-y_k) \log(1-p(x_k))). \end{aligned}\] Note that we are summing over our data points \((x_k,y_k)\). Each \(y_k\) is either \(0\) or \(1\), which forces our two sums to be equal. Each \(p(x_k)\) has the form \[ p(x_k) = \frac{1}{1+e^{-(ax_k+b)}}. \]

A simplifying assumption

If you take another look at our picture of the logistic curve, you’ll notice that it inherits a symmetry from the data. This symmetry implies that \(b=0\) so that our sigmoid function is a just bit simpler: \[ p(x_k) = \frac{1}{1+e^{-ax_k}}. \] While this looks like a minor change, it has important implications. In particular, our maximum likelihood is now a function of just one parameter \(a\). Thus, we can use single variable optimization techniques.

Code

OK, let’s look at some code to implement all this. We’ll start with a look at the data:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.optimize import minimize

games = pd.read_csv('https://marksmath.org/data/big_south_games_with_ratings.csv')
games.head(3)
team1_name score1 rating1 team2_name score2 rating2 rating_diff label
0 Winthrop 95 1.284302 SC_Upstate 76 -9.859515 11.143817 1
1 Presbyterian 68 -3.289193 Longwood 60 0.591144 -3.880336 1
2 High_Point 76 10.878501 Radford 58 1.991879 8.886621 1

The objective function

We now define \(p_a(x)\) and the LogLoss function in code as follows:

def p(x,a): 
    return 1/(1+np.exp(-a*x))
def log_loss(a):
    return -np.sum(games.apply(lambda r: (
        r.label*np.log(p(r.rating_diff, a)) + 
        (1-r.label)*np.log(1-p(r.rating_diff, a))
    ), axis=1))

Note that lines 5 and 6 correspond exactly to

\[ -\log(L) = -\sum_{k=1}^N (y_k \log(p(x_k)) + (1-y_k) \log(1-p(x_k))). \]

We pick up the minus sign since we’re going to use a minimizer.

The log-loss graph

Again, the negative of the log-likelihood is called the log-loss. Fitting our model is equivalent to minimizing the log-loss. Let’s investigate the graph of our log_loss:

A = np.linspace(0.001,1,200)
y = log_loss(A)
log_loss_graph = plt.plot(A,y)

Optimization

It sure looks like there’s a minimum around \(a=0.2\). Here’s all we need to do to find that more precisely:

minimize(log_loss, [0.2])
  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 51.192148762248976
        x: [ 2.046e-01]
      nit: 2
      jac: [ 1.431e-06]
 hess_inv: [[ 1.751e-03]]
     nfev: 10
     njev: 5

This suggests that \(a\approx0.2046\) should minimize our negative log-loss.

The graph

Here’s a pretty simple way to graph the data and optimal \(p(x)\):

X = np.linspace(-22,22,200)
Y = p(X,0.2046)
plt.plot(games.rating_diff, 
  games.label, 'ok', alpha=0.4)
pic = plt.plot(X,Y)

Application

This coming Thursday, for example, UNCA will play Radford. Let’s take a look at our last game:

games[
  (games.team1_name == "UNC_Asheville") & 
  (games.team2_name == "Radford")
]
team1_name score1 rating1 team2_name score2 rating2 rating_diff label
32 UNC_Asheville 72 3.839321 Radford 65 1.991879 1.847441 1

Well, we won by \(7\). The rating_diff is only \(1.847\), which translates to a win probability of about \(60\%\):

p(1.847441, 0.2046)
0.59338736234573

Other libraries

It’s worth pointing out that NumPy/SciPy are fairly low level libraries in the Python ecosystem. There are other libraries that are built more specifically for statistics and/or machine learning that allow us to do these things and more with far less code.

statsmodels

Statsmodels is Python’s main statistics library and the logistic model fits well within that context. Here’s how to fit a logistic function to model win probabilities based on rating differences in statmodels:

import statsmodels.api as sm
model = sm.Logit(games.label, games.rating_diff).fit()
Optimization terminated successfully.
         Current function value: 0.501884
         Iterations 6

The result

The output is an object with a number of properties. Notably, the summary property allows us to interpret the result:

model.summary()
Logit Regression Results
Dep. Variable: label No. Observations: 102
Model: Logit Df Residuals: 101
Method: MLE Df Model: 0
Date: Mon, 17 Feb 2025 Pseudo R-squ.: 0.2759
Time: 12:25:26 Log-Likelihood: -51.192
converged: True LL-Null: -70.701
Covariance Type: nonrobust LLR p-value: nan
coef std err z P>|z| [0.025 0.975]
rating_diff 0.2046 0.042 4.849 0.000 0.122 0.287

Note the 0.2046 coef of rating_diff in the lower left.

Scikit Learn

Scikit Learn is a full-fledged machine learning library with a lot of great tools; we’ll use it in the next lab for sure. Here’s how to run the logistic regression for our working example with Scikit Learn:

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
X = games[['rating_diff']].to_numpy()
Y = games.label.to_numpy()
logreg.fit(X,Y)
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

The coefficient

Scikit Learn doesn’t provide the same type of summary that statsmodels does. Here’s how to view the coefficient, though:

logreg.coef_
array([[0.2042248]])

Note that the result is just a touch off from our previous result of \(0.2046\). The reason is that Scikit Learn imposes a penalty by default to avoid overfitting. We’ll learn about that next week, when we talk about practical regression.

If you want to try to recover our simpler result, you can do so like this:

logreg = LogisticRegression(penalty=None)

More???

There’s a lot more to talk about in the context of regression! This includes:

  • Multiple regression (dealing with more input and output)
  • feature selection (figuring out which inputs are most important)
  • regularization (dealing with overfitting)
  • encoding categorical variables
  • classification (via rounding probabilities, for example)

We’ll do some of this when we discuss “practical regression” and have a lab next week. A lot of this stuff is applicable to other types of algorithms as well.

I guess we’ve got an exam to get through first.