Last time, we learned about linear regression, which provides a model to predict future values. Today, we’ll discuss logistic regression, which provides a model to predict future probabilities.
NCAA Basketball data
Our first example is going to involve NCAA Basketball tournaments. For each game in a tournament, the idea is to predict the winner in a probabilistic sense. That is, we want to say something like “The probability that UNC defeats Duke is 0.7”.
You might remember the next slide, which shows my predictions for the 2022 tournament, from our first day of class.
Massey ratings
We’re going to base these predictions, in part, on the so-called “Massey” ratings of the teams.
I’ve got a CSV file on my web space that lists every NCAA tournament team for every year from 2010 to 2023 together with that team’s Massey rating at the end of that season. Here are the top 8 teams from that data by Massey rating:
season
team_name
massey_rating
361
2015
Kentucky
29.831088
618
2019
Duke
28.759867
624
2019
Gonzaga
28.646682
15
2010
Kansas
27.326564
696
2021
Gonzaga
27.291249
762
2022
Gonzaga
27.197263
671
2019
Virginia
27.038487
7
2010
Duke
26.674557
The Massey rating is constructed using a linear model to predict the score difference if two teams play in near future. In 2019, for example, the Massey rating predicted that Duke wold defeat Virgina by 1.7 points
UNCA in the tournament
UNCA has been in the tournament 4 times since 2010. Their negative Massey rating last year indicated that they would be expected to lose to the average team by about a point.
season
team_name
massey_rating
120
2011
UNC Asheville
0.018425
190
2012
UNC Asheville
3.278971
458
2016
UNC Asheville
2.004859
872
2023
UNC Asheville
-1.107387
Paired tourney games
I’ve also got a CSV file listing all NCAA tournament games from 2010 to 2023. The table below shows the last six rows and, for each row, we see
The season,
each team name and Massey rating,
the massey_diff, which is the first Massey rating minus the second, and
the seed_diff, which is the difference between the seeds 1-16,
a Boolean label indicating whether team 1 won the game or not.
season
team1_name
team1_massey_rating
team2_name
team2_massey_rating
massey_diff
seed_diff
label
1733
2023
San Diego St
16.311344
Connecticut
21.225009
-4.913665
1
0
1732
2023
Connecticut
21.225009
San Diego St
16.311344
4.913665
-1
1
1731
2023
FL Atlantic
13.820686
San Diego St
16.311344
-2.490658
4
0
1730
2023
San Diego St
16.311344
FL Atlantic
13.820686
2.490658
-4
1
1729
2023
Miami FL
12.780767
Connecticut
21.225009
-8.444242
1
0
1728
2023
Connecticut
21.225009
Miami FL
12.780767
8.444242
-1
1
A little more
Here are a few more observations on the data:
Each game appears twice - once where one team appears first and once where the other appears first. Note that the labels switch; thus, one row represents team1 as the winner and the other row represents team 1 as the loser.
If you’re a basketball fan, you might notice that these six rows represent the outcome of the final four from 2023 when UConn defeated San Diego State for the NCAA championship.
season
team1_name
team1_massey_rating
team2_name
team2_massey_rating
massey_diff
seed_diff
label
1733
2023
San Diego St
16.311344
Connecticut
21.225009
-4.913665
1
0
1732
2023
Connecticut
21.225009
San Diego St
16.311344
4.913665
-1
1
1731
2023
FL Atlantic
13.820686
San Diego St
16.311344
-2.490658
4
0
1730
2023
San Diego St
16.311344
FL Atlantic
13.820686
2.490658
-4
1
1729
2023
Miami FL
12.780767
Connecticut
21.225009
-8.444242
1
0
1728
2023
Connecticut
21.225009
Miami FL
12.780767
8.444242
-1
1
Visualization
Let’s plot this data with the massey_diff on the horizontal axis and the label on the vertical:
Note that the symmetry arises from the two ways of looking at the games - one labeled zero on the bottom and one labeled one on the top.
Fit
Now, we’re going to “fit” that data with a certain type of curve:
Note that the curve looks just like a cumulative distribution function. Thus, if Team 1 has Massey rating \(R_1\), Team 2 has Massey rating \(R_2\), and the curve is the graph of the function \(y=f(x)\), then we oughtta be able to compute the probability that Team 1 defeats Team 2 as \[f(R_1-R_2).\] That curve is called a logistic curve.
Algebraic form
The algebraic form of the logistic curve is \[
\hat{p} = \frac{1}{1+e^{ax+b}}.
\] While you don’t need to worry too much about this specific algebraic form, there are a few things worth knowing. In particular, the coefficients \(a\) and \(b\) turn up in regression analyses and it is important to know how to interpret them.
It turn out that we can solve for the \(ax+b\) in that formula to get \[
-\log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) = ax+b.
\]
We use the symbol \(\hat{p}\) because it represents a probability.
The ratio \(\hat{p}/(1-\hat{p})\) is often called the odds.
The expression \(\log\left(\hat{p}/(1-\hat{p})\right)\) is called the logit or log-odds and can be computed as \(-(ax+b)\).
These terms all show up in Regression analyses.
Running a logistic regression
Let’s take a look at some computer code to run logistic regression. We start by grabbing the paired game data and displaying the last six rows:
import pandas as pdpaired_games = pd.read_csv('https://www.marksmath.org/data/paired_tourney_games.csv')paired_games.sort_index(ascending=False).head(6)
season
team1_name
team1_massey_rating
team2_name
team2_massey_rating
massey_diff
seed_diff
label
1733
2023
San Diego St
16.311344
Connecticut
21.225009
-4.913665
1
0
1732
2023
Connecticut
21.225009
San Diego St
16.311344
4.913665
-1
1
1731
2023
FL Atlantic
13.820686
San Diego St
16.311344
-2.490658
4
0
1730
2023
San Diego St
16.311344
FL Atlantic
13.820686
2.490658
-4
1
1729
2023
Miami FL
12.780767
Connecticut
21.225009
-8.444242
1
0
1728
2023
Connecticut
21.225009
Miami FL
12.780767
8.444242
-1
1
Train and fit
We now set up and fit a logistic regression model of the paired_games data using Python’s statsmodels library. That process looks like so:
Optimization terminated successfully.
Current function value: 0.572040
Iterations 6
This process is often called training the model, which is why we use the variable name train.
The \(x\) variables are typically associated with the predictors and
The \(y\) variable is typically associated with response - the variable we want to predict.
The hope that, if our model fits the knows data well, then it might work with data we might encounter in the future.
Examining the summary
Of course, the big question is how do we apply the result of the regression to make a prediction? Well, the model we built has a summary method that we can use to display some information:
model.summary().tables[1]
coef
std err
z
P>|z|
[0.025
0.975]
const
1.104e-18
0.054
2.03e-17
1.000
-0.106
0.106
massey_diff
0.1079
0.006
16.712
0.000
0.095
0.121
There’s a fair amount going on here in terms of inference. The middle three columns allow you to run hypothesis tests to determine if there’s really a relationship between the regression formula and the data; the last two determine 95% confidence intervals for the coefficients.
The most important items for making predictions are the coefficients in the first column labeled coef.
Using the summary for prediction
Let’s focus now on this most important part for prediction:
In this output, massey_diff=0.1079 and const=1.104e-18 refer to the coefficients of the massey_diff variable and the constant term, which is effectively zero. Thus, we have the following formula for the log-odds:
\[\begin{aligned}
O = \log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) &= 0.1079\times\mathtt{massey\_diff}+1.104\times10^{-18} \\
&= 0.1079\times\mathtt{massey\_diff}
\end{aligned}\]
From there, we can get the probabilistic prediction:
\[
\hat{p} = \frac{1}{1+e^{-O}}.
\]
Example
Last spring, I used something like this data through 2023 to help me with predictions for the 2024 tournament, when UConn defeated Purdue for their second straight championship. The semi-finals of that tournament featured
We still compute the probability via \[
\hat{p} = \frac{1}{1+e^{-O}}.
\]
Multiple predictors in the basketball data
The basketball data, for example, contains not just a massey_diff variable but also the so-called seed_diff, which is the difference between the two teams (1-16) seed in their region. We can look back at the tournament slide to see what this means.
Of course, there tends to be a (negative) correlation between seed and performance so we might expect that it could help if we could use both these variables.
In the context of software output on logistic regression, the coefficients each appear as a row in the coefficient table.
Example
Let’s suppose that a logistic regression analysis taking massey_diff and seed_diff into account yields the following:
coef
stderr
z
P>|z|
[0.025]
const
-1.677e-16
0.054
-3.09e-15
1.000
-0.106
0.106
massey_diff
0.1106
0.013
7.506
0.000
0.074
0.127
seed_diff
-0.0942
0.018
-0.614
0.539
-0.047
0.025
This indicates that the coefficient of massey_diff should be \(0.1106\) and that the coefficient of seed_diff should be \(-0.0942\).
It also indicates that the constant is effectively zero.
Application
Focusing again on the coefficients
\(\texttt{massey_diff} = 0.1106\) and
\(\texttt{seed_diff} = -0.0942\),
let’s again return to the 1 seed Purdue vs 11 seed NC State example.
The massey_diff is still \(13.2031\). That yields the following value for the log-odds:
\[
O = 0.1106\times13.2031 - 0.0942\times(-10) = 2.40226.
\]
We then get probability computation of \[
1/(1+e^{-2.40226}) \approx 0.916999.
\]