Logistic Models

Mon, Dec 02, 2024

Logistic Regression

Last time, we learned about linear regression, which provides a model to predict future values. Today, we’ll discuss logistic regression, which provides a model to predict future probabilities.

NCAA Basketball data

Our first example is going to involve NCAA Basketball tournaments. For each game in a tournament, the idea is to predict the winner in a probabilistic sense. That is, we want to say something like “The probability that UNC defeats Duke is 0.7”.

You might remember the next slide, which shows my predictions for the 2022 tournament, from our first day of class.

Massey ratings

We’re going to base these predictions, in part, on the so-called “Massey” ratings of the teams.

I’ve got a CSV file on my web space that lists every NCAA tournament team for every year from 2010 to 2023 together with that team’s Massey rating at the end of that season. Here are the top 8 teams from that data by Massey rating:

	season	team_name	massey_rating
361	2015	Kentucky	29.831088
618	2019	Duke	28.759867
624	2019	Gonzaga	28.646682
15	2010	Kansas	27.326564
696	2021	Gonzaga	27.291249
762	2022	Gonzaga	27.197263
671	2019	Virginia	27.038487
7	2010	Duke	26.674557

The Massey rating is constructed using a linear model to predict the score difference if two teams play in near future. In 2019, for example, the Massey rating predicted that Duke wold defeat Virgina by 1.7 points

UNCA in the tournament

UNCA has been in the tournament 4 times since 2010. Their negative Massey rating last year indicated that they would be expected to lose to the average team by about a point.

	season	team_name	massey_rating
120	2011	UNC Asheville	0.018425
190	2012	UNC Asheville	3.278971
458	2016	UNC Asheville	2.004859
872	2023	UNC Asheville	-1.107387

Paired tourney games

I’ve also got a CSV file listing all NCAA tournament games from 2010 to 2023. The table below shows the last six rows and, for each row, we see

The season,
each team name and Massey rating,
the massey_diff, which is the first Massey rating minus the second, and
the seed_diff, which is the difference between the seeds 1-16,
a Boolean label indicating whether team 1 won the game or not.

	season	team1_name	team1_massey_rating	team2_name	team2_massey_rating	massey_diff	seed_diff	label
1733	2023	San Diego St	16.311344	Connecticut	21.225009	-4.913665	1	0
1732	2023	Connecticut	21.225009	San Diego St	16.311344	4.913665	-1	1
1731	2023	FL Atlantic	13.820686	San Diego St	16.311344	-2.490658	4	0
1730	2023	San Diego St	16.311344	FL Atlantic	13.820686	2.490658	-4	1
1729	2023	Miami FL	12.780767	Connecticut	21.225009	-8.444242	1	0
1728	2023	Connecticut	21.225009	Miami FL	12.780767	8.444242	-1	1

A little more

Here are a few more observations on the data:

Each game appears twice - once where one team appears first and once where the other appears first. Note that the labels switch; thus, one row represents team1 as the winner and the other row represents team 1 as the loser.
If you’re a basketball fan, you might notice that these six rows represent the outcome of the final four from 2023 when UConn defeated San Diego State for the NCAA championship.

	season	team1_name	team1_massey_rating	team2_name	team2_massey_rating	massey_diff	seed_diff	label
1733	2023	San Diego St	16.311344	Connecticut	21.225009	-4.913665	1	0
1732	2023	Connecticut	21.225009	San Diego St	16.311344	4.913665	-1	1
1731	2023	FL Atlantic	13.820686	San Diego St	16.311344	-2.490658	4	0
1730	2023	San Diego St	16.311344	FL Atlantic	13.820686	2.490658	-4	1
1729	2023	Miami FL	12.780767	Connecticut	21.225009	-8.444242	1	0
1728	2023	Connecticut	21.225009	Miami FL	12.780767	8.444242	-1	1

Visualization

Let’s plot this data with the massey_diff on the horizontal axis and the label on the vertical:

Note that the symmetry arises from the two ways of looking at the games - one labeled zero on the bottom and one labeled one on the top.

Fit

Now, we’re going to “fit” that data with a certain type of curve:

Note that the curve looks just like a cumulative distribution function. Thus, if Team 1 has Massey rating \(R_1\), Team 2 has Massey rating \(R_2\), and the curve is the graph of the function \(y=f(x)\), then we oughtta be able to compute the probability that Team 1 defeats Team 2 as \[f(R_1-R_2).\] That curve is called a logistic curve.

Algebraic form

The algebraic form of the logistic curve is \[ \hat{p} = \frac{1}{1+e^{ax+b}}. \] While you don’t need to worry too much about this specific algebraic form, there are a few things worth knowing. In particular, the coefficients \(a\) and \(b\) turn up in regression analyses and it is important to know how to interpret them.

It turn out that we can solve for the \(ax+b\) in that formula to get \[ -\log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) = ax+b. \]

We use the symbol \(\hat{p}\) because it represents a probability.
The ratio \(\hat{p}/(1-\hat{p})\) is often called the odds.
The expression \(\log\left(\hat{p}/(1-\hat{p})\right)\) is called the logit or log-odds and can be computed as \(-(ax+b)\).

These terms all show up in Regression analyses.

Running a logistic regression

Let’s take a look at some computer code to run logistic regression. We start by grabbing the paired game data and displaying the last six rows:

import pandas as pd
paired_games = pd.read_csv('https://www.marksmath.org/data/paired_tourney_games.csv')
paired_games.sort_index(ascending=False).head(6)

	season	team1_name	team1_massey_rating	team2_name	team2_massey_rating	massey_diff	seed_diff	label
1733	2023	San Diego St	16.311344	Connecticut	21.225009	-4.913665	1	0
1732	2023	Connecticut	21.225009	San Diego St	16.311344	4.913665	-1	1
1731	2023	FL Atlantic	13.820686	San Diego St	16.311344	-2.490658	4	0
1730	2023	San Diego St	16.311344	FL Atlantic	13.820686	2.490658	-4	1
1729	2023	Miami FL	12.780767	Connecticut	21.225009	-8.444242	1	0
1728	2023	Connecticut	21.225009	Miami FL	12.780767	8.444242	-1	1

Train and fit

We now set up and fit a logistic regression model of the paired_games data using Python’s statsmodels library. That process looks like so:

import statsmodels.api as sm
train = paired_games
Xtrain = train[['massey_diff']]
Xtrain = sm.add_constant(Xtrain)
ytrain = train[['label']]
model = sm.Logit(ytrain, Xtrain).fit()

Optimization terminated successfully.
         Current function value: 0.572040
         Iterations 6

This process is often called training the model, which is why we use the variable name train.
The \(x\) variables are typically associated with the predictors and
The \(y\) variable is typically associated with response - the variable we want to predict.
The hope that, if our model fits the knows data well, then it might work with data we might encounter in the future.

Examining the summary

Of course, the big question is how do we apply the result of the regression to make a prediction? Well, the model we built has a summary method that we can use to display some information:

model.summary().tables[1]

	coef	std err	z	P>\|z\|	[0.025	0.975]
const	1.104e-18	0.054	2.03e-17	1.000	-0.106	0.106
massey_diff	0.1079	0.006	16.712	0.000	0.095	0.121

There’s a fair amount going on here in terms of inference. The middle three columns allow you to run hypothesis tests to determine if there’s really a relationship between the regression formula and the data; the last two determine 95% confidence intervals for the coefficients.

The most important items for making predictions are the coefficients in the first column labeled coef.

Using the summary for prediction

Let’s focus now on this most important part for prediction:

In this output, massey_diff=0.1079 and const=1.104e-18 refer to the coefficients of the massey_diff variable and the constant term, which is effectively zero. Thus, we have the following formula for the log-odds:

\[\begin{aligned} O = \log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) &= 0.1079\times\mathtt{massey\_diff}+1.104\times10^{-18} \\ &= 0.1079\times\mathtt{massey\_diff} \end{aligned}\]

From there, we can get the probabilistic prediction:

\[ \hat{p} = \frac{1}{1+e^{-O}}. \]

Example

Last spring, I used something like this data through 2023 to help me with predictions for the 2024 tournament, when UConn defeated Purdue for their second straight championship. The semi-finals of that tournament featured

Purdue with a Massey rating of 25.138448 vs
NC State with a Massey rating of 11.935341 for
a difference of 13.2031.

Thus, our log-odds \(O\) satisfy \[ O = 0.1079\times13.2031 = 1.4246 \]

Thus, the predicted probability that Purdue defeats NC State would be \[ \frac{1}{1+e^{-1.4246}} \approx 0.806058. \]

Multiple regression

Sometimes, you can improve your probability computations by using more predictor variables. In this case, our log-odds looks like

\[ O = \alpha_0 + \alpha_1 X_1 + \alpha_2 X_2 +\cdots + \alpha_n X_n. \]

We still compute the probability via \[ \hat{p} = \frac{1}{1+e^{-O}}. \]

Multiple predictors in the basketball data

The basketball data, for example, contains not just a massey_diff variable but also the so-called seed_diff, which is the difference between the two teams (1-16) seed in their region. We can look back at the tournament slide to see what this means.

Of course, there tends to be a (negative) correlation between seed and performance so we might expect that it could help if we could use both these variables.

In the context of software output on logistic regression, the coefficients each appear as a row in the coefficient table.

Example

Let’s suppose that a logistic regression analysis taking massey_diff and seed_diff into account yields the following:

	coef	stderr	z	P>\|z\|	[0.025]
const	-1.677e-16	0.054	-3.09e-15	1.000	-0.106	0.106
massey_diff	0.1106	0.013	7.506	0.000	0.074	0.127
seed_diff	-0.0942	0.018	-0.614	0.539	-0.047	0.025

This indicates that the coefficient of massey_diff should be \(0.1106\) and that the coefficient of seed_diff should be \(-0.0942\).

It also indicates that the constant is effectively zero.

Application

Focusing again on the coefficients

\(\texttt{massey_diff} = 0.1106\) and
\(\texttt{seed_diff} = -0.0942\),

let’s again return to the 1 seed Purdue vs 11 seed NC State example.

The massey_diff is still \(13.2031\). That yields the following value for the log-odds:

\[ O = 0.1106\times13.2031 - 0.0942\times(-10) = 2.40226. \]

We then get probability computation of \[ 1/(1+e^{-2.40226}) \approx 0.916999. \]

A look at the HW

Let’s take a quick look at the MyOpenMath HW and the associated Colab Notebook.