Logistic Models

Mon, Dec 02, 2024

Logistic Regression

Last time, we learned about linear regression, which provides a model to predict future values. Today, we’ll discuss logistic regression, which provides a model to predict future probabilities.

NCAA Basketball data

Our first example is going to involve NCAA Basketball tournaments. For each game in a tournament, the idea is to predict the winner in a probabilistic sense. That is, we want to say something like “The probability that UNC defeats Duke is 0.7”.

You might remember the next slide, which shows my predictions for the 2022 tournament, from our first day of class.

Massey ratings

We’re going to base these predictions, in part, on the so-called “Massey” ratings of the teams.

I’ve got a CSV file on my web space that lists every NCAA tournament team for every year from 2010 to 2023 together with that team’s Massey rating at the end of that season. Here are the top 8 teams from that data by Massey rating:

season team_name massey_rating
361 2015 Kentucky 29.831088
618 2019 Duke 28.759867
624 2019 Gonzaga 28.646682
15 2010 Kansas 27.326564
696 2021 Gonzaga 27.291249
762 2022 Gonzaga 27.197263
671 2019 Virginia 27.038487
7 2010 Duke 26.674557

The Massey rating is constructed using a linear model to predict the score difference if two teams play in near future. In 2019, for example, the Massey rating predicted that Duke wold defeat Virgina by 1.7 points

UNCA in the tournament

UNCA has been in the tournament 4 times since 2010. Their negative Massey rating last year indicated that they would be expected to lose to the average team by about a point.

season team_name massey_rating
120 2011 UNC Asheville 0.018425
190 2012 UNC Asheville 3.278971
458 2016 UNC Asheville 2.004859
872 2023 UNC Asheville -1.107387

Paired tourney games

I’ve also got a CSV file listing all NCAA tournament games from 2010 to 2023. The table below shows the last six rows and, for each row, we see

  • The season,
  • each team name and Massey rating,
  • the massey_diff, which is the first Massey rating minus the second, and
  • the seed_diff, which is the difference between the seeds 1-16,
  • a Boolean label indicating whether team 1 won the game or not.
season team1_name team1_massey_rating team2_name team2_massey_rating massey_diff seed_diff label
1733 2023 San Diego St 16.311344 Connecticut 21.225009 -4.913665 1 0
1732 2023 Connecticut 21.225009 San Diego St 16.311344 4.913665 -1 1
1731 2023 FL Atlantic 13.820686 San Diego St 16.311344 -2.490658 4 0
1730 2023 San Diego St 16.311344 FL Atlantic 13.820686 2.490658 -4 1
1729 2023 Miami FL 12.780767 Connecticut 21.225009 -8.444242 1 0
1728 2023 Connecticut 21.225009 Miami FL 12.780767 8.444242 -1 1

A little more

Here are a few more observations on the data:

  • Each game appears twice - once where one team appears first and once where the other appears first. Note that the labels switch; thus, one row represents team1 as the winner and the other row represents team 1 as the loser.
  • If you’re a basketball fan, you might notice that these six rows represent the outcome of the final four from 2023 when UConn defeated San Diego State for the NCAA championship.
season team1_name team1_massey_rating team2_name team2_massey_rating massey_diff seed_diff label
1733 2023 San Diego St 16.311344 Connecticut 21.225009 -4.913665 1 0
1732 2023 Connecticut 21.225009 San Diego St 16.311344 4.913665 -1 1
1731 2023 FL Atlantic 13.820686 San Diego St 16.311344 -2.490658 4 0
1730 2023 San Diego St 16.311344 FL Atlantic 13.820686 2.490658 -4 1
1729 2023 Miami FL 12.780767 Connecticut 21.225009 -8.444242 1 0
1728 2023 Connecticut 21.225009 Miami FL 12.780767 8.444242 -1 1

Visualization

Let’s plot this data with the massey_diff on the horizontal axis and the label on the vertical:

Note that the symmetry arises from the two ways of looking at the games - one labeled zero on the bottom and one labeled one on the top.

Fit

Now, we’re going to “fit” that data with a certain type of curve:

Note that the curve looks just like a cumulative distribution function. Thus, if Team 1 has Massey rating \(R_1\), Team 2 has Massey rating \(R_2\), and the curve is the graph of the function \(y=f(x)\), then we oughtta be able to compute the probability that Team 1 defeats Team 2 as \[f(R_1-R_2).\] That curve is called a logistic curve.

Algebraic form

The algebraic form of the logistic curve is \[ \hat{p} = \frac{1}{1+e^{ax+b}}. \] While you don’t need to worry too much about this specific algebraic form, there are a few things worth knowing. In particular, the coefficients \(a\) and \(b\) turn up in regression analyses and it is important to know how to interpret them.

It turn out that we can solve for the \(ax+b\) in that formula to get \[ -\log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) = ax+b. \]

  • We use the symbol \(\hat{p}\) because it represents a probability.
  • The ratio \(\hat{p}/(1-\hat{p})\) is often called the odds.
  • The expression \(\log\left(\hat{p}/(1-\hat{p})\right)\) is called the logit or log-odds and can be computed as \(-(ax+b)\).

These terms all show up in Regression analyses.

Running a logistic regression

Let’s take a look at some computer code to run logistic regression. We start by grabbing the paired game data and displaying the last six rows:

import pandas as pd
paired_games = pd.read_csv('https://www.marksmath.org/data/paired_tourney_games.csv')
paired_games.sort_index(ascending=False).head(6)
season team1_name team1_massey_rating team2_name team2_massey_rating massey_diff seed_diff label
1733 2023 San Diego St 16.311344 Connecticut 21.225009 -4.913665 1 0
1732 2023 Connecticut 21.225009 San Diego St 16.311344 4.913665 -1 1
1731 2023 FL Atlantic 13.820686 San Diego St 16.311344 -2.490658 4 0
1730 2023 San Diego St 16.311344 FL Atlantic 13.820686 2.490658 -4 1
1729 2023 Miami FL 12.780767 Connecticut 21.225009 -8.444242 1 0
1728 2023 Connecticut 21.225009 Miami FL 12.780767 8.444242 -1 1

Train and fit

We now set up and fit a logistic regression model of the paired_games data using Python’s statsmodels library. That process looks like so:

import statsmodels.api as sm
train = paired_games
Xtrain = train[['massey_diff']]
Xtrain = sm.add_constant(Xtrain)
ytrain = train[['label']]
model = sm.Logit(ytrain, Xtrain).fit()
Optimization terminated successfully.
         Current function value: 0.572040
         Iterations 6
  • This process is often called training the model, which is why we use the variable name train.
  • The \(x\) variables are typically associated with the predictors and
  • The \(y\) variable is typically associated with response - the variable we want to predict.
  • The hope that, if our model fits the knows data well, then it might work with data we might encounter in the future.

Examining the summary

Of course, the big question is how do we apply the result of the regression to make a prediction? Well, the model we built has a summary method that we can use to display some information:

model.summary().tables[1]
coef std err z P>|z| [0.025 0.975]
const 1.104e-18 0.054 2.03e-17 1.000 -0.106 0.106
massey_diff 0.1079 0.006 16.712 0.000 0.095 0.121

There’s a fair amount going on here in terms of inference. The middle three columns allow you to run hypothesis tests to determine if there’s really a relationship between the regression formula and the data; the last two determine 95% confidence intervals for the coefficients.

The most important items for making predictions are the coefficients in the first column labeled coef.

Using the summary for prediction

Let’s focus now on this most important part for prediction:

In this output, massey_diff=0.1079 and const=1.104e-18 refer to the coefficients of the massey_diff variable and the constant term, which is effectively zero. Thus, we have the following formula for the log-odds:

\[\begin{aligned} O = \log_e\left(\frac{\hat{p}}{1-\hat{p}}\right) &= 0.1079\times\mathtt{massey\_diff}+1.104\times10^{-18} \\ &= 0.1079\times\mathtt{massey\_diff} \end{aligned}\]

From there, we can get the probabilistic prediction:

\[ \hat{p} = \frac{1}{1+e^{-O}}. \]

Example

Last spring, I used something like this data through 2023 to help me with predictions for the 2024 tournament, when UConn defeated Purdue for their second straight championship. The semi-finals of that tournament featured

  • Purdue with a Massey rating of 25.138448 vs
  • NC State with a Massey rating of 11.935341 for
  • a difference of 13.2031.

Thus, our log-odds \(O\) satisfy \[ O = 0.1079\times13.2031 = 1.4246 \]

Thus, the predicted probability that Purdue defeats NC State would be \[ \frac{1}{1+e^{-1.4246}} \approx 0.806058. \]

Multiple regression

Sometimes, you can improve your probability computations by using more predictor variables. In this case, our log-odds looks like

\[ O = \alpha_0 + \alpha_1 X_1 + \alpha_2 X_2 +\cdots + \alpha_n X_n. \]

We still compute the probability via \[ \hat{p} = \frac{1}{1+e^{-O}}. \]

Multiple predictors in the basketball data

The basketball data, for example, contains not just a massey_diff variable but also the so-called seed_diff, which is the difference between the two teams (1-16) seed in their region. We can look back at the tournament slide to see what this means.

Of course, there tends to be a (negative) correlation between seed and performance so we might expect that it could help if we could use both these variables.

In the context of software output on logistic regression, the coefficients each appear as a row in the coefficient table.

Example

Let’s suppose that a logistic regression analysis taking massey_diff and seed_diff into account yields the following:

coef stderr z P>|z| [0.025]
const -1.677e-16 0.054 -3.09e-15 1.000 -0.106 0.106
massey_diff 0.1106 0.013 7.506 0.000 0.074 0.127
seed_diff -0.0942 0.018 -0.614 0.539 -0.047 0.025

This indicates that the coefficient of massey_diff should be \(0.1106\) and that the coefficient of seed_diff should be \(-0.0942\).

It also indicates that the constant is effectively zero.

Application

Focusing again on the coefficients

  • \(\texttt{massey_diff} = 0.1106\) and
  • \(\texttt{seed_diff} = -0.0942\),

let’s again return to the 1 seed Purdue vs 11 seed NC State example.

The massey_diff is still \(13.2031\). That yields the following value for the log-odds:

\[ O = 0.1106\times13.2031 - 0.0942\times(-10) = 2.40226. \]

We then get probability computation of \[ 1/(1+e^{-2.40226}) \approx 0.916999. \]

A look at the HW

Let’s take a quick look at the MyOpenMath HW and the associated Colab Notebook.