Programmatic linear regression with Kaggle data

In this notebook we’re going to introduce the important idea of linear regression with an eye towards application in sports analytics. The scatter plot below, for example, relates winning percentage (on the vertical axis) to points per game (on the horizontal axis) in college basketball. Each dot corresponds to some team from the men’s 2023/24 college basketball season. You can hover over the dots to find out who’s who and what’s what.

Code
# Relating winning percentage to points per game in college basketball
# This code needs to be run *after* the rest of the code in the notbook
px.scatter(winloss, x = 'points_per_game', y = 'win_pct', 
    hover_data = ['TeamName'], width=900, height=600, trendline='ols')

It certainly looks like there’s a relationship between the two - in that, generally, more points yields more wins. The diagonal line is called the regression line and represents the line of least squared error.

Imports

We begin by importing a few libraries that we’ll need.

# Data manipulation
import pandas as pd
import numpy as np

# A solver for linear algebra:
from scipy.linalg import solve

# Linear regression for Machine Learning
from sklearn.linear_model import LinearRegression

# Graphics
import plotly.express as px
import plotly.graph_objects as go

Kaggle data

I’ve participated for several year’s now in Kaggle’s data based March Machine Learning competition. Kaggle provides a slew of data as part of that competition - a couple hundred megabytes worth spread over 32 files. Note that the data is free but subject to competition rules; you’ve got to sign up for the competition to get the full data set.

I’ve got a reasonably simple file that combines data from several Kaggle files into one simplified file. You can read it into your notebook like so:

games = pd.read_csv('https://www.marksmath.org/data/CombinedCondensedNCAAData2024.csv')
games
WTeamName WIDX WScore LTeamName LIDX LScore
0 Abilene Chr 0 64 Oklahoma St 221 59
1 Akron 2 81 S Dakota St 249 75
2 Alabama 3 105 Morehead St 183 73
3 Arizona 9 122 Morgan St 184 59
4 Ark Little Rock 11 71 Texas St 302 66
... ... ... ... ... ... ...
5602 Auburn 16 86 Florida 89 67
5603 Duquesne 76 57 VCU 332 51
5604 Illinois 120 93 Wisconsin 355 87
5605 UAB 309 85 Temple 296 69
5606 Yale 360 62 Brown 30 61

5607 rows × 6 columns

Note that size and simplicity were both considerations in the setup of this file. As such, I’ve read in data for only the Men’s 2024 Season, even though Kaggle has data for both men and women going back as far as 1985.

I guess most of the variable names are pretty self-explanatory, though WIDX and LIDX are unique numeric identifiers.

Now, rather than data by game, we’d like data by team indicating points per game and winning percentage. Here’s one way to set that up:

# Group by wins
wins = pd.DataFrame(games.groupby('WTeamName').size()).reset_index()
wins['wins'] = wins[0]
wins = wins[['WTeamName', 'wins']]

# Group by losses
losses = pd.DataFrame(games.groupby('LTeamName').size()).reset_index()
losses['losses'] = losses[0]
losses = losses[['LTeamName', 'losses']]

# Group by winning points and losing points
WPoints = games.groupby('WTeamName').sum()[['WScore']].reset_index()
LPoints = games.groupby('LTeamName').sum()[['LScore']].reset_index()

# Put those all together
winloss = pd.concat([WPoints.set_index('WTeamName'), LPoints.set_index('LTeamName'), 
    wins.set_index('WTeamName'), losses.set_index('LTeamName')], axis=1, join='inner').reset_index()
winloss['points'] = winloss.apply(lambda r: r.WScore + r.LScore, axis=1)
winloss['games'] = winloss.apply(lambda r: r.wins + r.losses, axis=1)
winloss['points_per_game'] = winloss.apply(lambda r: r.points / r.games, axis=1)
winloss['win_pct'] = winloss.apply(lambda r: r.wins / r.games, axis=1)
winloss['TeamName'] = winloss['index']
winloss = winloss[['TeamName', 'points_per_game', 'win_pct']]

# Display
winloss
TeamName points_per_game win_pct
0 Abilene Chr 70.967742 0.451613
1 Air Force 66.161290 0.290323
2 Akron 72.343750 0.687500
3 Alabama 90.750000 0.656250
4 Alabama A&M 68.696970 0.333333
... ... ... ...
357 Wright St 86.166667 0.533333
358 Wyoming 71.566667 0.433333
359 Xavier 75.848485 0.484848
360 Yale 73.551724 0.689655
361 Youngstown St 78.428571 0.642857

362 rows × 3 columns

Stat 225 students don’t need to worry about all this. It’s worth pointing out, though, that the key step is illustrated right here:

games.groupby('WTeamName').size()
WTeamName
Abilene Chr      14
Air Force         9
Akron            22
Alabama          21
Alabama A&M      11
                 ..
Wright St        16
Wyoming          13
Xavier           16
Yale             20
Youngstown St    18
Length: 362, dtype: int64

Thus, the Pandas DataFrame object has a groupby method that allows us to group and aggregate data. In this case, we see how many games each team won as indicated by WTeamName. We can do the same for the losing teams and also the scores; we then combine all that into one data frame.

We can now generate the scatter plot like so:

px.scatter(winloss, x = 'points_per_game', y = 'win_pct', 
    hover_data = ['TeamName'], width=900, height=600, trendline='ols')

Linear regression

Recall that the diagonal line is called the regression line. Linear regression is a fundamental tool in machine learning that you’ll talk about more with Kedai later this semester. For now, it’s worth knowing that the regression line is the best fit for the data in the sense that it minimizes the least squared error. More precisely, if the data points are \(\{(x_i,y_i)\}_{i=1}^m\) and the line has the form \(L(x) = ax+b\), then the regression line is formed by choosing \(a\) and \(b\) to minimize

\[F(a,b) = \sum_{i=1}^m (a x_i + b - y_i)^2.\]

There are a few ways to approach this minimization problem using Python. Here are two such ways:

  1. A semi-theoretical approach using linear algebra and
  2. Using SciKit Learn’s LinearRegression package.

Finding the regression line with linear algebra

For those of you who know multivariable calculus, the logical thing to do might be to find the partial derivatives of \(F\) and solve \[ \frac{\partial F}{\partial a} = 0 \: \text{ and } \: \frac{\partial F}{\partial b} = 0. \]

Since \(F\) is quadratic in both variables \(a\) and \(b\), then the derivatives are first order in \(a\) and \(b\). Thus, the system we need to solve is linear. Perhaps, it’s not surprising then that we find the minimum using a little linear algebra. To do so, note that another way to write \(ax_i+b \approx y_i\) for each \(i\) is as

\[ \begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots & \vdots \\ x_m & 1 \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} \approx \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{pmatrix} \]

Now, let’s write

\[ X = \begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots & \vdots \\ x_m & 1 \end{pmatrix} \: \: \text{ and } \: \: Y = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{pmatrix}. \]

Then, as it turns out, the linear system we need to solve can be written \[ X^TX \begin{pmatrix} a \\ b \end{pmatrix} = X^T Y. \]

These equations are called the normal equations. We can implement and solve the normal equations for this problem as follows:

# Form the X matrix whose rows are (x_i,1)
X = np.array([[x,1] for x in winloss.points_per_game])

# Form the matrix M = X^T X
M = X.transpose().dot(X)

# Form the target vector Y
Y = [[y] for y in winloss.win_pct]

# Form the vector b = X^T Y
b = X.transpose().dot(Y)

# Solve the system
solve(M,b)
array([[ 0.02300548],
       [-1.17239978]])

This suggests that the formula for the regression line is y = 0.023x - 1.1724, which you can check by hovering over the line in the figure.

SciKit Learn’s LinearRegression

A higher level approach is to use a package devoted to linear regression. In our imports at the top of the notebook, you might notice this line:

from sklearn.linear_model import LinearRegression

We can use it like so:

x = list(winloss.points_per_game.apply(lambda p: [p]))
y = list(winloss.win_pct)
regression = LinearRegression().fit(x, y)

At this point regression is a Python object. We can list its public properties and methods as follows:

[s for s in dir(regression) if s[0] != '_']
['coef_',
 'copy_X',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'n_features_in_',
 'n_jobs',
 'positive',
 'predict',
 'rank_',
 'score',
 'set_params',
 'singular_']

Note that predict is a function that allows you to compute predicted win percentages given points per game.

regression.predict([[80]])
array([0.66803865])

You could also compute this using \(ax+b\), where a=coef_[0] and b=intercept_:

[regression.coef_[0], regression.intercept_]
[0.023005480376054713, -1.1723997846082073]
0.023005480376054713*80 - 1.1723997846082073
0.6680386454761698

Finally, note that these values of a and b agree with our previous computations.