Programmatic linear regression with Kaggle data

In this notebook we’re going to introduce the important idea of linear regression with an eye towards application in sports analytics. The scatter plot below, for example, relates winning percentage (on the vertical axis) to points per game (on the horizontal axis) in college basketball. Each dot corresponds to some team from the men’s 2023/24 college basketball season. You can hover over the dots to find out who’s who and what’s what.

Code

# Relating winning percentage to points per game in college basketball
# This code needs to be run *after* the rest of the code in the notbook
px.scatter(winloss, x = 'points_per_game', y = 'win_pct', 
    hover_data = ['TeamName'], width=900, height=600, trendline='ols')

It certainly looks like there’s a relationship between the two - in that, generally, more points yields more wins. The diagonal line is called the regression line and represents the line of least squared error.

Imports

We begin by importing a few libraries that we’ll need.

# Data manipulation
import pandas as pd
import numpy as np

# A solver for linear algebra:
from scipy.linalg import solve

# Linear regression for Machine Learning
from sklearn.linear_model import LinearRegression

# Graphics
import plotly.express as px
import plotly.graph_objects as go

Kaggle data

I’ve participated for several year’s now in Kaggle’s data based March Machine Learning competition. Kaggle provides a slew of data as part of that competition - a couple hundred megabytes worth spread over 32 files. Note that the data is free but subject to competition rules; you’ve got to sign up for the competition to get the full data set.

I’ve got a reasonably simple file that combines data from several Kaggle files into one simplified file. You can read it into your notebook like so:

games = pd.read_csv('https://www.marksmath.org/data/CombinedCondensedNCAAData2024.csv')
games

	WTeamName	WIDX	WScore	LTeamName	LIDX	LScore
0	Abilene Chr	0	64	Oklahoma St	221	59
1	Akron	2	81	S Dakota St	249	75
2	Alabama	3	105	Morehead St	183	73
3	Arizona	9	122	Morgan St	184	59
4	Ark Little Rock	11	71	Texas St	302	66
...	...	...	...	...	...	...
5602	Auburn	16	86	Florida	89	67
5603	Duquesne	76	57	VCU	332	51
5604	Illinois	120	93	Wisconsin	355	87
5605	UAB	309	85	Temple	296	69
5606	Yale	360	62	Brown	30	61

5607 rows × 6 columns

Note that size and simplicity were both considerations in the setup of this file. As such, I’ve read in data for only the Men’s 2024 Season, even though Kaggle has data for both men and women going back as far as 1985.

I guess most of the variable names are pretty self-explanatory, though WIDX and LIDX are unique numeric identifiers.

Now, rather than data by game, we’d like data by team indicating points per game and winning percentage. Here’s one way to set that up:

# Group by wins
wins = pd.DataFrame(games.groupby('WTeamName').size()).reset_index()
wins['wins'] = wins[0]
wins = wins[['WTeamName', 'wins']]

# Group by losses
losses = pd.DataFrame(games.groupby('LTeamName').size()).reset_index()
losses['losses'] = losses[0]
losses = losses[['LTeamName', 'losses']]

# Group by winning points and losing points
WPoints = games.groupby('WTeamName').sum()[['WScore']].reset_index()
LPoints = games.groupby('LTeamName').sum()[['LScore']].reset_index()

# Put those all together
winloss = pd.concat([WPoints.set_index('WTeamName'), LPoints.set_index('LTeamName'), 
    wins.set_index('WTeamName'), losses.set_index('LTeamName')], axis=1, join='inner').reset_index()
winloss['points'] = winloss.apply(lambda r: r.WScore + r.LScore, axis=1)
winloss['games'] = winloss.apply(lambda r: r.wins + r.losses, axis=1)
winloss['points_per_game'] = winloss.apply(lambda r: r.points / r.games, axis=1)
winloss['win_pct'] = winloss.apply(lambda r: r.wins / r.games, axis=1)
winloss['TeamName'] = winloss['index']
winloss = winloss[['TeamName', 'points_per_game', 'win_pct']]

# Display
winloss

	TeamName	points_per_game	win_pct
0	Abilene Chr	70.967742	0.451613
1	Air Force	66.161290	0.290323
2	Akron	72.343750	0.687500
3	Alabama	90.750000	0.656250
4	Alabama A&M	68.696970	0.333333
...	...	...	...
357	Wright St	86.166667	0.533333
358	Wyoming	71.566667	0.433333
359	Xavier	75.848485	0.484848
360	Yale	73.551724	0.689655
361	Youngstown St	78.428571	0.642857

362 rows × 3 columns

Stat 225 students don’t need to worry about all this. It’s worth pointing out, though, that the key step is illustrated right here:

games.groupby('WTeamName').size()

WTeamName
Abilene Chr      14
Air Force         9
Akron            22
Alabama          21
Alabama A&M      11
                 ..
Wright St        16
Wyoming          13
Xavier           16
Yale             20
Youngstown St    18
Length: 362, dtype: int64

Thus, the Pandas DataFrame object has a groupby method that allows us to group and aggregate data. In this case, we see how many games each team won as indicated by WTeamName. We can do the same for the losing teams and also the scores; we then combine all that into one data frame.

We can now generate the scatter plot like so:

px.scatter(winloss, x = 'points_per_game', y = 'win_pct', 
    hover_data = ['TeamName'], width=900, height=600, trendline='ols')

Linear regression

Recall that the diagonal line is called the regression line. Linear regression is a fundamental tool in machine learning that you’ll talk about more with Kedai later this semester. For now, it’s worth knowing that the regression line is the best fit for the data in the sense that it minimizes the least squared error. More precisely, if the data points are \(\{(x_i,y_i)\}_{i=1}^m\) and the line has the form \(L(x) = ax+b\), then the regression line is formed by choosing \(a\) and \(b\) to minimize

\[F(a,b) = \sum_{i=1}^m (a x_i + b - y_i)^2.\]

There are a few ways to approach this minimization problem using Python. Here are two such ways:

A semi-theoretical approach using linear algebra and
Using SciKit Learn’s LinearRegression package.

Finding the regression line with linear algebra

For those of you who know multivariable calculus, the logical thing to do might be to find the partial derivatives of \(F\) and solve \[ \frac{\partial F}{\partial a} = 0 \: \text{ and } \: \frac{\partial F}{\partial b} = 0. \]

Since \(F\) is quadratic in both variables \(a\) and \(b\), then the derivatives are first order in \(a\) and \(b\). Thus, the system we need to solve is linear. Perhaps, it’s not surprising then that we find the minimum using a little linear algebra. To do so, note that another way to write \(ax_i+b \approx y_i\) for each \(i\) is as

\[ \begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots & \vdots \\ x_m & 1 \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} \approx \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{pmatrix} \]

Now, let’s write

\[ X = \begin{pmatrix} x_1 & 1 \\ x_2 & 1 \\ x_3 & 1 \\ \vdots & \vdots \\ x_m & 1 \end{pmatrix} \: \: \text{ and } \: \: Y = \begin{pmatrix} y_1 \\ y_2 \\ y_3 \\ \vdots \\ y_m \end{pmatrix}. \]

Then, as it turns out, the linear system we need to solve can be written \[ X^TX \begin{pmatrix} a \\ b \end{pmatrix} = X^T Y. \]

These equations are called the normal equations. We can implement and solve the normal equations for this problem as follows:

# Form the X matrix whose rows are (x_i,1)
X = np.array([[x,1] for x in winloss.points_per_game])

# Form the matrix M = X^T X
M = X.transpose().dot(X)

# Form the target vector Y
Y = [[y] for y in winloss.win_pct]

# Form the vector b = X^T Y
b = X.transpose().dot(Y)

# Solve the system
solve(M,b)

array([[ 0.02300548],
       [-1.17239978]])

This suggests that the formula for the regression line is y = 0.023x - 1.1724, which you can check by hovering over the line in the figure.

SciKit Learn’s LinearRegression

A higher level approach is to use a package devoted to linear regression. In our imports at the top of the notebook, you might notice this line:

from sklearn.linear_model import LinearRegression

We can use it like so:

x = list(winloss.points_per_game.apply(lambda p: [p]))
y = list(winloss.win_pct)
regression = LinearRegression().fit(x, y)

At this point regression is a Python object. We can list its public properties and methods as follows:

[s for s in dir(regression) if s[0] != '_']

['coef_',
 'copy_X',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'n_features_in_',
 'n_jobs',
 'positive',
 'predict',
 'rank_',
 'score',
 'set_params',
 'singular_']

Note that predict is a function that allows you to compute predicted win percentages given points per game.

regression.predict([[80]])

array([0.66803865])

You could also compute this using \(ax+b\), where a=coef_[0] and b=intercept_:

[regression.coef_[0], regression.intercept_]

[0.023005480376054713, -1.1723997846082073]

0.023005480376054713*80 - 1.1723997846082073

0.6680386454761698

Finally, note that these values of a and b agree with our previous computations.