In this notebook we’re going to introduce the important idea of linear regression with an eye towards application in sports analytics. The scatter plot below, for example, relates winning percentage (on the vertical axis) to points per game (on the horizontal axis) in college basketball. Each dot corresponds to some team from the men’s 2023/24 college basketball season. You can hover over the dots to find out who’s who and what’s what.
Code
# Relating winning percentage to points per game in college basketball# This code needs to be run *after* the rest of the code in the notbookpx.scatter(winloss, x ='points_per_game', y ='win_pct', hover_data = ['TeamName'], width=900, height=600, trendline='ols')
It certainly looks like there’s a relationship between the two - in that, generally, more points yields more wins. The diagonal line is called the regression line and represents the line of least squared error.
Imports
We begin by importing a few libraries that we’ll need.
# Data manipulationimport pandas as pdimport numpy as np# A solver for linear algebra:from scipy.linalg import solve# Linear regression for Machine Learningfrom sklearn.linear_model import LinearRegression# Graphicsimport plotly.express as pximport plotly.graph_objects as go
Kaggle data
I’ve participated for several year’s now in Kaggle’s data based March Machine Learning competition. Kaggle provides a slew of data as part of that competition - a couple hundred megabytes worth spread over 32 files. Note that the data is free but subject to competition rules; you’ve got to sign up for the competition to get the full data set.
I’ve got a reasonably simple file that combines data from several Kaggle files into one simplified file. You can read it into your notebook like so:
games = pd.read_csv('https://www.marksmath.org/data/CombinedCondensedNCAAData2024.csv')games
WTeamName
WIDX
WScore
LTeamName
LIDX
LScore
0
Abilene Chr
0
64
Oklahoma St
221
59
1
Akron
2
81
S Dakota St
249
75
2
Alabama
3
105
Morehead St
183
73
3
Arizona
9
122
Morgan St
184
59
4
Ark Little Rock
11
71
Texas St
302
66
...
...
...
...
...
...
...
5602
Auburn
16
86
Florida
89
67
5603
Duquesne
76
57
VCU
332
51
5604
Illinois
120
93
Wisconsin
355
87
5605
UAB
309
85
Temple
296
69
5606
Yale
360
62
Brown
30
61
5607 rows × 6 columns
Note that size and simplicity were both considerations in the setup of this file. As such, I’ve read in data for only the Men’s 2024 Season, even though Kaggle has data for both men and women going back as far as 1985.
I guess most of the variable names are pretty self-explanatory, though WIDX and LIDX are unique numeric identifiers.
Now, rather than data by game, we’d like data by team indicating points per game and winning percentage. Here’s one way to set that up:
# Group by winswins = pd.DataFrame(games.groupby('WTeamName').size()).reset_index()wins['wins'] = wins[0]wins = wins[['WTeamName', 'wins']]# Group by losseslosses = pd.DataFrame(games.groupby('LTeamName').size()).reset_index()losses['losses'] = losses[0]losses = losses[['LTeamName', 'losses']]# Group by winning points and losing pointsWPoints = games.groupby('WTeamName').sum()[['WScore']].reset_index()LPoints = games.groupby('LTeamName').sum()[['LScore']].reset_index()# Put those all togetherwinloss = pd.concat([WPoints.set_index('WTeamName'), LPoints.set_index('LTeamName'), wins.set_index('WTeamName'), losses.set_index('LTeamName')], axis=1, join='inner').reset_index()winloss['points'] = winloss.apply(lambda r: r.WScore + r.LScore, axis=1)winloss['games'] = winloss.apply(lambda r: r.wins + r.losses, axis=1)winloss['points_per_game'] = winloss.apply(lambda r: r.points / r.games, axis=1)winloss['win_pct'] = winloss.apply(lambda r: r.wins / r.games, axis=1)winloss['TeamName'] = winloss['index']winloss = winloss[['TeamName', 'points_per_game', 'win_pct']]# Displaywinloss
TeamName
points_per_game
win_pct
0
Abilene Chr
70.967742
0.451613
1
Air Force
66.161290
0.290323
2
Akron
72.343750
0.687500
3
Alabama
90.750000
0.656250
4
Alabama A&M
68.696970
0.333333
...
...
...
...
357
Wright St
86.166667
0.533333
358
Wyoming
71.566667
0.433333
359
Xavier
75.848485
0.484848
360
Yale
73.551724
0.689655
361
Youngstown St
78.428571
0.642857
362 rows × 3 columns
Stat 225 students don’t need to worry about all this. It’s worth pointing out, though, that the key step is illustrated right here:
games.groupby('WTeamName').size()
WTeamName
Abilene Chr 14
Air Force 9
Akron 22
Alabama 21
Alabama A&M 11
..
Wright St 16
Wyoming 13
Xavier 16
Yale 20
Youngstown St 18
Length: 362, dtype: int64
Thus, the Pandas DataFrame object has a groupby method that allows us to group and aggregate data. In this case, we see how many games each team won as indicated by WTeamName. We can do the same for the losing teams and also the scores; we then combine all that into one data frame.
We can now generate the scatter plot like so:
px.scatter(winloss, x ='points_per_game', y ='win_pct', hover_data = ['TeamName'], width=900, height=600, trendline='ols')
Linear regression
Recall that the diagonal line is called the regression line. Linear regression is a fundamental tool in machine learning that you’ll talk about more with Kedai later this semester. For now, it’s worth knowing that the regression line is the best fit for the data in the sense that it minimizes the least squared error. More precisely, if the data points are \(\{(x_i,y_i)\}_{i=1}^m\) and the line has the form \(L(x) = ax+b\), then the regression line is formed by choosing \(a\) and \(b\) to minimize
\[F(a,b) = \sum_{i=1}^m (a x_i + b - y_i)^2.\]
There are a few ways to approach this minimization problem using Python. Here are two such ways:
A semi-theoretical approach using linear algebra and
Using SciKit Learn’s LinearRegression package.
Finding the regression line with linear algebra
For those of you who know multivariable calculus, the logical thing to do might be to find the partial derivatives of \(F\) and solve \[
\frac{\partial F}{\partial a} = 0 \: \text{ and } \: \frac{\partial F}{\partial b} = 0.
\]
Since \(F\) is quadratic in both variables \(a\) and \(b\), then the derivatives are first order in \(a\) and \(b\). Thus, the system we need to solve is linear. Perhaps, it’s not surprising then that we find the minimum using a little linear algebra. To do so, note that another way to write \(ax_i+b \approx y_i\) for each \(i\) is as
Then, as it turns out, the linear system we need to solve can be written \[
X^TX \begin{pmatrix}
a \\ b
\end{pmatrix} = X^T Y.
\]
These equations are called the normal equations. We can implement and solve the normal equations for this problem as follows:
# Form the X matrix whose rows are (x_i,1)X = np.array([[x,1] for x in winloss.points_per_game])# Form the matrix M = X^T XM = X.transpose().dot(X)# Form the target vector YY = [[y] for y in winloss.win_pct]# Form the vector b = X^T Yb = X.transpose().dot(Y)# Solve the systemsolve(M,b)
array([[ 0.02300548],
[-1.17239978]])
This suggests that the formula for the regression line is y = 0.023x - 1.1724, which you can check by hovering over the line in the figure.
SciKit Learn’s LinearRegression
A higher level approach is to use a package devoted to linear regression. In our imports at the top of the notebook, you might notice this line:
from sklearn.linear_model import LinearRegression
We can use it like so:
x =list(winloss.points_per_game.apply(lambda p: [p]))y =list(winloss.win_pct)regression = LinearRegression().fit(x, y)
At this point regression is a Python object. We can list its public properties and methods as follows: