Kaggle’s competition

Here are some things we’ve discussed in the few weeks leading up to Spring break:

Rating sports teams via linear algebra via
- Massey ratings and
- EigenRatings
Logistic Regression to translate those kinds of ratings into probabilities.

I’ve also briefly mentioned Kaggle’s March Madness Competition. Today, we’re going to take a quick look at the competition and some code that generates a submission.

The competition

Let’s take a quick look at

The competition webpage and the provided data, as well as
This bracket visualizer

Creating a submission

Obviously, we need to load some of our libraries, though we won’t actually use quite all of this here:

import numpy as np
import pandas as pd
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut

Let’s read and examine some data:

G = 'M'
train = pd.read_csv(f'https://marksmath.org/data/{G}prior_tourney_games2025.csv')
infer = pd.read_csv(f'https://marksmath.org/data/{G}potential_tourney_games2025.csv')
train

	Season	WTeamID	LTeamID	WinPctDiff	PPGDiff	AvgPointDiffDiff	FGPctDiff	ThreePctDiff	FTPctDiff	ORPGDiff	DRPGDiff	APGDiff	TOPGDiff	SPGDiff	BPGDiff	PFGDiff	SeedDiff	Outcome	team_ratingDiff	conf_ratingDiff
0	2010	1115	1457	-0.035417	1.839583	-0.914583	0.026144	0.039044	0.024015	-0.290720	0.287879	1.748106	3.059659	-0.422348	-0.298295	2.296402	0	1	-0.000950	-0.008247
1	2010	1124	1358	0.024194	-3.183180	2.589862	0.016923	0.000871	0.024458	1.322989	3.203448	-3.491954	1.306897	-0.248276	4.102299	-0.981609	11	1	0.005936	0.024009
2	2010	1139	1431	0.062500	-5.687500	-1.531250	-0.023173	-0.005059	0.075933	0.904203	0.084066	0.020528	0.880743	-0.751711	-0.954057	1.739003	7	1	0.000213	-0.005638
3	2010	1140	1196	0.212121	11.000000	10.757576	0.038012	0.102192	0.083549	0.214286	7.741379	5.217980	2.280788	2.620690	1.448276	6.238916	3	1	0.002368	-0.004671
4	2010	1242	1250	0.253676	6.639706	12.876838	0.032261	0.008564	-0.014871	8.267857	13.111607	9.325893	6.843750	5.424107	5.750000	8.799107	15	1	0.011535	0.024281
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1863	2024	1181	1301	0.138889	3.482639	8.739583	0.032591	0.031070	-0.011645	1.064516	2.064516	3.290323	1.354839	-0.806452	0.354839	0.322581	7	0	0.003471	0.000000
1864	2024	1397	1345	-0.128788	-3.925189	-1.648674	-0.043992	-0.066100	0.028050	-1.774194	-4.354839	-3.387097	-1.903226	0.290323	0.354839	-0.161290	-1	0	-0.000953	-0.000433
1865	2024	1104	1163	-0.255515	9.279412	-7.371324	-0.018671	-0.001546	0.041092	-1.390086	-5.880388	-5.750000	-0.477371	-0.181034	-2.058190	-2.876078	-3	0	-0.001502	-0.000763
1866	2024	1301	1345	-0.267677	-7.032828	-9.575758	-0.039121	-0.061963	0.012308	-4.064516	-7.322581	-7.935484	-3.741935	0.096774	-0.967742	-1.612903	-10	0	-0.005007	-0.000788
1867	2024	1345	1163	-0.032977	1.923351	-3.816399	-0.007664	0.041141	-0.021258	0.727823	-0.253024	-1.201613	1.055444	-0.475806	-1.921371	-2.628024	0	0	0.000203	-0.000330

1868 rows × 20 columns

Note that data is all derived from Kaggle’s data; that in and of itself represents quite a bit of work.

The data frames train and infer are very similar. Let’s check out the relationships:

print("Things in train but not infer: ", set(train.columns).difference(set(infer)))
print("Things in infer but not train: ", set(infer.columns).difference(set(train)))
print("Things in common: ", set(train.columns).intersection(set(infer)))

Things in train but not infer:  {'LTeamID', 'Outcome', 'WTeamID'}
Things in infer but not train:  {'TeamID2', 'TeamID1'}
Things in common:  {'APGDiff', 'SPGDiff', 'ThreePctDiff', 'FGPctDiff', 'AvgPointDiffDiff', 'SeedDiff', 'conf_ratingDiff', 'WinPctDiff', 'PFGDiff', 'TOPGDiff', 'team_ratingDiff', 'ORPGDiff', 'PPGDiff', 'Season', 'DRPGDiff', 'FTPctDiff', 'BPGDiff'}

OK, let’s build and train a model:

features = ['team_ratingDiff', 'SeedDiff', 'WinPctDiff']
X = train[features].values
y = train["Outcome"].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(C=0.5)
model.fit(X_scaled, y)
model

LogisticRegression(C=0.5)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And now, the prediction data frame:

pred = model.predict_proba(infer[features].values)
ID = infer.apply(lambda r:
   f'2025_{int(r.TeamID1)}_{int(r.TeamID2)}', 
   axis = 1
)
pd.DataFrame({'ID': ID, 'pred': [p[1] for p in pred]})

	ID	pred
0	2025_1103_1104	0.389727
1	2025_1103_1106	0.543172
2	2025_1103_1110	0.539127
3	2025_1103_1112	0.414363
4	2025_1103_1116	0.476587
...	...	...
2273	2025_1459_1463	0.469689
2274	2025_1459_1471	0.453974
2275	2025_1462_1463	0.517301
2276	2025_1462_1471	0.501500
2277	2025_1463_1471	0.484197

2278 rows × 2 columns

If we export that to CSV, we get a valid (though certainly not winning) Kaggle submission!

Comments

There’s clearly a lot leading up to this point.

Data processing,
Determination of hyperparameters via cross validation, such as
- Parameters in the determination of the eigenvalues, and
- a constant \(C\) bounding the norm of the vector of model coefficients to fight overfitting.