import numpy as np
import pandas as pd
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut
Kaggle’s competition
Here are some things we’ve discussed in the few weeks leading up to Spring break:
- Rating sports teams via linear algebra via
- Logistic Regression to translate those kinds of ratings into probabilities.
I’ve also briefly mentioned Kaggle’s March Madness Competition. Today, we’re going to take a quick look at the competition and some code that generates a submission.
The competition
Let’s take a quick look at
- The competition webpage and the provided data, as well as
- This bracket visualizer
Creating a submission
Obviously, we need to load some of our libraries, though we won’t actually use quite all of this here:
Let’s read and examine some data:
= 'M'
G = pd.read_csv(f'https://marksmath.org/data/{G}prior_tourney_games2025.csv')
train = pd.read_csv(f'https://marksmath.org/data/{G}potential_tourney_games2025.csv')
infer train
Season | WTeamID | LTeamID | WinPctDiff | PPGDiff | AvgPointDiffDiff | FGPctDiff | ThreePctDiff | FTPctDiff | ORPGDiff | DRPGDiff | APGDiff | TOPGDiff | SPGDiff | BPGDiff | PFGDiff | SeedDiff | Outcome | team_ratingDiff | conf_ratingDiff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2010 | 1115 | 1457 | -0.035417 | 1.839583 | -0.914583 | 0.026144 | 0.039044 | 0.024015 | -0.290720 | 0.287879 | 1.748106 | 3.059659 | -0.422348 | -0.298295 | 2.296402 | 0 | 1 | -0.000950 | -0.008247 |
1 | 2010 | 1124 | 1358 | 0.024194 | -3.183180 | 2.589862 | 0.016923 | 0.000871 | 0.024458 | 1.322989 | 3.203448 | -3.491954 | 1.306897 | -0.248276 | 4.102299 | -0.981609 | 11 | 1 | 0.005936 | 0.024009 |
2 | 2010 | 1139 | 1431 | 0.062500 | -5.687500 | -1.531250 | -0.023173 | -0.005059 | 0.075933 | 0.904203 | 0.084066 | 0.020528 | 0.880743 | -0.751711 | -0.954057 | 1.739003 | 7 | 1 | 0.000213 | -0.005638 |
3 | 2010 | 1140 | 1196 | 0.212121 | 11.000000 | 10.757576 | 0.038012 | 0.102192 | 0.083549 | 0.214286 | 7.741379 | 5.217980 | 2.280788 | 2.620690 | 1.448276 | 6.238916 | 3 | 1 | 0.002368 | -0.004671 |
4 | 2010 | 1242 | 1250 | 0.253676 | 6.639706 | 12.876838 | 0.032261 | 0.008564 | -0.014871 | 8.267857 | 13.111607 | 9.325893 | 6.843750 | 5.424107 | 5.750000 | 8.799107 | 15 | 1 | 0.011535 | 0.024281 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
1863 | 2024 | 1181 | 1301 | 0.138889 | 3.482639 | 8.739583 | 0.032591 | 0.031070 | -0.011645 | 1.064516 | 2.064516 | 3.290323 | 1.354839 | -0.806452 | 0.354839 | 0.322581 | 7 | 0 | 0.003471 | 0.000000 |
1864 | 2024 | 1397 | 1345 | -0.128788 | -3.925189 | -1.648674 | -0.043992 | -0.066100 | 0.028050 | -1.774194 | -4.354839 | -3.387097 | -1.903226 | 0.290323 | 0.354839 | -0.161290 | -1 | 0 | -0.000953 | -0.000433 |
1865 | 2024 | 1104 | 1163 | -0.255515 | 9.279412 | -7.371324 | -0.018671 | -0.001546 | 0.041092 | -1.390086 | -5.880388 | -5.750000 | -0.477371 | -0.181034 | -2.058190 | -2.876078 | -3 | 0 | -0.001502 | -0.000763 |
1866 | 2024 | 1301 | 1345 | -0.267677 | -7.032828 | -9.575758 | -0.039121 | -0.061963 | 0.012308 | -4.064516 | -7.322581 | -7.935484 | -3.741935 | 0.096774 | -0.967742 | -1.612903 | -10 | 0 | -0.005007 | -0.000788 |
1867 | 2024 | 1345 | 1163 | -0.032977 | 1.923351 | -3.816399 | -0.007664 | 0.041141 | -0.021258 | 0.727823 | -0.253024 | -1.201613 | 1.055444 | -0.475806 | -1.921371 | -2.628024 | 0 | 0 | 0.000203 | -0.000330 |
1868 rows × 20 columns
Note that data is all derived from Kaggle’s data; that in and of itself represents quite a bit of work.
The data frames train
and infer
are very similar. Let’s check out the relationships:
print("Things in train but not infer: ", set(train.columns).difference(set(infer)))
print("Things in infer but not train: ", set(infer.columns).difference(set(train)))
print("Things in common: ", set(train.columns).intersection(set(infer)))
Things in train but not infer: {'WTeamID', 'Outcome', 'LTeamID'}
Things in infer but not train: {'TeamID1', 'TeamID2'}
Things in common: {'conf_ratingDiff', 'SeedDiff', 'ORPGDiff', 'FGPctDiff', 'team_ratingDiff', 'SPGDiff', 'BPGDiff', 'APGDiff', 'ThreePctDiff', 'DRPGDiff', 'PPGDiff', 'Season', 'AvgPointDiffDiff', 'PFGDiff', 'FTPctDiff', 'WinPctDiff', 'TOPGDiff'}
OK, let’s build and train a model:
= ['team_ratingDiff', 'SeedDiff', 'WinPctDiff']
features = train[features].values
X = train["Outcome"].values
y = StandardScaler()
scaler = scaler.fit_transform(X)
X_scaled = LogisticRegression(C=0.5)
model
model.fit(X_scaled, y) model
LogisticRegression(C=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=0.5)
And now, the prediction data frame:
= model.predict_proba(infer[features].values)
pred = infer.apply(lambda r:
ID f'2025_{int(r.TeamID1)}_{int(r.TeamID2)}',
= 1
axis
)'ID': ID, 'pred': [p[1] for p in pred]}) pd.DataFrame({
ID | pred | |
---|---|---|
0 | 2025_1103_1104 | 0.389727 |
1 | 2025_1103_1106 | 0.543172 |
2 | 2025_1103_1110 | 0.539127 |
3 | 2025_1103_1112 | 0.414363 |
4 | 2025_1103_1116 | 0.476587 |
... | ... | ... |
2273 | 2025_1459_1463 | 0.469689 |
2274 | 2025_1459_1471 | 0.453974 |
2275 | 2025_1462_1463 | 0.517301 |
2276 | 2025_1462_1471 | 0.501500 |
2277 | 2025_1463_1471 | 0.484197 |
2278 rows × 2 columns
If we export that to CSV, we get a valid (though certainly not winning) Kaggle submission!
Comments
There’s clearly a lot leading up to this point.