Kaggle’s competition

Here are some things we’ve discussed in the few weeks leading up to Spring break:

I’ve also briefly mentioned Kaggle’s March Madness Competition. Today, we’re going to take a quick look at the competition and some code that generates a submission.

The competition

Let’s take a quick look at

Creating a submission

Obviously, we need to load some of our libraries, though we won’t actually use quite all of this here:

import numpy as np
import pandas as pd
from sklearn.metrics import brier_score_loss, log_loss
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import LeaveOneGroupOut

Let’s read and examine some data:

G = 'M'
train = pd.read_csv(f'https://marksmath.org/data/{G}prior_tourney_games2025.csv')
infer = pd.read_csv(f'https://marksmath.org/data/{G}potential_tourney_games2025.csv')
train
Season WTeamID LTeamID WinPctDiff PPGDiff AvgPointDiffDiff FGPctDiff ThreePctDiff FTPctDiff ORPGDiff DRPGDiff APGDiff TOPGDiff SPGDiff BPGDiff PFGDiff SeedDiff Outcome team_ratingDiff conf_ratingDiff
0 2010 1115 1457 -0.035417 1.839583 -0.914583 0.026144 0.039044 0.024015 -0.290720 0.287879 1.748106 3.059659 -0.422348 -0.298295 2.296402 0 1 -0.000950 -0.008247
1 2010 1124 1358 0.024194 -3.183180 2.589862 0.016923 0.000871 0.024458 1.322989 3.203448 -3.491954 1.306897 -0.248276 4.102299 -0.981609 11 1 0.005936 0.024009
2 2010 1139 1431 0.062500 -5.687500 -1.531250 -0.023173 -0.005059 0.075933 0.904203 0.084066 0.020528 0.880743 -0.751711 -0.954057 1.739003 7 1 0.000213 -0.005638
3 2010 1140 1196 0.212121 11.000000 10.757576 0.038012 0.102192 0.083549 0.214286 7.741379 5.217980 2.280788 2.620690 1.448276 6.238916 3 1 0.002368 -0.004671
4 2010 1242 1250 0.253676 6.639706 12.876838 0.032261 0.008564 -0.014871 8.267857 13.111607 9.325893 6.843750 5.424107 5.750000 8.799107 15 1 0.011535 0.024281
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1863 2024 1181 1301 0.138889 3.482639 8.739583 0.032591 0.031070 -0.011645 1.064516 2.064516 3.290323 1.354839 -0.806452 0.354839 0.322581 7 0 0.003471 0.000000
1864 2024 1397 1345 -0.128788 -3.925189 -1.648674 -0.043992 -0.066100 0.028050 -1.774194 -4.354839 -3.387097 -1.903226 0.290323 0.354839 -0.161290 -1 0 -0.000953 -0.000433
1865 2024 1104 1163 -0.255515 9.279412 -7.371324 -0.018671 -0.001546 0.041092 -1.390086 -5.880388 -5.750000 -0.477371 -0.181034 -2.058190 -2.876078 -3 0 -0.001502 -0.000763
1866 2024 1301 1345 -0.267677 -7.032828 -9.575758 -0.039121 -0.061963 0.012308 -4.064516 -7.322581 -7.935484 -3.741935 0.096774 -0.967742 -1.612903 -10 0 -0.005007 -0.000788
1867 2024 1345 1163 -0.032977 1.923351 -3.816399 -0.007664 0.041141 -0.021258 0.727823 -0.253024 -1.201613 1.055444 -0.475806 -1.921371 -2.628024 0 0 0.000203 -0.000330

1868 rows × 20 columns

Note that data is all derived from Kaggle’s data; that in and of itself represents quite a bit of work.

The data frames train and infer are very similar. Let’s check out the relationships:

print("Things in train but not infer: ", set(train.columns).difference(set(infer)))
print("Things in infer but not train: ", set(infer.columns).difference(set(train)))
print("Things in common: ", set(train.columns).intersection(set(infer)))
Things in train but not infer:  {'WTeamID', 'Outcome', 'LTeamID'}
Things in infer but not train:  {'TeamID1', 'TeamID2'}
Things in common:  {'conf_ratingDiff', 'SeedDiff', 'ORPGDiff', 'FGPctDiff', 'team_ratingDiff', 'SPGDiff', 'BPGDiff', 'APGDiff', 'ThreePctDiff', 'DRPGDiff', 'PPGDiff', 'Season', 'AvgPointDiffDiff', 'PFGDiff', 'FTPctDiff', 'WinPctDiff', 'TOPGDiff'}

OK, let’s build and train a model:

features = ['team_ratingDiff', 'SeedDiff', 'WinPctDiff']
X = train[features].values
y = train["Outcome"].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LogisticRegression(C=0.5)
model.fit(X_scaled, y)
model
LogisticRegression(C=0.5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And now, the prediction data frame:

pred = model.predict_proba(infer[features].values)
ID = infer.apply(lambda r:
   f'2025_{int(r.TeamID1)}_{int(r.TeamID2)}', 
   axis = 1
)
pd.DataFrame({'ID': ID, 'pred': [p[1] for p in pred]})
ID pred
0 2025_1103_1104 0.389727
1 2025_1103_1106 0.543172
2 2025_1103_1110 0.539127
3 2025_1103_1112 0.414363
4 2025_1103_1116 0.476587
... ... ...
2273 2025_1459_1463 0.469689
2274 2025_1459_1471 0.453974
2275 2025_1462_1463 0.517301
2276 2025_1462_1471 0.501500
2277 2025_1463_1471 0.484197

2278 rows × 2 columns

If we export that to CSV, we get a valid (though certainly not winning) Kaggle submission!

Comments

There’s clearly a lot leading up to this point.

  • Data processing,
  • Determination of hyperparameters via cross validation, such as
    • Parameters in the determination of the eigenvalues, and
    • a constant \(C\) bounding the norm of the vector of model coefficients to fight overfitting.