Kaggle is a company owned by Google that runs Machine Learning contests. One contest that I’ve competed in most years over the last dozen years is their March Madness Competition, which requires you to make probabilistic predictions for the NCAA basketball tournaments held each March and April.
The bracket below let’s competitors visualize their predictions; there are also several example submission files so that you can see how it works without creating your own.
You can also choose to project the tournament based on predicted probabilities to estimate your best and worst possible outcomes from either the beginning or the current tournament status.
Description
The idea behind the competition is to make a probabilistic prediction for each game. If you select the “Men’s eigenbracket 2022” from the menu, for example, and hover over the Duke-Carolina semifinal game, you’ll see that I gave North Carolina only a 30% chance of winning. UNC did win, though, which is why the game is shaded red.
The average log-loss and Brier score give measures of how close the predictions were; smaller is better. If you look at the 2022 scoreboard, you might find that the eigenbracket finished 64th out of 930 entries.
Logistic Massey ratings
The example files are built on predictions using logistic regression applied to Massey ratings. Here is a description of how to compute Massey ratings; this is pretty easy to do with basic linear algebra.
Using the data provided by Kaggle, we can compute the team-by-team Massey ratings for any given season. We can then attach those ratings to the game-by-game tournament data for that season and attempt to relate the probability that one team beats another to the difference between the teams’ Massey ratings.
For example, there were 1449 tournament games played from 2003 to 2025. Each dot in the figure below corresponds to one of those games. The horizontal axis represents the first team’s Massey rating minus the second team’s. The vertical axis represents the outcome of the game, either zero (if team 1 lost) or one (if team 1 won). Each game appears twice; once from the perspective of each team. That explains the clear symmetry in the figure. You can hover over the dots to get information on the corresponding games.
The idea behind logistic regression for symmetric data like this is to fit a curve of the form \[
f(x) = \frac{1}{1+e^{-ax}}
\] to the data. We then use \(f\) to compute the probability that a given outcome \(y\) is 1 in terms of the input \(x\). That is: \[P(Y = 1 \mid X = x) = f(x).\] The value of \(a\) for the curve in the figure was determined by scikit-learn’s LogisticRegression to be \(a=0.10729123\). Thus, for example, if Team 1 has a Massey rating that’s 10 more than Team 2’s, we would compute \[
P(\text{Team 1 wins}) = \frac{1}{1+e^{-0.10729123\times10}} \approx 0.74515036.
\] Thus, we expect that Team 1 has a nearly 75% chance of winning.
This technique is never going to be in the money for the Kaggle competition. It does provide a solid baseline, though.