The basic question
Since it’s tournament time, we’ve got 63 (or 67 or 134) opportunities coming up to ask the question:
What’s the probability that this team beats that team?
Kaggle
Personally, I’d love to be able to create submissions to the Kaggle NCAA contest. Doing so, though, is a multi-faceted project requiring
- The collection, curation, and formatting of data,
- The analysis of that data to assess team strengths and weaknesses,
- The translation of those strengths, weaknesses, and potentially other factors to gametime probabilities, and
- Extension of those individual probabilities to the whole tournament, possibly via simulation.
Simulation too
Note that simulation itself is an interesting and important topic. CBS Sports just posted their upset predictions, which claims to be built on 10,000 “simulations” of the tournament. Here’s my Stat 185 level explanation of simulation.
On this webpage, though, let’s focus on one particular aspect of that process that requires some knowledge of probability theory and the normal distribution. We specifically consider the following question:
Suppose we have a list of teams and we have a numerical rating associated with each team intended to indicate team strength. How can we use those ratings to answer our fundamental question: What’s the probability that this team beats that team?
Assessing win probabilities
Given two teams with ratings \(R_1\) and \(R_2\), we might expect the proability that team 1 beats team 2 to depend upon the difference \(R_1-R_2\). In order to appropriately assess probabilities associated with this quantity, we should examine its distribution. Let’s take a look at a histogram of the symmetric pairwise differences of the eigen-ratings of all 362 Division 1 teams:
Using it
Hey - that looks normal! In fact, the bell-shaped curve in the figure is exactly the normal curve with mean \(\mu=0\) and standard deviation \(\sigma=2.66758\), in agreement with the data. Note that the mean has to be zero because the way that the data is formed.
Now, suppose we want to use the eigen-ratings to compute the probability that UNC beats NC State. To do so, let \(R_1 = 10.199467\), let \(R_2 = 7.550299\), and suppose that \(X\) is normally distributed with mean \(\mu=0\) and standard deviation \(\sigma=2.66758\). I guess we could express the probability that we want as \[
P(X < R_1-R_2).
\]
Finishing up the computation
Computing the \(Z\)-score for \(R_1-R_2\), we get \[
Z = \frac{10.199467 - 7.550299}{2.66758} = \frac{2.649168}{2.66758} \approx 0.9931.
\] Looking this up in a standard normal table or using a normal calculator, we get \(0.839669\) or about \(84\%\).
Of course, that’s not what happened this past weekend but, that’s why we love sports!