Since it's tournament time, we've got 63 (or 67 or 134) opportunities coming up to ask the question:

What's the probability that

this teambeatsthat team?

Ultimately, we'd love to be able to create submissions to the Kaggle NCAA contest. Of course, doing so is a multi-faceted project requiring

- The collection, curation, and formatting of data,
- The analysis of that data to assess team strengths and weaknesses,
- The translation of those strengths, weaknesses, and potentially other factors to gametime probabilities.

In this notebook, we focus on one particular aspect of that process that requires some knowledge of probability theory and the normal distribution. We specifically consider the following question:

Suppose we have a list of teams and we have a numerical rating associated with each team intended to indicate team strength. How can we use those ratings to answer our fundamental question: What's the probability that

this teambeatsthat team?

To illustrate what we mean by ratings, consider the following list of ACC teams:

Team | Record | Rating |
---|---|---|

Duke | 16-4 | 0.80 |

UNC | 15-5 | 0.75 |

Notre Dame | 15-5 | 0.75 |

Miami FL | 14-6 | 0.70 |

Wake Forest | 13-7 | 0.65 |

Virginia | 12-8 | 0.60 |

Virginia Tech | 11-9 | 0.55 |

Florida State | 10-10 | 0.50 |

Syracuse | 9-11 | 0.45 |

Clemson | 8-12 | 0.40 |

Louisville | 6-14 | 0.30 |

Boston College | 6-14 | 0.30 |

Pittsburgh | 6-14 | 0.30 |

Georgia Tech | 5-15 | 0.25 |

NC State | 4-16 | 0.20 |

For each team, we see three items:

- The team name,
- the teams' ACC win/loss record,
- and a numerical rating.

In this particular example, the rating is simply the team's winning percentage. While quite simple, this can work reasonably well, when the teams in the list all play one another multiple times.

Again, the question is: Given a pair of teams how might we assess the probability that one team beats the other based on this winninng percentage. For example, Duke has a winning percentage of 0.8 and NC State has winning percentage of 0.2. Based on that - what should be our assessment of the probability that Duke would beat NC State, if they were to play again?

The win/loss rating above is just meant to be a simple illustration. While it can work fairly well in small, isolated examples, it's not likely to work well in larger, more complicated examples. If we take a close look at this year's tournament, we might notice that Winthrop is 23-10, while UNC is 18-10. Thus, Winthrop has a higher winning percentage and, therefore, a higher rating based on winning percentage alone. It's easy to find these kinds of examples since most games during the season are *within* conferences, rather than *between* conferences. Thus, Winthrop obtained most of its 23 wins by defeating Big South teams, rather than ACC teams.

Before working on win probabilities, let's build on someone elses work to find team ratings. Specifically, let's use FiveThirtyEight's NCAA Forcast.

**Note**: We are *not* going to simply copy probabilities from the interactive bracket; rather we're going to use the *Power Rating* column we can find by switching to the table. The power rating is also contained in this downloadable CSV file. Here's the relevant part of the whole table sorted by Power Rating:

team_name | team_rating |
---|---|

Gonzaga | 96.47 |

Kansas | 91.72 |

Kentucky | 91.23 |

Arizona | 91.39 |

Auburn | 89.60 |

Villanova | 90.22 |

Purdue | 89.44 |

Iowa | 88.55 |

Tennessee | 88.55 |

UCLA | 89.84 |

Houston | 88.15 |

Duke | 89.34 |

Texas Tech | 88.70 |

Baylor | 87.92 |

Illinois | 87.07 |

Louisiana State | 85.72 |

Arkansas | 86.78 |

Wisconsin | 84.64 |

Texas | 86.31 |

Connecticut | 86.45 |

Alabama | 85.15 |

Michigan | 84.73 |

Virginia Tech | 84.68 |

North Carolina | 83.99 |

Ohio State | 84.17 |

Memphis | 85.47 |

Saint Mary's (CA) | 84.32 |

Loyola (IL) | 83.70 |

Providence | 82.72 |

Indiana | 83.01 |

Michigan State | 83.49 |

San Diego State | 83.49 |

Southern California | 83.43 |

Texas Christian | 81.92 |

Seton Hall | 82.84 |

Marquette | 81.92 |

Boise State | 82.49 |

Creighton | 81.49 |

Murray State | 81.37 |

Davidson | 81.92 |

Alabama-Birmingham | 81.15 |

San Francisco | 83.00 |

Miami (FL) | 81.16 |

Iowa State | 80.59 |

Colorado State | 81.65 |

Notre Dame | 81.64 |

Rutgers | 81.12 |

Richmond | 79.71 |

South Dakota State | 79.68 |

Vermont | 80.32 |

Chattanooga | 78.68 |

Wyoming | 78.49 |

New Mexico State | 77.67 |

Colgate | 76.39 |

Akron | 76.67 |

Saint Peter's | 74.19 |

Yale | 74.44 |

Montana State | 74.31 |

Jacksonville State | 73.16 |

Wright State | 73.50 |

Longwood | 73.41 |

Delaware | 73.58 |

Georgia State | 73.48 |

Norfolk State | 71.42 |

Texas Southern | 70.37 |

Cal State Fullerton | 71.79 |

Bryant | 71.56 |

Texas A&M-Corpus Christi | 67.32 |

*Note*: If you are curious about where these types of ratings might come from in the first place, you can read FiveThirtyEight's well-documented methodology. You can also read about my variation on the Page Rank algorithm that I use to build my brackets.

Again, though, we are focused on turning the ratings into win probabilities at the moment.

Here is, perhaps, the simplest way to translate ratings into win probabilities: Suppose that Team 1 has rating $R_1$ and that Team 2 has rating $R_2$. Let $P_{12}$ denote the probability that Team 1 defeats Team 2 and let $P_{21}$ denote the probability that Team 2 defeates Team 1. Then we might suppose that $$ P_{12} = \frac{R_{1}}{R_1 + R_2} \text{ and } P_{21} = \frac{R_2}{R_1 + R_2}. $$ This looks good in that it at least obeys the laws of probability. That is,

- $0 \leq P_{ij} \leq 1$
- $R_1 < R_2 \implies P_{12} < \frac{1}{2} < P_{21}$
- $P_{12} + P_{21} = 1$.

That all looks pretty good! If we examine some particular cases, though, we'll see that it's not quite strong enough. Let's take a look, for example, at the probability that the highest rated team (Gonzaga) defeats the lowest rated team (Texas A&M Corpus Christie):

team_name | team_rating |
---|---|

Gonzaga | 96.47 |

Texas A&M CC | 67.32 |

Using those ratings, we have $$ \frac{96.47}{96.47 + 67.32} \approx 0.58898. $$ Well, it certainly seems like Gonzaga has a much better chance of winning that game than that!!

Generally, we might expect $P_{12}$ to depend upon the difference $R_1-R_2$. In order to appropriately assess probabilities associated with this quantity, we should examine its distribution. Let's take a look at a histogram of the pairwise differences of the ratings:

Hey - this looks *normally distributed*! So, to compute $P_{12}$, let's first compute $R_1 - R_2$, and then assess
$$
P(X < R_1-R_2),
$$
where $X$ is normally distributed with mean and standard deviation determined by the pairwise difference data.

Not surprisingly, the mean of the pairwise differences is zero; it's really been constructed that way. The standard deviation (for this particular data set) is about $8.5549$.

Now, let's reconsider the probability that Gonzaga beats T A&M CC. Recall that the ratings are

- Gonzaga: $R_1 = 96.47$ and
- Tex A&M CC: $R_2 = 67.32$.

We can now compute a $Z$-score: $$ Z = \frac{96.47 - 73.48}{8.5549} = 2.6873. $$ If we plug that into our normal probability calculator, we find that we get a probabilty of over $0.999$ - which certainly seems much more believable!