Numerical ranking of sports teams

I've been using numerical analysis to rank college sports teams for a few years now - for fun if not yet profit. This webpage contains a brief description some of the ideas involved.

Of course, the best time of year to do this is March when the NCAA tournament comes around. Using some extensions of these techniques, I placed 91st out of 598 entries in Kaggle's March Machine Learning Mania 2016 competition. If you want to try this yourself, they've got well organized data to get started.

Of course, you've got to realize that a good algorithm isn't necessarily going to do well in a pool - maybe above average, but the random nature of the beast means that someone's kid might win the thing based on team colors.

On the other hand, these techniques do generate results that are reasonably well aligned with polls - polls that (for better or worse) account for history, tradition, and human biases - as well as results. The techniques here are pure algebra. As Wes Colley put it in his paper detailing his technique - "that two such processes produce even remotely consistent results is, frankly, remarkable."

A basic example

Let's start with a very small and simple example. Suppose that three competitors or teams (numbered zero, one, and two) compete against each other twice each. How might we rank them based on the results that we see? For concreteness, suppose the results are the following:

  • Team zero beat team one twice and team two once,
  • Team one beat team two once,
  • Team two beat team zero and team one once each.

One thing that's nice about this example is that it seems reasonably clear that team zero should be ranked the highest, having beaten both competitors a total of 3 times out of 4 possible tries. Team one, on the other hand won only 1 game out of it's four tries, while team two seems to be in the middle, having split it's game with both competitors. Certainly, the teams listed from worst to first should be: $1$, $2$, $0$. Any reasonable numerical technique to evaluate these teams ought to generate that ranking.

Mathematical forumulations

One way to represent the data we see is with a directed graph, like follows: A simple directed graph

In this figure, the nodes are labeled by the teams. We place an edge from team $i$ to team $j$ and labeling that edge with the number of times that team $i$ beat team $j$. We supress zero edges.

There an obvious matrix associated with a directed graph called the adjacency matrix. The rows and column will be indexed by the teams. In row $i$ and column $j$, we place the label on the edge from team $i$ to team $j$. The adjacency matrix for this digraph is

$$\left(\begin{matrix} 0 & 2 & 1 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{matrix}\right)$$

Eigen-formulation

It turns out there's a lovely way to rank the teams usefully by applying a little linear algebra to this matrix. The basic idea, together with some generalizations and improvements, is described in The Perron-Frobenius Theorem and the Ranking of Football Teams by James Keener, though the idea dates back until at least the 1930s. The following outline follows Keener's most basic version.

Generally, let's suppose we have $N$ participants in a collection of contests - football, basketball, tiddly winks, what have you. We also suppose there is a vector $\vec{r}$ of rankings with positive entries $r_j$. Conceptually, $r_j$ represents the actual strength of the $j^{\text{th}}$ competitor. We wish to assign a positive score $s_i$ to each competitor. If competitor $i$ beats competitor $j$, we expect that to contribute positively to the score of competitor $i$. Furthermore, the stronger competitor $j$ was, the more we expect the contribution to be. Symbolically, we might write:

$$s_i = \frac{1}{n_i}\sum_{j=1}^N a_{ij}r_j.$$

Thus, $s_i$ is a linear combination of the strengths of it's opponents. The normalization factor $n_i$ is the number of games that team $i$ played; we include it because we don't want a team's ranking to be higher simply because it played more games.

A key issue, of course, is how should the matrix of coefficients $A=(a_{ij})$ be chosen? Certainly, if team $i$ defeated team $j$ every time they played (there might be more than one contest between the two), then we expect $a_{ij}>0$. Beyond that, there's a lot of flexibility and the precise choice is one for experimentation and (hopefully) optimization. In the simple approach that follows, we'll take $a_{ij}$ to simply be the number of times that team $i$ defeated team $j$.

Finally, it seems reasonable to hope that our score $s_i$ of the $i^{\text{th}}$ team should be related to the actual strength $r_i$ of that team. Let's assume that they are, in fact, proportional: $\vec{s} = \lambda \vec{r}$. Put another way, we have $$A\vec{r} = \lambda\vec{r}.$$ That is, the actual ranking of strengths vector $\vec{r}$ is an eigen-vector of $A$!

Computing the eigen-ranking with Python

Recall that the adjacency matrix associated with our simple example, written in code, is:

In [1]:
M = [
    [0,2,1],
    [0,0,1],
    [1,1,0]
]

According to the theory, this should have a unique positive eigenvalue whose magnitude is larger than the magnitude of any other eigenvalue. There should be an associated eigenvector with all positive entries. Of course, if $\vec{v}$ is an eigenvector, then so is $-\vec{v}$ (or any constant multiple). The theory tells us that we might as well just take the absolute value.

Here's the eigensystem for our simple example:

In [2]:
from scipy.linalg import eig
eig(M)
Out[2]:
(array([-0.88464618+0.58974281j, -0.88464618-0.58974281j,  1.76929235+0.j        ]),
 array([[ 0.63083491+0.j        ,  0.63083491-0.j        , -0.72356278+0.j        ],
        [-0.25319498+0.46743038j, -0.25319498-0.46743038j, -0.33963778+0.j        ],
        [-0.05167574-0.56283042j, -0.05167574+0.56283042j, -0.60091853+0.j        ]]))

The result is a pair: a list of the eigenvalues and a matrix whose columns are the eigenvectors. Note that the last displayed eigenvalue, about $1.769$, has zero imaginary part and is clearly larger in absolute value than the other two, which are complex conjugates. The corresponding eigenvector has components all with the same sign. The absolute value of that eigenvector should be reasonable strengths associated with the teams, approximately: $$0.7235, \; 0.3396, \; 0.6009.$$ As expected, the zeroth team is the strongest, while the middle team is the weakest.

Colley's technique

In 2001, Wes Colley (who was then an astrophysicist at MIT) published a paper outlining his ideas for sports rankings. From then until 2013, it was incorporated as part of the BCS formula for ranking college football teams. It was the only computer poll used in the BCS whose details were fully published. In fact, you can still download his paper.

The basic idea is to improve and extend the concept of winning percentage. That is we start with the ranking $r_i$ of team $i$ given by $$r_i = \frac{w_i}{t_i},$$ where $w_i$ is the number of games that team $i$ won and $t_i$ the total number of games that team $i$ played. Colley argues that this can be improved by simply using $$r_i = \frac{w_i+1}{t_i+2}.$$ As he states, this was not Colley's idea originally but is generally credited to Laplace as his rule of succession. Note, for example, that each team starts with a ranking of $1/2$ at the beginning of the season. If team $i$ loses to team $j$ in the first game, then their respective rankings will be $1/3$ and $2/3$. This seems better than $0$ and $1$; do we really think that team $j$ is inifinitely better than team $j$? Generally still, the rankings hover around $1/2$.

Now, let's try to introduce strength of schedule. To this end, we're going to re-write $w_i$ in a slightly funny way involing $t_i$ and $l_i$ (the number of games that team $i$ lost): $$\begin{align} w_i &= \frac{w_i-l_i}{2} + \frac{w_i+l_i}{2} \\ &= \frac{w_i-l_i}{2} + \frac{1}{2}t_i \\ &= \frac{w_i-l_i}{2} + \sum_{j=1}^{t_i}\frac{1}{2}. \end{align}$$ The point of this strange rewrite is to get to that $1/2$ at the end, which is of course, the average ranking of the teams. Colley argues that, rather than the average ranking of the teams - each of these should be the actual ranking of the corresponding team. That is, beating a good team should contribute more to your score than beating a bad team. Substituting that into Colley's first improvement of winning percentage, we get $$r_i = \frac{\left((w_i-l_i)/2 + \sum_{j=1}^{t_i}r_i\right)+1}{t_i+2}.$$ This can be nicely formulated as a matrix equation: $$C\vec{r} = \vec{b},$$ where $\vec{r}$ is the ranking vector, $\vec{b}$ is the vector with $b_i = 1+(w_i-l_i)/2$, and $$ C_{ij} = \begin{cases} 2+t_i & i=j \\ -n_{ij} & i\neq j. \end{cases} $$ We simply solve for $\vec{r}$.

Computing Colley's ranking with Python

Remember our simple example: A simple directed graph

Colley's matrix $C$ and the vector $b$ associated with this set of outcomes is

In [3]:
C=[
    [6,-2,-2],
    [-2,6,-2],
    [-2,-2,6]
]
b = [1+(3-1)/2,1+(1-3)/2,1+(2-2)/2]
b
Out[3]:
[2.0, 0.0, 1.0]

Here's the solution:

In [4]:
from scipy.linalg import solve
solve(C,b)
Out[4]:
array([ 0.625,  0.375,  0.5  ])

Again, we see that team $0$ is the strongest, team $1$ is the weakest, while team $2$ is in the middle.

A larger data set

I've got a JSON file that contains all the results of the 2016/17 Men's NCAA basketball season - up to, but not including, the tournament. You can download a zipped copy. Let's import it and view UNCA's results:

In [5]:
import json
with open('ncaa_data_2017.json', 'r') as infile:
    ncaa_info_dict = json.load(infile)
ncaa_info_dict['unc-asheville']
Out[5]:
[{'opponent': 'vcu', 'result': 'L', 'score': [65, 80]},
 {'opponent': 'georgia', 'result': 'L', 'score': [46, 60]},
 {'opponent': 'usc-upstate', 'result': 'W', 'score': [73, 57]},
 {'opponent': 'siena', 'result': 'W', 'score': [80, 92]},
 {'opponent': 'furman', 'result': 'W', 'score': [64, 58]},
 {'opponent': 'kansas', 'result': 'L', 'score': [57, 95]},
 {'opponent': 'brevard', 'result': 'W', 'score': [50, 68]},
 {'opponent': 'elon', 'result': 'W', 'score': [78, 67]},
 {'opponent': 'warren-wilson', 'result': 'W', 'score': [52, 117]},
 {'opponent': 'western-caro', 'result': 'W', 'score': [57, 59]},
 {'opponent': 'unc-greensboro', 'result': 'L', 'score': [75, 73]},
 {'opponent': 'ohio-st', 'result': 'L', 'score': [77, 79]},
 {'opponent': 'liberty', 'result': 'W', 'score': [70, 57]},
 {'opponent': 'high-point', 'result': 'W', 'score': [58, 88]},
 {'opponent': 'longwood', 'result': 'W', 'score': [68, 89]},
 {'opponent': 'charleston-so', 'result': 'W', 'score': [76, 67]},
 {'opponent': 'winthrop', 'result': 'L', 'score': [73, 76]},
 {'opponent': 'campbell', 'result': 'W', 'score': [56, 72]},
 {'opponent': 'presbyterian', 'result': 'W', 'score': [73, 47]},
 {'opponent': 'radford', 'result': 'W', 'score': [69, 80]},
 {'opponent': 'charleston-so', 'result': 'W', 'score': [73, 91]},
 {'opponent': 'high-point', 'result': 'W', 'score': [74, 71]},
 {'opponent': 'longwood', 'result': 'W', 'score': [91, 69]},
 {'opponent': 'presbyterian', 'result': 'W', 'score': [48, 89]},
 {'opponent': 'campbell', 'result': 'W', 'score': [66, 53]},
 {'opponent': 'gardner-webb', 'result': 'L', 'score': [76, 81]},
 {'opponent': 'liberty', 'result': 'W', 'score': [45, 63]},
 {'opponent': 'campbell', 'result': 'L', 'score': [81, 79]}]

The Big South

Before we deal with the whole shooting match, let's focus on a larger, but still somewhat reasonable data set - The Big South. We can trim our data to focus on just the big south like so:

In [6]:
big_south = [
 'campbell',
 'charleston-so',
 'gardner-webb',
 'high-point',
 'liberty',
 'longwood',
 'presbyterian',
 'radford',
 'unc-asheville',
 'winthrop'
]
def trimmed(team_key):
    return (team_key, 
        [entry for entry in ncaa_info_dict[team_key] if entry['opponent'] in big_south])
just_big_south_dict = dict([trimmed(team) for team in big_south])

Here's a directed graph illustrating the Big South season: A directed graph representing Big South basketball results

Eigen-rankings

Here is the same information represented as a sparse matrix:

In [7]:
from scipy.sparse import dok_matrix
n = len(big_south)
M = dok_matrix((n,n))
for i in range(n):
    team = big_south[i]
    games = just_big_south_dict[team]
    games = [game for game in games if game['opponent'] in big_south]
    number_of_games = len(games)
    for game in games:
        opponent = game['opponent']
        win_or_loss = game['result']
        if win_or_loss == "W":
            j = big_south.index(opponent)
            M[i,j] = M[i,j] + 1 # /number_of_games
                                # Ignore number of games for ease of visualization
print(M.toarray().astype(int))
[[0 1 1 0 0 1 3 3 1 0]
 [1 0 0 1 0 3 1 1 0 0]
 [1 2 0 1 0 2 2 2 1 0]
 [2 0 1 0 0 1 2 1 0 0]
 [2 2 1 2 0 2 2 1 0 0]
 [1 0 0 1 0 0 1 0 0 0]
 [0 0 0 0 0 1 0 0 0 0]
 [0 1 0 0 1 2 2 0 0 1]
 [2 2 0 2 2 2 2 1 0 0]
 [3 3 2 1 2 2 2 1 1 0]]

And here are the rankings:

In [8]:
import numpy as np
from scipy.sparse.linalg import eigs
val,vec = eigs(M, which='LM', k=1)
vec = np.ndarray.flatten(abs(vec))
sorted_indices = vec.argsort()
ranked = [(big_south[i], vec[i]) for i in sorted_indices]
ranked.reverse()
ranked
Out[8]:
[('winthrop', 0.62941620708203516),
 ('unc-asheville', 0.4114976449952073),
 ('liberty', 0.34797293887544561),
 ('gardner-webb', 0.31849762519005814),
 ('campbell', 0.28688894805888754),
 ('radford', 0.22626872047275146),
 ('high-point', 0.20751489041619828),
 ('charleston-so', 0.16695179522102499),
 ('longwood', 0.085633967755795279),
 ('presbyterian', 0.01441223425634861)]

Colley rankings

Now, we'll do the Colley rankings:

In [9]:
def colley(i,j):
    teami = big_south[i]
    games = just_big_south_dict[teami]
    if i==j:
        return len(games)+2
    else:
        games = just_big_south_dict[teami]
        teamj = big_south[j]
        gamesij = [game['opponent'] for game in games if game['opponent']==teamj]
        return -len(gamesij)

n = len(big_south)
M2 = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        M2[i,j] = colley(i,j)
def colley_b(i):
    games = just_big_south_dict[big_south[i]]
    n_wins = len([game for game in games if game['result']=='W'])
    n_losses = len([game for game in games if game['result']=='L'])
    return 1+0.5*(n_wins-n_losses)
b = np.array([colley_b(i) for i in range(len(big_south))])
In [10]:
from scipy.linalg import solve
vec = solve(M2,b)

vec = np.ndarray.flatten(abs(vec))
sorted_indices = vec.argsort()
ranked = [(big_south[i], vec[i]) for i in sorted_indices]
ranked.reverse()
ranked
Out[10]:
[('winthrop', 0.8502368258102091),
 ('unc-asheville', 0.72840291689368097),
 ('liberty', 0.66044964896633385),
 ('gardner-webb', 0.63055737170699999),
 ('campbell', 0.46671250975351558),
 ('high-point', 0.46325013124311759),
 ('charleston-so', 0.43015070620786722),
 ('radford', 0.41785997436270134),
 ('longwood', 0.21435437853077735),
 ('presbyterian', 0.13802553652480268)]

The whole NCAA

Eigen ranking

In [11]:
teams = list(ncaa_info_dict.keys())
n = len(teams)
n
Out[11]:
637
In [12]:
from scipy.sparse import dok_matrix
M = dok_matrix((n,n))
for i in range(n):
    team = teams[i]
    games = ncaa_info_dict[team]
    # games = [game for game in games if game['opponent'] in div1_basketball_schools]
    number_of_games = len(games)
    for game in games:
        opponent = game['opponent']
        win_or_loss = game['result']
        if win_or_loss == "W":
            j = teams.index(opponent)
            M[i,j] = M[i,j] + 1/number_of_games
M
Out[12]:
<637x637 sparse matrix of type '<class 'numpy.float64'>'
	with 4596 stored elements in Dictionary Of Keys format>
In [17]:
from scipy.sparse.linalg import eigs
import numpy as np

value, vector = eigs(M, which = 'LM', k=1)
vector = np.ndarray.flatten(vector.real)

order = list(vector.argsort())
#order.reverse()
ranked = [teams[i] for i in order]
ranked[:13]
Out[17]:
['villanova',
 'duke',
 'north-carolina',
 'butler',
 'florida-st',
 'kansas',
 'virginia',
 'purdue',
 'louisville',
 'notre-dame',
 'kentucky',
 'michigan',
 'maryland']

Colley ranking

In [14]:
def colley(i,j):
    teami = teams[i]
    games = ncaa_info_dict[teami]
    if i==j:
        return len(games)+2
    else:
        games = ncaa_info_dict[teami]
        teamj = teams[j]
        gamesij = [game['opponent'] for game in games if game['opponent']==teamj]
        return -len(gamesij)

n = len(teams)
M2 = np.zeros((n,n))
for i in range(n):
    for j in range(n):
        M2[i,j] = colley(i,j)
def colley_b(i):
    games = ncaa_info_dict[teams[i]]
    n_wins = len([game for game in games if game['result']=='W'])
    n_losses = len([game for game in games if game['result']=='L'])
    return 1+0.5*(n_wins-n_losses)
b = np.array([colley_b(i) for i in range(n)])
In [15]:
from scipy.linalg import solve
vec = solve(M2,b)

vec = np.ndarray.flatten(abs(vec))
sorted_indices = vec.argsort()
ranked = [(teams[i], vec[i]) for i in sorted_indices]
ranked.reverse()
ranked[:10]
Out[15]:
[('kansas', 1.2021737921801154),
 ('villanova', 1.1793905427825808),
 ('kentucky', 1.1340518192943112),
 ('arizona', 1.1294921352358125),
 ('gonzaga', 1.1208336099372467),
 ('duke', 1.0881738023464604),
 ('ucla', 1.0860721132523832),
 ('oregon', 1.0856920679418689),
 ('north-carolina', 1.0830342751165987),
 ('baylor', 1.0761206269939774)]