import numpy as np
import pandas as pd
import graphviz as gv
from scipy.linalg import eig
from scipy.sparse import dok_matrix
from scipy.sparse.linalg import eigs
from sklearn.linear_model import LogisticRegression
Eigen-ratings
Intro
This web page outlines a technique based in linear algebra to rate teams playing games in a common league based on a matrix describing game results. In that sense, it’s somewhat like the Massey ratings that we discussed earlier.
Like Massey ratings, the eigenrating technique sets up a matrix based on game results between the teams. This matrix, though, represents paired comparisons between the teams and the technique uses the dominant eigenvector of that matrix to obtain ratings. There’s a fair amount of flexibility in the setup of that matrix allowing for the possibility of optimimization.
I learned of the technique from James Keener’s 1993 paper in SIAM Review “The Perron-Frobenius theorem and the ranking of football teams” and the presentation here is strongly influenced by that paper. I also consulted Google’s Pagerank and Beyond by Langville and Meyer, as the technique is closely related Google’s original pagerank algorithm.
Imports
There’s a fair amount of code on this page and, in fact, you’ll need to copy some of it, paste it into your own Colab notebook and use it to complete the lab over this material. Let’ go ahead an import the libraries we’ll be using:
Description of the algorithm
Let’s start with a simple example to motivate the technique and then describe the general algorithm as an eigenvector problem.
A simple working example
Let’s suppose that three competitors or teams (numbered zero, one, and two) compete against each other twice each in some type of contest - football, basketball, tiddly winks, what have you. How might we rate them based on the results that we see? For concreteness, suppose the results are the following:
- Team zero beat team one twice and team two once,
- Team one beat team two once,
- Team two beat team zero and team one once each.
We might represent this diagrammatically as follows:
This configuration is called a directed graph or digraph. We construct it by placing an edge from team \(i\) to team \(j\) and labeling that edge with the number of times that team \(i\) beat team \(j\). We suppress zero edges. There an obvious matrix associated with a digraph called the adjacency matrix. The rows and columns will be indexed by the teams. In row \(i\) and column \(j\), we place the label on the edge from team \(i\) to team \(j\). The adjacency matrix for this digraph is
\[\left(\begin{matrix} 0 & 2 & 1 \\ 0 & 0 & 1 \\ 1 & 1 & 0 \end{matrix}\right)\]
It seems reasonably clear that team zero should be ranked the highest, having beaten both competitors a total of 3 times out of 4 possible tries. Team one, on the other hand won only 1 game out of it’s four tries, while team two seems to be in the middle, having split it’s game with both competitors. Certainly, the teams listed from worst to first should be: \(1\), \(2\), \(0\).
Eigen-formulation
It turns out there’s a lovely way to recover this exact ranking using the eigensystem of the adjacency matrix associated with the directed graph. In general, suppose we have \(N\) participants in our collection of contests. We also suppose there is a vector \(\vec{r}\) of ratings with positive entries \(r_j\). Conceptually, \(r_j\) represents the actual strength of the \(j^{\text{th}}\) competitor. We wish to assign a positive score \(s_i\) to each competitor. If competitor \(i\) beats competitor \(j\), we expect that to contribute positively to the score of competitor \(i\). Furthermore, the stronger competitor \(j\) was, the more we expect the contribution to be. Symbolically, we might write:
\[s_i = \frac{1}{n_i}\sum_{j=1}^N a_{ij}r_j.\]
Thus, \(s_i\) is a linear combination of the strengths of it’s opponents. The normalization factor \(n_i\) is the number of games that team \(i\) played; we include it because we don’t want a team’s rating to be higher simply because it played more games.
A key issue, of course, is how should the matrix of coefficients \(A=(a_{ij})\) be chosen? Certainly, if team \(i\) defeated team \(j\) every time they played (there might be more than one contest between the two), then we expect \(a_{ij}>0\). Beyond that, there’s a lot of flexibility and the precise choice is one for experimentation and (hopefully) optimization. In the simple approach that follows, we’ll take \(a_{ij}\) to simply be the number of times that team \(i\) defeated team \(j\).
Finally, it seems reasonable to hope that our score \(s_i\) of the \(i^{\text{th}}\) team should be related to the actual strength \(r_i\) of that team. Let’s assume that they are, in fact, proportional: \(\vec{s} = \lambda \vec{r}\). Put another way, we have \[A\vec{r} = \lambda\vec{r}.\] That is, the actual strengths vector \(\vec{r}\) is an eigen-vector of \(A\)!
The Perron-Frobenius theorem
At this point, we turn to some mathematical theory to guide us in our choice of eigenvector.
Theorem: If the square matrix \(A\) has nonnegative entries, then it has an eigenvector \(\vec{r}\) with nonnegative entries corresponding to a positive eigenvalue \(\lambda\). If \(A\) is irreducible, then \(\vec{r}\) has strictly positive entries, is unique, simple, and the corresponding eigenvalue is the one of largest absolute value.
This all seems good because we certainly want a vector of positive ratings and the theorem tells us which one to choose. This eigenvalue/eigenvector pair is sometimes called dominant.
To some readers, the notion of an irreducible matrix is quite possibly new. There are a number of equivalent characterizations including:
- \(A\vec{v}>0\), whenever \(\vec{v}>0\),
- There is no permutation of the rows and columns of \(A\) transforming the matrix into a block matrix of the form \[\left(\begin{matrix} A_{11} & A_{12} \\ 0 & A_{22} \end{matrix}\right),\] where \(A_{11}\) and \(A_{22}\) are square.
- The associated digraph is strongly connected, i.e. there is a path from any vertex to any other.
I find the last characterization easiest to work with and it’s easy to believe that it’s likely to be satisfied, if teams play each other enough. Even if the matrix is not irreducible, the eigen-rating approach can work. If not, it’s often sufficient to work with the appropriate strongly connected component of the digraph.
The simple example revisited in code
Recall that the adjacency matrix associated with our simple example, written in code, is:
= [
M 0,2,1],
[0,0,1],
[1,1,0]
[ ]
According to the theory, this should have a unique positive eigenvalue whose magnitude is larger than the magnitude of any other eigenvalue. There should be an associated eigenvector with all positive entries. Of course, if \(\vec{v}\) is an eigenvector, then so is \(-\vec{v}\) (or any constant multiple). The theory tells us that we might as well just take the absolute value.
Here’s the eigensystem for our simple example:
= eig(M)
values, vectors print(values)
print(vectors)
[-0.88464618+0.58974281j -0.88464618-0.58974281j 1.76929235+0.j ]
[[-0.63083491+0.j -0.63083491-0.j -0.72356278+0.j ]
[ 0.25319498-0.46743038j 0.25319498+0.46743038j -0.33963778+0.j ]
[ 0.05167574+0.56283042j 0.05167574-0.56283042j -0.60091853+0.j ]]
The result is a pair: a list of the eigenvalues and a matrix whose columns are the eigenvectors. Note that the last displayed eigenvalue, about \(1.769\), has zero imaginary part and is clearly larger in absolute value than the other two, which are complex conjugates. The corresponding eigenvector has components all with the same sign. The absolute value of that eigenvector should be reasonable strengths associated with the teams, approximately: \[0.7235, \; 0.3396, \; 0.6009.\] As expected, the zeroth team is the strongest, while the middle team is the weakest.
Working with real data
Let’s take a look now at some examples involving real data.
The Big South
I’ve got the results of every Big South regular season game played this year; we can grab that data like so:
= pd.read_csv('https://marksmath.org/data/BigSouth2025RegularSeason.csv')
games games
WTeamName | WScore | LTeamName | LScore | |
---|---|---|---|---|
0 | Winthrop | 95 | SC_Upstate | 76 |
1 | Presbyterian | 68 | Longwood | 60 |
2 | High_Point | 76 | Radford | 58 |
3 | Charleston_So | 72 | Gardner_Webb | 63 |
4 | Radford | 87 | Winthrop | 67 |
... | ... | ... | ... | ... |
67 | Winthrop | 85 | Longwood | 59 |
68 | Winthrop | 103 | UNC_Asheville | 90 |
69 | Longwood | 83 | SC_Upstate | 66 |
70 | Presbyterian | 68 | Gardner_Webb | 57 |
71 | Radford | 76 | Charleston_So | 60 |
72 rows × 4 columns
Here are the teams. This code is slightly more complicated than necessary, since it accounts for the possibility that some teams didn’t win any games.
= games.WTeamName
wteams = games.LTeamName
lteams = pd.concat([wteams, lteams], names=['TeamName']
teams
).drop_duplicates().sort_values()= list(teams.to_numpy().flat)
teams = pd.DataFrame({'team': teams})
teams teams
team | |
---|---|
0 | Charleston_So |
1 | Gardner_Webb |
2 | High_Point |
3 | Longwood |
4 | Presbyterian |
5 | Radford |
6 | SC_Upstate |
7 | UNC_Asheville |
8 | Winthrop |
Now, let’s iterate through the games and construct the simple win matrix described above:
# Consruct dict to look up idx from team name
'idx'] = teams.index
teams[= teams.set_index('team')['idx'].to_dict()
team_to_idx_dict = teams.drop('idx', axis=1)
teams
# Set up the matrix
= len(teams)
n = dok_matrix((n,n))
M for game in games.iterrows():
= team_to_idx_dict[game[1].WTeamName]
i = team_to_idx_dict[game[1].LTeamName]
j = M[i,j] + 1
M[i,j]
# Display it
= M.toarray()
M M
array([[0., 2., 0., 1., 1., 1., 1., 0., 0.],
[0., 0., 0., 1., 1., 0., 2., 0., 1.],
[2., 2., 0., 1., 2., 2., 2., 1., 2.],
[1., 1., 1., 0., 0., 1., 2., 1., 0.],
[1., 1., 0., 2., 0., 1., 1., 1., 0.],
[1., 2., 0., 1., 1., 0., 2., 1., 1.],
[1., 0., 0., 0., 1., 0., 0., 0., 0.],
[2., 2., 1., 1., 1., 1., 2., 0., 1.],
[2., 1., 0., 2., 2., 1., 2., 1., 0.]])
For a league of moderate size, it might make sense to visualize this data with a directed graph:
= gv.Digraph()
dot 'size': '8,5!'})
dot.graph_attr.update({for i in range(len(M)):
= teams.iloc[i].team
wteam for j in range(len(M)):
if(M[i,j] > 0):
= teams.iloc[j].team
lteam =str(int(M[i,j])))
dot.edge(wteam, lteam, label dot
Once we have the matrix, it’s pretty easy to compute the dominant eigenvector and display the teams ranked by rating:
= eigs(M, k=1, which = 'LM')
value, vector = abs(np.ndarray.flatten(vector.real))
vector 'rating'] = vector
teams['rating', ascending=False) teams.sort_values(
team | rating | |
---|---|---|
2 | High_Point | 0.561979 |
7 | UNC_Asheville | 0.427111 |
8 | Winthrop | 0.403688 |
5 | Radford | 0.324317 |
3 | Longwood | 0.285367 |
4 | Presbyterian | 0.275199 |
0 | Charleston_So | 0.202649 |
1 | Gardner_Webb | 0.172820 |
6 | SC_Upstate | 0.074218 |
That looks like a reasonable ranking, at least!
Making predictions
OK, let’s try to use those ratings to make actual predictions for the upcoming Big South Tournament. For example, UNCA plays Charleston Southern this Friday night tonight! We can look up their ratings in our list of teams:
7,0]] teams.loc[[
team | rating | |
---|---|---|
7 | UNC_Asheville | 0.427111 |
0 | Charleston_So | 0.202649 |
Our question is, how can we translate those ratings to a probabilistic prediction?
One answer, of course, is to take get some data and apply logistic regression!
Data
Using Kaggle data, I’ve assembled fourteen years worth of Big South games. You can grab it and examine like so:
= pd.read_csv(
df 'https://marksmath.org/data/big_south_games_with_ratings.csv')
The data is not quite ready for logistic regression. We should
- label each game with an
Outcome
of 1 indicating that the first team won the game, - flip the data and label the resulting games with an
Outcome
of 0, - concatenate those two data frames to obtain an expanded, symmetric data frame, and compute a
RatingDiff
for each game.
"Outcome"] = 1
df[= df.copy()
df_flipped "Outcome"] = 0
df_flipped["WTeamRating"], df_flipped["LTeamRating"] = df["LTeamRating"], df["WTeamRating"]
df_flipped[= pd.concat([df, df_flipped], ignore_index=True)
df_expanded
"RatingDiff"] = df_expanded["WTeamRating"] - df_expanded["LTeamRating"]
df_expanded[ df_expanded
Season | WTeamID | WTeamName | WScore | LTeamID | LTeamName | LScore | tourney_game | WTeamRating | LTeamRating | Outcome | RatingDiff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2011 | 1219 | High Point | 66 | 1205 | Gardner Webb | 64 | False | 0.219615 | 0.190593 | 1 | 0.029022 |
1 | 2011 | 1251 | Liberty | 70 | 1342 | Presbyterian | 61 | False | 0.410214 | 0.218673 | 1 | 0.191541 |
2 | 2011 | 1421 | UNC Asheville | 70 | 1347 | Radford | 50 | False | 0.353151 | 0.070308 | 1 | 0.282843 |
3 | 2011 | 1457 | Winthrop | 88 | 1440 | VMI | 82 | False | 0.301116 | 0.303004 | 1 | -0.001888 |
4 | 2011 | 1157 | Coastal Car | 73 | 1149 | Charleston So | 71 | False | 0.558662 | 0.274054 | 1 | 0.284608 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3395 | 2024 | 1255 | Longwood | 80 | 1219 | High Point | 79 | True | 0.525855 | 0.274761 | 0 | 0.251094 |
3396 | 2024 | 1255 | Longwood | 80 | 1219 | High Point | 79 | True | 0.525855 | 0.274761 | 0 | 0.251094 |
3397 | 2024 | 1421 | UNC Asheville | 83 | 1205 | Gardner Webb | 72 | True | 0.439125 | 0.442608 | 0 | -0.003483 |
3398 | 2024 | 1421 | UNC Asheville | 83 | 1205 | Gardner Webb | 72 | True | 0.439125 | 0.442608 | 0 | -0.003483 |
3399 | 2024 | 1255 | Longwood | 85 | 1421 | UNC Asheville | 59 | True | 0.442608 | 0.274761 | 0 | 0.167847 |
3400 rows × 12 columns
Modeling the data
Ultimately, we want to use the regular season games to fit the logistic regression to relate the Outcome
to the RatingDiff
. We’ll then use the results of that model to make predictions about the upcoming tournament.
Here’s how we do that:
= df_expanded[df_expanded["tourney_game"] == False]
train_data = df_expanded[df_expanded["tourney_game"] == True]
test_data = train_data[["RatingDiff"]].values
X_train = train_data["Outcome"].values
y_train = test_data[["RatingDiff"]].values
X_test = test_data["Outcome"].values
y_test
# Fit the model!
= LogisticRegression()
model model.fit(X_train, y_train)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Use the model
Finally, we can use the model to make a probabilistic prediction for tonight’s game:
model.predict_proba([[7].rating - teams.loc[0].rating
teams.loc[ ]])
array([[0.15906252, 0.84093748]])
Well, that’s pretty confident!
Comments
You might recall that the prediction we produced last time was much less confident - only 0.52. The only substantial difference between this code and the code that I whipped up just before class is in the definition of the model. Last time, the model was defined by:
model = LogisticRegressionCV(
cv=5, penalty="l2",
scoring="accuracy",
max_iter=1000
)
This approach uses five-fold cross-validation to minimize the score function pluse a constant time the sum of the squares of the model coefficients. That is, it should minimize \[
J(\mathbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log \hat{p}_i + (1 - y_i) \log (1 - \hat{p}_i) \right] + \alpha \sum_{j=1}^{p} w_j^2
\] Unfortunately, specifying score="accuracy"
changed the metric which is why I was so unhappy with the prediction.
Another issue is the use of cross-validation. The nature of the data indicates that we can’ just blindly use cross-validation. We should fit on the regular season data and then choose the regularization constant to minimize our score on the tournament data. That needs to be set up manually and uses a bit more code than I want to get into here.
Thus, this time, we defined it simply by:
model = LogisticRegression()
That gives us the 84% prediction which, honestly, is probably too high now. But, hey, I like it!
Comments
There’s a lot going on here and plenty of things worth pointing out.
The matrix \(M\)
The first thing to mention is the matrix \(M\) itself. To this point, we’ve used only the simplest possible formulation where \([M]_{ij}\) is simply the number of times that team \(i\) defeated team \(j\). We didn’t even need to normalize the rows by the number of games played, since each team played the same number of games.
It seems easy to believe that we could pick the matrix entries in a way that improves the ratings. Perhaps we could take the score into acount, for example? Perhaps we could weight more recent games more heavily or take the home court advantage into account? There are ton of possibilities that, ultimately, should be chosen by optimization.
The computer commands
The commands that construct the matrix and compute the eigenvalue are of interest, as well. I’m referring specifically to the following lines of code:
The matrix
M
is stored as a sparse matrix, meaning that a high proportion of its entries are zero. For a relatively small problem like this, the sparseness is not really such a big deal. As the size of the matrix grows, utilizing the sparse structure of the matrix becomes more and more important. Truly large sparse matrices can be stored in much less memory than dense matrices. In addition, algorithms can be specialized for sparse matrices to avoid unnecessary computation and speed things up considerably.The eigenvalue and eigenvector are computed using the
eigs
command, which is specifically designed for sparse matrices. The argumentsk=1
andwhich='LM'
indicate that we need only 1 eigenvector and that we specifically want the one whose eigenvalue as the Largest Magnitude - the so-called dominant eigenvector.Power iteration
A common algorithm to compute just the dominant eigenvector is called power iteration.
Power iteration shold work with any matrix \(A\) having a dominant eigenvector corresponding to a simple, positive eigenvalue. To appy it, simply pick a vector \(\mathbf{x}_0\) and iterate the function \[ \mathbf{x} \to \frac{A\mathbf{x}}{\|\mathbf{x}\|}. \] As long as the initial vector \(\mathbf{x}_0\) lies outside any other eigenspace, the process will converge to the dominant eigenvector and its magnitude will be the dominant eigenvalue. The rate of convergence should be exponential and the base is the ratio of the dominant eigenvalue divided by the magnitude of the next largest eigenvalue.
This works is because multiplication by the matrix \(A\) stretches space out the most in the direction of the dominant eigenvector.
Power iteration is super easy to implement and we can check that it works by comparing to the team ratings generated by SciPy: