Correlating basics stats to Ws/Ls in college football

The scatter plot below illustrates how some statistic in college football (set initially to total points scored throughout the season) correlates to winning percentage. You can use the drop down menus on the right to choose the year or some other statistic. Note that the statistics in the menu are sorted by strength of correlation for the selected year.

You can hover your mouse over the points in the scatter plot to try to find your team.

Correlation =
Win/Loss percentage vs
Year:
Stat:

Motivation

A fun article recently appeared in The All-American - an awesome new site dedicated to the coverage of college football. The article is behind a pay wall so, if you look at it, you might get only the first few paragraphs. The main idea, though, is as follows:
Amongst the myriad stats that college football coaches track to gauge their team's strength, four stand out as highly correlated to wins and losses:

Moreover(according to the article), these stats are more highly correlated to winning percentage than are other oft cited stats, such as total offense, pass defense or total defense. The article specifically promotes Number of Rushing Attempts over Amount of Rushing Yards.

While the article makes some interesting points, the data it cites is really quite sparse. As a data dabbler and college football fan myself, I couldn't resist the urge to look into the data with more depth. The exploration tool above is the result.

Methodology

The data all comes from College Football Stats. More specifically, data for each team is stored on a page that looks like this one for Ohio State's 2014 season . Each row contains several statistics rolled into one so that the app allows you to compare any of 107 different statistics to Win/Loss percentage. I scraped the data and massaged it into CSV files using this Python code. Once the data was nicely formatted, it was pretty easy to do as the article suggests and compute the correlation between any given stat and the Win/Loss percentage.

It helps to understand correlation, which is a numerical measure of the strength of the linear relationship between two variables. Correlation is always a number between -1 and +1, where

Correlation can be visualized intuitively by looking at a scatter plot, together with a regression line. The larger the correlation is, the more clustered we expect the data to be about the regression line.

Results

The four stated stats

The app makes it pretty easy to compare the importance of stats with one another, with a particular emphasis on the four stats mentioned by the article. The statistics in the menu are sorted by strength of correlation with Win/Loss percentage for the selected year. Also, the stats cited by the article are all highlighted in the menu

--- like this ---

Here are the yearly rankings (relative to all the stats that the app measures) of all four stats that the article mentions together with their correlations to win/loss percentage.

All four stats do indeed have a reasonable level of correlation with Win/Loss percentage. Of them, Passing yards per attempt is the highest ranked 4 years and Turnover margin is the highest ranked the other three years.

Rush attempts, though, seems to be the least important of the stats with a ranking above 50 and a correlation below 0.5 a couple of years. Worse, it is ranked lower than Rush Yards every year - contrary to the assertion of the article.

Which stats are most correlated with W/L percentage?

There is one thing that the stats make clear - if you want to win more games, then score more points. The TEAM Total Points stat appears at the top of the list every year and is always between 0.77 and 0.85. I suppose it's pretty obvious that scoring should be highly correlated to winning percentage, since a W or an L is determined directly by the score of the game.

Beyond that, you do need to be careful not to attribute too much significance to a stat just because it appears high on the list. The TEAM PAT Kicking Made stat appears very high on the list every year. But that correlates to winning percentage simply because it necessarily correlate with scoring. Point after attempts, for example, since you don't get a point after attempt unless you've scored a touchdown.

Similarly, the OPP Punting Yards appears quite high on the list every year. Does that mean that we hope our oppenents are really good punters? No - it means that we want them to punt a lot since, if they're punting, they're not scoring. In reality, the next highest stat on the list, that's not in obvious direct relationship with TEAM Total Points is OPP Scoring Points/Game. I was surprised, though, that this defensive stat seems generally quite less correlated with win/loss percentage than oTEAM Total Points - consistently less by about 0.1.

My List

So, here's my list based on the order of the items that appear in the menu year to year, together with an attempt to eliminate minor stats that simply have a direct relationship with "clearly" more important stats (whatever that means). I made it a bit longer, in case one is duly unimpressed with the scoring based stats.