One simple characterization of statistics might be
the study of how to collect, analyze, and draw conclusions from data.
Today, we're going to explore one fun thing you can do with statistics - make sports predictions!
We'll take a look back a the 2022 NCAA Basketball Tournament, in particular. In the process, we'll meet several important concepts from section 1.2 of our text whose title is Data Basics. The main purpose, though, is to see something kinda cool that you can do with statistics.
Well, sports form just one family of examples. There are many, many other applications. So don't worry, you'll be sick of politics by the end of the semester, too!
If you're a college basketball fan at all, you might recall that North Carolina's first year head coach Hubert Davis defeated Duke in the national semi-finals in Mike Krzyzewski's last game. North Carolina then went on to lost to top overall seed Kansas by only three points - not bad for an 8 seed!
The next slide shows the full, 64 team bracket from that year.
The shading of each game indicates how it went according to my own probabilistic assessment that I made before the tournament started. I made these assessments as part of my participation in Kaggle's annual March Madness competition.
I ended up placing 64th out of 930 particpants that year, as you can check on their leaderboard.
Not bad either!
Again statistics might be characterized as
the study of how to collect, analyze, and draw conclusions from data.
Conclusions in statistics are often stated in probabilistic terms. Thus, we might say something like
The probability that North Carolina will defeat Marquette in the first round is approximately 0.6298.
or maybe
The likelihood that North Carolina will defeat Marquette in the first round is nearly 63%.
The Kaggle provided data that I used to make the predictions looks something like so:
The data on the previous slide is in the form of a data table or data frame.
Each row in a data table is called an observation and corresponds to a case in our study. In this particular example, each case corresponds to a game played during the regular season.
Each column corresponds to a variable or characteristic associated with the cases.
There are two main types of data and both types can be further classified into two sub-types.
Here's another example, that focuses on the teams, rather than the games. This team data has been aggregated from the game data.
Perhaps, we can identify each variable type in the table?
Sometimes, pictures can help us understand a bit about what's what and suggest ideas for further exploration. Here are a couple of examples.
A histogram gives us a sense of where a single numerical variable is centered and how spread out it is. The histogram below illustrates the locations of the average score differences for each team.
If we suspect that two numerical variables might be related, we could plot them together on a scatter plot. In the figure below, for example, each dot corresponds to a team; the $x$-coordinate corresponds to team's average score difference and the $y$-coordinate to their winning percentage.
Again, we've got two main question when comes to the NCAA tournament:
We can approach that second question using simulation.
The idea behind simulation is to simply run the tournament according to the probabilities for each game and tabulate the results. For example, my algorithm assigned a probability of $0.6298$ or approximately a $63\%$ chance that North Carolina would defeat Marquette in the first round.
If you can run one simlation, you can run 1000 simulations - you just have to wait a little bit. You can then tabulate the results of all those runs to get a sense of who's most likely to win the tournament.
You can hit the "Simulate 1000" below to see this in action.
You might find all this to be pretty silly; you might just hate sports, for example.
It's important to understand, though, that these ideas have other applications. The presidential election, for example, can be analyzed in a very similar fashion. That is,