Stat 185

Intro to NCAA data basics

What is statistics?

One simple characterization of statistics might be

the study of how to collect, analyze, and draw conclusions from data.

A fun(?) example

Today, we're going to explore one fun thing you can do with statistics - make sports predictions!

We'll take a look back a the 2022 NCAA Basketball Tournament, in particular. In the process, we'll meet several important concepts from section 1.2 of our text whose title is Data Basics. The main purpose, though, is to see something kinda cool that you can do with statistics.

What if I hate sports?

Well, sports form just one family of examples. There are many, many other applications. So don't worry, you'll be sick of politics by the end of the semester, too!

A look back at the Men's 2022 NCAA Tournament

If you're a college basketball fan at all, you might recall that North Carolina's first year head coach Hubert Davis defeated Duke in the national semi-finals in Mike Krzyzewski's last game. North Carolina then went on to lost to top overall seed Kansas by only three points - not bad for an 8 seed!

The next slide shows the full, 64 team bracket from that year.

Interpretation

The shading of each game indicates how it went according to my own probabilistic assessment that I made before the tournament started. I made these assessments as part of my participation in Kaggle's annual March Madness competition.

I ended up placing 64th out of 930 particpants that year, as you can check on their leaderboard.
Not bad either!

Back to Stats

Again statistics might be characterized as

the study of how to collect, analyze, and draw conclusions from data.

An example conclusion

Conclusions in statistics are often stated in probabilistic terms. Thus, we might say something like

The probability that North Carolina will defeat Marquette in the first round is approximately 0.6298.

or maybe

The likelihood that North Carolina will defeat Marquette in the first round is nearly 63%.

Data

The Kaggle provided data that I used to make the predictions looks something like so:

Data tables

The data on the previous slide is in the form of a data table or data frame.

Each row in a data table is called an observation and corresponds to a case in our study. In this particular example, each case corresponds to a game played during the regular season.

Each column corresponds to a variable or characteristic associated with the cases.

Types of data

There are two main types of data and both types can be further classified into two sub-types.

  • Numerical data, which can be
    • Discrete or
    • Continuous
  • Categorical data, which can be
    • Nominal or
    • ordinal.

Aggregated team data

Here's another example, that focuses on the teams, rather than the games. This team data has been aggregated from the game data.
Perhaps, we can identify each variable type in the table?

Visualization

Sometimes, pictures can help us understand a bit about what's what and suggest ideas for further exploration. Here are a couple of examples.

Histograms

A histogram gives us a sense of where a single numerical variable is centered and how spread out it is. The histogram below illustrates the locations of the average score differences for each team.

A histogram

Scatter plots

If we suspect that two numerical variables might be related, we could plot them together on a scatter plot. In the figure below, for example, each dot corresponds to a team; the $x$-coordinate corresponds to team's average score difference and the $y$-coordinate to their winning percentage.

Simulation

Again, we've got two main question when comes to the NCAA tournament:

  • Given data on a couple of NCAA teams, how might we assess the probability that one team defeats the other?
  • If we can do that for each game, how can assess the likelihood that specific teams are to win the tournament?

We can approach that second question using simulation.

The main idea

The idea behind simulation is to simply run the tournament according to the probabilities for each game and tabulate the results. For example, my algorithm assigned a probability of $0.6298$ or approximately a $63\%$ chance that North Carolina would defeat Marquette in the first round.

Lots of simulations

If you can run one simlation, you can run 1000 simulations - you just have to wait a little bit. You can then tabulate the results of all those runs to get a sense of who's most likely to win the tournament.

You can hit the "Simulate 1000" below to see this in action.

Other applications

You might find all this to be pretty silly; you might just hate sports, for example.

It's important to understand, though, that these ideas have other applications. The presidential election, for example, can be analyzed in a very similar fashion. That is,

  • we first assess the probability of victory for each candidate in each state and
  • we then use simulation to determine the likely overall victor.