Statistics vs data science

Check out Wikipedia’s article on Data Science. Quoting from that article, I guess that

So… just what is this hot, new field? Well, according to the same Wikipedia article,

Why so hot?

Statistics (or data science, if you want) is hot for good reason. Data is becoming easier and easier to come by and it’s impact more and more pervasive. Think

  • Tech (tracking your browsing history)
  • Finance (analyzing the markets)
  • Sports (Moneyball)
  • Medicine (The genome)
  • Politics

That’s quite a variety of fields! Maybe that’s why I recently stumbled on this article (written by a physician) asserting that statistics may be the most important class that you’ll ever take.

What if I’m just an ordinary person?

What if you’re not interested in being a techie or a doctor or anything like that? What if you just want to be an ordinary person?

You’re extraordinary!

First off, if you complete your goal of obtaining a college degree, you’re not exactly an “ordinary” person. By my estimates, barely 30% of US adults have a bachelor’s degree or higher. Of course, only some fraction of those folks have taken a statistics class so, in a sense, you’re already approaching the data elite!

Do you read the newspaper?

You really need some level of quantitative literacy in general and statistical literacy in particular to be an informed citizen these days. Politics these days provides all kinds of examples. Here are just a couple of examples:

A look at some actual data

One major objective of this class to learn to deal with real world data. Here are two examples - one pretty basic and another more involved.

Percentage of folks with college degrees

Just a couple of paragraphs ago, I estimated that barely 30% of US adults had a bachelor’s degree or higher. With what level of confidence can we assert that kind of estimate? Certainly, we can’t check the educational status of absolutely every adult in the US!

These types of estimates are typically based on a survey. We select a subset (called a sample) of the whole population and perform the computation on the subset. We then extrapolate to the whole population based on the sample.

Data on educational attainment can be obtained from the Census Bureau. This particular data set is based on the Current Population Survey, a monthly survey of about 60000 households. If we take a look at that first file on that Census Bureau page, we see somthing that looks a bit like so:

Total None 1st-4th 5th-6th 7th-8th 9th 10th 11th HighSchool Some_college AssociateOC AssociateAC Bachelor Master Prof Doctoral
18+ 246325 770 1555 3214 3648 3484 4267 10245 71170 46445 10081 13990 49368 20797 3196 4096
18-24 29404 53 64 80 182 231 604 3427 8658 10990 672 1088 3105 205 25 20
25-29 22745 18 71 114 172 224 351 761 6076 4462 918 1463 6028 1675 197 214
30-34 21505 51 78 208 238 298 277 620 5418 3680 893 1376 5356 2197 357 458
35-39 20773 53 102 325 280 385 306 623 5152 3209 914 1284 4984 2438 283 434

There are more row but the first row is the key row for the question at hand. From there, we can simply compute compute the percentage by adding the amounts in the “Bachelor”, “Master”, “Prof”, and “Doctoral” columns and dividing by the total number. Evaluating that computation on my computer, Is get

(49368+20797+3196+4096)/246325
## [1] 0.3144504

Just over 31%.

There’s a lot more to ask about this question! The main questions though are

  1. What can we infer about the whole population from this one compuation and
  2. With what level of confidence can we make that inference?

Relating hitting to postion in baseball

On my webspace, I have a data file containing batting statistics for all 1199 players in 2010. I’ll often store this type of data on my webspace because it’s easy to read it directly over the web into your compouter using the open source statistical software package R. Let’s take a look:

library(knitr)
df = read.csv('https://www.marksmath.org/data/mlbBat10.tsv', sep="\t")
kable(head(df))
name team position G AB R H X2B X3B HR RBI TB BB SO SB CS OBP SLG AVG
I Suzuki SEA OF 162 680 74 214 30 3 6 43 268 45 86 42 9 0.359 0.394 0.315
D Jeter NYY SS 157 663 111 179 30 3 10 67 245 63 106 18 5 0.340 0.370 0.270
M Young TEX 3B 157 656 99 186 36 3 21 91 291 50 115 4 2 0.330 0.444 0.284
J Pierre CWS OF 160 651 96 179 18 3 1 47 206 45 47 68 18 0.341 0.316 0.275
R Weeks MIL 2B 160 651 112 175 32 4 29 83 302 76 184 11 4 0.366 0.464 0.269
M Scutaro BOS SS 150 632 92 174 38 0 11 56 245 53 71 5 4 0.333 0.388 0.275

Anyone who follows baseball can tell you offensive output varies by defensive position; pitchers are so bad at hitting that their position could be considered an outlier. Let’s look at some actual data to try to back this up.

The data has actually already been throught quite a bit of formatting. Still, it’s so huge that’s quite a challenge to get a grip on it. Statistics provides tools (both qualitative and quantitative) to analyze large datasets like this.

A question

First, let’s state a specific question we wish to address: How is on base percentage related to position?

Qualitative analysis

One way to visually investigate this question is with a side-by-side box and whisker plot.

In this image, the positions are listed on the horizontal axis and the on base percentage is on the vertical axis. The data has actually been trimmed down quite a bit to include only those non-pitchers who played at least 75 games; there were 327 such players that year. If you understand how to read a box and whikser plot, you can see quite clearly the differences in on base percentage between the different positions

Quantitative analysis

While we can certainly see the differences between the positions in the box and whisker plots, it’s nice to have definitive numbers to point to that support our analysis. For the analysis of the variation between several variables, there is a well established tool called ANOVA, for Analysis of Variations. Here’s how to run ANOVA for this example:

df$position = factor(df$position,
  labels = c('1B', '2B', '3B', 'C', 'DH', 'OF', 'SS')
)
anova(lm(df$OBP ~ df$position))
## Analysis of Variance Table
## 
## Response: df$OBP
##              Df  Sum Sq   Mean Sq F value    Pr(>F)    
## df$position   6 0.04129 0.0068822  5.7125 1.165e-05 ***
## Residuals   320 0.38552 0.0012048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “Pr(>F)” entry is an example of a \(p\)-value. As we will learn, these kinds of computations allow us to make inferences. Typically, a small \(p\)-value indicates a deviation from the status-quo. In this case, it indicates that the batting performances between the positions are not all the same.

Types of data

Here is some of the vocabulary surrounding types of data. Note that this stuff is covered in much more detail in section 1.2 of your text.

As we see, data is often naturally represented in a table - also called a data matrix or data frame. The rows in the table are often called cases. In the baseball example, each case is a player. The columns are often called variables. There are two main types of variables and both types can be further classified into two sub-types.

If you take a look at our baseball data above, you can see discrete numerical data (like number of at bats) and continuous numerical data (like batting average). We also see several examples of Nominal, Categorical data like position. An example of Ordinal, Categorical data might be the player’s jersey number. It looks numeric but there’s really no informative computation that can be done with it.