Stat 185 Intro

Statistics vs data science

Check out Wikipedia’s article on Data Science. Quoting from that article, I guess that

Data science is “The Sexiest Job of the 21st Century” (Wikipedia cites the Harvard Business Review)
There could be a global shortage of 1.5 million data scientists (Wikipedia cites McKinsey & Company)

So… just what is this hot, new field? Well, according to the same Wikipedia article,

Data science is a “concept to unify statistics, data analysis and their related methods” (Wikipedia cites a prominent text)
Data science is “sexed-up term for statistics” (Wikipedia cites Nate Silver)

Why so hot?

Statistics (or data science, if you want) is hot for good reason. Data is becoming easier an easier to come by and it’s impact more an more pervasive. Think

Tech (tracking your browsing history)
Finance (analyzing the markets)
Sports (Moneyball)
Medicine (The genome)
Politics

A look at some actual data

Anyone who follows baseball can tell you offensive output varies by defensive position; pitchers are so bad that their position could be considered an outlier. Let’s look at some actual data to try to back this up.

Reading and looking at raw data

On my webspace, I have a data file containing batting statistics for all 1199 players in 2010. Let’s take a look:

library(knitr)
df = read.csv('https://www.marksmath.org/data/mlbBat10.tsv', sep="\t")
kable(head(df))

name	team	position	G	AB	R	H	X2B	X3B	HR	RBI	TB	BB	SO	SB	CS	OBP	SLG	AVG
I Suzuki	SEA	OF	162	680	74	214	30	3	6	43	268	45	86	42	9	0.359	0.394	0.315
D Jeter	NYY	SS	157	663	111	179	30	3	10	67	245	63	106	18	5	0.340	0.370	0.270
M Young	TEX	3B	157	656	99	186	36	3	21	91	291	50	115	4	2	0.330	0.444	0.284
J Pierre	CWS	OF	160	651	96	179	18	3	1	47	206	45	47	68	18	0.341	0.316	0.275
R Weeks	MIL	2B	160	651	112	175	32	4	29	83	302	76	184	11	4	0.366	0.464	0.269
M Scutaro	BOS	SS	150	632	92	174	38	0	11	56	245	53	71	5	4	0.333	0.388	0.275

In the code above, the read.csv command reads the data from the table. That data is then stored in a variable called df (for DataFrame). The resulting DataFrame has 1199 rows and 19 columns. The head command grabs just the first few rows and the kable command makes it look a little pretty.

The data has actually already been throught quite a bit of formatting. Still, it’s so huge that’s quite a challenge to get a grip on it. Statistics provides tools (both qualitative and quantitative) to analyze large datasets like this.

A question

First, let’s state a specific question we wish to address: How is on base percentage related to position?

Qualitative analysis

One way to visually investigate this question is with a side-by-side box and whisker plot.

In this image, the positions are listed on the horizontal axis and the on base percentage is on the vertical axis. The data has actually been trimmed down quite a bit to include only those non-pitchers who played at least 75 games; there were 327 such players that year. If you understand how to read a box and whikser plot, you can see quite clearly the differences in on base percentage between the different positions

Quantitative analysis

While we can certainly see the differences between the positions in the box and whisker plots, it’s nice to have definitive numbers to point to that support our analysis. For the analysis of the variation between several variables, there is a well established tool called ANOVA, for Analysis of Variations. Here’s how to run ANOVA for this example:

df$position = factor(df$position,
  labels = c('1B', '2B', '3B', 'C', 'DH', 'OF', 'SS')
)
anova(lm(df$OBP ~ df$position))

## Analysis of Variance Table
## 
## Response: df$OBP
##              Df  Sum Sq   Mean Sq F value    Pr(>F)    
## df$position   6 0.04129 0.0068822  5.7125 1.165e-05 ***
## Residuals   320 0.38552 0.0012048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The “Pr(>F)” entry is an example of a \(p\)-value. As we will learn, these kinds of computations allow us to make inferences. Typically, a small \(p\)-value indicates a deviation from the status-quo. In this case, it indicates that the batting performances between the positions are not all the same.