Statistics vs data science

Check out Wikipedia's article on Data Science. Quoting from that article, I guess that

So... just what is this hot, new field? Well, according to the same Wikipedia article,

  • Data science is a "concept to unify statistics, data analysis and their related methods" (Wikipedia cites a prominent text)
  • Data science is "sexed-up term for statistics" (Wikipedia cites Nate Silver)

Why so hot?

Statistics (or data science, if you want) is hot for good reason. Data is becoming easier an easier to come by and it's impact more an more pervasive. Think

  • Tech (tracking your browsing history)
  • Finance (analyzing the markets)
  • Sports (Moneyball)
  • Medicine (The genome)
  • Politics

A look at some actual data

Anyone who follows baseball can tell you offensive output varies by defensive position; pitchers are so bad that their position could be considered an outlier. Let's look at some actual data to try to back this up.

Reading and looking at raw data

On my webspace, I have a data file containing batting statistics for all 1199 players in 2010. Let's take a look:

In [1]:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/mlbBat10.tsv', sep='\t')
df.head()
Out[1]:
name team position G AB R H 2B 3B HR RBI TB BB SO SB CS OBP SLG AVG
0 I Suzuki SEA OF 162 680 74 214 30 3 6 43 268 45 86 42 9 0.359 0.394 0.315
1 D Jeter NYY SS 157 663 111 179 30 3 10 67 245 63 106 18 5 0.340 0.370 0.270
2 M Young TEX 3B 157 656 99 186 36 3 21 91 291 50 115 4 2 0.330 0.444 0.284
3 J Pierre CWS OF 160 651 96 179 18 3 1 47 206 45 47 68 18 0.341 0.316 0.275
4 R Weeks MIL 2B 160 651 112 175 32 4 29 83 302 76 184 11 4 0.366 0.464 0.269

In the code above, we first import a library called Pandas that contains some great tools for the manipulation of data. In particular, it provides a Data Frame object that stores large data sets efficiently as well as tools to read data files into data frames - in this case, right off of the web. The resulting DataFrame has 1199 rows and 19 columns. The head method grabs just the first few rows.

The data has actually already been throught quite a bit of formatting. Still, it's so huge that's quite a challenge to get a grip on it. Statistics provides tools (both qualitative and quantitative) to analyze large datasets like this.

A question

First, let's state a specific question we wish to address: How is on base percentage related to position?

Qualitative analysis

One way to visually investigate this question is with a side-by-side box plot.

In [2]:
%matplotlib inline
import seaborn as sns

df2 = df[(df['position'] != 'P') & (df['G'] > 75)]
pic = sns.catplot(
    kind='box', x='position', y='OBP', data=df2,
    aspect = 1.5
)

In this image, the positions are listed on the horizontal axis and the on base percentage is on the vertical axis. The data has actually been trimmed down quite a bit to include only those non-pitchers who played at least 75 games; there were 327 such players that year. If you understand how to read a box and whikser plot, you can see quite clearly the differences in on base percentage between the different positions

Quantitative analysis

While we can certainly see the differences between the positions in the box and whisker plots, it's nice to have definitive numbers to point to that support our analysis. For the analysis of the variance between several variables, there is a well established tool called ANOVA, for Analysis of Variance. Here's how to run ANOVA for this example:

In [3]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
 
mod = ols('OBP ~ position',
                data=df2).fit()
                
aov_table = sm.stats.anova_lm(mod, typ=2)
aov_table
Out[3]:
sum_sq df F PR(>F)
position 0.041293 6.0 5.712489 0.000012
Residual 0.385524 320.0 NaN NaN

The "Pr(>F)" entry is an example of a $p$-value. As we will learn, these kinds of computations allow us to make inferences. Typically, a small $p$-value indicates a deviation from the status-quo. In this case, it indicates that the batting performances between the positions are not all the same.

Types of data

There are two main types of data and both types can be further classified into two sub-types.

  • Numerical data, which can be
    • Discrete or
    • Continuous
  • Categorical data, which can be
    • Nominal or
    • Ordinal

If you take a look at our baseball data above, you can see numerical data of both types as well as Nominal, Categorical data. An example of Ordinal, Categorical data might be the player's jersey number. It looks numeric but there's really no reasonable computation that can be done with it.