Analysis of Variation

The basics

In The Analysis of Variation (ANOVA), we compare some statistic across several groups. In the simplest case outlined in section 5.5 of our text, we compare means.

Here’s a little data to get the idea across:

library(knitr)
df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/mlbBat10.tsv', sep="\t")
bball.df = subset(df, G>75 & position != 'P')
kable(head(bball.df))

name	team	position	G	AB	R	H	X2B	X3B	HR	RBI	TB	BB	SO	SB	CS	OBP	SLG	AVG
I Suzuki	SEA	OF	162	680	74	214	30	3	6	43	268	45	86	42	9	0.359	0.394	0.315
D Jeter	NYY	SS	157	663	111	179	30	3	10	67	245	63	106	18	5	0.340	0.370	0.270
M Young	TEX	3B	157	656	99	186	36	3	21	91	291	50	115	4	2	0.330	0.444	0.284
J Pierre	CWS	OF	160	651	96	179	18	3	1	47	206	45	47	68	18	0.341	0.316	0.275
R Weeks	MIL	2B	160	651	112	175	32	4	29	83	302	76	184	11	4	0.366	0.464	0.269
M Scutaro	BOS	SS	150	632	92	174	38	0	11	56	245	53	71	5	4	0.333	0.388	0.275

This data represents the 327 non-pitcher players who played more than 75 MLB games in 2010. There’s quite a bit of data on each player - number of games played, at bats, runs, hits, doubles, triples, etc.

We are interested in the relationship between a player’s defensive postion and their offensive output. Obviously, the position field will be important for us. A player’s offensive output can be measure any number of ways; we will use their “On Base Percentage” (OBP).

It’s not too surprising that there’s some difference between positions; we might expect a huge difference between pitchers and designated hitters, for example. I was surprised by the difference between first and third base players:

b1 = subset(bball.df, position == '1B')$OBP
b3 = subset(bball.df, position == '3B')$OBP
c(mean(b1),mean(b3))

## [1] 0.3556571 0.3169091

I guess it might not be totally clear that there’s a big difference between a 355 hitter and a 317 hitter. Let’s apply a \(t\)-test.

t.test(b1,b3)

## 
##  Welch Two Sample t-test
## 
## data:  b1 and b3
## t = 4.1251, df = 74.242, p-value = 9.568e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02003268 0.05746342
## sample estimates:
## mean of x mean of y 
## 0.3556571 0.3169091

The \(p\)-value clearly indicates a difference.

A major point is that we will work with ANOVA at this level.

Side-by-side box plots

A simple way to visually compare groups is with a side-by-side box plot. There are a number of ways to generate this in R but the simplest is with the ggplot2 library.

library(ggplot2)
ggplot(bball.df, aes(x = position, y = OBP)) +
   geom_boxplot(fill = "grey80", colour = "blue") +
   scale_x_discrete() + xlab("Position") +
   ylab("On base percentage")

It certainly looks like we can see the variation in the means but we’d like to measure it.

Before doing so, it’s worth mentioning that the variation in the data looks roughly the same, which is an important assumption to check, if we want ANOVA to work. This is why, for example, we need a minimum number of games and to exclude pitchers. Look at the picture without:

library(ggplot2)
ggplot(df, aes(x = position, y = OBP)) +
   geom_boxplot(fill = "grey80", colour = "blue") +
   scale_x_discrete() + xlab("Position") +
   ylab("On base percentage")

The statement of the hypothesis test

Let’s clarify some terms.

We use an ANOVA means test to compare the means of \(k\) groups of observations. ANOVA uses a test statistic \(F\), which represents a standardized ratio of variability in the sample means relative to the variability within the groups.

Associated with each of the \(k\) groups is a mean, say \(\mu_1, \mu_2, \ldots, \mu_k\). Using these means, we may state our null and alternative hypotheses \(H_0\) and \(H_A\) as follows:

\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 = \cdots = \mu_k \\ H_A: & \text{at least two } \mu_i\text{s are different} \end{array} \] We are ultimately interested in a \(p\)-value for the test statistic. We’ll use this \(p\)-value just as we’ve used other \(p\)-values to reject or accept the null.

Using `anova`

ANOVA is built on top of the idea of a linear model (the lm in the code below). Thus, we first build the model and then pass that to the anova command.

bball.df$position = factor(bball.df$position,
  labels = c('1B', '2B', '3B', 'C', 'DH', 'OF', 'SS')
)
bball.modl = lm(OBP ~ position, data = bball.df)
anova(bball.modl)

## Analysis of Variance Table
## 
## Response: OBP
##            Df  Sum Sq   Mean Sq F value    Pr(>F)    
## position    6 0.04129 0.0068822  5.7125 1.165e-05 ***
## Residuals 320 0.38552 0.0012048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The big question is - how do we interpret this? For us, this really boils down the \(p\)-value. As usual, this indicates a confidence level and the smaller the \(p\)-value, the more likely we are to reject the null-hypothesis. In this particular case, \(p=0.00001165\), so we reject the null hypothesis in favor of the theory that there is genuine variation between player positions.

Some formulae lurking in the background

The \(F\) distribution

Like the \(p\)-test and \(T\)-test, the \(F\)-test is associated with a particular distribution - called the \(F\) distribution. This has two parameters, namely \(df_{1}=k-1\) and \(df_{2}=n-k\), where \(k\) is the number of groups and \(n\) is the total number of observations. The upper tail of the \(F\) distribution is used to represent the \(p\)-value.

Here’s a plot the \(F\) distribution associated with the previous problem.

f <- function(x) df(x,6,320)
plot(f,0,4)

And here’s the computation of the \(p\)-value:

1-pf(5.7125, 6,320)

## [1] 1.165063e-05

The \(F\)-statistic

As already mentioned, the \(F\)-statistic represents a standardized ratio of variability in the sample means relative to the variability within the groups. Formulaically, \[F = \frac{MSG}{MSE}.\] \(MSG\) represents a measure of the between-group variability and can be computed \[MSG =\frac{1}{k-1}\sum_{i=1}^{k} n_{i}\left(\bar{x}_{i} - \bar{x}\right)^2.\]

\(MSE\) measures the variability within each of the groups and can be computed \[SSE = \frac{1}{n-k}\left((n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2\right).\] where \(s_i\) is the standard deviation of group \(i\).

Another example

Let’s return to the question of whether we change as runners over the years. We’ll use our 2010 Peachtree road race data. We can download that data and grab a range of ages, say from 25 to 35 inclusive, and check to see if we have a noticeable difference in the times over that age group.

First, let’s get the data and set up the smaller data frame of interest.

#df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/peach_tree2015.csv')
ages = 25:35
df2 = subset(df, Age %in% ages)

Then, we’ll setup and run an ANOVA:

df2$Age = factor(df2$Age, labels = ages)
modl = lm(Net.Time ~ Age, data = df2)
anova(modl)

## Analysis of Variance Table
## 
## Response: Net.Time
##              Df  Sum Sq Mean Sq F value  Pr(>F)  
## Age          10    8808  880.83  2.0892 0.02194 *
## Residuals 12821 5405583  421.62                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like we can reject the null hypothesis that our running speed stays the same over these 10 years, if we’re looking for a 95% level of confidence.