The basics

In The Analysis of Variation (ANOVA), we compare some statistic across several groups. In the simplest case outlined in section 5.5 of our text, we compare means.

Here’s a little data to get the idea across:

library(knitr)
df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/mlbBat10.tsv', sep="\t")
bball.df = subset(df, G>75 & position != 'P')
kable(head(bball.df))
name team position G AB R H X2B X3B HR RBI TB BB SO SB CS OBP SLG AVG
I Suzuki SEA OF 162 680 74 214 30 3 6 43 268 45 86 42 9 0.359 0.394 0.315
D Jeter NYY SS 157 663 111 179 30 3 10 67 245 63 106 18 5 0.340 0.370 0.270
M Young TEX 3B 157 656 99 186 36 3 21 91 291 50 115 4 2 0.330 0.444 0.284
J Pierre CWS OF 160 651 96 179 18 3 1 47 206 45 47 68 18 0.341 0.316 0.275
R Weeks MIL 2B 160 651 112 175 32 4 29 83 302 76 184 11 4 0.366 0.464 0.269
M Scutaro BOS SS 150 632 92 174 38 0 11 56 245 53 71 5 4 0.333 0.388 0.275

This data represents the 327 non-pitcher players who played more than 75 MLB games in 2010. There’s quite a bit of data on each player - number of games played, at bats, runs, hits, doubles, triples, etc.

We are interested in the relationship between a player’s defensive postion and their offensive output. Obviously, the position field will be important for us. A player’s offensive output can be measure any number of ways; we will use their “On Base Percentage” (OBP).

It’s not too surprising that there’s some difference between positions; we might expect a huge difference between pitchers and designated hitters, for example. I was surprised by the difference between first and third base players:

b1 = subset(bball.df, position == '1B')$OBP
b3 = subset(bball.df, position == '3B')$OBP
c(mean(b1),mean(b3))
## [1] 0.3556571 0.3169091

I guess it might not be totally clear that there’s a big difference between a 355 hitter and a 317 hitter. Let’s apply a \(t\)-test.

t.test(b1,b3)
## 
##  Welch Two Sample t-test
## 
## data:  b1 and b3
## t = 4.1251, df = 74.242, p-value = 9.568e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.02003268 0.05746342
## sample estimates:
## mean of x mean of y 
## 0.3556571 0.3169091

The \(p\)-value clearly indicates a difference.

A major point is that we will work with ANOVA at this level.

Side-by-side box plots

A simple way to visually compare groups is with a side-by-side box plot. There are a number of ways to generate this in R but the simplest is with the ggplot2 library.

library(ggplot2)
ggplot(bball.df, aes(x = position, y = OBP)) +
   geom_boxplot(fill = "grey80", colour = "blue") +
   scale_x_discrete() + xlab("Position") +
   ylab("On base percentage")

It certainly looks like we can see the variation in the means but we’d like to measure it.

Before doing so, it’s worth mentioning that the variation in the data looks roughly the same, which is an important assumption to check, if we want ANOVA to work. This is why, for example, we need a minimum number of games and to exclude pitchers. Look at the picture without:

library(ggplot2)
ggplot(df, aes(x = position, y = OBP)) +
   geom_boxplot(fill = "grey80", colour = "blue") +
   scale_x_discrete() + xlab("Position") +
   ylab("On base percentage")

The statement of the hypothesis test

Let’s clarify some terms.

We use an ANOVA means test to compare the means of \(k\) groups of observations. ANOVA uses a test statistic \(F\), which represents a standardized ratio of variability in the sample means relative to the variability within the groups.

Associated with each of the \(k\) groups is a mean, say \(\mu_1, \mu_2, \ldots, \mu_k\). Using these means, we may state our null and alternative hypotheses \(H_0\) and \(H_A\) as follows:

\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 = \cdots = \mu_k \\ H_A: & \text{at least two } \mu_i\text{s are different} \end{array} \] We are ultimately interested in a \(p\)-value for the test statistic. We’ll use this \(p\)-value just as we’ve used other \(p\)-values to reject or accept the null.

Using anova

ANOVA is built on top of the idea of a linear model (the lm in the code below). Thus, we first build the model and then pass that to the anova command.

bball.df$position = factor(bball.df$position,
  labels = c('1B', '2B', '3B', 'C', 'DH', 'OF', 'SS')
)
bball.modl = lm(OBP ~ position, data = bball.df)
anova(bball.modl)
## Analysis of Variance Table
## 
## Response: OBP
##            Df  Sum Sq   Mean Sq F value    Pr(>F)    
## position    6 0.04129 0.0068822  5.7125 1.165e-05 ***
## Residuals 320 0.38552 0.0012048                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The big question is - how do we interpret this? For us, this really boils down the \(p\)-value. As usual, this indicates a confidence level and the smaller the \(p\)-value, the more likely we are to reject the null-hypothesis. In this particular case, \(p=0.00001165\), so we reject the null hypothesis in favor of the theory that there is genuine variation between player positions.

Some formulae lurking in the background

The \(F\) distribution

Like the \(p\)-test and \(T\)-test, the \(F\)-test is associated with a particular distribution - called the \(F\) distribution. This has two parameters, namely \(df_{1}=k-1\) and \(df_{2}=n-k\), where \(k\) is the number of groups and \(n\) is the total number of observations. The upper tail of the \(F\) distribution is used to represent the \(p\)-value.

Here’s a plot the \(F\) distribution associated with the previous problem.

f <- function(x) df(x,6,320)
plot(f,0,4)

And here’s the computation of the \(p\)-value:

1-pf(5.7125, 6,320)
## [1] 1.165063e-05

The \(F\)-statistic

As already mentioned, the \(F\)-statistic represents a standardized ratio of variability in the sample means relative to the variability within the groups. Formulaically, \[F = \frac{MSG}{MSE}.\] \(MSG\) represents a measure of the between-group variability and can be computed \[MSG =\frac{1}{k-1}\sum_{i=1}^{k} n_{i}\left(\bar{x}_{i} - \bar{x}\right)^2.\]

\(MSE\) measures the variability within each of the groups and can be computed \[SSE = \frac{1}{n-k}\left((n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2\right).\] where \(s_i\) is the standard deviation of group \(i\).

Another example

Let’s return to the question of whether we change as runners over the years. We’ll use our 2010 Peachtree road race data. We can download that data and grab a range of ages, say from 25 to 35 inclusive, and check to see if we have a noticeable difference in the times over that age group.

First, let’s get the data and set up the smaller data frame of interest.

#df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/peach_tree2015.csv')
ages = 25:35
df2 = subset(df, Age %in% ages)

Then, we’ll setup and run an ANOVA:

df2$Age = factor(df2$Age, labels = ages)
modl = lm(Net.Time ~ Age, data = df2)
anova(modl)
## Analysis of Variance Table
## 
## Response: Net.Time
##              Df  Sum Sq Mean Sq F value  Pr(>F)  
## Age          10    8808  880.83  2.0892 0.02194 *
## Residuals 12821 5405583  421.62                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looks like we can reject the null hypothesis that our running speed stays the same over these 10 years, if we’re looking for a 95% level of confidence.