In The Analysis of Variation (ANOVA), we compare some statistic across several groups. In the simplest case outlined in section 5.5 of our text, we compare means.
Here’s a little data to get the idea across:
library(knitr)
df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/mlbBat10.tsv', sep="\t")
bball.df = subset(df, G>75 & position != 'P')
kable(head(bball.df))
name | team | position | G | AB | R | H | X2B | X3B | HR | RBI | TB | BB | SO | SB | CS | OBP | SLG | AVG |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I Suzuki | SEA | OF | 162 | 680 | 74 | 214 | 30 | 3 | 6 | 43 | 268 | 45 | 86 | 42 | 9 | 0.359 | 0.394 | 0.315 |
D Jeter | NYY | SS | 157 | 663 | 111 | 179 | 30 | 3 | 10 | 67 | 245 | 63 | 106 | 18 | 5 | 0.340 | 0.370 | 0.270 |
M Young | TEX | 3B | 157 | 656 | 99 | 186 | 36 | 3 | 21 | 91 | 291 | 50 | 115 | 4 | 2 | 0.330 | 0.444 | 0.284 |
J Pierre | CWS | OF | 160 | 651 | 96 | 179 | 18 | 3 | 1 | 47 | 206 | 45 | 47 | 68 | 18 | 0.341 | 0.316 | 0.275 |
R Weeks | MIL | 2B | 160 | 651 | 112 | 175 | 32 | 4 | 29 | 83 | 302 | 76 | 184 | 11 | 4 | 0.366 | 0.464 | 0.269 |
M Scutaro | BOS | SS | 150 | 632 | 92 | 174 | 38 | 0 | 11 | 56 | 245 | 53 | 71 | 5 | 4 | 0.333 | 0.388 | 0.275 |
This data represents the 327 non-pitcher players who played more than 75 MLB games in 2010. There’s quite a bit of data on each player - number of games played, at bats, runs, hits, doubles, triples, etc.
We are interested in the relationship between a player’s defensive postion and their offensive output. Obviously, the position
field will be important for us. A player’s offensive output can be measure any number of ways; we will use their “On Base Percentage” (OBP).
It’s not too surprising that there’s some difference between positions; we might expect a huge difference between pitchers and designated hitters, for example. I was surprised by the difference between first and third base players:
b1 = subset(bball.df, position == '1B')$OBP
b3 = subset(bball.df, position == '3B')$OBP
c(mean(b1),mean(b3))
## [1] 0.3556571 0.3169091
I guess it might not be totally clear that there’s a big difference between a 355 hitter and a 317 hitter. Let’s apply a \(t\)-test.
t.test(b1,b3)
##
## Welch Two Sample t-test
##
## data: b1 and b3
## t = 4.1251, df = 74.242, p-value = 9.568e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.02003268 0.05746342
## sample estimates:
## mean of x mean of y
## 0.3556571 0.3169091
The \(p\)-value clearly indicates a difference.
A major point is that we will work with ANOVA at this level.
A simple way to visually compare groups is with a side-by-side box plot. There are a number of ways to generate this in R but the simplest is with the ggplot2
library.
library(ggplot2)
ggplot(bball.df, aes(x = position, y = OBP)) +
geom_boxplot(fill = "grey80", colour = "blue") +
scale_x_discrete() + xlab("Position") +
ylab("On base percentage")
It certainly looks like we can see the variation in the means but we’d like to measure it.
Before doing so, it’s worth mentioning that the variation in the data looks roughly the same, which is an important assumption to check, if we want ANOVA to work. This is why, for example, we need a minimum number of games and to exclude pitchers. Look at the picture without:
library(ggplot2)
ggplot(df, aes(x = position, y = OBP)) +
geom_boxplot(fill = "grey80", colour = "blue") +
scale_x_discrete() + xlab("Position") +
ylab("On base percentage")
Let’s clarify some terms.
We use an ANOVA means test to compare the means of \(k\) groups of observations. ANOVA uses a test statistic \(F\), which represents a standardized ratio of variability in the sample means relative to the variability within the groups.
Associated with each of the \(k\) groups is a mean, say \(\mu_1, \mu_2, \ldots, \mu_k\). Using these means, we may state our null and alternative hypotheses \(H_0\) and \(H_A\) as follows:
\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 = \cdots = \mu_k \\ H_A: & \text{at least two } \mu_i\text{s are different} \end{array} \] We are ultimately interested in a \(p\)-value for the test statistic. We’ll use this \(p\)-value just as we’ve used other \(p\)-values to reject or accept the null.
anova
ANOVA is built on top of the idea of a linear model (the lm
in the code below). Thus, we first build the model and then pass that to the anova
command.
bball.df$position = factor(bball.df$position,
labels = c('1B', '2B', '3B', 'C', 'DH', 'OF', 'SS')
)
bball.modl = lm(OBP ~ position, data = bball.df)
anova(bball.modl)
## Analysis of Variance Table
##
## Response: OBP
## Df Sum Sq Mean Sq F value Pr(>F)
## position 6 0.04129 0.0068822 5.7125 1.165e-05 ***
## Residuals 320 0.38552 0.0012048
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The big question is - how do we interpret this? For us, this really boils down the \(p\)-value. As usual, this indicates a confidence level and the smaller the \(p\)-value, the more likely we are to reject the null-hypothesis. In this particular case, \(p=0.00001165\), so we reject the null hypothesis in favor of the theory that there is genuine variation between player positions.
Like the \(p\)-test and \(T\)-test, the \(F\)-test is associated with a particular distribution - called the \(F\) distribution. This has two parameters, namely \(df_{1}=k-1\) and \(df_{2}=n-k\), where \(k\) is the number of groups and \(n\) is the total number of observations. The upper tail of the \(F\) distribution is used to represent the \(p\)-value.
Here’s a plot the \(F\) distribution associated with the previous problem.
f <- function(x) df(x,6,320)
plot(f,0,4)
And here’s the computation of the \(p\)-value:
1-pf(5.7125, 6,320)
## [1] 1.165063e-05
As already mentioned, the \(F\)-statistic represents a standardized ratio of variability in the sample means relative to the variability within the groups. Formulaically, \[F = \frac{MSG}{MSE}.\] \(MSG\) represents a measure of the between-group variability and can be computed \[MSG =\frac{1}{k-1}\sum_{i=1}^{k} n_{i}\left(\bar{x}_{i} - \bar{x}\right)^2.\]
\(MSE\) measures the variability within each of the groups and can be computed \[SSE = \frac{1}{n-k}\left((n_1-1)s_1^2 + (n_2-1)s_2^2 + \cdots + (n_k-1)s_k^2\right).\] where \(s_i\) is the standard deviation of group \(i\).
Let’s return to the question of whether we change as runners over the years. We’ll use our 2010 Peachtree road race data. We can download that data and grab a range of ages, say from 25 to 35 inclusive, and check to see if we have a noticeable difference in the times over that age group.
First, let’s get the data and set up the smaller data frame of interest.
#df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/peach_tree2015.csv')
ages = 25:35
df2 = subset(df, Age %in% ages)
Then, we’ll setup and run an ANOVA:
df2$Age = factor(df2$Age, labels = ages)
modl = lm(Net.Time ~ Age, data = df2)
anova(modl)
## Analysis of Variance Table
##
## Response: Net.Time
## Df Sum Sq Mean Sq F value Pr(>F)
## Age 10 8808 880.83 2.0892 0.02194 *
## Residuals 12821 5405583 421.62
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looks like we can reject the null hypothesis that our running speed stays the same over these 10 years, if we’re looking for a 95% level of confidence.