Very commonly, we wish to compare two data sets. We did this before when comparing runners from various age groups in the Peachtree Road Race. Recall that I’ve got a data set listing the results of all 54796 amateur runners from the 2015 Peachtree. We can read in that data and compare the average times of men in two age groups as follows:
#df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/peach_tree2015.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/peach_tree2015.csv')
men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 40<=Age & Age<45)
set.seed(1) # For reproducibility
young_times <- sample(young$Net.Time, 100)
old_times <- sample(old$Net.Time, 100)
mu_young = mean(young_times)
mu_old = mean(old_times)
c(mu_young,mu_old)
## [1] 73.41334 77.00703
Back when we first jumped into hypothesis testing we computed a \(p\)-value for the hypothesis that the average old time was greater than \(73.4\) - the fixed average young time. It would be better to compare the two means directly and t.test
provides a way to do this.
t.test(young_times, old_times, alternative = "less")
##
## Welch Two Sample t-test
##
## data: young_times and old_times
## t = -1.2586, df = 193, p-value = 0.1049
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf 1.125634
## sample estimates:
## mean of x mean of y
## 73.41334 77.00703
I was a little bit surprised that the \(p\)-value wasn’t a bit smaller. On the other hand, we do see an expected result with a bit more of a gap between the age groups.
men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 45<=Age & Age<50)
set.seed(1) # For reproducibility
young_times <- sample(young$Net.Time, 100)
old_times <- sample(old$Net.Time, 100)
t.test(young_times, old_times, alternative = "less")
##
## Welch Two Sample t-test
##
## data: young_times and old_times
## t = -1.7147, df = 191.39, p-value = 0.04401
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.1792582
## sample estimates:
## mean of x mean of y
## 73.41334 78.38337
The ideas behind the computations that produce the above \(t\)-test are quite similar to the hypothesis tests we’ve done before. The major new ingredients are forumulae to combine the measures of the two original datasets into one. To do so, suppose that
Then, we analyze the difference of the two means using a \(t\)-test with
In this set up, the expression \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\] is often called the test statistic.
Sam thinks that there is a difference in quality of life between rural and urban living. He collects information from obituaries in newspapers from urban and rural towns in Idaho to see if there is a difference in life expectancy. A sample of 4 people from rural towns give a life expectancy of \(\bar{x}_r\) years with a standard deviation of \(\sigma_r=6.99\) years. A sample of 6 people from larger towns give \(\bar{x}_u=81.9\) years and \(\sigma_u=5.64\) years. Does this provide evidence that people living in rural Idaho communities have a shorter life expectancy than those in more urban communities to a \(95\%\) level of confidence?
Solution: There are three degrees of freedom and the test statistic is
\[\frac{\bar{x}_r - \bar{x}_u}{\sqrt{\frac{\sigma_1^r}{n_r} + \frac{\sigma_u^2}{n_u}}} = \frac{72-81.9}{\sqrt{\frac{6.99^2}{4}+\frac{5.64^2}{6}}} = -2.36543.\]
Thus, we compute the \(p\)-value via
pt(-2.36543, 3)
## [1] 0.0494561
I guess we should reject the null hypothesis.
Alternatively, we could use a table. If we look at the \(t\)-table in appendix B2 on page 430 of our text, we see something that looks like so:
one tail | 0.100 | 0.050 | 0.025 | 0.010 | 0.005 |
two tails | 0.200 | 0.100 | 0.050 | 0.020 | 0.010 |
df 1 | 3.08 | 6.31 | 12.71 | 31.82 | 63.66 |
2 | 1.89 | 2.92 | 4.30 | 6.96 | 9.92 |
3 | 1.64 | 2.35 | 3.18 | 4.54 | 5.84 |
4 | 1.53 | 2.13 | 2.78 | 3.75 | 4.60 |
The entries in this table are critical \(t^*\) values. The columns indicate several common choices for confidence level and are alternately labeled either one-sided or two. The rows correspond to degrees of freedom. Thus, you can figure out where a given \(t\)-score lies relative to your critical value.
Look in the row where \(df=3\). As we move from left to right along this row, the corresponding \(p\)-values must decrease. We are interested in the column where the one-sided test is equal to 0.05. The corresponding \(t^*\) value in our row and column is 2.35. Since our \(t\)-score of 2.36 is slightly larger than this, the \(p\)-value must be smaller than \(0.05\), thus we (again) reject the null-hypothesis.
Sometimes, data is naturally paired. As an example, consider the prices via two difference sources. More concretely, here’s some data on the price of textbooks via UCLA’s bookstore and Amazon:
#df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/textbooks.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/textbooks.csv')
head(df)
## X deptAbbr course ibsn uclaNew amazNew more diff
## 1 1 Am Ind C170 978-0803272620 27.67 27.95 Y -0.28
## 2 2 Anthro 9 978-0030119194 40.59 31.14 Y 9.45
## 3 3 Anthro 135T 978-0300080643 31.68 32.00 Y -0.32
## 4 4 Anthro 191HB 978-0226206813 16.00 11.52 Y 4.48
## 5 5 Art His M102K 978-0892365999 18.95 14.21 Y 4.74
## 6 6 Art His 118E 978-0394723693 14.95 10.17 Y 4.78
The natural and simple way to explore this type of data is to deal with the pairwise difference. In this dataset, it’s easy since the differences are already included as a column. Here’s a histogram of the differences:
hist(df$diff)
Note that most of the weight lies to the right of zero, indicating that most of the texts are less expensive at Amazon. We can quantify this with a \(t\)-test.
t.test(df$diff, alternative = "greater")
##
## One Sample t-test
##
## data: df$diff
## t = 7.6488, df = 72, p-value = 3.464e-11
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
## 9.981505 Inf
## sample estimates:
## mean of x
## 12.76164
That’s strong! Note that the average difference is \(\$12.76\) and we are \(95\%\) certain that are book will be at least \(\$10\) cheaper on Amazon.