Relating data sets

Very commonly, we wish to compare two data sets. We did this before when comparing runners from various age groups in the Peachtree Road Race. Recall that I’ve got a data set listing the results of all 54796 amateur runners from the 2015 Peachtree. We can read in that data and compare the average times of men in two age groups as follows:

#df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/peach_tree2015.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/peach_tree2015.csv')
men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 40<=Age & Age<45)

set.seed(1) # For reproducibility
young_times <- sample(young$Net.Time, 100)
old_times <- sample(old$Net.Time, 100)

mu_young = mean(young_times)
mu_old = mean(old_times)
c(mu_young,mu_old)

## [1] 73.41334 77.00703

Comparing two means

Back when we first jumped into hypothesis testing we computed a $p$-value for the hypothesis that the average old time was greater than $73.4$ - the fixed average young time. It would be better to compare the two means directly and t.test provides a way to do this.

t.test(young_times, old_times, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  young_times and old_times
## t = -1.2586, df = 193, p-value = 0.1049
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##      -Inf 1.125634
## sample estimates:
## mean of x mean of y 
##  73.41334  77.00703

I was a little bit surprised that the $p$-value wasn’t a bit smaller. On the other hand, we do see an expected result with a bit more of a gap between the age groups.

men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 45<=Age & Age<50)

set.seed(1) # For reproducibility
young_times <- sample(young$Net.Time, 100)
old_times <- sample(old$Net.Time, 100)
t.test(young_times, old_times, alternative = "less")

## 
##  Welch Two Sample t-test
## 
## data:  young_times and old_times
## t = -1.7147, df = 191.39, p-value = 0.04401
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.1792582
## sample estimates:
## mean of x mean of y 
##  73.41334  78.38337

Behind the computation

The ideas behind the computations that produce the above $t$-test are quite similar to the hypothesis tests we’ve done before. The major new ingredients are forumulae to combine the measures of the two original datasets into one. To do so, suppose that

$D_1$ and $D_2$ are two sets of sample observatins,
with $n_1$ and $n_2$ elements respectively.
The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

Then, we analyze the difference of the two means using a $t$-test with

Mean $\bar{x}_1 - \bar{x}_2$,
Standard error \[\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},\]
and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom.

In this set up, the expression \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\] is often called the test statistic.

Example

Sam thinks that there is a difference in quality of life between rural and urban living. He collects information from obituaries in newspapers from urban and rural towns in Idaho to see if there is a difference in life expectancy. A sample of 4 people from rural towns give a life expectancy of $\bar{x}_r$ years with a standard deviation of $\sigma_r=6.99$ years. A sample of 6 people from larger towns give $\bar{x}_u=81.9$ years and $\sigma_u=5.64$ years. Does this provide evidence that people living in rural Idaho communities have a shorter life expectancy than those in more urban communities to a $95\%$ level of confidence?

Solution: There are three degrees of freedom and the test statistic is

\[\frac{\bar{x}_r - \bar{x}_u}{\sqrt{\frac{\sigma_1^r}{n_r} + \frac{\sigma_u^2}{n_u}}} = \frac{72-81.9}{\sqrt{\frac{6.99^2}{4}+\frac{5.64^2}{6}}} = -2.36543.\]

Thus, we compute the $p$-value via

pt(-2.36543, 3)

## [1] 0.0494561

I guess we should reject the null hypothesis.

Alternatively, we could use a table. If we look at the $t$-table in appendix B2 on page 430 of our text, we see something that looks like so:

one tail	0.100	0.050	0.025	0.010	0.005
two tails	0.200	0.100	0.050	0.020	0.010
df 1	3.08	6.31	12.71	31.82	63.66
2	1.89	2.92	4.30	6.96	9.92
3	1.64	2.35	3.18	4.54	5.84
4	1.53	2.13	2.78	3.75	4.60

The entries in this table are critical $t^*$ values. The columns indicate several common choices for confidence level and are alternately labeled either one-sided or two. The rows correspond to degrees of freedom. Thus, you can figure out where a given $t$-score lies relative to your critical value.

Look in the row where $df=3$. As we move from left to right along this row, the corresponding $p$-values must decrease. We are interested in the column where the one-sided test is equal to 0.05. The corresponding $t^*$ value in our row and column is 2.35. Since our $t$-score of 2.36 is slightly larger than this, the $p$-value must be smaller than $0.05$, thus we (again) reject the null-hypothesis.

Paired data

Sometimes, data is naturally paired. As an example, consider the prices via two difference sources. More concretely, here’s some data on the price of textbooks via UCLA’s bookstore and Amazon:

#df = read.csv('https://www.marksmath.org/classes/Summer2017Stat185/data/textbooks.csv')
df = read.csv('/Users/mcmcclur/Documents/html/classes/Summer2017Stat185/data/textbooks.csv')
head(df)

##   X deptAbbr course           ibsn uclaNew amazNew more  diff
## 1 1   Am Ind   C170 978-0803272620   27.67   27.95    Y -0.28
## 2 2   Anthro      9 978-0030119194   40.59   31.14    Y  9.45
## 3 3   Anthro   135T 978-0300080643   31.68   32.00    Y -0.32
## 4 4   Anthro  191HB 978-0226206813   16.00   11.52    Y  4.48
## 5 5  Art His  M102K 978-0892365999   18.95   14.21    Y  4.74
## 6 6  Art His   118E 978-0394723693   14.95   10.17    Y  4.78

The natural and simple way to explore this type of data is to deal with the pairwise difference. In this dataset, it’s easy since the differences are already included as a column. Here’s a histogram of the differences:

hist(df$diff)

Note that most of the weight lies to the right of zero, indicating that most of the texts are less expensive at Amazon. We can quantify this with a $t$-test.

t.test(df$diff, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  df$diff
## t = 7.6488, df = 72, p-value = 3.464e-11
## alternative hypothesis: true mean is greater than 0
## 95 percent confidence interval:
##  9.981505      Inf
## sample estimates:
## mean of x 
##  12.76164

That’s strong! Note that the average difference is $\$12.76$ and we are $95\%$ certain that are book will be at least $\$10$ cheaper on Amazon.