Very commonly, we wish to compare two data sets. To do so, we’ll apply the same hypothesis test that we’ve used before but with a mean and standard deviation specially chosen to match the problem at hand. Depending upon the degrees of freedom, we might choose to compute the \(p\)-value using a \(t\)-distribution or a normal distribution.

A concrete example

As an example, we can compare the speeds of two age groups from the Peachtree Road Race from the data set that we used before. Suppose, for example, that we wish to explore the question: Are men between the ages of 35 and 40 generally faster than men between the ages of 45 and 50?

To answer this question, we’ll read in our entire data set and grab a random sample of size 100 from each group of interest. While were at it, let’s compute the mean of each group.

df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
men = subset(df, Gender='M')
young = subset(men, 35<=Age & Age<40)
old = subset(men, 45<=Age & Age<50)

set.seed(1) # For reproducibility
young_times = sample(young$Net.Time, 100)
old_times = sample(old$Net.Time, 100)

mu_young = mean(young_times)
mu_old = mean(old_times)
c(mu_young,mu_old)
## [1] 73.41334 78.38337

As expected, the average of the younger times is less than average of the older times. What can we infer, though, about the general population from this small sample?

Comparing two means with the t-test

The fabulous t.test provides a compare two means. If we have two groups with means \(\mu_1\) and \(\mu_2\), the logical hypotheses statements are something like

In our case, if \(\mu_1\) represents the average of the younger times and \(\mu_2\) represents the average of the older times, we might be more interested in hypotheses like

We can run this like so:

t.test(young_times, old_times, alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  young_times and old_times
## t = -1.7147, df = 191.39, p-value = 0.04401
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##        -Inf -0.1792582
## sample estimates:
## mean of x mean of y 
##  73.41334  78.38337

Looks like the difference is genuine.

Behind the computation

The ideas behind the computations that produce the above \(t\)-test are quite similar to the hypothesis tests we’ve done before. The major new ingredients are formulae to combine the measures of the two original datasets into one. To do so, suppose that

  • \(D_1\) and \(D_2\) are two sets of sample observations,
  • with \(n_1\) and \(n_2\) elements respectively.
  • The mean of \(D_1\) is \(\bar{x}_1\) and the mean of \(D_2\) is \(\bar{x}_2\)
  • The standard deviation of \(D_1\) is \(\sigma_1\) and the standard deviation of \(D_2\) is \(\sigma_2\)

Then, we analyze the difference of the two means using a \(t\)-test with

  • Mean \(\bar{x}_1 - \bar{x}_2\),
  • Standard error \[\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},\]
  • and we use the minimum of \(n_1-1\) and \(n_2-1\) as the degrees of freedom.

In this set up, the expression \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\] is often called the test statistic.

We should point out that these same techniques work for large sample sizes but, when the sample sizes are so large that the degrees of freedom exceeds 30, we can just use the normal distribution, rather than the \(t\)-distribution. This works since the normal distribution is a limit of \(t\)-distributions. Thus, the t.test command works regardless of sample sizes.

Example

Sam thinks that there is a difference in quality of life between rural and urban living. He collects information from obituaries in newspapers from urban and rural towns in Idaho to see if there is a difference in life expectancy. A sample of 4 people from rural towns give a life expectancy of \(\bar{x}_r = 42\) years with a standard deviation of \(\sigma_r=6.99\) years. A sample of 6 people from larger towns give \(\bar{x}_u=81.9\) years and \(\sigma_u=5.64\) years. Does this provide evidence that people living in rural Idaho communities have a shorter life expectancy than those in more urban communities to a \(95\%\) level of confidence?

Solution: Our hypothesis test looks like

  • \(H_0: \mu_u = \mu_r\)
  • \(H_A: \mu_u \neq \mu_r.\)

There are three degrees of freedom and the test statistic is

\[\frac{\bar{x}_r - \bar{x}_u}{\sqrt{\frac{\sigma_1^r}{n_r} + \frac{\sigma_u^2}{n_u}}} = \frac{72-81.9}{\sqrt{\frac{6.99^2}{4}+\frac{5.64^2}{6}}} = -2.36543.\]

We have a two-sided alternative hypothesis, thus we compute the \(p\)-value via

2*pt(-2.36543, 3)
## [1] 0.09891221

I guess we cannot reject the null hypothesis.

Alternatively, we could use a table. If we look at our \(t\)-table, we see something that looks like so:

one tail 0.100 0.050 0.025 0.010 0.005
two tails 0.200 0.100 0.050 0.020 0.010
df 1 3.08 6.31 12.71 31.82 63.66
2 1.89 2.92 4.30 6.96 9.92
3 1.64 2.35 3.18 4.54 5.84
4 1.53 2.13 2.78 3.75 4.60

The entries in this table are critical \(t^*\) values. The columns indicate several common choices for confidence level and are alternately labeled either one-sided or two. The rows correspond to degrees of freedom. Thus, you can figure out where a given \(t\)-score lies relative to your critical value.

Look in the row where \(df=3\). As we move from left to right along this row, the corresponding \(p\)-values must decrease. We are interested in the column where the two-sided test is equal to 0.05. The corresponding \(t^*\) value in our row and column is 3.16. Since our \(t\)-score of 2.36 is less than that, the \(p\)-value must be larger than \(0.05\), thus we (again) fail to reject the null-hypothesis.

Comparing two sample proportions

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. We’ve just got to figure out the correct formulation and parameters to use in our \(t\)-test.

A problem

Let’s illustrate the ideas in the context of a problem. It’s widely believed that Trump’s support among men is stronger than his support among women. Let’s use some data to test this.


Incidentally: Last night’s election results were predicted beforehand!


According to a recent Reuter’s poll Trump’s approval rating stands at 35% but there appears to be a difference between the views of men and the views of women. Among the 980 men surveyed, 40.4% approve of Trump. Among the 930 women surveyed, only 29.9% approve of Trump.

Does this data support our conjecture that Trump’s support among men is higher than that among women to a 95% level of confidence?

Solution: Let’s first clearly state our hypotheses. Let’s suppose that \(p_m\) represents the proportion of men who support Trump and \(p_w\) represent the proportion of women who support Trump. Our hypothesis test can be written

  • \(H_0: p_m = p_w\) or, put another way, \(p_m-p_w = 0\)
  • \(H_A: p_m > p_w\) or, put another way, \(p_m-p_w > 0\).

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard \(t\)-test to. Now, we have measured proportions of \(\hat{p}_m = 0.404\) and \(\hat{p}_w = 0.299\). Thus, we want to run our test with \[\hat{p} = \hat{p}_m - \hat{p}_w = 0.404 - 0.299 = 0.105.\] We want just one standard error as well, which we get by adding the variances in the two samples. That is, \[SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.404 \times 0.596}{980} + \frac{0.299 \times 0.701}{930}} \approx 0.02170422.\]

Of course, I computed this with R:

pm = 0.404
pw = 0.299
se = sqrt( pm*(1-pm)/980 + pw*(1-pw)/930 )
se
## [1] 0.02170422

We can now compute our test statistic: \[T = \frac{\hat{p}_m - \hat{p}_w}{SE}.\] via

t = 0.105/se
t
## [1] 4.837769

And finally, compute the \(p\)-value

1-pt(t,929)
## [1] 7.682946e-07

Since this is quite small, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.

Paired data

Sometimes, two data sets are naturally paired. As an example, consider the prices via two difference sources. More concretely, I’ve got some data on the price of textbooks via UNCA’s bookstore vs Amazon stored in a CSV file on my website. Let’s load it into R and take a look:

df = read.csv('https://www.marksmath.org/data/book_prices.csv')
head(df)
##                         book_title bkstr_price amazon_price
## 1 89 Color-Coded Flashcards-12061#        5.99         5.99
## 2  Abanico, Cuaderno de Ejercicios       34.75        18.85
## 3        Abina & the Important Men       21.75        18.95
## 4              Abnormal Psychology      125.00        69.00
## 5   Accounting Information Systems      331.25       233.95
## 6                   Actor Prepares       31.95        28.19

Now, let’s explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let’s first clearly state our null and alternative hypotheses. Let \(\mu_U\) denote the average price of the books at the UNCA bookstore and let \(\mu_A\) denote the average price of the books at Amazon.

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean \(\mu\), then we could rephrase our hypotheses as

Now, let’s grab a random sample of size 50, compute the differences and run a \(t\)-test on the result:

set.seed(1); dfs = df[sample(nrow(df), 50), ]
t.test(dfs$amazon_price - dfs$bkstr_price, alternative = "less")
## 
##  One Sample t-test
## 
## data:  dfs$amazon_price - dfs$bkstr_price
## t = -4.6867, df = 49, p-value = 1.122e-05
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##       -Inf -11.66783
## sample estimates:
## mean of x 
##  -18.1664

Note that the \(p\)-value is quite small; we have a better than 99% level of confidence that the difference in the book prices is negative.

Of course, there’s a deeper way to obtain the same result:

d = dfs$amazon_price - dfs$bkstr_price
m = mean(d)
t = m/(sd(d)/sqrt(50))
p = pt(t,49)
m
## [1] -18.1664
t
## [1] -4.68671
p
## [1] 1.1223e-05

Finally, we can often glean some information from a histogram:

Note that most of the weight lies to the left of zero, indicating that most of the texts are less expensive at Amazon.