After the midterm, we’ll spend most of our time generalizing our techniques for inference to more questions. Let’s go ahead and take a look at the first, most basic such situation - the comparison of two means of paired data sets.

The general question

Sometimes we have two data sets that we wish to compare. Examples might include:

Paired data vs unpaired data

Sometimes two data sets might be naturally paired. Suppose we are studying how runners slow after age 30 and we want to use data from the huge Peachtree Road race to explore the question. Here are two approaches:

  • Compare the average time of the 31-35 age group to the 36-40 age group.
  • Find the runners over 30 who competed in 2010 and 2015 and compare their 2010 time to their 2015 time.

The key difference is that the data in the second approach are naturally paired. We can explore the question by taking the pairwise difference of those sets, computing the mean, and comparing that to zero. That is, for each runner, we subtract their 2010 time from their 2015 time. If the mean of the resulting differnce is genuinely less than zero, then the runners are indeed slowing down.

An example

As a concrete example, consider the prices via two difference sources. More specifically, I’ve got some data on the price of textbooks via UNCA’s bookstore vs Amazon stored in a CSV file on my website. Let’s load it into R and take a look:

df = read.csv('https://www.marksmath.org/data/book_prices.csv')
head(df)
##                         book_title bkstr_price amazon_price
## 1 89 Color-Coded Flashcards-12061#        5.99         5.99
## 2  Abanico, Cuaderno de Ejercicios       34.75        18.85
## 3        Abina & the Important Men       21.75        18.95
## 4              Abnormal Psychology      125.00        69.00
## 5   Accounting Information Systems      331.25       233.95
## 6                   Actor Prepares       31.95        28.19

Now, let’s explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let’s first clearly state our null and alternative hypotheses. Let \(\mu_U\) denote the average price of the books at the UNCA bookstore and let \(\mu_A\) denote the average price of the books at Amazon.

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean \(\mu\), then we could rephrase our hypotheses as

Now, let’s grab a random sample of size 50, compute the differences and run a \(t\)-test on the result:

set.seed(1); dfs = df[sample(nrow(df), 50), ]
t.test(dfs$amazon_price - dfs$bkstr_price, alternative = "less")
## 
##  One Sample t-test
## 
## data:  dfs$amazon_price - dfs$bkstr_price
## t = -4.6867, df = 49, p-value = 1.122e-05
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##       -Inf -11.66783
## sample estimates:
## mean of x 
##  -18.1664

Note that the \(p\)-value is quite small; we have a better than 99% level of confidence that the difference in the book prices is negative.

Of course, there’s a deeper way to obtain the same result:

d = dfs$amazon_price - dfs$bkstr_price
m = mean(d)
t = m/(sd(d)/sqrt(50))
p = pt(t,49)
c(m,t,p)
## [1] -1.81664e+01 -4.68671e+00  1.12230e-05

Finally, we can often glean some information from a histogram:

Note that most of the weight lies to the left of zero, indicating that most of the texts are less expensive at Amazon.