Comparing data sets

Last week, we discussed the comparison between two related multinomial proportions. This week, we'll step back toward the textbook and discuss the comparison between two means or two proportions.

In order, this material is covered in sections 7.2, 7.3, and 6.2 of our text.

Textbook prices

I'm going to present some code to manipulate some data that all college students should be somewhat interested - textbook prices. Specifically, I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks that were in use at UNCA during the Fall of 2018 semester.

Let's go ahead and load some libraries that we'll use throughout:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data

Here's the data:

import pandas as pd
df = pd.read_csv('')
(518, 4)
title course bookstore_new_price amazon_max_price
0 MyAccountingLab Access Card -- for Financial A... ACCT 215 122.25 99.99
1 Financial Accounting (Loose Pgs)(w/MyAccountin... ACCT 215 263.50 214.95
2 Managerial Accounting (Loose Pgs)(w/MyAcctLab ... ACCT 216 263.25 234.33
3 Intermediate Accounting (WileyPlus Standalone ... ACCT 301 171.25 88.99
4 Intermediate Accounting (Loose Pgs)(w/ Wiley P... ACCT 301 297.00 183.31

Comparing prices between the sellers (paired data)

"Common knowledge" suggest that Amazon prices might generally be lower than our bookstore's prices. If we want to compare the prices between the two vendors like this, we can simply compute the pairwise difference.

differences = df.bookstore_new_price - df.amazon_max_price
m = differences.mean()

I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average.

A picture of the differences

Of course, the complete picture is more complicated than just the mean of the differences. Here's a histogram of those differences, with the mean marked by a vertical line:

A hypothesis test

Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices to a statistically significant level? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.

$$ H_0: \mu_A = \mu_U \\ H_A: \mu_A < \mu_U $$

Rephrasing using differences

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as

$$ H_0: \mu = 0 \\ H_A: \mu < 0 $$


Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

se = np.std(differences)/np.sqrt(len(differences))
T = (m-0)/se

Given this huge test statistic, the $p$-value will be riduculously small.

Comparing prices between disciplines (unpaired data)

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? This is a perfectly reasonable question but the data is not paired in a natural way, as in our last example. In this case, we'll compute the two means separately, take the difference, and run a hypothesis test using a combined standard error.

Grabbing the data

sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

[m_sci, len(sci_prices), m_hum, len(hum_prices)]
[154.0701234567901, 81, 36.0403875968992, 129]

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook!

The general scenario

Let's place this example in the context of a general scenario. In particular, suppose that

  • $D_1$ and $D_2$ are two sets of sample observations,
  • with $n_1$ and $n_2$ elements respectively.
  • The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
  • The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

The general strategy

We'll analyze the difference of the two means using a hypothesis test with

  • Mean $\overline{x}_1 - \overline{x}_2$,
  • Standard error $$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},$$
  • and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom, if the smaller is less than 30

The test statistic

The form of the test statistic for this scenario is $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

For our problem, this boils down to:

se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)

Again, a very large test statistic so that we'd reject the null.

Comparing two sample proportions

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. Let's examine the question in the context of a specific problem:

It's widely believed that Trump's support among men is stronger than his support among women. We'll explore this question with some data.

The data

According to a recent Marist's poll Trump's most recent approval rating stands at 41% but there appears to be a difference between the views of men and the views of women. Among the 712 men surveyed, 48% approve of Trump. Among the 685 women surveyed, only 36% approve of Trump.

Does this data support our conjecture that Trump's support among men is higher than that among women to a 95% level of confidence?

The hypothesis test

Let's first clearly state our hypotheses. Let's suppose that $p_m$ represents the proportion of men who support Trump and $p_w$ represent the proportion of women who support Trump. Our hypothesis test can be written

  • $H_0: p_m = p_w$ or, put another way, $p_m-p_w = 0$
  • $H_A: p_m > p_w$ or, put another way, $p_m-p_w > 0$.

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to.

Applying the data

Now, we have measured proportions of $\hat{p}_m = 0.48$ and $\hat{p}_w = 0.36$. Thus, we want to run our test with

$$\hat{p} = \hat{p}_m - \hat{p}_w = 0.48 - 0.36 = 0.12.$$

We want just one standard error as well, which we get by adding the variances in the two samples. That is,

$$SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.48 \times 0.52}{712} + \frac{0.36 \times 0.64}{684}} \approx 0.026218.$$

Of course, I computed this with Python:

pm = 0.48
pw = 0.36
se = np.sqrt(pm*(1-pm)/712 + pw*(1-pw)/684)

The test statistic

We can now compute our test statistic: $$T = \frac{\hat{p}_m - \hat{p}_w}{SE}.$$ via

t = (pm-pw)/se

With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.

Small sample sizes

As we move forward this semester, we will need to ask whether our sample meets the conditions to apply some specific distribution - most often, the normal distribution. In particular, when the sample size is small, we'll typically use a $t$-distribution.

Let's look at such an example.

Ohio State vs Nebraska

Ohio State and Nebraska are scheduled to play each other in Football at Noon this coming Saturday. They've played each other 8 times in Football over the years. Here are the results:

Season OSU NEB Difference
2019 48 7 41
2018 36 31 5
2017 56 14 42
2016 62 3 59
2012 63 38 25
2011 27 34 -7
1956 34 7 27
1955 28 20 8

Looks good for the Buckeyes!

Historical assessment

Let's use this data to get a sense of how these teams compare historically. Specifically, we'll write down a 95% percent confidence interval for the difference in score.

Note: This is admittedly a whimsical example and certainly worthless for most anything worthwhile - such as prediction. You do see this sort of thing used to compare teams historically, though.

A $t$-distribution for paired data

Since the data is naturally paired, all we really need to do is compute the confidence interval for the Difference colum in the table.

When computing the standard error, though, we'll need to use a $t$-distribution with 7 degrees of freedom to deal with the small sample of size 8.

Using our calculator page for the T-distribution, we find that the multiplier for a 95% level of confidence is

$$t^* \approx 2.364624.$$

The computation

Here's the computation of the confidence interval in Python. By this point, I hope you can read the formula we are using right out of the code.

import numpy as np
data = [41,5,42,59,25,-7,27,8]
m = np.mean(data)
s = np.std(data)
se = s/np.sqrt(len(data))
t = 2.364624
[7.719427143201763, 42.28057285679824]

I guess one interpretation of this interval is that it would be a major disappointment for the Buckeyes to win by less than a touchdown.