Comparing data sets¶

So far, we've dealt mostly with one numeric list at a time. Often, though, we want to compare two variables - which is exactly what we'll do today!

In order, this material is covered in sections 7.2, 7.3, and 6.2 of our text.

Textbook prices¶

I'm going to present some code to manipulate some data that all college students should be somewhat interested - textbook prices. Specifically, I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks that were in use at UNCA during the Fall of 2018 semester.

Let's go ahead and load some libraries that we'll use throughout:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data¶

Here's the data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/BookPricesFall2018.csv')
print(df.shape)
df.head()

(518, 4)

	title	course	bookstore_new_price	amazon_max_price
0	MyAccountingLab Access Card -- for Financial A...	ACCT 215	122.25	99.99
1	Financial Accounting (Loose Pgs)(w/MyAccountin...	ACCT 215	263.50	214.95
2	Managerial Accounting (Loose Pgs)(w/MyAcctLab ...	ACCT 216	263.25	234.33
3	Intermediate Accounting (WileyPlus Standalone ...	ACCT 301	171.25	88.99
4	Intermediate Accounting (Loose Pgs)(w/ Wiley P...	ACCT 301	297.00	183.31

Comparing prices between the sellers (paired data)¶

"Common knowledge" suggest that Amazon prices might generally be lower than our bookstore's prices. If we want to compare the prices between the two vendors like this, we can simply compute the pairwise difference.

differences = df.bookstore_new_price - df.amazon_max_price
m = differences.mean()
m

18.29876447876449

I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average.

A picture of the differences¶

Of course, the complete picture is more complicated than just the mean of the differences. Here's a histogram of those differences, with the mean marked by a vertical line:

A hypothesis test¶

Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices to a statistically significant level? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.

$$ H_0: \mu_A = \mu_U \\ H_A: \mu_A < \mu_U $$

Rephrasing using differences¶

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as

$$ H_0: \mu = 0 \\ H_A: \mu < 0 $$

Computation¶

Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

se = np.std(differences)/np.sqrt(len(differences))
se

1.669252423192934

T = (m-0)/se
T

10.962251259616412

Given this huge test statistic, the $p$-value will be riduculously small.

Comparing prices between disciplines (unpaired data)¶

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? This is a perfectly reasonable question but the data is not paired in a natural way, as in our last example. In this case, we'll compute the two means separately, take the difference, and run a hypothesis test using a combined standard error.

Grabbing the data¶

sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

[m_sci, len(sci_prices), m_hum, len(hum_prices)]

[154.0701234567901, 81, 36.0403875968992, 129]

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook!

The general scenario¶

Let's place this example in the context of a general scenario. In particular, suppose that

$D_1$ and $D_2$ are two sets of sample observations,
with $n_1$ and $n_2$ elements respectively.
The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

The general strategy¶

We'll analyze the difference of the two means using a hypothesis test with

Mean $\overline{x}_1 - \overline{x}_2$,
Standard error $$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},$$
and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom, if the smaller is less than 30

The test statistic¶

The form of the test statistic for this scenario is $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$

For our problem, this boils down to:

se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)

10.877145243602286

Again, a very large test statistic so that we'd reject the null.

Comparing two sample proportions¶

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. Let's examine the question in the context of a specific problem:

It's widely believed that Trump's support among men is stronger than his support among women. We'll explore this question with some data.

The data¶

According to a recent Marist's poll Trump's most recent approval rating stands at 41% but there appears to be a difference between the views of men and the views of women. Among the 712 men surveyed, 48% approve of Trump. Among the 685 women surveyed, only 36% approve of Trump.

Does this data support our conjecture that Trump's support among men is higher than that among women to a 95% level of confidence?

The hypothesis test¶

Let's first clearly state our hypotheses. Let's suppose that $p_m$ represents the proportion of men who support Trump and $p_w$ represent the proportion of women who support Trump. Our hypothesis test can be written

$H_0: p_m = p_w$ or, put another way, $p_m-p_w = 0$
$H_A: p_m > p_w$ or, put another way, $p_m-p_w > 0$.

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to.

Applying the data¶

Now, we have measured proportions of $\hat{p}_m = 0.48$ and $\hat{p}_w = 0.36$. Thus, we want to run our test with

$$\hat{p} = \hat{p}_m - \hat{p}_w = 0.48 - 0.36 = 0.12.$$

We want just one standard error as well, which we get by adding the variances in the two samples. That is,

$$SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.48 \times 0.52}{712} + \frac{0.36 \times 0.64}{684}} \approx 0.026218.$$

Of course, I computed this with Python:

pm = 0.48
pw = 0.36
se = np.sqrt(pm*(1-pm)/712 + pw*(1-pw)/684)
se

0.026218388642629563

The test statistic¶

We can now compute our test statistic: $$T = \frac{\hat{p}_m - \hat{p}_w}{SE}.$$ via

t = (pm-pw)/se
t

4.576940316037845

With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.

Small sample sizes¶

As we move forward this semester, we will need to ask whether our sample meets the conditions to apply some specific distribution - most often, the normal distribution. In particular, when the sample size is small, we'll typically use a $t$-distribution.

Let's look at such an example.

Ohio State vs Nebraska¶

Ohio State and Nebraska played each other in football this past Fall. Prior to that, they had played each other 8 times in Football over the years. Here are the results:

Season	OSU	NEB	Difference
2019	48	7	41
2018	36	31	5
2017	56	14	42
2016	62	3	59
2012	63	38	25
2011	27	34	-7
1956	34	7	27
1955	28	20	8

Looks good for the Buckeyes!

Historical assessment¶

Let's use this data to get a sense of how these teams compare historically. Specifically, we'll write down a 95% percent confidence interval for the difference in score.

Note: This is admittedly a whimsical example and certainly worthless for most anything worthwhile - such as prediction. You do see this sort of thing used to compare teams historically, though.

A $t$-distribution for paired data¶

Since the data is naturally paired, all we really need to do is compute the confidence interval for the Difference colum in the table.

When computing the standard error, though, we'll need to use a $t$-distribution with 7 degrees of freedom to deal with the small sample of size 8.

Using our calculator page for the T-distribution, we find that the multiplier for a 95% level of confidence is

$$t^* \approx 2.364624.$$

The computation¶

Here's the computation of the confidence interval in Python. By this point, I hope you can read the formula we are using right out of the code.

import numpy as np
data = [41,5,42,59,25,-7,27,8]
m = np.mean(data)
s = np.std(data)
se = s/np.sqrt(len(data))
t = 2.364624
[m-t*se,m+t*se]

[7.719427143201763, 42.28057285679824]

I guess one interpretation of this interval is that it would have been a major disappointment for the Buckeyes to win by less than a touchdown.