So far, we've dealt mostly with one numeric list at a time. Often, though, we want to compare two variables - which is exactly what we'll do today!
In order, this material is covered in sections 7.2, 7.3, and 6.2 of our text.
I'm going to present some code to manipulate some data that all college students should be somewhat interested - textbook prices. Specifically, I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks that were in use at UNCA during the Fall of 2018 semester.
Let's go ahead and load some libraries that we'll use throughout:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Here's the data:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/BookPricesFall2018.csv')
print(df.shape)
df.head()
(518, 4)
title | course | bookstore_new_price | amazon_max_price | |
---|---|---|---|---|
0 | MyAccountingLab Access Card -- for Financial A... | ACCT 215 | 122.25 | 99.99 |
1 | Financial Accounting (Loose Pgs)(w/MyAccountin... | ACCT 215 | 263.50 | 214.95 |
2 | Managerial Accounting (Loose Pgs)(w/MyAcctLab ... | ACCT 216 | 263.25 | 234.33 |
3 | Intermediate Accounting (WileyPlus Standalone ... | ACCT 301 | 171.25 | 88.99 |
4 | Intermediate Accounting (Loose Pgs)(w/ Wiley P... | ACCT 301 | 297.00 | 183.31 |
"Common knowledge" suggest that Amazon prices might generally be lower than our bookstore's prices. If we want to compare the prices between the two vendors like this, we can simply compute the pairwise difference.
differences = df.bookstore_new_price - df.amazon_max_price
m = differences.mean()
m
18.29876447876449
I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average.
Of course, the complete picture is more complicated than just the mean of the differences. Here's a histogram of those differences, with the mean marked by a vertical line:
Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices to a statistically significant level? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.
$$ H_0: \mu_A = \mu_U \\ H_A: \mu_A < \mu_U $$Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as
$$ H_0: \mu = 0 \\ H_A: \mu < 0 $$Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m
in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.
se = np.std(differences)/np.sqrt(len(differences))
se
1.669252423192934
T = (m-0)/se
T
10.962251259616412
Given this huge test statistic, the $p$-value will be riduculously small.
What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? This is a perfectly reasonable question but the data is not paired in a natural way, as in our last example. In this case, we'll compute the two means separately, take the difference, and run a hypothesis test using a combined standard error.
sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)
hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)
[m_sci, len(sci_prices), m_hum, len(hum_prices)]
[154.0701234567901, 81, 36.0403875968992, 129]
Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook!
Let's place this example in the context of a general scenario. In particular, suppose that
We'll analyze the difference of the two means using a hypothesis test with
The form of the test statistic for this scenario is $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$
For our problem, this boils down to:
se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
10.877145243602286
Again, a very large test statistic so that we'd reject the null.
The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. Let's examine the question in the context of a specific problem:
It's widely believed that Trump's support among men is stronger than his support among women. We'll explore this question with some data.
According to a recent Marist's poll Trump's most recent approval rating stands at 41% but there appears to be a difference between the views of men and the views of women. Among the 712 men surveyed, 48% approve of Trump. Among the 685 women surveyed, only 36% approve of Trump.
Does this data support our conjecture that Trump's support among men is higher than that among women to a 95% level of confidence?
Let's first clearly state our hypotheses. Let's suppose that $p_m$ represents the proportion of men who support Trump and $p_w$ represent the proportion of women who support Trump. Our hypothesis test can be written
The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to.
Now, we have measured proportions of $\hat{p}_m = 0.48$ and $\hat{p}_w = 0.36$. Thus, we want to run our test with
$$\hat{p} = \hat{p}_m - \hat{p}_w = 0.48 - 0.36 = 0.12.$$We want just one standard error as well, which we get by adding the variances in the two samples. That is,
$$SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.48 \times 0.52}{712} + \frac{0.36 \times 0.64}{684}} \approx 0.026218.$$Of course, I computed this with Python:
pm = 0.48
pw = 0.36
se = np.sqrt(pm*(1-pm)/712 + pw*(1-pw)/684)
se
0.026218388642629563
We can now compute our test statistic: $$T = \frac{\hat{p}_m - \hat{p}_w}{SE}.$$ via
t = (pm-pw)/se
t
4.576940316037845
With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.
As we move forward this semester, we will need to ask whether our sample meets the conditions to apply some specific distribution - most often, the normal distribution. In particular, when the sample size is small, we'll typically use a $t$-distribution.
Let's look at such an example.
Ohio State and Nebraska played each other in football this past Fall. Prior to that, they had played each other 8 times in Football over the years. Here are the results:
Season | OSU | NEB | Difference |
---|---|---|---|
2019 | 48 | 7 | 41 |
2018 | 36 | 31 | 5 |
2017 | 56 | 14 | 42 |
2016 | 62 | 3 | 59 |
2012 | 63 | 38 | 25 |
2011 | 27 | 34 | -7 |
1956 | 34 | 7 | 27 |
1955 | 28 | 20 | 8 |
Looks good for the Buckeyes!
Let's use this data to get a sense of how these teams compare historically. Specifically, we'll write down a 95% percent confidence interval for the difference in score.
Note: This is admittedly a whimsical example and certainly worthless for most anything worthwhile - such as prediction. You do see this sort of thing used to compare teams historically, though.
Since the data is naturally paired, all we really need to do is compute the confidence interval for the Difference
colum in the table.
When computing the standard error, though, we'll need to use a $t$-distribution with 7 degrees of freedom to deal with the small sample of size 8.
Using our calculator page for the T-distribution, we find that the multiplier for a 95% level of confidence is
$$t^* \approx 2.364624.$$Here's the computation of the confidence interval in Python. By this point, I hope you can read the formula we are using right out of the code.
import numpy as np
data = [41,5,42,59,25,-7,27,8]
m = np.mean(data)
s = np.std(data)
se = s/np.sqrt(len(data))
t = 2.364624
[m-t*se,m+t*se]
[7.719427143201763, 42.28057285679824]
I guess one interpretation of this interval is that it would have been a major disappointment for the Buckeyes to win by less than a touchdown.