Wed, Oct 30, 2024
So far, we’ve dealt mostly with one numeric list at a time. Often, though, we want to compare two variables - which is exactly what we’ll do today!
In order, this material is covered in sections 7.2, 7.3, and 6.2 of our text.
I’m going to present some code to manipulate some data that all college students should be somewhat interested - textbook prices. Specifically, I’ve got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks that were in use at UNCA during the Fall of 2018 semester.
Let’s go ahead and load some libraries that we’ll use throughout:
Here’s the data:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/BookPricesFall2018.csv')
print(df.shape)
df.head(3)
(518, 4)
title | course | bookstore_new_price | amazon_max_price | |
---|---|---|---|---|
0 | MyAccountingLab Access Card -- for Financial A... | ACCT 215 | 122.25 | 99.99 |
1 | Financial Accounting (Loose Pgs)(w/MyAccountin... | ACCT 215 | 263.50 | 214.95 |
2 | Managerial Accounting (Loose Pgs)(w/MyAcctLab ... | ACCT 216 | 263.25 | 234.33 |
“Common knowledge” suggest that Amazon prices might generally be lower than our bookstore’s prices. If we want to compare the prices between the two vendors like this, we can simply compute the pairwise difference.
18.298764478764475
I guess that computation means that the textbook price is \(\$18.30\) more expensive at the UNCA bookstore on average.
Now, let’s explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices to a statistically significant level? To do so, let’s first clearly state our null and alternative hypotheses. Let \(\mu_U\) denote the average price of the books at the UNCA bookstore and let \(\mu_A\) denote the average price of the books at Amazon.
\[ H_0: \mu_A = \mu_U \\ H_A: \mu_A < \mu_U \]
Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean \(\mu\), then we could rephrase our hypotheses as
\[ H_0: \mu = 0 \\ H_A: \mu < 0 \]
Note that we’ve already computed an estimate of \(\mu\) to be \(\overline{x}=18.8\), denoted by m
in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.
{'se': 1.669252423192934, 'T': 10.962251259616403}
Given this huge test statistic, the \(p\)-value will be riduculously small.
What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? This is a perfectly reasonable question but the data is not paired in a natural way, as in our last example. In this case, we’ll compute the two means separately, take the difference, and run a hypothesis test using a combined standard error.
sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)
hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)
{"sci mean": m_sci, "sci_cnt": len(sci_prices), "hum_mean": m_hum, "hum_cnt": len(hum_prices)}
{'sci mean': 154.07012345679013,
'sci_cnt': 81,
'hum_mean': 36.040387596899215,
'hum_cnt': 129}
Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook!
Let’s place this example in the context of a general scenario. In particular, suppose that
We’ll analyze the difference of the two means using a hypothesis test with
The form of the test statistic for this scenario is \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\]
For our problem, this boils down to:
se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
10.877145243602286
Again, a very large test statistic so that we’d reject the null.
The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. Let’s examine the question in the context of a specific problem:
It’s widely believed that Donald Trump’s support among men is stronger than his support among women. We’ll explore this question with some data.
Back in October of 2020, a Marist poll indicated that, at that time, Trump’s approval rating stood at 41% but there appeared to be a difference between the views of men and the views of women. Among the 712 men surveyed, 48% approve of Trump. Among the 685 women surveyed, only 36% approve of Trump.
Does this data support our conjecture that Trump’s support among men is higher than that among women to a 95% level of confidence?
Let’s first clearly state our hypotheses. Let’s suppose that \(p_m\) represents the proportion of men who support Trump and \(p_w\) represent the proportion of women who support Trump. Our hypothesis test can be written
The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to.
Now, we have measured proportions of \(\hat{p}_m = 0.48\) and \(\hat{p}_w = 0.36\). Thus, we want to run our test with
\[\hat{p} = \hat{p}_m - \hat{p}_w = 0.48 - 0.36 = 0.12.\]
We want just one standard error as well, which we get by adding the variances in the two samples. That is,
\[SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.48 \times 0.52}{712} + \frac{0.36 \times 0.64}{684}} \approx 0.026218.\]
Of course, I computed this with Python:
We can now compute our test statistic: \[T = \frac{\hat{p}_m - \hat{p}_w}{SE}.\]
With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.