Comparing Data Sets

Wed, Oct 30, 2024

Comparing data sets

So far, we’ve dealt mostly with one numeric list at a time. Often, though, we want to compare two variables - which is exactly what we’ll do today!

In order, this material is covered in sections 7.2, 7.3, and 6.2 of our text.

Textbook prices

I’m going to present some code to manipulate some data that all college students should be somewhat interested - textbook prices. Specifically, I’ve got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks that were in use at UNCA during the Fall of 2018 semester.

Let’s go ahead and load some libraries that we’ll use throughout:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The data

Here’s the data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/BookPricesFall2018.csv')
print(df.shape)
df.head(3)
(518, 4)
title course bookstore_new_price amazon_max_price
0 MyAccountingLab Access Card -- for Financial A... ACCT 215 122.25 99.99
1 Financial Accounting (Loose Pgs)(w/MyAccountin... ACCT 215 263.50 214.95
2 Managerial Accounting (Loose Pgs)(w/MyAcctLab ... ACCT 216 263.25 234.33

Paired data

Comparing prices between the sellers

“Common knowledge” suggest that Amazon prices might generally be lower than our bookstore’s prices. If we want to compare the prices between the two vendors like this, we can simply compute the pairwise difference.

differences = df.bookstore_new_price - df.amazon_max_price
m = differences.mean()
m
18.298764478764475

I guess that computation means that the textbook price is \(\$18.30\) more expensive at the UNCA bookstore on average.

A hypothesis test

Now, let’s explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices to a statistically significant level? To do so, let’s first clearly state our null and alternative hypotheses. Let \(\mu_U\) denote the average price of the books at the UNCA bookstore and let \(\mu_A\) denote the average price of the books at Amazon.

\[ H_0: \mu_A = \mu_U \\ H_A: \mu_A < \mu_U \]

Rephrasing using differences

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean \(\mu\), then we could rephrase our hypotheses as

\[ H_0: \mu = 0 \\ H_A: \mu < 0 \]

Computation

Note that we’ve already computed an estimate of \(\mu\) to be \(\overline{x}=18.8\), denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

se = np.std(differences)/np.sqrt(len(differences))
T = (m-0)/se
{'se': se, 'T': T}
{'se': 1.669252423192934, 'T': 10.962251259616403}

Given this huge test statistic, the \(p\)-value will be riduculously small.

Unpaired data

Comparing prices between disciplines

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? This is a perfectly reasonable question but the data is not paired in a natural way, as in our last example. In this case, we’ll compute the two means separately, take the difference, and run a hypothesis test using a combined standard error.

Grabbing the data

sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

{"sci mean": m_sci, "sci_cnt": len(sci_prices), "hum_mean": m_hum, "hum_cnt": len(hum_prices)}
{'sci mean': 154.07012345679013,
 'sci_cnt': 81,
 'hum_mean': 36.040387596899215,
 'hum_cnt': 129}

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook!

The general scenario

Let’s place this example in the context of a general scenario. In particular, suppose that

  • \(D_1\) and \(D_2\) are two sets of sample observations,
  • with \(n_1\) and \(n_2\) elements respectively.
  • The mean of \(D_1\) is \(\bar{x}_1\) and the mean of \(D_2\) is \(\bar{x}_2\)
  • The standard deviation of \(D_1\) is \(\sigma_1\) and the standard deviation of \(D_2\) is \(\sigma_2\)

The general strategy

We’ll analyze the difference of the two means using a hypothesis test with

  • Mean \(\overline{x}_1 - \overline{x}_2\),
  • Standard error \[\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},\]
  • and we use the minimum of \(n_1-1\) and \(n_2-1\) as the degrees of freedom, if the smaller is less than 30

The test statistic

The form of the test statistic for this scenario is \[\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}\]

For our problem, this boils down to:

se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
10.877145243602286

Again, a very large test statistic so that we’d reject the null.

Comparing two sample proportions

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. Let’s examine the question in the context of a specific problem:

It’s widely believed that Donald Trump’s support among men is stronger than his support among women. We’ll explore this question with some data.

The data

Back in October of 2020, a Marist poll indicated that, at that time, Trump’s approval rating stood at 41% but there appeared to be a difference between the views of men and the views of women. Among the 712 men surveyed, 48% approve of Trump. Among the 685 women surveyed, only 36% approve of Trump.

Does this data support our conjecture that Trump’s support among men is higher than that among women to a 95% level of confidence?

The hypothesis test

Let’s first clearly state our hypotheses. Let’s suppose that \(p_m\) represents the proportion of men who support Trump and \(p_w\) represent the proportion of women who support Trump. Our hypothesis test can be written

  • \(H_0: p_m = p_w\) or, put another way, \(p_m-p_w = 0\)
  • \(H_A: p_m > p_w\) or, put another way, \(p_m-p_w > 0\).

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to.

Applying the data

Now, we have measured proportions of \(\hat{p}_m = 0.48\) and \(\hat{p}_w = 0.36\). Thus, we want to run our test with

\[\hat{p} = \hat{p}_m - \hat{p}_w = 0.48 - 0.36 = 0.12.\]

We want just one standard error as well, which we get by adding the variances in the two samples. That is,

\[SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.48 \times 0.52}{712} + \frac{0.36 \times 0.64}{684}} \approx 0.026218.\]

Of course, I computed this with Python:

pm = 0.48
pw = 0.36
se = np.sqrt(pm*(1-pm)/712 + pw*(1-pw)/684)
se
0.026218388642629563

The test statistic

We can now compute our test statistic: \[T = \frac{\hat{p}_m - \hat{p}_w}{SE}.\]

t = (pm-pw)/se
t
4.576940316037845

With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.