Comparing data sets

As we move through the latter part of the semester, we'll often be interested in checking for a relationship between variables. Perhaps the simplest such situation is when we compare the sample means of two data sets. Here are a couple of examples along those lines.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Textbook prices

I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks in use at UNCA this semester. Let's take a look:

In [6]:
df = pd.read_csv('https://www.marksmath.org/data/BookPricesFall2018.csv')
print(df.shape)
df.head()
(518, 4)
Out[6]:
title course bookstore_new_price amazon_max_price
0 MyAccountingLab Access Card -- for Financial A... ACCT 215 122.25 99.99
1 Financial Accounting (Loose Pgs)(w/MyAccountin... ACCT 215 263.50 214.95
2 Managerial Accounting (Loose Pgs)(w/MyAcctLab ... ACCT 216 263.25 234.33
3 Intermediate Accounting (WileyPlus Standalone ... ACCT 301 171.25 88.99
4 Intermediate Accounting (Loose Pgs)(w/ Wiley P... ACCT 301 297.00 183.31

Comparing prices between the sellers

If we want to compare the prices between the two vendors we can simply compute the pairwise difference.

In [8]:
differences = []
for i,row in df.iterrows():
    amazon_max_price = float(row['amazon_max_price'])
    bookstore_new_price = float(row['bookstore_new_price'])
    differences.append(bookstore_new_price - amazon_max_price)
m = np.mean(differences)
m
Out[8]:
18.298764478764475

I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average. Of course, the complete picture is more complicated than that.

In [9]:
plt.hist(differences, bins = range(-50,230,20), edgecolor='black')
plt.plot([m,m],[0,290], 'k--');

Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.

  • $H_0: \mu_A = \mu_U$
  • $H_A: \mu_A < \mu_U$.

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as

  • $H_0: \mu = 0$
  • $H_A: \mu < 0$.

Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

In [10]:
se = np.std(differences)/np.sqrt(len(differences))
se
Out[10]:
1.6692524231929347
In [11]:
T = (m-0)/se
T
Out[11]:
10.9622512596164

Given this huge test statistic, the $p$-value will be riduculously small.

Comparing prices between disciplines

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? Let's grab a subset of our data to find out.

In [12]:
sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

[m_sci, len(sci_prices), m_hum, len(hum_prices)]
Out[12]:
[154.0701234567901, 81, 36.0403875968992, 129]

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook but how do we compare these? In particular, what is the correct choice of a standard error?

To clarify, suppose that

  • $D_1$ and $D_2$ are two sets of sample observations,
  • with $n_1$ and $n_2$ elements respectively.
  • The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
  • The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

Then, we analyze the difference of the two means using a hypotnesis test with

  • Mean $\overline{x}_1 - \overline{x}_2$,
  • Standard error $$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},$$
  • and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom, if the smaller is less than 30

In this set up, the expression $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$ is often called the test statistic.

For our problem, this boils down to:

In [13]:
se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
Out[13]:
10.877145243602286

Again, a very large test statistic!

Comparing two sample proportions

The ideas behind the comparison of two sample proportions is very similar to the ideas behind the comparison of two sample means. We've just got to figure out the correct formulation and parameters to use in our $t$-test.

A problem

Let's illustrate the ideas in the context of a problem. It's widely believed that Trump's support among men is stronger than his support among women. Let's use some data to test this.

According to a recent Reuter's poll Trump's most recent approval rating stands at 40% but there appears to be a difference between the views of men and the views of women. Among the 1009 men surveyed, 44% approve of Trump. Among the 1266 women surveyed, only 36% approve of Trump.

Does this data support our conjecture that Trump's support among men is higher than that among women to a 95% level of confidence?

Solution: Let's first clearly state our hypotheses. Let's suppose that $p_m$ represents the proportion of men who support Trump and $p_w$ represent the proportion of women who support Trump. Our hypothesis test can be written

  • $H_0: p_m = p_w$ or, put another way, $p_m-p_w = 0$
  • $H_A: p_m > p_w$ or, put another way, $p_m-p_w > 0$.

The point behind the reformulation to compare with zero is that it gives us just one number that we can apply a standard hypothesis test to. Now, we have measured proportions of $\hat{p}_m = 0.44$ and $\hat{p}_w = 0.36$. Thus, we want to run our test with $$\hat{p} = \hat{p}_m - \hat{p}_w = 0.44 - 0.36 = 0.08.$$ We want just one standard error as well, which we get by adding the variances in the two samples. That is, $$SE = \sqrt{\frac{\hat{p}_m(1-\hat{p}_m)}{n_m} + \frac{\hat{p}_w(1-\hat{p}_w)}{n_w}} = \sqrt{\frac{0.44 \times 0.56}{1009} + \frac{0.36 \times 0.64}{1266}} \approx 0.02064.$$

Of course, I computed this with Python:

In [14]:
pm = 0.44
pw = 0.36
se = np.sqrt(pm*(1-pm)/1009 + pw*(1-pw)/1266)
se
Out[14]:
0.02064443512677508

We can now compute our test statistic: $$T = \frac{\hat{p}_m - \hat{p}_w}{SE}.$$ via

In [15]:
t = 0.08/se
t
Out[15]:
3.8751363022882095

With this very large test statistic, we can reject the null hypothesis and conclude with confidence that there is a difference between the way that men and women view Trump.