Comparing data sets

As we move through the latter part of the semester, we'll often be interested in checking for a relationship between variables. Perhaps the simplest such situation is when we compare the sample means of two data sets. Here are a couple of examples along those lines.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Textbook prices

I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks in use at UNCA this semester. Let's take a look:

In [2]:
df = pd.read_csv('https://www.marksmath.org/data/book_prices_Fall2018.csv')
print(df.shape)
df.head()
(518, 6)
Out[2]:
title course bookstore_new_price amazon_max_price bookstore_link amazon_link
0 MyAccountingLab Access Card -- for Financial A... ACCT 215 122.25 99.99 https://www.bkstr.com/webapp/wcs/stores/servle... https://www.amazon.com/gp/search/ref=sr_adv_b/...
1 Financial Accounting (Loose Pgs)(w/MyAccountin... ACCT 215 263.50 214.95 https://www.bkstr.com/webapp/wcs/stores/servle... https://www.amazon.com/gp/search/ref=sr_adv_b/...
2 Managerial Accounting (Loose Pgs)(w/MyAcctLab ... ACCT 216 263.25 234.33 https://www.bkstr.com/webapp/wcs/stores/servle... https://www.amazon.com/gp/search/ref=sr_adv_b/...
3 Intermediate Accounting (WileyPlus Standalone ... ACCT 301 171.25 88.99 https://www.bkstr.com/webapp/wcs/stores/servle... https://www.amazon.com/gp/search/ref=sr_adv_b/...
4 Intermediate Accounting (Loose Pgs)(w/ Wiley P... ACCT 301 297.00 183.31 https://www.bkstr.com/webapp/wcs/stores/servle... https://www.amazon.com/gp/search/ref=sr_adv_b/...

We could grab a sample and format the links to be clickable.

In [3]:
sample = df.sample(5, random_state=1016)
def clickable(link):
    return '<a target="_blank" href="{}">link</a>'.format(link, link)
sample.style.format({'bookstore_link': clickable, 'amazon_link': clickable})
Out[3]:
title course bookstore_new_price amazon_max_price bookstore_link amazon_link
511 Probability & Statistics for Engineering & the Sciences STAT 225 204 178.74 link link
69 Campbell Essential Biology BIOL 125 199.75 126.7 link link
408 History of Graphic Design (w/Bind-in Access Code) NM 344 95 64.18 link link
389 Natural Connections MLAS 560 54.5 31.99 link link
193 Why Won't You Just Tell Us the Answer? EDUC 437 32.25 26.79 link link

Comparing prices between the sellers

If we want to compare the prices between the two vendors we can simply compute the pairwise difference.

In [4]:
differences = []
for i,row in df.iterrows():
    amazon_max_price = float(row['amazon_max_price'])
    bookstore_new_price = float(row['bookstore_new_price'])
    differences.append(bookstore_new_price - amazon_max_price)
m = np.mean(differences)
m
Out[4]:
18.298764478764475

I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average. Of course, the complete picture is more complicated than that.

In [5]:
plt.hist(differences, bins = range(-50,230,20), edgecolor='black')
plt.plot([m,m],[0,290], 'k--');

Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.

  • $H_0: \mu_A = \mu_U$
  • $H_A: \mu_A < \mu_U$.

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as

  • $H_0: \mu = 0$
  • $H_A: \mu < 0$.

Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

In [6]:
se = np.std(differences)/np.sqrt(len(differences))
se
Out[6]:
1.6692524231929347
In [7]:
T = (m-0)/se
T
Out[7]:
10.9622512596164

Given this huge test statistic, the $p$-value will be riduculously small.

Exploring a relationship between the variables

There oughtta be a relationship between the prices. We often explore this type of question qualitatively with a scatter plot.

In [8]:
plt.plot(differences, 'k.')

ax = plt.gca()
ax.set_ylim([-40,40])
Out[8]:
(-40, 40)

We'll explore this type of picture quantitatively when we talk about covariance, correlation and regression.

Comparing prices between disciplines

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? Let's grab a subset of our data to find out.

In [9]:
sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

[m_sci, len(sci_prices), m_hum, len(hum_prices)]
Out[9]:
[154.0701234567901, 81, 36.0403875968992, 129]

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook but how do we compare these? In particular, what is the correct choice of a standard error?

To clarify, suppose that

  • $D_1$ and $D_2$ are two sets of sample observations,
  • with $n_1$ and $n_2$ elements respectively.
  • The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
  • The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

Then, we analyze the difference of the two means using a hypotnesis test with

  • Mean $\overline{x}_1 - \overline{x}_2$,
  • Standard error $$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},$$
  • and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom, if the smaller is less than 30

In this set up, the expression $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$ is often called the test statistic.

For our problem, this boils down to:

In [10]:
se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
Out[10]:
10.877145243602286

Again, a very large test statistic!