Comparing data sets¶

As we move through the latter part of the semester, we'll often be interested in checking for a relationship between variables. Perhaps the simplest such situation is when we compare the sample means of two data sets. Here are a couple of examples along those lines.

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Textbook prices¶

I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks in use at UNCA this semester. Let's take a look:

df = pd.read_csv('https://www.marksmath.org/data/book_prices_Fall2018.csv')
print(df.shape)
df.head()

(518, 6)

We could grab a sample and format the links to be clickable.

sample = df.sample(5, random_state=1016)
def clickable(link):
    return '<a target="_blank" href="{}">link</a>'.format(link, link)
sample.style.format({'bookstore_link': clickable, 'amazon_link': clickable})

Comparing prices between the sellers¶

If we want to compare the prices between the two vendors we can simply compute the pairwise difference.

differences = []
for i,row in df.iterrows():
    amazon_max_price = float(row['amazon_max_price'])
    bookstore_new_price = float(row['bookstore_new_price'])
    differences.append(bookstore_new_price - amazon_max_price)
m = np.mean(differences)
m

18.298764478764475

I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average. Of course, the complete picture is more complicated than that.

plt.hist(differences, bins = range(-50,230,20), edgecolor='black')
plt.plot([m,m],[0,290], 'k--');

Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.

$H_0: \mu_A = \mu_U$
$H_A: \mu_A < \mu_U$.

Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as

$H_0: \mu = 0$
$H_A: \mu < 0$.

Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.

se = np.std(differences)/np.sqrt(len(differences))
se

1.6692524231929347

T = (m-0)/se
T

10.9622512596164

Given this huge test statistic, the $p$-value will be riduculously small.

Exploring a relationship between the variables¶

There oughtta be a relationship between the prices. We often explore this type of question qualitatively with a scatter plot.

plt.plot(differences, 'k.')

ax = plt.gca()
ax.set_ylim([-40,40])

(-40, 40)

We'll explore this type of picture quantitatively when we talk about covariance, correlation and regression.

Comparing prices between disciplines¶

What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? Let's grab a subset of our data to find out.

sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)

hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)

[m_sci, len(sci_prices), m_hum, len(hum_prices)]

[154.0701234567901, 81, 36.0403875968992, 129]

Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook but how do we compare these? In particular, what is the correct choice of a standard error?

To clarify, suppose that

$D_1$ and $D_2$ are two sets of sample observations,
with $n_1$ and $n_2$ elements respectively.
The mean of $D_1$ is $\bar{x}_1$ and the mean of $D_2$ is $\bar{x}_2$
The standard deviation of $D_1$ is $\sigma_1$ and the standard deviation of $D_2$ is $\sigma_2$

Then, we analyze the difference of the two means using a hypotnesis test with

Mean $\overline{x}_1 - \overline{x}_2$,
Standard error $$\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}},$$
and we use the minimum of $n_1-1$ and $n_2-1$ as the degrees of freedom, if the smaller is less than 30

In this set up, the expression $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$ is often called the test statistic.

For our problem, this boils down to:

se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)

10.877145243602286

Again, a very large test statistic!

	title	course	bookstore_new_price	amazon_max_price	bookstore_link	amazon_link
0	MyAccountingLab Access Card -- for Financial A...	ACCT 215	122.25	99.99	https://www.bkstr.com/webapp/wcs/stores/servle...	https://www.amazon.com/gp/search/ref=sr_adv_b/...
1	Financial Accounting (Loose Pgs)(w/MyAccountin...	ACCT 215	263.50	214.95	https://www.bkstr.com/webapp/wcs/stores/servle...	https://www.amazon.com/gp/search/ref=sr_adv_b/...
2	Managerial Accounting (Loose Pgs)(w/MyAcctLab ...	ACCT 216	263.25	234.33	https://www.bkstr.com/webapp/wcs/stores/servle...	https://www.amazon.com/gp/search/ref=sr_adv_b/...
3	Intermediate Accounting (WileyPlus Standalone ...	ACCT 301	171.25	88.99	https://www.bkstr.com/webapp/wcs/stores/servle...	https://www.amazon.com/gp/search/ref=sr_adv_b/...
4	Intermediate Accounting (Loose Pgs)(w/ Wiley P...	ACCT 301	297.00	183.31	https://www.bkstr.com/webapp/wcs/stores/servle...	https://www.amazon.com/gp/search/ref=sr_adv_b/...

	title	course	bookstore_new_price	amazon_max_price	bookstore_link	amazon_link
511	Probability & Statistics for Engineering & the Sciences	STAT 225	204	178.74	link	link
69	Campbell Essential Biology	BIOL 125	199.75	126.7	link	link
408	History of Graphic Design (w/Bind-in Access Code)	NM 344	95	64.18	link	link
389	Natural Connections	MLAS 560	54.5	31.99	link	link
193	Why Won't You Just Tell Us the Answer?	EDUC 437	32.25	26.79	link	link