As we move through the latter part of the semester, we'll often be interested in checking for a relationship between variables. Perhaps the simplest such situation is when we compare the sample means of two data sets. Here are a couple of examples along those lines.
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
I've got a CSV file that contains prices from the UNCA bookstore and Amazon for more than 500 textbooks in use at UNCA this semester. Let's take a look:
df = pd.read_csv('https://www.marksmath.org/data/book_prices_Fall2018.csv')
print(df.shape)
df.head()
We could grab a sample and format the links to be clickable.
sample = df.sample(5, random_state=1016)
def clickable(link):
return '<a target="_blank" href="{}">link</a>'.format(link, link)
sample.style.format({'bookstore_link': clickable, 'amazon_link': clickable})
If we want to compare the prices between the two vendors we can simply compute the pairwise difference.
differences = []
for i,row in df.iterrows():
amazon_max_price = float(row['amazon_max_price'])
bookstore_new_price = float(row['bookstore_new_price'])
differences.append(bookstore_new_price - amazon_max_price)
m = np.mean(differences)
m
I guess that computation means that the textbook price is $\$18.30$ more expensive at the UNCA bookstore on average. Of course, the complete picture is more complicated than that.
plt.hist(differences, bins = range(-50,230,20), edgecolor='black')
plt.plot([m,m],[0,290], 'k--');
Now, let's explore the question: are Amazon prices genuinely cheaper the UNCA bookstore prices (as is commonly believed)? To do so, let's first clearly state our null and alternative hypotheses. Let $\mu_U$ denote the average price of the books at the UNCA bookstore and let $\mu_A$ denote the average price of the books at Amazon.
Since the data are paired, we can rephrase this by taking the pairwise difference of the data sets to get a single data set. If for each row, we take the Amazon price minus the UNCA price to get a single data set with mean $\mu$, then we could rephrase our hypotheses as
Note that we've already computed an estimate of $\mu$ to be $\overline{x}=18.8$, denoted by m
in our code. To examine the hypothesis, we compute a standard errror and test statistic from our set of differences.
se = np.std(differences)/np.sqrt(len(differences))
se
T = (m-0)/se
T
Given this huge test statistic, the $p$-value will be riduculously small.
There oughtta be a relationship between the prices. We often explore this type of question qualitatively with a scatter plot.
plt.plot(differences, 'k.')
ax = plt.gca()
ax.set_ylim([-40,40])
We'll explore this type of picture quantitatively when we talk about covariance, correlation and regression.
What do you suppose is more expensive - the average textbook in the sciences or the average textbook in the humanities? Let's grab a subset of our data to find out.
sci_prefixes = ['ATMS', 'BIOL', 'CHEM', 'CSCI', 'MATH', 'PHYS', 'STAT']
sci_df = df[[c[:4] in sci_prefixes for c in df.course]]
sci_prices = sci_df.bookstore_new_price
m_sci = np.mean(sci_prices)
hum_prefixes = ['CLAS', 'DRAM', 'HUM ', 'LIT ', 'PHIL', 'RELS']
hum_df = df[[c[:4] in hum_prefixes for c in df.course]]
hum_prices = hum_df.bookstore_new_price
m_hum = np.mean(hum_prices)
[m_sci, len(sci_prices), m_hum, len(hum_prices)]
Well, the average price of a science textbook indeed looks a lot more than the average price of a humanities textbook but how do we compare these? In particular, what is the correct choice of a standard error?
To clarify, suppose that
Then, we analyze the difference of the two means using a hypotnesis test with
In this set up, the expression $$\frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$$ is often called the test statistic.
For our problem, this boils down to:
se_sci = np.std(sci_prices)/np.sqrt(len(sci_prices))
se_hum = np.std(hum_prices)/np.sqrt(len(hum_prices))
(m_sci - m_hum)/np.sqrt(se_sci**2 + se_hum**2)
Again, a very large test statistic!