So far, we've met a little bit of data and talked about some techniques to get data. Today, we're going to focus on numerical data and introduce some formulae and computer code to compute quantitative measures of the data.
This is mostly sections 1.3 and 1.4 out of our textbook.
First let's take a look at a big data set.
The Center for Disease Control publishes lots of data obtained through a number of studies. We're going to play with one particular data set obtained from a study called the Behavioral Risk Factor Surveillance System. This is an ongoing process where over 400,000 US adults are interviewed every year. The resulting data file has over 2000 variables ranging from simple descriptors like age and weight, through basic behaviors like activity level and whether the subject smokes to what kind of medical care the subject receives. I've got a random sample of this data on my website for just 8 variables for 20000 individuals. Let's start by loading that data set:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
print(len(df.height))
df.head()
Most of the variables (ie., the column names) are self-explanatory. My favorite is smoke100
, which is a boolean flag indicating whether or not the individual has smoked 100 cigarettes or more throughout their life.
Now that we've now got plenty of data (20,000 rows), let's look at a proper histogram of heights!
%matplotlib inline
heights = df['height']
heights.hist(bins = 20, grid=False, edgecolor='black');
The mean is a measure of location - it tells us where the data is centered.
m = heights.mean()
m
import matplotlib.pyplot as plt
heights.hist(bins = 20, grid=False, edgecolor='black');
plt.plot([m,m],[0,5100], 'y--')
The standard deviation tells us how widely spread the data is:
heights.std()
To see the effect of standard deviation you should compare a couple of distributions.
These height histograms displays a class bell shape; they are normally distributed. It's worth mentioning that there are other types of shapes that can arise.
Let's take a look at the quantitative definitions of the computational concepts that we've been throwing around above.
Suppose we have a list of numerical data; we'll denote it by $$x_1, x_2, x_3, \ldots, x_n.$$ For example, our list might be $$2, 8, 2, 4, 7.$$ The mean of the list is $$\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.$$ For our concrete example, this is $$\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.$$ The median is the middle value when the list is sorted. For our example, the sorted list is $$2, 2, 4, 7, 8$$ so the median is $4$. If the sorted list has an even number of observations, then them median is the middle term. For example, the median of $$1,1,3,4,8,8,8,10$$ is the average of $4$ and $8$ which is $6$.
Suppose our data is $$1,2,4,5,5,6,7,9,10.$$ The $25^{\text{th}}$ percentile might be 4, the $75^{\text{th}}$ percentile could be 7 and the inter-quartile range would be 3. There are differing conventions on how you interpolate but these differences diminish with sample size.
Suppose our sample is $$1,2,3,4.$$ Then, the mean is $2.5$ and the variance is $$s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.$$ The standard deviation is $$s = \sqrt{5/3} \approx 1.290994.$$