Last time we took a look at data with a focus on categorical data. Today, we’re going to focus on numerical data and introduce some formulae and computer code to compute quantitative measures of the data. First let’s take a look at a big data set.
The Center for Disease Control publishes lots of data obtained through a number of studies. We’re going to play with one particular data set obtained from a study called the Behavioral Risk Factor Surveillance System. This is an ongoing process where over 400,000 US adults are interviewed every year. The resulting data file has over 2000 variables ranging from simple descriptors like age and weight, through basic behaviors like activity level and whether the subject smokes to what kind of medical care the subject receives. I’ve got a random sample of this data on my website for just 8 variables for 20000 individuals.
Aside: This might be a good time to make sure we talk about
At any rate, let’s use R to take a look at the dataset:
library(knitr)
df = read.csv('https://www.marksmath.org/data/cdc.csv')
kable(head(df))
X | genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender |
---|---|---|---|---|---|---|---|---|---|
1 | good | 0 | 1 | 0 | 70 | 175 | 175 | 77 | m |
2 | good | 0 | 1 | 1 | 64 | 125 | 115 | 33 | f |
3 | good | 1 | 1 | 1 | 60 | 105 | 105 | 49 | f |
4 | good | 1 | 1 | 0 | 66 | 132 | 124 | 42 | f |
5 | very good | 0 | 1 | 0 | 61 | 150 | 130 | 55 | f |
6 | very good | 1 | 1 | 0 | 64 | 114 | 114 | 55 | f |
Most of the variables (ie., the column names) are self-explanatory. My favorite is smoke100
, which is a boolean flag indicating whether or not the individual has smoked 100 cigarettes or more throughout their life.
Note that we’ve now got plenty of data, let’s revisit the geometric tools we talked about last time but we’ll include some quantitative information as well.
Let’s grab the heights of just the men in the sample, count them, and generate a box plot:
heights = subset(df, gender == 'm')$height
length(heights)
## [1] 9569
boxplot(heights, horizontal = T)
Note that a box plot is closely related to a nifty summary of the data called the five point summary. We can get that in R as follows:
summary(heights)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 49.00 68.00 70.00 70.25 72.00 93.00
Note that the box plot is a qualitative visualization while the five point summary is a numerical description; they’re closely related ways of understanding the spread of the data.
The median and the quartiles are special examples of quantiles - also called percentiles. We can compute these with the quantile
command. Here are the \(25^{\text{th}}\) and \(90^{\text{th}}\) percentiles of our height data.
quantile(heights,c(0.25,0.9))
## 25% 90%
## 68 74
Here’s a histogram of the heights:
hist(heights, breaks = 20, col='gray')
While the box plot is related to the median, quartiles and quantiles, the histogram is related to mean and standard deviation. We’ll define those carefully in a bit but here’s how to compute them in R.
First, the mean:
mean(heights)
## [1] 70.25165
Take a look at where this value lies on the \(x\)-axis of the histogram. It should look like it’s at about the balancing point.
Here’s a computation of the standard deviation:
sd(heights)
## [1] 3.009219
This is a single number that gives a measure as to how spread out the data is. It’s hard to get grip on, unless you look at multiple examples. Here are a couple of histograms that compare a standard deviation of 2 with a standard deviation of 1/2.
As we mentioned before, our height histogram displays a class bell shape; it’s normally distributed. It’s worth mentioning that there are other types of shapes that can arise.
Let’s take a look at the quantitative definitions of the computational concepts that we’ve been throwing around above.
Suppose we have a list of numerical data; we’ll denote it by \[x_1, x_2, x_3, \ldots, x_n.\] For example, our list might be \[2, 8, 2, 4, 7.\] The mean of the list is \[\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.\] For our concrete example, this is \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\] The median is the middle value when the list is sorted. For our example, the sorted list is \[2, 2, 4, 7, 8\] so the median is \(4\). If the sorted list has an even number of observations, then them median is the middle term. For example, the median of \[1,1,3,4,8,8,8,10\] is the average of \(4\) and \(8\) which is \(6\).
Suppose our data is \[1,2,4,5,5,6,7,9,10.\] The \(25^{\text{th}}\) percentile is 4, the \(75^{\text{th}}\) percentile is 7 and the inter-quartile range is 3.
Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]