Last week, we learned some data basics with a focus on data tables and the types of data (numeric and categorical) they contain. We also discussed generating data with studies and experiments. Today, we'll take a closer look at numeric data - looking not just at the pictures but also digging a bit deeper into the quantitative parameters that describe the data.
This is all based on section 1.2 of our text.
Let's start with a specific, real world data set obtained from the Center for Disease Control that publishes loads of data - the Behavioral Risk Factor Surveillance System.
This is an ongoing process where over 400,000 US adults are interviewed every year. The resulting data file has over 2000 variables ranging from simple descriptors like age and weight, through basic behaviors like activity level and whether the subject smokes to what kind of medical care the subject receives.
I've got a subset of this data on my website listing just 8 variables for a random sample of 20000 individuals: https://www.marksmath.org/data/cdc.csv
Here's the CDC sample rendered as a data table:
Most of the variables (ie., the column names) are self-explanatory. My favorite is smoke100
, which is a boolean flag indicating whether or not the individual has smoked 100 cigarettes or more throughout their life. You should probably be able to classify the rest as numerical or categorical.
A box plot is a picture of the data tied to the so-called five-point summary that we'll go over in a bit more detail in a bit.
A histogram is a picture of the data tied to the mean and standard deviation.
You can use the slider below to see how the graph changes when you cange either mean or standard deviation. It's particularly hard to see the affect of standard deviation in a single image.
It's worth mentioning that there are other types of distributions that can arise.
Here's an example of a bimodal histogram.
And here's a skewed histogram. Specifically, it's skewed left, since more of the data lies to the left of the mean.
Sometimes, we need to visualize the relationship between two variables. One great way to do that is with a scatter plot. For example, here's the relationship between height and weight in the CDC data.
At this point we've met several parameters that describe numerical data, including
Let's take a look at how these quantities are actually defined.
Before we go through these, it's worth pointing out that the mean and standard deviation are the most important to understand thoroughly.
It's worth understanding percentiles from a conceptual standpoint, but we will rarely compute them directly. We will compute mean and standard deviation.
The mean is a measure of where the data is centered. It is computed by simply averaging the numbers.
For example, our data might be: $$2,8,2,4,7.$$ The mean of the data is then: $$\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.$$
Like the mean, the median is a measure of where the data is centered.
Roughly speaking, it represents the middle value. They way it is computed depends on how many numbers are in your list.
If the number of terms in your data is odd, then the median is simply the middle entry.
For example, if the data is $$1,3,4,8,9,$$ then the median is $4$.
If the number of terms in your data is even, then the median is simply the average of the middle two entries.
For example, if the data is $$1,3,8,9,$$ then the median is $(3+8)/2 = 5.5$.
Suppose our data is $$4, 5, 9, 7, 6, 10, 2, 1, 5.$$ To find percentiles, it helps to sort the data: $$1,2,4,5,5,6,7,9,10.$$
There are differing conventions on how you interpolate when the number of terms doesn't work well with the percentile, but these differences diminish with sample size.
Suppose our sample is $$1,2,3,4.$$ Then, the mean is $2.5$ and the variance is $$s^2=\frac{(-3/2)^2 + (-1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.$$ The standard deviation is $$s = \sqrt{5/3} \approx 1.290994.$$
More often than not, we will be computing sample variance and the corresponding standard deviation.