Last time we took a look at data with a focus on categorical data. Today, we’re going to focus on numerical data and introduce some formulae and computer code to compute quantitative measures of the data.

Our class heights

Let’s start with some simple numerical data that we know - our list of class heights. I’m going to drop the first term, since that’s my 8 year old daughter who is just going to throw things off.

heights = read.csv('https://www.marksmath.org/data/class_data_Fall2017.csv')$Height
heights = heights[2:length(heights)]
heights
##  [1] 5.333333 5.500000 5.250000 5.750000 6.083333 5.666667 5.916667
##  [8] 5.750000 5.666667 5.500000 5.416667 6.000000 5.583333 5.583333
## [15] 5.500000 5.833333 6.000000 5.416667 5.916667 5.416667 5.583333
## [22] 5.333333 5.500000 5.166667 5.083333 5.833333 5.583333 5.666667
## [29] 5.583333 5.333333 5.250000 6.333333 5.250000 5.250000 5.333333
## [36] 5.333333 5.250000 5.500000 5.416667 5.666667 5.333333 5.500000
## [43] 5.583333 6.166667 5.333333 6.166667 5.166667 5.416667 5.333333
## [50] 5.583333 5.750000 6.083333 5.666667 5.666667 5.250000 5.166667
## [57] 5.750000 5.416667 5.250000 5.083333

Note that we’ve now got 60 entries. Let’s revisit the geometric tools we talked about last time but we’ll include some quantitative information as well.

Box plots

Here’s a box plots of our heights:

boxplot(heights, horizontal = T)

Note that a box plot is closely related to a nifty summary of the data called the five point summary. We can get that in R as follows:

summary(heights)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.083   5.333   5.500   5.550   5.688   6.333

Do you see a subtle distinction between the two?

Note that the median and the quartiles are special examples of quqntiles - also called percentiles. We can compute these with the quantile command. Here are the \(25^{\text{th}}\) and \(90^{\text{th}}\) percentiles of our height data.

quantile(heights,c(0.25,0.9))
##      25%      90% 
## 5.333333 6.000000

Histograms

Here’s a histogram of our heights:

hist(heights, col='gray')

While the box plot is related to the median, quartiles and quantiles, the histogram is related to mean and standard deviation. We’ll define those carefully in a bit but here’s how to compute them in R.

First, the mean:

mean(heights)
## [1] 5.55

Take a look at where this value lies on the \(x\)-axis of the histogram. It should look like it’s at about the balancing point.

Here’s a computation of the standard deviation:

sd(heights)
## [1] 0.2919994

This is a single number that gives a measure as to how spread out the data is. It’s hard to get grip on, unless you look at multiple examples. Here are a couple of histograms that compare a standard deviation of 2 with a standard deviation of 1/2.

As we mentioned before, our height histogram displays a class bell shape; it’s normally distributed. It’s worth mentioning that there are other types of shapes that can arise.

Statistical measures of numerical data

Let’s take a look at the quantitative defintions of the computational concepts that we’ve been throwing around above.

The mean and median

Suppose we have a list of numerical data; we’ll denote it by \[x_1, x_2, x_3, \ldots, x_n.\] For example, our list might be \[2, 8, 2, 4, 7.\] The mean of the list is \[\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.\] For our concrete example, this is \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\] The median is the middle value when the list is sorted. For our example, the sorted list is \[2, 2, 4, 7, 8\] so the median is \(4\). If the sorted list has an even number of observations, then them median is the middle term. For example, the median of \[1,1,3,4,8,8,8,10\] is the average of \(4\) and \(8\) which is \(6\).

Percentiles (also called quantiles)

  • The median is a special case of a percentile - 50% of the population lies below the median and 50% lies above.
  • Similarly, 25% of the population lies below the first quartile and 75% lies above.
  • Also, 75% of the population lies below the third quartile and 25% lies above.
  • The second quartile is just another name for the median.
  • The inter-quartile range is the difference between the third and first quartile.
  • One reasonable definition of an outlier is a data point that lies more than 3 inter-quartile ranges from the median.

Example

Suppose our data is \[1,2,4,5,5,6,7,9,10.\] The \(25^{\text{th}}\) percentile is 4, the \(75^{\text{th}}\) percentile is 7 and the inter-quartile range is 3.

Variance and standard deviation

  • Percentiles form a measure of the spread of a population or sample related to the median of that population or sample.
  • The standard deviation forms a measure of the spread of a population or sample related to the mean of the population or sample.

Definitions

  • Roughly, the standard deviation measures how far the individuals deviate from the mean on average.
  • The variance is defined to be the suare of the standard deviation. Thus, if the standard deviation is \(s\), then the variance is \(s^2\).
  • If we have a sample of \(n\) observations \[x_1,x_2,x_3, \ldots, x_n,\] then the variance is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n-1}.\]
  • If \(s^2\) is the variance, then \(s\) is the standard deviation.

Sample variance vs population variance

  • You might see the definition \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n}.\]
  • The only difference is the \(n\) in the denominator, rather than \(n-1\).
  • We won’t really be able to state the difference until we talk about populations and samples. In that context, our definition (with the \(n-1\) in the denominator) is called the sample variance. More often than not, we will be computing sample variance and the corresponding deviation.

Example

Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]