Most statistical data is modeled by some distribution. By far the most common such distribution that arises in practice is the normal distribution.

Normal curves

The standard normal

The standard normal distribution is a specific bell shaped curve; its graph is shown at the top of the page above. It has a maximum as it crosses the \(y\)-axis; its mean is zero. It decreases as we move away from the mean and tapers off to the \(x\)-axis. It’s scaled so that the total area under its graph is one. The area under the curve is distributed in a specific way; it’s standard deviation is one.

More normals

The standard normal is one of a specific family of bell shaped curves. We can change the mean by shifting the graph to the left or to the right. We can change the standard deviation by compressing or dilating along the \(x\)-axis while simultaneously doing the opposite along the \(y\)-axes. Several such curves are shown in the figure below.

Note that the mean and standard deviation are denoted \(m\) and \(s\) in the figure above. Often, we denote these with the Greek letters \(\mu\) and \(\sigma\).

A formula

It’s worth mentioning that there is a specific formula that generates the normal curves, namely \[f_{\mu,\sigma}(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/(2\sigma^2)}.\] You can explore the graphs of these functions on Desmos.

Modeling histograms with normal curves

Often, if we scale a histogram so that the total area of its rectangles is one, then the normal curve with the same mean and standard deviation will match the histogram fairly closely. An example is shown below.

We’ve got a dynamic illustration on our webpage.

Standard deviation as a ruler

The mean and standard deviation can be used to standardize data. A principle tool to do so is the \(z\)-score.

The heptathlon

I’ve got a CSV on my website that contains the results of the 2008 Olympic Heptathlon. Here are the first few rows:

df = read.csv('https://marksmath.org/data/heptathlon.csv')
kable(head(df,3))
Name Country Score javelin eight_hundred shot_put two_hundred long_jump hundred_hurdles high_jump
Nataliia Dobrynska Ukraine 6733 48.60 137.72 17.29 24.39 6.63 13.44 1.80
Hyleas Fountain United States 6619 41.93 135.45 13.36 23.21 6.38 12.78 1.89
Tatyana Chernova Russia 6591 48.37 126.50 12.88 23.95 6.47 13.65 1.83

Note that Nataliia Dobrynska won the long jump by about half a meter over the average jump; Hyleas Fountain won the 200m by a second and a half faster than the mean. Which of these two results deserves more points?

One way to compare these results involving different units is to use the standard deviation as a ruler. To do so, we standardize. Given a data point \(X\) chosen from data that is normally distributed with mean \(\mu\) and standard deviation \(\sigma\) we compute \[Z = \frac{x-\mu}{\sigma}.\] With this choice of \(Z\), note that \[X = \mu + Z\sigma.\] Thus, \(Z\) measures exactly how many standard deviations \(X\) is from the mean. This computation is sometimes called the \(Z\)-score.

Here’s how we apply this to the 200m vs long jump question:

# Z-score for the 200m
two_hundread_x = min(df$two_hundred)
(two_hundread_x-mean(df$two_hundred))/sd(df$two_hundred)
## [1] -2.03574
# Z-score for the long jump
long_jump_x = max(df$long_jump)
(long_jump_x-mean(df$long_jump))/sd(df$long_jump)
## [1] 2.251191

It looks like the long jump is the more impressive result.

SAT Example

The SAT is designed to have mean score of 500 with a standard deviation of 100. Let’s suppose you score a 650.

  1. What is your \(z\)-score?
  2. What is your percentile?

Solution: For part (a), the \(z\)-score is simply \[Z = \frac{650-500}{100} = 1.5.\] We can answer part (b) by computing an area under the standard normal curve as shown below:

This type of area is typically computed using either software or a table. We will learn how to do this in R soon but, for now, let’s take a look at this table.

The 68-95-99.7 rule

There is a statistical rule of thumb called the 68-95-99.7 rule that states that

For this to work, the property being measured should be normally distributed. If so, then the rule of thumb follows from the fact that this property holds for the standard normal:

Example

Let’s suppose that the mean life expectancy of a cat is 14 years with a standard deviation of 2.5 years. Assuming that the cats’ life spans are normally distributed, is it reasonable to expect a cat to live to 22 years old?

Solution: Well, the \(z\)-score for a 22 year old kitty cat would be \[Z = \frac{22-14}{2.5} = 3.2.\] As we know, only \(0.3\%\) of cats live beyond a \(z\)-score of 3, so a 22 year old cat would be quite rare indeed.

Assessing normality

In the previous problem, we assumed that the lifespan of cats was normally distributed. Is this a valid assumption? I have no idea, but given a data set there are ways to check.

A histogram

Human heights are classically known to be normal. Let’s take a look at the heights of the over 9500 men chosen from our CDC data set.

Looks kinda normal.

The normal probability plot

There’s a more sensitive tool, though: the normal probability plot - more generally called a quantile-quantile plot.

Here’s the basic idea: Suppose we have some sample data consisting of \(n\) numerical values. We’ll make a plot with \(n\) points - one point for each of the \(n\) values. The vertical component for a given data point is just the value of the data point itself. The horizontal component for that given data point is determined by the \(z\)-score of that percentile.

For example, \(z=1\) is greater than about 84% of all values produced by the normal distribution. You can look that up in a normal table or use the following R command:

pnorm(1)
## [1] 0.8413447

Furthermore, the \(84^{\text{th}}\) quantile for the men’s CDC height data is 73 inches, as the following computation reveals.

quantile(men$height,0.84)
## 84% 
##  73

Thus, the point \((1,73)\) should be on the normal probability plot for this data. Here’s the normal probability plot of that height data, together with the point \((1,73)\) shown in yellow.

By contrast, income is classically not normally distributed. Here’s a normal probability plot for a random sample of incomes taken from the US census.