The normal distribution

The normal distribution is a continuous distribution that turns out to be an extremely good approximation to many real world problems.

Probability and area

Recall that a continuous random variable is a random process that produces real numbers - as opposed to just integers. A continuous distribution is a rule that allows us to assign a probabilty that a continuous random variable lies in some given interval.

One way to generate such a rule is to draw a curve over the \(x\)-axis such that the total area under that curve is 1. We then say that the probability that a random variable lies in an interval is the area under the curve and over the interval.

In the image at the top, for example, the curve is the graph of a function called the standard normal distribution. If \(X\) is a variable with this distribution, then \(P(-1\leq X \leq 1)\) is exactly the shaded area.

The normal curve

The standard normal curve

There is a specific formula for the standard normal distribution. It’ not particularly important for our purposes but, in case you’re curious, the standard normal curve is the graph of the function \[f(x) = \frac{1}{\sqrt{2\pi}} e^{-x^2/2}.\] You can check this by graphing this function using a tool like Desmos or in R like so:

f <- function(x) exp(-x^2/2)/sqrt(2*pi)
plot(f,-4,4)

Mean and standard deviation

It’s easy to believe that the mean of the standard normal is zero - half of the area lies to the left of zero and half to the right. The standard deviation of the standard normal is one. These quantities are what makes this particular normal curve the standard normal.

There is a somewhat more general formula that yields normal curves that are shifted and/or concentrated or more spread out. Again, it’s not super important for us but here it is:

\[f(x) = \frac{1}{\sqrt{2\pi s}} e^{-(x-m)^2/(2s^2)}.\] The important issue is that \(m\) indicates where the area is centered, while \(s\) indicates how spread out the area is. The figure below shows the graphs of these functions for several choices of \(m\) and \(s\). You can also interact with these curves dynamically using this Desmos demonstration.

Computing normal probabilities

There two ways to compute normal probabilities:

Using a table
Using the computer

Normal tables

One common way to compute normal probabailities is from a table of standard normal probabilities. A rather sparse example is shown below; there’s a much denser table in the back of your textbook.

\[ \begin{array}{c|c} k & P(0<X<k) \\ \hline 0.1 & 0.0398278 \\ 0.2 & 0.0792597 \\ 0.3 & 0.117911 \\ 0.4 & 0.155422 \\ 0.5 & 0.191462 \\ 0.6 & 0.225747 \\ 0.7 & 0.258036 \\ 0.8 & 0.288145 \\ 0.9 & 0.31594 \\ 1 & 0.341345 \\ 1.1 & 0.364334 \\ 1.2 & 0.38493 \\ 1.3 & 0.4032 \\ 1.4 & 0.419243 \\ 1.5 & 0.433193 \\ 1.6 & 0.445201 \\ 1.7 & 0.455435 \\ 1.8 & 0.46407 \\ 1.9 & 0.471283 \\ 2 & 0.47725 \\ \end{array} \]

Geometrically, this table tells us the area under the standard normal curve and over the interval \([0,k]\). For example, the area in the figure below is \(0.455435\).

Probabilistically, if \(X\) is a random variable whose distribution is the standard normal, then \[P(0<X<1.6) = 0.455435.\]

Note that the normal distribution has a lot of symmetry. Thus, for example, \[P(-1<X<0) = P(0<X<1) = 0.341345,\] where that last result can be read from the table. Also note that \[P(-1<X<1) = P(-1<X<0) + P(0<X<1) = 0.341345 + 0.341345 = 0.68269.\] This captures the intuitive statement that \(68\%\) of the population lies within 1 standard deviation of the mean.

Z-scores

Suppose that \(X\) is a normally distributed random variable with mean \(\mu\) and standard deviation \(\sigma\). We can translate \(X\) to a standard normal \(Z\) by \[Z=\frac{X-\mu}{\sigma}.\] This allows us to use the standard normal table to compute probabilities associated with any normal random variable.

Example: Suppose that \(X\) is a normally distributed random variable with mean 80 and standard deviation 10. Compute \(P(75<X<90)\).

Solution: We take the inequality \(75<X<90\), subtract through by the mean 80, and then divide through by the standard deviation 10 to get \[ 75<X<90 \] \[ 75-80<X-80<90-80 \] \[ \frac{-5}{10} < \frac{X-80}{10} < \frac{10}{10} \] \[ -\frac{1}{2} < Z < 1 \]

Since \(Z\) is a standard normal, we can read the result off of the table. Doing so, we see that

\[P\left(-\frac{1}{2} < Z < 1\right) = 0.191462 + 0.341345 = 0.532807.\]

Using the computer

Of course, R has built in commands to compute normal probabilities. In particular, if \(X\) is a normally distributed random variable with mean \(m\) and standard deviation \(s\), then \[\texttt{pnorm(k,m,s)} = P(X<k).\] For example, if \(X\) is a normally distributed random variable with mean 80 and standard deviation 10 as at the end of the last section, we can compute \(P(75<X<90)\) via the command:

pnorm(90,80,10) - pnorm(75,80,10)

## [1] 0.5328072

Percentiles

The 68-95-99.7 rule

There is a statistical rule of thumb called the 68-95-99.7 rule that states that

68% of the population lies within 1 standard deviation of the mean,
95% of the population lies within 2 standard deviations of the mean, and
99.7% of the population lies within 3 standard deviations of the mean.

For this to work, the property being measured should be normally distributed. If so, then the rule of thumb follows from the fact that this property holds for the standard normal:

The normal probability plot

Suppose we have some data that we suspect is normally distributed; we’d like an objective test, though. One simple geometric tool for this purpose is the normal probability plot - also called a quantile-quantile plot.

Here’s the basic idea: Suppose we have some sample data consisting of \(n\) numerical values. We’ll make a scatter plot with \(n\) points - one point for each of the \(n\) values. First, we compute the mean and standard deviation of the sample data. Assuming the data is normally distributed, we can compute what the quantiles should be and plot that against the actual quantiles.

As an example, let’s examine the heights of NBA players.

nba = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/nbaHeights.txt')
qqnorm(nba$h.in)
qqline(nba$h.in)

By contrast, income is classically not normally distributed:

cen = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/census.txt')
qqnorm(cen$totalFamilyIncome)
qqline(cen$totalFamilyIncome)