The normal distribution

Thu, Feb 22, 2024

Recap

Last time,

we reviewed discrete distributions, with an emphasis on the binomial,
we introduced continuous distributions including
- the uniform distribution,
- the exponential distribution,
- more general distributions defined using functions,
we talked about several important concepts surrounding continuous distributions, including
- PDFs and CDFs
- visualizing distributions with graphs of these functions,
- how these graphs might fit with discrete distributions and histograms, and
- mean, variance, and standard deviation

The normal distribution

The normal distribution is easily the most important distribution in elementary statistics - and it’s important for a concrete reason: it arises via the averaging process that we often apply to summarize and understand data.

It also arises surprisingly often in raw data itself. Last time, we saw how a histogram of human heights obtained from the CDC follows a normal distribution.

Let’s look at more data to illustrate … more.

NCAA Basketball data

I’ve got a data file from Kaggle that lists every NCAA basketball tournament game from 1985 through 2022. This data is free to obtain but not open; you’ve got to sign up for their March Madness 2023 competition to get it. Once you’ve got it, you can. examine it easily enough; it looks like so:

Code

import pandas as pd
results_df = pd.read_csv('MNCAATourneyCompactResults.csv')
results_df

	Season	DayNum	WTeamID	WScore	LTeamID	LScore	WLoc	NumOT
0	1985	136	1116	63	1234	54	N	0
1	1985	136	1120	59	1345	58	N	0
2	1985	136	1207	68	1250	43	N	0
3	1985	136	1229	58	1425	55	N	0
4	1985	136	1242	49	1325	38	N	0
...	...	...	...	...	...	...	...	...
2379	2022	146	1242	76	1274	50	N	0
2380	2022	146	1314	69	1389	49	N	0
2381	2022	152	1242	81	1437	65	N	0
2382	2022	152	1314	81	1181	77	N	0
2383	2022	154	1242	72	1314	69	N	0

2384 rows × 8 columns

Average winning score

Here’s a histogram of the average winning score across all 2384 games, along with the normal distribution that has the same mean and standard deviation:

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

results_df.hist(column = ['WScore'], grid=False, bins=20,
    edgecolor='black', figsize = (12,6), density=True);

# Plot points below the histogram
z = results_df.WScore.apply(lambda z: -0.0005)
plt.plot(results_df.WScore, z, 'ok', alpha = 0.02)

# Compute and plot the normal pdf
m = results_df.WScore.mean()
s = results_df.WScore.std()
x = np.linspace(40,120,500)
y = norm.pdf(x,m,s)
plt.plot(x,y, '-', linewidth=4);

ax=plt.gca()
ax.set_xlim([40,120])

print("Average winning score: " + format(m, '0.2f'))
print("Standard deviation: " + format(s, '0.2f'))

Average winning score: 76.83
Standard deviation: 11.81

Non-normal data

Of course, there’s plenty of data that’s not normally distributed. If we compute the margin of victory for each game, for example, the data is not normally distributed:

Code

results_df['score_difference'] = results_df.WScore - results_df.LScore

results_df.hist(column = ['score_difference'], grid=False, bins=20,
    edgecolor='black', figsize = (12,6), density=True);

# Plot points below the histogram
z = results_df.score_difference.apply(lambda z: -0.0005)
plt.plot(results_df.score_difference, z, 'ok', alpha = 0.02);

Average margin of victory

The true importance of the normal distribution comes from the fact that it arises in the averaging process, regardless of the distribution of the underlying data. If we compute, for example, the average margin of victory per season and plot the resulting 37 points in a histogram, we expect to see something like a normal distribution:

Code

diff_df = results_df[['Season', 'score_difference']].groupby('Season').mean()

diff_df.hist(column = ['score_difference'], grid=False, bins = 6,
    edgecolor='black', figsize = (12,6), density=True);

# Plot diff below the histogram
z = diff_df.score_difference.apply(lambda z: -0.004)
plt.plot(diff_df.score_difference, z, 'ok', alpha = 0.3)

# Compute and plot the normal pdf
m = diff_df.score_difference.mean()
s = diff_df.score_difference.std()
x = np.linspace(8.5,15,500)
y = norm.pdf(x,m,s)
plt.plot(x,y, '-', linewidth=3);

# ax=plt.gca()
# ax.set_xlim([40,120])

The normal formula

A random variable is normally distributed if it has a very particular form for its distribution:

\[f_{\mu,\sigma}(x) = \frac{1}{\sqrt{2\pi}\sigma} e^{-(x-\mu)^2/(2\sigma^2)}.\]

This depends on two parameters:

The mean \(\mu\) that indicates where mass of the graph is concentrated and
the standard deviation \(\sigma\) that indicates just how concentrated the distribution is about its mean.

When \(\mu = 0\) and \(\sigma = 1\), we get the standard normal distribution.

A random variable whose distribution is the standard normal is commonly denoted with a \(Z\).

The family album

The interactive graphic below shows how the mean and standard deviation determine the corresponding normal distribution. Note that changing \(\mu\) and/or \(\sigma\) preserves the total area under the curve.

Important facts about normal distributions

The total area under a normal distribution is 1. That is, \[\frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-x^2/2} \ dx = 1.\]
The mean of a normal distribution really is \(\mu\)! That is, \[\frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{\infty} x \ e^{-(x-\mu)^2/(2\sigma^2)} \ dx = \mu.\]
The variance of a normal distribution really is \(\sigma^2\)! That is, \[\frac{1}{\sqrt{2\pi}\sigma} \int_{-\infty}^{\infty} (x-\mu)^2 \ e^{-(x-\mu)^2/(2\sigma^2)} \ dx = \sigma^2.\]

Let’s take a look at the first two in class! The last one requires integration by parts mixed with \(u\)-substitution.

Working with normal distributions in software

We are often interested in probabilities like \(P(a<X<b)\), \(P(a<X)\), or \(P(X<b)\) where \(X\) is normally distributed. We can set this up just as we would for any probability density function. For example,

\[P(a<X<b) = \frac{1}{\sqrt{2\pi}} \int_a^b e^{-x^2/2} dx.\]

A key difference between the normal distribution and the examples that we’ve worked with before is that the PDF has no elementary anti-derivative! As a result, we typically estimate this type of integral using a table or software.

Let’s focus first on using SciPy to compute normal probabilities.

SciPy’s `norm.pdf` and `norm.cdf`

Python’s SciPy library has extensive tools for dealing with all kinds of statistical distributions. We’re particularly interested in the scipy.stats module and its norm object at the moment. There, we find two important functions:

norm.pdf: The normal probability density function and
norm.cdf: The corresponding cumulative density function.

In particular, the CDF is used to compute areas under the PDF. We can use it like so:

from scipy.stats import norm
norm.cdf(0,0,1)

0.5

This says that the area under the standard normal PDF up to zero should be 1/2, which makes perfect sense:

A data based example

Examining our basketball data, we see that the winning team has scored 90 or more points about 13% of the time:

Code

len(results_df[results_df.WScore.apply(lambda s: s>=90)])/len(results_df)

0.13548657718120805

If normal approximation really works, we should be able to estimate this with our normal distribution, like so:

Code

m = results_df.WScore.mean()
s = results_df.WScore.std()
1 - norm.cdf(90,m,s)

0.1325408429665722

Looks pretty good!

Normal tables

The traditional way to compute normal probabilities is with a standard normal table. The image of a zero-based normal table below, for example, suggests that \[P(0<Z<0.62) \approx 0.2324.\]

Where do I find these tables?

Tables for the standard normal distribution are printed in every statistics book ever published, scattered all over the internet, and on our class webpage in particular. We’ve got:

Let’s spend a little time playing with them in class to compute things like

\(P(Z < 1)\),
\(P(2<Z)\)
\(P(-1.55<Z<0.14)\)

\(Z\)-scores

When \(X\) normally distributed, but not via the standard normal, we typically convert to a so-called \(Z\)-score:

If \(X\) is normally distributed with mean \(\mu\) and standard deviation \(\sigma\), then its \(Z\)-score is \[Z = \frac{X-\mu}{\sigma}.\]

Suppose, for example, that SAT scores are normally distributed with mean 500 and standard deviation 100. We’d like to know what percentage of scores are below 700. We figure this out by computing the \(Z\)-score for the value \(X=700\), assuming that \(X\) has mean 500 and standard deviation 100: \[Z = \frac{700-500}{100} = 2.\] Thus, \[P(X<700) = P(Z<2) \approx 0.9772\] so we expect a score of 700 to be larger than nearly 98% of other SAT scores.

Visualization

The following picture illustrates the relationship between the SAT distribution and the standard distribution. If you’re thinking those are the same picture with different numbers, then you’re exactly right - that’s the point!

Where’s that come from?

Ultimately, the validity of the \(Z\)-score comes down to the following:

For real numbers \(a\) and \(b\),

\[\frac{1}{\sqrt{2\pi}\sigma} \int_{a}^{b} e^{-(x-\mu)^2/(2\sigma^2)} \ dx = \frac{1}{\sqrt{2\pi}} \int_{(a-\mu)/\sigma}^{(b-\mu)/\sigma} e^{-x^2/2} \ dx.\]

We can prove this with \(u\)-substitution.

The 68-95-99.7 Rule

A common rule of thumb for the normal distribution states that

68% of the population lies within 1 standard deviation of the mean,
95% of the population lies within 2 standard deviations of the mean, and
99.7% of the population lies within 3 standard deviations of the mean.

Sometimes, this can be used to quickly determine the credibility of a claim.

Visualization

This can all be summed up in the following picture:

Example

Nabisco used to guarantee that each 18 ounce bag of Chips Ahoy cookies contained at least 1000 chocolate chips. A group of determined Air Force academy looked into this and found that the number of chips in a bag were normally distributed with a mean of 1262 and a standard deviation of 118. Based on that, do you think it would be worth trying to beat the challenge?

The central limit theorem

As stated earlier, the importance of the the normal distribution lies in the fact that it arises from the averaging process. The theoretical foundation of this statement is The Central Limit Theorem or CTL.

In its purest form, CTL deals with a sequence \[(X_i)_{i=1}^{\infty} = (X_1,X_2,X_3,\ldots)\] of independent, identically distributed random variables; we often refer to such a sequence as i.i.d.

Averaging

Given sequence of random variables, we can form a sequence of averages: \[\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i.\]

The first big theorem in the theory of random variables is the law of large numbers which states that, if \((X_i)_{i=1}^{\infty}\) is an i.i.d. sequence of random variables, each with mean \(\mu\), then \(\bar{X}_n\) almost surely approaches \(\mu\).

The central limit theorem is related but gives us more information about the how the convergence occurs.

Statement of CTL

Let \((X_i)_1^{\infty}\) be a sequence of i.i.d. random variables, each with mean \(\mu\) and standard deviation \(\sigma\). For \(n\in\mathbb N\), let \[\bar{X}_n = \frac{X_1 + X_2 + X_3 +\cdots+X_n}{n}.\] Then the distribution of \(\sqrt{n}(\bar{X}_n - \mu)\) approaches the standard normal distribution \(Z\).

Implication

If \[\sqrt{n}(\bar{X}_n - \mu) \approx Z,\] then I guess that \[\bar{X}_n \approx \frac{1}{\sqrt{n}} Z + \mu.\] Thus, \(\bar{X}_n\) is normally distributed with mean \(\mu\) and standard deviation \(\sqrt{n}\).

Statement from Devore’s text

Let \(X_1,X_2,X_3,\ldots,X_n\) be a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\). Then if \(n\) is sufficiently large, \(\bar{X}\) has approximately a normal distribution with mean \(\mu\) and variance \(\sigma^2/n\). The larger the value of \(n\), the better the approximation.

Stated like a true statistician!!

In practice, we’ll often think of each \(X_i\) as representing and individual from some sample chosen from a whole population.

A computer experiment

We can illustrate the central limit theorem with a little computer experiment. Suppose we have a large data set of 20000 values. We grab a small sample of size 1, 4, 16, 32, or 64 from that data set. We then compute the average of the sample. As it turns out, the spread of each histogram seems to be about half as much as the previous.

Sample problems

Let’s look at a couple of problems like your online HW.

Straight up normal

Suppose that \(X\) is normally distributed with mean \(\mu=182.2\) and standard deviation \(\sigma=15.7\).

Compute

\(P(X<190)\)
\(P(X>190)\)
\(P(175<X<190)\)

Normal from data

Suppose that exam scores are normally distributed with a mean of \(\mu=67.5\) and a standard deviation of \(9.2\). Estimate the percentage of

Students who scored less than 74
Students who scored in the 70s.

Normal weight from a group

The average weight of a bag of chips is normally distributed with mean 8.2oz and standard deviation of 0.12oz. These bags are then packaged by the dozen. Find the probability that that the average weight of all the bags in a sample of chips is less than 8oz.

The normal distribution

Recap

The normal distribution

NCAA Basketball data

Average winning score

Non-normal data

Average margin of victory

The normal formula

The family album

Important facts about normal distributions

Working with normal distributions in software

SciPy’s norm.pdf and norm.cdf

A data based example

Normal tables

Where do I find these tables?

\(Z\)-scores

Visualization

Where’s that come from?

The 68-95-99.7 Rule

Visualization

Example

The central limit theorem

Averaging

Statement of CTL

Implication

Statement from Devore’s text

A computer experiment

Sample problems

Straight up normal

Normal from data

Normal weight from a group

SciPy’s `norm.pdf` and `norm.cdf`