The normal distribution and samples

Mon, Sep 16, 2024

Summary

To this point, we’ve covered a lot of apparently disparate things:

Data, including language, tables, charts, and measures
Probability theory - seemingly, for gamblers,
Random variables and their distributions with a focus on
- The binomial and
- The Normal

The road ahead

Today, we’re going to start to tie things together. In particular, we’re going to learn

Why the normal distribution arises so frequently,
More specifically, how it arises in the sampling process, and
How that allows us to model data.

This is mostly the beginnings of section 5.1 in our text.

The binomial and the normal

Here’s a problem that comes right off of our review sheet:

I’ve got an unfair coin that comes up heads \(90\%\) of the time. Suppose I flip the coin and write down a \(1\) if it comes up heads or a \(0\) if it comes up tails. Let’s denote that numerical value by the random variable \(X\).

What are the expectation and variance of this random variable?

Solution

Recall that expectation is \[E(X) = p = 0.9\] and that the variance is \[\sigma^2(X) = p(1-p) = 0.9\times0.1 = 0.09.\]

Continuation

The problem continues to ask: suppose I flip the coin 1000 times and count the number of heads that I get. We’ll call that numerical value \(S\). What are the expectation and variance of this new random variable \(S\)?

Solution

It’s just a matter of multiplying by the number of coin flips to get

\(E(S) = 1000\times0.9 = 900\) and
\(\sigma^2(S) = 1000\times0.09 = 90\).

Further continuation

Finally, the problem asks us to estimate \(P(S < 888)\).

And, this is where it gets a little funkier

A binomial solution

In principle, we can solve this problem with the binomial distribution:

from scipy.stats import binom, norm
sum([binom.pmf(k,1000,0.9) for k in range(0,888)])

0.09540800627807665

Effectively, this computes \[\sum_{k=0}^{887} \binom{1000}{k} (0.9)^{1000-k}(0.1)^k.\]

Issues

There are issues, though:

It’s complicated
It’s narrow
It doesn’t work well for a very large number of throws.

A normal solution

Here’s an alternative that yields a good estimate:

import numpy as np
norm.cdf(887.5,900,np.sqrt(90))

0.09381616499744216

Interpretation

Effectively, this computes the shaded area in this picture:

Code

import matplotlib.pyplot as plt
xs = np.arange(860,940)
ys = [norm.pdf(x-0.5, 900, np.sqrt(90)) for x in xs]
ypts = [binom.pmf(k, 1000, 0.9) for k in xs]
plt.plot(xs,ypts,'.')
plt.plot(xs,ys,'-')
xs2 = np.arange(860,888)

ax = plt.gca()
ax.fill_between(xs2,0,norm.pdf(xs2, 900, np.sqrt(90)))
ax.set_aspect(600)

Advantages

Much more broadly applicable,
Works just as well for a very large number of trials,
While still complicated, it’s just one thing to learn for a lot of examples.

Example

Suppose I roll a fair six sided die one million time, add up the numbers, and call the result \(S\). What’s \[P(S < 3501000)?\]

Step 1

First, we need to know the mean and variance for 1 roll:

\[E(X) = \frac{1+2+3+4+5+6}{6} = \frac{7}{2} = 3.5\] and \[\sigma^2(X) = \frac{\left(1-\frac{7}{2}\right)^2 + \left(2-\frac{7}{2}\right)^2 + \left(3-\frac{7}{2}\right)^2 + \left(4-\frac{7}{2}\right)^2 + \left(5-\frac{7}{2}\right)^2 + \left(6-\frac{7}{2}\right)^2}{6} = \frac{35}{12}.\]

Step 2

To get the mean and variance for 1,000,000 rolls we simply multiply by 1,000,000.

\[E(S) = 3,500,000\] and \[\sigma^2(S) = \frac{35,000,000}{12}.\]

Final step

We use norm.cdf to compute the probability:

n = 1000000
E = 7*n/2
s2 = 35*n/12
s = np.sqrt(s2)

norm.cdf(3501000,E,s)

0.7209076752886714

Why does this work??

Because of the Central Limit Theorem, of course!

The central limit theorem

The central limit theorem is the theoretical explanation of why the normal distribution appears as the limit of binomials above and, therefore, so often in practice. Suppose that \(X\) is a random variable which we evaluate a bunch of times to produce a sequence of numbers: \[X_1, X_2, \ldots, X_n.\] We then compute the sum of those values to produce a new value \(S\) defined by \[S = X_1 + X_2 + \cdots + X_n.\] The central limit theorem asserts that the random variable \(S\) is normally distributed. Furthermore, if \(X\) has mean \(\mu\) and standard deviation \(\sigma\), then the mean and standard deviation of \(S\) are \(n\times\mu\) and \(\sqrt{n} \times \sigma\).

Averaging

Since an average is just a sum divided by \(n\), we can do the same thing with averages.

That is, define \(\bar{X}\) by \[\bar{X} = \frac{X_1 + X_2 + \cdots + X_n}{n}.\] The central limit theorem also asserts that the random variable \(\bar{X}\) is normally distributed. Furthermore, if \(X\) has mean \(\mu\) and standard deviation \(\sigma\), then the mean and standard deviation of \(\bar X\) are \(\mu\) and \(\sigma/\sqrt{n}\).

Note that all of this is true regardless of the distribution of \(X\)!

Sampling as a random variable

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense: Suppose we draw a sample of the population and compute some statistic. If we repeat that process several times, we’ll surely get different results.

Since sampling produces a random variable, that random variable has some distribution; we call that distribution the sampling distribution.

Example

Suppose we’d like to estimate the average height of individuals in a population. We could do so by selecting a random sample of 100 folks and finding their average height. Probably, this is pretty close to the actual average height for the whole population. If we do this again, though, we’ll surely get a different value.

Thus, the process of sampling is itself a random variable.

As we move forward, we’ll think of sampling as random processes which we try to model with the normal distribution!