`0.09540800627807665`

Mon, Sep 16, 2024

To this point, we’ve covered a lot of apparently disparate things:

- Data, including language, tables, charts, and measures
- Probability theory - seemingly, for gamblers,
- Random variables and their distributions with a focus on
- The
*binomial*and - The
**Normal**

- The

Today, we’re going to start to tie things together. In particular, we’re going to learn

- Why the normal distribution arises so frequently,
- More specifically, how it arises in the
*sampling*process, and - How that allows us to model data.

This is mostly the beginnings of section 5.1 in our text.

Here’s a problem that comes right off of our review sheet:

I’ve got an unfair coin that comes up heads \(90\%\) of the time. Suppose I flip the coin and write down a \(1\) if it comes up heads or a \(0\) if it comes up tails. Let’s denote that numerical value by the random variable \(X\).

What are the expectation and variance of this random variable?

Recall that expectation is \[E(X) = p = 0.9\] and that the variance is \[\sigma^2(X) = p(1-p) = 0.9\times0.1 = 0.09.\]

The problem continues to ask: suppose I flip the coin 1000 times and count the number of heads that I get. We’ll call that numerical value \(S\). What are the expectation and variance of this new random variable \(S\)?

It’s just a matter of multiplying by the number of coin flips to get

- \(E(S) = 1000\times0.9 = 900\) and
- \(\sigma^2(S) = 1000\times0.09 = 90\).

Finally, the problem asks us to estimate \(P(S < 888)\).

And, this is where it gets a little funkier

In principle, we can solve this problem with the binomial distribution:

`0.09540800627807665`

Effectively, this computes \[\sum_{k=0}^{887} \binom{1000}{k} (0.9)^{1000-k}(0.1)^k.\]

There are issues, though:

- It’s complicated
- It’s narrow
- It doesn’t work well for a very large number of throws.

Here’s an alternative that yields a good estimate:

Effectively, this computes the shaded area in this picture:

```
import matplotlib.pyplot as plt
xs = np.arange(860,940)
ys = [norm.pdf(x-0.5, 900, np.sqrt(90)) for x in xs]
ypts = [binom.pmf(k, 1000, 0.9) for k in xs]
plt.plot(xs,ypts,'.')
plt.plot(xs,ys,'-')
xs2 = np.arange(860,888)
ax = plt.gca()
ax.fill_between(xs2,0,norm.pdf(xs2, 900, np.sqrt(90)))
ax.set_aspect(600)
```

- Much more broadly applicable,
- Works just as well for a very large number of trials,
- While still complicated, it’s just one thing to learn for a lot of examples.

Suppose I roll a fair six sided die one million time, add up the numbers, and call the result \(S\). What’s \[P(S < 3501000)?\]

First, we need to know the mean and variance for 1 roll:

\[E(X) = \frac{1+2+3+4+5+6}{6} = \frac{7}{2} = 3.5\] and \[\sigma^2(X) = \frac{\left(1-\frac{7}{2}\right)^2 + \left(2-\frac{7}{2}\right)^2 + \left(3-\frac{7}{2}\right)^2 + \left(4-\frac{7}{2}\right)^2 + \left(5-\frac{7}{2}\right)^2 + \left(6-\frac{7}{2}\right)^2}{6} = \frac{35}{12}.\]

To get the mean and variance for 1,000,000 rolls we simply multiply by 1,000,000.

\[E(S) = 3,500,000\] and \[\sigma^2(S) = \frac{35,000,000}{12}.\]

We use `norm.cdf`

to compute the probability:

Because of the Central Limit Theorem, of course!

The *central limit theorem* is the theoretical explanation of why the normal distribution appears as the limit of binomials above and, therefore, so often in practice. Suppose that \(X\) is a random variable which we evaluate a bunch of times to produce a sequence of numbers: \[X_1, X_2, \ldots, X_n.\] We then compute the sum of those values to produce a new value \(S\) defined by \[S = X_1 + X_2 + \cdots + X_n.\] The central limit theorem asserts that the random variable \(S\) is normally distributed. Furthermore, if \(X\) has mean \(\mu\) and standard deviation \(\sigma\), then the mean and standard deviation of \(S\) are \(n\times\mu\) and \(\sqrt{n} \times \sigma\).

Since an average is just a sum divided by \(n\), we can do the same thing with averages.

That is, define \(\bar{X}\) by \[\bar{X} = \frac{X_1 + X_2 + \cdots + X_n}{n}.\] The central limit theorem also asserts that the random variable \(\bar{X}\) is normally distributed. Furthermore, if \(X\) has mean \(\mu\) and standard deviation \(\sigma\), then the mean and standard deviation of \(\bar X\) are \(\mu\) and \(\sigma/\sqrt{n}\).

Note that all of this is true regardless of the distribution of \(X\)!

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense: Suppose we draw a sample of the population and compute some statistic. If we repeat that process several times, we’ll surely get different results.

Since sampling produces a random variable, that random variable has some distribution; we call that distribution the *sampling distribution*.

Suppose we’d like to estimate the average height of individuals in a population. We could do so by selecting a random sample of 100 folks and finding *their* average height. *Probably*, this is pretty close to the actual average height for the whole population. If we do this again, though, we’ll surely get a different value.

Thus, *the process of sampling is itself a random variable*.

As we move forward, we’ll think of sampling as random processes which we try to model with the normal distribution!