Recently, we've discussed discrete random variables and their distributions. Today, we'll take a look at the process of sampling as a random variable itself. From this perspective, there's a distribution associated with sampling and an understanding of this distribution allows us to estimate how close statistics computed from the sample are to the corresponding parameters for the whole population. Thus, we're laying the foundation for inference - the process of drawing quantitative conclusions from our data.

This is very much like sections 5.1 and 5.2 of our text, though with an emphasis on means of numerical data rather than proportions in categorical data.

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense: Suppose we draw a sample of the population and compute some statistic. If we repeat that process several times, we'll surely get different results.

Since sampling produces a random variable, that random variable has some distribution; we call that distribution the *sampling distribution*.

Suppose we'd like to estimate the average height of individuals in a population. We could do so by selecting a random sample of 100 folks and finding *their* average height. *Probably*, this is pretty close to the actual average height for the whole population. If we do this again, though, we'll surely get a different value.

Thus, *the process of sampling is itself a random variable*.

Let's illustrate this with a computer computation. Recall that we've got a data set with the heights (and more) of 20000 individuals. Let's select several random samples of size around 100 from that data set and compute the average height for each.

We can grab our own individualized data like so:

Once we've got the data, we can get the mean and standard deviation from our Data Explorer.

It looks like Audrey's data has a mean of $\mu=67.736$ and a standard deviation of $\sigma = 4.176$

- What does the statistic (the computation from the sample) tell us about the parameter (the value for the whole corresponding population)?

More precisely:- How close can we expect the statistic to be to the parameter?
- How confident can we be in any assertions we make about the parameter?

Since sampling is a random variable with a distribution, that distribution has a standard deviation. We call that standard deviation the *standard error*. Generally, the standard error depends on two things:

- The standard deviation $\sigma$ of the corresponding population parameter and
- The sample size $n$.

For a sample mean, the standard error is $$\sigma/\sqrt{n},$$ which decreases with sample size.

We can understand where the $\sigma/\sqrt{n}$ comes from using the additivity of means and variances. Suppose that $X$ represents the random height of one individual. If we grab $n$ individuals, we might represent their heights as $$X_1,X_2,\ldots,X_n$$ and their average height as $$\bar{x} = \frac{X_1 + X_2 + \cdots + X_n}{n}.$$ Of course, we know that the standard deviation of the numerator is $\sqrt{n}\sigma$, thus the standard deviation of $\bar{x}$ is $$\frac{\sqrt{n}\,\sigma}{n} = \frac{\sigma}{\sqrt{n}}.$$

We can illustrate standard error by running a little computer experiment. Suppose we have a large dataset of 20000 values. We grab a small sample of size 1, 4, 16, 32, or 64 from that dataset. We then compute the average of the sample. As it turns out, the spread of each histogram seems to be about half as much as the previous.

The computation of a sample mean is not exact; it has variability. Thus, rather than asserting that population mean is some specific value based on the sample mean, we often claim that the population mean probably lies in some interval and we do so with some level of confidence.

The confidence interval for a sample mean $\bar{x}$ has the form
$$[\bar{x}-ME, \bar{x}+ME],$$
where
$$ME = z^* \times SE$$
is called the *margin of error*.

Let's pick this apart a bit. Recall that the standard error $SE$ is just another name for the standard deviation of the sample mean. The number $z^*$ is then multiplier that indicates how many standard deviations away from the mean we'll allow our interval to go. A common choice for $z^*$ is 2, which implies a $95\%$ level of confidence in our interval since we know that $95\%$ of the population lies within 2 standard deviations from the mean.

Returning to our CDC data set of 20000 individuals that includes their heights, let's use our random sample of around 100 folks to write down a $95\%$ confidence interval for the average height of the population.

Begin by getting your sample and computing the mean and standard deviation.

Recall that Audrey's data has a mean of $\mu=67.736$ and a standard deviation of $\sigma = 4.176$.

Now, our confidence interval will have the form $$[\bar{x} - ME, \bar{x} + ME].$$ We just need to know what $ME=z^*\times SE$ is. We first compute $SE=\sigma/\sqrt{n}$ where we take $\sigma$ to be the standard deviation of the sample.

For Audrey's sample, this works out to be $$ SE = 4.176/\sqrt{110} \approx 0.39816598. $$

Now, for a $95\%$ level of confidence, we take $z^*=2$. Thus our margin of error is $$ ME = 2\times0.39816598 \approx 0.79633196. $$ Thus, our confidence interval is $$ [\bar{x} - ME, \bar{x} + ME] \approx [67.736 - 0.79633196, 67.736 - 0.79633196] \approx [66.93967, 68.532332]. $$

Thus, we have a $95\%$ level of confidence that the actual mean lies in that interval.

Suppose we draw a random sample of 36 beagles and find their average weight to be 22 pounds with a standard deviation of 8 pounds. Use this information to write down a $95\%$ confidence interval for the average weight of beagles.

**Solution**: Our answer should look like

Now

$$SE = \sigma/\sqrt{n} = 8/6 \approx 1.33$$and our rules of thumb tell us to take $z^* = 2$ for a $95\%$ level of confidence. Thus, our confidence interval is

$$[22 - 2\times1.33, 22+2\times1.33] = [19.34,24.66].$$As we'll see in our table though, a better answer might use $z^*=1.96$.

What if we're not looking for a $95\%$ level of confidence? Rather, we might need quite specifically a $98\%$ confidence interval or a $99.99\%$ confidence interval. We simply find the $z^*$ value such that the area under the standard normal shown below is our desired confidence level.

If, on the other hand, we needed a $98\%$ level of confidence, we'd need to look at this row:

Thus, it looks like $z^* = 2.33$ should do.

Or better yet, we can use this online calculator for $Z^*$-multipliers. Just like our probability calculator, this is on our normal calculator page, but near the bottom.

In the beagle example, our 98% confidence interval would be

$$[22 - 2.33\times1.33, 22+2.33\times1.33] = [18.9,25.1].$$Note that the interval needs to be bigger to be more confident.

Finally, we should mention that in order for this technique to work, the sample means must be normally distributed. This does *not* require that the underlying data is normally distributed For example, weight is not typically normally distributed *but* the sample means of weight are. Here are the conditions that we need to check to make sure these ideas are applicable:

- A simple random sample
- Independence
- Large enough (typically, at least 30) but
- Need less than 10% of population for independence