Recently, we've discussed discrete random variables and their distributions. Today, we'll take a look at the process of sampling as a random variable itself. From this perspective, there's a distribution associated with sampling and an understanding of this distribution allows us to estimate how close statistics computed from the sample are to the corresponding parameters for the whole population. Thus, we're laying the foundation for inference - the process of drawing quantitative conclusions from our data.
This is essentially sections 4.1 and 4.2 in our text.
The process of computing a statistic based on a random sample can be thought of a random variable in the following sense: Suppose we draw a sample of the population and compute some statistic. If we repeat that process several times, we'll surely get different results.
Since sampling produces a random variable, that random variable has some distribution; we call that distribution the sampling distribution.
Suppose we'd like to estimate the average height of individuals in a population. We could do so by selecting a random sample of 100 folks and finding their average height. Probably, this is pretty close to the actual average height for the whole population. If we do this again, though, we'll surely get a different value.
Thus, the process of sampling is itself a random variable.
Let's illustrate this with Python. Recall that we've got a data set with the heights (and more) of 20000 individuals. Let's select 10 random samples of size 100 from that data set and compute the average height for each. Here's how:
import pandas as pd
from random import seed
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
[df.height.sample(100,random_state=i).mean() for i in range(10)]
Looks like the actual average height of the 20000 folks in our CDC data set might be 66 or 67 point something. In fact, we can compute this exactly, since 20000 is not so huge for our computer, but in the general situation (like dealing with the whole US population), this can't be done. The major questions in statistical inference are:
Since sampling is a random variable with a distribution that, distribution has a standard deviation. We call that standard deviation the standard error. Generally, the standard error depends on two things:
For a sample mean, the standard error is $$\sigma/\sqrt{n},$$ which decreases with sample size.
We can understand where the $\sigma/\sqrt{n}$ comes from using the additivity of means and variances. Suppose that $X$ represents the random height of one individual. If we grab $n$ individuals, we might represent their heights as $$X_1,X_2,\ldots,X_n$$ and their average height as $$\bar{x} = \frac{X_1 + X_2 + \cdots + X_n}{n}.$$ Of course, we know that the standard deviation of the numerator is $\sqrt{n}\sigma$, thus the standard deviation of $\bar{x}$ is $$\frac{\sqrt{n}\,\sigma}{n} = \frac{\sigma}{\sqrt{n}}.$$
We can illustrate this by running a little experiment on the on our CDC dataset with 20000 heights. In the first histogram below, we grab a single random sample of size 200 and display the histogram. In the next image, we see a histogram of 200 means of sample size 4 and, in the one after that we see a histogram of 200 means of sample size 16. Note that the spread of each histogram seems to be about half as much as the previous.
The computation of a sample mean is not exact; it has variability. Thus, rather than asserting that population mean is some specific value based on the sample mean, we often claim that the population mean probably lies in some interval and we do so with some level of confidence.
The confidence interval for a sample mean $\bar{x}$ has the form $$[\bar{x}-ME, \bar{x}+ME],$$ where $$ME = z^* \times SE$$ is called the margin of error.
Let's pick this apart a bit. Recall that the standard error $SE$ is just another name for the standard deviation of the sample mean. The number $z^*$ is then multiplier that indicates how many standard deviations away from the mean we'll allow our interval to go. A common choice for $z^*$ is 2, which implies a $95\%$ level of confidence in our interval since we know that $95\%$ of the population lies within 2 standard deviations from the mean.
Returning to our CDC data set of 20000 individuals that includes their heights, let's draw a random sample of 100 of them and write down a $95\%$ confidence interval for the average height of the population. We begin by getting the data, drawing the sample, and computing the mean of the sample.
heights = df.height.sample(100,random_state=1)
xbar = heights.mean()
xbar
Now, our confidence interval will have the form $$[\bar{x} - ME, \bar{x} + ME].$$ We just need to know what $ME=z^*\times SE$ is. We first compute $SE=\sigma/\sqrt{n}$ where we take $\sigma$ to be the standard deviation of the sample, which we can compute directly:
from numpy import sqrt
s = heights.std()
se = s/sqrt(100)
se
Now, for a $95\%$ level of confidence, we take $z^*=2$. Thus our margin of error is
me = 2*se
me
And here's our confidence interval:
[xbar - me, xbar + me]
Thus, we have a $95\%$ level of confidence that the actual mean lies in that interval. In this particular example, since the population size of 20000 is not too large and already resides in computer memory, we can compute the actual population mean:
df.height.mean()
We should emphasize though, that we cannot typically do this!
Suppose we draw a random sample of 36 labradoodles and find their average weight to be 45 pounds with a standard deviation of 12 pounds. Use this information to write down a $95\%$ confidence interval for the weight of labradoodles.
Solution: Our answer should look like $$[\bar{x} - ME, \bar{x} + ME] = [36-z^*\times SE,36+z^*\times SE].$$ Now $SE = \sigma/\sqrt{n} \approx 12/6 = 2$ and we take $z^* = 2$ for a $95\%$ level of confidence. Thus, our confidence interval is $$[45 - 2\times2, 45+2\times2] = [41,49].$$
Note that weight is not typically normally distributed but the sample means of weight are. Here are the conditions that we need to check to make sure these ideas are applicable.
What if we're not looking for a $95\%$ level of confidence? Rather, we might need quite specifically a $98\%$ confidence interval or a $99.99\%$ confidence interval. We simply find the $z^*$ value such that the area under the standard normal shown below is our desired confidence level.
Effectively, we're computing quantiles here. For example, the fact that we use $z^*\approx2$ for a $95\%$ confidence interval comes from this little portion of our table:
... | 0.07 | 0.06 | 0.05 | ... | Z |
---|---|---|---|---|---|
... | ... | ... | ... | ... | -2.0 |
... | 0.0244 | 0.0250 | 0.0256 | ... | -1.9 |
... | ... | ... | ... | ... | -1.8 |
If, on the other hand, we needed a $98\%$ level of confidence, we'd need to look at this row:
... | 0.04 | 0.03 | 0.02 | 0.01 | 0.00 | Z |
---|---|---|---|---|---|---|
... | ... | ... | ... | ... | ... | -2.4 |
... | 0.0096 | 0.0099 | 0.0102 | 0.0104 | 0.0107 | -2.3 |
... | ... | ... | ... | ... | ... | -2.2 |
Thus, it looks like $z^* = -2.33$ should do.