Last time, we met the normal distribution, which is kinda cool in its own right. However,

The fundamental objective of statistics is to draw conclusions from data.

So, today, we’re going to take our first steps in that direction by determining confidence intervals to estimate values of population parameters. First, though, we might need to make sure that we’re on the same page when it comes to the language of data and statistics.

Data and statistics

We’ve already talked about what data looks like, by referencing a couple of data sets, like our CDC data set:

Code

import pandas as pddf = pd.read_csv('https://marksmath.org/data/cdc.csv')df.tail()

genhlth

exerany

hlthplan

smoke100

height

weight

wtdesire

age

gender

19995

good

1

1

0

66

215

140

23

f

19996

excellent

0

1

0

73

200

185

35

m

19997

poor

0

1

0

65

216

150

57

f

19998

good

1

1

0

67

165

165

81

f

19999

good

1

1

1

69

170

165

83

m

This table represents a sample of 200000 individuals or cases chosen from a the population of all US adults. Associated with each case is a list of observations indicated by variable values that can be numerical or categorical.

Terminology

Here’s just a bit of the lingo surrounding statistics and inference.

Population refers to the complete set of entities under consideration,

Sample refers to some subset of the population,

Preferably a simple random sample.

Parameter refers to some summary characteristic of the population, and

Statistic refers to some summary characteristic computed from a sample.

The objective of statistics

A main question in statistics is: once you’ve computed a statistic from a sample, what might that tell you about the corresponding parameter for the whole population?

Suppose, for example, that the average height of the 20000 individuals in our CDC data set is 67.1829 inches. We might take that to be a reasonable approximation to the average height of all US adults. How accurate an approximation might we expect that to be?

Sampling as a random variable

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense: Suppose we draw a sample of the population and compute some statistic. If we repeat that process several times, we’ll surely get different results.

Since sampling produces a random variable, that random variable has some distribution; we call that distribution the sampling distribution.

Example

Suppose we’d like to estimate the average height of individuals in a population. We could do so by selecting a random sample of 100 folks and finding their average height. Probably, this is pretty close to the actual average height for the whole population. If we do this again, though, we’ll surely get a different value.

Thus, the process of sampling is itself a random variable.

Standard error

Since sampling is a random variable with a distribution, that distribution has a standard deviation. We call that standard deviation the standard error. Generally, the standard error depends on two things:

The standard deviation \(\sigma\) of the corresponding population parameter and

The sample size \(n\).

Standard error for a sample mean

For a sample mean, the standard error is \[\sigma/\sqrt{n},\] which decreases with sample size.

Ultimately, this is just a restatement of the Central Limit Theorem, as Devore states:

Let \(X_1,X_2,X_3,\ldots,X_n\) be a random sample from a distribution with mean \(\mu\) and variance \(\sigma^2\). Then if \(n\) is sufficiently large, \(\bar{X}\) has approximately a normal distribution with mean \(\mu\) and variance \(\sigma^2/n\). The larger the value of \(n\), the better the approximation.

An intuitive explanation

We can understand where the \(\sigma/\sqrt{n}\) comes from using the additivity of means and variances for i.i.d random variables. Suppose that \(X\) represents the random height of one individual. If we grab \(n\) individuals, we might represent their heights as \[X_1,X_2,\ldots,X_n\] and their average height as \[\bar{X} = \frac{X_1 + X_2 + \cdots + X_n}{n}.\]

Intuitive explanation (cont)

Now, when we say that variance is additive (at least, for i.i.d. random variables), we mean that \[\sigma^2(X_1 + X_2 + \cdots + X_n) = n \ \sigma(X_i)^2.\]

We can illustrate standard error by running a little computer experiment. Suppose we have a large data set of 20000 values. We grab a small sample of size 1, 4, 16, 32, or 64 from that data set. We then compute the average of the sample. As it turns out, the spread of each histogram seems to be about half as much as the previous.

Confidence intervals

The computation of a sample mean is not exact; it has variability. Thus, rather than asserting that population mean is some specific value based on the sample mean, we often claim that the population mean probably lies in some interval and we do so with some level of confidence.

Forumula

The confidence interval for a sample mean \(\bar{x}\) has the form \[[\bar{x}-ME, \bar{x}+ME],\] where \[ME = z^* \times SE\] is called the margin of error.

Let’s pick this apart a bit. Recall that the standard error \(SE\) is just another name for the standard deviation of the sample mean. The number \(z^*\) is then a multiplier that indicates how many standard deviations away from the mean we’ll allow our interval to go. A common choice for \(z^*\) is 2, which implies a \(95\%\) level of confidence in our interval since we know that \(95\%\) of the population lies within 2 standard deviations from the mean.

A computer example

Suppose we’d like to use a small sample to estimate the average height of the 20000 people in our CDC data set. We could draw a sample (perhaps, of size 100) compute the mean, standard deviation, and standard error of the sample, and use all that to compute our confidence interval. The code to do so might like like so:

import pandas as pddf = pd.read_csv('https://marksmath.org/data/cdc.csv')m = df.height.mean()s = df.height.std()sample = df.sample(100)sm = sample.height.mean()ss = sample.height.std()se = ss/10{"population_mean": m, "sample_mean": sm, "margin_of_error": se, "confidence_interval": [sm -2*se, sm +2*se], "in_there": sm-2*se < m and m < sm+2*se}

Suppose we draw a random sample of 36 beagles and find their average weight to be 22 pounds with a standard deviation of 8 pounds. Use this information to write down a \(95\%\) confidence interval for the average weight of beagles.

As we’ll see though, a better answer might use \(z^*=1.96\).

Varying the confidence level

What if we’re not looking for a \(95\%\) level of confidence? Rather, we might need quite specifically a \(98\%\) confidence interval or a \(99.99\%\) confidence interval. We simply find the \(z^*\) value such that the area under the standard normal shown below is our desired confidence level.

Varying confidence level (cont)

Effectively, we’re computing quantiles here. For example, the fact that we use \(z^*\approx2\) (or, even better \(z^*\approx1.96)\) for a \(95\%\) confidence interval comes from this little portion of our normal table:

Or maybe, better yet, we could use our normal calculator.

Beagles again

In the beagle example, our 98% confidence interval would be

Note that the interval needs to be bigger to be more confident.

Dealing with proportions

Suppose we take a random sample of 100 North Carolinians and check whether they are left handed or right handed. If 13 of them are left handed, we would say that the proportion of them who are left handed is \(13\%\). That \(13\%\) is a sample proportion\(\hat{p}\) that estimates the population proportion\(p\).

Note that a proportion is a numerical quantity, even though the data is categorical. Thus, we can compute confidence intervals in a very similar way. Just as with sample means, the sampling process leads to a random variable and, if certain assumptions are met, then we can expect that random variable to be normally distributed.

Standard deviation for a proportion

One notable computational difference between finding confidence intervals for proportions as compared to those for means is how we find the underlying standard deviation. For numerical data, we simply estimate the population standard deviation with standard deviation for the sample.

For a sample proportion, if we identify success (being left handed, for example) with a \(1\) and failure as a \(0\), then the resulting standard deviation is

\[\sigma = \sqrt{p(1-p)}.\]

This is simply the standard deviation associated with one Bernouli trial

It follows that the standard deviation associated with \(n\) trials is

In the NC left/right handed example we have \[SE = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.13\times0.87}{100}} \approx 0.0336303.\]

Example

Suppose we draw a random sample of 132 people and find that 16 of them have blue eyes. Use this data to write down a \(95\%\) confidence interval for the proportion of people with blue eyes.

Solution: We have \(\hat{p}=16/132 \approx 0.1212\) and

If you read the details of political surveys, you’re likely to come across the term “margin of error” at some point. I did a Google news search for “margin of error” a while back (2/24/22), for example, and the first political story was this one from The Hill which discusses political poles in Texas. In that story, I found

Texas Gov. Greg Abbott (R) tops his primary challengers and leads Democrat Beto O’Rourke in the Lone Star State’s 2022 gubernatorial race, according to a new poll released Thursday.

Abbott currently leads O’Rourke in the poll by seven points, with 52 percent to those surveyed supporting Abbott and 45 percent supporting O’Rourke. Three percent are unsure of their vote.

How happy should Governor Abbott be with 52 percent of the survey?

Margin of error (cont)

In the same survey we find:

The Democratic primary sample surveyed 388 likely voters with a margin of error of plus or minus 4.9 percentage points.

The poll surveyed 522 likely voters in the Republican primary with a margin of error of 4.2 percentage points.

The general election sample surveyed 1,000 likely voters with a margin of error of plus or minus 3 percentage points.

With 52 percent support but a margin of error of \(\pm 3\%\), I guess that Governor Abbott looks strong but doesn’t have the race locked up.

As students of statistics, we’d like to know more precisely where phrases like “margin of error of plus or minus 3 percentage points” comes from and what it means.

Definition

When we write a confidence interval as \[s \pm z^* \times SE,\] Then, \(z^* \times SE\) is the margin of error. Geometrically, it’s the distance that the interval extends in either direction from the measured statistic \(s\).

So, where’s the \(\pm 3\%\) or \(\pm 4.2\%\) or whatever come from?

Computation

Suppose we’re writing down a confidence interval for a proportion. In this case, approve or disapprove. If the actual proportion is \(p\) and our sample size is \(n\), then the standard error is

\[\sqrt{\frac{p(1-p)}{n}}.\]

In the Texas poll comparing both Abbott and O’Rourke, \(n = 1000\). Furthermore the biggest that \(p(1-p)\) can be is \(1/4\). You can see this by taking a look at a graph:

Finishing the computation

In the Texas poll, the sample size was 1000. Thus, the standard error is at most

Now, for a \(95\%\) confidence interval, we could take \(z^* = 1.96\) so that our margin of error is at most

\[ME \leq 1.96\times0.0158 \approx 0.03099.\]

There’s your 3 percentage points.

Choosing sample size

Recall that the margin of error generally depends on three things:

Confidence level,

underlying standard deviation, and

the sample size.

Sometimes, we require a specific confidence level and margin of error and, for a sample proportion, the underling standard deviation \(\sqrt{p(1-p)}\) is never larger than \(1/2\). Thus, we can always obtain the desired confidence level and margin of error by choosing sample size large enough.

The inequality

In order to choose the sample size, we simply set up the inequality

\[z^*\sqrt{\frac{p(1-p)}{n}} < ME,\]

where \(z^*\) corresponds to the desired confidence level and \(ME\) is the desired confidence level. Since \(\sqrt{p(1-p)}<1/2\), this simplfies to

\[z^*\frac{1/2}{\sqrt{n}} < ME \: \text{ or } \: n>\frac{{z^*}^2}{4ME^2}.\]

Example

Suppose we wish to determine the percentage of voters who support our candidate. We’d like a \(95\%\) level of confidence to \(\pm2\%\) points. What sample size should guarantee this?

Simple solution: For a 95% level of confidence, we might take \(z^*=2\) together with the given margin of error \(ME=0.02\) to get

Thus, a pollster would probably be happy with \(n=2500\) folks in the poll.

More precision

We can get a more precise (possibly smaller) sample size by using more precise estimates to \(z^*\). In fact, for the homework, you need three digits of precision for your \(z^*\) multiplier. For 90, 95, and 99%, these are

Conf:

90%

95%

99%

\(z^*\)

1.282

1.960

2.326

In the previous problem, we would have:

\[n>\frac{{z^*}^2}{4ME^2} = \frac{1.960^2}{2\times0.02^2} = 2400.\] Thus, the HW would like to see 2401.

Sample HW problems

Let’s take a look at how this material might be phrased in your online HW.

Note that you will need some computational tool that works well with the normal distribution; you’ll need to be able to compute \(z^*\) multipliers for desired confidence levels, in particular. The \(z^*\) multiplier tool at the bottom of our normal calculator page should work fine for this.

Next time, we’ll have a computer lab and, in anticipation of that, the answers below show how to perform the computations with Python. Those answers assume the following imports have been run:

from scipy.stats import norm
import numpy as np

Chips in a cookie

I randomly select a sample of 102 Chips Ahoy cookies and finds that the number of chocolate chips per cookie in the sample has a mean of 24.8 and a standard deviation of 3.7. Write down a 92% confidence interval for the number of chocolate chips per cookie.

Solution

n =102m =24.8s =3.7zStar = norm.ppf(0.96)se = s/np.sqrt(n)me = zStar*se[m-me,m+me]