Confidence Intervals for Proportions

Fri, Sep 20, 2024

Today’s objectives

Last time, we learned about Confidence intervals for means.

Today we’ll do something very similar with proportions

Recap on Means

We collect data on \(n\) individuals and compute some summary statistic \(\bar{x}\) of a numeric variable from that data set. The corresponding confidence interval has the form

\[[\bar{x} - ME, \bar{x} + ME],\]

where \(ME\) stands for the Margin of Error

Margin of error

Margin of error has the form

\[ME = z^* \times SE,\]

where \(z^*\), the z-star multiplier, is chosen from the standard normal table to yield the desired degree of confidence and

\[SE = \sigma/\sqrt{n}\]

denotes the standard error, which is the standard deviation of the underlying population and \(n\) is the sample size.

An computer based example

Suppose we’d like to use a small sample to estimate the average height of the 20000 people in our CDC data set. We could draw a sample (perhaps, of size 100) compute the mean, standard deviation, and standard error of the sample, and use all that to compute our confidence interval. The code to do so might like like so:

import pandas as pd
cdc_data = pd.read_csv('https://marksmath.org/data/cdc.csv')
m = cdc_data.height.mean()
s = cdc_data.height.std()
sample = cdc_data.sample(100)
sm = sample.height.mean()
ss = sample.height.std()
se = ss/10

{"population_mean": m, "sample_mean": sm, "margin_of_error": se, 
 "confidence_interval": [sm - 2*se, sm + 2*se], "in_there": sm-2*se < m and m < sm+2*se}
{'population_mean': 67.1829,
 'sample_mean': 66.75,
 'margin_of_error': 0.4023604595309278,
 'confidence_interval': [65.94527908093815, 67.55472091906185],
 'in_there': True}

Dealing with proportions

Suppose we take a random sample of 100 North Carolinians and check whether they are left handed or right handed. If 13 of them are left handed, we would say that the proportion of them who are left handed is \(13\%\). That \(13\%\) is a sample proportion \(\hat{p}\) that estimates the population proportion \(p\).

Note that a proportion is a numerical quantity, even though the data is categorical. Thus, we can compute confidence intervals in a very similar way. Just as with sample means, the sampling process leads to a random variable and, if certain assumptions are met, then we can expect that random variable to be normally distributed.

Standard deviation for a proportion

One notable computational difference between finding confidence intervals for proportions as compared to those for means is how we find the underlying standard deviation. For numerical data, we simply estimate the population standard deviation with standard deviation for the sample.

For a sample proportion, if we identify success (being left handed, for example) with a \(1\) and failure as a \(0\), then the resulting standard deviation is

\[\sigma = \sqrt{p(1-p)}.\]

This is simply the standard deviation associated with one Bernouli trial

It follows that the standard deviation associated with \(n\) trials is

\[\sigma = \sqrt{n p(1-p)}.\]

Standard error

It follows that the standard error is

\[SE = \frac{\sqrt{n p(1-p)}}{n} = \sqrt{\frac{p(1-p)}{n}}.\]

In the NC left/right handed example we have \[SE = \sqrt{\frac{p(1-p)}{n}} = \sqrt{\frac{0.13\times0.87}{100}} \approx 0.0336303.\]

Example

Suppose we draw a random sample of 132 people and find that 16 of them have blue eyes. Use this data to write down a \(95\%\) confidence interval for the proportion of people with blue eyes.

Solution: We have \(\hat{p}=16/132 \approx 0.1212\) and

\[SE(\hat{p}) = \sqrt{(16/132)\times(116/132)/132} \approx 0.02840718.\]

Thus, our confidence interval is

\[0.1212 \pm 2\times0.0284 = [0.0644, 0.178].\]

A computer based example

Suppose we’d like to estimate the proportion of people who exercise some using the CDC data set.

First we group the sample into people who do exercise some 1 and those who don’t:

cdc_data.groupby('exerany').size()
exerany
0     5086
1    14914
dtype: int64

So, the proportion of people who do exercise some is \[\frac{14914}{14914+5086} \approx 0.7457.\]

Computer example (cont)

With that information, we can compute the confidence interval for the proportion as follows;

from numpy import sqrt
n = 20000
p = 14914/n
se = sqrt(p*(1-p)/n)
zStar = 1.96
me = zStar*se
[p-me,p+me]
[0.7396647352634039, 0.7517352647365961]