Inference for Categorical Data

Over the last couple of days, we've learned the first two big tools for statistical inference - hypothesis testing and confidence intervals and both for numerical data. We'll now turn our attention to categorical data.

Sample proportions

Suppose we take a random sample of 100 North Carolinians and check whether they are left handed or right handed. If 13 of them are left handed, we would say that the proportion of them who are left handed is $13\%$. That $13\%$ is a sample proportion $\hat{p}$ that estimates the population proportion $p$.

Note that a proportion is a numerical quantity, even though the data is categorical. Thus, we can perform inference in a very similar way. One notable computational difference is how we find the underlying standard deviation. For numerical data, we simply estimate the population standard deviation with standard deviation for the sample. For a sample proportion, if we identify success (being left handed, for example) with a $1$ and failure as a $0$, then (as we know from our discussion of the binomial distribution) the resulting standard deviation is $$\sigma = \sqrt{p(1-p)}.$$ It follows that the standard error is $$SE = \frac{\sigma}{\sqrt{n}} = \sqrt{\frac{p(1-p)}{n}}.$$

Example 1 - a confidence interval

Suppose we draw a random sample of 132 people and find that 16 of them have blue eyes. Use this data to write down a 95\% confidence interval for the proportion of people with blue eyes

Solution: We have $\hat{p}=16/132 \approx 0.1212$ and $$SE(\hat{p}) = \sqrt{(16/132)\times(116/132)/132} \approx 0.02840718.$$ Thus, our confidence interval is $$0.1212 \pm 2\times0.0284 = [0.0644, 0.178].$$

Example 2 - a hypothesis test

According to Wikipedia, around 10\% of the population is left handed. A random sample of 211 people found that 29 were left handed. Does this data support the Wikipedia's estimate?

  • Does this data support the null hypotheses that 10% of the population is left handed?
  • Does this data support the alternative hypotheses that more than 10% of the population is left handed?
  • Does this data support the alternative hypotheses that 10% of the population is not left handed?

Note the distinction between the two versions of the alternative hypotheses. The first is called a one sided hypothesis and the second is called a two sided hypothesis.

Thus, there are basically two problems here. In both, we must compare the null hypotheses to one of the two alternative hypotheses. Written symbolically, our null and alternative hypotheses are

\begin{align} H_0 : p=0.1 \\ H_A : p > 0.1 \end{align}

or \begin{align} H_0 : p=0.1 \\ H_A : p \neq 0.1 \end{align}

The first hypotheses test is one-sided; the second is two-sided.

The fundamental definition of a p-value is still the same: the probability that of obtaining the observed data or worse, under the assumption of the null hypotheses. In this problem, our null mean and standard deviation are $0.1$ and $$\sqrt{0.1\times0.9/211} = 0.02065285.$$ Our observed data is $\hat{p} = 29/211$, which is larger than $0.1$.

For the first, one-sided test, the p-value is

In [9]:
from scipy.stats import norm
import numpy as np
1-norm.cdf(29/211, loc=0.1, scale=np.sqrt(0.1*0.9/211))

As this is smaller than $0.05$, we reject the null hypotheses. For the second, two-sided test, the p-value is twice this, thus we don't reject the null hypotheses.

More on margin of error

If you read the details of political surveys, you're likely to come across the term "margin of error" at some point. Five Thirty Eight, for example, maintains a running Trump approval rating page. The page also points to poll details for a slew of polls. Check out the first one, namely the Gallup poll. There, we read "Daily results are based on telephone interviews with approximately 1,400 national adults; Margin of error is $\pm 3$ percentage points". What's that mean?


When we write a confidence interval as $$s \pm z^* \times SE,$$ Then, $z^* \times SE$ is the margin of error. Geometrically, it's the distance that the interval extends in either direction from the measured statistic $s$.


So, where's the $\pm 3$ come from?

Suppose we're writing down a confidence interval for a proportion. In this case, approve or disapprove. If the actual proportion is $p$ and our sample size is $n$, then the standard error is $$\sqrt{\frac{p(1-p)}{n}}.$$ In our case, the take $n\approx 1500$. Furthermore the biggest that $p(1-p)$ can be is $1/4$. You can see this by taking a look at a graph:

Thus, our standard error is at most $$SE \leq \sqrt{\frac{1/4}{1500}} \approx 0.01290994.$$ Now, for a $95\%$ confidence interval, we take $z^* = 2$ so that our margin of error is at most $$ME \leq 2*0.01290994 \approx 0.026,$$ which is rounded up to 3 percentage points.

This is a common thing to shoot for in political polls, which is why you often see sample sizes close to 1500.

A hypothetical political race

In a political contest, where there are two candidates and victory requires a simple majority, a candidate likes to be more than 3 percentage points above 50\%. Can you see why?

An actual political race

In early November of last year, FiveThirtyEight reported that Donald Trump was only 3.3 percentage points behind Hilary Clinton - almost within the margin of error. In fact, Clinton ended up winning the popular vote by 2.1\%.

A poll on the first amendment

A recent poll by the Brookings Institute asks the following question of 1500 college students: "Is hate speech constitutionally protected?"

Here are the results:

Political Affiliation Type of College Gender
All Dem Rep Ind Public Private Female Male
Yes 39 39 44 40 38 43 31 51
No 44 41 39 44 44 44 49 38
Don’t know 16 15 17 17 17 13 21 11

Use this to write down a confidence interval for the percentage of students who believe that hate speech is not constitutionally protected.