Inference for Categorical Data¶

Over the last couple of days, we've learned the first two big tools for statistical inference - hypothesis testing and confidence intervals and both for numerical data. We'll now turn our attention to categorical data.

Sample proportions¶

Suppose we take a random sample of 100 North Carolinians and check whether they are left handed or right handed. If 13 of them are left handed, we would say that the proportion of them who are left handed is $13\%$. That $13\%$ is a sample proportion $\hat{p}$ that estimates the population proportion $p$.

Note that a proportion is a numerical quantity, even though the data is categorical. Thus, we can perform inference in a very similar way. One notable computational difference is how we find the underlying standard deviation. For numerical data, we simply estimate the population standard deviation with standard deviation for the sample. For a sample proportion, if we identify success (being left handed, for example) with a $1$ and failure as a $0$, then (as we know from our discussion of the binomial distribution) the resulting standard deviation is $$\sigma = \sqrt{p(1-p)}.$$ It follows that the standard error is $$SE = \frac{\sigma}{\sqrt{n}} = \sqrt{\frac{p(1-p)}{n}}.$$

Example 1 - a confidence interval¶

Suppose we draw a random sample of 132 people and find that 16 of them have blue eyes. Use this data to write down a 95\% confidence interval for the proportion of people with blue eyes

Solution: We have $\hat{p}=16/132 \approx 0.1212$ and $$SE(\hat{p}) = \sqrt{(16/132)\times(116/132)/132} \approx 0.02840718.$$ Thus, our confidence interval is $$0.1212 \pm 2\times0.0284 = [0.0644, 0.178].$$

Example 2 - a hypothesis test¶

According to Wikipedia, around 10\% of the population is left handed. A random sample of 211 people found that 29 were left handed. Does this data support the Wikipedia's estimate?

Does this data support the null hypotheses that 10% of the population is left handed?
Does this data support the alternative hypotheses that more than 10% of the population is left handed?
Does this data support the alternative hypotheses that 10% of the population is not left handed?

Note the distinction between the two versions of the alternative hypotheses. The first is called a one sided hypothesis and the second is called a two sided hypothesis.

Thus, there are basically two problems here. In both, we must compare the null hypotheses to one of the two alternative hypotheses. Written symbolically, our null and alternative hypotheses are

\begin{align} H_0 : p=0.1 \\ H_A : p > 0.1 \end{align}

or \begin{align} H_0 : p=0.1 \\ H_A : p \neq 0.1 \end{align}

The first hypotheses test is one-sided; the second is two-sided.

The fundamental definition of a p-value is still the same: the probability that of obtaining the observed data or worse, under the assumption of the null hypotheses. In this problem, our null mean and standard deviation are $0.1$ and $$\sqrt{0.1\times0.9/211} = 0.02065285.$$ Our observed data is $\hat{p} = 29/211$, which is larger than $0.1$.

For the first, one-sided test, the p-value is

from scipy.stats import norm
import numpy as np
1-norm.cdf(29/211, loc=0.1, scale=np.sqrt(0.1*0.9/211))

0.034926598260588304

As this is smaller than $0.05$, we reject the null hypotheses. For the second, two-sided test, the p-value is twice this, thus we don't reject the null hypotheses.

More on margin of error¶

If you read the details of political surveys, you're likely to come across the term "margin of error" at some point. Five Thirty Eight, for example, maintains a running Trump approval rating page. The page also points to poll details for a slew of polls. Check out the first one, namely the Gallup poll. There, we read "Daily results are based on telephone interviews with approximately 1,400 national adults; Margin of error is $\pm 3$ percentage points". What's that mean?

Definition¶

When we write a confidence interval as $$s \pm z^* \times SE,$$ Then, $z^* \times SE$ is the margin of error. Geometrically, it's the distance that the interval extends in either direction from the measured statistic $s$.

Computation¶

So, where's the $\pm 3$ come from?

Suppose we're writing down a confidence interval for a proportion. In this case, approve or disapprove. If the actual proportion is $p$ and our sample size is $n$, then the standard error is $$\sqrt{\frac{p(1-p)}{n}}.$$ In our case, the take $n\approx 1500$. Furthermore the biggest that $p(1-p)$ can be is $1/4$. You can see this by taking a look at a graph:

Thus, our standard error is at most $$SE \leq \sqrt{\frac{1/4}{1500}} \approx 0.01290994.$$ Now, for a $95\%$ confidence interval, we take $z^* = 2$ so that our margin of error is at most $$ME \leq 2*0.01290994 \approx 0.026,$$ which is rounded up to 3 percentage points.

This is a common thing to shoot for in political polls, which is why you often see sample sizes close to 1500.

A hypothetical political race¶

In a political contest, where there are two candidates and victory requires a simple majority, a candidate likes to be more than 3 percentage points above 50\%. Can you see why?

An actual political race¶

In early November of last year, FiveThirtyEight reported that Donald Trump was only 3.3 percentage points behind Hilary Clinton - almost within the margin of error. In fact, Clinton ended up winning the popular vote by 2.1\%.

A poll on the first amendment¶

A recent poll by the Brookings Institute asks the following question of 1500 college students: "Is hate speech constitutionally protected?"

Here are the results:

		Political Affiliation			Type of College		Gender
	All	Dem	Rep	Ind	Public	Private	Female	Male
Yes	39	39	44	40	38	43	31	51
No	44	41	39	44	44	44	49	38
Don’t know	16	15	17	17	17	13	21	11

Use this to write down a confidence interval for the percentage of students who believe that hate speech is not constitutionally protected.