Means, Proportions, Hypotheses, and Confidence Intervals

Confidence intervals

  • A confidence interval for a mean of numerical data has the form

    $$\left[\overline{x}-t^*\times\frac{\sigma}{\sqrt{n}},\, \overline{x}+t^*\times\frac{\sigma}{\sqrt{n}}\right],$$

    where $\overline{x}$ and $\sigma$ are measured from the sample and $n$ is the sample size.

  • A confidence interval for a proportion of categorical data has the form

    $$\left[\hat{p}-t^*\times\sqrt{\frac{\hat{p}(1-\hat{p})}{n}},\, \hat{p}+t^*\times\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\right]$$

    where $\hat{p}$ is measured from the sample and $n$ is the sample size. Note that, for a proportion, we have

    $$\sigma = \hat{p}(1-\hat{p}),$$

    Thus, this formula is practically the same as above.

In both cases, you can safely replace the $t^*$ multiplier with a $z^*$ multiplier from the normal distribution, when $n$ is large.

Hypothesis testing

  • A hypothesis statement for a mean looks like

    • $H_0: \mu=\mu_0$
    • $H_A: \mu\neq\mu_0$ or $H_A: \mu<\mu_0$ or $H_A: \mu>\mu_0$

    Given data with measured mean $\overline{x}$ and standard deviation $\sigma$, the test statistic is

    $$T = \frac{\overline{x}-\mu_0}{\sigma/\sqrt{n}}.$$

  • A hypothesis statement for a proportion looks like

    • $H_0: p=p_0$
    • $H_A: \mu\neq p_0$ or $H_A: p<p_0$ or $H_A: p>p_0$

    Given data with measured proportion $\hat{p}$, the test statistic is

    $$T = \frac{\hat{p}-p_0}{\sqrt{p_0(1-p_0)/n}}.$$

    Note that the choice of mean $p_0$ and standard deviation $\sqrt{p_0(1-p_0)/n}$ for the distribution that we'll use are driven by the assumption that the null hypothesis is correct.

In both cases, we compute the $p$-value by comparing the test statistic against the standard normal or $t$-distribution with the appropriate degrees of freedom. Your computation should be guided by an image that looks something like so:

You might be intereted in one side or the other or both depending on your hypothesis statement.

An example hypothesis test for a proportion

Let's return to our data on body temperature from last time:

In [1]:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/normtemp.csv')
df.head()
Out[1]:
body_temperature gender heart_rate
0 96.3 1 70
1 96.7 1 71
2 96.9 1 74
3 97.0 1 80
4 97.1 1 73

We now ask the question: Is the proportion of people who's body temperature is less than $98.6^{\circ}$ less than $1/2$? Let's ue a 99% confidence level.

Written as a hypothesis test:

  • $H_0: p=1/2$
  • $H_A: p<1/2$

with $\alpha=0.01$.

In [2]:
p = df[df.body_temperature<98.6].shape[0]/df.shape[0]
p
Out[2]:
0.6230769230769231

Here's the test statistic:

In [3]:
import numpy as np
T = (p-0.5)/np.sqrt(0.5*(1-0.5)/130)
T
Out[3]:
2.806585661782494

Finally, here's the $p$-value:

In [4]:
from scipy.stats import norm
1-norm.cdf(2.8)
Out[4]:
0.0025551303304279793

Since this is less than 0.01, we reject the null hypothesis.