Intro to Hypothesis Testing

Often in statistics, we want to answer a simple yes or no question. Hypothesis testing refers to the statistical process of formulating and exploring such a question.

This is mostly section 5.3 of our text.

The general idea

Simple yes or no questions that we might address in statistics include:

  • Is my candidate going to win an election?
  • Does this drug alleviate symptoms?
  • Does a new teaching technique improve student learning outcomes?

The idea behind hypothesis testing is to

  1. Clearly state the question we are trying to answer in terms of two competing hypotheses and
  2. Assess the likelihood of the two hypotheses in light of data that's been collected.

Statement of the hypotheses

The two competing statements in a hypothesis test are typically called the null hypothesis and the alternative hypothesis.

  • The null hypotheses $H_0$ is a sentence representing a skeptical perspective or status quo
  • The alternative hypotheses $H_A$ represents an alternative claim under consideration

The $p$-value

In the context of hypothesis testing, the $p$-value represents the probability of generating observed data as least as favorable to the alternative hypothesis under the assumption of the null hypothesis. A small $p$-value is evidence against the null hypothesis.

Seems a little confusing - perhaps, some example would help?!

The first example - historically

The first known use of what we now call a $p$-value is typically credited to John Arbuthnot. In 1710, he was interested in the following question:

Are males and females born at equal ratios?

To address this question, he examined birth records in London for each of the 82 years from 1629 to 1710. In every one of those years, the number of males born in London exceeded the number of females. Under the assumption that males and females born at equal ratios, we'd expect that there would be more women about half the time. So it would be quite unlikely that there would be more men every year.

The first p-value

To be more precise on what quite unlikely means, Arbuthnot argued as follows: The probability that more males are born in any particular year is $1/2$. Thus, the probability that more males were born each one of the 82 years from 1629 to 1710 would be

$$\frac{1}{2} \times \frac{1}{2} \times \cdots \times \frac{1}{2} = \left(\frac{1}{2}\right)^{82} \approx 2.06795 \times 10^{-25}.$$

Given the ridiculously small probability that the observed data (82 years of more males) could have arisen under the under the assumption of equal ratios, it seems reasonable to conclude that the assumption was incorrect in the first place.

The modern formulation

Recall the first two steps in modern hypothesis testing: Statement of the hypotheses (null and alternative) and computation of the $p$-value. For Arbuthnot's problem, the question concerns the ratio $r$ of male births to female births. Stated in terms of $r$, the hypotheses are:

  • $H_0$: $r=0.5$
  • $H_A$: $r \neq 0.5$.

The confidence level

Prior to computing the $p$-value, the researcher typically specifies a desired confidence level $\alpha$. Common choices might be $\alpha=0.05$ for a $95\%$ level of confidence or $\alpha = 0.01$ for a $99\%$ level of confidence or something even closer to $100\%$. If the computed $p$-value is then less than $\alpha$, then we say that we reject the null hypothesis.

If we specify $\alpha=0.01$ for a $99\%$ level of confidence in Arbuthnot's question, the computation of the $p$ value is $(1/2)^{82}$, which is much smaller than $0.01$. Thus we reject the null.

The conclusion

Ultimately the conclusion of a hypothesis test is always either:

  • We reject the null hypothesis or
  • We fail to reject the null hypothesis.

Note this strange double negative language - we never actually say that we accept the null hypothesis. The null hypothesis is the already the status quo; it doesn't need acceptance.

We always reject or fail to reject the null hypthesis.

A normal example

According to Wikipedia, around 10% of the population is left handed. A random sample of 211 people found that 29 were left handed. Does this data support the Wikipedia's estimate?

Clearly stating the hypothesis

Before starting, it's critically important to clearly state the question as a hypothesis test with a confidence level. Let's suppose we want a 95% level of confidence. Next, our hypotheses can be stated symbolically as

$$ \begin{align} H_0 : p=0.1 \\ H_A : p \neq 0.1 \end{align} $$

The $Z$-score

The fundamental definition of a p-value is still the same: the probability that of obtaining the observed data or worse, under the assumption of the null hypotheses.

In this problem, we assume the null-hypothesis to get a mean of $0.1$ and a standard deviation of

$$\sqrt{0.1\times0.9/211} = 0.02065285.$$

Our observed data is $\hat{p} = 29/211 \approx 0.137$, which has a $Z$-score of

$$ \frac{0.137 - 0.1}{0.02065} \approx 1.796. $$

The $p$-value

If we look up $1.79$ in our standard table, see that

$$P(Z>1.79) \approx 0.0367.$$

Since the proportion could be different from this in either direction, we double this to get a $p$-value of $0.0734$.

Since our $p$-value is greater than $0.05$, we reject the null hypothesis.

A picture of the $p$-value

Geometrically, the $p$-value is the shaded area in the following picture:

One vs two sided

Note the distinction between the two versions of the alternative hypotheses. The first is called a one sided hypothesis and the second is called a two sided hypothesis.

Thus, there are basically two problems here. In both, we must compare the null hypotheses to one of the two alternative hypotheses. Written symbolically, our null and alternative hypotheses are

\begin{align} H_0 : p=0.1 \\ H_A : p > 0.1 \end{align}

or \begin{align} H_0 : p=0.1 \\ H_A : p \neq 0.1 \end{align}

The first hypotheses test is one-sided; the second is two-sided.

Hypothesis testing for means

Similar ideas apply to sample means. Here's an example:

Conventional wisdom states that normal human body temperature is $98.6^{\circ}$ but, according to a 1992 research paper entitled "A Critical Appraisal of 98.6 Degrees F" that appeared in The Journal of the American Medical Association, it's actually lower. Let's take a look at the data.

The data

Here's how to grab the data off of my website:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/normtemp.csv')
temps = df.body_temperature
m = temps.mean()
s = temps.std()
n = len(temps)
[m,s,n]
[98.24923076923076, 0.7331831580389456, 130]

So it looks like there were 130 folks in the study with an average temperature of $98.249^{\circ}$ and a standard deviation of $0.733^{\circ}$.

The analysis

The average temperature from this sample is indeed lower than $98.6$, but let's use a one-sided hypothesis test to examine whether this is genuine evidence against the conventional wisdom. To be clear, our hypothesis test looks like so:

  • $H_0$: $\mu=98.6$
  • $H_A$: $\mu < 98.6$.

Let's use a confidence level of $99\%$.

Analysis (cont)

Well, the standard error is $$SE = \frac{\sigma}{\sqrt{n}} = \frac{0.7331832}{\sqrt{130}} = 0.0642884.$$

Thus, our $Z$-score is $$Z = \frac{\bar{x}-\mu}{SE} = \frac{98.249 - 98.6}{0.0642884} = -5.45977.$$

This is literally off our table so our one-sided $p$-value must surely be much less than $0.01$; thus, we reject the null-hypothesis.