

Students from the 2011 YRBSS (Youth Risk Behavior Surveillance System) lifted weights (or performed other strength training exercises) 3.09 days per week on average. We want to determine if the YRBSS sample data set provides strong evidence that YRBSS students selected in 2013 are lifting more or less than the 2011 YRBSS students, versus the other possibility that there has been no change.

We simplify these three options into two competing hypotheses:

  • \(H_0\): The average days per week that YRBSS students lifted weights was the same for 2011 and 2013.
  • \(H_A\): The average days per week that YRBSS students lifted weights was different for 2013 than in 2011.

Confidence intervals

  • Denote the average days per week that YRBSS students lifted weights in 2011 by \(\mu_{11}\) - also known as the null value.
  • Denote the average days per week that YRBSS students lifted weights in 2013 \(\mu_{13}\)
  • To reject the null hypotheses, we would need to find a confidence interval for \(\mu_{13}\) that does not contain \(\mu_{11}\).

Pushing the example further

  • Suppose \(\mu_{11}\) is known to be 3.09
  • To reject \(H_0\), we need a confidence interval for \(\mu_{11}\) that doesn’t contain 3.09.
  • Suppose our sample of 100 students from the 2013 YRBSS survey, has an average of \(\bar{x} = 2.78\) days with a standard deviation of \(s = 2.56\) days.
  • General confidence interval: \[\bar{x} \pm z^{*} SE_{\bar{x}}\]
  • If we’d like a 95% confidence interval, we take \(z^{*} = 1.96\).
  • The standard error is \[SE_{\bar{x}} = \frac{s_{13}}{\sqrt{n}}= \frac{2.56}{100}=0.256.\]
  • The confidence interval is \((2.27,3.29)\)
  • We do not reject the null hypotheses

Types of errors

  • Type 1: rejecting the null hypothesis when it is actually true
  • Type 2: accepting the null hypothesis when it is actually false

Significance levels

Example - Coin flipping

If we flip a coin 10 times then, assuming the coin is fair, the probability of 10 straight heads is \(1/2^{10} \approx 0.0009765625\); you’d have cause to doubt the fairness of the coin.

In the context of statistical studies, we often use a normal distribution to compute p-values. If, in the case of the coin, we suspect the coin is weighted heads, we’d write.

  • \(H_0\): \(\mu=5\) (the expected number of heads in 10 flips)
  • \(H_A\): \(\mu>5\)

The estimated probability of the observed using a normal distribution with mean \(\mu\) and standard deviation \(\sqrt{10}/2\) is

1 - pnorm(10,5,sqrt(10)/2)
## [1] 0.0007827011

Conditions to check for normality

  • Random sample
  • Need less than 10% of population for independence
  • Large enough
  • Typically, at least 30

Another example

The internet will happily tell you that we all slow down with age. Let’s test that using some data from the 2015 Peachtree road race. I’ve got a CSV file that contains the times for all 54796 non-professional runners. Let’s read it in and take a look:

df <- read.csv('')
## [1] 54796    11
X Div.Place Name Bib Age Place Gender.Place Clock.Time Net.Time Hometown Gender
6451 1 SCOTT OVERALL 72 32 1 1 29.500 29.500 SUTTON, UNITED KINGDOM M
6452 2 BEN PAYNE 74 33 2 2 29.517 29.517 COLORADO SPRINGS, CO M
4092 1 GRIFFITH GRAVES 79 25 3 3 29.633 29.633 BLOWING ROCK, NC M
4093 2 SCOTT MACPHERSON 87 28 4 4 29.800 29.783 COLUMBIA, MO M
6453 3 ELKANAH KIBET 77 32 5 5 29.883 29.883 FAYETTEVILLE, NC M
4094 3 MATT LLANO 71 26 6 6 30.200 30.200 FLAGSTAFF, AZ M

Let’s grab a “young” subset of men between the ages of 35 and 40 and an “old” subset of men between the ages of 40 and 45.

men <- subset(df, Gender='M')
young <- subset(men, 35<=Age & Age<40)
old <- subset(men, 40<=Age & Age<45)

We’ll then select a random sample of size 100 from each age group and compute the sample means.

set.seed(1) # For reproducibility
young_times <- sample(young$Net.Time, 100)
old_times <- sample(old$Net.Time, 100)

mu_young = mean(young_times)
mu_old = mean(old_times)
## [1] 73.41334 77.00703

Perhaps, we’re not surprised to see that the sample means satisfy mu_old > mu_young - but is the result statistically significant or is likely just by chance?

Put another way, let \(\mu\) be the population mean of the old_times. Our null and alternative hypotheses may be written symbolically:

  • \(H_0\): \(\mu=73.41334\)
  • \(H_A\): \(\mu>73.41334\)

Do we have sufficient evidence to reject \(H_0\)?

We use the \(p\)-value to explore this question. That is, we compute the probability that we could get the observed sample mean mu_old or higher under the assumption that the times are normally distributed with mean mu_young. Since we are investigating the distribution of the sample mean, we use the standard error as the standard deviation.

Before going to all this trouble, we should mention that (1) we have genuine random samples of (2) large enough size. While the data are a bit skew, it’s not so bad with a sample of size 100.

hist(old_times, 6)

Now, here’s the critical computation:

se = sd(old_times)/10
## [1] 2.175528
1 - pnorm(mu_old, mu_young, se)
## [1] 0.04928052

Thus, we (barely) reject the Null Hypotheses

Sample proportions

Many data (often, categorical data) is more easily stated in terms of proportions, rather than in raw quantities. For example, we might be interested in the proportion of people who respond to a medical treatement, or the proportion of gun owners in a city, or the proportion of folks who are left handed.

In order to solve this, we need to understand the mean and standard deviation associated with proportions. Recall first, the mean and standard deviation associated with the binomial distribution.

Suppose the probability of a single trial being a success is \(p\). Then, the probability of observing exactly \(k\) successes in \(n\) independent trials is given by

\[{n\choose k}p^k(1-p)^{n-k} = \frac{n!}{k!(n-k)!}p^k(1-p)^{n-k}.\]

Additionally, the mean, variance, and standard deviation of the number of observed successes are \[\begin{align} \mu &= np &\sigma^2 &= np(1-p) &\sigma &= \sqrt{np(1-p)} \end{align}\]
Now, if \(X\) is a random variable that tells us raw quantity, then \[\hat{p} = \frac{X}{n}\] is random variable that tells us proportion. To get the mean and standard deviation of \(\hat{p}\), we just divide the mean and standard deviation of \(X\) through by \(n\). Thus, \[\begin{align} \mu &= p &\sigma^2 &= p(1-p)/n &\sigma &= \sqrt{p(1-p)/n} \end{align}\]


Suppose that about 10% of people are left handed. A random sample of 211 people found that 29 were left handed. Does this data support the hypotheses that 10% of folks are left handed?

  • Does this data support the null hypotheses that 10% of the population is left handed?
  • Does this data supoort the alternative hypotheses that more than 10% of the populationis left handed?
  • Does this data supoort the alternative hypotheses that 10% of the population is not left handed?

There are basically two problems here. In both, we must compare the null hypotheses to one of the two alternative hypotheses. Written symbollically, our null and alternative hypotheses are \[\begin{align} H_0 : \hat{p}=0.1 \\ H_A : \hat{p} > 0.1 \end{align}\] or \[\begin{align} H_0 : \hat{p}=0.1 \\ H_A : \hat{p} \neq 0.1 \end{align}\]

The first hypotheses test is one-sided; the second is two-sided.

The fundamental definition of a p-value is still the same: the probability that of obtaining the observed data or worse, under the assumption of the null hypotheses. In this problem, our null mean and standard deviation are \(0.1\) and \[\sqrt{0.1\times0.9/211} = 0.02065285.\] Our observed data is \(\hat{p} = 29/211\), which is larger than \(0.1\).

For the first, one-sided test, the p-value is

1 - pnorm(29/211, 0.1, sqrt(0.1*0.9/211))
## [1] 0.0349266

A this is smaller than one, we reject the null hypotheses. For the second, two-sided test, the p-value is twice this, thus we don’t reject the null hypotheses.