Before break, we learned about discrete random variables and their distributions. Today, we’ll take a look at the process of sampling as a random variable itself. From this perspective, there’s a distribution associated with sampling and an understanding of this distribution allows us to estimate how close statistics computed from the sample are to the corresponding parameters for the whole population. Thus, we’re laying the foundation for inference - the process of drawing quantitative conclusions from our data.

This is essentially chapter 15 and a little bit of 16 in our text.

Sampling as a random variable

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense:

Example

  • Suppose we draw a random sample of 100 people from a large population and find out how many of them smoke.
  • In North Carolina, it’ll be about \(22\%\) of them - but not exactly.
  • Suppose we do it again. We’ll almost certainly get a different number.
  • Now suppose we do that a bunch of times. We’ll get a bunch of different numbers. The histogram of those numbers represents the sampling distribution associated with the process.

The amazing thing is that the sampling distribution is approximately normal.

A computer experiment

As you recall, we’ve been playing some with R. A recent StackOverflow Blog indicated how quickly R has been growing. So, let’s illustrate these ideas with some R code. Here’s a function that returns one of two classes (‘s’ or ‘n’), together with an illustrative bar plot.

set.seed(1)
pick_one = function(x) {
  u = runif(1)
  if(u<0.22) {
    return('s')
  }
  else {
    return('n')
  }
}
barplot(table(sapply(1:12, pick_one)))

Now, suppose we run the process a bunch of times and, each time, we compute the proportion of ’s’s. Here’s a histogram of the results:

set.seed(1)
n = 100
run_trials = function(x) {
  trials = sapply(1:n, pick_one)
  return(length(trials[trials=='s'])/n)
}
hist(sapply(1:1000, run_trials), 10, xlab='', main='', col='gray')

The crazy thing is, this is about normal - independent of the proportions.

Which normal?

If \(\hat{p}\) can be modeled with a normal distribution, then what are the expectation and standard deviation of that normal? Well, they can be computed by the corresponding concepts for the binomial distributions. We already know these to be the following:

\[E(\hat{p}) = p\] and \[\sigma(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\] In the previous example, the expectation is \(0.22\) and the standard deviation is \[\sqrt{\frac{0.22\times0.78}{100}} \approx 0.04142463.\]

If we scale the rectangles from that previous example so that their total area is one and plot those rectangles together with the normal distribution with mean \(0.22\) and standard deviation \(0.0414\), we get the following picture:

Note that the standard deviation gets smaller as the sample size gets bigger. For example, here is the same picture with 1000 trials, rather than 100.

Conditions to check for normality

Sampling distributions of other statistics

Recall that the term statistic refers to some summary characteristic computed from a sample. It’s an approximation to the corresponding parameter for the whole population.

For example, when we say that 22% of adult North Carolinians smoke, we’re making an assertion about a parameter called a proportion. In reality, the value of 22% has only been inferred from a statistic computed from a sample.

To this point, we’ve been discussing the computation of a proportion but the same basic idea can be applied to just about any statistic that you might compute. Examples include:

Example - Sampling the mean

I’ve got a CSV file that contains the times for all 54796 non-professional runners from the 2015 Peachtree road race. Let’s read it in and take a look:

library(knitr)
df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
kable(head(df))
X Div.Place Name Bib Age Place Gender.Place Clock.Time Net.Time Hometown Gender
6451 1 SCOTT OVERALL 72 32 1 1 29.500 29.500 SUTTON, UNITED KINGDOM M
6452 2 BEN PAYNE 74 33 2 2 29.517 29.517 COLORADO SPRINGS, CO M
4092 1 GRIFFITH GRAVES 79 25 3 3 29.633 29.633 BLOWING ROCK, NC M
4093 2 SCOTT MACPHERSON 87 28 4 4 29.800 29.783 COLUMBIA, MO M
6453 3 ELKANAH KIBET 77 32 5 5 29.883 29.883 FAYETTEVILLE, NC M
4094 3 MATT LLANO 71 26 6 6 30.200 30.200 FLAGSTAFF, AZ M

Now, let’s compare the mean time of all runners to a random sample of 100 of them.

set.seed(1)
c(mean(df$Net.Time), mean(sample(df$Net.Time, 100)))
## [1] 76.08483 76.34933

Three key types of sampling distributions

Above, we see three key types of statistical parameter:

In all three types, we approximate the parameter with a statistical measurement via a normal distribution. The key difference is how we find the correct mean and standard deviation for that normal.

Proportions

For a proportion, we are dealing with categorical data that breaks into two classes - say \(S\) with probability \(p\) and \(F\) with probability \(1-p\). Note that \(p\) is generally unknown. The objective is to find a good estimate for \(p\). We draw a sample of size \(n\) and break the sample into the two classes. Our estimate \(\hat{p}\) for \(p\) is then \[\hat{p} = \#(S)/n.\] The associated standard devi(ation is \[SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\]

Means

For a mean, we are dealing with numerical data. We draw a sample of size \(n\) and compute the mean \(\bar{x}\) and standard deviation \(\sigma\) of that sample. The standard deviation of the sample mean is then \[SD(\bar{x}) = \frac{\sigma}{\sqrt{n}}.\]

Totals

A total is very similar to a mean. We draw a sample of numerical data of size \(n\) and compute the total \(T\) and standard deviation \(\sigma\) of that sample. The standard deviation of the sample total is then \[SD(T) = \sqrt{n}\ \sigma.\]

Example

Historical data for a non-profit indicates that cold-callers average $12.50 per call with a standard devation of $5.25.

  1. If a cold-caller makes 100 calls, what is the probability that they rake in more than $1300?
  2. How much might they make on a really good day - say, top 5%?

Answer: We’ll model this situation with a normal distribution with mean \[100\times12.50 = 1250\] and standard deviation \[10\times5.25 = 52.5.\]

To answer the first part with R, we use the command

1-pnorm(1300, 1250, 52.5)
## [1] 0.1704519

To answer the second part we use the inverse command qnorm:

qnorm(0.95,1250,52.5)
## [1] 1336.355