Sampling distributions

Before break, we learned about discrete random variables and their distributions. Today, we’ll take a look at the process of sampling as a random variable itself. From this perspective, there’s a distribution associated with sampling and an understanding of this distribution allows us to estimate how close statistics computed from the sample are to the corresponding parameters for the whole population. Thus, we’re laying the foundation for inference - the process of drawing quantitative conclusions from our data.

This is essentially chapter 15 and a little bit of 16 in our text.

Sampling as a random variable

The process of computing a statistic based on a random sample can be thought of a random variable in the following sense:

Suppose we have a process which results in one of two outcomes: ‘success’ with probability $p$ or ‘failure’ with probability $1-p$.
We run $n$ trials of the process and compute the proportion $\hat{p}$ of successes.
The number $\hat{p}$ is a random variable

Example

Suppose we draw a random sample of 100 people from a large population and find out how many of them smoke.
In North Carolina, it’ll be about $22\%$ of them - but not exactly.
Suppose we do it again. We’ll almost certainly get a different number.
Now suppose we do that a bunch of times. We’ll get a bunch of different numbers. The histogram of those numbers represents the sampling distribution associated with the process.

The amazing thing is that the sampling distribution is approximately normal.

A computer experiment

As you recall, we’ve been playing some with R. A recent StackOverflow Blog indicated how quickly R has been growing. So, let’s illustrate these ideas with some R code. Here’s a function that returns one of two classes (‘s’ or ‘n’), together with an illustrative bar plot.

set.seed(1)
pick_one = function(x) {
  u = runif(1)
  if(u<0.22) {
    return('s')
  }
  else {
    return('n')
  }
}
barplot(table(sapply(1:12, pick_one)))

Now, suppose we run the process a bunch of times and, each time, we compute the proportion of ’s’s. Here’s a histogram of the results:

set.seed(1)
n = 100
run_trials = function(x) {
  trials = sapply(1:n, pick_one)
  return(length(trials[trials=='s'])/n)
}
hist(sapply(1:1000, run_trials), 10, xlab='', main='', col='gray')

The crazy thing is, this is about normal - independent of the proportions.

Which normal?

If $\hat{p}$ can be modeled with a normal distribution, then what are the expectation and standard deviation of that normal? Well, they can be computed by the corresponding concepts for the binomial distributions. We already know these to be the following:

\[E(\hat{p}) = p\] and \[\sigma(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\] In the previous example, the expectation is $0.22$ and the standard deviation is \[\sqrt{\frac{0.22\times0.78}{100}} \approx 0.04142463.\]

If we scale the rectangles from that previous example so that their total area is one and plot those rectangles together with the normal distribution with mean $0.22$ and standard deviation $0.0414$, we get the following picture:

Note that the standard deviation gets smaller as the sample size gets bigger. For example, here is the same picture with 1000 trials, rather than 100.

Conditions to check for normality

A simple random sample
Independence
Large enough (Typically, at least 30) but
Need less than 10% of population for independence

Sampling distributions of other statistics

Recall that the term statistic refers to some summary characteristic computed from a sample. It’s an approximation to the corresponding parameter for the whole population.

For example, when we say that 22% of adult North Carolinians smoke, we’re making an assertion about a parameter called a proportion. In reality, the value of 22% has only been inferred from a statistic computed from a sample.

To this point, we’ve been discussing the computation of a proportion but the same basic idea can be applied to just about any statistic that you might compute. Examples include:

Medians
Variances
Mins, Maxes, and (especially)
Means

Example - Sampling the mean

I’ve got a CSV file that contains the times for all 54796 non-professional runners from the 2015 Peachtree road race. Let’s read it in and take a look:

library(knitr)
df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
kable(head(df))

X	Div.Place	Name	Bib	Age	Place	Gender.Place	Clock.Time	Net.Time	Hometown	Gender
6451	1	SCOTT OVERALL	72	32	1	1	29.500	29.500	SUTTON, UNITED KINGDOM	M
6452	2	BEN PAYNE	74	33	2	2	29.517	29.517	COLORADO SPRINGS, CO	M
4092	1	GRIFFITH GRAVES	79	25	3	3	29.633	29.633	BLOWING ROCK, NC	M
4093	2	SCOTT MACPHERSON	87	28	4	4	29.800	29.783	COLUMBIA, MO	M
6453	3	ELKANAH KIBET	77	32	5	5	29.883	29.883	FAYETTEVILLE, NC	M
4094	3	MATT LLANO	71	26	6	6	30.200	30.200	FLAGSTAFF, AZ	M

Now, let’s compare the mean time of all runners to a random sample of 100 of them.

set.seed(1)
c(mean(df$Net.Time), mean(sample(df$Net.Time, 100)))

## [1] 76.08483 76.34933

Three key types of sampling distributions

Above, we see three key types of statistical parameter:

Proportions
Means
Totals

In all three types, we approximate the parameter with a statistical measurement via a normal distribution. The key difference is how we find the correct mean and standard deviation for that normal.

Proportions

For a proportion, we are dealing with categorical data that breaks into two classes - say $S$ with probability $p$ and $F$ with probability $1-p$. Note that $p$ is generally unknown. The objective is to find a good estimate for $p$. We draw a sample of size $n$ and break the sample into the two classes. Our estimate $\hat{p}$ for $p$ is then \[\hat{p} = \#(S)/n.\] The associated standard devi(ation is \[SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\]

Means

For a mean, we are dealing with numerical data. We draw a sample of size $n$ and compute the mean $\bar{x}$ and standard deviation $\sigma$ of that sample. The standard deviation of the sample mean is then \[SD(\bar{x}) = \frac{\sigma}{\sqrt{n}}.\]

Totals

A total is very similar to a mean. We draw a sample of numerical data of size $n$ and compute the total $T$ and standard deviation $\sigma$ of that sample. The standard deviation of the sample total is then \[SD(T) = \sqrt{n}\ \sigma.\]

Example

Historical data for a non-profit indicates that cold-callers average $12.50 per call with a standard devation of $5.25.

If a cold-caller makes 100 calls, what is the probability that they rake in more than $1300?
How much might they make on a really good day - say, top 5%?

Answer: We’ll model this situation with a normal distribution with mean \[100\times12.50 = 1250\] and standard deviation \[10\times5.25 = 52.5.\]

To answer the first part with R, we use the command

1-pnorm(1300, 1250, 52.5)

## [1] 0.1704519

To answer the second part we use the inverse command qnorm:

qnorm(0.95,1250,52.5)

## [1] 1336.355

Sampling distributions

10/11/17

Sampling as a random variable

Example

A computer experiment

Which normal?

Conditions to check for normality

Sampling distributions of other statistics

Example - Sampling the mean

Three key types of sampling distributions

Proportions

Means

Totals

Example