Before break, we learned about discrete random variables and their distributions. Today, we’ll take a look at the process of sampling as a random variable itself. From this perspective, there’s a distribution associated with sampling and an understanding of this distribution allows us to estimate how close statistics computed from the sample are to the corresponding parameters for the whole population. Thus, we’re laying the foundation for inference - the process of drawing quantitative conclusions from our data.
This is essentially chapter 15 and a little bit of 16 in our text.
The process of computing a statistic based on a random sample can be thought of a random variable in the following sense:
The amazing thing is that the sampling distribution is approximately normal.
As you recall, we’ve been playing some with R. A recent StackOverflow Blog indicated how quickly R has been growing. So, let’s illustrate these ideas with some R code. Here’s a function that returns one of two classes (‘s’ or ‘n’), together with an illustrative bar plot.
set.seed(1)
pick_one = function(x) {
u = runif(1)
if(u<0.22) {
return('s')
}
else {
return('n')
}
}
barplot(table(sapply(1:12, pick_one)))
Now, suppose we run the process a bunch of times and, each time, we compute the proportion of ’s’s. Here’s a histogram of the results:
set.seed(1)
n = 100
run_trials = function(x) {
trials = sapply(1:n, pick_one)
return(length(trials[trials=='s'])/n)
}
hist(sapply(1:1000, run_trials), 10, xlab='', main='', col='gray')
The crazy thing is, this is about normal - independent of the proportions.
If \(\hat{p}\) can be modeled with a normal distribution, then what are the expectation and standard deviation of that normal? Well, they can be computed by the corresponding concepts for the binomial distributions. We already know these to be the following:
\[E(\hat{p}) = p\] and \[\sigma(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\] In the previous example, the expectation is \(0.22\) and the standard deviation is \[\sqrt{\frac{0.22\times0.78}{100}} \approx 0.04142463.\]
If we scale the rectangles from that previous example so that their total area is one and plot those rectangles together with the normal distribution with mean \(0.22\) and standard deviation \(0.0414\), we get the following picture:
Note that the standard deviation gets smaller as the sample size gets bigger. For example, here is the same picture with 1000 trials, rather than 100.
Recall that the term statistic refers to some summary characteristic computed from a sample. It’s an approximation to the corresponding parameter for the whole population.
For example, when we say that 22% of adult North Carolinians smoke, we’re making an assertion about a parameter called a proportion. In reality, the value of 22% has only been inferred from a statistic computed from a sample.
To this point, we’ve been discussing the computation of a proportion but the same basic idea can be applied to just about any statistic that you might compute. Examples include:
I’ve got a CSV file that contains the times for all 54796 non-professional runners from the 2015 Peachtree road race. Let’s read it in and take a look:
library(knitr)
df = read.csv('https://www.marksmath.org/data/peach_tree2015.csv')
kable(head(df))
X | Div.Place | Name | Bib | Age | Place | Gender.Place | Clock.Time | Net.Time | Hometown | Gender |
---|---|---|---|---|---|---|---|---|---|---|
6451 | 1 | SCOTT OVERALL | 72 | 32 | 1 | 1 | 29.500 | 29.500 | SUTTON, UNITED KINGDOM | M |
6452 | 2 | BEN PAYNE | 74 | 33 | 2 | 2 | 29.517 | 29.517 | COLORADO SPRINGS, CO | M |
4092 | 1 | GRIFFITH GRAVES | 79 | 25 | 3 | 3 | 29.633 | 29.633 | BLOWING ROCK, NC | M |
4093 | 2 | SCOTT MACPHERSON | 87 | 28 | 4 | 4 | 29.800 | 29.783 | COLUMBIA, MO | M |
6453 | 3 | ELKANAH KIBET | 77 | 32 | 5 | 5 | 29.883 | 29.883 | FAYETTEVILLE, NC | M |
4094 | 3 | MATT LLANO | 71 | 26 | 6 | 6 | 30.200 | 30.200 | FLAGSTAFF, AZ | M |
Now, let’s compare the mean time of all runners to a random sample of 100 of them.
set.seed(1)
c(mean(df$Net.Time), mean(sample(df$Net.Time, 100)))
## [1] 76.08483 76.34933
Above, we see three key types of statistical parameter:
In all three types, we approximate the parameter with a statistical measurement via a normal distribution. The key difference is how we find the correct mean and standard deviation for that normal.
For a proportion, we are dealing with categorical data that breaks into two classes - say \(S\) with probability \(p\) and \(F\) with probability \(1-p\). Note that \(p\) is generally unknown. The objective is to find a good estimate for \(p\). We draw a sample of size \(n\) and break the sample into the two classes. Our estimate \(\hat{p}\) for \(p\) is then \[\hat{p} = \#(S)/n.\] The associated standard devi(ation is \[SD(\hat{p}) = \sqrt{\frac{p(1-p)}{n}}.\]
For a mean, we are dealing with numerical data. We draw a sample of size \(n\) and compute the mean \(\bar{x}\) and standard deviation \(\sigma\) of that sample. The standard deviation of the sample mean is then \[SD(\bar{x}) = \frac{\sigma}{\sqrt{n}}.\]
A total is very similar to a mean. We draw a sample of numerical data of size \(n\) and compute the total \(T\) and standard deviation \(\sigma\) of that sample. The standard deviation of the sample total is then \[SD(T) = \sqrt{n}\ \sigma.\]
Historical data for a non-profit indicates that cold-callers average $12.50 per call with a standard devation of $5.25.
Answer: We’ll model this situation with a normal distribution with mean \[100\times12.50 = 1250\] and standard deviation \[10\times5.25 = 52.5.\]
To answer the first part with R, we use the command
1-pnorm(1300, 1250, 52.5)
## [1] 0.1704519
To answer the second part we use the inverse command qnorm
:
qnorm(0.95,1250,52.5)
## [1] 1336.355