Sample Surveys

We started playing with data in the first week of the semester. To this point, the data we’ve played with has all come from some CSV file or it’s been scraped off of the web. In part III of the text, we learn how data is gathered at a more fundamental level. In particular, we’ll explore

Populations and parameters vs samples and statistics

To understand the terms population, parameter, sample, and statistic, it helps to keep in mind that we are often working with some specific study. In that context,

The term population refers to the complete set of entities under consideration,
the term parameter refers to some summary characteristic of the population,
the term sample refers to some subset of the population, and
the term statistic refers to some summary characteristic computed from a sample.

Example

Suppose we’d like to know the average height of women enrolled at UNCA. According to the UNCA Factbook, there were 2147 women enrolled in the Fall of 2017. I’m not even sure how many are enrolled this year; it might be hard to round up all of them.

So, here’s a more practical approach: We have 42 women enrolled in this class who filled out our discourse survey. The average height of those women is 5’5’’. We might hope that could be a good estimate to the average height of all women at UNCA.

This is a fundamental idea in statistics: we wish to study some parameter for a large population but doing so is a too unwieldly or down right impossible. Thus, we choose managable sample from the population and estimate the parameter with the corresponding statistic computed from the sample. In the example above,

The population is the set of all women enrolled at UNCA,
the parameter is the average height of women in the population,
the sample is the set of all women enrolled in this class, and
the statistic is the average height of women in the sample, which happens to be 5’5’’.

A number of key questions arise:

How well can we expect the statistic to approximate the parameter?
What properties should the sample have in order to maximize the accuracy of our approximation?

The first question will consume much of the second half of our semester under the general topic of inference.

Section 10.1 of the book sums up the answer to the second question using three big ides:

Examine part of the whole
Randomize
It’s the sample size.

This approach is in contrast to the idea of just grabbing the whole population - often called a census.

Potential issues

Volunteers
Convenience
Bad sampling frame
Bias (reponse and non-response)

Sampling strategies

Here are a few strategies to implement the big ideas

Simple random samples

The idea is so simple - choose \(n\) people from the population independently and with equal probability. It’s a bit hard to achieve in practice, though.

Example: Randomly generated phone numbrers

Other stratigies

Stratified samples:
e.g., college students by class
Clustered samples
e.g., door-to-door survey by blocks
Systematic sampling
e.g., every tenth person you see