Getting data¶

with Studies or Experiments¶

Last time we learned how to read data tables into the computer and do a little analysis. Today, we'll learn how data is generated aa more fundamental level.

This is all based on sections 1.3 and 1.4 of our text, although in a somewhat different order.

Observational studies vs controlled experiments¶

There are two main techniques to systematically collect data:

Observational studies and
Controlled experiments

Observational studies¶

An observational study is one where the data collection does not interfere with how the data arises. Examples include

Surveys,
Review of records,
Simple measurements

Controlled experiments¶

In an experiment, researchers actively work with the samples applying treatments and observing effects.

Often (if not always), the intention is to explore a relationship or even a causal effect between two variable.
For that reason, it's often important to distinguish between an explanatory variable and a response variable.

Quick examples from the news¶

Here was a nice story on CNN that clearly describes an observational study: Up to 25 cups of coffee a day still safe for heart health, study says.
CNN also had a more serious study yesterday: Drug extends life of younger women with advanced breast cancer, study says. I actually had to read this article from ASCO that was cited by the CNN article to verify that this was a controlled experiment.

Anecdotal evidence¶

It's worth mentioning that anecdotal evidence does not generate data that is reliable enough to inform decision making.

Example My sister took 8 years to graduate from OU, so OU must be really hard.

Some sampling terminology¶

You can't clearly articulate a research question without first clearly identifying the population that you're working with, together with some related terms:

Population refers to the complete set of entities under consideration,
Sample refers to some subset of the population,
Parameter refers to some summary characteristic of the population, and
Statistic refers to some summary characteristic computed from a sample.

Coffee example¶

In the coffee example:

The population might be all the coffee drinkers in the UK.
The sample consists of the 8,412 people who were surveyed.
A parameter of interest would be the proportion of all individuals in the population with stiff blood vessels.
A statistic would be the computed proportion of individuals from the sample with stiff blood vessels.

Cancer example¶

In the cancer example:

The population would be all women under the age of 59 with this specific cancer.
The sample would be the 672 women under the age of 59 who participated.
A parameter would be the survival rate among the population after 42 months.
A statistic would be the survival rate among the sample after 42 months.

Causation and correlation¶

Often we are interested in the relationship between two variables - specifically, is there a correlation or even a causal relationship between two variables. In the context of a study, we should clearly identify:

Independent or explanatory variables
Dependent or response variables
Confounding or lurking variables

Generally, an explanatory variable is one that a researcher suspects might affect the response variable. Correlation, however, does not always imply causation.

Example: We suspect that folks who use more sunscreen have a higher incidence of skin cancer. What are the explanatory and response variables - as well as any confounding variables?

More on observational studies¶

Again, the basic idea is that the data collection does not interfere with how the data arises.

Example¶

Suppose we'd like to know the average height of women enrolled at UNCA. According to the UNCA Factbook, there were 2077 women enrolled in the Fall of 2019. I'm not even sure how many were enrolled this past year; it might be hard to round up all of them to measure their height.

Another approach¶

We have 23 women enrolled in this statistics class right now that semester who filled out my online survey survey. The average height of those women was 5'4.2''. We might hope that could be a good estimate to the average height of all women at UNCA.

This is a fundamental idea in statistics: we wish to study some parameter for a large population but doing so is a too unwieldy or down right impossible. Thus, we choose manageable sample from the population and estimate the parameter with the corresponding statistic computed from the sample.

Terminology applied¶

In the example above,

The population is the set of all women enrolled at UNCA,
the parameter is the average height of women in the population,
the sample is the set of all women enrolled in this class, and
the statistic is the average height of women in the sample, which happens to be 5'4.2''.

Key questions¶

How well can we expect the statistic to approximate the parameter?
What properties should the sample have in order to maximize the accuracy of our approximation?

The first question will consume much of last two-thirds of our semester under the general topic of inference. As we'll learn, we need a random sample of sufficient size.

This approach of sampling is in contrast to the idea of just grabbing the whole population - often called a census.

Potential issues¶

Volunteers
Convenience
Bad sampling frame
Bias (response and non-response)

Retrospective studies¶

A retrospective study is one where we review records from the past.

The study of women's heights at UNCA is not a retrospective study.

If we reviewed grades at UNCA over the last 10 years to try to determine a relationship between GPA and class level, then we would be running a retrospective study.

More on designed experiments¶

In an experiment, researchers actively work with the samples applying treatments and observing effects.

The experimental approach to the cancer example might go like so: Select 672 women under 59. Randomly assign them into one of two groups - one who takes the drug and one that does not. Examine the groups after 42 months.

Some key principles of experiments¶

Control: We split the group of patients into two groups:
- A treatment group that receives the experimental drug
- A control group that doesn't receive the drug; they might receive a placebo.
Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
Replication: The results should be reproducible
Blocking: We might break the control and treatment groups in to smaller groups or blocks.
- Reduces variability in the groups
- Allows us to identify confounding factors
- Example: We might block by gender, age, or degree of risk.

If we find differences between the groups we can examine whether they are statistically significant or not.

Sampling strategies¶

Good random sampling is surprisingly hard to achieve in practice. Here are a few strategies to implement sampling.

Simple random samples: This is the ideal and is so simple - choose $n$ people from the population independently and with equal probability. It's quite hard to achieve in practice, though.

Other strategies:

Convenience
- Ex: Survey the folks on your dorm floor
Stratified samples:
- Ex: Survey college students and stratify by class
Clustered samples
- Ex: Door-to-door survey picking one block out of each neighborhood
Systematic sampling
- Ex: Survey every tenth person you see