Getting data

with Studies or Experiments

Last time, we learned some data basics with a focus on data tables, which neatly organize data into rows of observations and columns of variables. Today, we'll take a brief look at how data is generated a more fundamental level.

This is all based on sections 1.3 and 1.4 of our text, although in a somewhat different order.

Observational studies vs controlled experiments

There are two main techniques to systematically collect data:

  • Observational studies and
  • Controlled experiments

Observational studies

An observational study is one where the data collection does not interfere with how the data arises. Examples include

  • Surveys,
  • Review of records,
  • Simple measurements

Controlled experiments

In an experiment, researchers actively work with the samples applying treatments and observing effects.

  • Often (if not always), the intention is to explore a relationship or even a causal effect between two variable.
  • For that reason, it's often important to distinguish between an explanatory variable and a response variable.

An observational example

Suppose we are interested in the following question:

Does caffeine consumed through coffee affect GPA among UNCA students?

The observational approach

To perform an observational study, we might select 24 UNCA students and interview them. We might ask a number of questions but two questions key to the study would be:

  • How much caffeinated coffee do you drink?
  • What is your GPA?

The resulting data

If we interview 24 of the 3363 students at UNCA, we might generate a table that looks something like so:



We might visualize the data with a scatter plot.


The numbers and picture certainly appear to support the idea that caffeinated coffee consumption improves GPA. Of course, we ultimately hope to be able to quantify that type of analysis more precisely.

Some terminology

You can't clearly articulate a research question without first clearly identifying the population that you're working with, together with some related terms:

  • Population refers to the complete set of entities under consideration,
  • Sample refers to some subset of the population,
  • Parameter refers to some summary characteristic of the population, and
  • Statistic refers to some summary characteristic computed from a sample.

Sample/population for the coffee example

In the coffee example (where we interviewed 24 of the 3363 students at UNCA),

  • The sample is the 24 students we interviewed.
  • The population is the set of all 3363 UNCA students.
    (The extent to which the conclusion might extend to all college students anywhere might be a point of discussion and further research.)

Statistic/parameter for the coffee example

Suppose we find that, of the 24 students in our sample, the students drink 3.4 cups of coffee a day on average.

  • That average number 3.4 computed from the sample is a statistic.
  • If we could somehow determine the average number of cups of coffee consumed by all 3363 UNCA students, that number would be a parameter.

Typically, it's not necessarily feasible to compute a parameter from an entire population. Thus, we estimate population parameters from sample statistics. A key question is:

How close can we expect a computed statistic to be to the corresponding population parameter?

Some important issues

  • Sample selection
  • Confounding variables
  • Causation vs correlation

Sample selection

A key objective in statistics is to draw conclusions about the population at large from a sample of manageable size. To do so, the sample should be representative of the population at large.

If we are interested, for example, in the average height of UNCA students then we would certainly obtain an overestimate if we sampled the heights of the men's basketball team.


A key idea in sample selection is randomization. If we pick fairly large samples truly at random, then we should get individuals of all types so that sample will be representative.

The gold standard in sample selection is the simple random sample, which can be defined as follows:

To choose a simple random sample of size $n$ from a large population, each subset of size $n$ from that population must have the same probability of being chosen as any other subset of size $n$.


If we're particularly interested in how two groups compare, we might stratify first - i.e., we might choose simple random samples within subgroups.

In the coffee example, we might choose a simple random sample of size 10 from the set of all women at UNCA and also choose a simple random sample of size 14 from the set of men at UNCA. The set collection of $$10 \text{ women } + 14 \text { men } = 24 \text{ students}$$ would then be our sample.

Confounding variables

A confounding variable is a variable (often unknown) that influences two or more other variables and causes a spurious association.

In the coffee GPA example, a confounding variable might be the amount of time the student studies. If they study a lot, the might find themselves drinking more coffee and increase their GPA.

The term lurking variable is a synonym for confounding variable.

Causation vs correlation

In statistics, we are often interested in the effect of one variable on another. In the context of the coffee study - does the consumption of caffeinated coffee improve GPA?

Due to the potential presence of unknown confounding variables, whether there is a relationship between two variables - and not whether there is a causal relationship.

This is sometimes phrased as

correlation does not imply causation.

To establish a causal relationship, you've really got to set up and experiment.

An experimental example

In an experiment (more specifically, a controlled experiment), researchers actively work with the samples applying treatments and controls and observing effects.

Let's outline an experimental approach to our main question:

Does caffeine consumed through coffee affect GPA among college students?

The experimental approach

To explore this question via a controlled experiment, we would start with a large, simple random sample of participating college students. We would randomly assign the each student into one of two groups:

  • The treatment group, who would be required drink a specified amount of caffeinated coffee every day and
  • the control group, who drink decaffeinated coffee instead.

After some pre-specified amount of time, we would check and compare the grades earned by the two groups.


If there were 250 participants in our sample, perhaps our results might look something like so:

Treatment Control

This again looks like it might be compelling evidence that consumption of caffeinated coffee boosts GPA. We would ultimately like to be able to quantify that statement.

Some key principles of experiments

  • Control: We split the group of patients into two groups:
    • A treatment group that receives the treatment
    • A control group that receives a placebo.
  • Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
  • Replication: The results should be reproducible

If we find differences between the groups we can examine whether they are statistically significant or not.

Efficacy of masks in preventing COVID

Let's examine a couple recent, real world examples of studies applying these ideas to fight the transmission of COVID. We'll start with an observational study on efficacy of masks described in BMJ Global Health.

The study

The researchers identified 121 families in Bejing who had at least one COVID infection in the household. They then interviewed each family to determine

  • If there was any further transmission within the household after the initial infection and
  • whether or masks were worn consistently in the household or not.


The researcher's data can be summarized in the following contingency table:

MasksNo masksTotal
Further transmission43640
No further transmission275481


It certainly looks like there's a negative relationship between mask usage and transmission. We'll develop a statistical test later this semester to quantify the strength of that relationship.

Of course, we can't conclude a causal relationship, since this is an observational study; we would need to a controlled experiment, if we'd like to establish causality.

A controlled experiment for vaccines

The medical and drug industries form a rich source of controlled experiment, since companies are actually required to establish safety and efficacy before the products are brought to market.

This paper on MedRXiv describes one such study for Pfizer's mRNA vaccine.

The study

Researchers identified 46,077 participants for a controlled experiment. The participants were then randomly assigned as follows:

  • 23,040 in the treatment group and
  • 23,037 in the control group.


After 6 months,

  • 131 people in the treatment group had been infected at some point during the study and
  • 1034 people in the control group had been infected at some point during the study.


This again looks like pretty compelling evidence and could even prove that the vaccine is effective at preventing COVID. We'll certainly need to develop quantitative tools to address that question, though.