Last time, we learned some data basics with a focus on data tables, which neatly organize data into rows of observations and columns of variables. Today, we'll take a brief look at how data is generated a more fundamental level.
This is all based on sections 1.3 and 1.4 of our text, although in a somewhat different order.
There are two main techniques to systematically collect data:
An observational study is one where the data collection does not interfere with how the data arises. Examples include
In an experiment, researchers actively work with the samples applying treatments and observing effects.
Suppose we are interested in the following question:
Does caffeine consumed through coffee affect GPA among UNCA students?
To perform an observational study, we might select 24 UNCA students and interview them. We might ask a number of questions but two questions key to the study would be:
If we interview 24 of the 3363 students at UNCA, we might generate a table that looks something like so:
id | Cups/day | GPA |
---|---|---|
1 | 5 | 3.23 |
2 | 1 | 2.80 |
3 | 0 | 0.00 |
4 | 10 | 4.00 |
$\vdots$ | $\vdots$ | $\vdots$ |
24 | 9 | 3.00 |
We might visualize the data with a scatter plot.
The numbers and picture certainly appear to support the idea that caffeinated coffee consumption improves GPA. Of course, we ultimately hope to be able to quantify that type of analysis more precisely.
You can't clearly articulate a research question without first clearly identifying the population that you're working with, together with some related terms:
In the coffee example (where we interviewed 24 of the 3363 students at UNCA),
Suppose we find that, of the 24 students in our sample, the students drink 3.4 cups of coffee a day on average.
Typically, it's not necessarily feasible to compute a parameter from an entire population. Thus, we estimate population parameters from sample statistics. A key question is:
How close can we expect a computed statistic to be to the corresponding population parameter?
A key objective in statistics is to draw conclusions about the population at large from a sample of manageable size. To do so, the sample should be representative of the population at large.
If we are interested, for example, in the average height of UNCA students then we would certainly obtain an overestimate if we sampled the heights of the men's basketball team.
A key idea in sample selection is randomization. If we pick fairly large samples truly at random, then we should get individuals of all types so that sample will be representative.
The gold standard in sample selection is the simple random sample, which can be defined as follows:
To choose a simple random sample of size $n$ from a large population, each subset of size $n$ from that population must have the same probability of being chosen as any other subset of size $n$.
If we're particularly interested in how two groups compare, we might stratify first - i.e., we might choose simple random samples within subgroups.
In the coffee example, we might choose a simple random sample of size 10 from the set of all women at UNCA and also choose a simple random sample of size 14 from the set of men at UNCA. The set collection of $$10 \text{ women } + 14 \text { men } = 24 \text{ students}$$ would then be our sample.
A confounding variable is a variable (often unknown) that influences two or more other variables and causes a spurious association.
In the coffee GPA example, a confounding variable might be the amount of time the student studies. If they study a lot, the might find themselves drinking more coffee and increase their GPA.
The term lurking variable is a synonym for confounding variable.
In statistics, we are often interested in the effect of one variable on another. In the context of the coffee study - does the consumption of caffeinated coffee improve GPA?
Due to the potential presence of unknown confounding variables, whether there is a relationship between two variables - and not whether there is a causal relationship.
This is sometimes phrased as
correlation does not imply causation.
To establish a causal relationship, you've really got to set up and experiment.
In an experiment (more specifically, a controlled experiment), researchers actively work with the samples applying treatments and controls and observing effects.
Let's outline an experimental approach to our main question:
Does caffeine consumed through coffee affect GPA among college students?
To explore this question via a controlled experiment, we would start with a large, simple random sample of participating college students. We would randomly assign the each student into one of two groups:
After some pre-specified amount of time, we would check and compare the grades earned by the two groups.
If there were 250 participants in our sample, perhaps our results might look something like so:
Treatment | Control | |
---|---|---|
N | 119 | 131 |
GPA | 3.7 | 2.9 |
This again looks like it might be compelling evidence that consumption of caffeinated coffee boosts GPA. We would ultimately like to be able to quantify that statement.
If we find differences between the groups we can examine whether they are statistically significant or not.
Let's examine a couple recent, real world examples of studies applying these ideas to fight the transmission of COVID. We'll start with an observational study on efficacy of masks described in BMJ Global Health.
The researchers identified 121 families in Bejing who had at least one COVID infection in the household. They then interviewed each family to determine
The researcher's data can be summarized in the following contingency table:
Masks | No masks | Total | |
---|---|---|---|
Further transmission | 4 | 36 | 40 |
No further transmission | 27 | 54 | 81 |
Total | 31 | 90 | 121 |
It certainly looks like there's a negative relationship between mask usage and transmission. We'll develop a statistical test later this semester to quantify the strength of that relationship.
Of course, we can't conclude a causal relationship, since this is an observational study; we would need to a controlled experiment, if we'd like to establish causality.
The medical and drug industries form a rich source of controlled experiment, since companies are actually required to establish safety and efficacy before the products are brought to market.
This paper on MedRXiv describes one such study for Pfizer's mRNA vaccine.
Researchers identified 46,077 participants for a controlled experiment. The participants were then randomly assigned as follows:
After 6 months,
This again looks like pretty compelling evidence and could even prove that the vaccine is effective at preventing COVID. We'll certainly need to develop quantitative tools to address that question, though.