In our first week, we got a bit of a grip on data. To this point, the data we’ve played with has all come from some CSV file or it’s been scraped off of the web. Now we’ll learn a bit about how data is gathered at a more fundamental level. In particular, we’ll explore

Observational studies vs Controlled experiments

Case study - Do stents prevent strokes?

Let’s work in the context of a specific research question - do stents help prevent stroke in patients who’ve had heart surgery? Consider the following three approaches:

  • Anecdotal evidence
    • We might draw conclusions based on the experiences of friends and family.
  • Retrospective studies
    • We might go through the records of several hospitals and compare outcomes of patients who received stents vs those who did not.
  • Controlled experiments
    • Ideally, we could set up a long term study examining the effect of stents on two groups of patients - the treatment group and the control group.

Some terminology

You can’t clearly articulate a research question without first clearly identifying the population that you’re working with, together with some related terms:

  • Population refers to the complete set of entities under consideration,
  • Parameter refers to some summary characteristic of the population,
  • Sample refers to some subset of the population, and
  • Statistic refers to some summary characteristic computed from a sample.

In the stent/stroke example, the population might be all patients who’ve had heart surgery while the sample would be all patients in a specific study.

Often we are interested in the relationship between two variables - specifically, is there a correlation or even a causal relationship between two variables. In the context of a study, we should clearly identify:

  • Independent or explanatory variables
  • Dependent or response variables
  • Confounding or lurking variables

Generally, an explanatory variable is one that a researcher suspects might affect the response variable. Correlation, however, does not always imply causation.

Example: We suspect that folks who use more sunscreen have a higher incidence of skin cancer. What are the explanatory and response variables - as well as any confounding variables?

Observational studies

Again, the basic idea is that the data collection does not interfere with how the data arises.

Suppose we’d like to know the average height of women enrolled at UNCA. According to the UNCA Factbook, there were 2147 women enrolled in the Fall of 2017. I’m not even sure how many were enrolled this past year; it might be hard to round up all of them.

So, here’s a more practical approach: I had 42 women enrolled in my statistics classes that semester who filled out my online survey survey. The average height of those women was 5’5’’. We might hope that could be a good estimate to the average height of all women at UNCA.

This is a fundamental idea in statistics: we wish to study some parameter for a large population but doing so is a too unwieldy or down right impossible. Thus, we choose manageable sample from the population and estimate the parameter with the corresponding statistic computed from the sample. In the example above,

A number of key questions arise:

The first question will consume much of last two-thirds of our semester under the general topic of inference. As we’ll learn, we need a random sample of sufficient size.

This approach of sampling is in contrast to the idea of just grabbing the whole population - often called a census.

Potential issues:

Sampling strategies

Here are a few strategies to implement the big ideas

Simple random samples: The idea is so simple - choose \(n\) people from the population independently and with equal probability. It’s a bit hard to achieve in practice, though.

Other strategies:

  • Convenience
    • Not a particularly good idea.
  • Stratified samples:
    • e.g., college students by class
  • Clustered samples
    • e.g., door-to-door survey by blocks
  • Systematic sampling
    • e.g., every tenth person you see

Designed experiments

The experimental approach to the Music / GPA question might go like so: Select 100 third graders. Randomly assign them into one of two groups - one who takes music lessons and one that doesn’t. Examine the groups over the course of several years and compare their grades.

Some key principles of experiments

  • Control: We split the group of patients into two groups:
    • A treatment group that receives the experimental drug
    • A control group that doesn’t receive the drug; they might receive a placebo.
  • Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
  • Replication: The results should be reproducible
  • Blocking: We might break the control and treatment groups in to smaller groups or blocks.
    • Reduces variability in the groups
    • Allows us to identify confounding factors
    • Example: We might block by gender, age, or degree of risk.

If we find differences between the groups we can examine whether they are statistically significant or not.

Example

The stent/stroke example is presented as an introductory case study in section 1.1 of our textbook The data is available as well, In that study, 451 patients were broken into treatment and control groups with the following results after 1 year:

no event stroke
control 199 28
treatment 179 45

It appears that the stents have helped since 25% of the patients in the control group had a stroke while only 14% of the treatment group had a stroke. Near the end of the semester we’ll develop some tests to quantify the statistical significance of this type of result.

Visualizing relationships

For categorical data

We’ve already used Mosaic plots to visualize the relationship between categorical variables. Here’s the Mosaic plot for the Stent/Stroke experiment.

We will talk later about making quantifying the relationship using a \(\chi\)-square test.

For numerical data

When we have two numerical variables that we suspect are related, we can investigate with a scatter plot. For example, it makes sense that height and weight would be related. The following scatter plot shows the heights and weights of a random sample of 100 individuals taken from our CDC data set.

We will talk later about making quantifying the relationship using linear regression.