Last time, we talked about some data basics. We learned about
Today we'll talk how we get data at a fundamental level using observational studies or controlled experiments. In either case, we need a strategy to find a sample on which to perform the study.
This is all based on sections 1.3 and 1.4 of our text, although in a somewhat different order.
Here was a nice story on CNN that clearly describes an observational study: Up to 25 cups of coffee a day still safe for heart health, study says.
CNN also had a more serious study yesterday: Drug extends life of younger women with advanced breast cancer, study says. I actually had to read this article from ASCO that was cited by the CNN article to verify that this was a controlled experiment.
Note that both of these techniques should be contrasted with anecdotal evidence like - "My sister took 8 years to graduate from OU, so OU must be really hard."
You can't clearly articulate a research question without first clearly identifying the population that you're working with, together with some related terms:
In the coffee example:
In the cancer example:
Often we are interested in the relationship between two variables - specifically, is there a correlation or even a causal relationship between two variables. In the context of a study, we should clearly identify:
Generally, an explanatory variable is one that a researcher suspects might affect the response variable. Correlation, however, does not always imply causation.
Example: We suspect that folks who use more sunscreen have a higher incidence of skin cancer. What are the explanatory and response variables - as well as any confounding variables?
Again, the basic idea is that the data collection does not interfere with how the data arises.
Suppose we'd like to know the average height of women enrolled at UNCA. According to the UNCA Factbook, there were 2147 women enrolled in the Fall of 2017. I'm not even sure how many were enrolled this past year; it might be hard to round up all of them.
So, here's a more practical approach: I had 42 women enrolled in my statistics classes that semester who filled out my online survey survey. The average height of those women was 5'5''. We might hope that could be a good estimate to the average height of all women at UNCA.
This is a fundamental idea in statistics: we wish to study some parameter for a large population but doing so is a too unwieldy or down right impossible. Thus, we choose manageable sample from the population and estimate the parameter with the corresponding statistic computed from the sample. In the example above,
A number of key questions arise:
The first question will consume much of last two-thirds of our semester under the general topic of inference. As we'll learn, we need a random sample of sufficient size.
This approach of sampling is in contrast to the idea of just grabbing the whole population - often called a census.
Potential issues:
The experimental approach to the cancer example might go like so: Select 672 women under 59. Randomly assign them into one of two groups - one who takes the drug and one that does not. Examine the groups after 42 months.
If we find differences between the groups we can examine whether they are statistically significant or not.
Good random sampling is surprisingly hard to achieve in practice. Here are a few strategies to implement sampling.
Simple random samples: This is the ideal and is so simple - choose $n$ people from the population independently and with equal probability. It's quite hard to achieve in practice, though.
Other strategies: