More on Data

and a look at our survey data

Wed, Aug 21, 2024

The objectives of Statistics

Let’s recall our characterization of statistics:

the study of how to collect, analyze, and draw conclusions from data.

We’re going to focus for a couple of days on the basics of collecting and visualizing data.

Note that a lot of this material (particularly that part surrounding the language of observational studies and controlled experiments) comes from sections 1.3 and 1.4 of our text.

Getting data

When we talk about getting data, we probably mean one of a couple of things:

Finding and loading existing data onto our computer
Generating new data

In this class we will work almost exclusively with existing data - often loaded directly from my web space. That will allow us to focus on the more analytical aspects of statistics.

Loading data onto your computer

When we do our first computer lab (on Friday!) loading data from my web space will be easy! Here’s how to load and display the first 4 rows of the NCAA team data that we saw last time, for example:

import pandas as pd
ncaa_data = pd.read_csv("https://marksmath.org/data/NCAATeamData2022.csv")
ncaa_data.head(4)

	team_id	team_name	conf	games	avg_team_score	avg_oppenent_score	avg_score_diff	winning_pct
0	1101	ABILENE CHR	WAC	29	73.172414	68.413793	4.758621	65.517241
1	1102	AIR FORCE	MWC	29	59.034483	66.034483	-7.000000	37.931034
2	1103	AKRON	MAC	31	69.290323	64.161290	5.129032	70.967742
3	1104	ALABAMA	SEC	32	79.968750	76.406250	3.562500	59.375000

Generating data

There are two main techniques to systematically generate data:

Observational studies and
Controlled experiments

While we want to focus on the more analytical aspects of statistics, it’s still important to know a bit about how data is generated to help us assess its reliability.

Observational studies

An observational study is one where the data collection does not interfere with how the data arises. Examples include

Surveys,
Review of records,
Simple measurements

Controlled experiments

In an experiment, researchers actively work with the samples applying treatments and observing effects.

Often (if not always), the intention is to explore a relationship or even a causal effect between two variable.
For that reason, it’s often important to distinguish between an explanatory variable and a response variable.

Quick examples from the news

Here was an old story on CNN that clearly describes an observational study: Up to 25 cups of coffee a day still safe for heart health, study says.
Around the same time, CNN also had a more serious study yesterday: Drug extends life of younger women with advanced breast cancer, study says. I actually had to read this article from ASCO that was cited by the CNN article to verify that this was a controlled experiment.

Some sampling terminology

You can’t clearly articulate a research question without first clearly identifying the population that you’re working with, together with some related terms:

Population refers to the complete set of entities under consideration,
Sample refers to some subset of the population,
Parameter refers to some summary characteristic of the population, and
Statistic refers to some summary characteristic computed from a sample.

A major question is statistics is - what does a sample statistic tell us about the corresponding population parameter?

Ideally, the sample will be a reasonably large, simple random sample.

Coffee example

In the coffee example:

The population might be all the coffee drinkers in the UK.
The sample consists of the 8,412 people who were surveyed.
A parameter of interest would be the proportion of all individuals in the population with stiff blood vessels.
A statistic would be the computed proportion of individuals from the sample with stiff blood vessels.

Cancer example

In the cancer example:

The population would be all women under the age of 59 with this specific cancer.
The sample would be the 672 women under the age of 59 who participated.
A parameter would be the survival rate among the population after 42 months.
A statistic would be the survival rate among the sample after 42 months.

Our class data

We’ve actually already done one observational study already - namely, our class survey! Let’s take a look at that class data to

push our visualizations a bit further and
to apply some of our new terminology.

A slice of our data

Here’s a small sample of our data:

age	eye_color	gender	height	hometown	major	psych_factor
19	Brown	Female	5.333333	Asheville 28806	Soc	2
19	Brown	Female	5.333333	Covington GA 30016	Soc	3
19	Brown	Female	5.000000	Andrews, NC 28901	New Media	3
31	Brown	Male	5.583333	Weaverville, 28787	Phil	5
21	Blue	Male	6.250000	Juneau, Alaska 99801	Mass Comm	4
26	Blue	Female	5.166667	Waynesville, NC, 28786	Env Sci	5
18	Hazel	Female	5.750000	Naples, FL 34109	Und	4
18	Hazel	Male	5.833333	Sarasota, Florida 34240	Bus	4

Note that you don’t need to see a lot of the data to get a sense of what it looks like.

Data tables

This is another example of a data table or data frame.

Each row or case corresponds to one of the students in the class.
Each column or variable corresponds to one of the student’s responses.

We can see several variables types in the table

The age and pscyh_factor variables look numeric and discrete,
The height variable is clearly numeric and continuous,
The other four variables are all categorical and nominal.

Numeric class data

Let’s take a look at some visualizations for the numeric data in our class. Here’s a histogram of the heights of people in our class, for example:

A computation

Generally, height is normally distributed, though, it’s a bit skewed in this class. The total number of folks in the sample is only in the 40s, though.

We can compute the average heights of folks in the class to be about 5.6 feet or just over 5 foot 7 inches.

I wonder what might tell us about the average height of UNCA students?

A histogram for age

Here’s a histogram for the ages of folks in the class (including me). I’m not at all surprised to see that this is not normally distributed. Note that I’ve marked both the mean (in yellow) and the median (in red).

Box plots for numerical data

Alernatively, we might look at box plots fore these same variables. These illustrate the so-called five-point summary of

min, 1st quartile, median, 3rd quartile, and max.

Comparing numerical variables

Sometimes we’ll want to find if there’s a relationship between two variables. In the following side-by-side boxplot, we examine how the relationship between the numerical variable height and the categorical variable gender.

Categorical class data

Here’s a look some pictures focusing on categorical class data.

Bar plots

Like a histogram, a bar chart represents value counts with vertical bars. Here’s a bar chart for the different eye colors in the class:

The same bar chart

A key difference between a bar chart and a histogram is that a bar chart represents categorical data. So, for example, the order of the bars doesn’t really matter. Here’s a bar chart for the same data, the variables are sorted in ascending order, rather than alphabetically:

Another bar plot

And here’s a bar plot for the majors in the class. Note that I had to standardize the responses a bit.

Bad pie

Lots of people seem to love pie charts. Often, though, there’s really no way to tell absolute magnitude and they can be confusing for data sets of moderate size. Here’s a pie chart for the majors in the class:

Good pie

I only favor pie charts when comparing two values of a categorical variable. In our data - section, gender or handedness are logical choices to illustrate with a pie chart. You get a good sense of relative magnitude or proportion.

Here’s a pie chart illustrating the number of students with just one major vs those with more than one major:

A real pie chart

And here’s my personal all-time favorite pie chart:

Geographic data

Geographic data is often contains longitudes and latitudes, which can be used to plot data on a map. Here’s our class hometowns, for example:

More on observational studies

Again, the basic idea is that the data collection does not interfere with how the data arises.

Example

Suppose we’d like to know the average height of women enrolled at UNCA. According to the UNCA Factbook, there were 1742 women enrolled in the Fall of 2023. I’m not even sure how many were enrolled this past year; it might be hard to round up all of them to measure their height.

Another approach

We have 22 students who identified themselves as female on our class survey. The average height of those women is about 5’4’’. We might hope that could be a good estimate to the average height of all women at UNCA.

This is a fundamental idea in statistics: we wish to study some parameter for a large population but doing so is a too unwieldy or down right impossible. Thus, we choose manageable sample from the population and estimate the parameter with the corresponding statistic computed from the sample.

Terminology applied

In the example above,

The population is the set of all women enrolled at UNCA,
the parameter is the average height of women in the population,
the sample is the set of all women enrolled in this class, and
the statistic is the average height of women in the sample, which happens to be 5’4’’.

Key questions

How well can we expect the statistic to approximate the parameter?
What properties should the sample have in order to maximize the accuracy of our approximation?

The first question will consume much of last two-thirds of our semester under the general topic of inference. As we’ll learn, we need a random sample of sufficient size.

This approach of sampling is in contrast to the idea of just grabbing the whole population - often called a census.

Potential issues

Volunteers
Convenience
Bad sampling frame
Bias (response and non-response)

Retrospective studies

A retrospective study is one where we review records from the past.

The study of women’s heights at UNCA is not a retrospective study.

If we reviewed grades at UNCA over the last 10 years to try to determine a relationship between GPA and class level, then we would be running a retrospective study.

More on designed experiments

In an experiment, researchers actively work with the samples applying treatments and observing effects.

The experimental approach to the cancer example might go like so: Select 672 women under 59. Randomly assign them into one of two groups - one who takes the drug and one that does not. Examine the groups after 42 months.

Some key principles of experiments

Control: We split the group of patients into two groups:
- A treatment group that receives the experimental drug
- A control group that doesn’t receive the drug; they might receive a placebo.
Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
Replication: The results should be reproducible
Blocking: We might break the control and treatment groups in to smaller groups or blocks.
- Reduces variability in the groups
- Allows us to identify confounding factors
- Example: We might block by gender, age, or degree of risk.

If we find differences between the groups we can examine whether they are statistically significant or not.

Sampling strategies

Good random sampling is surprisingly hard to achieve in practice. Here are a few strategies to implement sampling.

Simple random samples: This is the ideal and is so simple - choose \(n\) people from the population independently and with equal probability. It’s quite hard to achieve in practice, though.

Other strategies:

Convenience
- Ex: Survey the folks on your dorm floor
Stratified samples:
- Ex: Survey college students and stratify by class
Clustered samples
- Ex: Door-to-door survey picking one block out of each neighborhood
Systematic sampling
- Ex: Survey every tenth person you see

Causation and correlation

Often we are interested in the relationship between two variables - specifically, is there a correlation or even a causal relationship between two variables. In the context of a study, we should clearly identify:

Independent or explanatory variables
Dependent or response variables
Confounding or lurking variables

Generally, an explanatory variable is one that a researcher suspects might affect the response variable. Correlation, however, does not always imply causation.

Example: We suspect that folks who use more sunscreen have a higher incidence of skin cancer. What are the explanatory and response variables - as well as any confounding variables?