Today’s presentation is short, since we’ll spend some time practicing for the quiz. We re going to take a quick at categorical data, though.
This is mostly section 2.2 of our textbook.
CDC Data
That CDC data set will be your favorite soon, if it’s not already:
import pandas as pdcdc_data = pd.read_csv('https://www.marksmath.org/data/cdc.csv')cdc_data.head()
genhlth
exerany
hlthplan
smoke100
height
weight
wtdesire
age
gender
0
good
0
1
0
70
175
175
77
m
1
good
0
1
1
64
125
115
33
f
2
good
1
1
1
60
105
105
49
f
3
good
1
1
0
66
132
124
42
f
4
very good
0
1
0
61
150
130
55
f
Recall that it’s got five categorical variables.
The exerany variable
You already know about the mean of the exerany variable:
cdc_data.exerany.mean()
0.7457
Alternatively, we might call this the proportion of people who exercise some.
Other proportions
The idea of proportion applies broadly to categorical data and is somewhat analogous to the mean for numerical data. It’s defined simply as \[
\frac{\text{# occurences}}{\text{total}}.
\]
Example
For example, there’s 20000 folks in the CDC Data set and it looks like 4657 of them are in excellent health:
very good 6972
good 5675
excellent 4657
fair 2019
poor 677
Name: genhlth, dtype: int64
Example (cont)
This the proportion of folks in excellent health is
\[
\frac{4657}{20000} \approx 0.23285.
\]
Relating variables
If we want to see if two categorical variables are related, we might use a contingency table. Here’s a contingency table relating exerany and genhlth, for example: