Today’s presentation is short, since we’ll spend some time practicing for the quiz. We re going to take a quick at categorical data, though.

This is mostly section 2.2 of our textbook.

CDC Data

That CDC data set will be your favorite soon, if it’s not already:

import pandas as pdcdc_data = pd.read_csv('https://www.marksmath.org/data/cdc.csv')cdc_data.head()

genhlth

exerany

hlthplan

smoke100

height

weight

wtdesire

age

gender

0

good

0

1

0

70

175

175

77

m

1

good

0

1

1

64

125

115

33

f

2

good

1

1

1

60

105

105

49

f

3

good

1

1

0

66

132

124

42

f

4

very good

0

1

0

61

150

130

55

f

Recall that it’s got five categorical variables.

The exerany variable

You already know about the mean of the exerany variable:

cdc_data.exerany.mean()

0.7457

Alternatively, we might call this the proportion of people who exercise some.

Other proportions

The idea of proportion applies broadly to categorical data and is somewhat analogous to the mean for numerical data. It’s defined simply as \[
\frac{\text{# occurences}}{\text{total}}.
\]

Example

For example, there’s 20000 folks in the CDC Data set and it looks like 4657 of them are in excellent health:

very good 6972
good 5675
excellent 4657
fair 2019
poor 677
Name: genhlth, dtype: int64

Example (cont)

This the proportion of folks in excellent health is

\[
\frac{4657}{20000} \approx 0.23285.
\]

Relating variables

If we want to see if two categorical variables are related, we might use a contingency table. Here’s a contingency table relating exerany and genhlth, for example: