Categorical data

Wed, Aug 28, 2024

More on categorical data

Today’s presentation is short, since we’ll spend some time practicing for the quiz. We re going to take a quick at categorical data, though.

This is mostly section 2.2 of our textbook.

CDC Data

That CDC data set will be your favorite soon, if it’s not already:

import pandas as pd
cdc_data = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
cdc_data.head()

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
0	good	0	1	0	70	175	175	77	m
1	good	0	1	1	64	125	115	33	f
2	good	1	1	1	60	105	105	49	f
3	good	1	1	0	66	132	124	42	f
4	very good	0	1	0	61	150	130	55	f

Recall that it’s got five categorical variables.

The `exerany` variable

You already know about the mean of the exerany variable:

cdc_data.exerany.mean()

0.7457

Alternatively, we might call this the proportion of people who exercise some.

Other proportions

The idea of proportion applies broadly to categorical data and is somewhat analogous to the mean for numerical data. It’s defined simply as \[ \frac{\text{# occurences}}{\text{total}}. \]

Example

For example, there’s 20000 folks in the CDC Data set and it looks like 4657 of them are in excellent health:

genhlth_counts = cdc_data.genhlth.value_counts()
genhlth_counts

very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64

Example (cont)

This the proportion of folks in excellent health is

\[ \frac{4657}{20000} \approx 0.23285. \]

Relating variables

If we want to see if two categorical variables are related, we might use a contingency table. Here’s a contingency table relating exerany and genhlth, for example:

pd.crosstab(cdc_data.exerany, cdc_data.genhlth,
           normalize=False, margins=True)

genhlth	excellent	fair	good	poor	very good	All
exerany
0	762	857	1731	384	1352	5086
1	3895	1162	3944	293	5620	14914
All	4657	2019	5675	677	6972	20000

Mosaic plots

A visualization of a contingency table is sometimes called a mosaic plot: