Categorical data

Wed, Aug 28, 2024

More on categorical data

Today’s presentation is short, since we’ll spend some time practicing for the quiz. We re going to take a quick at categorical data, though.

This is mostly section 2.2 of our textbook.

CDC Data

That CDC data set will be your favorite soon, if it’s not already:

import pandas as pd
cdc_data = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
cdc_data.head()
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f

Recall that it’s got five categorical variables.

The exerany variable

You already know about the mean of the exerany variable:

cdc_data.exerany.mean()
0.7457

Alternatively, we might call this the proportion of people who exercise some.

Other proportions

The idea of proportion applies broadly to categorical data and is somewhat analogous to the mean for numerical data. It’s defined simply as \[ \frac{\text{# occurences}}{\text{total}}. \]

Example

For example, there’s 20000 folks in the CDC Data set and it looks like 4657 of them are in excellent health:

genhlth_counts = cdc_data.genhlth.value_counts()
genhlth_counts
very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64

Example (cont)

This the proportion of folks in excellent health is

\[ \frac{4657}{20000} \approx 0.23285. \]

Relating variables

If we want to see if two categorical variables are related, we might use a contingency table. Here’s a contingency table relating exerany and genhlth, for example:

pd.crosstab(cdc_data.exerany, cdc_data.genhlth,
           normalize=False, margins=True)
genhlth excellent fair good poor very good All
exerany
0 762 857 1731 384 1352 5086
1 3895 1162 3944 293 5620 14914
All 4657 2019 5675 677 6972 20000

Mosaic plots

A visualization of a contingency table is sometimes called a mosaic plot: