Categorical data¶

After learning the basics of data and examining numerical data a bit more closely, today we'll jump into a closer look at categorical data. This is mostly section 2.2 of our textbook.

Re-examining our class data¶

Let's start by taking another look at the categorical data in from our class survey. Here's another look at the last few rows of that data:

Response ID How old are you? How tall are you? [Feet] How tall are you? [Inches] Are you left or right handed? What is your gender (optional): Choose your eye color What is your major? Where is your hometown?
0 2 31 6 2 Right Male Hazel Undeclared 36.60024;-121.89468
1 3 18 5 11 Right Male Hazel Health and Wellness 36.02411;-78.4774
2 4 18 5 8 Right NaN Hazel economics 35.7721;-78.63861

The (abbreviated) categorical variables are handedness, gender, eyecolor, and major.

You might recall that I don't really care for pie charts. It's hard to tell for sure what's larger between green and hazel here. Nonetheless, it's worth understanding how to read them so we might see them in homework.

Good pie¶

One situation when a pie chart is a reasonable idea is when you're comparing two values for one categorical variable:

Gender¶

Thus, I suppose that gender might be one example where a pie chart might make sense.

Good power bar¶

Bar plots make it much easier to distiguish relative sizes. We can see much easier that the green class is larger than the hazel. We can also indicate absolute quantities with a single axis.

Proportions¶

Recall that the mean is the fundamental measure of location for numerical data. The corresponding notion for categorical data is the proportion.

The proportion is simply the ratio of the occurrence of some value of a categorical variable to the total number of occurrences.

For example, we have 39 folks in the class who completed our survey, 6 of whom have green eyes. The proportion of green eyed folks is, thus, $$\frac{6}{39} \approx 0.1538$$ or a little over $15\%$.

Tables¶

Of course, a bar chart could be displayed as a simple table.

Brown 18
Blue 9
Green 6
Hazel 5
Other 1

Contingency tables¶

We can display two categorical variables in a so-called contingency table:

What is your gender (optional): Female Male
Blue 3 6
Brown 6 12
Green 3 3
Hazel 1 3

Side by side and stacked bar charts¶

Bar charts can be used to compare two categorical variables by stacking bars or displaying them side by side.

Mosaic plots¶

Another graphical tool to compare two categorical variables is a mosaic plot.

Normalized contingency tables¶

The relationship between a mosaic plot and a contingency table is, perhaps, more clear when the entries in the table are normalized to indicate proportions.

What is your gender (optional): Female Male All
Blue 0.081081 0.162162 0.243243
Brown 0.162162 0.324324 0.486486
Green 0.081081 0.081081 0.162162
Hazel 0.027027 0.081081 0.108108
All 0.351351 0.648649 1.000000

More data (with code)¶

As with numerical data, we are interested in being able to manipulate large sets of categorical on the computer. So, let's take another look at our CDC data and explore how to do three simple things with it:

• Generate a table
• Compute a proportion
• Make a bar plot

The CDC data¶

Just in case you've forgotten, here's how to grab the CDC data:

import pandas as pd

genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f

A table¶

We can generate a table for the genhlth variable using the value_counts method of the data frame:

value_counts = df['genhlth'].value_counts()
value_counts

very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64
• If you want a groovier looking version in the notebook, try value_counts.to_frame().
• If you want markdown to copy and paste into the forum, try value_counts.to_markdown().

A proportion¶

Once we have a table, it's a simple matter to compute a proportion. For example, the proportion of folks who are in excellent health is $4657/20000 \approx 23.3\%$.

Sometimes we might want to automate this type of computation by doing something like so:

value_counts['excellent']/len(df)

0.23285

A bar plot¶

Again, once we have the value_count, we can pass use its plot.bar method:

value_counts.plot.bar(figsize=(12,7), rot = 0);


An assignment¶

Note that there is a forum assignment that asks you to create a bar plot and compute a proportion for your randomly generated data.