Categorical data

After learning the basics of data and examining numerical data a bit more closely, today we'll jump into a closer look at categorical data. This is mostly section 2.2 of our textbook.

Re-examining our class data

Let's start by taking another look at the categorical data in from our class survey. Here's another look at the last few rows of that data:

How old are you? How tall are you? [Feet] How tall are you? [Inches] Are you left or right handed? What is your gender (optional): Choose your eye color What is your major?
20.0 5.0 5.0 Right Male Blue New Media
21.0 5.0 9.0 Right Male Brown Computer Science
18.0 5.0 6.0 Right Female Blue Environmental studies

The (abbreviated) categorical variables are handedness, gender, eyecolor, and major.

Bad Pie

You might recall that I don't really care for pie charts. It's hard to tell for sure what's larger between green and hazel here. Nonetheless, it's worth understanding how to read them so we might see them in homework.

Gender

You might also recall that, one situation when a pie chart is a reasonable idea is when you're comparing two values for one categorical variable. Thus, I suppose that gender might be one example where a pie chart might make sense.

Good power bar

Bar plots make it much easier to distiguish relative sizes. We can see much easier that the green class is larger than the hazel. We can also indicate absolute quantities with a single axis.

Proportions

Recall that the mean is the fundamental measure of location for numerical data. The corresponding notion for categorical data is the proportion.

The proportion is simply the ratio of the occurrence of some value of a categorical variable to the total number of occurrences.

For example, we have 38 folks in the class who completed our survey, 4 of whom have green eyes. The proportion of green eyed folks is, thus, $$\frac{4}{38} \approx 0.10526$$ or a little over $10\%$.

Tables

Of course, a bar chart could be displayed as a simple table.

Choose your eye color
Brown 16
Blue 14
Green 4
Hazel 2

Contingency tables

We can display two categorical variables in a so-called contingency table:

What is your gender (optional): Female Male
Choose your eye color
Blue 7 6
Brown 7 8
Green 4 0
Hazel 2 0

Side by side and stacked bar charts

Bar charts can be used to compare two categorical variables by stacking bars or displaying them side by side.

Mosaic plots

Another graphical tool to compare two categorical variables is a mosaic plot.

Normalized contingency tables

The relationship between a mosaic plot and a contingency table is, perhaps, more clear when the entries in the table are normalized to indicate proportions.

What is your gender (optional): Female Male All
Choose your eye color
Blue 0.205882 0.176471 0.382353
Brown 0.205882 0.235294 0.441176
Green 0.117647 0.000000 0.117647
Hazel 0.058824 0.000000 0.058824
All 0.588235 0.411765 1.000000

More data (with code)

As with numerical data, we are interested in being able to manipulate large sets of categorical on the computer. So, let's take another look at our CDC data and explore how to do three simple things with it:

  • Generate a table
  • Compute a proportion
  • Make a bar plot

The CDC data

Just in case you've forgotten, here's how to grab the CDC data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
df.head()
genhlth exerany hlthplan smoke100 height weight wtdesire age gender
0 good 0 1 0 70 175 175 77 m
1 good 0 1 1 64 125 115 33 f
2 good 1 1 1 60 105 105 49 f
3 good 1 1 0 66 132 124 42 f
4 very good 0 1 0 61 150 130 55 f

A table

We can generate a table for the genhlth variable using the value_counts method of the data frame:

value_counts = df['genhlth'].value_counts()
value_counts
very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64
  • If you want a groovier looking version in the notebook, try value_counts.to_frame().
  • If you want markdown to copy and paste into the forum, try value_counts.to_markdown().

A proportion

Once we have a table, it's a simple matter to compute a proportion. For example, the proportion of folks who are in excellent health is $4657/20000 \approx 23.3\%$.

Sometimes we might want to automate this type of computation by doing something like so:

value_counts['excellent']/len(df)
0.23285

A bar plot

Again, once we have the value_count, we can pass use its plot.bar method:

value_counts.plot.bar(figsize=(12,7), rot = 0);

An assignment

Note that there is a forum assignment that asks you to create a bar plot and compute a proportion for your randomly generated data.

You should be able emulate the code in this presentation using Colab. Just be sure that you're logged into your UNCA email acount in your web browser before following the link.