Categorical data¶

After learning the basics of data and examining numerical data a bit more closely, today we'll jump into a closer look at categorical data. This is mostly section 2.2 of our textbook.

Re-examining our class data¶

Let's start by taking another look at the categorical data in from our class survey. Here's another look at the last few rows of that data:

	Response ID	How old are you?	How tall are you? [Feet]	How tall are you? [Inches]	Are you left or right handed?	What is your gender (optional):	Choose your eye color	What is your major?	Where is your hometown?
0	2	31	6	2	Right	Male	Hazel	Undeclared	36.60024;-121.89468
1	3	18	5	11	Right	Male	Hazel	Health and Wellness	36.02411;-78.4774
2	4	18	5	8	Right	NaN	Hazel	economics	35.7721;-78.63861

The (abbreviated) categorical variables are handedness, gender, eyecolor, and major.

Bad Pie¶

You might recall that I don't really care for pie charts. It's hard to tell for sure what's larger between green and hazel here. Nonetheless, it's worth understanding how to read them so we might see them in homework.

Good pie¶

One situation when a pie chart is a reasonable idea is when you're comparing two values for one categorical variable:

A reasonable pie chart with actual pie

Gender¶

Thus, I suppose that gender might be one example where a pie chart might make sense.

Good power bar¶

Bar plots make it much easier to distiguish relative sizes. We can see much easier that the green class is larger than the hazel. We can also indicate absolute quantities with a single axis.

Proportions¶

Recall that the mean is the fundamental measure of location for numerical data. The corresponding notion for categorical data is the proportion.

The proportion is simply the ratio of the occurrence of some value of a categorical variable to the total number of occurrences.

For example, we have 39 folks in the class who completed our survey, 6 of whom have green eyes. The proportion of green eyed folks is, thus, $$\frac{6}{39} \approx 0.1538$$ or a little over $15\%$.

Tables¶

Of course, a bar chart could be displayed as a simple table.

	Choose your eye color
Brown	18
Blue	9
Green	6
Hazel	5
Other	1

Contingency tables¶

We can display two categorical variables in a so-called contingency table:

What is your gender (optional):	Female	Male
Choose your eye color
Blue	3	6
Brown	6	12
Green	3	3
Hazel	1	3

Side by side and stacked bar charts¶

Bar charts can be used to compare two categorical variables by stacking bars or displaying them side by side.

Mosaic plots¶

Another graphical tool to compare two categorical variables is a mosaic plot.

Normalized contingency tables¶

The relationship between a mosaic plot and a contingency table is, perhaps, more clear when the entries in the table are normalized to indicate proportions.

What is your gender (optional):	Female	Male	All
Choose your eye color
Blue	0.081081	0.162162	0.243243
Brown	0.162162	0.324324	0.486486
Green	0.081081	0.081081	0.162162
Hazel	0.027027	0.081081	0.108108
All	0.351351	0.648649	1.000000

More data (with code)¶

As with numerical data, we are interested in being able to manipulate large sets of categorical on the computer. So, let's take another look at our CDC data and explore how to do three simple things with it:

Generate a table
Compute a proportion
Make a bar plot

The CDC data¶

Just in case you've forgotten, here's how to grab the CDC data:

import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
df.head()

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
0	good	0	1	0	70	175	175	77	m
1	good	0	1	1	64	125	115	33	f
2	good	1	1	1	60	105	105	49	f
3	good	1	1	0	66	132	124	42	f
4	very good	0	1	0	61	150	130	55	f

A table¶

We can generate a table for the genhlth variable using the value_counts method of the data frame:

value_counts = df['genhlth'].value_counts()
value_counts

very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64

If you want a groovier looking version in the notebook, try value_counts.to_frame().
If you want markdown to copy and paste into the forum, try value_counts.to_markdown().

A proportion¶

Once we have a table, it's a simple matter to compute a proportion. For example, the proportion of folks who are in excellent health is $4657/20000 \approx 23.3\%$.

Sometimes we might want to automate this type of computation by doing something like so:

value_counts['excellent']/len(df)

0.23285

A bar plot¶

Again, once we have the value_count, we can pass use its plot.bar method:

value_counts.plot.bar(figsize=(12,7), rot = 0);

An assignment¶

Note that there is a forum assignment that asks you to create a bar plot and compute a proportion for your randomly generated data.

You should be able emulate the code in this presentation using Colab. Just be sure that you're logged into your UNCA email acount in your web browser before following the link.