After learning the basics of data and examining numerical data a bit more closely, today we'll jump into a closer look at categorical data. This is mostly section 2.2 of our textbook.
Let's start by taking another look at the categorical data in from our class survey. Here's another look at the last few rows of that data:
How old are you? | How tall are you? [Feet] | How tall are you? [Inches] | Are you left or right handed? | What is your gender (optional): | Choose your eye color | What is your major? |
---|---|---|---|---|---|---|
20.0 | 5.0 | 5.0 | Right | Male | Blue | New Media |
21.0 | 5.0 | 9.0 | Right | Male | Brown | Computer Science |
18.0 | 5.0 | 6.0 | Right | Female | Blue | Environmental studies |
The (abbreviated) categorical variables are handedness, gender, eyecolor, and major.
You might recall that I don't really care for pie charts. It's hard to tell for sure what's larger between green and hazel here. Nonetheless, it's worth understanding how to read them so we might see them in homework.
You might also recall that, one situation when a pie chart is a reasonable idea is when you're comparing two values for one categorical variable. Thus, I suppose that gender might be one example where a pie chart might make sense.
Bar plots make it much easier to distiguish relative sizes. We can see much easier that the green class is larger than the hazel. We can also indicate absolute quantities with a single axis.
Recall that the mean is the fundamental measure of location for numerical data. The corresponding notion for categorical data is the proportion.
The proportion is simply the ratio of the occurrence of some value of a categorical variable to the total number of occurrences.
For example, we have 38 folks in the class who completed our survey, 4 of whom have green eyes. The proportion of green eyed folks is, thus, $$\frac{4}{38} \approx 0.10526$$ or a little over $10\%$.
Of course, a bar chart could be displayed as a simple table.
Choose your eye color | |
---|---|
Brown | 16 |
Blue | 14 |
Green | 4 |
Hazel | 2 |
We can display two categorical variables in a so-called contingency table:
What is your gender (optional): | Female | Male |
---|---|---|
Choose your eye color | ||
Blue | 7 | 6 |
Brown | 7 | 8 |
Green | 4 | 0 |
Hazel | 2 | 0 |
Bar charts can be used to compare two categorical variables by stacking bars or displaying them side by side.
Another graphical tool to compare two categorical variables is a mosaic plot.
The relationship between a mosaic plot and a contingency table is, perhaps, more clear when the entries in the table are normalized to indicate proportions.
What is your gender (optional): | Female | Male | All |
---|---|---|---|
Choose your eye color | |||
Blue | 0.205882 | 0.176471 | 0.382353 |
Brown | 0.205882 | 0.235294 | 0.441176 |
Green | 0.117647 | 0.000000 | 0.117647 |
Hazel | 0.058824 | 0.000000 | 0.058824 |
All | 0.588235 | 0.411765 | 1.000000 |
As with numerical data, we are interested in being able to manipulate large sets of categorical on the computer. So, let's take another look at our CDC data and explore how to do three simple things with it:
Just in case you've forgotten, here's how to grab the CDC data:
import pandas as pd
df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
df.head()
genhlth | exerany | hlthplan | smoke100 | height | weight | wtdesire | age | gender | |
---|---|---|---|---|---|---|---|---|---|
0 | good | 0 | 1 | 0 | 70 | 175 | 175 | 77 | m |
1 | good | 0 | 1 | 1 | 64 | 125 | 115 | 33 | f |
2 | good | 1 | 1 | 1 | 60 | 105 | 105 | 49 | f |
3 | good | 1 | 1 | 0 | 66 | 132 | 124 | 42 | f |
4 | very good | 0 | 1 | 0 | 61 | 150 | 130 | 55 | f |
We can generate a table for the genhlth
variable using the value_counts
method of the data frame:
value_counts = df['genhlth'].value_counts()
value_counts
very good 6972 good 5675 excellent 4657 fair 2019 poor 677 Name: genhlth, dtype: int64
value_counts.to_frame()
.value_counts.to_markdown()
.Once we have a table, it's a simple matter to compute a proportion. For example, the proportion of folks who are in excellent
health is $4657/20000 \approx 23.3\%$.
Sometimes we might want to automate this type of computation by doing something like so:
value_counts['excellent']/len(df)
0.23285
Again, once we have the value_count
, we can pass use its plot.bar
method:
value_counts.plot.bar(figsize=(12,7), rot = 0);
Note that there is a forum assignment that asks you to create a bar plot and compute a proportion for your randomly generated data.
You should be able emulate the code in this presentation using Colab. Just be sure that you're logged into your UNCA email acount in your web browser before following the link.