Examining Categorical Data¶

After learning the basics of data and examining numerical data a bit more closely, today we'll jump into a closer look at categorical data. This is mostly section 1.7 of our text, which is the last section that we'll do in chapter 1.

A little data¶

Let's start small today by taking a look at our class data:

Gender	Age	Height	Eye Color	Major
f	18	5.333333	green	market
f	18	5.083333	brown	Bio
m	31	6.250000	brown	Acct
m	20	6.583333	blue	Comm
m	31	6.166667	blue	Mgmt
m	20	5.750000	hazel	Geo & Eco
m	49	5.916667	hazel	Mechatronics
f	28	5.666667	green	Anthro & Health
f	22	5.583333	brown	Psychology

Recall that this comes from our class forum question; I updated our Scraping our class data demo to clean this up, too.

Frequency tables and bar plots¶

A simple way to get a handle on one categorical variable is through a frequency table. Here's the frequency table for Eye Color in this data frame:

Green	Brown	Blue	Hazel
2	3	2	2

Sometimes, it's easier to visualize this with a picture called a bar plot:

Note that bar plots look a lot like histograms but it's important to keep them distinct. A bar plot represents counts of categorical data while a histogram represents counts in some range of numerical data.

Contingency tables¶

If we'd like to explore any possible relationship between two categorical variables, we can use a contingency table. Here's the contingency table for gender and eye color:

G\EC	blue	brown	green	hazel
f	0	2	2	0
m	2	1	0	2

Often, it's useful to include row and column sums as margins:

G\EC	blue	brown	green	hazel	All
f	0	2	2	0	4
m	2	1	0	2	5
All	2	3	2	2	9

We could even use proportions, as we'll see in just a bit.

Stacked bar plots¶

We can expand the bar plot idea to account for and visualize two categorical variables. This is called a stacked bar plot:

Mosaic plots¶

Another tools to visualize a pair of categorical variables that is more tightly tied to contingency tables is calles a mosaic plot:

A lotta data¶

Now let's examine the same stuff for a lot of data. We'll do so by applying Python to our CDC dataset.

Here's our basic imports:

%matplotlib inline
import pandas as pd

We'll compute a couple more special tools when we need them.

Getting the data¶

Recall that we can import our CDC data right off of the web:

df = pd.read_csv('https://www.marksmath.org/data/cdc.csv')
df.head()

A contingency table and bar plot¶

Here's how to generate a frequency table for the genhlth variable:

value_counts = df['genhlth'].value_counts()
value_counts

very good    6972
good         5675
excellent    4657
fair         2019
poor          677
Name: genhlth, dtype: int64

We can go straight from the value_counts to the corresponding bar plot:

value_counts.plot('bar', edgecolor='black', rot=0);

A contingency table¶

Pandas has a crosstab function designed specifically to generate a contingency table.

cont = pd.crosstab(df.genhlth, df.smoke100)
cont

You might want to reorder the rows, place the row and column sums in the margins, and/or indicate proportions, rather than counts:

cont = pd.crosstab(df.genhlth, df.smoke100, normalize=True, margins=True)
cont = cont.reindex(['excellent', 'very good', 'good', 'fair', 'poor', 'All'])
cont

It's very easy to generate a stacked bar chart directly from a contingency table.

cont = pd.crosstab(df.genhlth, df.smoke100)
cont = cont.reindex(['excellent', 'very good', 'good', 'fair', 'poor'])
cont.plot(kind='bar', stacked=True, rot=0, edgecolor='black');

There's also a function in the statsmodels library that makes it very easy to generate a mosaic plot:

from statsmodels.graphics.mosaicplot import mosaic
mosaic(df, ['genhlth', 'smoke100']);

It's actually quite tricky to reorder and style that result, though.

from seaborn import palplot, color_palette as pl

df2 = df[df.genhlth == 'excellent']
df2 = df2.append(df[df.genhlth == 'very good'])
df2 = df2.append(df[df.genhlth == 'good'])
df2 = df2.append(df[df.genhlth == 'fair'])
df2 = df2.append(df[df.genhlth == 'poor'])

def color(key):
    if key == ('excellent', '0'):
        return {'color': pl()[2]}
    elif key == ('excellent', '1'):
        return {'color': pl()[8]}
    elif key == ('very good', '0'):
        return {'color': pl()[0]}
    elif key == ('very good', '1'):
        return {'color': pl()[9]}
    elif key == ('good', '0'):
        return {'color': pl()[4]}
    elif key == ('good', '1'):
        return {'color': pl()[6]}
    elif key == ('fair', '0'):
        return {'color': pl()[1]}
    elif key == ('fair', '1'):
        return {'color': pl()[3]}
    elif key == ('poor', '1'):
        return {'color': pl()[7]}
    else:
        return {'color': 'gray'}

mosaic(df2, ['genhlth', 'smoke100'], gap=(0.02,0.02), 
       properties=color, labelizer=lambda key: "");

	genhlth	exerany	hlthplan	smoke100	height	weight	wtdesire	age	gender
0	good	0	1	0	70	175	175	77	m
1	good	0	1	1	64	125	115	33	f
2	good	1	1	1	60	105	105	49	f
3	good	1	1	0	66	132	124	42	f
4	very good	0	1	0	61	150	130	55	f

smoke100	0	1
genhlth
excellent	2879	1778
fair	911	1108
good	2782	2893
poor	229	448
very good	3758	3214

smoke100	0	1	All
genhlth
excellent	0.14395	0.08890	0.23285
very good	0.18790	0.16070	0.34860
good	0.13910	0.14465	0.28375
fair	0.04555	0.05540	0.10095
poor	0.01145	0.02240	0.03385
All	0.52795	0.47205	1.00000