Let’s take a look at some actual data and talk carefully about how we might view and describe it. This information correlates closely with chapter 2 and a little of chapter 3 of our textbook.
Let’s begin with our classroom data. I’ve got that data recorded in a CSV file right here. If you’d like to know how I scraped the data off of the webpage, you can read about the process.
Written out in a table, the first few rows looks like so:
Gender | Age | Height |
---|---|---|
f | 8 | 4.166667 |
f | 18 | 5.333333 |
f | 18 | 5.500000 |
f | 19 | 5.250000 |
m | 18 | 5.750000 |
m | 18 | 6.083333 |
We have a data frame with three columns - corresponding to three variables:
The major different types of data, categorical and numerical, require different types of visualization.
We’ve got just one categorical variable: Gender. Bar plots and pie charts are fundamental tools for visualizing this type of simple categorical data.
According to the 56 data entries currently posted, there are 17 men and 39 women enrolled in the class. (That includes my daughter’s fake entry.) A common way to visualize relative amounts of categorical data when there are only two categories is with a pie chart. Here is the pie chart comparing the number of men and women in our class.
I emphasize that relative proportion of two categories like this is about the only situation where I would use a pie chart. Pie charts are generally poor when it comes to working with more categories. We’ll take a closer look at this after we get data with more categories.
Another way (often better) way to visualize categories is with a bar plot:
We have two numerical variables: Age and Height. Histograms and box plots (or box and whisker plots) are two standard ways to visualize a list of numeric data.
Let’s take a look at a histogram of the heights (after removing my 8 year old daughter). The idea is to count how many observations lie in a collection of equally spaced intervals. A picture gets the idea across.
The basic shape that we see here (heavy in the middle and lighter on the ends is common). Sometimes this is called a bell shape and indicates a normal distribution. While normally distributed data is common, not all data is normally distributed. I suppose there’s no reason to think that age in a collge class should be normally distributed.
We briefly met box plots in our first day demo. Here’s a look at a box plot for our class heights.
Any dots that we see are outliers. The thick line in the middle is the median and the two lines on either side of that represent the quartiles. The dashes at the end of the lines extending out of the box are the max and min, excluding the outliers. We’ll talk a little more carefully about all of those in a bit.
In chapter 2, the text book talks a lot about the Titanic. There is a famous data set for this disaster which I have stored on my webspace. The first couple of rows look something like so:
Class | Age | Gender | Survived |
---|---|---|---|
1 | a | m | y |
1 | a | m | y |
There are 2200 rows like that. The fields are:
Thus, what we have here are four categorical variables. Class is an example of an ordinal variable; the others are nominal, though it still helps to see them in context.
Looking at the complete data table is pretty much useless. We often look at various summary tables. I suppose the most logical is to count how many lived and how many did not.
##
## n y
## 1490 711
We can convert this to a relative frequency table by dividing through by the total number of people.
##
## n y
## 0.676965 0.323035
More interesting is a contingency table that compares two of the variables.
##
## n y Sum
## 0 673 212 885
## 1 122 203 325
## 2 167 118 285
## 3 528 178 706
## Sum 1490 711 2201
There are a number of ways to translate a contingency table into a relative frequency table.
Here’s a relative frequency table of the first type.
##
## n y Sum
## 0 0.7604520 0.2395480 1.0000000
## 1 0.3753846 0.6246154 1.0000000
## 2 0.5859649 0.4140351 1.0000000
## 3 0.7478754 0.2521246 1.0000000
We see quite clearly the dependence of survial on class.
I’m a big fan of Mosaic plots, which are geometric realizations of the third type of relative frequency table.
I’m not a big fan of the pie chart. Here’s a pie chart for class:
And please don’t ever make a 3D pie chart.
Here’s the only way that a pie chart is OK: