Looking at data

Let’s take a look at some actual data and talk carefully about how we might view and describe it. This information correlates closely with chapter 2 and a little of chapter 3 of our textbook.

Our class data

Let’s begin with our classroom data. I’ve got that data recorded in a CSV file right here. If you’d like to know how I scraped the data off of the webpage, you can read about the process.

Written out in a table, the first few rows looks like so:

Gender	Age	Height
f	8	4.166667
f	18	5.333333
f	18	5.500000
f	19	5.250000
m	18	5.750000
m	18	6.083333

We have a data frame with three columns - corresponding to three variables:

Gender: A nominal, categorical variable
Age: A discrete, numerical variable, and
Height: A continuous, numerical variable.

The major different types of data, categorical and numerical, require different types of visualization.

A look at our one categorical variable

We’ve got just one categorical variable: Gender. Bar plots and pie charts are fundamental tools for visualizing this type of simple categorical data.

A pie chart

According to the 56 data entries currently posted, there are 17 men and 39 women enrolled in the class. (That includes my daughter’s fake entry.) A common way to visualize relative amounts of categorical data when there are only two categories is with a pie chart. Here is the pie chart comparing the number of men and women in our class.

I emphasize that relative proportion of two categories like this is about the only situation where I would use a pie chart. Pie charts are generally poor when it comes to working with more categories. We’ll take a closer look at this after we get data with more categories.

A bar plot

Another way (often better) way to visualize categories is with a bar plot:

A look at our numerical variables

We have two numerical variables: Age and Height. Histograms and box plots (or box and whisker plots) are two standard ways to visualize a list of numeric data.

A histogram

Let’s take a look at a histogram of the heights (after removing my 8 year old daughter). The idea is to count how many observations lie in a collection of equally spaced intervals. A picture gets the idea across.

The basic shape that we see here (heavy in the middle and lighter on the ends is common). Sometimes this is called a bell shape and indicates a normal distribution. While normally distributed data is common, not all data is normally distributed. I suppose there’s no reason to think that age in a collge class should be normally distributed.

A box plot

We briefly met box plots in our first day demo. Here’s a look at a box plot for our class heights.

Any dots that we see are outliers. The thick line in the middle is the median and the two lines on either side of that represent the quartiles. The dashes at the end of the lines extending out of the box are the max and min, excluding the outliers. We’ll talk a little more carefully about all of those in a bit.

Another dataset - The Titanic

In chapter 2, the text book talks a lot about the Titanic. There is a famous data set for this disaster which I have stored on my webspace. The first couple of rows look something like so:

Class	Age	Gender	Survived
1	a	m	y
1	a	m	y

There are 2200 rows like that. The fields are:

Class:
- 0 (for Crew)
- 1 (for First)
- 2 (for Second)
- 3 (for Third)
Gender:
- m
- f
Age:
- a (for Adult)
- c (for Child)
Survived:
- y
- n

Thus, what we have here are four categorical variables. Class is an example of an ordinal variable; the others are nominal, though it still helps to see them in context.

Looking at the complete data table is pretty much useless. We often look at various summary tables. I suppose the most logical is to count how many lived and how many did not.

## 
##    n    y 
## 1490  711

We can convert this to a relative frequency table by dividing through by the total number of people.

## 
##        n        y 
## 0.676965 0.323035

More interesting is a contingency table that compares two of the variables.

##      
##          n    y  Sum
##   0    673  212  885
##   1    122  203  325
##   2    167  118  285
##   3    528  178  706
##   Sum 1490  711 2201

There are a number of ways to translate a contingency table into a relative frequency table.

Divide the rows by their sums
Divide the columns by their sums
Divide everything by the total sum

Here’s a relative frequency table of the first type.

##    
##             n         y       Sum
##   0 0.7604520 0.2395480 1.0000000
##   1 0.3753846 0.6246154 1.0000000
##   2 0.5859649 0.4140351 1.0000000
##   3 0.7478754 0.2521246 1.0000000

We see quite clearly the dependence of survial on class.

I’m a big fan of Mosaic plots, which are geometric realizations of the third type of relative frequency table.

I’m not a big fan of the pie chart. Here’s a pie chart for class:

And please don’t ever make a 3D pie chart.

Here’s the only way that a pie chart is OK:

source