Looking at data

Let’s take a look at some actual data and talk carefully about how we might inspect it, visualize it, and describe it. This information correlates closely with chapter 2 and a little of chapter 3 of our textbook.

Our class data

Let’s begin with our classroom data. I’ve got that data recorded in a CSV file right here. If you’d like to know how I scraped the data off of the webpage, you can read about the process.

Written out in a table, the first few rows looks like so:

Gender	Age	Height	Eye.Color	School
f	8	4.166667	hazel	Isaac Dickson
m	23	5.833333	brown	UNCA
f	23	5.333333	blue	UNCA
f	29	5.583333	hazel	UNCA
f	48	5.333333	brown	UNCA
m	18	6.083333	green	Duke

This is the inspection step; we’re just taking a quick look at the data to see what type of data it is. We have a data frame with five columns - corresponding to five variables of three distinct types:

Gender: A nominal, categorical variable
Age: A discrete, numerical variable,
Height: A continuous, numerical variable,
Eye Color: A nominal, categorical variable,
School: A nominal, categorical variable.

Our next step might be visualization and the different types of data (categorical and numerical) require different types of visualization.

A look at our first two categorical variables

We’ve got two interesting categorical variables: Gender and Eye Color. A bar chart is a fundamental tools for visualizing this type of simple categorical data. Pie charts are often used as well, though, that’snot generally a good idea. That’s start with those to see why.

A pie chart

According to the data entries currently posted, there are 7 men, 5 women, and one gender neutral person enrolled in the class. (That includes my daughter’s fake entry.) A common way to visualize relative amounts of categorical data is with a pie chart. Here is the pie chart for this data.

In this particular example, it’s not too hard to see the relative sizes of the categories. As we get more categories, this becomes quite a bit harder. Here’s the pie chart for eye color:

The problem is exacerbated with the curiously popular 3D pie chart.

Of course, the problem is much worse still with more categories. Here’s a pie chart for 20 randomly generated quantities:

The only time that a pie chart is at all reasonable is when we are investigating the relative proportion of two categories. Here’s a good example:

source

A bar plot

Another way (often better) way to visualize categories is with a bar plot:

Note how easy it is to see relative magnitudes - even with that list of 20 randomly generated quatities:

A look at our numerical variables

We have two numerical variables: Age and Height. Histograms and box plots (or box and whisker plots) are two standard ways to visualize a list of numeric data.

A histogram

Let’s take a look at a histogram of the heights (after removing my 8 year old daughter). The idea is to count how many observations lie in a collection of equally spaced intervals. A picture gets the idea across.

This is somewhat atypical. It’s more often that we see shat that is heavy in the middle and lighter on the ends. The basic shape that we see here (heavy in the middle and lighter on the ends is common). Sometimes this is called a bell shape and indicates a normal distribution. I think we’re missing it here simply because of our small class. Here’s the height histogram for the 60+ statistics students I had last Fall:

While normally distributed data is common, not all data is normally distributed. I suppose there’s no reason to think that age in a collge class should be normally distributed.

A box plot

We briefly met box plots in our first day demo. Here’s a look at a box plot for our class heights.

Any dots that we see are outliers. The thick line in the middle is the median and the two lines on either side of that represent the quartiles. The dashes at the end of the lines extending out of the box are the max and min, excluding the outliers. We’ll talk a little more carefully about all of those in a bit.

Another dataset - The Titanic

In chapter 2, the text book talks a lot about the Titanic. There is a famous data set for this disaster which I have stored on my webspace. The first couple of rows look something like so:

Class	Age	Gender	Survived
1	a	m	y
1	a	m	y

There are 2200 rows like that. The fields are:

Class:
- 0 (for Crew)
- 1 (for First)
- 2 (for Second)
- 3 (for Third)
Gender:
- m
- f
Age:
- a (for Adult)
- c (for Child)
Survived:
- y
- n

Thus, what we have here are four categorical variables. Class is an example of an ordinal variable; the others are nominal, though it still helps to see them in context.

Looking at the complete data table is pretty much useless. We often look at various summary tables. I suppose the most logical is to count how many lived and how many did not.

## 
##    n    y 
## 1490  711

We can convert this to a relative frequency table by dividing through by the total number of people.

## 
##        n        y 
## 0.676965 0.323035

More interesting is a contingency table that compares two of the variables.

##      
##          n    y  Sum
##   0    673  212  885
##   1    122  203  325
##   2    167  118  285
##   3    528  178  706
##   Sum 1490  711 2201

There are a number of ways to translate a contingency table into a relative frequency table.

Divide the rows by their sums
Divide the columns by their sums
Divide everything by the total sum

Here’s a relative frequency table of the first type.

##    
##             n         y       Sum
##   0 0.7604520 0.2395480 1.0000000
##   1 0.3753846 0.6246154 1.0000000
##   2 0.5859649 0.4140351 1.0000000
##   3 0.7478754 0.2521246 1.0000000

We see quite clearly the dependence of survial on class.

I’m a big fan of Mosaic plots, which are geometric realizations of the third type of relative frequency table.