Let’s take a look at some actual data and talk carefully about how we might inspect it, visualize it, and describe it. This information correlates closely with chapter 2 and a little of chapter 3 of our textbook.
Let’s begin with our classroom data. I’ve got that data recorded in a CSV file right here. If you’d like to know how I scraped the data off of the webpage, you can read about the process.
Written out in a table, the first few rows looks like so:
Gender | Age | Height | Eye.Color | School |
---|---|---|---|---|
f | 8 | 4.166667 | hazel | Isaac Dickson |
m | 23 | 5.833333 | brown | UNCA |
f | 23 | 5.333333 | blue | UNCA |
f | 29 | 5.583333 | hazel | UNCA |
f | 48 | 5.333333 | brown | UNCA |
m | 18 | 6.083333 | green | Duke |
This is the inspection step; we’re just taking a quick look at the data to see what type of data it is. We have a data frame with five columns - corresponding to five variables of three distinct types:
Our next step might be visualization and the different types of data (categorical and numerical) require different types of visualization.
We’ve got two interesting categorical variables: Gender and Eye Color. A bar chart is a fundamental tools for visualizing this type of simple categorical data. Pie charts are often used as well, though, that’snot generally a good idea. That’s start with those to see why.
According to the data entries currently posted, there are 7 men, 5 women, and one gender neutral person enrolled in the class. (That includes my daughter’s fake entry.) A common way to visualize relative amounts of categorical data is with a pie chart. Here is the pie chart for this data.
In this particular example, it’s not too hard to see the relative sizes of the categories. As we get more categories, this becomes quite a bit harder. Here’s the pie chart for eye color:
The problem is exacerbated with the curiously popular 3D pie chart.
Of course, the problem is much worse still with more categories. Here’s a pie chart for 20 randomly generated quantities:
The only time that a pie chart is at all reasonable is when we are investigating the relative proportion of two categories. Here’s a good example:
Another way (often better) way to visualize categories is with a bar plot:
Note how easy it is to see relative magnitudes - even with that list of 20 randomly generated quatities:
We have two numerical variables: Age and Height. Histograms and box plots (or box and whisker plots) are two standard ways to visualize a list of numeric data.
Let’s take a look at a histogram of the heights (after removing my 8 year old daughter). The idea is to count how many observations lie in a collection of equally spaced intervals. A picture gets the idea across.
This is somewhat atypical. It’s more often that we see shat that is heavy in the middle and lighter on the ends. The basic shape that we see here (heavy in the middle and lighter on the ends is common). Sometimes this is called a bell shape and indicates a normal distribution. I think we’re missing it here simply because of our small class. Here’s the height histogram for the 60+ statistics students I had last Fall:
While normally distributed data is common, not all data is normally distributed. I suppose there’s no reason to think that age in a collge class should be normally distributed.
We briefly met box plots in our first day demo. Here’s a look at a box plot for our class heights.
Any dots that we see are outliers. The thick line in the middle is the median and the two lines on either side of that represent the quartiles. The dashes at the end of the lines extending out of the box are the max and min, excluding the outliers. We’ll talk a little more carefully about all of those in a bit.
In chapter 2, the text book talks a lot about the Titanic. There is a famous data set for this disaster which I have stored on my webspace. The first couple of rows look something like so:
Class | Age | Gender | Survived |
---|---|---|---|
1 | a | m | y |
1 | a | m | y |
There are 2200 rows like that. The fields are:
Thus, what we have here are four categorical variables. Class is an example of an ordinal variable; the others are nominal, though it still helps to see them in context.
Looking at the complete data table is pretty much useless. We often look at various summary tables. I suppose the most logical is to count how many lived and how many did not.
##
## n y
## 1490 711
We can convert this to a relative frequency table by dividing through by the total number of people.
##
## n y
## 0.676965 0.323035
More interesting is a contingency table that compares two of the variables.
##
## n y Sum
## 0 673 212 885
## 1 122 203 325
## 2 167 118 285
## 3 528 178 706
## Sum 1490 711 2201
There are a number of ways to translate a contingency table into a relative frequency table.
Here’s a relative frequency table of the first type.
##
## n y Sum
## 0 0.7604520 0.2395480 1.0000000
## 1 0.3753846 0.6246154 1.0000000
## 2 0.5859649 0.4140351 1.0000000
## 3 0.7478754 0.2521246 1.0000000
We see quite clearly the dependence of survial on class.
I’m a big fan of Mosaic plots, which are geometric realizations of the third type of relative frequency table.