Retrospective and prospective outlook

Yesterday, we covered section 1.6 which covered analysis and visualization of numerical data. Today, we’ll cover section 1.7, which will do a similar thing for categorical data.

Another look at our email data

email_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/email.txt')
names(email_data)
##  [1] "spam"         "to_multiple"  "from"         "cc"          
##  [5] "sent_email"   "time"         "image"        "attach"      
##  [9] "dollar"       "winner"       "inherit"      "viagra"      
## [13] "password"     "num_char"     "line_breaks"  "format"      
## [17] "re_subj"      "exclaim_subj" "urgent_subj"  "exclaim_mess"
## [21] "number"

Most of these are numeric but spam and number are categorical

It’s easy to generate a numerical summary:

table(email_data$number)
## 
##   big  none small 
##   545   549  2827

Barplots

For examining a single varialbe at a time

par(mfrow=c(1,2))
barplot(table(email_data$number))
title('number frequencies')
barplot(table(email_data$spam))
title('spam frequencies')

Relating two variables

spam_number_tab <- table( 
  ifelse(email_data$spam == 0, "not spam", "spam"),
  email_data$number
)
addmargins(spam_number_tab)
##           
##             big none small  Sum
##   not spam  495  400  2659 3554
##   spam       50  149   168  367
##   Sum       545  549  2827 3921
barplot(spam_number_tab, col=c('gray', '#dd2400'))
title('A stacked bar plot')

mosaicplot(spam_number_tab, title("Mosaic plot of spam vs number"))

CDC Data

source("http://www.openintro.org/stat/data/cdc.R")
health_smoke_tab <- table(
  cdc$genhlth,
  ifelse(cdc$smoke100 == 0, "False", "True"))
addmargins(health_smoke_tab)
##            
##             False  True   Sum
##   excellent  2879  1778  4657
##   very good  3758  3214  6972
##   good       2782  2893  5675
##   fair        911  1108  2019
##   poor        229   448   677
##   Sum       10559  9441 20000
mosaicplot(health_smoke_tab, 
  xlab="Condition of general health",
  ylab="Smoked at least 100 cigs",
  title("Mosaic plot relating smoking and health"))

Creating categories from numerical data

county_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/county.txt')
pop_gain <- subset(county_data, county_data$pop2000 <= county_data$pop2010)
pop_loss <- subset(county_data, county_data$pop2000 > county_data$pop2010)

par(mfrow=c(1,2))
boxplot(pop_gain$poverty, range=0)
title('Gain')
boxplot(pop_loss$poverty, range=0)
title('Loss')