Yesterday, we covered section 1.6 which covered analysis and visualization of numerical data. Today, we’ll cover section 1.7, which will do a similar thing for categorical data.
email_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/email.txt')
names(email_data)
## [1] "spam" "to_multiple" "from" "cc"
## [5] "sent_email" "time" "image" "attach"
## [9] "dollar" "winner" "inherit" "viagra"
## [13] "password" "num_char" "line_breaks" "format"
## [17] "re_subj" "exclaim_subj" "urgent_subj" "exclaim_mess"
## [21] "number"
Most of these are numeric but spam
and number
are categorical
spam
(an ordinal categorical variable)
number
:
It’s easy to generate a numerical summary:
table(email_data$number)
##
## big none small
## 545 549 2827
For examining a single varialbe at a time
par(mfrow=c(1,2))
barplot(table(email_data$number))
title('number frequencies')
barplot(table(email_data$spam))
title('spam frequencies')
spam_number_tab <- table(
ifelse(email_data$spam == 0, "not spam", "spam"),
email_data$number
)
addmargins(spam_number_tab)
##
## big none small Sum
## not spam 495 400 2659 3554
## spam 50 149 168 367
## Sum 545 549 2827 3921
barplot(spam_number_tab, col=c('gray', '#dd2400'))
title('A stacked bar plot')
mosaicplot(spam_number_tab, title("Mosaic plot of spam vs number"))
source("http://www.openintro.org/stat/data/cdc.R")
health_smoke_tab <- table(
cdc$genhlth,
ifelse(cdc$smoke100 == 0, "False", "True"))
addmargins(health_smoke_tab)
##
## False True Sum
## excellent 2879 1778 4657
## very good 3758 3214 6972
## good 2782 2893 5675
## fair 911 1108 2019
## poor 229 448 677
## Sum 10559 9441 20000
mosaicplot(health_smoke_tab,
xlab="Condition of general health",
ylab="Smoked at least 100 cigs",
title("Mosaic plot relating smoking and health"))
county_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/county.txt')
pop_gain <- subset(county_data, county_data$pop2000 <= county_data$pop2010)
pop_loss <- subset(county_data, county_data$pop2000 > county_data$pop2010)
par(mfrow=c(1,2))
boxplot(pop_gain$poverty, range=0)
title('Gain')
boxplot(pop_loss$poverty, range=0)
title('Loss')