Playing with data on the computer

Email data

Here's a data set that contains 21 bits of information on nearly 4000 emails:

In [1]:
email_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/email.txt')
dim(email_data)
  1. 3921
  2. 21

If I enter email_data, my computer will attempt to display all 3921 rows and 21 columns. I can get a sense of the data by just looking at the first few rows, though.

In [2]:
head(email_data)
spamto_multiplefromccsent_emailtimeimageattachdollarwinner⋯viagrapasswordnum_charline_breaksformatre_subjexclaim_subjurgent_subjexclaim_messnumber
0 0 1 0 0 2011-12-31 22:16:410 0 0 no ⋯ 0 0 11.370 202 1 0 0 0 0 big
0 0 1 0 0 2011-12-31 23:03:590 0 0 no ⋯ 0 0 10.504 202 1 0 0 0 1 small
0 0 1 0 0 2012-01-01 08:00:320 0 4 no ⋯ 0 0 7.773 192 1 0 0 0 6 small
0 0 1 0 0 2012-01-01 01:09:490 0 0 no ⋯ 0 0 13.256 255 1 0 0 0 48 small
0 0 1 0 0 2012-01-01 02:00:010 0 0 no ⋯ 0 2 1.231 29 0 0 0 0 1 none
0 0 1 0 0 2012-01-01 02:04:460 0 0 no ⋯ 0 2 1.091 25 0 0 0 0 1 none

Can you guess what's going on here? Here's a list of all fields (or names) associated with the data.

In [3]:
names(email_data)
  1. 'spam'
  2. 'to_multiple'
  3. 'from'
  4. 'cc'
  5. 'sent_email'
  6. 'time'
  7. 'image'
  8. 'attach'
  9. 'dollar'
  10. 'winner'
  11. 'inherit'
  12. 'viagra'
  13. 'password'
  14. 'num_char'
  15. 'line_breaks'
  16. 'format'
  17. 're_subj'
  18. 'exclaim_subj'
  19. 'urgent_subj'
  20. 'exclaim_mess'
  21. 'number'

Here's a histogram of the lenght of the emails:

In [4]:
hist(email_data$num_char, 100)

And a plot showing the (not so surprising) relationship between number of characters and line breaks.

In [5]:
plot(email_data$num_char, email_data$line_breaks)

US Population data

In [6]:
us_population_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/us_population.txt')
us_population_data
yearpop
1790 3929214
1800 5236631
1810 7239881
1820 9638453
1830 12866020
1840 17069453
1850 23191876
1860 31443321
1870 38558371
1880 49371340
1890 62979766
1900 76212168
1910 92228531
1920 106021568
1930 123202660
1940 132165129
1950 151325798
1960 179323175
1970 203211926
1980 226545805
1990 248709873

It might make sense to visualize this with a time series plot.

In [7]:
plot(us_population_data$year, us_population_data$pop, type='b')