Playing with data on the computer¶

Email data¶

Here's a data set that contains 21 bits of information on nearly 4000 emails:

email_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/email.txt')
dim(email_data)

If I enter email_data, my computer will attempt to display all 3921 rows and 21 columns. I can get a sense of the data by just looking at the first few rows, though.

head(email_data)

Can you guess what's going on here? Here's a list of all fields (or names) associated with the data.

names(email_data)

Here's a histogram of the lenght of the emails:

hist(email_data$num_char, 100)

And a plot showing the (not so surprising) relationship between number of characters and line breaks.

plot(email_data$num_char, email_data$line_breaks)

US Population data¶

us_population_data = read.delim('https://www.marksmath.org/classes/Summer2017Stat185/data/us_population.txt')
us_population_data

It might make sense to visualize this with a time series plot.

plot(us_population_data$year, us_population_data$pop, type='b')

from	time	dollar	winner	⋯	password	num_char	line_breaks	format	exclaim_mess	number
1	2011-12-31 22:16:41	0	no	⋯	0	11.370	202	1	0	big
1	2011-12-31 23:03:59	0	no	⋯	0	10.504	202	1	1	small
1	2012-01-01 08:00:32	4	no	⋯	0	7.773	192	1	6	small
1	2012-01-01 01:09:49	0	no	⋯	0	13.256	255	1	48	small
1	2012-01-01 02:00:01	0	no	⋯	2	1.231	29	0	1	none
1	2012-01-01 02:04:46	0	no	⋯	2	1.091	25	0	1	none

year	pop
1790	3929214
1800	5236631
1810	7239881
1820	9638453
1830	12866020
1840	17069453
1850	23191876
1860	31443321
1870	38558371
1880	49371340
1890	62979766
1900	76212168
1910	92228531
1920	106021568
1930	123202660
1940	132165129
1950	151325798
1960	179323175
1970	203211926
1980	226545805
1990	248709873