One way to get data these days is via web scraping. That is, you write a computer program that automatically traverses a specific set of web pages that you know contain some type of data that you want. Your program needs to download these pages, parse them, and output a file with the data in some palatable format. This is quite common for sports data because so many news sites present scores and other statistics in a tabular format.
Let’s use this technique to gather the data from our personal data exercise. Recall that you entered the data in a somewhat specific format. For my daughter Audrey’s entry, it looked like so:
| Gender | Age | Height | Eye Color | School |
| ------ | --- | -------- | --------- | ------------- |
| f | 8 | 4' 2'' | hazel | Isaac Dickson |
Utilmately, though, this is not what our webscraper will actually see because the forum software reformats it to look like so:
<table>
<thead>
<tr>
<th>Gender</th> <th>Age</th> <th>Height</th> <th>Eye Color</th> <th>School</th>
</tr>
</thead>
<tbody>
<tr>
<td>f</td> <td>8</td> <td>4’2’’</td> <td>hazel</td> <td>Isaac Dickson</td>
</tr>
</tbody>
</table>
This kind of code is called HTML and is exactly what your web browser needs to see to know how to format your input into a table. It also just so happens that there is an R function that can parse this kind of info directly to a Data Frame.
R has plenty of functions to read in formatted data. We’ll often use read.csv
to read CSV files right off of my webspace. Functions to deal with other types of files are contained in libraries, so let’s load the libraries that we’ll need. Note that these libraries are not all part of the standard R installation so, if you want to try this yourself you might need to use the install.packages
command. It’s not at all hard and you can read more about it here.
library(httr) # To read data over the web
library(xml2) # Required by the next library
library(rvest) # To parse HTML
library(knitr) # To format tables nicely
Using this new functionality, we can directly read and display part of my webpage:
input = content(GET('https://mathstat.hwdiscuss.com/t/some-personal-data/43'), 'text', encoding='UTF-8')
cat(substring(input, 266, 770))
## introduction to using Math&amp;Stat HW Discuss, and
## It gathers a little data for us to play with.
##
##
## Reply to this post with the following inform&hellip;">
## <meta name="author" content="">
## <meta name="generator" content="Discourse 2.1.0.beta1 - https://github.com/discourse/discourse version 313ff264f2524bac4383a004bf575b461405b9c7">
## <link rel="icon" type="image/png" href="/uploads/default/original/1X/4471172caf6ec1b30c39e79b4c494d213a6c1360.png">
## <link rel="apple-touch-icon" type="image/png
In fact, you could do this with any webpage. We’ll do something like this with our personal data question page.
The process of scraping our page of interest is takes us bit beyond the scope of this class. Nonetheless, here is the code. We’re going to define an empty Data Frame with columns named Gender
, Age
, and Height
. Then, we’re going to read in all the posts on that page, step through them to extract out the data stored in the tables that you entered, and use that build up our classroom Data Frame.
topic_url = 'https://mathstat.hwdiscuss.com/t/some-personal-data/43'
class_df = data.frame(Gender=character(), Age = integer(), Height = character())
json_in = content(GET(paste(topic_url, '.json', sep="")), as="parsed")
page = 0
while(page < ceiling(json_in$highest_post_number/20)) {
page = page+1
page_url = paste(topic_url, '.json?page=', toString(page), sep="")
page_json = content(GET(page_url), as="parsed")
posts = page_json$post_stream$posts
if(page == 1 ) {
start = 2
} else {
start = 1
}
for(i in start:length(posts)) {
df = html_table(html_node(read_html(posts[[i]]$cooked), 'table'))
class_df = rbind(class_df, df)
}
}
kable(head(class_df))
Gender | Age | Height | Eye Color | School |
---|---|---|---|---|
f | 8 | 4’ 2’’ | hazel | Isaac Dickson |
m | 23 | 5’ 10’’ | Brown | UNCA |
f | 23 | 5’ 4’’ | blue | UNCA |
f | 29 | 5’ 7’’ | Hazel | UNCA |
f | 48 | 5’4’’ | brown | UNCA |
m | 18 | 6’1’’ | Green | Duke |
OK, prett cool - but, let’s convert those heights to actual numbers.
ftin_to_feet = function(ftin_str) {
spl = strsplit(ftin_str, split="’")[[1]]
ft = as.numeric(spl[1])
inch = as.numeric(spl[2])
if(is.numeric(ft) & !is.na(ft) & is.numeric(inch) & !is.na(inch)) {
return(ft + inch/12)
}
else {
return(0)
}
}
class_df$Height = as.numeric(lapply(class_df$Height, ftin_to_feet))
kable(head(class_df))
Gender | Age | Height | Eye Color | School |
---|---|---|---|---|
f | 8 | 4.166667 | hazel | Isaac Dickson |
m | 23 | 5.833333 | Brown | UNCA |
f | 23 | 5.333333 | blue | UNCA |
f | 29 | 5.583333 | Hazel | UNCA |
f | 48 | 5.333333 | brown | UNCA |
m | 18 | 6.083333 | Green | Duke |
Awesome!
Finally, the CSV file that we actually read to easily work with this data is created like so:
write.csv(class_df, 'class_data_Summer2018.csv', row.names = F)
It’s worth mentioning that our class outline is generated using R. For example, a histogram for height can be generated as follows:
hist(subset(class_df, Age>10)$Height, col = 'gray', main='',
xlab = 'Height'
)