Web scraping

One way to get data these days is via web scraping. That is, you write a computer program that automatically traverses a specific set of web pages that you know contain some type of data that you want. Your program needs to download these pages, parse them, and output a file with the data in some palatable format. This is quite common for sports data because so many news sites present scores and other statistics in a tabular format.

The format of our data

Let’s use this technique to gather the data from our personal data exercise. Recall that you entered the data in a somewhat specific format. For my daughter Audrey’s entry, it looked like so:

| Gender | Age |  Height  |
| --- | ---- | -------- |
| f |  8  | 4' 2'' |

You can read a bit more about typing tables in this post. Utilmately, though, this is not what our webscraper will actually see because the forum software reformats it to look like so:

<table>
  <thead>
    <tr>
      <th>Gender</th> <th>Age</th> <th>Height</th>
    </tr>
  </thead>
  <tbody>
    <tr>
     <td>f</td> <td>8</td> <td>4’2’’</td>
    </tr>
  </tbody>
</table>

This kind of code is called HTML and is exactly what your web browser needs to see to know how to format your input into a table. It also just so happens that there is an R function that can parse this kind of info directly to a Data Frame.

R functions for reading over the web

R has plenty of functions to read in formatted data. We’ll often use read.csv to read CSV files right off of my webspace. Functions to deal with other types of files are contained in libraries, so let’s load the libraries that we’ll need. Note that these libraries are not all part of the standard R installation so, if you want to try this yourself you might need to use the install.packages command. It’s not at all hard and you can read more about it here.

library(httr)      # To read data over the web
library(xml2)      # Required by the next library
library(rvest)     # To parse HTML
library(knitr)     # To format tables nicely

Using this new functionality, we can directly read and display part of my webpage:

input = content(GET('https://www.marksmath.org'), 'text', encoding='UTF-8')
cat(substring(input, 266, 770))
##      <section class='main-content'>
##          <h1>Mark McClure</h1>
##          <p>
##          Professor of <a href="http://www.unca.edu/math/">Mathematics</a><br />
##          <a href="http://www.unca.edu/">UNC - Asheville</a><br />
##          325 Robinson<br />
##          mcmcclur-AT-unca-DOT-edu<br />
##          </p>
## 
##          <h2>Fall 2017</h2>
##              <ul>
##                  <li><a href="classes/Fall2017Stat185">Stat 185</a>: MWF 8:00-9:10 and 11:00-12:10</li>
##                  <li><a href="classes/Fall2017ChaosAndFractals">Chaos and fractals</a>: MW 12:30-1:45</li>
##              </ul>
##      </section>

In fact, you could do this with any webpage. We’ll do something like this with our personal data question page.

Scraping our personal data page

The process of scraping our page of interest is takes us bit beyond the scope of this class. Nonetheless, here is the code. We’re going to define an empty Data Frame with columns named Gender, Age, and Height. Then, we’re going to read in all the posts on that page, step through them to extract out the data stored in the tables that you entered, and use that build up our classroom Data Frame.

topic_url = 'https://statdiscourse.marksmath.org/t/some-personal-data/15'
class_df = data.frame(Gender=character(), Age = integer(), Height = character())
json_in = content(GET(paste(topic_url, '.json', sep="")), as="parsed")
page = 0
while(page < ceiling(json_in$highest_post_number/20)) {
  page = page+1
  page_url = paste(topic_url, '.json?page=', toString(page), sep="")
  page_json = content(GET(page_url), as="parsed")
  posts = page_json$post_stream$posts
  if(page == 1 ) {
    start = 2
  } else {
    start = 1
  }
  for(i in start:length(posts)) {
    df = html_table(html_node(read_html(posts[[i]]$cooked), 'table'))
    class_df = rbind(class_df, df)
  }
}
kable(head(class_df))
Gender Age Height
f 8 4’ 2’’
f 18 5’ 4’’
f 18 5’ 6’’
f 19 5’ 3’’
m 18 5’ 9’’
m 18 6’ 1’’

OK, prett cool - but, let’s convert those heights to actual numbers.

ftin_to_feet = function(ftin_str) {
  spl = strsplit(ftin_str, split="’")[[1]]
  ft = as.numeric(spl[1])
  inch = as.numeric(spl[2])
  if(is.numeric(ft) & !is.na(ft) & is.numeric(inch) & !is.na(inch)) {
    return(ft + inch/12)
  }
  else {
    return(0)
  }
}
class_df$Height = as.numeric(lapply(class_df$Height, ftin_to_feet))
kable(head(class_df))
Gender Age Height
f 8 4.166667
f 18 5.333333
f 18 5.500000
f 19 5.250000
m 18 5.750000
m 18 6.083333

Awesome!

Finally, the CSV file that we actually read to easily work with this data is created like so:

write.csv(class_df, 'class_data_Fall2017.csv', row.names = F)

It’s worth mentioning that our class outline is generated using R. For example, a histogram for height can be generated as follows:

hist(subset(class_df, Age>10)$Height, col = 'gray', main='',
  xlab = 'Height'
)