The following chunk of code scrapes the data that you entered in this Discourse question.

# Uncomment this first line, if necessary:
# install.packages('rvest')
library(rvest)
webpage <- read_html("https://statdiscourse.marksmath.org/t/some-personal-data/17")
nodes = html_nodes(webpage, "p")
html_text(nodes)
##  [1] "(5 pts)"                                                                                                                                     
##  [2] "This little problem serves two purposes:"                                                                                                    
##  [3] "Please reply to this post with the following information:"                                                                                   
##  [4] "Sex: M/FAge: nHeight: ft' in''School: name"                                                                                                  
##  [5] "Note, I've got a little program that will automate the process of gathering the data so please copy the template below and edit accordingly."
##  [6] "Sex: FAge: 22Height: 5 ft' 0 in''School: UNCA"                                                                                               
##  [7] "Sex: MAge: 25Height: 5' 10\"School: University of North Carolina Asheville"                                                                  
##  [8] "Sex: MAge: 40Height: 6'5\"School: UNCA/NCSU"                                                                                                 
##  [9] "Sex: FAge: 22Height: 5 ft' 4 in''School: UNCA"                                                                                               
## [10] "Sex: FAge: 29Height: 5' 1''School: UNCA"                                                                                                     
## [11] "Sex: FAge: 22Height: 5 ft' 10 in''School: UNCA"                                                                                              
## [12] "Sex: MAge: 21Height: 5' 7''School: UNCA"                                                                                                     
## [13] "Sex: FAge: 26Height: 5 ft' 6 in''School: UNCA"                                                                                               
## [14] "Sex: MAge: 32Height: 5ft' 9in''School: UNCA"                                                                                                 
## [15] "Sex: MAge: 20Height: 5 ft' 7 in''School: UNCA"                                                                                               
## [16] "Sex: FAge: 22Height: 5 ft' 1 in''School: UNCA"                                                                                               
## [17] "Sex: MaleAge: 22Height: 6'0School: UNCA"                                                                                                     
## [18] "Sex: MAge: 19Height: 6 ft' 2 in''School: UNC Chapel Hill"                                                                                    
## [19] "Sex: FAge: 7Height: 4ft' 1in''School: Isaac Dickson"                                                                                         
## [20] "Sex: FAge: 19Height: 5' 5\"School: UGA"                                                                                                      
## [21] "Sex: FAge: 24Height: 5ft' 1in''School: UNCA"                                                                                                 
## [22] "Sex: MAge: 20Height:5 ft' 7in''School: UNCA"                                                                                                 
## [23] "Sex: FAge: 19Height: 5 ft' 7 in''School: Augustana College-Illinois"                                                                         
## [24] "Sex: MAge: 21Height: 5ft' 11in''School: Furman University"                                                                                   
## [25] "Powered by Discourse, best viewed with JavaScript enabled"

While we could certainly auto process this more, it’s easy enough to use this to create the following DataFrame:

class_data <- data.frame(
  sex = c('F','M','M','F','F','F','M','F','M','M','F','M','M','F','F','F','M','F','M'),
  age = c(22,25,40,22,29,22,21,26,32,20,22,22,19,7,19,24,20,19,21),
  height = c(
    5, 5+10/12, 6+5/12, 5+4/12, 5+1/12, 5+10/12, 5+7/12, 5+6/12, 5+9/12,
    5+7/12, 5+1/12, 6, 6+2/12, 4+1/12, 5+5/12, 5+1/12, 5+7/12, 5+7/12,5+11/12
  ),
  school = c(
    "UNCA","UNCA", "UNCA/NCSU", "UNCA", "UNCA", "UNCA", "UNCA", 
    "UNCA","UNCA", "UNCA", "UNCA", "UNCA", "UNC Chapel Hill", 
    "Isaac Dickson", "UGA","UNCA", "UNCA", "Augustana College-Illinois", "Furman"
  )
)
summary(class_data)
##  sex         age            height                             school  
##  F:10   Min.   : 7.00   Min.   :4.083   Augustana College-Illinois: 1  
##  M: 9   1st Qu.:20.00   1st Qu.:5.208   Furman                    : 1  
##         Median :22.00   Median :5.583   Isaac Dickson             : 1  
##         Mean   :22.74   Mean   :5.518   UGA                       : 1  
##         3rd Qu.:24.50   3rd Qu.:5.833   UNC Chapel Hill           : 1  
##         Max.   :40.00   Max.   :6.417   UNCA                      :13  
##                                         UNCA/NCSU                 : 1

Note: This summary table is very much like the one we see in table 1.2 for the stent study in our text book.


Let’s examine the heights of students in the class.

heights <- class_data$height
mean(heights)
## [1] 5.517544
hist(heights)

Here’s a bar chart comparing the average heights of men and women.

men = class_data[class_data$sex == "M",]
women = class_data[class_data$sex == "F",]
m_heights <- men$height
w_heights <- women$height
barplot(c(mean(m_heights), mean(w_heights)),
  names.arg = c("M", "W")
)

Note: Bar charts and histograms look somewhat similar but there is an important difference. Bar charts compare categorical data while histograms compare numerical data.


Finally, here’s a pie chart showing the ratio of men to women in the class.

total = length(class_data$age)
n_men = length(men$age)
n_women = length(women$age)
pie(c(n_men/total, n_women/total), labels = c("m","w"))

Note: Pie charts are generally considered to be fairly poor tools for data visualization. They should only be used when comparing two proportions.


source


The data you actually work with in this class will typically be handed to you in a CSV file which is easy to read into R with one command. This is great because data scraping can get quite involved. I scrape through a season’s worth of basket ball scores to get the data to do my March Madness predictions.