So far, we’ve covered sections 1.1-1.4 of the text mainly discussing the language of data including:
Today, we’ll finish up our overview of the language of data with a discussion of experiments (section 1.5) and we’ll forge ahead with more details on the examination of numerical data (section 1.6).
Suppose we want to explore the efficacy of a drug in preventing heart attacks. We might randomly select 432 patients on which to perform an experiment.
Suppose we have a list of numerical data; we’ll denote it by \[x_1, x_2, x_3, \ldots, x_n.\] For example, our list might be \[2, 8, 2, 4, 7.\] The mean of the list is \[\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.\] For our concrete example, this is \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\] The median is the middle value when the list is sorted. For our example, the sorted list is \[2, 2, 4, 7, 8\] so the median is \(4\). If the sorted list has an even number of observations, then them median is the middle term. For example, the median of \[1,1,3,4,8,8,8,10\] is the average of \(4\) and \(8\) which is \(6\).
Suppose our data is \[1,2,4,5,5,6,7,9,10.\] The \(25^{\text{th}}\) percentile is 4, the \(75^{\text{th}}\) percentile is 7 and the inter-quartile range is 3.
Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]
All these statiscal measures can easily be computed with R. Let’s generate a list of numbers chosen randomly between 1 and 20.
set.seed(2)
mylist = sample(1:20, 10, replace=TRUE)
mylist
## [1] 4 15 12 4 19 19 3 17 10 11
mean(mylist)
## [1] 11.4
sd(mylist)
## [1] 6.168018
quantile(mylist)
## 0% 25% 50% 75% 100%
## 3.0 5.5 11.5 16.5 19.0
Let’s generate some normally distributed, random data:
set.seed(1)
mylist = rnorm(100)
quantile(mylist)
## 0% 25% 50% 75% 100%
## -2.2146999 -0.4942425 0.1139092 0.6915454 2.4016178
Now, we’ll generate a histogram:
hist(mylist)
Also, dot and box/whisker plots:
par(mfrow=c(2,1))
stripchart(mylist, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(mylist, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')
Finally, let’s do it with some real data:
par(mfrow=c(3,1))
stripchart(state.area, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, range=0, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')