Recap

So far, we’ve covered sections 1.1-1.4 of the text mainly discussing the language of data including:

Today, we’ll finish up our overview of the language of data with a discussion of experiments (section 1.5) and we’ll forge ahead with more details on the examination of numerical data (section 1.6).

Experiments

Fundamental principles

  • Controls
  • Randomization
  • Replication
  • Blocking

Example

Suppose we want to explore the efficacy of a drug in preventing heart attacks. We might randomly select 432 patients on which to perform an experiment.

  • Control: We split the group of patients into two groups:
    • A treatment group that receives the experimental drug
    • A control group that doesn’t receive the drug; they might receive a placebo.
  • Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
  • Replication: The results should be reproducible
  • Blocking: We might break the control and treatment groups in to smaller groups or blocks.
    • Reduces variabilty in the groups
    • Allows us to identify confounding factors
    • Example: We might block by age or degree of risk.

Statistical measures of numerical data

The mean and median

Suppose we have a list of numerical data; we’ll denote it by \[x_1, x_2, x_3, \ldots, x_n.\] For example, our list might be \[2, 8, 2, 4, 7.\] The mean of the list is \[\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.\] For our concrete example, this is \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\] The median is the middle value when the list is sorted. For our example, the sorted list is \[2, 2, 4, 7, 8\] so the median is \(4\). If the sorted list has an even number of observations, then them median is the middle term. For example, the median of \[1,1,3,4,8,8,8,10\] is the average of \(4\) and \(8\) which is \(6\).

Percentiles (also called quantiles)

  • The median is a special case of a percentile - 50% of the population lies below the median and 50% lies above.
  • Similarly, 25% of the population lies below the first quartile and 75% lies above.
  • Also, 75% of the population lies below the third quartile and 25% lies above.
  • The second quartile is just another name for the median.
  • The inter-quartile range is the difference between the third and first quartile.

Example

Suppose our data is \[1,2,4,5,5,6,7,9,10.\] The \(25^{\text{th}}\) percentile is 4, the \(75^{\text{th}}\) percentile is 7 and the inter-quartile range is 3.

Variance and standard deviation

  • Percentiles form a measure of the spread of a population or sample related to the median of that population or sample.
  • The standard deviation forms a measure of the spread of a population or sample related to the mean of the population or sample.

Definitions

  • Roughly, the standard deviation measures how far the individuals deviate from the mean on average.
  • The variance is defined to be the suare of the standard deviation. Thus, if the standard deviation is \(s\), then the variance is \(s^2\).
  • If we have a collection of observations on a population of \(n\) individuals \[x_1,x_2,x_3, \ldots, x_n,\] the variance is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n}.\]
  • If \(s^2\) is the variance, then \(s\) is the standard deviation.

Sample variance vs population variance

  • Variance for a sample of \(n\) observations is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n-1}.\]
  • The only difference is the \(n-1\) in the denominator.
  • This improves the way that sample variance approximates population variance which is, ultimately, the objective.
  • More often than not, we will be computing sample variance and the corresponding deviation.

Example

Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]

R commands

All these statiscal measures can easily be computed with R. Let’s generate a list of numbers chosen randomly between 1 and 20.

set.seed(2)
mylist = sample(1:20, 10, replace=TRUE)
mylist
##  [1]  4 15 12  4 19 19  3 17 10 11
mean(mylist)
## [1] 11.4
sd(mylist)
## [1] 6.168018
quantile(mylist)
##   0%  25%  50%  75% 100% 
##  3.0  5.5 11.5 16.5 19.0

Dot and Box plots

Let’s generate some normally distributed, random data:

set.seed(1)
mylist = rnorm(100)
quantile(mylist)
##         0%        25%        50%        75%       100% 
## -2.2146999 -0.4942425  0.1139092  0.6915454  2.4016178

Now, we’ll generate a histogram:

hist(mylist)

Also, dot and box/whisker plots:

par(mfrow=c(2,1))
stripchart(mylist, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(mylist, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')

Finally, let’s do it with some real data:

par(mfrow=c(3,1))
stripchart(state.area, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, range=0, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')