6/8/17

Recap

So far, we’ve covered sections 1.1-1.4 of the text mainly discussing the language of data including:

Data matrices
(Basic) summaries and visualization
Variables
- Numerical vs Categorical
- Explanatory vs Response vs Confounding
Observational data collection
- Populations
- Samples
- (Simple) random
- Convenience
- Stratified

Today, we’ll finish up our overview of the language of data with a discussion of experiments (section 1.5) and we’ll forge ahead with more details on the examination of numerical data (section 1.6).

Experiments

Fundamental principles

Controls
Randomization
Replication
Blocking

Example

Suppose we want to explore the efficacy of a drug in preventing heart attacks. We might randomly select 432 patients on which to perform an experiment.

Control: We split the group of patients into two groups:
- A treatment group that receives the experimental drug
- A control group that doesn’t receive the drug; they might receive a placebo.
Randomization: The groups should be chosen randomly to prevent bias and to even out confounding factors.
Replication: The results should be reproducible
Blocking: We might break the control and treatment groups in to smaller groups or blocks.
- Reduces variabilty in the groups
- Allows us to identify confounding factors
- Example: We might block by age or degree of risk.

Statistical measures of numerical data

The mean and median

Suppose we have a list of numerical data; we’ll denote it by \[x_1, x_2, x_3, \ldots, x_n.\] For example, our list might be \[2, 8, 2, 4, 7.\] The mean of the list is \[\bar{x} = \frac{x_1 + x_2 + x_3 + \cdots + x_n}{n}.\] For our concrete example, this is \[\frac{2+8+2+4+7}{5} = \frac{23}{5} = 4.6.\] The median is the middle value when the list is sorted. For our example, the sorted list is \[2, 2, 4, 7, 8\] so the median is \(4\). If the sorted list has an even number of observations, then them median is the middle term. For example, the median of \[1,1,3,4,8,8,8,10\] is the average of \(4\) and \(8\) which is \(6\).

Percentiles (also called quantiles)

The median is a special case of a percentile - 50% of the population lies below the median and 50% lies above.
Similarly, 25% of the population lies below the first quartile and 75% lies above.
Also, 75% of the population lies below the third quartile and 25% lies above.
The second quartile is just another name for the median.
The inter-quartile range is the difference between the third and first quartile.

Example

Suppose our data is \[1,2,4,5,5,6,7,9,10.\] The \(25^{\text{th}}\) percentile is 4, the \(75^{\text{th}}\) percentile is 7 and the inter-quartile range is 3.

Variance and standard deviation

Percentiles form a measure of the spread of a population or sample related to the median of that population or sample.
The standard deviation forms a measure of the spread of a population or sample related to the mean of the population or sample.

Definitions

Roughly, the standard deviation measures how far the individuals deviate from the mean on average.
The variance is defined to be the suare of the standard deviation. Thus, if the standard deviation is \(s\), then the variance is \(s^2\).
If we have a collection of observations on a population of \(n\) individuals \[x_1,x_2,x_3, \ldots, x_n,\] the variance is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n}.\]
If \(s^2\) is the variance, then \(s\) is the standard deviation.

Sample variance vs population variance

Variance for a sample of \(n\) observations is defined by \[s^2 = \frac{(x_1 - \bar{x})^2 + (x_2-\bar{x})^2 +\cdots+(x_n-\bar{x})^2}{n-1}.\]
The only difference is the \(n-1\) in the denominator.
This improves the way that sample variance approximates population variance which is, ultimately, the objective.
More often than not, we will be computing sample variance and the corresponding deviation.

Example

Suppose our sample is \[1,2,3,4.\] Then, the mean is \(2.5\) and the variance is \[s^2=\frac{(-3/2)^2 + (1/2)^2 + (1/2)^2 + (3/2)^2}{3} = \frac{5}{3}.\] The standard deviation is \[s = \sqrt{5/3} \approx 1.290994.\]

R commands

All these statiscal measures can easily be computed with R. Let’s generate a list of numbers chosen randomly between 1 and 20.

set.seed(2)
mylist = sample(1:20, 10, replace=TRUE)
mylist

##  [1]  4 15 12  4 19 19  3 17 10 11

mean(mylist)

## [1] 11.4

sd(mylist)

## [1] 6.168018

quantile(mylist)

##   0%  25%  50%  75% 100% 
##  3.0  5.5 11.5 16.5 19.0

Dot and Box plots

Let’s generate some normally distributed, random data:

set.seed(1)
mylist = rnorm(100)
quantile(mylist)

##         0%        25%        50%        75%       100% 
## -2.2146999 -0.4942425  0.1139092  0.6915454  2.4016178

Now, we’ll generate a histogram:

hist(mylist)

Also, dot and box/whisker plots:

par(mfrow=c(2,1))
stripchart(mylist, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(mylist, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')

Finally, let’s do it with some real data:

par(mfrow=c(3,1))
stripchart(state.area, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')
boxplot(state.area, range=0, horizontal=TRUE, frame=FALSE, yaxt='n', ylab='',xlab='')