At this point, we’ve learned quite a few statistical tests. Here’s a recap:
All of the tests have a few things in common.
Perhaps the most obvious difference centers on the type of data being considered: numerical vs categorical.
There are other differences too, though, and understanding these helps you know which to apply in a certain situation
The \(z\)-tests are the most basic and first hypothesis tests we meet.
The \(z\)-test for means deals with numerical data. In the simplest case, we have one data sample - just a list of numbers.
The question is - does that data support the hypothesis that the mean of the population from which is was drawn is some particular number? If our data has sample mean \(\bar{x}\) and we suspec the population mean is \(\mu_0\), then our two-sided hypothesis can be written
A one sided hypothesis can be written with a greater or less, rather than a not equal.
The \(z\)-score for our mean is \[Z = \frac{\bar{x} - \mu_0}{\sigma/n}.\] We use then compare this against the standard normal distribution to compute the \(p\)-value.
There are examples in our notes from 6/19.
This is very much like the \(z\)-test for means, but we are dealing with proportions of categorical data. We often think of this in terms of a random variable \(X\) that is binomially distributed; thus, we need to know the binomoial distribution after dividing through by \(n\):
\[\begin{align} \mu &= p &\sigma^2 &= p(1-p)/n &\sigma &= \sqrt{p(1-p)/n} \end{align}\]Our hypothesis can be written
\[\begin{align} H_0 : \hat{p}=p_0 \\ H_A : \hat{p} \neq p_0 \end{align}\]There’s an example at the endo of our notes from 6/19.
The \(t\)-tests are very much like the \(z\)-tests. The primary difference is that \(t\)-test is applicable to smaller data sets.
In the simplest case, we again have one data sample, which is just a list of numbers.
Again, this data set can be relatively small - less than 30.
We use a different distribution - the \(t\)-distribution. There are a slew of these though - expressed in terms of the degrees of freedom parameter, which is just the sample size minus one.
In addition, it’s more important that the population be normally distributed.
There are examples illustrating the \(t\)-test for one sample mean in our notes from 6/22.
We use this when we habe two data sets that are paired in a natural way; that is, each data point in one set corresponds to a particular data point in the other set.
Such a data set can be translated to a single data set by simply subtracting the data sets pair-wise.
Our hypotesis test looks like
\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 \\ H_A: & \mu_1 \neq \mu_2 \\ \end{array} \]
There are some examples of this in our notes from 6/27.
We use this when we habe two data sets that are independent of one another.
If the sets have sizes \(n_1\) and \(n_2\), we analyze the difference of the two means using a \(t\)-test with
Our hypothesis test again looks like
\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 \\ H_A: & \mu_1 \neq \mu_2 \end{array} \]
There are some examples of this in our notes from 6/27.
ANOVA or Analyis of Variations is used to compare some statistic (just the mean for us, though proportion is natural too) across several groups.
Our hypothesis test looks like
\[ \begin{array}{ll} H_0: & \mu_1 = \mu_2 = \cdots = \mu_k \\ H_A: & \text{at least two } \mu_i\text{s are different} \end{array} \]
The mathematics lurking in the background is based on a new distribution called the \(F\)-distribution.
This is more complicated than the distributions we’ve seen to this point and we only use software.
There are some examples in our notes of 7/5.
The chi-square test is a method for assessing a model when the data are binned.
In this situation, we have two data sets, call them
Our hypothesis test looks like
We then compute the \(\chi^2\) statistic \[\chi^2 = \frac{(O_1 - E_1)^2}{E_1} + \frac{(O_2 - E_2)^2}{E_2} + \cdots + \frac{(O_k - E_k)^2}{E_k}\] and use the \(\chi^2\) distribution with \(k-1\) degrees of freedom.
There are some examples in our notes of (7/7)[07.07.17.html].
Linear regression is topic that spans much more than just hypothesis testsing. There is an important hypothesis test that arises from linear regression, though.
In linear regression, we have two data samples \(x_1,\ldots,x_k\) and \(y_1,\ldots,y_k\). The question is - are they related?
The hypothsis statement looks like:
This can be stated in terms of the slope of the regression line