The chi-square test is a method for assessing a model when the data are binned. It is often use to test if a sample is representative of a given population.
Here’s an important question taken from our text: Is a given pool of potential jurors in a county racially representative of that county?
Here’s some specific data representing 275 jurors in a small county. Jurors identified their racial group, as shown in the table below. We would like to determine if these jurors are racially representative of the population.
Race | White | Black | Hispanic | Other | Total | ||
---|---|---|---|---|---|---|---|
Representation in juries | 205 | 26 | 25 | 19 | 275 | ||
Percentages for registered voters | 0.72 | 0.07 | 0.12 | 0.09 | 1.00 | ||
Expected count | 198 | 19.25 | 33 | 24.75 | 275 |
chisq.test
R’s chisq.test
is built for exactly this situation and it’s pretty easy to use:
chisq.test(c(205, 26, 25, 19), p=c(0.72,0.07,0.12,0.09))
##
## Chi-squared test for given probabilities
##
## data: c(205, 26, 25, 19)
## X-squared = 5.8896, df = 3, p-value = 0.1171
There’s a lot going on in the background here but, ultimately, we are interested in that \(p\)-value. If we are looking for a 95% confidence level, then we are unable to reject the null hypothesis here, in spite of the deviation from expected counts that we see in the data.
The \(p\)-value is computed using the \(\chi^2\) statistic, which we find as follows:
We suppose that we are to evaluate whether there is convincing evidence that a set of observed counts \(O_1\), \(O_2\), …, \(O_k\) in \(k\) categories are unusually different from what might be expected under a null hypothesis. Call the that are based on the null hypothesis \(E_1\), \(E_2\), …, \(E_k\). If each expected count is at least 5 and the null hypothesis is true, then the test statistic below follows a chi-square distribution with \(k-1\) degrees of freedom: \[ \chi^2 = \frac{(O_1 - E_1)^2}{E_1} + \frac{(O_2 - E_2)^2}{E_2} + \cdots + \frac{(O_k - E_k)^2}{E_k} \] We then evaluate the area under the tail of the \(\chi^2\)-distribution with \(k-1\) degrees of freedom. In the example above, the \(\chi^2\)-statistic is
ch_sq = ((205-198)^2/198 + (26-19.25)^2/19.25 +(25-33)^2/33 + (19-24.75)^2/24.75)
ch_sq
## [1] 5.88961
and the probability can be computed via
1-pchisq(ch_sq,3)
## [1] 0.1171062
Geometrically, this represents the area under the curve below and to the right of 5.88
f = function(x) dchisq(x,3)
plot(f,0,8)
Note that we can use a table like the one in the back of our text to assess the null-hypothesis.
The previous example was one-way. That is, the percentages were known and we just wanted to know if the jury pool matched those percentages.
Sometimes, we have two categorical variables and we want to know if they are independent or not. Here’s an example from the R-Tutorial:
library(MASS)
tbl = table(survey$Smoke, survey$Exer)
tbl
##
## Freq None Some
## Heavy 7 1 3
## Never 87 18 84
## Occas 12 3 4
## Regul 9 1 7
The rows indicate how much the participant smokes and the columns indicate how much they exercise. Our null hypothesis is that these are independent; our alternative hypothesis is contrary. Let’s check:
chisq.test(tbl)
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828
The error message is due to some of the small values in the table but we definitely cannot reject the null hypothesis.
You might find (on WebWork, for example) that you need to construct this sort of table by hand. You can do this like so:
m = matrix(c(7,1,3, 87,18,84, 12,3,4, 9,1,7), 4,3, byrow=T)
tbl = as.table(m)
tbl
## A B C
## A 7 1 3
## B 87 18 84
## C 12 3 4
## D 9 1 7
Note that the row and column labels are different but that’s not important for the computation.
\[\int_{-\infty}^{\infty} e^{-x^2}dx = \sqrt{\pi}\]
Mean $\mu$: