Recall that we have recently been discussing relationships between variables. For example, linear regression examines the relationship between two numerical variables. Similarly, the $\chi^2$-test examines the relationship between categorical variables.
As we'll see, there are two somewhat different types of $\chi^2$-tests. Specifically, there's
scipy.stats
¶The two types of $\chi^2$ tests each has its own function in the scipy.stats
module. There's:
chisquare
for homogeneity andchi2_contingency
for independence.Let's go ahead and import those now:
from scipy.stats import chisquare, chi2_contingency
Here's a potentially important question: Is a given pool of potential jurors in a county racially representative of that county?
Here's some specific data representing 275 jurors in a small county. Jurors identified their racial group, as shown in the table below. We would like to determine if these jurors are racially representative of the population.
Race | White | Black | Hispanic | Other | Total | ||
---|---|---|---|---|---|---|---|
Representation in juries | 205 | 26 | 25 | 19 | 275 | ||
Percentages for registered voters | 0.72 | 0.07 | 0.12 | 0.09 | 1.00 | ||
Expected count | 198 | 19.25 | 33 | 24.75 | 275 |
chisquare
¶The chisquare
function built for exactly this situation and it's pretty easy to use:
chisquare([205, 26, 25, 19], f_exp = [198.0, 19.25, 33.0, 24.75])
There's a lot going on in the background here but, ultimately, we are interested in that $p$-value. If we are looking for a 95% confidence level, then we are unable to reject the null hypothesis here, in spite of the deviation from expected counts that we see in the data.
The $p$-value is computed using the $\chi^2$ statistic, which we find as follows:
We suppose that we are to evaluate whether there is convincing evidence that a set of observed counts $O_1$, $O_2$, ..., $O_k$ in $k$ categories are unusually different from what might be expected under a null hypothesis. Call the \emph{expected counts} that are based on the null hypothesis $E_1$, $E_2$, ..., $E_k$. If each expected count is at least 5 and the null hypothesis is true, then the test statistic below follows a chi-square distribution with $k-1$ degrees of freedom: $$ \chi^2 = \frac{(O_1 - E_1)^2}{E_1} + \frac{(O_2 - E_2)^2}{E_2} + \cdots + \frac{(O_k - E_k)^2}{E_k} $$ We then evaluate the area under the tail of the $\chi^2$-distribution with $k-1$ degrees of freedom. In the example above, the $\chi^2$-statistic is
ch_sq = ((205-198)**2/198 + (26-19.25)**2/19.25 +(25-33)**2/33 + (19-24.75)**2/24.75)
ch_sq
Geometrically, this represents the area under the curve below and to the right of $5.88$:
Note that we can use a table like the one in the back of our text to assess the null-hypothesis.
Sometimes, we have two categorical variables and we want to know if they are independent or not. This is also called the Chi-Square test for homogeneity. One example from the R-Tutorial examines whether exercise and smoking are independent of one another using the following data:
Smokes/Exercises | Frequently | Some | None | |
---|---|---|---|---|
Never | 435 | 420 | 90 | |
Occasionally | 60 | 20 | 15 | |
Regularly | 45 | 35 | 5 | |
Heavily | 35 | 15 | 5 |
The rows indicate how much the participant smokes and the columns indicate how much they exercise. Our null hypothesis is that these are independent; our alternative hypothesis is contrary.
We can enter a small table like this into Python and get the $p$-value with chi2_contingency
as follows:
A = [
[435,420,90],
[60,20,15],
[45,35,5],
[35,15,5]
]
chi2_contingency(A)[1]
It looks like we reject the null hypothesis of independence.