Chi-Square Test Implementation in Python: A Comprehensive Guide52

The chi-square (χ²) test is a powerful statistical tool used to determine if there's a significant association between two categorical variables. It assesses whether observed frequencies differ significantly from expected frequencies, under the assumption of independence. This guide will walk you through implementing the chi-square test in Python, covering various aspects from basic implementation to handling different scenarios and interpreting results.

Python offers several libraries to perform the chi-square test, primarily ``. This library provides functions that efficiently compute the test statistic, p-value, and degrees of freedom. Let's begin with a basic example of a chi-square test of independence.

Basic Chi-Square Test of Independence

Imagine we're studying the relationship between gender and preference for a particular type of coffee (e.g., black coffee vs. latte). We collect the following data:

Black Coffee
Latte
Total

Male
30
20
50

Female
25
25
50

Total
55
45
100

To perform the chi-square test, we first need to calculate the expected frequencies under the assumption of independence. The expected frequency for each cell is calculated as:

Expected Frequency = (Row Total * Column Total) / Grand Total

For example, the expected frequency for males preferring black coffee is (50 * 55) / 100 = 27.5.

Here's how we can perform this test using ``:```python
import numpy as np
from import chi2_contingency
observed = ([[30, 20], [25, 25]])
chi2, p, dof, expected = chi2_contingency(observed)
print("Chi-square statistic:", chi2)
print("P-value:", p)
print("Degrees of freedom:", dof)
print("Expected frequencies:", expected)
```

This code will output the chi-square statistic, the p-value, the degrees of freedom, and the expected frequencies. The p-value indicates the probability of observing the data if there's no association between gender and coffee preference. A small p-value (typically less than 0.05) suggests a significant association.

Chi-Square Test of Goodness of Fit

The chi-square test can also be used to assess whether a sample distribution matches a hypothesized distribution (Goodness of Fit). For example, we might want to test if a die is fair by rolling it many times and comparing the observed frequencies of each outcome to the expected frequencies (1/6 for each face).```python
import numpy as np
from import chisquare
observed = ([15, 18, 12, 17, 14, 14]) # Observed frequencies for each face of a die
expected = ([16.67] * 6) # Expected frequencies (fair die)
chi2, p = chisquare(observed, f_exp=expected)
print("Chi-square statistic:", chi2)
print("P-value:", p)
```

This code performs a chi-square goodness-of-fit test. Again, a small p-value indicates a significant deviation from the expected distribution.

Handling Small Expected Frequencies

The chi-square test assumes that the expected frequencies are sufficiently large. A common rule of thumb is that all expected frequencies should be at least 5. If this condition is not met, the chi-square approximation may not be accurate. In such cases, consider using Fisher's exact test, which is implemented in `.fisher_exact`.

Interpreting Results

The interpretation of the chi-square test results relies heavily on the p-value. If the p-value is less than a predetermined significance level (alpha, often 0.05), we reject the null hypothesis (no association or no difference from the expected distribution). If the p-value is greater than alpha, we fail to reject the null hypothesis. It's crucial to remember that failing to reject the null hypothesis doesn't necessarily mean the null hypothesis is true; it simply means there's not enough evidence to reject it.

Conclusion

The chi-square test is a versatile tool for analyzing categorical data. Python's `` library provides readily available functions for performing both the test of independence and the goodness-of-fit test. Understanding the assumptions and limitations of the test, particularly regarding expected frequencies, is critical for accurate interpretation of results. Remember to always consider the context of your data and choose the appropriate statistical test accordingly.

2025-06-16

上一篇：Python 降序排序函数：详解与应用

下一篇：Python字符串截取技巧与应用详解