The Chi-Square Independence
test is used a complex tool (Excel recommended!) to determine if two
attributes of the same object are correlated or independent from each other,
with a high level of certainty (>95%). The Chi-Square coefficient is given by the formula above using (obs)erved frequencies and (exp)ected frequencies of the values in the table.
Verify the following checklist before using the Chi-Square Independence Test:
The Chi-Square independence test determines whether two attributes of one object are related or independent from each other:
Watch these two tutorials to learn how to run and interpret the Chi-Square Independence Test without Excel, or check out the examples below
Verify the following checklist before using the Chi-Square Independence Test:
- The sample must include at least 20 inviduals (the larger the sample the better)
- The data must be exact values, NOT percentages or frequencies
- Both observations MUST be independent from each other: one should not be the direct cause of the other (e.g. do not use this test to compare HDI and Life Expectancy, since the HDI is a composite index which is partly based on Life Expectancy!)
- You are only trying to compare TWO attributes of the same object/person at a time (e.g. income AND gender, or temperature AND elevation)
- Each attribute (gender, nationality, preference, etc) is divided into at least TWO possibilities, preferably more whenever possible (e.g. nationality: US, French, British, Japanese, etc). A table with only 2 lines and 2 columns might not lend itself to a significant Chi-Square independence test
The Chi-Square independence test determines whether two attributes of one object are related or independent from each other:
- Verify that you sample meets the 5 criteria above and formulate the null hypothesis
- Arrange the sample data in a contingency table (=table of frequencies) (see tutorial below)
- Calculate the Chi-Square correlation coefficient
- Use a table to find the Chi-Square critical value (depending on your sample size and required confidence level)
- If your Chi-Square value ("Chi-Square statistic") > critical Chi-Square value => there is a correlation, based on the sample considered
Watch these two tutorials to learn how to run and interpret the Chi-Square Independence Test without Excel, or check out the examples below
|
|
Example 1: Is the nationality of foreign tourists visiting SF related to which tourist attraction they liked most?
Step 1: Test criteria and null hypothesis
Step 2: Table of "observed frequencies" Create a table of the numbers you have collected (e.g. interviews of tourists in the street) in a table combining attribute #1 (columns) and attribute #2 (rows) as follows:
Step 3: Table of "expected frequencies"
Using the totals of each line and column, calculate what frequencies would be expected if the distribution was completely random (note: if the expected frequencies are smaller than 5, it usually means that the sample is too small for the Chi-Square independence test to be meaningful). To do that, just multiply the total of a line by the total of a column, and divide by the total of the whole table. For example, 44 of all 167 tourists liked the De Young most while 54 of all 167 tourists are French. Therefore, if we did not have the details about who preferred what, we could expect the average number of tourists who are French AND who preferred the De Young Museum to be: 44 x 54 / 167 = 14.2 tourists
Step 4: Run the Chi-Square test with Excel:
If you are not using Excel, go to step 5 to learn how to use the Chi-Square formula (see above). With Excel, simply use the formula =CHITEST and select the array of observed frequencies (blue) and the array of expected frequencies (yellow) (do NOT select the total row and column with the totals). Using the tables above, the result is: CHITEST = 0.000003 = 0.0003% This means that there is only a 0.0003% chance that the Null Hypothesis is true and that the two attributes (nationality and preferred tourist attraction) are INDEPENDENT from each other. In other words, there is a 99.9997% chance that the two attributes are related. Anything above 95% being almost certain, we have to conclude that the two attributes are almost certainly related: based on this sample, certain nationalities tend to prefer specific SF tourist attractions. Further analysis may try to explain why French tourists tend to prefer the De Young while Germans tend to prefer the Mission Dolores: cultural reasons? Social reasons? Tour operators? Step 5: Calculate the Chi-Square value "manually": Create a third table: use observed (O) and expected (E) values found above to calculate the value in each cell of the new table, using the Chi-Test formula (O-E)2/E, as follows:
Step 6: degrees of freedom, confidence level and critical Chi-Square values:
The number of “degrees of freedom” in the sample combines the number of nationalities (3 choices/columns) and attractions (4 choices/rows) as follows: df = (# of nationalities – 1) x (# of attractions – 1) = (3-1)x(4-1) = 6 This sample therefore has 6 "degrees of freedom". The "confidence level” is the minimum probability (p) over which you can safely say that the two attributes examined are related (usually 95%, 99% or 99.9%). Vice versa, the “level of significance” is the maximum level of error (e) you want to allow for the test: e = 100% – p = 1 – p
The critical values of Chi-Square are the benchmarks against which to compare the Chi-Square value you have found above. The critical values are set, and depend only upon the degrees of freedom of the sample and the minimum confidence level you have decided to require. Choose ONE of the following methods to calculate the critical values of Chi-Square:
ou must compare the Chi-Square coefficient you calculated (step 5) with the critical value of Chi-Square (step 6):
|
Example 2: Is the nationality of foreign tourist related to which neighborhood they decide to stay in while in SF?
Step 1: Test criteria and null hypothesis
Step 2: Table of "observed frequencies" Create a table of the numbers you have collected (e.g. interviews of tourists in the street) in a table combining attribute #1 (columns) and attribute #2 (rows) as follows:
Step 3: Table of "expected frequencies"
(see step 3 of example 1)
Step 4: Run the Chi-Square test with Excel:
If you are not using Excel, go to step 5 to calculate the Chi-Square value using the formula above. With Excel, simply use the formula =CHITEST and select the table of observed frequencies and the table of expected frequencies (do NOT select the total row and column with the totals). Using the tables above, the result is: CHITEST = 0.36 = 36% This means that there is a 36% chance that the Null Hypothesis is true and that the two attributes (nationality and neighborhood of hotel) are INDEPENDENT from each other. In other words, there is only a 64% chance that the two attributes are related, well below the 95% confidence level required to make any statistical calculation reliable: we have to conclude that the Null Hypothesis can NOT be rejected, and that the two attributes are almost certainly independent from each other. Based on this sample, the nationality of tourists does not seem to be correlated with where they choose to stay while in San Francisco. Since most tourists appear to choose to stay in the Tenderloin over the Mission regardless of their nationality, further analysis may try to explain this phenomenon: Cost? Accessibility? Hotel chains? Number of hotels? Step 5: Calculate the Chi-Square value "manually": Create a third table: use observed (O) and expected (E) values found above to calculate the value in each cell of the new table, using the Chi-Test formula (O-E)2/E, as follows:
Step 6: degrees of freedom, confidence level and critical Chi-Square values:
(See example 1 for details) df = (# of nationalities - 1) x (# of neighborhoods - 1) = (3-1)x(3-1) = 4 This sample therefore has 4 degrees of freedom The critical values of Chi-Square are as follows (see online table):
Step 7: Interpret your results Let's compare the Chi-Square coefficient of the sample (step 5) with the critical value of Chi-Square (step 6) (see step 7 of example 1 for more information): Chi-Square = 4.35 < 9.5 (critical value for the lowest confidence level). The confidence level is NOT met and the test is inconclusive: we cannot reject the Null Hypothesis, which means that the two attributes (nationality and preferred neighborhood to stay) are probably NOT related. Based on this sample, the nationality of tourists does not seem correlated with where they choose to stay in San Francisco. Since most tourist appear to choose to stay in the Tenderloin over the Marina regardless of their nationality, further analysis may try to explain this phenomenon: more hotels? Accessibility? Advertisement? |