Chi2 analysis is used for categorical (count) data to
compare expected frequencies to observed frequencies. It can be used to
1. Independence (if any) between variables
2. Whether various subgroups are homogeneous
--use difference between observed rather than O-E
3. Whether there is a significant difference
in proportions in the subclasses among subgroups.
Chi2 is an approximation of a continuous distribution-- it's
no good for "small" frequencies, e.g. if expected < 1, or more than
20% of values <5
Our most typical use will be to determine if two dimensions of classification
are independent. For example, is smoking behavior independent of
gender for highschool students represented by the YRBSS sample?
The null hypothesis is that gender and smoking are independent, thus
we would expect to see the same ratio of smokers to nonsmokers in both
The prediction, then is that the frequency of female smokers would
be the grand total of individuals in the study times the joint probability
of being both female and a smoker.
The probability of being a female is the overall number of females
divided by the total number of people in the sample. The probability
of being a smoker is the total number of smokers divided by the number
of people in the study. If smoking and gender are independent, the expected
number of female smokers is simply the product of P(female) X P(smoker).
analysis requires calculation of the expected numbers in each category,
followed by a comparison with what was actually observed. We do this by
calculating an observed - expected component for each cell in the cross-categorization.
The Chi2 value is the total of all these values for all cells
in the contingency table.
The shape of the Chi2 distribution depends on the degrees
of freedom, which is calculated as number of rows-1 times Number of columns
Since there is some "random" variation, the Chi2 would give
an expected distribution of values if the null hypothesis is true. from
the shape of the curve and the size of the observed value, we can calculate
a probability of obtaining a value that far out if the null hypothesis
If this Pvalue is sufficiently small, we conclude that our assumed
independence is probably not true, and we would reject that assumption
and conclude with some confidence 1-P that the categories are not independent.
EXCEL does have a function for calculating the Chi2 statistic,
but it uses a table of observed values and a table of expected values for
input. Usually the table of expected values must be calculated, so
to be most efficient, we will design a table with flexible formulas so
it can readily be expanded or contracted to any dimensions.
As a first example, here is the crosstabulation of gender (Q2) and smoking
(Q35) from the YRBS 2001. The formulas are designed so that they can be
entered once, then copied down or across to give the right result in all
cells. Formulas entered are shown in the shaded cells. The first
table shows actual, or observed, counts. The table of percents is designed
to calculate as percent of column totals. In some cases, it would be better
to calculate as percent of column totals. The table of expected values
is calculated assuming that gender and smoking are independent (Ho:) In
that case, the ratio of smokers to nonsmokers should be the same regardless
of gender. The probability of being both female and a smoker would be the
product of P(female) X P(smoke).
The Chi2 Statistic is calculated by comparing observed
to expected for each combination in the table. The Chi2 statistic
is the sum over all cells in the table.
Chi2 Critical is obtained through the ChiInv Function for
alpha and degrees of freedom. d.f. = Rows-1 X Columns-1. In this case we
put d.f. on the table and referred to that cell (C42).
The Pvalue for a given Chi2 observed is obtained with the
Chidist function. The Pvalue can also be obtained through EXCEL's
Chitest function, which requires a table of observed values and expected
An important crosscheck is that EXCEL's Chitest give the same result
as the Chidist function on the Chi2 value calculated by a different
route. If these aren't exactly the same, something is wrong.
Here are the numeric results of the flexible formulas, showing that
males are more likely to smoke than females and the results are highly
significant, so probably reflect real differences in the populations represented
by this survey. If there is really no difference (Ho is true) then the
probability of obtaining this result is about 2 in a thousand. We
can therefore conclude with about 99.8% confidence that male highschool
students are more likely to be smokers than are female highschool students.
|In some cases, you will want to calculate percents of row totals rather
than column totals. In that case the flexible formula would be B19/$E19
Once this spreadsheet is assembled, it has been designed to be flexible
and apply to other crosstabulations. A table with different dimensions
can be accommodated by inserting or deleting rows or columns.
make sure to insert or delete on the inside of the table rather than
the periphery so that range references are not disrupted. after inserting
rows or columns to fit the new dimension simply copy the formulas down
or across to fill the new space. Change the degrees of freedom to fit the
new dimensions (R-1)X(C-1), and all the formulas should adjust to give
the correct results.
|be sure to change degrees of freedom to fit the new dimensions of the
|| To calculate percents of row totals rather than column totals,
the flexible formula would be B19/$E19
Chi Square analysis-- Does smoking Marijuana make you
Contingency table on Marijuana per month and debilitating
Before you can conveniently work with the pivottable data, you have to
copy it into another table as values.
There are some tricks to this in EXCEL XP. Right click/copy doesn't
work. use the menu edit/copy. You might have to right click select enable
select before selecting the table. If you select and copy-paste the entire
table, you generate a second active pivottable instead of workable values.
Get around this by paste special/values, or by copying only the actual
data from the table when you select it and copy.
Once the data are copied to a separate table in EXCEL, you will want
to relabel rows and columns to be more descriptive, and delete the row
and column that correspond to "no response". Those records are only a few
percent of the total and for our purposes, we can tolerate a little inaccuracy
to spare the complexity of corrections.
Analysis of the MJHappy crosstab results in EXCEL:
Use real count data from the yrbs2001, from crosstabulating questions 47
(amount of Marijuana in the last 30 days) In Rows--and 23 (sadness
to the point of neglecting normal activities) In columns.
Delete nonrespondents (less than 5%) to simplify the table. Replace
the last digit in each of the first four rows in the Yes column with the
last 4 digits of your student number.
Calculate totals for rows and columns of the observed values. Set up
tables next to the observed table and next to the calculated expected frequency
table in which you calculate the percents of counts related to total
count for each Row. Note this will require slightly different formulas
than those shown in the figure, above. But, in this case the interesting
result is percent of pot users who are depressed. I think this will
clarify what we are doing in the test and help in the practical interpretation.
Calculate Chi² , Critical Chi²
and Pvalue for the observed result. If you use a sheet designed with flexible
formulas for copying as we did in class this will be about a 5-minute job.
Use 5% alpha to test independence of amount of Marijuana smoking and sadness.
Compare the EXCEL Chitest result to the Chidist result for the calculated
observed Chi² value (if these aren't the
same then there is something wrong. Fix it.).
What is the null hypothesis?
Compare Chi² obs to Chi2crit
and Pvalue to alpha (5%) .What do you conclude about the relationship?
How confident are you in your conclusions?
From this, can you conclude Marijuana makes people happy? Can you conclude
that Marijuana makes people sad? What can you conclude from this?
last modified April 22, 2004
or questions to rbanis (at) jinx.umsl.edu