 ## Videotutorials on this exercise.

Chi2 analysis is used for categorical (count) data to compare expected frequencies to observed frequencies. It can be used to test:
1. Independence (if any) between variables
2. Whether various subgroups are homogeneous --use difference between observed rather than O-E
3. Whether there is a significant difference in proportions in the subclasses among subgroups.

## Limitations are:

Chi2 is an approximation of a continuous distribution-- it's no good for "small" frequencies, e.g. if expected < 1, or more than 20% of values <5
Our most typical use will be to determine if two dimensions of classification are independent.  For example, is smoking behavior independent of gender for highschool students represented by the YRBSS sample?
The null hypothesis is that gender and smoking are independent, thus we would expect to see the same ratio of smokers to nonsmokers in both gender samples.
The prediction, then is that the frequency of female smokers would be the grand total of individuals in the study times the joint probability of being both female and a smoker.
The probability of being a female is the overall number of females divided by the total number of people in the sample.  The probability of being a smoker is the total number of smokers divided by the number of people in the study. If smoking and gender are independent, the expected number of female smokers is simply the product of  P(female) X P(smoker). Chi2 analysis requires calculation of the expected numbers in each category, followed by a comparison with what was actually observed. We do this by calculating an observed - expected component for each cell in the cross-categorization. The Chi2 value is the total of all these values for all cells in the contingency table. The shape of the Chi2 distribution depends on the degrees of freedom, which is calculated as number of rows-1 times Number of columns -1.
d.f.= (R-1)*(C-1)
Since there is some "random" variation, the Chi2 would give an expected distribution of values if the null hypothesis is true. from the shape of the curve and the size of the observed value, we can calculate a probability of obtaining a value that far out if the null hypothesis is correct.
If this Pvalue is sufficiently small, we conclude that our assumed independence is probably not true, and we would reject that assumption and conclude with some confidence 1-P that the categories are not independent.
EXCEL does have a function for calculating the Chi2 statistic, but it uses a table of observed values and a table of expected values for input.  Usually the table of expected values must be calculated, so to be most efficient, we will design a table with flexible formulas so it can readily be expanded or contracted to any dimensions.

As a first example, here is the crosstabulation of gender (Q2) and smoking (Q35) from the YRBS 2001. The formulas are designed so that they can be entered once, then copied down or across to give the right result in all cells. Formulas entered are shown in the shaded cells.  The first table shows actual, or observed, counts. The table of percents is designed to calculate as percent of column totals. In some cases, it would be better to calculate as percent of column totals. The table of expected values is calculated assuming that gender and smoking are independent (Ho:) In that case, the ratio of smokers to nonsmokers should be the same regardless of gender. The probability of being both female and a smoker would be the product of P(female) X P(smoke).
The Chi2 Statistic is calculated  by comparing observed to expected for each combination in the table. The Chi2 statistic is the sum over all cells in the table.
Chi2 Critical is obtained through the ChiInv Function for alpha and degrees of freedom. d.f. = Rows-1 X Columns-1. In this case we put d.f.  on the table and referred to that cell (C42).
The Pvalue for a given Chi2 observed is obtained with the Chidist function. The Pvalue can also be obtained through EXCEL's  Chitest function, which requires a table of observed values and expected values.
An important crosscheck is that EXCEL's Chitest give the same result as the Chidist function on the Chi2 value calculated by a different route. If these aren't exactly the same, something is wrong. Here are the numeric results of the flexible formulas, showing that males are more likely to smoke than females and the results are highly significant, so probably reflect real differences in the populations represented by this survey. If there is really no difference (Ho is true) then the probability of  obtaining this result is about 2 in a thousand. We can therefore conclude with about 99.8% confidence that male highschool students are more likely to be smokers than are female highschool students.
 In some cases, you will want to calculate percents of row totals rather than column totals.  In that case the flexible formula would be B19/\$E19 Once this spreadsheet is assembled, it has been designed to be flexible and apply to other crosstabulations. A table with different dimensions can be accommodated by inserting or deleting rows or columns.
make sure to insert or delete on the inside of the table rather than the periphery so that range references are not disrupted. after inserting rows or columns to fit the new dimension simply copy the formulas down or across to fill the new space. Change the degrees of freedom to fit the new dimensions (R-1)X(C-1), and all the formulas should adjust to give the correct results. be sure to change degrees of freedom to fit the new dimensions of the data. To calculate percents of row totals rather than column totals, the flexible formula would be B19/\$E19

# Contingency table on Marijuana per month and debilitating depression.

Before you can conveniently work with the pivottable data, you have to copy it into another table as values.
There are some tricks to this in EXCEL XP. Right click/copy doesn't work. use the menu edit/copy. You might have to right click select enable select before selecting the table. If you select and copy-paste the entire table, you generate a second active pivottable instead of workable values. Get around this by paste special/values, or by copying only the actual data from the table when you select it and copy.
Once the data are copied to a separate table in EXCEL, you will want to relabel rows and columns to be more descriptive, and delete the row and column that correspond to "no response". Those records are only a few percent of the total and for our purposes, we can tolerate a little inaccuracy to spare the complexity of corrections.

## Analysis of the MJHappy crosstab results in EXCEL:

Use real count data from the yrbs2001, from crosstabulating questions 47 (amount of Marijuana in the last 30 days) In Rows--and 23 (sadness to the point of neglecting normal activities) In columns.  Delete nonrespondents (less than 5%)  to simplify the table. Replace the last digit in each of the first four rows in the Yes column with the last 4 digits of your student number.

Calculate totals for rows and columns of the observed values. Set up tables next to the observed table and next to the calculated expected frequency table in which you calculate the percents of counts related to total count for each Row.  Note this will require slightly different formulas than those shown in the figure, above. But, in this case the interesting result is percent of pot users who are depressed. I think this will clarify what we are doing in the test and help in the practical interpretation. Calculate Chi² , Critical Chi²  and Pvalue for the observed result. If you use a sheet designed with flexible formulas for copying as we did in class this will be about a 5-minute job. Use 5% alpha to test independence of amount of Marijuana smoking and sadness. Compare the EXCEL Chitest result to the Chidist result for the calculated observed Chi²  value (if these aren't the same then there is something wrong. Fix it.).

1. What is the null hypothesis?
2. Compare Chi²  obs to Chi2crit and Pvalue to alpha (5%) .What do you conclude about the relationship? How confident are you in your conclusions?
3. From this, can you conclude Marijuana makes people happy? Can you conclude that Marijuana makes people sad? What can you conclude from this?

last modified April 22, 2004  comments or questions to rbanis (at) jinx.umsl.edu