Example:  paired data method for correction for other variables

Sex and Salaries Are they significantly different?

There is a common belief that females' pay scales for equivalent work are less than those of males.
When studied in the context of broad variation of pay for different work, such differences may not be apparent so the overall means of these two distributions are not vastly different, but if other factors affecting the pay are controlled and adjusted for (the Bill Cosby school of detecting causality) a  sex difference may be detected.

Since we don't know which way a difference might show, we will use two-tailed tests in the t-tests.

  • There is a Comma-delimited ASCII file of the  data as sexsal.txt which you can copy and paste into EXCEL and immediately do a SaveAs to your A: disk, specifying file type EXCEL before you start on the project.  It's not enough to add the .xls  extension on the file.
  • Use the following tools in EXCEL (you don't have to do them exactly in this order. If you are not sure about some of them, do the easy ones first):
    1. add a third column containing the differences between the pairs
    2. Derive tables of descriptive statistics for each of the two samples, including 95% confidence intervals. comment on the means and confidence intervals.  Is there strong statistical evidence from this analysis that these two samples came from different populations? Why or why not?
    3. conduct an F test (alpha=0.05) on the variances of the two distributions and decide whether to do a t-test assuming equal or unequal variances. Comment on the results. What does the Pvalue mean?
    4. conduct the appropriate unpaired t-test to compare the two distributions (H0: difference=0, alpha=0.05). Comment on the results. Is there a significant difference? what is the probability of getting this result if the H0 is true? Do you conclude with 95% confidence from this analysis that males and females have different average salaries?
    5. conduct a t-test for difference of the means of the two distributions as paired variables, with alpha =0.05 comment on the results. Are they "significant at the 5% level"? what does that mean? Are they "significant at the 1% level"?
    6. derive descriptive statistics and a confidence interval (95% confidence) on the column of differences.  Calculate a t value for the mean of this distribution compared an H0 that the mean =0. How does this compare to the t-test on paired variables?
    7. Why didn't we see the difference when we did the analysis without pairing the observations? What did the pairing do for us mathematically so that we could identify a difference with higher confidence? Construct a histogram on the pooled observations (all 200 treated as a group) make sure the X axis has the right values as labels on the axis.
    8. Why is it okay to use a t test rather than a Z test even though there is a large number of observations?
    9. highlight both original columns again, click on the chart icon and this time select an XY scatter chart. Click next a few times, making sure that the chart appears as an object in the same spreadsheet, then move it to a convenient location. Right click on the middle of the data and click add trendline. This will bring up a dialogue where you can select a linear trendline under type, and then, on the options tab, check the boxes for display equation on chart and set intercept = 0.  Consider the result. What does this mean in comparing male to female salaries when values are paired this way? If you wanted to predict a female's salary knowing only the salary of her pair partner, how would you calculate your best guess?