Download the 2003 YRBSS data. (click on the link to data and documentation to get to the list of files-- I've also provided the documentation as a booklet in class) Import the columns you need into EXCEL. Use the import wizard for fixed length fields. Refer to the YRBSS documentation pdf or the booklet for data positions. Importing into EXCEL is very similar to importing into ACCESS. There are video tutorials on the fixed-length import of YRBSS2003 data into EXCEL. The procedures are the same but for this exercise, we don't need to import all the data. Rrefer to the screen captures for parsing the 2003 data. the procedure can be simplified quite a bit, and smaller files obtained, by not importing data fields you aren't working with immediately. We can always go back to the source at CDC if you subsequently want to work with something you left out.
You will need the data in the columns shown for:
The allocation on your K drive is about 20 MB. Since the files we are working with are relatively big, the best place to save them would probably be a USB flash memory device. You can buy one of these for up to 512 MB for less than $30. Floppy disks are the worst option, as they are not only low capacity, but they are very prone to failure. To make the spreadsheet smaller, you can either delete unused columns early or not import them in the first place. As you add to the complexity of the sheet, it becomes bigger. You can minimize this by keeping the layout of figures organized and compact. The Multiple regression file is only about 2 MB, so you can save it to your UMSL webserver virtual drive space (access from home via FTP) or else use winZIP to compress the file to save it to a floppy. Winzip can also be used to compress the whole dataset down to about 2 MB, which is practical to save on the K drive. See the video on winzip and the page of instructions where you can download Winzip and FTP.
In all of the following procedures, it is easiest to select ranges by selecting the first cell, then holding down the shift key to anchor the selection as you move to other cells. long columns can be selected by holding shift as you use end down to go to the lowest cell. this can be used for columns with empty cells by selecting adjacent columns first then holding shift as you use the arrow keys to extend the selection across. The Weight column has decimal points in the cells that are missing values, so there are no empty cells in that column.
Remove records that have no values (no response) cells. You can most conveniently do this by sorting based on each column, then deleting records in blocks that are missing data. Do this first on the Weight column. be sure to cover all columns in the sort ranges so records don't get scrambled.
Insert a column at "A" to put the random variable. Use the formula =rand()*400 and copy it down the whole length of the records. Convert height to inches; weight to pounds; the age response to the corresponding ages; the sex response to 0 for female, 1 for male. The conversions can be accomplished by setting up new columns with the appropriate formulas, use the small-cross clickety-click trick to copy the formulas to all the rows, then use copy paste special values to convert the formulas to values before deleting the original data columns. To use all the columns together in the multiple regression, all the columns of "X" variables have to be adjacent to each other. EXCEL won't accept a split X range for this procedure.
Insert a few rows at the top to put your name, section time and student number. You are required to print this identification with the spreadsheet results to get credit.
Generate descriptive statistics of weights.
Produce a histogram of the weight distribution This should have about 10 pound increments from 60-410 lb. This will will give about 35 bins (bars in the graph) and will show the shape of the curve well.
Produce a scatter chart of weight vs. height, with a linear trend line, r-squared and the regression equation.
Conduct a multiple regression of weight as the dependent (Y) variable on the 4 independent variables: random number, age, sex and height. Referring to the correlation and multiple R-squared, does there appear to be a strong relationship between weight and the other variables? Referring to the results of the multiple regression analysis (focus on the Pvalues), discuss statistical significance of each of the relationships. What is the equation for this model? What is the practical significance of each?
Print the analysis and discussions to one page. Don't print all the rows of data-- just the top --by selecting a range and printing the selection (print/selection). include the top of the data columns, but don't print the whole sheet as it would be huge and take a lot of paper (about 400 pp.) and printer time. turn in the one page printout plus make sure to a file in the mygateway dropbox in case I need to look at it to resolve any questions.
Return to BA3300 syllabus