For this assignment, you will be using the Framingham Heart Study Data. The Framingham Heart Study is a long term prospective study of the etiology of cardiovascular disease among a population subjects in the community of Framingham, Massachusetts. The Framingham Heart Study was a landmark study in epidemiology in that it was the first prospective study of cardiovascular disease and identified the concept of risk factors and their joint effects. We will be using this original data
For last week’s assignment, Assignment 8, you constructed an analysis file for the Framingham Heart Study data and you were asked to save it to your computer. You will be using that same analysis file for Assignment 9. If you did not save the analysis file, please go back to the Assignment 8 instructions and re-create the analysis file.
You may use either EXCEL of R for the following preparatory activities.
1) If you look over the data in the analysis file, you will notice that there are a number of variables that have Yes / No text entries for the data. This is problematic for Correlation Analysis because the algorithms need numbers for their computations. These Yes / No variables are typically indicators of some type of health quality or situation. These need to be recoded into number. There is no one way to do this, but the “Yes” entries need to be coded as 1, while the “No” entries need to be coded to 0. These are referred to as present / absent variables. Some suggestions for doing this are to sort the data in EXCEL by each of these present/absent variables and then copy / paste the desired values over the text data. This must be one one variable at a time. It is a little bit of work, but is not that onerous. Once this is done for all the variables, you would then import a “clean” dataset into R. Or, you could import the dataset “as is” into R, and then use code to create the new present / absent variables. Analysts typically use the IFELSE() function to do this. For example, if you focus on the “DEATH” variable, you could have code that looks like:
d_death <- ifelse(mydata$death=='Yes', 1, ifelse(mydata$death=='No', 0, 99))
table(d_death)
I throw in the extra check and the 99 value to indicate records that are not containing valid data. Notice, that I’ve created a new object here called d_death. You could have a series of statements, like this one for the death variable, for each of the present / absent variables.
You will have to bring the d_death, and any other indicator variables you create, into your mydata data.frame. You can do this using the cbind.data.frame() function. For example:
mydata1<-cbind.data.frame(mydata, d_death)
With these skills in mind, recode and Yes / No variables into present / absent indicator variables. Update and save your revised analysis file.
2) Similarly, the variable “SEX1” is an indicator variable of gender. This is not really a present / absent variable, but it does only have 2 levels. This variable can also be recoded into a 0 / 1 indicator variable. Please do this recoding, but be sure you remember which gender was designated as 1 and which as 0. You will need to know this for interpretation of results. Again, update and save your revised analysis file.
3) At this point, you should have a revised analysis file with not text in the dataset at all as data. Once you have achieved this, you may go on to the tasks of Assignment 9.
ASSIGNMENT TASKS
PART A – MECHANICS (30 POINTS)
For the analysis in Part A, the variable “totchol1” should be considered the response variable Y. Any other variable involved in the analysis should be considered an explanatory variable X. Complete the following:
1) Construct a scatterplot of age1 by totchol1. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.
2) Correlate age1 and totchol1. Report and interpret the correlation coefficient.
3) Conduct a hypothesis test on the correlation between age1 and totchol1. Be sure to clearly specify the null and alternative hypotheses, make a decision based on p-value of the test statistic, and interpret the result.
4) Construct a scatterplot of death by totchol1. Describe what you see in this graph. Be sure to label the axes and give titles to the graph.
5) Correlate death and totchol1. This is called a point bi-serial correlation and it is a legitimate correlation. Report and interpret the correlation coefficient.
6) Conduct a hypothesis test on the correlation between death and totchol1. Be sure to clearly specify the null and alternative hypotheses, make a decision based on p-value of the test statistic, and interpret the result.
PART B – OPEN ENDED ANALYSIS (70 POINTS)
In professional practice, correlation is not merely used to measure the strength of linear association between two variables. Rather, correlation can be used to gain knowledge about the information contained within sets of variables. This is valuable as a pre-cursor analysis to Regression so that you know which variables are related to one another. This understanding can greatly facilitate your future predictive modeling analysis. When you have an observational dataset like the Framingham Heart Study data, one typically is looking for risk factors. In other words, explanatory variables that are related to specific response variables of interest, but also interrelated amongst themselves. Keep this in mind as these analyses progress.
7) Create a new data frame, call it mydata7, that only contains the continuous variables from the analysis file. Do not include age1 in this data frame. These variables should include: totchol1, sysbp1, diabp1, cigpday1, bmi1, heartrte1, glucose1
a) Obtain the correlation matrix for the mydata7 data frame. Which variable is correlated highest with totchol1?
b) From the correlation matrix, can you tell if any of these variables appear to be measuring something in common?
c) Use the corrplot(mydata7) function from the corrplot library to plot the correlation hit matrix (note: There is an example of this in the Module 9 classroom if you don’t know how this package or function). Does this help to see common patterns in this data. Feel free to re-orient the order of the variables in the data frame list to help make the connections between variables easier to see. How do you interpret the results from the “hit matrix”?
8) Create a new data frame from the analysis file that includes totchol1 and all of the present / absent indicator variables. Call this data frame mydata8. Do not include gender in this data frame.
a) Obtain the correlation matrix for the mydata8 data frame. Which present/absent variable is correlated highest with totchol1?
b) From the correlation matrix, can you tell if any of these variables appear to be measuring something in common?
c) Use the corrplot(mydata7) function from the corrplot library to plot the correlation hit matrix (note: There is an example of this in the Module 9 classroom if you don’t know how this package or function). Does this help to see common patterns in this data. Feel free to re-orient the order of the variables in the data frame list to help make the connections between variables easier to see. How do you interpret the results from the “hit matrix”?
9) Reflect on your experiences here. What seems to be the story about risk factors for total cholesterol (totchol1)? Do you have recommendations for future analysis?