Week 8: Correlation and Regression Analysis in R
Doing correlation and regression analysis by hand can be tricky, especially as the data/models become more
complex. In this document, you will find all the commands necessary to check assumptions and perform
correlation and regression tests in R.
I) Checking Assumptions
Recall that there are three assumptions to check before performing a correlation or regression test:
1) The data comes from a randomly-selected sample.
2) The data is approximately linear.
3) The residual values have an approximately Normal distribution.
To check the second assumption, a scatterplot is needed. Remember that in order to be linear, the data simply
needs to show a constant upward or downward trend as you move from left to right on the graph (see Figure 3,
all but � = 0). The data does not need to form a perfect line to be linear. To construct a scatterplot, use the
following command format:
plot(dataset$independentVariable, dataset$dependentVariable)
To check the third assumption, you will first need to perform a linear regression and then perform a ShapiroWilks test on the residual values of that regression. In the first command, we will use the variable “model” to
store the linear regression model. This formatting is similar to the model used for ANOVA tests.
model <- lm(dataset$dependentVariable ~ dataset$independentVariable)
shapiro.test(resid(model))
As with all other Shapiro-Wilks tests we have performed, if the resulting p-value is > 0.05, the residual values
have an approximately normal distribution.
II) Correlation Tests
Pearson’s Product Moment
Calculating correlations in R is similar to t-tests. The Pearson Product Moment is the default correlation
calculated by the function cor.test. This function will also do a hypothesis test for the correlation. The general
formula is presented below:
cor.test(dataset$independentVariable, dataset$dependentVariable)
This will present an output that looks like this:
Pearson’s product-moment correlation
data: height and weight
t = -0.048776, df = 10, p-value = 0.9621
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5841539 0.5634663
sample estimates:
cor
-0.01542263
The output is similar to previous tests. The p-value is clearly labeled (in green), and the sample correlation
value, �, can be found at the bottom of the output (in teal).
III) Regression Tests
Building the Model
Similar to performing an ANOVA, a regression test involves building a model to run the analysis. If we want to
predict the weight based on height, the independent variable would be height and the dependent variable would
be weight. To perform a regression in R, we use the linear model function, which includes two separate
commands. First, we create the model and store it in a variable; in this example, I chose the word “model”. The
second command will display the results of the regression test.
model <- lm(dataset$dependentVariable ~ dataset$independentVariable)
summary(model)
This will produce the regression summary, which will look similar to the output of an ANOVA test.
Reading Regression Output
After running the summary command, you will receive a wealth of information. The first portion provides
information about the residuals, but this will not be used. The “Coefficients” section will provide you with the
values needed to write the regression equation, and the last section will provide the p-value and �!. Below is an
example summary from the regression commands on the previous page:
Call:
lm(formula = weight ~ height)
Residuals:
Min 1Q Median 3Q Max
-86.615 -49.715 2.233 38.029 103.641
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 213.5151 320.2997 0.667 0.520
height -0.2561 5.2497 -0.049 0.962
Residual standard error: 62.32 on 10 degrees of freedom
Multiple R-squared: 0.0002379, Adjusted R-squared: -0.09974
F-statistic: 0.002379 on 1 and 10 DF, p-value: 0.9621
To write the regression equation, look in the “Estimate” column (highlighted in yellow). The top number is the
y-intercept, and the bottom number is the slope. Remember that the slope is always connected to x in a linear
equation, so it makes sense that the independent variable (height) is right next to the slope value. After
substituting the y-intercept for �” and the slope for �#, the regression equation would be:
�” = 213.515 − 0.256(�)
When writing the regression equation, it is recommended to round numbers to the nearest hundredth or
thousandth place (two or three decimal places) when appropriate. Another possible way to write the regression
equation is to replace � and � with the independent and dependent variables to provide context:
��6��ℎ� = 213.515 − 0.256(ℎ���ℎ�)
You can also find the R2 value in the regression output (highlighted in teal). We typically express R2 as a
percentage, so our value would be 0.02379%.
Finally, performing a linear regression in R provides us with a p-value that can be used to test the alternative
hypothesis (the true slope is not equal to zero). This allows us to determine if the slope of the population
regression line between the independent and dependent variables is significantly different from zero. The pvalue for that test (highlighted in green) is displayed at the bottom of the output. In this example, we would fail
to reject the null hypothesis and have insufficient evidence to show that the true slope is not equal to zero (the
p-value is > 0.05), thus implying there is not a significant linear relationship between height and weight.