The following code creates a sample data frame that we will use in our examples.
# Import the height and weight data
htwt = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/downloads/math111/htwt.csv")
# The Group variable is actually categorical and not numeric
htwt$Group = factor(htwt$Group,levels=c(1,2),labels=c("Male","Female"))
head(htwt)
## Height Weight Group
## 1 64 159 Male
## 2 63 155 Female
## 3 67 157 Female
## 4 60 125 Male
## 5 52 103 Female
## 6 58 122 Female
It is fairly clear from graphing Weight as a function of Height that when modeling a person’s weight we should correct for height.
A linear model makes the correction fairly easy.
p_load(lattice)
xyplot(Weight~Height, data=htwt)
To fit the model Weight = a + b Height we can use the “simple” equations given in class or we can use the function lm.
r = cor(htwt$Weight,htwt$Height)
b = r * sqrt(var(htwt$Weight))/sqrt(var(htwt$Height))
b
## [1] 5.041217
a = mean(htwt$Weight) - b * mean(htwt$Height)
a
## [1] -173.4596
htwt.lmw = lm(Weight ~ 1 + Height, data=htwt)
summary(htwt.lmw)
##
## Call:
## lm(formula = Weight ~ 1 + Height, data = htwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.6729 -5.4001 -0.6165 3.9014 14.3163
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -173.4596 12.2080 -14.21 3.18e-11 ***
## Height 5.0412 0.1949 25.87 1.09e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.171 on 18 degrees of freedom
## Multiple R-squared: 0.9738, Adjusted R-squared: 0.9723
## F-statistic: 669.1 on 1 and 18 DF, p-value: 1.09e-15
print(c(a, b))
## [1] -173.459595 5.041217
coef(htwt.lmw)
## (Intercept) Height
## -173.459595 5.041217
The estimated reqression line is seen to be Weight = -173.4595952 + 5.0412173 Height.
The regression line can be plotted using standard R functions.
# Using base R graphics
plot(htwt$Height, htwt$Weight, xlab="Height", ylab="Weight")
abline(htwt.lmw)
# Using lattice graphics
p_load(lattice)
xyplot(Weight ~ Height, data=htwt, type=c("p","r"))
# Using ggplot2
p_load(ggplot2)
ggplot(data = htwt, aes(x = Height, y = Weight)) + geom_point() + geom_smooth(method = "lm", se=FALSE, color="lightblue", formula = y ~ 1 + x)
We need to find the predicted and residual values for our data. Remember that the residual is residual = observed - predicted.
# Using definitions
obsserved = htwt$Weight
predicted = a + b * htwt$Height
htwt.lmw.resid = obsserved - predicted
htwt.lmw.resid
## [1] 9.8216871 10.8629044 -7.3019648 -4.0134436 14.3162949 3.0689910
## [7] -7.8485744 -6.6837051 3.2034274 -10.6729207 0.4507313 1.1514256
## [13] 6.5331659 -3.2195302 -4.9722263 -4.9722263 -6.9310090 3.3575122
## [19] 5.5331659 -1.6837051
# Using internal functions
res = as.numeric(residuals(htwt.lmw))
res
## [1] 9.8216871 10.8629044 -7.3019648 -4.0134436 14.3162949 3.0689910
## [7] -7.8485744 -6.6837051 3.2034274 -10.6729207 0.4507313 1.1514256
## [13] 6.5331659 -3.2195302 -4.9722263 -4.9722263 -6.9310090 3.3575122
## [19] 5.5331659 -1.6837051
When fitting a regression model, it is always a good idea to check the residuals to make certain that underlying model assumptions are satisfied. A first step is to plot the residuals against the explanatory variables.
plot(htwt$Height, res, ylab="Residual")
abline(h=0)
Since no obvious patterns exist in the residuals, we can be fairly sure that we have not missed some non-linear pattern in the data. Since the vertical spread of the residuals seems to be the same across the x values, we can assume that the variability is constant.
If the data were collected using random selection and we wish to test to see if the slope is different from zero, the residuals must be normally distributed.
hist(res, xlab="Residual")
qqnorm(res)
qqline(res, lty=2)
Neither the histogram nor the normal quantile plot suggest any reason to question the normality of the residuals. Statistical testing appears to be valid in this situation.
R generates a number of plots automatically.
plot(htwt.lmw)