This document shows how to use R/Studio to create output that is similar to that shown in Lock^5 (3^\text{rd} Ed.).
Section 9.1
In this section we look at inference for slope and correlation. To do so we need to import some data. We start by looking at the cricket chirp data from the book’s data repository.
Code
### Read the CSV file crickets =read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/Lock5Ed3/Lock5Data3eCSV/CricketChirps.csv")### See what is in the "crickets" dataframe to see if it worked crickets
The above seven observations look to be the same as those in the original CSV file.
Suppose that we are interested in predicting Temperature from the number of Chirps that the crickets generate. We can plot the data to see if the relationship appears to be linear.
Code
### Load the lattice package to make plotting easylibrary(lattice)### Now plot the points. Note the use of ~ instead of = in the formulaxyplot(Temperature ~ Chirps, data=crickets)
Although there are not many points, the relationship appears to be fairly linear. R makes getting the regression equation easy.
Code
### Get the regression equation by fitting the linear model (lm)lm(Temperature ~ Chirps, data=crickets)
Call:
lm(formula = Temperature ~ Chirps, data = crickets)
Coefficients:
(Intercept) Chirps
37.6786 0.2307
Code
### Save the linear model for later use crickets.lm =lm(Temperature ~ Chirps, data=crickets)### Check to see if it was saved crickets.lm
Call:
lm(formula = Temperature ~ Chirps, data = crickets)
Coefficients:
(Intercept) Chirps
37.6786 0.2307
Code
### Plot the points and the regression line using p and r for points and regressionxyplot(Temperature ~ Chirps, data=crickets, type=c("p","r"))
The model is Temperature = 37.677 + 0.231(Chirps). What is a 95% confidence interval for \beta_1? Is the slope significantly different from zero? That is, does Chirps add anything to the guess of Temperature? We can let R compute the standard error and t-value that are needed to test H_0: \beta_1 = 0 versus H_A: \beta_1 \ne 0.
Code
### Check that the linear model fitted above is still around crickets.lm
Call:
lm(formula = Temperature ~ Chirps, data = crickets)
Coefficients:
(Intercept) Chirps
37.6786 0.2307
Code
### Now get a summary that contains the coefficients and standard errorssummary(crickets.lm)
Call:
lm(formula = Temperature ~ Chirps, data = crickets)
Residuals:
1 2 3 4 5 6 7
-1.8625 -0.5532 2.0628 1.4495 -0.2785 -1.1598 0.3416
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.67858 1.97817 19.05 7.35e-06 ***
Chirps 0.23067 0.01423 16.21 1.63e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.528 on 5 degrees of freedom
Multiple R-squared: 0.9813, Adjusted R-squared: 0.9776
F-statistic: 262.9 on 1 and 5 DF, p-value: 1.626e-05
Note that the line that starts with Chirps contains an estimate (0.23067) and a standard error (0.01423). These values can be used to create a confidence interval for \beta_1. We note that the degrees of freedom for the t-distribution are df = 7-2 = 5. We now compute a 95% confidence interval using b_1 \pm t^*_{n-2} \cdot SE
Code
### Compute t-star for 5%/2 = 2.5% to the right with df=5qt(0.025, 5, lower.tail=FALSE)
[1] 2.570582
Code
### Compute the endpoints using the information found above0.23067-2.57058*0.01423
[1] 0.1940906
Code
0.23067+2.57058*0.01423
[1] 0.2672494
So, it appears that we can be 95% sure that the slope of the population model, \beta_1, is in the interval (0.194, 0.267). That is, we are 95% sure that the slope of the model to predict temperature from cricket chirp rate is between 0.194 and 0.267 degrees F per chirp.
We now test to see if \beta_1 is equal to zero.
Code
### Compute t and the degrees of freedom (0.23067-0)/0.01423
[1] 16.21012
Code
7-2
[1] 5
Code
### Get the p-value for a two-sided alternative using the right tail of the t distribution for t = 16.2101pt(16.2101, 5 , lower.tail=FALSE)*2
[1] 1.628652e-05
From the above we see that the t-value is t = (0.23067 - 0)/0.01423 = 16.2101 which has df = 7-2 = 5 degrees of freedom. This compares well with R’s value of 16.21.
Using either StatKey or R we can find the p-value by finding the area above 16.21 when there are 5 degrees of freedom. Above we see that R computes the p-value as 0.0000163. This is the same p-value presented in the table under Pr(>|t|) and indicates that We have very strong evidence that slope is different from zero, and that temperature is related to cricket chirp rates.
R also computes the correlation between two variables.
Code
### Compute the correlation between temperature and chirpscor(crickets$Temperature, crickets$Chirps)
[1] 0.9906249
Code
### Cheat and find the correlation between all variables in the dataframecor(crickets)
Temperature Chirps
Temperature 1.0000000 0.9906249
Chirps 0.9906249 1.0000000
To test H_0: \rho = 0 versus H_A: \rho \ne 0 we compute the t statistic t = \frac{r - 0}{\sqrt{\frac{1 - r^2}{n-2}}} = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} which has an approximate t-distribution with n-2 degrees of freedom. For the cricket data this is
Code
### Compute the sample correlation, rcor(crickets$Temperature, crickets$Chirps)
[1] 0.9906249
Code
### Compute t (0.9906249*sqrt(7-2))/sqrt(1-0.99062^2)
[1] 16.21058
This value is the same (within rounding) as the t-statistic computed using the slope. The p-value will be the same as well. R computes these values for us more easily using the cor.test function.
Code
### Use cor.test to get the t-stat and p-valuecor.test(crickets$Temperature, crickets$Chirps)
Pearson's product-moment correlation
data: crickets$Temperature and crickets$Chirps
t = 16.215, df = 5, p-value = 1.626e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9352953 0.9986740
sample estimates:
cor
0.9906249
It is easy to get the coefficient of determination, R^2 = r^2, using R.
Code
### Find the correlation, r r =cor(crickets$Temperature, crickets$Chirps) r
[1] 0.9906249
Code
### Now square it to get R-squared r^2
[1] 0.9813377
For the cricket data, Chirps explains about 98.1% of the variability in Temperature via the regression line. Stated another way, the proportion of variability in temperature that is explained by the model based on chirp rate is about 0.981. Note that this value corresponds to the “Multiple R-squared” value from the summary table above.
We should check the residuals to make sure that the assumptions that are made when fitting a regression model have been satisfied. Plots of the residuals help with this.
Code
### Replot the data with the regression linexyplot(Temperature ~ Chirps, data=crickets, type=c("p","r"))
Code
### Plot the residuals against the explanatory variableplot(crickets$Chirps, resid(crickets.lm))abline(h=0, lty=2)
Code
### Plot the normal probability plotqqnorm(resid(crickets.lm))qqline(resid(crickets.lm))
With only seven observations it is difficult to say that the assumptions have not been satisfied. There do not appear to be any serious outliers or influential observations. We also do not see any obvious departure from normality.