We need some data with a number of variables. Using the Lahman package, we can download a year’s worth of team data.
p_load(Lahman)
data(Teams)
MLB = subset.data.frame(Teams, subset=(yearID==1961 & lgID %in% c("AL","NL")))
head(MLB)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1360 1961 AL BAL BAL <NA> 3 163 82 95 67 <NA> <NA> N
## 1361 1961 AL BOS BOS <NA> 6 163 82 76 86 <NA> <NA> N
## 1362 1961 AL CHA CHW <NA> 4 163 81 86 76 <NA> <NA> N
## 1363 1961 NL CHN CHC <NA> 7 156 78 64 90 <NA> <NA> N
## 1364 1961 NL CIN CIN <NA> 1 154 77 93 61 <NA> <NA> Y
## 1365 1961 AL CLE CLE <NA> 5 161 81 78 83 <NA> <NA> N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO
## 1360 N 691 5481 1393 227 36 149 581 902 39 30 NA NA 588 526 3.22 54 21
## 1361 N 729 5508 1401 251 37 112 647 847 56 36 NA NA 792 687 4.29 35 6
## 1362 N 765 5556 1475 216 46 138 550 612 100 40 NA NA 726 653 4.06 39 3
## 1363 N 689 5344 1364 238 51 176 539 1027 35 25 NA NA 800 689 4.48 34 6
## 1364 N 710 5243 1414 247 35 158 423 761 70 33 NA NA 653 575 3.78 46 12
## 1365 N 737 5609 1493 257 39 150 492 720 34 11 NA NA 752 665 4.15 35 12
## SV IPouts HA HRA BBA SOA E DP FP name
## 1360 33 4413 1226 109 617 926 126 173 0.980 Baltimore Orioles
## 1361 30 4326 1472 167 679 831 143 140 0.977 Boston Red Sox
## 1362 33 4344 1491 158 498 814 128 138 0.980 Chicago White Sox
## 1363 25 4155 1492 165 465 755 183 175 0.970 Chicago Cubs
## 1364 40 4110 1300 147 500 829 134 124 0.977 Cincinnati Reds
## 1365 23 4329 1426 178 599 801 139 142 0.977 Cleveland Indians
## park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
## 1360 Memorial Stadium 951089 96 96 BAL BAL BAL
## 1361 Fenway Park II 850589 102 103 BOS BOS BOS
## 1362 Comiskey Park 1146019 99 97 CHW CHA CHA
## 1363 Wrigley Field 673057 101 104 CHC CHN CHN
## 1364 Crosley Field 1117603 102 101 CIN CIN CIN
## 1365 Cleveland Stadium 725547 97 98 CLE CLE CLE
names(MLB)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
MLB$lgID = factor(MLB$lgID)
Pairs plots make it possible to look at the relationship between multiple variables at the same time. Generally we look at relationships between quantitative variables. However, comparisons using qualitative data generate parallel dotplots which are sometimes informative.
pairs(MLB[,c(2,9,10,15,16,17,18,19,20,21)])
Data where when plotted we see that if one variable increases the other will increase — or equivalently when one decreases the other decreases — are said to be positively associated. If when one variable increases the other variable decreases we say the variables are negatively associated.
A measure of the strength and direction of the linear association between two variables is the correlation. Correlation is essentially the almost average product of the z-scores of the two variables. In R it is easy to compute the correlation between two (or more) variables. To find the correlation between hits, H, and runs, R, we can use R to compute the correlation “by hand.”
zh = (MLB$H - mean(MLB$H))/sqrt(var(MLB$H))
zh
## [1] 0.006959129 0.132223454 1.290918454 -0.447124047 0.335777981
## [6] 1.572763184 1.384866698 -0.791600939 -0.963839385 -0.541072290
## [11] -0.619362493 -0.431466006 1.071705887 -1.997270062 0.868151360
## [16] -0.212253439 0.680254873 -1.339632358
zr = (MLB$R - mean(MLB$R))/sqrt(var(MLB$R))
zr
## [1] -0.4512746 0.1611695 0.7413798 -0.4835086 -0.1450526 0.2901051
## [7] 1.9662681 -0.5802103 0.4029238 0.2578712 -0.1934034 -0.1128187
## [13] 1.7406308 -2.1757885 -0.4029238 0.8703154 -0.2578712 -1.6278121
cor.rh = sum(zh*zr)/(length(zh)-1)
cor.rh
## [1] 0.7119815
Alternatively, we can use the internal function cor.
# For hits and runs this is
cor(MLB$H,MLB$R)
## [1] 0.7119815
# For all of the variables that were ploted (except lgID) we use
cor(MLB[,c(9,10,15,16,17,18,19,20,21)])
## W L R AB H X2B
## W 1.00000000 -0.969570439 0.80863191 0.320836049 0.7045472 0.016427476
## L -0.96957044 1.000000000 -0.74046762 -0.105680982 -0.6528911 0.007618194
## R 0.80863191 -0.740467616 1.00000000 0.497382881 0.7119815 0.031894294
## AB 0.32083605 -0.105680982 0.49738288 1.000000000 0.5885888 0.283990796
## H 0.70454721 -0.652891137 0.71198146 0.588588835 1.0000000 0.437289268
## X2B 0.01642748 0.007618194 0.03189429 0.283990796 0.4372893 1.000000000
## X3B -0.19259119 0.152371525 -0.19491536 -0.007360244 0.2019594 -0.017286952
## HR 0.60840098 -0.602646208 0.66475810 0.176830482 0.2525095 -0.240654442
## BB 0.09293399 0.062188409 0.35991868 0.439793320 -0.1156499 -0.165058955
## X3B HR BB
## W -0.192591188 0.6084010 0.09293399
## L 0.152371525 -0.6026462 0.06218841
## R -0.194915357 0.6647581 0.35991868
## AB -0.007360244 0.1768305 0.43979332
## H 0.201959439 0.2525095 -0.11564993
## X2B -0.017286952 -0.2406544 -0.16505896
## X3B 1.000000000 -0.4308045 -0.30224067
## HR -0.430804506 1.0000000 0.18589154
## BB -0.302240666 0.1858915 1.00000000
The values computed by hand and using the internal function are the same. Since 0.7112 is positive, we see that as hits go up runs go up. The magnitude of 0.7112 inidicates a moderately strong linear association (while sociology majors might be happy, physics majors would not be impressed).
Wins and losses are strongly negatively associated (r=-0.9696). Since if you are not winning then you are losing, this is not surprising. At bats are positively associated with runs. This makes sense, but it should be noted that the correlation of 0.4974 is fairly weak.