This document contains R code to reproduce the plots, tables, and statistics presented in Jim Albert’s Teaching Statistics Using Baseball, 2nd Ed. Some of the data is taken from sources other than the author’s web site or his tsub package for R.
The following code is usually put in the setup of the .RMD file. It is included here so that you can read it if you don’t have access to the .RMD for this file.
### Install or load the pacman library
if (!require(pacman)){
install.packages("pacman", dependencies = TRUE)
require(pacman)
}
p_load(dplyr) ### Used by Albert's tsub package
### Install or load the tsub package that ontains a few of Albert's functions and all of the book datasets
if (!require(tsub)){
p_load(devtools) ### Install or load the devtools package
install_github("bayesball/tsub") ### Install tsub from github
p_load(tsub) ### Load tsub
}
p_load(Lahman) ### Contains the Lahman data
p_load(lattice) ### Load the lattice graphics package
p_load(ggplot2) ### Load Hadley's ggplot graphics package
p_load(lubridate) ### Used to modify dates. Things like getting the year from mm-dd-yyyy, etc.
p_load(aplpack) ### Pretty stemplots
Lahman has a site at https://www.seanlahman.com/baseball-archive/statistics that has his data in various formats and describes the tables that make up the data. Documentation for the data can be found at https://www.seanlahman.com/files/database/readme2021.txt. The data are also available in the Lahman package for R.
The first step is to install and load the Lahman package.
# p_load is in the pacman package which is installed above
p_load(Lahman)
The Lahman package contains a number of tables and a few functions that help subset the data. For now we will work with Henderson’s batting data.
### Get the Batting data.frame ###
data(Batting)
### Figure out what it looks like ###
head(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
### Some of the Lahman functions require dplyr ###
p_load(dplyr)
Albert’s text works on Rickey Henderson’s 25 seasons. It would be nice if we could pull this information out of the Lahman data to confirm the values and compute a few statistics.
As a first step, we need to find Henderson’s player information.
data(Master)
## Warning in data(Master): data set 'Master' not found
playerInfo(nameLast = "Henderson")
## playerID nameFirst nameLast
## 7931 hendebe01 Bernie Henderson
## 7932 hendebi01 Bill Henderson
## 7933 hendebi99 Bill Henderson
## 7934 hendeda01 Dave Henderson
## 7935 hendeed01 Ed Henderson
## 7936 hendeha01 Hardie Henderson
## 7937 hendeji01 Jim Henderson
## 7938 hendejo01 Joe Henderson
## 7939 hendeke01 Ken Henderson
## 7940 henderi01 Rickey Henderson
## 7941 hendero01 Rod Henderson
## 7942 hendest01 Steve Henderson
playerInfo(nameFirst = "Rickey")
## playerID nameFirst nameLast
## 3298 clarkri01 Rickey Clark
## 3846 cradlri01 Rickey Cradle
## 7940 henderi01 Rickey Henderson
## 9487 keetori01 Rickey Keeton
playerInfo("henderi01")
## playerID nameFirst nameLast
## 7940 henderi01 Rickey Henderson
Knowing that Rickey Henderson is “henderi01” means that we can subset his information.
batting = battingStats()
Henderson = batting[Batting$playerID=="henderi01",]
head(Henderson)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS
## 56777 henderi01 1979 1 OAK AL 89 351 49 96 13 3 1 26 33 11
## 57733 henderi01 1980 1 OAK AL 158 591 111 179 22 4 9 53 100 26
## 58676 henderi01 1981 1 OAK AL 108 423 89 135 18 7 6 35 56 22
## 59632 henderi01 1982 1 OAK AL 149 536 119 143 24 4 10 51 130 42
## 60637 henderi01 1983 1 OAK AL 145 513 105 150 25 7 9 48 108 19
## 61632 henderi01 1984 1 OAK AL 142 502 113 147 27 4 16 58 66 18
## BB SO IBB HBP SH SF GIDP BA PA TB SlugPct OBP OPS BABIP
## 56777 34 39 0 2 8 3 4 0.274 398 118 0.336 0.338 0.674 0.303
## 57733 117 54 7 5 6 3 6 0.303 722 236 0.399 0.420 0.819 0.320
## 58676 64 68 4 2 0 4 7 0.319 493 185 0.437 0.408 0.845 0.365
## 59632 116 94 1 2 0 2 5 0.267 656 205 0.382 0.398 0.780 0.306
## 60637 103 80 8 4 1 1 11 0.292 622 216 0.421 0.414 0.835 0.332
## 61632 86 81 1 5 1 3 7 0.293 597 230 0.458 0.399 0.857 0.321
dim(Henderson)
## [1] 29 29
table(Henderson$yearID)
##
## 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994
## 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1
## 1995 1996 1997 1998 1999 2000 2001 2002 2003
## 1 1 2 1 1 2 1 1 1
These values do not correspond to the yearly values in Table 1.1 of Albert’s book. The years 1989, 1993, 1997, and 2000 include information for two teams. Albert’s table has combined the information.
At this point, we will just import Albert’s Table 1.1 data. Later we will see how to use R to compute yearly information. The first method is to import Albert’s comma seperated value (CSV) file. These files can either be local (I have mine stored in a folder CSV within the same folder as this RMD file) or we can find the file on the web. Both methods are shown below.
### Read the local file
Henderson.local <- read.csv("CSV/case_1_1.csv")
head(Henderson.local, 13)
## Year Age Tm G AB R H X2B X3B HR RBI SB CS BB SO BA OBP SLG
## 1 1979 20 OAK 89 351 49 96 13 3 1 26 33 11 34 39 0.274 0.338 0.336
## 2 1980 21 OAK 158 591 111 179 22 4 9 53 100 26 117 54 0.303 0.420 0.399
## 3 1981 22 OAK 108 423 89 135 18 7 6 35 56 22 64 68 0.319 0.408 0.437
## 4 1982 23 OAK 149 536 119 143 24 4 10 51 130 42 116 94 0.267 0.398 0.382
## 5 1983 24 OAK 145 513 105 150 25 7 9 48 108 19 103 80 0.292 0.414 0.421
## 6 1984 25 OAK 142 502 113 147 27 4 16 58 66 18 86 81 0.293 0.399 0.458
## 7 1985 26 NYY 143 547 146 172 28 5 24 72 80 10 99 65 0.314 0.419 0.516
## 8 1986 27 NYY 153 608 130 160 31 5 28 74 87 18 89 81 0.263 0.358 0.469
## 9 1987 28 NYY 95 358 78 104 17 3 17 37 41 8 80 52 0.291 0.423 0.497
## 10 1988 29 NYY 140 554 118 169 30 2 6 50 93 13 82 54 0.305 0.394 0.399
## 11 1989 30 TOT 150 541 113 148 26 3 12 57 77 14 126 68 0.274 0.411 0.399
## 12 1990 31 OAK 136 489 119 159 33 3 28 61 65 10 97 60 0.325 0.439 0.577
## 13 1991 32 OAK 134 470 105 126 17 1 18 57 58 18 98 73 0.268 0.400 0.423
## OPS
## 1 0.675
## 2 0.820
## 3 0.845
## 4 0.780
## 5 0.835
## 6 0.857
## 7 0.934
## 8 0.827
## 9 0.920
## 10 0.793
## 11 0.810
## 12 1.016
## 13 0.823
tail(Henderson.local, 13)
## Year Age Tm G AB R H X2B X3B HR RBI SB CS BB SO BA OBP SLG
## 13 1991 32 OAK 134 470 105 126 17 1 18 57 58 18 98 73 0.268 0.400 0.423
## 14 1992 33 OAK 117 396 77 112 18 3 15 46 48 11 95 56 0.283 0.426 0.457
## 15 1993 34 TOT 134 481 114 139 22 2 21 59 53 8 120 65 0.289 0.432 0.474
## 16 1994 35 OAK 87 296 66 77 13 0 6 20 22 7 72 45 0.260 0.411 0.365
## 17 1995 36 OAK 112 407 67 122 31 1 9 54 32 10 72 66 0.300 0.407 0.447
## 18 1996 37 SDP 148 465 110 112 17 2 9 29 37 15 125 90 0.241 0.410 0.344
## 19 1997 38 TOT 120 403 84 100 14 0 8 34 45 8 97 85 0.248 0.400 0.342
## 20 1998 39 OAK 152 542 101 128 16 1 14 57 66 13 118 114 0.236 0.376 0.347
## 21 1999 40 NYM 121 438 89 138 30 0 12 42 37 14 82 82 0.315 0.423 0.466
## 22 2000 41 TOT 123 420 75 98 14 2 4 32 36 11 88 75 0.233 0.368 0.305
## 23 2001 42 SDP 123 379 70 86 17 3 8 42 25 7 81 84 0.227 0.366 0.351
## 24 2002 43 BOS 72 179 40 40 6 1 5 16 8 2 38 47 0.223 0.369 0.352
## 25 2003 44 LAD 30 72 7 15 1 0 2 5 3 0 11 16 0.208 0.321 0.306
## OPS
## 13 0.823
## 14 0.883
## 15 0.906
## 16 0.776
## 17 0.855
## 18 0.754
## 19 0.742
## 20 0.723
## 21 0.889
## 22 0.673
## 23 0.717
## 24 0.721
## 25 0.627
dim(Henderson.local)
## [1] 25 19
### Read from the web
Henderson.web <- read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS28/TSUB2/CSV/case_1_1.csv")
head(Henderson.web)
## Year Age Tm G AB R H X2B X3B HR RBI SB CS BB SO BA OBP SLG
## 1 1979 20 OAK 89 351 49 96 13 3 1 26 33 11 34 39 0.274 0.338 0.336
## 2 1980 21 OAK 158 591 111 179 22 4 9 53 100 26 117 54 0.303 0.420 0.399
## 3 1981 22 OAK 108 423 89 135 18 7 6 35 56 22 64 68 0.319 0.408 0.437
## 4 1982 23 OAK 149 536 119 143 24 4 10 51 130 42 116 94 0.267 0.398 0.382
## 5 1983 24 OAK 145 513 105 150 25 7 9 48 108 19 103 80 0.292 0.414 0.421
## 6 1984 25 OAK 142 502 113 147 27 4 16 58 66 18 86 81 0.293 0.399 0.458
## OPS
## 1 0.675
## 2 0.820
## 3 0.845
## 4 0.780
## 5 0.835
## 6 0.857
dim(Henderson.web)
## [1] 25 19
Albert has made it easy to load all the data that is used in TSUB2. By loading his TSUB package, all of the case and exercise data are made available.
p_load(devtools) ### Install or load the devtools package
install_github("bayesball/tsub") ### Install tsub from github
## Skipping install of 'tsub' from a github remote, the SHA1 (563eded6) has not changed since last install.
## Use `force = TRUE` to force installation
p_load(tsub) ### Load tsub
Now that the tsub package is installed, we can call the different datasets.
Henderson.tsub <- case_1_1
head(Henderson.tsub)
## Year Age Tm G AB R H X2B X3B HR RBI SB CS BB SO BA OBP SLG
## 1 1979 20 OAK 89 351 49 96 13 3 1 26 33 11 34 39 0.274 0.338 0.336
## 2 1980 21 OAK 158 591 111 179 22 4 9 53 100 26 117 54 0.303 0.420 0.399
## 3 1981 22 OAK 108 423 89 135 18 7 6 35 56 22 64 68 0.319 0.408 0.437
## 4 1982 23 OAK 149 536 119 143 24 4 10 51 130 42 116 94 0.267 0.398 0.382
## 5 1983 24 OAK 145 513 105 150 25 7 9 48 108 19 103 80 0.292 0.414 0.421
## 6 1984 25 OAK 142 502 113 147 27 4 16 58 66 18 86 81 0.293 0.399 0.458
## OPS
## 1 0.675
## 2 0.820
## 3 0.845
## 4 0.780
## 5 0.835
## 6 0.857
These values compare well with those presented in Table 1.1 of the text and the data as imported above.
TSUB2 provides a dotplot of Henderson’s OBP during his first 23 years. We can create this plot in a few ways.
### First use base graphics
stripchart(case_1_1$OBP)
### Use the lattice library. cex is the character size multiplier
### Using pch=1 chooses open circles which better show overlapped data.
p_load(lattice)
dotplot(~OBP, data=case_1_1, cex=1.25, pch=1)
stripplot(~OBP, data=case_1_1, cex=1.25, pch=1, jitter=TRUE, factor=1.5)
### Albert likes the minitab version of the plotrix dotplot
p_load(plotrix)
dotplot.mtb(case_1_1$OBP)
These plots are similar to Figure 1.1 in the text.
The text compares Henderson’s home run production (HR) against his doubles (X2B) using a scatterplot. There are multiple methods for generating scatterplots in R. We will look at three.
### Using base R we get ###
plot(case_1_1$X2B, case_1_1$HR, xlab="Doubles", ylab="Home Runs")
### Using lattice graphics
p_load(lattice)
xyplot(HR~X2B, data=case_1_1, xlab="Doubles", ylab="Home Runs", main="Henderson")
xyplot(HR~X2B, data=case_1_1, xlab="Doubles", ylab="Home Runs", xlim=c(0:35), ylim=c(0:35), aspect=1)
### Use ggplot ###
p_load(ggplot2)
p = ggplot(case_1_1, aes(X2B, HR)) ### Define data.frame and variables
p + geom_point() ### Request a point plot
p + geom_point() + theme_bw() + xlab("Doubles") + ylab("Home Runs")
Hendersons hitting statistics for 1990 were
### Grab Henderon's batting data and get totals by year using dplyr and piping to make it easy
p_load(dplyr)
Henderson.total <- Batting %>% ### Get all of the batting data
filter(playerID == "henderi01") %>% ### Subset for Henderson
group_by(yearID) %>% ### Work on year subsets from here on out
summarize(
teams = n(),
G = sum(G),
AB = sum(AB),
R = sum(R),
H = sum(H),
X2B = sum(X2B),
X3B = sum(X3B),
HR = sum(HR),
RBI = sum(RBI),
SB = sum(SB),
CS = sum(CS),
BB = sum(BB),
SO = sum(SO),
IBB = sum(IBB),
HBP = sum(HBP),
SH = sum(SH),
SF = sum(SF),
GIDP = sum(GIDP)
)
names(Henderson.total)
## [1] "yearID" "teams" "G" "AB" "R" "H" "X2B" "X3B"
## [9] "HR" "RBI" "SB" "CS" "BB" "SO" "IBB" "HBP"
## [17] "SH" "SF" "GIDP"
### Now add computed performance statistics for each year... e.g. BA, OBP, etc.
Henderson.total <- battingStats(Henderson.total)
### Check the data against Albert's Table 1.1
Henderson.total %>%
filter(teams == 2)
## yearID teams G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1 1989 2 150 541 113 148 26 3 12 57 77 14 126 68 5 3 0 4 8
## 2 1993 2 134 481 114 139 22 2 21 59 53 8 120 65 7 4 1 4 9
## 3 1997 2 120 403 84 100 14 0 8 34 45 8 97 85 2 6 1 2 10
## 4 2000 2 123 420 75 98 14 2 4 32 36 11 88 75 1 4 3 4 11
## BA PA TB SlugPct OBP OPS BABIP
## 1 0.274 674 216 0.399 0.411 0.810 0.292
## 2 0.289 610 228 0.474 0.432 0.906 0.296
## 3 0.248 509 138 0.342 0.400 0.742 0.295
## 4 0.233 519 128 0.305 0.368 0.673 0.272
The values for the four years when Henderson played for two teams match those in Table 1.1.
Albert uses only the 1990 season when creating Henderson’s spinner. We can do the same.
spinner <- Henderson.total %>% ### Grab the total batting data from above
filter(yearID == 1990) %>% ### Keep only the 1990 season
mutate( ### Create a few variables
X1B = H - (X2B + X3B + HR), ### Singles are hits minus doubles, trips, and homers
Out = AB - (H + BB), ### Outs are at bats that aren't hits or walks
Walk = BB ### Walks are base on balls
)
### Check the counts
spinner
## yearID teams G AB R H X2B X3B HR RBI SB CS BB SO IBB HBP SH SF GIDP
## 1 1990 1 136 489 119 159 33 3 28 61 65 10 97 60 2 4 2 2 13
## BA PA TB SlugPct OBP OPS BABIP X1B Out Walk
## 1 0.325 594 282 0.577 0.439 1.016 0.325 95 233 97
### Probabilities are counts divided by at bats... sort of. We ignore HBP, interference, etc.
spinner.prob = spinner[c("X1B","X2B","X3B","HR","Walk","Out")] ### Keep only those outcomes we care about
spinner.prob = spinner.prob/sum(spinner.prob)
spinner.prob
## X1B X2B X3B HR Walk Out
## 1 0.194274 0.06748466 0.006134969 0.05725971 0.198364 0.4764826
sum(spinner.prob)
## [1] 1
### Turn the data frame into a vector so that we can plot it
dim(spinner.prob)
## [1] 1 6
spinner.prob = as.vector(t(spinner.prob))
dim(spinner.prob)
## NULL
length(spinner.prob)
## [1] 6
spinner.prob
## [1] 0.194274029 0.067484663 0.006134969 0.057259714 0.198364008 0.476482618
### Create Figure 1.3
pie(spinner.prob,labels=c("Single", "Double","Triple","Home Run","Walk","Out"))
Now, Albert has created a nice spinner, but he hasn’t used it. Let’s see what we get for Henderson if we let him bat (randomly) 179 times — the number of at bats that he had in 2002).
outcomes = c("X1B","X2B","X3B","HR","Walk","Out") ### Create a list of outcomes with probs in spinner.prob
spin2002 = sample(outcomes, 179, replace=TRUE, prob=spinner.prob) ### Sample from outcomes using spinner.probs
spin2002 = factor(spin2002,levels=c("X1B","X2B","X3B","HR","Walk","Out")) ### Make the data categorical
table(spin2002) ### Compute the number of each type of outcome
## spin2002
## X1B X2B X3B HR Walk Out
## 33 11 0 18 36 81
### List the stats for 2002 and 2003
Henderson.total %>%
filter(yearID >= 2002) %>%
mutate(
Walk = BB + IBB + HBP,
X1B = H - (X2B + X3B + HR),
Out = AB - (H + Walk)
) %>%
select(yearID, AB, H, X1B, X2B, X3B, HR, Walk, Out)
## yearID AB H X1B X2B X3B HR Walk Out
## 1 2002 179 40 28 6 1 5 42 97
## 2 2003 72 15 12 1 0 2 12 45
We have already computed a few measures of a batter’s performance using functions defined by others. In the previous section, we computed a few of these by hand. We can confirm that the functions and raw code provide the same results.
### Slugging percentage
Hend7983 = Henderson.total %>%
filter(yearID <= 1983) %>% ### Grab Henderson's 1979 to 1983 years
mutate( ### Create a few batting stats
x1b = H - (X2B + X3B + HR),
tb = 1*x1b + 2*X2B + 3*X3B + 4*HR,
slg = round(tb/AB,3)
)
temp <- Henderson.total %>%
filter(yearID <= 1983) %>%
select(yearID, SlugPct)
cbind(head(Hend7983[,c(1,6:9,27:29)]), temp$SlugPct)
## yearID H X2B X3B HR x1b tb slg temp$SlugPct
## 1 1979 96 13 3 1 79 118 0.336 0.336
## 2 1980 179 22 4 9 144 236 0.399 0.399
## 3 1981 135 18 7 6 104 185 0.437 0.437
## 4 1982 143 24 4 10 105 205 0.382 0.382
## 5 1983 150 25 7 9 109 216 0.421 0.421
Similar computations can be used to compute pitching statistics. We look at Orel Hershiser’s AL years.
### Start a search for Hershier's player ID
playerInfo("hers")
## playerID nameFirst nameLast
## 5420 etherse01 Seth Etherton
## 8110 hershea01 Earl Hersh
## 8111 hershfr01 Frank Hershey
## 8112 hershmi01 Mike Hershberger
## 8113 hershor01 Orel Hershiser
## 8114 hershwi01 Willard Hershberger
### Figure out what's in the pitching data.frame
names(Pitching)
## [1] "playerID" "yearID" "stint" "teamID" "lgID" "W"
## [7] "L" "G" "GS" "CG" "SHO" "SV"
## [13] "IPouts" "H" "ER" "HR" "BB" "SO"
## [19] "BAOpp" "ERA" "IBB" "WP" "HBP" "BK"
## [25] "BFP" "GF" "R" "SH" "SF" "GIDP"
### Subset out Hershiser's AL years
HershAL = Pitching %>% ### Grab all the pitching data
filter(playerID == "hershor01" & lgID == "AL") ### subset hor Orel and AL
head(HershAL)
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER HR BB
## 1 hershor01 1995 1 CLE AL 16 6 26 26 1 1 0 502 151 72 21 51
## 2 hershor01 1996 1 CLE AL 15 9 33 33 1 0 0 618 238 97 21 58
## 3 hershor01 1997 1 CLE AL 14 6 32 32 1 0 0 586 199 97 26 69
## SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 1 111 0.244 3.87 1 3 5 0 683 0 76 3 4 26
## 2 125 0.287 4.24 4 11 12 1 908 0 115 5 4 24
## 3 107 0.272 4.47 2 11 11 0 826 0 105 6 8 31
### Compute his effective "complete"" innings pitched based upon total outs
HershAL$ip = HershAL$IPouts/3
### Convert innings to games
HershAL$gp = HershAL$ip/9
### As is indicated in TSUB, ERA is earned runs over total games pitched
HershAL$era = round(HershAL$ER/HershAL$gp, 4)
### Now compare
HershAL[,c("yearID","teamID","IPouts","ER","ip","gp","ERA","era")]
## yearID teamID IPouts ER ip gp ERA era
## 1 1995 CLE 502 72 167.3333 18.59259 3.87 3.8725
## 2 1996 CLE 618 97 206.0000 22.88889 4.24 4.2379
## 3 1997 CLE 586 97 195.3333 21.70370 4.47 4.4693
Not so amazingly, the internal value, ERA, is equal to our computed value, era.
The generation of the graphs and statistics that are presented in the discussion of Case 2.1 is simplified by the use of R. The data from the 2014 MLB season is obtainable from Lahman.
p_load(Lahman, dplyr)
MLB2014 = Teams %>% ### Grab the Teams dataframe
filter(yearID==2014 & lgID %in% c("AL","NL")) %>% ### Subset 2014 and AL or NL
arrange(name) ### Sort by league and wins withing league
### Note: %in% checks if a value is in a list
head(MLB2014)
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1 2014 NL ARI ARI W 5 162 81 64 98 N N N
## 2 2014 NL ATL ATL E 2 162 81 79 83 N N N
## 3 2014 AL BAL BAL E 1 162 81 96 66 Y N N
## 4 2014 AL BOS BOS E 5 162 81 71 91 N N N
## 5 2014 NL CHN CHC C 5 162 81 73 89 N N N
## 6 2014 AL CHA CHW C 4 162 81 73 89 N N N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO SV
## 1 N 615 5552 1379 259 47 118 398 1165 86 33 43 36 742 683 4.26 2 4 35
## 2 N 573 5468 1316 240 22 123 472 1369 95 33 43 27 597 547 3.38 5 13 54
## 3 N 705 5596 1434 264 16 211 401 1285 44 20 62 36 593 557 3.43 3 13 53
## 4 N 634 5551 1355 282 20 123 535 1337 63 25 68 52 715 653 4.01 3 7 36
## 5 N 614 5508 1315 270 31 157 442 1477 65 40 54 41 707 636 3.91 1 11 37
## 6 N 660 5543 1400 279 32 155 417 1362 85 36 60 38 758 687 4.29 3 6 36
## IPouts HA HRA BBA SOA E DP FP name
## 1 4333 1467 154 469 1278 101 147 0.983 Arizona Diamondbacks
## 2 4365 1369 121 472 1301 85 143 0.986 Atlanta Braves
## 3 4384 1342 151 472 1174 87 156 0.986 Baltimore Orioles
## 4 4397 1458 154 482 1213 92 155 0.985 Boston Red Sox
## 5 4390 1398 115 504 1311 103 137 0.983 Chicago Cubs
## 6 4323 1468 140 557 1152 107 170 0.982 Chicago White Sox
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1 Chase Field 2073730 102 102 ARI ARI
## 2 Turner Field 2354305 99 99 ATL ATL
## 3 Oriole Park at Camden Yards 2464473 100 100 BAL BAL
## 4 Fenway Park II 2956089 102 101 BOS BOS
## 5 Wrigley Field 2652113 103 104 CHC CHN
## 6 U.S. Cellular Field 1650821 100 101 CHW CHA
## teamIDretro
## 1 ARI
## 2 ATL
## 3 BAL
## 4 BOS
## 5 CHN
## 6 CHA
names(MLB2014)
## [1] "yearID" "lgID" "teamID" "franchID"
## [5] "divID" "Rank" "G" "Ghome"
## [9] "W" "L" "DivWin" "WCWin"
## [13] "LgWin" "WSWin" "R" "AB"
## [17] "H" "X2B" "X3B" "HR"
## [21] "BB" "SO" "SB" "CS"
## [25] "HBP" "SF" "RA" "ER"
## [29] "ERA" "CG" "SHO" "SV"
## [33] "IPouts" "HA" "HRA" "BBA"
## [37] "SOA" "E" "DP" "FP"
## [41] "name" "park" "attendance" "BPF"
## [45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
While we could compute statistics like team batting average from hits and at bats, we can also import the data from the text. In this case we pull it off of the class website.
### Read the CSV file from the web
case.2.1 = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS28/TSUB2/CSV/case_2_1.csv")
names(case.2.1) ### Get the dataframe column names
## [1] "Team" "G" "AB" "PA" "H" "X1B" "X2B" "X3B" "HR" "R"
## [11] "RBI" "BB" "IBB" "SO" "HBP" "SF" "SH" "GDP" "SB" "CS"
## [21] "AVG"
case.2.1 ### Look at the dataframe
## Team G AB PA H X1B X2B X3B HR R RBI BB IBB SO HBP
## 1 Tigers 1961 5630 6202 1557 1051 325 26 155 757 731 443 51 1144 44
## 2 Rockies 2120 5612 6164 1551 1017 307 41 186 755 721 397 39 1281 48
## 3 Dodgers 2163 5560 6231 1476 1002 302 38 134 718 686 519 30 1246 61
## 4 Royals 1788 5545 6058 1456 1046 286 29 95 651 604 380 22 985 53
## 5 Pirates 2220 5536 6224 1436 975 275 30 156 682 659 520 46 1244 78
## 6 Angels 1929 5652 6284 1464 974 304 31 155 773 729 492 42 1266 60
## 7 Blue Jays 2060 5549 6167 1435 952 282 24 177 723 690 502 27 1151 41
## 8 Rangers 1801 5460 6026 1400 1001 260 28 111 637 597 417 37 1162 61
## 9 Orioles 1901 5596 6130 1434 943 264 16 211 705 681 401 29 1285 62
## 10 Giants 2140 5523 6087 1407 976 257 42 132 665 636 427 37 1245 43
## 11 Twins 1785 5567 6233 1412 941 316 27 128 715 675 544 29 1329 53
## 12 Nationals 1913 5542 6216 1403 959 265 27 152 686 635 517 29 1304 56
## 13 Indians 1860 5575 6222 1411 962 284 23 142 669 644 504 24 1189 42
## 14 Cardinals 2069 5426 6086 1371 970 275 21 105 619 585 471 28 1133 86
## 15 Marlins 1963 5538 6185 1399 987 254 36 122 645 614 501 49 1419 35
## 16 White Sox 1817 5543 6077 1400 934 279 32 155 660 625 417 33 1362 60
## 17 Brewers 2008 5462 6065 1366 891 297 28 150 650 617 423 32 1197 73
## 18 Diamondbacks 1958 5552 6089 1379 955 259 47 118 615 573 398 31 1165 43
## 19 Rays 1915 5516 6205 1361 957 263 24 117 612 586 527 31 1124 66
## 20 Yankees 1786 5497 6082 1349 929 247 26 147 633 591 452 16 1133 56
## 21 Athletics 2003 5545 6245 1354 922 253 33 146 729 686 586 34 1104 49
## 22 Red Sox 1733 5551 6226 1355 930 282 20 123 634 601 535 36 1337 68
## 23 Mariners 1928 5450 5977 1328 913 247 32 136 634 600 396 33 1232 60
## 24 Phillies 1987 5603 6198 1356 953 251 27 125 619 584 443 42 1306 55
## 25 Astros 1865 5447 6055 1317 895 240 19 163 629 596 495 27 1442 55
## 26 Braves 1928 5468 6064 1316 931 240 22 123 573 545 472 31 1369 43
## 27 Cubs 2009 5508 6102 1315 857 270 31 157 614 590 442 29 1477 54
## 28 Mets 2135 5472 6145 1306 887 275 19 125 629 602 516 42 1264 54
## 29 Reds 1926 5395 5978 1282 877 254 20 131 595 562 415 22 1252 52
## 30 Padres 2148 5294 5905 1199 836 224 30 109 535 500 468 27 1294 41
## SF SH GDP SB CS AVG
## 1 61 24 137 106 41 0.277
## 2 48 59 121 85 48 0.276
## 3 43 47 120 138 50 0.265
## 4 47 33 131 153 36 0.263
## 5 35 54 127 104 47 0.259
## 6 54 26 112 81 39 0.259
## 7 40 35 128 78 21 0.259
## 8 45 41 148 105 59 0.256
## 9 36 35 112 44 20 0.256
## 10 49 45 113 56 27 0.255
## 11 44 25 97 99 36 0.254
## 12 41 60 115 101 23 0.253
## 13 49 51 126 104 27 0.253
## 14 39 64 140 57 32 0.253
## 15 39 71 143 58 21 0.253
## 16 38 19 127 85 36 0.253
## 17 37 70 137 102 43 0.250
## 18 36 56 115 86 33 0.248
## 19 53 43 135 63 27 0.247
## 20 47 29 111 112 26 0.245
## 21 43 19 118 83 20 0.244
## 22 52 20 138 63 25 0.244
## 23 34 35 113 96 42 0.244
## 24 37 59 95 109 26 0.242
## 25 36 22 122 122 37 0.242
## 26 27 53 121 95 33 0.241
## 27 41 57 94 65 40 0.239
## 28 44 59 112 101 34 0.239
## 29 37 76 88 122 52 0.238
## 30 45 56 118 91 34 0.226
case.2.1[order(case.2.1$AVG),c("Team","AVG")] ### Rows are sorted by AVG, columns are only Team and AVG
## Team AVG
## 30 Padres 0.226
## 29 Reds 0.238
## 27 Cubs 0.239
## 28 Mets 0.239
## 26 Braves 0.241
## 24 Phillies 0.242
## 25 Astros 0.242
## 21 Athletics 0.244
## 22 Red Sox 0.244
## 23 Mariners 0.244
## 20 Yankees 0.245
## 19 Rays 0.247
## 18 Diamondbacks 0.248
## 17 Brewers 0.250
## 12 Nationals 0.253
## 13 Indians 0.253
## 14 Cardinals 0.253
## 15 Marlins 0.253
## 16 White Sox 0.253
## 11 Twins 0.254
## 10 Giants 0.255
## 8 Rangers 0.256
## 9 Orioles 0.256
## 5 Pirates 0.259
## 6 Angels 0.259
## 7 Blue Jays 0.259
## 4 Royals 0.263
## 3 Dodgers 0.265
## 2 Rockies 0.276
## 1 Tigers 0.277
### Or, use the MLB2014 data taken from the Lahman Teams data
case_2.1 = MLB2014 ### Make a copy of MLB2014
case_2.1$avg = case_2.1$H/case_2.1$AB ### Create a new variable/column avg using H and AB
case_2.1[order(case_2.1$avg),c("name","avg")] ### Rows are sorted by avg, columns are only name and avg
## name avg
## 23 San Diego Padres 0.2264828
## 7 Cincinnati Reds 0.2376274
## 18 New York Mets 0.2386696
## 5 Chicago Cubs 0.2387436
## 2 Atlanta Braves 0.2406730
## 11 Houston Astros 0.2417845
## 21 Philadelphia Phillies 0.2420132
## 25 Seattle Mariners 0.2436697
## 4 Boston Red Sox 0.2441002
## 20 Oakland Athletics 0.2441839
## 19 New York Yankees 0.2454066
## 27 Tampa Bay Rays 0.2467368
## 1 Arizona Diamondbacks 0.2483790
## 16 Milwaukee Brewers 0.2500915
## 6 Chicago White Sox 0.2525708
## 15 Miami Marlins 0.2526183
## 26 St. Louis Cardinals 0.2526723
## 8 Cleveland Indians 0.2530942
## 30 Washington Nationals 0.2531577
## 17 Minnesota Twins 0.2536375
## 24 San Francisco Giants 0.2547529
## 3 Baltimore Orioles 0.2562545
## 28 Texas Rangers 0.2564103
## 29 Toronto Blue Jays 0.2586052
## 13 Los Angeles Angels of Anaheim 0.2590234
## 22 Pittsburgh Pirates 0.2593931
## 12 Kansas City Royals 0.2625789
## 14 Los Angeles Dodgers 0.2654676
## 9 Colorado Rockies 0.2763721
## 10 Detroit Tigers 0.2765542
### Or, we can use dplyr to do the same thing
case_2.1 = MLB2014 %>% ### Copy the MLB2014 data
mutate(avg = H/AB) %>% ### Create a new variable avg using H and AB
arrange(avg) ### Sort by avg
case_2.1 %>% select(name, avg) ### Display only name and avg
## name avg
## 1 San Diego Padres 0.2264828
## 2 Cincinnati Reds 0.2376274
## 3 New York Mets 0.2386696
## 4 Chicago Cubs 0.2387436
## 5 Atlanta Braves 0.2406730
## 6 Houston Astros 0.2417845
## 7 Philadelphia Phillies 0.2420132
## 8 Seattle Mariners 0.2436697
## 9 Boston Red Sox 0.2441002
## 10 Oakland Athletics 0.2441839
## 11 New York Yankees 0.2454066
## 12 Tampa Bay Rays 0.2467368
## 13 Arizona Diamondbacks 0.2483790
## 14 Milwaukee Brewers 0.2500915
## 15 Chicago White Sox 0.2525708
## 16 Miami Marlins 0.2526183
## 17 St. Louis Cardinals 0.2526723
## 18 Cleveland Indians 0.2530942
## 19 Washington Nationals 0.2531577
## 20 Minnesota Twins 0.2536375
## 21 San Francisco Giants 0.2547529
## 22 Baltimore Orioles 0.2562545
## 23 Texas Rangers 0.2564103
## 24 Toronto Blue Jays 0.2586052
## 25 Los Angeles Angels of Anaheim 0.2590234
## 26 Pittsburgh Pirates 0.2593931
## 27 Kansas City Royals 0.2625789
## 28 Los Angeles Dodgers 0.2654676
## 29 Colorado Rockies 0.2763721
## 30 Detroit Tigers 0.2765542
R makes it very easy to create a stemplot. A few options also make it easy to modify the stemplot so that it is easily interpretable.
### The default stem has too few stems
stem(case.2.1$HR)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 8 | 5
## 10 | 59178
## 12 | 2335581246
## 14 | 2670255567
## 16 | 37
## 18 | 6
## 20 | 1
### We can split the stems in half to match the text
stem(case.2.1$HR, 2)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 9 | 5
## 10 | 59
## 11 | 178
## 12 | 233558
## 13 | 1246
## 14 | 267
## 15 | 0255567
## 16 | 3
## 17 | 7
## 18 | 6
## 19 |
## 20 |
## 21 | 1
### Further division of stems is overkill
stem(case.2.1$HR, 5)
##
## The decimal point is at the |
##
## 94 | 0
## 96 |
## 98 |
## 100 |
## 102 |
## 104 | 0
## 106 |
## 108 | 0
## 110 | 0
## 112 |
## 114 |
## 116 | 0
## 118 | 0
## 120 |
## 122 | 000
## 124 | 00
## 126 |
## 128 | 0
## 130 | 0
## 132 | 0
## 134 | 0
## 136 | 0
## 138 |
## 140 |
## 142 | 0
## 144 |
## 146 | 00
## 148 |
## 150 | 0
## 152 | 0
## 154 | 000
## 156 | 00
## 158 |
## 160 |
## 162 | 0
## 164 |
## 166 |
## 168 |
## 170 |
## 172 |
## 174 |
## 176 | 0
## 178 |
## 180 |
## 182 |
## 184 |
## 186 | 0
## 188 |
## 190 |
## 192 |
## 194 |
## 196 |
## 198 |
## 200 |
## 202 |
## 204 |
## 206 |
## 208 |
## 210 | 0
If we need a back-to-back stemplot we need to use the aplpack package.
p_load(aplpack)
### Create vectors that contain HR for either NL or AL using base R
nlhr2014 = MLB2014$HR[MLB2014$lgID=="NL"] ### Use the HR variable and subset only NL
alhr2014 = MLB2014$HR[MLB2014$lgID=="AL"] ### Use the HR variable and subset only AL
stem.leaf.backback(nlhr2014, alhr2014)
## ____________________________
## 1 | 2: represents 12, leaf unit: 1
## nlhr2014 alhr2014
## ____________________________
## | 9 |5 1
## 2 95| 10 |
## 3 8| 11 |17 3
## 7 5532| 12 |38 5
## (3) 421| 13 |6 6
## | 14 |267 (3)
## 5 7620| 15 |555 6
## | 16 |3 3
## | 17 |7 2
## 1 6| 18 |
## | 19 |
## ____________________________
## HI: 211
## n: 15 15
## ____________________________
### Or, use dplyr to do the same thing
nlhr2014 = MLB2014 %>% filter(lgID == "NL") %>% select(HR)
alhr2014 = MLB2014 %>% filter(lgID == "AL") %>% select(HR)
stem.leaf.backback(as.vector(t(alhr2014)), as.vector(t(nlhr2014))) ### Force the dataframes to col vectors
## ____________________________
## 1 | 2: represents 12, leaf unit: 1
## as.vector(t(alhr2014))
## as.vector(t(nlhr2014))
## ____________________________
## 1 5| 9 |
## | 10 |59 2
## 3 71| 11 |8 3
## 5 83| 12 |2355 7
## 6 6| 13 |124 (3)
## (3) 762| 14 |
## 6 555| 15 |0267 5
## 3 3| 16 |
## 2 7| 17 |
## | 18 |6 1
## | 19 |
## ____________________________
## HI: 211
## n: 15 15
## ____________________________
Team on base percentage (OBP) is easily graphed. Computing statistics on the statistic is also simple.
### Add the OBP to the dataframe using dplyr
case.2.1 = case.2.1 %>% mutate(OBP = round((H + BB + HBP)/(AB + BB + HBP + SF), 3))
### Stem plot sing base R
stem(case.2.1$OBP)
##
## The decimal point is 2 digit(s) to the left of the |
##
## 29 | 2
## 29 | 6
## 30 | 0022
## 30 | 5789
## 31 | 011144
## 31 | 6777
## 32 | 001234
## 32 | 7
## 33 | 013
### Using aplpack
stem.leaf(case.2.1$OBP)
## 1 | 2: represents 0.012
## leaf unit: 0.001
## n: 30
## 1 29* | 2
## 2 29. | 6
## 6 30* | 0022
## 10 30. | 5789
## (6) 31* | 011144
## 14 31. | 6777
## 10 32* | 001234
## 4 32. | 7
## 3 33* | 013
### While we can read the sorted data from the stem plot, sort the data to check
sort(case.2.1$OBP)
## [1] 0.292 0.296 0.300 0.300 0.302 0.302 0.305 0.307 0.308 0.309 0.310 0.311
## [13] 0.311 0.311 0.314 0.314 0.316 0.317 0.317 0.317 0.320 0.320 0.321 0.322
## [25] 0.323 0.324 0.327 0.330 0.331 0.333
### How many observations
length(case.2.1$OBP)
## [1] 30
### What are the OBPs for the 15th and 16th obs?
case.2.1$OBP[15:16] ### Oops, not sorted
## [1] 0.317 0.310
sort(case.2.1$OBP)[15:16] ### That's better
## [1] 0.314 0.314
### Find the median
median(case.2.1$OBP)
## [1] 0.314
### What are the OBPs for the 8th and 23rd sorted obs?
sort(case.2.1$OBP)[c(8,23)]
## [1] 0.307 0.321
### Find the quartiles
quantile(case.2.1$OBP)[c(2,4)]
## 25% 75%
## 0.30725 0.32075
### Or...
summary(case.2.1$OBP)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2920 0.3073 0.3140 0.3137 0.3207 0.3330
The plots and statistics for Derek Jeter that are presented in Case 2.2 are variations on those already presented above. For this case we download the data Albert’s GitHub repository.
### Get the data. Be careful about getting the "raw" CSV file from GitHub
case.2.2 <- read.csv("https://raw.githubusercontent.com/bayesball/Teaching-Statistics-Using-Baseball/master/case_2_2.csv")
### See what's in it
head(case.2.2)
## Year Age G PA AB R H X2B X3B HR RBI BB SO BA OBP SLG OPS
## 1 1995 21 15 51 48 5 12 4 1 0 7 3 11 0.250 0.294 0.375 0.669
## 2 1996 22 157 654 582 104 183 25 6 10 78 48 102 0.314 0.370 0.430 0.800
## 3 1997 23 159 748 654 116 190 31 7 10 70 74 125 0.291 0.370 0.405 0.775
## 4 1998 24 149 694 626 127 203 25 8 19 84 57 119 0.324 0.384 0.481 0.864
## 5 1999 25 158 739 627 134 219 37 9 24 102 91 116 0.349 0.438 0.552 0.989
## 6 2000 26 148 679 593 119 201 31 4 15 73 68 99 0.339 0.416 0.481 0.896
### Albert's Table 2.2 contains fewer variables
table.2.2 = case.2.2 %>%
select(Year, AB:HR, BB:OPS) %>% ### Use ":" to get all variables between those indicated
rename(AVG = BA) ### Rename batting average from BA to AVG
head(table.2.2)
## Year AB R H X2B X3B HR BB SO AVG OBP SLG OPS
## 1 1995 48 5 12 4 1 0 3 11 0.250 0.294 0.375 0.669
## 2 1996 582 104 183 25 6 10 48 102 0.314 0.370 0.430 0.800
## 3 1997 654 116 190 31 7 10 74 125 0.291 0.370 0.405 0.775
## 4 1998 626 127 203 25 8 19 57 119 0.324 0.384 0.481 0.864
## 5 1999 627 134 219 37 9 24 91 116 0.349 0.438 0.552 0.989
## 6 2000 593 119 201 31 4 15 68 99 0.339 0.416 0.481 0.896
Albert removed Jeter’s rookie year (1995) and his injury year(2013).
case.2.2$Year ### Look at Jeter's years
## [1] 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
## [16] 2010 2011 2012 2013 2014
albert.2.2 = case.2.2[-c(1,19), ] ### Remove rows 1 and 19 (negative index), but keep all columns (blank after comma)
albert.2.2$Year ### 1995 and 2013 are not listed
## [1] 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
## [16] 2011 2012 2014
case.2.2 = albert.2.2 ### Update case.2.2
### Create Figure 2.5
dotplot(~HR, data=case.2.2, cex=1.25, pch=1, xlab="Home Runs")
dotplot.mtb(case.2.2$HR)
### Create Figure 2.6
plot(case.2.2$Year, case.2.2$HR, xlab="Year", ylab="Home Runs")
### Create Figure 2.7
plot(case.2.2$Year, case.2.2$HR, xlab="Year", ylab="Home Runs")
abline(a=883.68, b=-0.4337) ### Doesn't show in RStudio. Knit the doc.
# Lattice approach
p_load(lattice)
xyplot(HR~Year, data=case.2.2, pch=1,
panel=function(x, y, ...){
panel.xyplot(x, y, ...)
panel.abline(a=883.68, b=-0.4337, col="darkgray", lty=2)
}
)
# ggplot2 approach
p_load(ggplot2)
ggplot(case.2.2, aes(Year, HR)) + geom_point() + theme_bw() + xlab("Year") + ylab("Home Runs") + stat_smooth(method="lm", se=FALSE, color="maroon")
## `geom_smooth()` using formula 'y ~ x'
### Create Figure 2.8
stem(case.2.2$OPS, 2)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 6 | 2
## 6 |
## 7 | 14
## 7 | 7899
## 8 | 02444
## 8 | 667
## 9 | 00
## 9 | 9
### Oops, Albert's software truncated to two decimals rather than rounded
stem(trunc(100*case.2.2$OPS)/100, 2)
##
## The decimal point is 1 digit(s) to the left of the |
##
## 6 | 1
## 6 |
## 7 | 14
## 7 | 7799
## 8 | 02344
## 8 | 5679
## 9 | 0
## 9 | 8
### Create Figure 2.9
# ggplot2 approach
ggplot(case.2.2, aes(Year, OPS)) + geom_point() + theme_bw() + xlab("Year") +
geom_smooth(method = "loess", span = 2/3, se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
This case provides an opportunity to create a new statistic and to use some new graphical features. We will use Randy Johnson’s data from the tsub package.
### Get the data
case.2.3 = case_2_3
head(case.2.3)
## yearID W L G IP H R SO BB ERA
## 1 1988 3 0 4 26.00 23 8 25 7 2.42
## 2 1989 7 13 29 160.67 147 100 130 96 4.82
## 3 1990 14 11 33 219.67 174 103 194 120 3.65
## 4 1991 13 10 33 201.33 151 96 228 152 3.98
## 5 1992 12 14 31 210.33 154 104 241 144 3.77
## 6 1993 19 8 35 255.33 185 97 308 99 3.24
### Compute the magic strikeout rate, SOR
case.2.3$sor = (case.2.3$SO/case.2.3$IP) * 9
case.2.3[, c("yearID","sor")]
## yearID sor
## 1 1988 8.653846
## 2 1989 7.282007
## 3 1990 7.948286
## 4 1991 10.192222
## 5 1992 10.312366
## 6 1993 10.856539
## 7 1994 10.674419
## 8 1995 12.345449
## 9 1996 12.473504
## 10 1997 12.295775
## 11 1998 12.118856
## 12 1999 12.058748
## 13 2000 12.558813
## 14 2001 13.409701
## 15 2002 11.561538
## 16 2003 9.868421
## 17 2004 10.624008
## 18 2005 8.414942
## 19 2006 7.551220
## 20 2007 11.434621
## 21 2008 8.461957
## 22 2009 8.062500
### Create Figure 2.10
stem(case.2.3$sor,5)
##
## The decimal point is at the |
##
## 7 | 3
## 7 | 69
## 8 | 14
## 8 | 57
## 9 |
## 9 | 9
## 10 | 23
## 10 | 679
## 11 | 4
## 11 | 6
## 12 | 1133
## 12 | 56
## 13 | 4
### Again, the figures differ by rounding versus truncating
### Generate Figure 2.11
plot(case.2.3$yearID, case.2.3$sor, xlab="Year", ylab="Strikeout Rate")
abline(h=9)
# ggplot2 approach
p_load(ggplot2)
ggplot(case.2.3, aes(yearID, sor)) + geom_point() + xlab("Year") + ylab("Strikeout Rate") +
geom_hline(aes(yintercept=9)) + geom_smooth(method="loess", span = 2/3)
## `geom_smooth()` using formula 'y ~ x'
Of course, the data for Figures 2.12 and 2.13 are not in case_2_3.csv. So, we need to get a little creative.
### We'll use dplyr to pull Johnson's ID from Lahman's People table
p_load(Lahman, dplyr)
johnson.id = People %>% ### Grab the People dataframe
filter(nameFirst=="Randy" & nameLast=="Johnson") # %>% ### Get Randy Johnson
### There are three Randy Johnson's
head(johnson.id)
## playerID birthYear birthMonth birthDay birthCountry birthState birthCity
## 1 johnsra03 1958 8 15 USA FL Miami
## 2 johnsra04 1956 6 10 USA CA Escondido
## 3 johnsra05 1963 9 10 USA CA Walnut Creek
## deathYear deathMonth deathDay deathCountry deathState deathCity nameFirst
## 1 NA NA NA <NA> <NA> <NA> Randy
## 2 NA NA NA <NA> <NA> <NA> Randy
## 3 NA NA NA <NA> <NA> <NA> Randy
## nameLast nameGiven weight height bats throws debut finalGame
## 1 Johnson Randall Stuart 195 74 L L 1980-07-05 1982-10-03
## 2 Johnson Randall Glenn 190 73 R R 1982-04-27 1984-09-30
## 3 Johnson Randall David 225 82 R L 1988-09-15 2009-10-04
## retroID bbrefID deathDate birthDate
## 1 johnr101 johnsra03 <NA> 1958-08-15
## 2 johnr001 johnsra04 <NA> 1956-06-10
## 3 johnr005 johnsra05 <NA> 1963-09-10
### We want the one whose first game was in 1988
johnson.id = johnson.id %>%
filter(year(debut) == 1988) %>% ### Make sure debut year was 1988
select(retroID) ### We need only his ID
head(johnson.id)
## retroID
## 1 johnr005
johnson.id = as.character(johnson.id) ### Make it a value not a dataframe
### Starting pitchers and runs can be found in RetroSheet
### RetroSheet can sometimes be hard to install. If p_load fails,
### uncomment the two lines that follow and run once.
## p_load(devtools)
## devtools::install_github("rmscriven/retrosheet")
p_load(retrosheet)
game1995 = getRetrosheet("game", 1995)
### Use SQL to subset the SEA games and order them by date
p_load(sqldf)
sqlsea1995 = sqldf("SELECT * FROM game1995 WHERE (HmTm='SEA' OR VisTm='SEA') ORDER BY Date")
### Or, use dplyr to get SEA games and order by Date
sea1995 = game1995 %>%
filter(HmTm=="SEA" | VisTm=="SEA") %>% ### Get home and away games for SEA
arrange(Date)
### Check to see if the two dataframes are the same
all_equal(sea1995, sqlsea1995) ### Good news, they are the same
## [1] TRUE
### Now figure out which ones Johnson started in and which were home
sea1995$RJstart = sea1995$VisStPchID==johnson.id | sea1995$HmStPchID==johnson.id ### RJ started TRUE/FALSE
sea1995$AtHome = sea1995$HmTm=="SEA" ### Home team SEA TRUE/FALSE
sea1995[sea1995$AtHome, "runs"] = sea1995[sea1995$AtHome, "HmRuns"] ### Rows: AtHome==TRUE; runs=HmRuns
sea1995[!sea1995$AtHome,"runs"] = sea1995[!sea1995$AtHome, "VisRuns"] ### Rows: AtHome==FALSE; runs=VisRuns
### Subset the data for Johnson starts
runs.rj.start = sea1995$runs[sea1995$RJstart]
### Text "table" of runs when RJ started.
runs.rj.start
## [1] 3 15 3 6 4 5 8 11 2 3 1 9 3 5 4 5 3 4 8 2 2 6 6 7 4
## [26] 7 8 7 6 9
### Create Figure 2.12
stem(runs.rj.start, 2)
##
## The decimal point is at the |
##
## 1 | 0
## 2 | 000
## 3 | 00000
## 4 | 0000
## 5 | 000
## 6 | 0000
## 7 | 000
## 8 | 000
## 9 | 00
## 10 |
## 11 | 0
## 12 |
## 13 |
## 14 |
## 15 | 0
### While the shape is right, the leaves and stems aren't pretty. Try aplpack.
p_load(aplpack)
stem.leaf(runs.rj.start)
## 1 | 2: represents 1.2
## leaf unit: 0.1
## n: 30
## 1 1 | 0
## 4 2 | 000
## 9 3 | 00000
## 13 4 | 0000
## (3) 5 | 000
## 14 6 | 0000
## 10 7 | 000
## 7 8 | 000
## 4 9 | 00
## 10 |
## 2 11 | 0
## HI: 15
### Create Figure 2.13
runs.rj.nostart = sea1995$runs[!sea1995$RJstart]
stem.leaf.backback(runs.rj.start,runs.rj.nostart)
## _____________________________________________________
## 1 | 2: represents 1.2, leaf unit: 0.1
## runs.rj.start runs.rj.nostart
## _____________________________________________________
## | 0 |00 2
## 1 0| 1 |000000000000 14
## 4 000| 2 |00000000000000 28
## 9 00000| 3 |00000000000 39
## 13 0000| 4 |000000000000 51
## (3) 000| 5 |000000000000 (12)
## 14 0000| 6 |0000000000000000 52
## 10 000| 7 |00000 36
## 7 000| 8 |00000 31
## 4 00| 9 |0000000 26
## | 10 |000000000 19
## 2 0| 11 |000000 10
## | 12 |
## | 13 |
## | 14 |0 4
## 1 0| 15 |000 3
## _____________________________________________________
## n: 30 115
## _____________________________________________________
### Means and medians
mean(runs.rj.start)
## [1] 5.533333
median(runs.rj.start)
## [1] 5
mean(runs.rj.nostart)
## [1] 5.478261
median(runs.rj.nostart)
## [1] 5
TSUB2 now looks at attendance. As seen above, this information is included in Lahman’s data. However, we will use Albert’s data for convenience.
### Make a local copy of the tsub dataframe
case.2.4 = case_2_4
head(case.2.4)
## HomeTeam Mean
## 1 ANA 38221
## 2 ARI 25602
## 3 ATL 29065
## 4 BAL 30426
## 5 BOS 36495
## 6 CHA 20381
### The variable names are messed up. We fix them.
names(case.2.4) ### Ask for the column names
## [1] "HomeTeam" "Mean"
names(case.2.4) <- c("Team","Attendance") ### Set the column names with the new names
names(case.2.4) ### Ask for the column names
## [1] "Team" "Attendance"
head(case.2.4) ### Look at the dataframe
## Team Attendance
## 1 ANA 38221
## 2 ARI 25602
## 3 ATL 29065
## 4 BAL 30426
## 5 BOS 36495
## 6 CHA 20381
### Plot Figure 2.14 using base R
hist(case.2.4$Attendance, xlab="Average Home Attendance", main="")
### and the lattice package
histogram(~Attendance, data=case.2.4, xlab="Average Home Attendance", breaks=15000+5000*(0:7))
histogram(~Attendance/1000, data=case.2.4, xlab="Average Home Attendance (thousands)", type="count")
### and using ggplot
p_load(ggplot2)
ggplot(case.2.4,aes(Attendance)) + geom_histogram() + xlab("Average Home Attendance") + stat_bin(breaks=15000+5000*(0:7)) + theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(case.2.4,aes(Attendance)) + geom_histogram() + xlab("Average Home Attendance") + stat_bin(binwidth=5000) + theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
### Create Figure 2.15
p_load(aplpack)
stem.leaf(case.2.4$Attendance)
## 1 | 2: represents 12000
## leaf unit: 1000
## n: 30
## 2 1. | 77
## 7 2* | 01144
## (8) 2. | 55677999
## (8) 3* | 00012334
## 7 3. | 668
## 4 4* | 113
## 1 4. | 6
### Create Figure 2.16
### We can use the lattice package to create the Bill Cleveland style dotplot of average attendance
p_load(lattice)
dotplot(reorder(Team, Attendance)~Attendance, data=case.2.4) ### We order the teams by their mean attendance
Finally, we look at the use of the sacrifice bunt during the 2013 season. The data list SF which appears to be sac FLIES not bunts. The values are the same as those of Table 2.6. The tsub package data do not include League.
### Get the case_2_5 data
case.2.5 = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS28/TSUB2/CSV/case_2_5.csv")
head(case.2.5)
## BAT_TEAM SF League
## 1 ANA 37 AL
## 2 ARI 50 NL
## 3 ATL 58 NL
## 4 BAL 27 AL
## 5 BOS 24 AL
## 6 CHA 19 AL
names(case.2.5)[1] <- "Team" ### Replace BAT_TEAM with Team. Is SH sac bunts?
### Figure 2.17
p_load(lattice)
dotplot(~SF, data=case.2.5, cex=1.25, pch=1)
dotplot.mtb(case.2.5$SF)
plt = case.2.5 %>%
ggplot(aes(x = SF)) +
geom_dotplot(binwidth = 1)
plt + theme(axis.text.y=element_blank(),
axis.ticks.y=element_blank()
) + labs(y="")
### Compute simple statistics
summary(case.2.5$SF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 29.50 45.50 46.10 57.75 85.00
### Since we are using the lattice package, Figure 2.18 is an easy fix
dotplot(League ~ SF, data=case.2.5, cex=1.25, pch=1)
plt = case.2.5 %>%
ggplot(aes(x = League, y = SF)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth = 1)
plt
### Compute associated statistics
by(case.2.5$SF,case.2.5$League,summary)
## case.2.5$League: AL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 24.50 29.00 29.79 35.00 45.00
## ------------------------------------------------------------
## case.2.5$League: NL
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 43.00 52.75 57.50 60.38 66.50 85.00
Chapter 3 moves from describing one characteristic to comparing a couple. R’s developers have made this transition easy.
The plots and statistics for comparing Albert Pujols and Manny Ramirez that are presented in Case 3.1 are, for the most part, variations on those already presented above. For this case we download the data from my web site.
### Get the data
case.3.1 = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS28/TSUB2/CSV/case_3_1.csv")
### Table 3.1 is just the data.frame
case.3.1 %>% select(Player, Age, SLG, OBP, OPS)
## Player Age SLG OBP OPS
## 1 Rameriz 22 0.5206897 0.3571429 0.878
## 2 Rameriz 23 0.5578512 0.4024605 0.960
## 3 Rameriz 24 0.5818182 0.3987635 0.981
## 4 Rameriz 25 0.5383244 0.4147465 0.953
## 5 Rameriz 26 0.5989492 0.3770739 0.976
## 6 Rameriz 27 0.6628352 0.4421875 1.105
## 7 Rameriz 28 0.6970387 0.4567669 1.154
## 8 Rameriz 29 0.6086957 0.4048387 1.014
## 9 Rameriz 30 0.6467890 0.4498069 1.097
## 10 Rameriz 31 0.5869947 0.4270987 1.014
## 11 Rameriz 32 0.6126761 0.3966817 1.009
## 12 Rameriz 33 0.5938628 0.3876923 0.982
## 13 Rameriz 34 0.6191537 0.4390681 1.058
## 14 Rameriz 35 0.4927536 0.3884007 0.881
## 15 Rameriz 36 0.6014493 0.4296636 1.031
## 16 Rameriz 37 0.5312500 0.4176334 0.949
## 17 Rameriz 38 0.4603774 0.4093750 0.870
## 18 Pujols 21 0.6101695 0.4029630 1.013
## 19 Pujols 22 0.5610169 0.3940741 0.955
## 20 Pujols 23 0.6666667 0.4394161 1.106
## 21 Pujols 24 0.6570946 0.4147399 1.072
## 22 Pujols 25 0.6091371 0.4300000 1.039
## 23 Pujols 26 0.6710280 0.4305994 1.102
## 24 Pujols 27 0.5681416 0.4285714 0.997
## 25 Pujols 28 0.6526718 0.4617785 1.114
## 26 Pujols 29 0.6584507 0.4428571 1.101
## 27 Pujols 30 0.5962521 0.4142857 1.011
## 28 Pujols 31 0.5405872 0.3655914 0.906
## 29 Pujols 32 0.5156507 0.3432836 0.859
## 30 Pujols 33 0.4373402 0.3295711 0.767
## 31 Pujols 34 0.4660348 0.3237410 0.790
## 32 Pujols 35 0.4800664 0.3071104 0.787
names(case.3.1)
## [1] "Player" "yearID" "AB" "BB" "HBP" "H" "X2B" "X3B"
## [9] "HR" "SF" "Age" "SLG" "OBP" "OPS"
### Albert next creates Figure 3.1, a back-to-back stemplot
p_load(aplpack)
(pujols = case.3.1 %>% filter(Player=="Pujols"))
## Player yearID AB BB HBP H X2B X3B HR SF Age SLG OBP OPS
## 1 Pujols 2001 590 69 9 194 47 4 37 7 21 0.6101695 0.4029630 1.013
## 2 Pujols 2002 590 72 9 185 40 2 34 4 22 0.5610169 0.3940741 0.955
## 3 Pujols 2003 591 79 10 212 51 1 43 5 23 0.6666667 0.4394161 1.106
## 4 Pujols 2004 592 84 7 196 51 2 46 9 24 0.6570946 0.4147399 1.072
## 5 Pujols 2005 591 97 9 195 38 2 41 3 25 0.6091371 0.4300000 1.039
## 6 Pujols 2006 535 92 4 177 33 1 49 3 26 0.6710280 0.4305994 1.102
## 7 Pujols 2007 565 99 7 185 38 1 32 8 27 0.5681416 0.4285714 0.997
## 8 Pujols 2008 524 104 5 187 44 0 37 8 28 0.6526718 0.4617785 1.114
## 9 Pujols 2009 568 115 9 186 45 1 47 8 29 0.6584507 0.4428571 1.101
## 10 Pujols 2010 587 103 4 183 39 1 42 6 30 0.5962521 0.4142857 1.011
## 11 Pujols 2011 579 61 4 173 29 0 37 7 31 0.5405872 0.3655914 0.906
## 12 Pujols 2012 607 52 5 173 50 0 30 6 32 0.5156507 0.3432836 0.859
## 13 Pujols 2013 391 40 5 101 19 0 17 7 33 0.4373402 0.3295711 0.767
## 14 Pujols 2014 633 48 5 172 37 1 28 9 34 0.4660348 0.3237410 0.790
## 15 Pujols 2015 602 50 6 147 22 0 40 3 35 0.4800664 0.3071104 0.787
(ramirez = case.3.1 %>% filter(Player=="Ramirez"))
## [1] Player yearID AB BB HBP H X2B X3B HR SF
## [11] Age SLG OBP OPS
## <0 rows> (or 0-length row.names)
Wait. There aren’t any Ramirez observations? What’s going on?
table(case.3.1$Player)
##
## Pujols Rameriz
## 15 17
It looks like Albert typoed. Let’s fix it and move on.
### Select only those rows with Player == "Rameriz" and replace the Player column value with "Ramirez"
case.3.1[case.3.1$Player == "Rameriz", "Player" ] <- "Ramirez"
(ramirez = case.3.1 %>% filter(Player=="Ramirez"))
## Player yearID AB BB HBP H X2B X3B HR SF Age SLG OBP OPS
## 1 Ramirez 1994 290 42 0 78 22 0 17 4 22 0.5206897 0.3571429 0.878
## 2 Ramirez 1995 484 75 5 149 26 1 31 5 23 0.5578512 0.4024605 0.960
## 3 Ramirez 1996 550 85 3 170 45 3 33 9 24 0.5818182 0.3987635 0.981
## 4 Ramirez 1997 561 79 7 184 40 0 26 4 25 0.5383244 0.4147465 0.953
## 5 Ramirez 1998 571 76 6 168 35 2 45 10 26 0.5989492 0.3770739 0.976
## 6 Ramirez 1999 522 96 13 174 34 3 44 9 27 0.6628352 0.4421875 1.105
## 7 Ramirez 2000 439 86 3 154 34 2 38 4 28 0.6970387 0.4567669 1.154
## 8 Ramirez 2001 529 81 8 162 33 2 41 2 29 0.6086957 0.4048387 1.014
## 9 Ramirez 2002 436 73 8 152 31 0 33 1 30 0.6467890 0.4498069 1.097
## 10 Ramirez 2003 569 97 8 185 36 1 37 5 31 0.5869947 0.4270987 1.014
## 11 Ramirez 2004 568 82 6 175 44 0 43 7 32 0.6126761 0.3966817 1.009
## 12 Ramirez 2005 554 80 10 162 30 1 45 6 33 0.5938628 0.3876923 0.982
## 13 Ramirez 2006 449 100 1 144 27 1 35 8 34 0.6191537 0.4390681 1.058
## 14 Ramirez 2007 483 71 7 143 33 1 20 8 35 0.4927536 0.3884007 0.881
## 15 Ramirez 2008 552 87 11 183 36 1 37 4 36 0.6014493 0.4296636 1.031
## 16 Ramirez 2009 352 71 7 102 24 2 19 1 37 0.5312500 0.4176334 0.949
## 17 Ramirez 2010 265 46 6 79 16 0 9 3 38 0.4603774 0.4093750 0.870
stem.leaf.backback(pujols$OPS, ramirez$OPS)
## _______________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## pujols$OPS ramirez$OPS
## _______________________________
## | 7* |
## 3 986| 7. |
## | 8* |
## 4 5| 8. |778 3
## 5 0| 9* |4 4
## 7 95| 9. |56788 (5)
## (3) 311| 10* |0113 8
## 5 7| 10. |59 4
## 4 1000| 11* |0 2
## | 11. |5 1
## | 12* |
## _______________________________
## n: 15 17
## _______________________________
That’s better, but where did Pujols’ 78 come from? How many observations should he have? The online data contains an extra season for Pujols — when he was 35.
### Five number summaries can be computed in a number of ways
### Median
(n = nrow(pujols))
## [1] 15
(n+1)/2
## [1] 8
m = sort(pujols$OPS)[(n+1)/2]
m
## [1] 1.011
median(pujols$OPS)
## [1] 1.011
### Quartiles
sort(pujols$OPS[pujols$OPS < m])
## [1] 0.767 0.787 0.790 0.859 0.906 0.955 0.997
ql = median(pujols$OPS[pujols$OPS < m])
ql
## [1] 0.859
qu = median(pujols$OPS[pujols$OPS > m])
qu
## [1] 1.101
quantile(pujols$OPS)[c(2,4)]
## 25% 75%
## 0.8825 1.0865
sort(pujols$OPS)
## [1] 0.767 0.787 0.790 0.859 0.906 0.955 0.997 1.011 1.013 1.039 1.072 1.101
## [13] 1.102 1.106 1.114
### Min and max are easy
min(pujols$OPS)
## [1] 0.767
max(pujols$OPS)
## [1] 1.114
range(pujols$OPS)
## [1] 0.767 1.114
### Together, the statistics from above are the five-number-summary
quantile(pujols$OPS)
## 0% 25% 50% 75% 100%
## 0.7670 0.8825 1.0110 1.0865 1.1140
summary(pujols$OPS) # Ignore the mean for now.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7670 0.8825 1.0110 0.9746 1.0865 1.1140
quantile(ramirez$OPS)
## 0% 25% 50% 75% 100%
## 0.870 0.953 0.982 1.031 1.154
summary(ramirez$OPS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8700 0.9530 0.9820 0.9948 1.0310 1.1540
### Albert computes a "step" that is then used to compute the u/l fences
### Note that the difference between steps found using Albert's method and
### the internal function is minimal. We'll use Albert's value.
(qu = quantile(ramirez$OPS)[4])
## 75%
## 1.031
(ql = quantile(ramirez$OPS)[2])
## 25%
## 0.953
step = 1.5*(qu-ql)
step
## 75%
## 0.117
1.5 * c(-1,1) %*% quantile(ramirez$OPS)[c(2,4)] ### A little different due to different quantiles
## [,1]
## [1,] 0.117
fu = qu + step
fu
## 75%
## 1.148
fl = ql - step
fl
## 25%
## 0.836
### So values greater than fu and less than fl are possible outliers
ramirez$OPS[ramirez$OPS > fu | ramirez$OPS < fl]
## [1] 1.154
sort(ramirez$OPS)
## [1] 0.870 0.878 0.881 0.949 0.953 0.960 0.976 0.981 0.982 1.009 1.014 1.014
## [13] 1.031 1.058 1.097 1.105 1.154
### Now we generate boxplots
boxplot(OPS~Player, data=case.3.1, ylab="OPS")
p_load(lattice)
bwplot(Player~OPS, data=case.3.1)
p_load(ggplot2)
ggplot(case.3.1,aes(Player, OPS)) + geom_boxplot()
ggplot(case.3.1,aes(Player, OPS)) + geom_boxplot() + coord_flip()
### Compare means
median(pujols$OPS)-median(ramirez$OPS)
## [1] 0.029
### Create Figure 3.4
plot(case.3.1$Age, case.3.1$OPS, pch=as.numeric(factor(case.3.1$Player)))
xyplot(OPS~Age, data=case.3.1, group=Player,
type=c("p","smooth"), span=0.75, pch=c(1,3), lty=c(1,2),
key=list(text=list(c("Pujols", "Ramirez")),
points=list(pch=c(1,3), col=c(13,14)),
columns=2)
)
ggplot(case.3.1, aes(x=Age, y=OPS, color=Player)) +
geom_point() + geom_smooth(span=0.75, se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The plots and statistics for comparing the pitchers Robin Roberts and Whitey Ford that are presented in Case 3.2 are, for the most part, variations on those already presented above. For this case we again download the data from my web site.
### Get the data
case.3.2 = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS28/TSUB2/CSV/case_3_2.csv")
### Albert next creates Figure 3.5, a back-to-back stemplot
p_load(aplpack)
roberts = subset(case.3.2, Player=="Roberts")
roberts
## Year Player Age Tm W L G IP H ER HR BB SO ERA X.lgERA
## 1 1948 Roberts 21 PHI 7 9 20 146.7 148 52 10 61 84 3.19 3.96
## 2 1949 Roberts 22 PHI 15 15 43 226.7 229 93 15 75 95 3.69 3.96
## 3 1950 Roberts 23 PHI 20 11 40 304.3 282 102 29 77 146 3.02 4.06
## 4 1951 Roberts 24 PHI 21 15 44 315.0 284 106 20 64 127 3.03 3.84
## 5 1952 Roberts 25 PHI 28 7 39 330.0 292 95 22 45 148 2.59 3.66
## 6 1953 Roberts 26 PHI 23 16 44 346.7 324 106 30 61 198 2.75 4.20
## 7 1954 Roberts 27 PHI 23 15 45 336.7 289 111 35 56 185 2.97 4.03
## 8 1955 Roberts 28 PHI 23 14 41 305.0 292 111 41 53 160 3.28 3.96
## 9 1956 Roberts 29 PHI 19 18 43 297.3 328 147 46 40 157 4.45 3.73
## 10 1957 Roberts 30 PHI 10 22 39 249.7 246 113 40 43 128 4.07 3.80
## 11 1958 Roberts 31 PHI 17 14 35 269.7 270 97 30 51 130 3.24 3.95
## 12 1959 Roberts 32 PHI 15 17 35 257.3 267 122 34 35 137 4.27 4.11
## 13 1960 Roberts 33 PHI 12 16 35 237.3 256 106 31 34 122 4.02 3.88
## 14 1961 Roberts 34 PHI 1 10 26 117.0 154 76 19 23 54 5.85 4.07
## 15 1962 Roberts 35 BAL 10 9 27 191.3 176 59 17 41 102 2.78 3.77
## 16 1963 Roberts 36 BAL 14 13 35 251.3 230 93 35 40 124 3.33 3.52
## 17 1964 Roberts 37 BAL 13 7 31 204.0 203 66 18 52 109 2.91 3.59
## 18 1965 Roberts 38 BAL/HOU 10 9 30 190.7 171 59 18 30 97 2.78 3.42
## 19 1966 Roberts 39 CHC/HOU 5 8 24 112.0 141 60 15 21 54 4.82 3.54
## X.ERA.
## 1 124
## 2 107
## 3 135
## 4 127
## 5 141
## 6 152
## 7 136
## 8 121
## 9 84
## 10 93
## 11 122
## 12 96
## 13 96
## 14 70
## 15 136
## 16 106
## 17 123
## 18 123
## 19 73
ford = subset(case.3.2, Player=="Ford")
ford
## Year Player Age Tm W L G IP H ER HR BB SO ERA X.lgERA X.ERA.
## 20 1950 Ford 21 NYY 9 1 20 112.0 87 35 7 52 59 2.81 4.30 153
## 21 1953 Ford 24 NYY 18 6 32 207.0 187 69 13 110 110 3.00 3.68 123
## 22 1954 Ford 25 NYY 16 8 34 210.7 170 66 10 101 125 2.82 3.42 121
## 23 1955 Ford 26 NYY 18 7 39 253.7 188 74 20 113 137 2.63 3.76 143
## 24 1956 Ford 27 NYY 19 6 31 225.7 187 62 13 84 141 2.47 3.87 156
## 25 1957 Ford 28 NYY 11 5 24 129.3 114 37 10 53 84 2.57 3.60 140
## 26 1958 Ford 29 NYY 14 7 30 219.3 174 49 14 62 145 2.01 3.54 176
## 27 1959 Ford 30 NYY 16 10 35 204.0 194 69 13 89 114 3.04 3.63 119
## 28 1960 Ford 31 NYY 12 9 33 192.7 168 66 15 65 85 3.08 3.60 117
## 29 1961 Ford 32 NYY 25 4 39 283.0 242 101 23 92 209 3.21 3.70 115
## 30 1962 Ford 33 NYY 17 8 38 257.7 243 83 22 69 160 2.90 3.73 129
## 31 1963 Ford 34 NYY 24 7 38 269.3 240 82 26 56 189 2.74 3.52 128
## 32 1964 Ford 35 NYY 17 6 39 244.7 212 58 10 57 172 2.13 3.62 170
## 33 1965 Ford 36 NYY 16 13 37 244.3 241 88 22 50 162 3.24 3.39 105
## 34 1966 Ford 37 NYY 2 5 22 73.0 79 20 8 24 43 2.47 3.33 135
## 35 1967 Ford 38 NYY 2 4 7 44.0 40 8 2 9 21 1.64 3.13 191
stem.leaf.backback(ford$ERA, roberts$ERA, m=2, Min=1.5, Max=6)
## ________________________________
## 1 | 2: represents 1.2, leaf unit: 0.1
## ford$ERA roberts$ERA
## ________________________________
## | 1* |
## 1 6| 1. |
## 5 4410| 2* |
## (6) 988765| 2. |577799 6
## 5 22000| 3* |001223 (6)
## | 3. |6 7
## | 4* |0024 6
## | 4. |8 2
## | 5* |
## | 5. |8 1
## | 6* |
## ________________________________
## n: 16 19
## ________________________________
The back-to-back stemplot seems to be the same as Figure 3.5.
In the first edition of TSUB, Albert forgot to include league average information. case_3_2.csv contains this information for ERA which is what we need. However, it might be useful to be able to get this, or similar, information in the future. Code for generating more complete league average data is presented below.
### Make Figure 3.6
### To get league average pitcher information...
### The data can be found at
### http://www.baseball-reference.com/leagues/NL/pitch.shtml#teams_standard_pitching::none
### and http://www.baseball-reference.com/leagues/AL/pitch.shtml#teams_standard_pitching::none
### Click on CSV and then EXPORT to save them to your machine.
### The files are also available at the URLs given below.
nlpitch = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS04/nlpitchingaverages.csv")
alpitch = read.csv("http://facweb1.redlands.edu/fac/jim_bentley/Data/FYS04/alpitchingaverages.csv")
nlpitch$lgID = "NL" ### Identify the data as NL
alpitch$lgID = "AL" ### Identify the data as AL
pitch = rbind(alpitch, nlpitch) ### Stack the data into a single dataframe
pitch = pitch %>% filter(Year < 2022) ### Get rid of the incomplete 2022 season
plot(alpitch$Year,alpitch$ERA,type="l", lty=2, xlab="Year", ylab="Mean ERA")
lines(nlpitch$Year,nlpitch$ERA, lty=1)
### Note that Albert only used years where both the AL and NL had data -- 1901 on.
### Albert didn't have data beyond about 2010 but we have through 2021.
abline(v=2010,lty=3)
abline(v=1901, lty=3)
### Add a legend.
legend(1978,3.4,legend=c("AL","NL"),lty=c(2,1))
We can recreate the plot using actual data scraping rather than downloading.
### Install and load a few support packages.
p_load(rvest) # Scrapes
p_load(stringr) # String tools
p_load(tidyr) # Data frame manipulation
### Get the NL pitching page
url = 'http://www.baseball-reference.com/leagues/NL/pitch.shtml#teams_standard_pitching::none'
webpage = read_html(url)
### Use the rvest functions html_nodes and html_table to extract what we want
### Grab the tables on the page
nl_table = html_nodes(webpage, 'table')
nl_table
## {xml_nodeset (2)}
## [1] <table class="sortable stats_table" id="teams_standard_pitching" data-col ...
## [2] <table class="sortable stats_table" id="teams_standard_pitching_totals" d ...
### Two tables are read. Assign the tables to separate data frames
nl1 = html_table(nl_table)[[1]]
nl2 = html_table(nl_table)[[2]]
head(nl1)
## # A tibble: 6 × 32
## Year Tms `#P` PAge `R/G` ERA G GF CG SHO tSho SV IP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2022 15 377 29.1 4.47 4.14 1292 0.99 0.01 0.00 0.06 0.26 8.86
## 2 2021 15 497 28.9 4.48 4.20 2428 0.99 0.01 0.01 0.07 0.25 8.76
## 3 2020 15 383 28.4 4.66 4.47 898 0.98 0.02 0.01 0.06 0.21 8.56
## 4 2019 15 426 28.3 4.71 4.38 2430 0.99 0.01 0.00 0.06 0.25 8.94
## 5 2018 15 431 28.2 4.34 4.02 2432 0.99 0.01 0.00 0.07 0.26 8.98
## 6 2017 15 381 28.4 4.63 4.34 2430 0.99 0.01 0.01 0.06 0.25 8.90
## # … with 19 more variables: H <chr>, R <chr>, ER <chr>, HR <chr>, BB <chr>,
## # IBB <chr>, SO <chr>, HBP <chr>, BK <chr>, WP <chr>, BF <chr>, WHIP <chr>,
## # BAbip <chr>, H9 <chr>, HR9 <chr>, BB9 <chr>, SO9 <chr>, `SO/W` <chr>,
## # E <chr>
head(nl2)
## # A tibble: 6 × 32
## Year Tms `#P` PAge `R/G` ERA G GF CG SHO tSho SV IP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2022 15 377 29.1 4.47 4.14 1292 1283 9 3 83 331 11451…
## 2 2021 15 497 28.9 4.48 4.20 2428 2399 29 18 166 596 21265…
## 3 2020 15 383 28.4 4.66 4.47 898 880 18 7 57 193 7689.1
## 4 2019 15 426 28.3 4.71 4.38 2430 2412 18 11 143 607 21732…
## 5 2018 15 431 28.2 4.34 4.02 2432 2415 17 8 169 621 21828…
## 6 2017 15 381 28.4 4.63 4.34 2430 2403 27 13 135 605 21626…
## # … with 19 more variables: H <chr>, R <chr>, ER <chr>, HR <chr>, BB <chr>,
## # IBB <chr>, SO <chr>, HBP <chr>, BK <chr>, WP <chr>, BF <chr>, WHIP <chr>,
## # BAbip <chr>, H9 <chr>, HR9 <chr>, BB9 <chr>, SO9 <chr>, `SO/W` <chr>,
## # E <chr>
### Since the first table is averages and the second is totals, get rid of the second
nl = nl1
rm(nl1, nl2)
head(nl)
## # A tibble: 6 × 32
## Year Tms `#P` PAge `R/G` ERA G GF CG SHO tSho SV IP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2022 15 377 29.1 4.47 4.14 1292 0.99 0.01 0.00 0.06 0.26 8.86
## 2 2021 15 497 28.9 4.48 4.20 2428 0.99 0.01 0.01 0.07 0.25 8.76
## 3 2020 15 383 28.4 4.66 4.47 898 0.98 0.02 0.01 0.06 0.21 8.56
## 4 2019 15 426 28.3 4.71 4.38 2430 0.99 0.01 0.00 0.06 0.25 8.94
## 5 2018 15 431 28.2 4.34 4.02 2432 0.99 0.01 0.00 0.07 0.26 8.98
## 6 2017 15 381 28.4 4.63 4.34 2430 0.99 0.01 0.01 0.06 0.25 8.90
## # … with 19 more variables: H <chr>, R <chr>, ER <chr>, HR <chr>, BB <chr>,
## # IBB <chr>, SO <chr>, HBP <chr>, BK <chr>, WP <chr>, BF <chr>, WHIP <chr>,
## # BAbip <chr>, H9 <chr>, HR9 <chr>, BB9 <chr>, SO9 <chr>, `SO/W` <chr>,
## # E <chr>
### Repeat the above for the AL
### Get the AL pitching page
url = 'http://www.baseball-reference.com/leagues/AL/pitch.shtml#teams_standard_pitching::none'
webpage = read_html(url)
### Use the rvest functions html_nodes and html_table to extract what we want
### Grab the tables on the page
al_table = html_nodes(webpage, 'table')
al_table
## {xml_nodeset (2)}
## [1] <table class="sortable stats_table" id="teams_standard_pitching" data-col ...
## [2] <table class="sortable stats_table" id="teams_standard_pitching_totals" d ...
### Assign the first table to the data frame
al = html_table(al_table)[[1]]
head(al)
## # A tibble: 6 × 32
## Year Tms `#P` PAge `R/G` ERA G GF CG SHO tSho SV IP
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2022 15 374 28.6 4.19 3.86 1284 0.99 0.01 0.00 0.08 0.25 8.86
## 2 2021 15 471 28.5 4.58 4.32 2430 0.99 0.01 0.00 0.05 0.24 8.79
## 3 2020 15 375 28.2 4.63 4.42 898 0.99 0.01 0.01 0.05 0.26 8.66
## 4 2019 15 456 28.4 4.95 4.60 2428 0.99 0.01 0.01 0.05 0.24 8.93
## 5 2018 15 420 28.6 4.56 4.27 2430 0.99 0.01 0.00 0.06 0.26 8.91
## 6 2017 15 417 28.6 4.67 4.37 2430 0.99 0.01 0.01 0.05 0.24 8.90
## # … with 19 more variables: H <chr>, R <chr>, ER <chr>, HR <chr>, BB <chr>,
## # IBB <chr>, SO <chr>, HBP <chr>, BK <chr>, WP <chr>, BF <chr>, WHIP <chr>,
## # BAbip <chr>, H9 <chr>, HR9 <chr>, BB9 <chr>, SO9 <chr>, `SO/W` <chr>,
## # E <chr>
### Did we get the right years for the leagues?
range(al$Year)
## [1] "1901" "Year"
table(al$Year)
##
## 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 Year
## 1 1 1 1 1 1 1 1 1 1 4
### Oops, al seems to have multiple headers in the data and "Year" appears within Year. We fix that
### by filtering out (!= is not equal) the bad obs and then convert Year to a numeric.
al = al %>%
filter(Year != "Year") %>%
mutate(Year = as.numeric(Year)) %>%
filter(Year < 2022) ### As before, we drop 2022 because it's not complete
range(al$Year)
## [1] 1901 2021
nl = nl %>%
filter(Year != "Year") %>%
mutate(Year = as.numeric(Year)) %>%
filter(Year < 2022) ### As before, we drop 2022 because it's not complete
range(nl$Year)
## [1] 1876 2021
# That's better. Now compare ERAs.
plot(nl$Year,nl$ERA,type="l", lty=2, xlab="Year", ylab="Mean ERA")
lines(al$Year,al$ERA, lty=2)
abline(v=2010,lty=3)
abline(v=1901, lty=3)
legend(1978,3.4,legend=c("AL","NL"),lty=c(2,1))
A bit of work, but it’s nice to know that we can get up to date data.
We move on to Figure 3.7.
###
### Create Figure 3.7
### We can use the data already imported for Figure 3.5
head(roberts)
## Year Player Age Tm W L G IP H ER HR BB SO ERA X.lgERA X.ERA.
## 1 1948 Roberts 21 PHI 7 9 20 146.7 148 52 10 61 84 3.19 3.96 124
## 2 1949 Roberts 22 PHI 15 15 43 226.7 229 93 15 75 95 3.69 3.96 107
## 3 1950 Roberts 23 PHI 20 11 40 304.3 282 102 29 77 146 3.02 4.06 135
## 4 1951 Roberts 24 PHI 21 15 44 315.0 284 106 20 64 127 3.03 3.84 127
## 5 1952 Roberts 25 PHI 28 7 39 330.0 292 95 22 45 148 2.59 3.66 141
## 6 1953 Roberts 26 PHI 23 16 44 346.7 324 106 30 61 198 2.75 4.20 152
### Note that x.ERA. (*ERA) is equal to the formula in the book (within rounding)
cbind(roberts$X.lgERA/roberts$ERA*100, roberts$X.ERA.)[1:5,]
## [,1] [,2]
## [1,] 124.1379 124
## [2,] 107.3171 107
## [3,] 134.4371 135
## [4,] 126.7327 127
## [5,] 141.3127 141
stem.leaf.backback(ford$X.ERA.,roberts$X.ERA., Max=190)
## ________________________________
## 1 | 2: represents 12, leaf unit: 1
## ford$X.ERA. roberts$X.ERA.
## ________________________________
## | 7 |03 2
## | 8 |4 3
## | 9 |366 6
## 1 5| 10 |67 8
## 4 975| 11 |
## (4) 9831| 12 |123347 (6)
## (1) 5| 13 |566 5
## 7 30| 14 |1 2
## 5 63| 15 |2 1
## | 16 |
## 3 60| 17 |
## | 18 |
## 1 1| 19 |
## ________________________________
## n: 16 19
## ________________________________
This back-to-back stemplot is similar (within truncation/rounding) to Figure 3.7 in TSUB2.
Albert now looks at home runs and the rate at which they occur. For this case we download the data from Albert’s GitHub site.
### Get the data. Be careful about getting the "raw" CSV file from GitHub
case.3.3 <- read.csv("https://raw.githubusercontent.com/bayesball/Teaching-Statistics-Using-Baseball/master/case_3_3.csv")
### See what's in it
head(case.3.3)
## YEAR HR G RATE
## 1 1927 54 156 0.35
## 2 1927 84 153 0.55
## 3 1927 109 155 0.70
## 4 1927 74 153 0.48
## 5 1927 29 153 0.19
## 6 1927 39 154 0.25
### Get the number of teams in each year to compare to Table 3.4
case.3.3 %>%
group_by(YEAR) %>%
summarize(nTeams = n())
## # A tibble: 4 × 2
## YEAR nTeams
## <int> <int>
## 1 1927 16
## 2 1961 18
## 3 1998 30
## 4 2001 30
### Or
table(case.3.3$YEAR)
##
## 1927 1961 1998 2001
## 16 18 30 30
It looks like the data are okay.
R doesn’t generate parallel stemplots, so for Figure 3.8 we create multiple back-to-back stemplots
homers.1927 = case.3.3[case.3.3$YEAR==1927,] ### Subset those rows/obs that are YEAR == 1927
homers.1961 = case.3.3[case.3.3$YEAR==1961,] ### Blank after the comma means keep all columns
homers.1998 = case.3.3[case.3.3$YEAR==1998,]
homers.2001 = case.3.3[case.3.3$YEAR==2001,]
p_load(aplpack)
stem.leaf.backback(homers.1927$RATE,homers.1961$RATE, Min=.1, Max=1.6)
## ______________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1927$RATE
## homers.1961$RATE
## ______________________________
## 4 9887| 1 |
## 7 544| 2 |
## (5) 76553| 3 |
## 4 8| 4 |
## 3 5| 5 |6 1
## | 6 |669 4
## 2 0| 7 |4 5
## | 8 |35 7
## | 9 |13 (2)
## 1 2| 10 |234 (3)
## | 11 |0378 6
## | 12 |1 2
## | 13 |
## | 14 |7 1
## | 15 |
## | 16 |
## ______________________________
## n: 16 18
## ______________________________
stem.leaf.backback(homers.1927$RATE,homers.1998$RATE, Min=.1, Max=1.6)
## __________________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1927$RATE
## homers.1998$RATE
## __________________________________
## 4 9887| 1 |
## 7 544| 2 |
## (5) 76553| 3 |
## 4 8| 4 |
## 3 5| 5 |
## | 6 |68 2
## 2 0| 7 |017 5
## | 8 |245 8
## | 9 |1124889 (7)
## 1 2| 10 |223 (3)
## | 11 |3 12
## | 12 |12478 11
## | 13 |02367 6
## | 14 |5 1
## | 15 |
## | 16 |
## __________________________________
## n: 16 30
## __________________________________
stem.leaf.backback(homers.1927$RATE,homers.2001$RATE, Min=.1, Max=1.6)
## _________________________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1927$RATE homers.2001$RATE
## _________________________________________
## 4 9887| 1 |
## 7 544| 2 |
## (5) 76553| 3 |
## 4 8| 4 |
## 3 5| 5 |
## | 6 |
## 2 0| 7 |5 1
## | 8 |146 4
## | 9 |14899 9
## 1 2| 10 |112479 (6)
## | 11 |
## | 12 |0033367889 (10)
## | 13 |112 5
## | 14 |5 2
## | 15 |2 1
## | 16 |
## _________________________________________
## n: 16 30
## _________________________________________
stem.leaf.backback(homers.1961$RATE,homers.1998$RATE, Min=.1, Max=1.6)
## __________________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1961$RATE
## homers.1998$RATE
## __________________________________
## | 1 |
## | 2 |
## | 3 |
## | 4 |
## 1 6| 5 |
## 4 966| 6 |68 2
## 5 4| 7 |017 5
## 7 53| 8 |245 8
## (2) 31| 9 |1124889 (7)
## (3) 432| 10 |223 (3)
## 6 8730| 11 |3 12
## 2 1| 12 |12478 11
## | 13 |02367 6
## 1 7| 14 |5 1
## | 15 |
## | 16 |
## __________________________________
## n: 18 30
## __________________________________
stem.leaf.backback(homers.1961$RATE,homers.2001$RATE, Min=.1, Max=1.6)
## _________________________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1961$RATE homers.2001$RATE
## _________________________________________
## | 1 |
## | 2 |
## | 3 |
## | 4 |
## 1 6| 5 |
## 4 966| 6 |
## 5 4| 7 |5 1
## 7 53| 8 |146 4
## (2) 31| 9 |14899 9
## (3) 432| 10 |112479 (6)
## 6 8730| 11 |
## 2 1| 12 |0033367889 (10)
## | 13 |112 5
## 1 7| 14 |5 2
## | 15 |2 1
## | 16 |
## _________________________________________
## n: 18 30
## _________________________________________
stem.leaf.backback(homers.1998$RATE,homers.2001$RATE, Min=.1, Max=1.6)
## _________________________________________
## 1 | 2: represents 0.12, leaf unit: 0.01
## homers.1998$RATE homers.2001$RATE
## _________________________________________
## | 1 |
## | 2 |
## | 3 |
## | 4 |
## | 5 |
## 2 86| 6 |
## 5 710| 7 |5 1
## 8 542| 8 |146 4
## (7) 9884211| 9 |14899 9
## (3) 322| 10 |112479 (6)
## 12 3| 11 |
## 11 87421| 12 |0033367889 (10)
## 6 76320| 13 |112 5
## 1 5| 14 |5 2
## | 15 |2 1
## | 16 |
## _________________________________________
## n: 30 30
## _________________________________________
A quick dotplot might help here.
### Figure 3.8 can be generated using the lattice package
p_load(lattice)
dotplot(RATE ~ factor(YEAR), data=case.3.3, cex=1.25, pch=1, jitter.x=TRUE)
dotplot(factor(YEAR) ~ RATE, data=case.3.3, cex=1.25, pch=1, jitter.y=TRUE)
### or ggplot2
plt = case.3.3 %>%
ggplot(aes(x = factor(YEAR), y = RATE)) +
geom_dotplot(binaxis = "y", stackdir = "center", binwidth=0.05, dotsize=0.75)
plt
plt + coord_flip()
Tough to interpret? Yep. Maybe a boxplot would be better.
### Create 5-number summaries
by(case.3.3[,"RATE"],case.3.3[,"YEAR"], summary)
## case.3.3[, "YEAR"]: 1927
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1700 0.2275 0.3400 0.3725 0.3975 1.0200
## ------------------------------------------------------------
## case.3.3[, "YEAR"]: 1961
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5600 0.7625 0.9750 0.9544 1.1225 1.4700
## ------------------------------------------------------------
## case.3.3[, "YEAR"]: 1998
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.660 0.865 1.005 1.040 1.262 1.450
## ------------------------------------------------------------
## case.3.3[, "YEAR"]: 2001
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.750 0.990 1.145 1.123 1.278 1.520
### and figure out how many teams there were in each year
cnts = table(case.3.3$YEAR)
cnts
##
## 1927 1961 1998 2001
## 16 18 30 30
### or
case.3.3 %>%
group_by(YEAR) %>%
summarise(n = n(),
min = fivenum(RATE)[1],
Q1 = fivenum(RATE)[2],
median = fivenum(RATE)[3],
Q3 = fivenum(RATE)[4],
max = fivenum(RATE)[5]
)
## # A tibble: 4 × 7
## YEAR n min Q1 median Q3 max
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1927 16 0.17 0.215 0.34 0.425 1.02
## 2 1961 18 0.56 0.74 0.975 1.13 1.47
## 3 1998 30 0.66 0.85 1.00 1.27 1.45
## 4 2001 30 0.75 0.99 1.14 1.28 1.52
### The Figure 3.9 boxplots look like
boxplot(RATE~YEAR, data=case.3.3, ylab="Home Run Rate (HR/G)", varwidth=TRUE)
p_load(lattice)
bwplot(YEAR~RATE, data=case.3.3, ylab="Home Run Rate (HR/G)", varwidth=TRUE)
bwplot(RATE~YEAR, data=case.3.3, ylab="Home Run Rate (HR/G)", varwidth=TRUE, horizontal = FALSE)
p_load(ggplot2)
ggplot(case.3.3, aes(factor(YEAR), RATE)) + geom_boxplot(varwidth=TRUE) + xlab("Year") + ylab("Home Run Rate (HR/G)")
ggplot(case.3.3,aes(factor(YEAR), RATE)) + geom_boxplot(varwidth=TRUE) + coord_flip() + xlab("Year") + ylab("Home Run Rate (HR/G)")
The text now discusses the idea of the normal distribution. We will suplement the discussion with a couple of “modern” techniques. For this case we download the data from the author’s site. Note that many of the values disagree with those in the text. Please let me know if you figure out why.
### Get the data. Be careful about getting the "raw" CSV file from GitHub
case.3.4 <- read.csv("https://raw.githubusercontent.com/bayesball/Teaching-Statistics-Using-Baseball/master/case_3_4.csv")
head(case.3.4)
## playerID SLG
## 1 abreujo02 0.5809353
## 2 ackledu01 0.3984064
## 3 adamsma01 0.4573055
## 4 altuvjo01 0.4530303
## 5 amarial01 0.3144208
## 6 andruel01 0.3327948
case.3.4$SLG
## [1] 0.5809353 0.3984064 0.4573055 0.4530303 0.3144208 0.3327948 0.3604888
## [8] 0.5000000 0.3786078 0.5244123 0.3475336 0.4918033 0.4019851 0.4401349
## [15] 0.3624535 0.3603604 0.5057283 0.4528302 0.3488372 0.3732252 0.3788707
## [22] 0.4450085 0.3869801 0.4577465 0.5237316 0.4118896 0.4503043 0.2896040
## [29] 0.4537815 0.3747899 0.4911243 0.3939962 0.3655914 0.4375000 0.4500000
## [36] 0.4267782 0.3736264 0.3003953 0.3145336 0.3890020 0.3628510 0.5252855
## [43] 0.4044444 0.4570858 0.4013015 0.3857442 0.4300169 0.5665138 0.3297872
## [50] 0.4555921 0.4163880 0.4805447 0.4149184 0.4012346 0.4191304 0.5471698
## [57] 0.3765112 0.4064665 0.3403361 0.3955774 0.3986175 0.4589615 0.4612850
## [64] 0.3831169 0.4216216 0.4340909 0.4159483 0.5418719 0.4721649 0.4773519
## [71] 0.4822335 0.4316163 0.3776683 0.3882979 0.3325000 0.3552398 0.3724008
## [78] 0.4903846 0.3723404 0.3563433 0.3839442 0.3672655 0.4407666 0.3808463
## [85] 0.3976143 0.3796134 0.4462659 0.3588517 0.3371212 0.3467337 0.3777240
## [92] 0.3778706 0.3132530 0.3608247 0.4689441 0.4112903 0.3827751 0.5064695
## [99] 0.3970827 0.4195906 0.3300000 0.3822115 0.4554656 0.3481781 0.3800000
## [106] 0.4038462 0.3545817 0.4649573 0.3862928 0.4525253 0.5532880 0.3639775
## [113] 0.5650624 0.3714286 0.4060606 0.5419708 0.3571429 0.3873518 0.3861386
## [120] 0.3701431 0.4960159 0.4748858 0.4380000 0.3610503 0.3846154 0.4026846
## [127] 0.4192771 0.3950104 0.5173745 0.4548673 0.3686200 0.3756806 0.4446154
## [134] 0.4428571 0.4031142 0.3722944 0.4230769 0.4899452 0.4123134 0.4802867
## [141] 0.4660348 0.4083601 0.4271255 0.4476615 0.4730832 0.3610649 0.3983607
## [148] 0.3983740 0.5267176 0.3940520 0.4149660 0.4269871 0.4716049 0.3538462
## [155] 0.4542373 0.3255361 0.3314815 0.4401806 0.3688699 0.4163934 0.5547310
## [162] 0.3827434 0.3977273 0.5614618 0.3333333 0.4911661 0.4074703 0.4351464
## [169] 0.3251232 0.4053537 0.4667969 0.4550562 0.3880597 0.3738318 0.4020619
## [176] 0.3947368 0.4041096
We can create Figure 3.10. Again, rounding creates some minor differences.
### Create Figure 3.10
stem(case.3.4$SLG)
##
## The decimal point is 2 digit(s) to the left of the |
##
## 28 | 0
## 30 | 0345
## 32 | 560013337
## 34 | 07889455679
## 36 | 0011123467990122234456788899
## 38 | 00123334566677889445567888889
## 40 | 112233444566781225566699
## 42 | 023777024588
## 44 | 00135568003334455567789
## 46 | 1567922357
## 48 | 012001126
## 50 | 0667
## 52 | 4457
## 54 | 22735
## 56 | 157
## 58 | 1
### or
p_load(aplpack)
stem.leaf(case.3.4$SLG)
## 1 | 2: represents 0.12
## leaf unit: 0.01
## n: 177
## 1 2. | 8
## 5 3* | 0111
## 14 t | 222333333
## 25 f | 44444555555
## 54 s | 66666666666677777777777777777
## 82 3. | 8888888888888888999999999999
## (25) 4* | 0000000000000011111111111
## 70 t | 22222333333
## 59 f | 44444444555555555555555
## 36 s | 6666677777
## 26 4. | 888899999
## 17 5* | 0001
## 13 t | 2222
## 9 f | 44455
## 4 s | 666
## 1 5. | 8
The values that Albert presents can be computed fairly easily.
### Compute the statistics that Albert presents
n = length(case.3.4$SLG)
n # The number of observations
## [1] 177
xbar = mean(case.3.4$SLG)
xbar # The sample mean
## [1] 0.4158045
median(case.3.4$SLG)
## [1] 0.4038462
mean(case.3.4$SLG, trim=0.05) # A 10% trimmed mean
## [1] 0.4137492
variance = var(case.3.4$SLG)
sd = sqrt(variance)
sd # Standard deviation
## [1] 0.05955793
sqrt(variance/n) # Standard error of the mean
## [1] 0.004476648
summary(case.3.4$SLG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2896 0.3738 0.4038 0.4158 0.4542 0.5809
### Computing the mean plus/minus k standard deviations is "easy". Use rounded values to match Albert.
xbar = .416
sd = 0.060
xbar + c(-1,1)%*%t(1:3)*sd # Compute values for k=1,2,3 at the same time
## [,1] [,2] [,3]
## [1,] 0.356 0.296 0.236
## [2,] 0.476 0.536 0.596
# Compare with actual values
length(case.3.4$SLG[.356 < case.3.4$SLG & case.3.4$SLG < .476])
## [1] 128
length(case.3.4$SLG[.296 < case.3.4$SLG & case.3.4$SLG < .536])
## [1] 167
length(case.3.4$SLG[.236 < case.3.4$SLG & case.3.4$SLG < .596])
## [1] 177
c(126,167,177)/178 * 100
## [1] 70.78652 93.82022 99.43820
Graphical methods are sometimes helpful. A graphical method for checking normality is the normal probability or normal quantile plot. If the data are normally distributed, what we observe and what we expect to see should be the same. Hence, the plotted values should fall on the diagonal line.
histogram(~SLG, data=case.3.4)
qqnorm(case.3.4$SLG)
qqline(case.3.4$SLG)
There are also some “tests” for normality.
p_load(nortest)
## Installing package into 'C:/Users/School/AppData/Local/R/win-library/4.2'
## (as 'lib' is unspecified)
## Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2:
## cannot open URL 'http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/4.2/PACKAGES'
## package 'nortest' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\School\AppData\Local\Temp\RtmpsFnTzO\downloaded_packages
##
## nortest installed
ad.test(case.3.4$SLG)
##
## Anderson-Darling normality test
##
## data: case.3.4$SLG
## A = 1.6986, p-value = 0.0002268
cvm.test(case.3.4$SLG)
##
## Cramer-von Mises normality test
##
## data: case.3.4$SLG
## W = 0.29369, p-value = 0.0003855
lillie.test(case.3.4$SLG)
##
## Lilliefors (Kolmogorov-Smirnov) normality test
##
## data: case.3.4$SLG
## D = 0.093368, p-value = 0.0007091
pearson.test(case.3.4$SLG)
##
## Pearson chi-square normality test
##
## data: case.3.4$SLG
## P = 24.492, p-value = 0.0269
sf.test(case.3.4$SLG)
##
## Shapiro-Francia normality test
##
## data: case.3.4$SLG
## W = 0.97109, p-value = 0.001516
shapiro.test(case.3.4$SLG)
##
## Shapiro-Wilk normality test
##
## data: case.3.4$SLG
## W = 0.96939, p-value = 0.0006182
Since all of the p-values are “small” it appears that it is unlikely that the data are truly normal. How different from normal they are is debatable.
The last case in this chapter discusses battting averages. For this case we download the data from the author’s site.
### Get the data. Be careful about getting the "raw" CSV file from GitHub
case.3.5 <- read.csv("https://raw.githubusercontent.com/bayesball/Teaching-Statistics-Using-Baseball/master/case_3_5.csv")
head(case.3.5)
## Lastname Firstname Year AB H X2B X3B HR
## 1 Appling Luke 1941 592 186 26 8 1
## 2 Benjamin Stan 1941 480 113 20 7 3
## 3 Berardino Johnny 1941 469 127 30 4 5
## 4 Bloodworth Jimmy 1941 506 124 24 3 7
## 5 Boudreau Lou 1941 579 149 45 8 10
## 6 Bragan Bobby 1941 557 140 19 3 4
dim(case.3.5)
## [1] 98 8
table(case.3.5$Year)
##
## 1941
## 98
It appears that the data on GitHub aren’t quite right. We can pull the data from the TSUB package.
case.3.5 <- case_3_5
table(case.3.5$Year)
##
## 1941 1977 1980 1994
## 98 168 148 63
head(case.3.5)
## Lastname Firstname Year AB H X2B X3B HR
## 1 Appling Luke 1941 592 186 26 8 1
## 2 Benjamin Stan 1941 480 113 20 7 3
## 3 Berardino Johnny 1941 469 127 30 4 5
## 4 Bloodworth Jimmy 1941 506 124 24 3 7
## 5 Boudreau Lou 1941 579 149 45 8 10
## 6 Bragan Bobby 1941 557 140 19 3 4
Or, we can go the source… Mr. Lahman.
### Unfortunately, the online dataset is for 1941 and not 1977. We go to
### Lahman to get the 1977 data.
p_load(Lahman)
data(Batting)
head(Batting)
## playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO
## 1 abercda01 1871 1 TRO NA 1 4 0 0 0 0 0 0 0 0 0 0
## 2 addybo01 1871 1 RC1 NA 25 118 30 32 6 0 0 13 8 1 4 0
## 3 allisar01 1871 1 CL1 NA 29 137 28 40 4 5 0 19 3 1 2 5
## 4 allisdo01 1871 1 WS3 NA 27 133 28 44 10 2 2 27 1 1 0 2
## 5 ansonca01 1871 1 RC1 NA 25 120 29 39 11 3 0 16 6 2 2 1
## 6 armstbo01 1871 1 FW1 NA 12 49 9 11 2 1 0 5 0 1 0 1
## IBB HBP SH SF GIDP
## 1 NA NA NA NA 0
## 2 NA NA NA NA 0
## 3 NA NA NA NA 1
## 4 NA NA NA NA 0
## 5 NA NA NA NA 0
## 6 NA NA NA NA 0
batting = battingStats()
dim(batting)
## [1] 110495 29
batting = batting[batting$yearID %in% c(1941, 1977, 1980, 1994),]
dim(batting)
## [1] 3546 29
batting = batting[batting$lgID %in% c("AL","NL"),]
dim(batting)
## [1] 3546 29
batting = batting[batting$AB >= 400, ]
dim(batting)
## [1] 477 29
table(batting$yearID)
##
## 1941 1977 1980 1994
## 98 168 148 63
### Or, the tidyverse approach
p_load(tidyverse)
batting = battingStats() %>% ### Get the supplemented batting data
filter(yearID %in% c(1941, 1977, 1980, 1994)) %>% ### Keep the right years
filter(lgID %in% c("AL", "NL")) %>% ### Keep only the major leagues
filter(AB >= 400) ### Keep people who actually batted
table(batting$yearID)
##
## 1941 1977 1980 1994
## 98 168 148 63
### Or, the shorter tidyverse approach
p_load(tidyverse)
batting = battingStats() %>% ### Get the supplemented batting data
filter(yearID %in% c(1941, 1977, 1980, 1994) ### Keep the right years
& lgID %in% c("AL", "NL") ### Keep only the major leagues
& AB >= 400 ### Keep people who actually batted
)
table(batting$yearID)
##
## 1941 1977 1980 1994
## 98 168 148 63
It’s nice to know that the TSUB package data are the same as Lahman’s.
So, onward with the great batters. Albert looks at 1977 and there appear to have been 168 MLB batters with at least 400 appearances in 1977. We can analyze them as Albert did to see if our conclusions are the same.
### Get the 1977 data
batting.1977 <- batting %>% filter(yearID == 1977)
### Figure 3.11
stem(batting.1977$BA)
##
## The decimal point is 2 digit(s) to the left of the |
##
## 20 | 06
## 21 | 6
## 22 | 9
## 23 | 0133559
## 24 | 001135566778
## 25 | 00111222344567799
## 26 | 001111223334444556677899
## 27 | 001122233445555557788999
## 28 | 00000122223333444446777788999
## 29 | 000111122336778899
## 30 | 00022445778899
## 31 | 011225788
## 32 | 002568
## 33 | 668
## 34 |
## 35 |
## 36 |
## 37 |
## 38 | 8
p_load(aplpack)
stem.leaf(batting.1977$BA)
## 1 | 2: represents 0.012
## leaf unit: 0.001
## n: 168
## LO: 0.2 0.206
## 3 21 | 6
## 4 22 | 9
## 11 23 | 0133559
## 23 24 | 001135566778
## 40 25 | 00111222344567799
## 64 26 | 001111223334444556677899
## (24) 27 | 001122233445555557788999
## 80 28 | 00000122223333444446777788999
## 51 29 | 000111122336778899
## 33 30 | 00022445778899
## 19 31 | 011225788
## 10 32 | 002568
## 4 33 | 668
## HI: 0.388
Note that Albert says 168 in the text but the plot shows 173. This is strange. We’ll continue computing statistics and see if they differ from the text.
summary(batting.1977$BA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2000 0.2607 0.2780 0.2774 0.2923 0.3880
xbar = mean(batting.1977$BA)
xbar
## [1] 0.2774226
sd = sqrt(var(batting.1977$BA))
sd
## [1] 0.02717155
batting.1977$BA.z = (batting.1977$BA - xbar)/sd
stem(batting.1977$BA.z)
##
## The decimal point is at the |
##
## -2 | 86
## -2 | 3
## -1 | 8776666
## -1 | 444333222211100000
## -0 | 99999988887766666666555555555
## -0 | 44443333322222221111111100
## 0 | 0011111111122222222222223444444444
## 0 | 55555555566777888888899
## 1 | 000111122222334
## 1 | 555666889
## 2 | 222
## 2 |
## 3 |
## 3 |
## 4 | 1
summary(batting.1977$BA.z)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.84940 -0.61361 0.02125 0.00000 0.54570 4.06960
Table 3.6 includes the 1900 season, so we can throw it in too. We compute the means and standard deivations.
p_load(tidyverse)
table.3.6 = battingStats() %>% ### Get the supplemented batting data
filter(yearID %in% c(1900, 1941, 1977, 1980, 1994) ### Keep the right years
& lgID %in% c("AL", "NL") ### Keep only the major leagues
& AB >= 400 ### Keep people who actually batted
) %>%
group_by(yearID) %>% ### Group by year
summarize(BA.mean = mean(BA), ### Compute mean BA by year
BA.sd = sqrt(var(BA)), ### Compute the sd of BA by year
n = n() ### Confirm the number of batters
)
table.3.6
## # A tibble: 5 × 4
## yearID BA.mean BA.sd n
## <int> <dbl> <dbl> <int>
## 1 1900 0.296 0.0374 45
## 2 1941 0.281 0.0328 98
## 3 1977 0.277 0.0272 168
## 4 1980 0.279 0.0275 148
## 5 1994 0.293 0.0320 63
Chapter 4 moves from looking at a single variable in isolation to looking at how variables behave together. Changing one may cause the other to change, or they may just change together. We discuss both causality and association.
In this section, Albert looks at team offensive stats. We will use the data from Albert’s site.
### Get the data. Be careful about getting the "raw" CSV file from GitHub
case.4.1 <- read.csv("https://raw.githubusercontent.com/bayesball/Teaching-Statistics-Using-Baseball/master/case_4_1.csv")
head(case.4.1)
## Tm NBat BatAge R.G G PA AB R H X2B X3B HR RBI SB CS BB SO
## 1 ARI 52 27.6 3.80 162 6089 5552 615 1379 259 47 118 573 86 33 398 1165
## 2 ATL 39 26.8 3.54 162 6064 5468 573 1316 240 22 123 545 95 33 472 1369
## 3 BAL 44 28.3 4.35 162 6130 5596 705 1434 264 16 211 681 44 20 401 1285
## 4 BOS 55 29.2 3.91 162 6226 5551 634 1355 282 20 123 601 63 25 535 1337
## 5 CHC 48 26.8 3.79 162 6102 5508 614 1315 270 31 157 590 65 40 442 1477
## 6 CHW 44 27.7 4.07 162 6077 5543 660 1400 279 32 155 625 85 36 417 1362
## BA OBP SLG OPS OPS. TB GDP HBP SH SF IBB LOB
## 1 0.248 0.302 0.376 0.678 88 2086 115 43 56 36 31 1092
## 2 0.241 0.305 0.360 0.665 87 1969 121 43 53 27 31 1128
## 3 0.256 0.311 0.422 0.734 106 2363 112 62 35 36 29 1088
## 4 0.244 0.316 0.369 0.684 92 2046 138 68 20 52 36 1168
## 5 0.239 0.300 0.385 0.684 88 2118 94 54 57 41 29 1069
## 6 0.253 0.310 0.398 0.708 100 2208 127 60 19 38 33 1071
dim(case.4.1)
## [1] 30 29
The data appear to be similar to Table 4.1.
Albert now looks at the relationships that exist between these variables using scatter plots.
### Figure 4.1 is just a scatter plot
plot(case.4.1$SLG, case.4.1$OBP, xlab="SLG", ylab="OBP")
plot(OBP~SLG, data=case.4.1, xlab="SLG", ylab="OBP")
p_load(lattice)
xyplot(OBP~SLG, data=case.4.1, xlab="SLG", ylab="OBP")
p_load(ggplot2)
ggplot(case.4.1, aes(SLG, OBP)) + geom_point() + xlab("SLG") + ylab("OBP") + theme_bw()
### Figure 4.2 uses three letter team designators
plot(case.4.1$X2B, case.4.1$X3B, ylab="Triples", xlab="Doubles", type="n") ### plot nothing to leave room for the text
text(case.4.1$X2B, case.4.1$X3B, case.4.1$Tm, cex=0.75) ### add the text
### Analysis of Table 4.2 data
### The data are somewhat hidden
search() ### Look to see which paths are available
## [1] ".GlobalEnv" "package:forcats" "package:purrr"
## [4] "package:readr" "package:tibble" "package:tidyverse"
## [7] "package:nortest" "package:tidyr" "package:stringr"
## [10] "package:rvest" "package:sqldf" "package:RSQLite"
## [13] "package:gsubfn" "package:proto" "package:retrosheet"
## [16] "package:devtools" "package:usethis" "package:plotrix"
## [19] "package:aplpack" "package:lubridate" "package:ggplot2"
## [22] "package:lattice" "package:Lahman" "package:tsub"
## [25] "package:dplyr" "package:pacman" "package:stats"
## [28] "package:graphics" "package:grDevices" "package:utils"
## [31] "package:datasets" "package:methods" "Autoloads"
## [34] "package:base"
ls(pos=23) ### On my machine, the tsub package is the 23rd object in the path. We can look at what is in it.
## [1] "AllstarFull" "Appearances" "AwardsManagers"
## [4] "AwardsPlayers" "AwardsShareManagers" "AwardsSharePlayers"
## [7] "Batting" "battingLabels" "BattingPost"
## [10] "battingStats" "CollegePlaying" "Fielding"
## [13] "fieldingLabels" "FieldingOF" "FieldingOFsplit"
## [16] "FieldingPost" "HallOfFame" "HomeGames"
## [19] "Label" "LahmanData" "Managers"
## [22] "ManagersHalf" "Parks" "People"
## [25] "Pitching" "pitchingLabels" "PitchingPost"
## [28] "playerInfo" "Salaries" "Schools"
## [31] "SeriesPost" "teamInfo" "Teams"
## [34] "TeamsFranchises" "TeamsHalf"
head(case_4_1b) ### The case_4_1b object contains the proper data.
## teamID yearID HomeRun.Rate Triples.Rate
## 1 WS1 1916 0.0023 0.0117
## 2 CHN 1983 0.0254 0.0076
## 3 ATL 1982 0.0265 0.0040
## 4 CLE 1983 0.0157 0.0057
## 5 SDN 2003 0.0231 0.0058
## 6 SDN 1984 0.0198 0.0076
table.4.2 = tsub::case_4_1b ### Make a local copy of the case_4_1b data
names(table.4.2)
## [1] "teamID" "yearID" "HomeRun.Rate" "Triples.Rate"
# It's time to make Figure 4.3
p_load(lattice)
xyplot(Triples.Rate~HomeRun.Rate, data=table.4.2, xlab="Home Runs per Game", ylab="Triples per Game")
### Create Figure 4.4
# To create the factor variable that contains the year groups we use "cut"
table.4.2$yearGRP = cut(table.4.2$yearID,
breaks=c(1901,1931,1961,1991,2014),
include.lowest=TRUE,
labels=c("1901-1930","1931-1960","1961-1990","1991-2014")
)
p_load(ggplot2)
ggplot(table.4.2, aes(x=HomeRun.Rate, y=Triples.Rate, shape=yearGRP, col=yearGRP)) +
geom_point() + xlab("Home Runs per Game") + ylab("Triples per Game")
Albert now turns to looking at offensive statistics that are associated with scoring runs. We will use the data from Albert’s tsub package.
case.4.2 = case_4_2 ### Make a local copy
head(case.4.2)
## Tm NBat BatAge R.G G PA AB R H X2B X3B HR RBI SB CS BB SO
## 1 ARI 52 27.6 3.80 162 6089 5552 615 1379 259 47 118 573 86 33 398 1165
## 2 ATL 39 26.8 3.54 162 6064 5468 573 1316 240 22 123 545 95 33 472 1369
## 3 BAL 44 28.3 4.35 162 6130 5596 705 1434 264 16 211 681 44 20 401 1285
## 4 BOS 55 29.2 3.91 162 6226 5551 634 1355 282 20 123 601 63 25 535 1337
## 5 CHC 48 26.8 3.79 162 6102 5508 614 1315 270 31 157 590 65 40 442 1477
## 6 CHW 44 27.7 4.07 162 6077 5543 660 1400 279 32 155 625 85 36 417 1362
## BA OBP SLG OPS OPS. TB GDP HBP SH SF IBB LOB
## 1 0.248 0.302 0.376 0.678 88 2086 115 43 56 36 31 1092
## 2 0.241 0.305 0.360 0.665 87 1969 121 43 53 27 31 1128
## 3 0.256 0.311 0.422 0.734 106 2363 112 62 35 36 29 1088
## 4 0.244 0.316 0.369 0.684 92 2046 138 68 20 52 36 1168
## 5 0.239 0.300 0.385 0.684 88 2118 94 54 57 41 29 1069
## 6 0.253 0.310 0.398 0.708 100 2208 127 60 19 38 33 1071
names(case.4.2)
## [1] "Tm" "NBat" "BatAge" "R.G" "G" "PA" "AB" "R"
## [9] "H" "X2B" "X3B" "HR" "RBI" "SB" "CS" "BB"
## [17] "SO" "BA" "OBP" "SLG" "OPS" "OPS." "TB" "GDP"
## [25] "HBP" "SH" "SF" "IBB" "LOB"
### Make the pairs plot
pairs(case.4.2[,c("R","OBP","SLG","X2B","X3B","HR")])
### The correlations presented in Table 4.3 are easily computed
case.4.2$TB = case.4.2$AB * case.4.2$SLG
table.4.3 = round(cor(case.4.2[,c("R", "OBP", "SLG", "X2B", "X3B", "HR")]),3) ### The values are rounded to 3 decimal places
table.4.3
## R OBP SLG X2B X3B HR
## R 1.000 0.797 0.857 0.737 0.202 0.578
## OBP 0.797 1.000 0.668 0.724 0.104 0.260
## SLG 0.857 0.668 1.000 0.676 0.209 0.788
## X2B 0.737 0.724 0.676 1.000 0.090 0.238
## X3B 0.202 0.104 0.209 0.090 1.000 -0.094
## HR 0.578 0.260 0.788 0.238 -0.094 1.000
### Table 4.4 can be constructed by rearranging the rows of the first column ot Table 4.4
as.matrix(table.4.3[c(3,2,4,6,5), 1], ncol=1)
## [,1]
## SLG 0.857
## OBP 0.797
## X2B 0.737
## HR 0.578
## X3B 0.202
We now look at the most valuable hitting statistics. In this section we get to play with linear regression and prediction. We will use the data from Albert’s package.
### We get the Table 4.4 data from ALbert's tsub package.
p_load(tsub)
case.4.3 = case_4_3 ### make a local copy of the dataframe
names(case.4.3)
## [1] "teamID" "R" "G" "R.G" "AVG" "OBP" "SLG" "OPS"
## [9] "RC" "TA" "BRA"
head(case.4.3) ### Table 4.5 similar data
## teamID R G R.G AVG OBP SLG OPS RC TA BRA
## 1 BAL 705 162 4.35 0.256 0.311 0.422 0.733 723 0.664 0.131
## 2 BOS 634 162 3.91 0.244 0.316 0.369 0.685 635 0.615 0.117
## 3 CHA 660 162 4.07 0.253 0.310 0.398 0.708 673 0.634 0.123
## 4 CLE 669 162 4.13 0.253 0.317 0.389 0.706 683 0.641 0.123
## 5 DET 757 162 4.67 0.277 0.331 0.426 0.757 790 0.698 0.141
## 6 HOU 629 162 3.88 0.242 0.309 0.383 0.692 636 0.624 0.118
Figure 4.6 is a plot that is then made into 4.7 by adding a regression line and the residuals.
# Figure 4.6 looks like
plot(R.G~AVG, data=case.4.3, ylab="Runs per Game")
# We compute a linear regression line
summary(lm(R.G~AVG, data=case.4.3))
##
## Call:
## lm(formula = R.G ~ AVG, data = case.4.3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.36344 -0.10671 -0.07335 0.13420 0.50569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.003 1.905 -0.526 0.6075
## AVG 20.480 7.524 2.722 0.0174 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.264 on 13 degrees of freedom
## Multiple R-squared: 0.363, Adjusted R-squared: 0.314
## F-statistic: 7.409 on 1 and 13 DF, p-value: 0.01745
coef(lm(R.G~AVG, data=case.4.3))
## (Intercept) AVG
## -1.002905 20.480394
# Albert starts looking at goodness of fit. R helps us with this.
rpg.lm.avg = lm(R.G~AVG, data=case.4.3) ### Fit and save the regression equation (and some other stuff as well)
case.4.3$predrpg = predict(rpg.lm.avg) ### Get the predicted values
case.4.3$residrpg = resid(rpg.lm.avg) ### Get the residuals or errors
# Figure 4.7
plot(R.G~AVG, data=case.4.3) ### Plot the points
abline(rpg.lm.avg) ### Add the regression line computed above
segments(case.4.3$AVG, case.4.3$R.G, case.4.3$AVG, case.4.3$predrpg, col="red") ### Add the vertical errors or residuals
text(case.4.3$AVG[c(8,7)]+0.0025, case.4.3$R.G[c(8,7)] , case.4.3$teamID[c(8,7)], cex=0.75) ### Label the interesting points
### Other methods for getting points and the regression line
# lattice
p_load(lattice)
xyplot(R.G ~ AVG, data=case.4.3, type=c("p","r")) ### Get p=points and r=regression
# ggplot2
p_load(ggplot2)
ggplot(data=case.4.3, aes(x=AVG, y=R.G)) + ### Define the data, set the aesthetics, "+" continues the plot on next line
geom_point() + ### Plot the points, "+" continues the plot on the next line
geom_smooth(method ="lm", se=FALSE, color="maroon") ### Add the regression line without confidence band
## `geom_smooth()` using formula 'y ~ x'
case.4.3 %>% ### Pipe the data to ggplot using %>%
ggplot(aes(x=AVG, y=R.G)) + ### The rest is the same as above except that the data have been piped above
geom_point() +
geom_smooth(method="lm", se=FALSE, color="gray")
## `geom_smooth()` using formula 'y ~ x'
Mean squared error (MSE) is a measure of the quality of the fit of the regression line. It is the average of the squared errors or residuals. The root mean squared error (RMSE) is the almost average error — almost because the square root of a sum is not equal to the summ of the square roots.
# Compute the Mean Squared Error (MSE)
case.4.3$sqresidrpg = case.4.3$residrpg^2 ### Square the residuals found above
n = nrow(case.4.3) ### Count the number of observations
mse = sum(case.4.3$sqresidrpg)/n ### Sum the squared residuals and divide by n. This really should be divide by n-2.
mse
## [1] 0.06039262
rmse = sqrt(mse)
rmse
## [1] 0.2457491
sort(case.4.3$residrpg)
## [1] -0.3634385832 -0.3100758232 -0.2757522747 -0.1086346404 -0.1047914861
## [6] -0.0843110919 -0.0843110919 -0.0733503033 -0.0486346404 -0.0001641031
## [11] 0.1099241768 0.1584829939 0.2108849653 0.4684829939 0.5056889081
# Albert suggests that the residuals are normally distributed. R helps us
# check this.
plot(rpg.lm.avg)
# Or on one page...
par(mfrow=c(2,2))
plot(rpg.lm.avg)
par(mfrow=c(1,1))
# Next we try OBP
rpg.lm.obp = lm(R.G~OBP, data=case.4.3)
coef(rpg.lm.obp)
## (Intercept) OBP
## -5.155568 29.574132
Tables 4.6 and 4.7 give the values used in computing the RMSE.
### Table 4.7 uses AVG
cbind(case.4.3$teamID, round(case.4.3[,c("AVG","R.G")],3), pred=round(predict(rpg.lm.avg),3),
res=round(resid(rpg.lm.avg),3), sqrres=round(resid(rpg.lm.avg)^2,3))
## case.4.3$teamID AVG R.G pred res sqrres
## 1 BAL 0.256 4.35 4.240 0.110 0.012
## 2 BOS 0.244 3.91 3.994 -0.084 0.007
## 3 CHA 0.253 4.07 4.179 -0.109 0.012
## 4 CLE 0.253 4.13 4.179 -0.049 0.002
## 5 DET 0.277 4.67 4.670 0.000 0.000
## 6 HOU 0.242 3.88 3.953 -0.073 0.005
## 7 KCA 0.263 4.02 4.383 -0.363 0.132
## 8 LAA 0.259 4.77 4.302 0.468 0.219
## 9 MIN 0.254 4.41 4.199 0.211 0.044
## 10 NYA 0.245 3.91 4.015 -0.105 0.011
## 11 OAK 0.244 4.50 3.994 0.506 0.256
## 12 SEA 0.244 3.91 3.994 -0.084 0.007
## 13 TBA 0.247 3.78 4.056 -0.276 0.076
## 14 TEX 0.256 3.93 4.240 -0.310 0.096
## 15 TOR 0.259 4.46 4.302 0.158 0.025
mse = sum(resid(rpg.lm.avg)^2)/(nrow(case.4.3)-2) ### We divide by n-2 because we used 2 pieces of info to get int and slope
rmse = sqrt(mse)
c(mse, rmse)
## [1] 0.06968379 0.26397687
anova(rpg.lm.avg)
## Analysis of Variance Table
##
## Response: R.G
## Df Sum Sq Mean Sq F value Pr(>F)
## AVG 1 0.51631 0.51631 7.4093 0.01745 *
## Residuals 13 0.90589 0.06968
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
### Table 4.7 uses OBP
cbind(case.4.3$teamID, round(case.4.3[,c("OBP","R.G")],3), pred=round(predict(rpg.lm.obp),3),
res=round(resid(rpg.lm.obp),3), sqrres=round(resid(rpg.lm.obp)^2,3))
## case.4.3$teamID OBP R.G pred res sqrres
## 1 BAL 0.311 4.35 4.042 0.308 0.095
## 2 BOS 0.316 3.91 4.190 -0.280 0.078
## 3 CHA 0.310 4.07 4.012 0.058 0.003
## 4 CLE 0.317 4.13 4.219 -0.089 0.008
## 5 DET 0.331 4.67 4.633 0.037 0.001
## 6 HOU 0.309 3.88 3.983 -0.103 0.011
## 7 KCA 0.314 4.02 4.131 -0.111 0.012
## 8 LAA 0.322 4.77 4.367 0.403 0.162
## 9 MIN 0.324 4.41 4.426 -0.016 0.000
## 10 NYA 0.307 3.91 3.924 -0.014 0.000
## 11 OAK 0.320 4.50 4.308 0.192 0.037
## 12 SEA 0.300 3.91 3.717 0.193 0.037
## 13 TBA 0.317 3.78 4.219 -0.439 0.193
## 14 TEX 0.314 3.93 4.131 -0.201 0.040
## 15 TOR 0.323 4.46 4.397 0.063 0.004
mse = sum(resid(rpg.lm.obp)^2)/(nrow(case.4.3)-2) ### We divide by n-2 because we used 2 pieces of info to get int and slope
rmse = sqrt(mse)
c(mse, rmse)
## [1] 0.05252667 0.22918697
anova(rpg.lm.obp)
## Analysis of Variance Table
##
## Response: R.G
## Df Sum Sq Mean Sq F value Pr(>F)
## OBP 1 0.73935 0.73935 14.076 0.002419 **
## Residuals 13 0.68285 0.05253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As noted in the comments, the proper formula for mean squared error divides by \(n-2\) and not \(n\). This is a correction to the amount of information that we have left over after computing the intercept and slope. We call these leftover pieces of information degrees of freedom. The correct formula is thus \[ \text{MSE} = \frac{\sum_{i=1}^n \left(\text{OBSERVED}_i - \text{PREDICTED}_i \right)^2}{n-2} \] ] \[ \text{MSE} = \frac{\sum_{i=1}^n \left(y_i - \widehat{y_i} \right)^2}{n-2} \]
We now switch to looking at multiple predictor variables by using multiple linear regression. Albert fits a model using the six batting events (Singles, doubles, triples, home runs, walks, and hit by pitch) to predict runs per game.
### The initial model that Albert fits uses the 2014 data from Case 4.1. This is the same as case_4_4
MLB2014 = case_4_4
### We called this MLB2014
head(MLB2014)
## Tm NBat BatAge R.G G PA AB R H X2B X3B HR RBI SB CS BB SO
## 1 ARI 52 27.6 3.80 162 6089 5552 615 1379 259 47 118 573 86 33 398 1165
## 2 ATL 39 26.8 3.54 162 6064 5468 573 1316 240 22 123 545 95 33 472 1369
## 3 BAL 44 28.3 4.35 162 6130 5596 705 1434 264 16 211 681 44 20 401 1285
## 4 BOS 55 29.2 3.91 162 6226 5551 634 1355 282 20 123 601 63 25 535 1337
## 5 CHC 48 26.8 3.79 162 6102 5508 614 1315 270 31 157 590 65 40 442 1477
## 6 CHW 44 27.7 4.07 162 6077 5543 660 1400 279 32 155 625 85 36 417 1362
## BA OBP SLG OPS OPS. TB GDP HBP SH SF IBB LOB
## 1 0.248 0.302 0.376 0.678 88 2086 115 43 56 36 31 1092
## 2 0.241 0.305 0.360 0.665 87 1969 121 43 53 27 31 1128
## 3 0.256 0.311 0.422 0.734 106 2363 112 62 35 36 29 1088
## 4 0.244 0.316 0.369 0.684 92 2046 138 68 20 52 36 1168
## 5 0.239 0.300 0.385 0.684 88 2118 94 54 57 41 29 1069
## 6 0.253 0.310 0.398 0.708 100 2208 127 60 19 38 33 1071
names(MLB2014)
## [1] "Tm" "NBat" "BatAge" "R.G" "G" "PA" "AB" "R"
## [9] "H" "X2B" "X3B" "HR" "RBI" "SB" "CS" "BB"
## [17] "SO" "BA" "OBP" "SLG" "OPS" "OPS." "TB" "GDP"
## [25] "HBP" "SH" "SF" "IBB" "LOB"
### We can now fit the optimized average
MLB2014$X1B = MLB2014$H - (MLB2014$X2B + MLB2014$X3B + MLB2014$HR)
rpg.lm.optavg = lm(R.G ~ X1B + X2B + X3B + HR + BB + HBP, data=MLB2014)
coef(rpg.lm.optavg)
## (Intercept) X1B X2B X3B HR
## -1.8352666443 0.0025173114 0.0056001510 0.0076324824 0.0072430677
## BB HBP
## 0.0017538445 -0.0009130335
It is likely that the differences that we are seeing here are the result of Albert having rounded his data. The RMSE is
mse = sum(resid(rpg.lm.optavg)^2)/(nrow(MLB2014)-7)
rmse = sqrt(mse)
c(mse, rmse)
## [1] 0.01730754 0.13155812
anova(rpg.lm.optavg)
## Analysis of Variance Table
##
## Response: R.G
## Df Sum Sq Mean Sq F value Pr(>F)
## X1B 1 1.15388 1.15388 66.6693 3.016e-08 ***
## X2B 1 0.90625 0.90625 52.3614 2.295e-07 ***
## X3B 1 0.02087 0.02087 1.2060 0.283496
## HR 1 0.76495 0.76495 44.1975 8.790e-07 ***
## BB 1 0.23774 0.23774 13.7363 0.001163 **
## HBP 1 0.00278 0.00278 0.1607 0.692214
## Residuals 23 0.39807 0.01731
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Albert fits other single variable models to fill Table 4.10.
n = nrow(MLB2014)
MLB2014$avg = MLB2014$H/MLB2014$AB
rpg.lm.avg = lm(R.G ~ avg, data=MLB2014)
rmse.avg = sqrt(sum(resid(rpg.lm.avg)^2)/(n-2))
MLB2014$tb = 1*MLB2014$X1B + 2*MLB2014$X2B + 3*MLB2014$X3B + 4*MLB2014$HR
MLB2014$slg = MLB2014$tb/MLB2014$AB
rpg.lm.slg = lm(R.G ~ slg, data=MLB2014)
rmse.slg = sqrt(sum(resid(rpg.lm.slg)^2)/(n-2))
MLB2014$obp = (MLB2014$H + MLB2014$BB + MLB2014$HBP)/(MLB2014$AB + MLB2014$BB + MLB2014$HBP + MLB2014$SF)
rpg.lm.obp = lm(R.G ~ obp, data=MLB2014)
rmse.obp = sqrt(sum(resid(rpg.lm.obp)^2)/(n-2))
MLB2014$rc = (MLB2014$H + MLB2014$BB)*MLB2014$tb/(MLB2014$AB + MLB2014$BB)
rpg.lm.rc = lm(R.G ~ rc, data=MLB2014)
rmse.rc = sqrt(sum(resid(rpg.lm.rc)^2)/(n-2))
MLB2014$ops = MLB2014$obp + MLB2014$slg
rpg.lm.ops = lm(R.G ~ ops, data=MLB2014)
rmse.ops = sqrt(sum(resid(rpg.lm.ops)^2)/(n-2))
rmse.rpg = round(c(rmse.avg, rmse.obp, rmse.slg, rmse.ops, rmse.rc),3)
data.frame(Statistic=c("AVG","OBP","SLG","OPS","RC"), rmse.rpg, rmse.n = round(rmse.rpg*sqrt((n-2)/n),3))
## Statistic rmse.rpg rmse.n
## 1 AVG 0.212 0.205
## 2 OBP 0.212 0.205
## 3 SLG 0.182 0.176
## 4 OPS 0.150 0.145
## 5 RC 0.133 0.128
Table 4.11 is easy to create using information from above. Albert takes the coefficients, which he views as weights of the events, and standardizes them by dividing by the “weight” of a single.
t(round(rbind(coef(rpg.lm.optavg)[-1],coef(rpg.lm.optavg)[-1]/coef(rpg.lm.optavg)[2]),5))
## [,1] [,2]
## X1B 0.00252 1.00000
## X2B 0.00560 2.22466
## X3B 0.00763 3.03200
## HR 0.00724 2.87730
## BB 0.00175 0.69671
## HBP -0.00091 -0.36270
We now switch to looking at multiple predictor variables by using non-linear (in the coefficients) regression.
### We get the Table 4.11 data from ALbert's tsub package.
case.4.5 = case_4_5
head(case.4.5)
## teamID W L R RA
## 1 ARI 64 98 615 742
## 2 ATL 79 83 573 597
## 3 BAL 96 66 705 593
## 4 BOS 71 91 634 715
## 5 CHA 73 89 660 758
## 6 CHN 73 89 614 707
### Alternatively, we could pull the table from Lahman's data.
require(pacman)
p_load(Lahman, sqldf)
data(Teams)
tbl.4.12 = sqldf("SELECT teamId, W, L, R, RA FROM Teams WHERE (yearID=2014 AND (lgID='AL' OR lgID='NL')) ORDER BY teamID")
head(tbl.4.12)
## teamID W L R RA
## 1 ARI 64 98 615 742
## 2 ATL 79 83 573 597
## 3 BAL 96 66 705 593
## 4 BOS 71 91 634 715
## 5 CHA 73 89 660 758
## 6 CHN 73 89 614 707
It’s nice to see that the values are the same. We continue by creating Figure 4.8
### Create Figure 4.8 using Lahman data
tbl.4.12$lnwl = log(tbl.4.12$W/tbl.4.12$L)
tbl.4.12$lnrra = log(tbl.4.12$R/tbl.4.12$RA)
plot(lnwl~lnrra, data=tbl.4.12, xlab="log(R/RA)", ylab="log(W/L)")
### Fit the linearized model through the point (0,0)---i.e. without intercept
tbl.4.12.lm = lm(lnwl ~ lnrra - 1, data=tbl.4.12)
summary(tbl.4.12.lm)
##
## Call:
## lm(formula = lnwl ~ lnrra - 1, data = tbl.4.12)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26466 -0.05703 0.01246 0.05430 0.17586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lnrra 1.8057 0.1509 11.97 9.65e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09813 on 29 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.8258
## F-statistic: 143.2 on 1 and 29 DF, p-value: 9.655e-13
### Try again using data from Albert's tsub
### Create Figure 4.8
case.4.5$lnwl = log(case.4.5$W/case.4.5$L)
case.4.5$lnrra = log(case.4.5$R/case.4.5$RA)
plot(lnwl~lnrra, data=case.4.5, xlab="log(R/RA)", ylab="log(W/L)")
# Fit the linearized model through the point (0,0)---i.e. without intercept
case.4.5.lm = lm(lnwl ~ lnrra - 1, data=case.4.5)
summary(case.4.5.lm)
##
## Call:
## lm(formula = lnwl ~ lnrra - 1, data = case.4.5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26466 -0.05703 0.01246 0.05430 0.17586
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## lnrra 1.8057 0.1509 11.97 9.65e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09813 on 29 degrees of freedom
## Multiple R-squared: 0.8316, Adjusted R-squared: 0.8258
## F-statistic: 143.2 on 1 and 29 DF, p-value: 9.655e-13
coef(tbl.4.12.lm)
## lnrra
## 1.805664
As found in the text, the observed coefficient (1.81) is approximately 2.
# Figure 4.9 can be plotted as
par(mfrow=c(2,1))
plot(lnwl ~ lnrra, data=tbl.4.12, xlab="log(R/RA)", ylab="log(W/L)")
abline(tbl.4.12.lm)
plot(resid(tbl.4.12.lm)~tbl.4.12$lnrra, xlab="log(R/RA)", ylab="Residual")
abline(h=0, lty=2)
par(mfrow=c(1,1))
# Note that R has the default diagnostic plots
par(mfrow=c(2,2))
plot(tbl.4.12.lm)
par(mfrow=c(1,1))
Table 4.13 is generated using R functions “pred” and “resid”.
tbl.4.12$predw = tbl.4.12$L*(tbl.4.12$R/tbl.4.12$RA)^coef(tbl.4.12.lm) ###coef is 1.805
tbl.4.12$residw = tbl.4.12$W - tbl.4.12$predw
tbl.4.13 = cbind(tbl.4.12[,c("teamID","W","L")], Predicted=round(tbl.4.12$predw,1), Residual=round(tbl.4.12$residw,3))
print(tbl.4.13)
## teamID W L Predicted Residual
## 1 ARI 64 98 69.8 -5.825
## 2 ATL 79 83 77.1 1.927
## 3 BAL 96 66 90.2 5.799
## 4 BOS 71 91 73.2 -2.241
## 5 CHA 73 89 69.3 3.686
## 6 CHN 73 89 69.0 4.009
## 7 CIN 76 86 81.7 -5.735
## 8 CLE 85 77 80.4 4.560
## 9 COL 66 96 83.1 -17.066
## 10 DET 90 72 81.9 8.127
## 11 HOU 70 92 71.5 -1.543
## 12 KCA 89 73 78.8 10.197
## 13 LAA 98 64 92.6 5.404
## 14 LAN 94 68 89.4 4.589
## 15 MIA 77 85 78.5 -1.511
## 16 MIL 82 80 78.5 3.532
## 17 MIN 70 92 79.2 -9.173
## 18 NYA 84 78 71.5 12.451
## 19 NYN 79 83 85.7 -6.687
## 20 OAK 88 74 114.7 -26.663
## 21 PHI 73 89 73.7 -0.732
## 22 PIT 88 74 85.1 2.851
## 23 SDN 77 85 74.2 2.843
## 24 SEA 87 75 95.7 -8.683
## 25 SFN 88 74 85.5 2.532
## 26 SLN 90 72 75.5 14.514
## 27 TBA 77 85 81.8 -4.834
## 28 TEX 67 95 67.0 0.015
## 29 TOR 83 79 86.9 -3.860
## 30 WAS 96 66 96.8 -0.766
These values do not match Albert’s. We can try using the Pythagorean approach of \(k=2\).
tbl.4.12$predw = tbl.4.12$L*(tbl.4.12$R/tbl.4.12$RA)^2 ### coef is 2
tbl.4.12$residw = tbl.4.12$W - tbl.4.12$predw
cbind(tbl.4.12[,c("teamID","W","L")], Predicted=round(tbl.4.12$predw,1), Pythag=round(tbl.4.12$residw,3))
## teamID W L Predicted Pythag
## 1 ARI 64 98 67.3 -3.324
## 2 ATL 79 83 76.5 2.539
## 3 BAL 96 66 93.3 2.715
## 4 BOS 71 91 71.5 -0.550
## 5 CHA 73 89 67.5 5.526
## 6 CHN 73 89 67.1 5.874
## 7 CIN 76 86 81.3 -5.289
## 8 CLE 85 77 80.8 4.180
## 9 COL 66 96 81.8 -15.782
## 10 DET 90 72 83.0 6.987
## 11 HOU 70 92 69.6 0.367
## 12 KCA 89 73 79.5 9.546
## 13 LAA 98 64 96.4 1.649
## 14 LAN 94 68 92.1 1.915
## 15 MIA 77 85 77.8 -0.843
## 16 MIL 82 80 78.3 3.696
## 17 MIN 70 92 77.9 -7.904
## 18 NYA 84 78 70.9 13.113
## 19 NYN 79 83 86.0 -6.981
## 20 OAK 88 74 120.2 -32.197
## 21 PHI 73 89 72.3 0.747
## 22 PIT 88 74 86.4 1.555
## 23 SDN 77 85 73.1 3.924
## 24 SEA 87 75 98.2 -11.225
## 25 SFN 88 74 86.8 1.196
## 26 SLN 90 72 75.9 14.128
## 27 TBA 77 85 81.5 -4.501
## 28 TEX 67 95 64.5 2.488
## 29 TOR 83 79 87.8 -4.752
## 30 WAS 96 66 100.8 -4.834
Still not right. Let’s try figuring out what Albert used for \(k\).
### Find k using our estimates
log(95.8/64)/log(773/630) # Angels
## [1] 1.971945
log(92.0/68)/log(718/617) # Dodgers
## [1] 1.99393
# Both should have been 1.81 which is the "right" answer on average.
### Try the same for our predicted values.
log(96.4/64)/log(773/630) # Angels
## [1] 2.002467
log(92.1/68)/log(718/617) # Dodgers
## [1] 2.001096
# Both appear to be 2 and not 1.81. Maybe he used the Pythagorean approach.
It looks like Albert used \(k=2\) instead of \(k=1.81\) in his estimates and residuals.
Based on our results, Albert’s contention that we hit 80% prediction within 4 runs is a joke. It’s more like 13/30 or 43%.
p_load(aplpack)
stem.leaf(abs(tbl.4.13$Residual))
## 1 | 2: represents 12
## leaf unit: 1
## n: 30
## 6 0* | 000111
## 13 t | 2222333
## (8) f | 44445555
## 9 s | 6
## 8 0. | 889
## 5 1* | 0
## 4 t | 2
## 3 f | 4
## 2 s | 7
## HI: 26.663
tbl.4.13 %>%
mutate(abs.resid = abs(Residual)) %>%
filter(abs.resid <= 4) %>%
summarize(n = n())
## n
## 1 13
table(abs(tbl.4.13$Residual) <= 4)/30*100
##
## FALSE TRUE
## 56.66667 43.33333
quantile(abs(tbl.4.13$Residual))
## 0% 25% 50% 75% 100%
## 0.01500 2.60975 4.57450 7.76700 26.66300
Albert now brings up the concept of regression to the mean — those things that were big will get smaller and those that were small will get bigger. He focuses on slugging (SLG), but could have used any baseball stat.
### We get the Table 4.14 data from ALbert's tsub package.
case.4.6 = case_4_6 ### Make a local copy
head(case.4.6)
## Name SLG.2013 SLG.2014 Improvement
## 1 Domonic_Brown 0.494 0.349 -0.145
## 2 Melky_Cabrera 0.360 0.458 0.098
## 3 Miguel_Cabrera 0.636 0.524 -0.112
## 4 Alberto_Callaspo 0.369 0.290 -0.079
## 5 Zack_Cozart 0.381 0.300 -0.081
## 6 Chris_Davis 0.634 0.404 -0.230
names(case.4.6)
## [1] "Name" "SLG.2013" "SLG.2014" "Improvement"
Figure 4.11 and the regression line are computed using the methods from above.
plot(SLG.2014 ~ SLG.2013, data=case.4.6, xlab="SLG 2013", ylab="SLG 2014")
case.4.6.lm1 = lm(SLG.2014 ~ SLG.2013, case.4.6)
abline(case.4.6.lm1)
coef(case.4.6.lm1)
## (Intercept) SLG.2013
## 0.2421790 0.3580688
summary(case.4.6.lm1)
##
## Call:
## lm(formula = SLG.2014 ~ SLG.2013, data = case.4.6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08431 -0.04457 0.01647 0.03507 0.08692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24218 0.06577 3.682 0.0017 **
## SLG.2013 0.35807 0.14480 2.473 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05251 on 18 degrees of freedom
## Multiple R-squared: 0.2536, Adjusted R-squared: 0.2121
## F-statistic: 6.115 on 1 and 18 DF, p-value: 0.0236
# Compute Davis's predicted 2014 value
c(1,.634)%*%coef(case.4.6.lm1)
## [,1]
## [1,] 0.4691947
mean(case.4.6$SLG.2013) ### Albert says 0.469 not 0.4469
## [1] 0.4469
c(1,.634)%*%coef(case.4.6.lm1)-mean(case.4.6$SLG.2014) ### Albert got this one right at .067
## [,1]
## [1,] 0.06699468
Now look at improvement in Figure 4.11 and the regression model.
plot(Improvement ~ SLG.2013, data=case.4.6, xlab="SLG 2013", ylab="Improvement in SLG")
case.4.6.lm2 = lm(Improvement ~ SLG.2013, case.4.6)
abline(case.4.6.lm2)
cor(case.4.6[,-1]) ### Remove the Team column
## SLG.2013 SLG.2014 Improvement
## SLG.2013 1.0000000 0.5035738 -0.7224750
## SLG.2014 0.5035738 1.0000000 0.2335145
## Improvement -0.7224750 0.2335145 1.0000000
coef(case.4.6.lm2)
## (Intercept) SLG.2013
## 0.2421790 -0.6419312
summary(case.4.6.lm2)
##
## Call:
## lm(formula = Improvement ~ SLG.2013, data = case.4.6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.08431 -0.04457 0.01647 0.03507 0.08692
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24218 0.06577 3.682 0.001704 **
## SLG.2013 -0.64193 0.14480 -4.433 0.000321 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05251 on 18 degrees of freedom
## Multiple R-squared: 0.522, Adjusted R-squared: 0.4954
## F-statistic: 19.65 on 1 and 18 DF, p-value: 0.0003209
### Albert doesn't look at residuals, but you ought to.
par(mfrow=c(2,2))
plot(case.4.6.lm1)
plot(case.4.6.lm2)
par(mfrow=c(1,1))
The above code is intended to show one or two ways of computing some of the statistics that might be used in looking at sports related data. In a few cases, our results differed from Albert’s. While this is unfortunate, the process of identifying possible reasons for these differences is very informative. In the real world it is important to verify results. Having experience in debugging analyses is a good thing. While I am pretty sure that Albert did not intend to include the errors that we found, we should be glad that we had the opportunity to find them.