Read Baseball Data

Authors

Oliver d’Pug

Chinley d’Pug

Read Data

The data for the multiple linear regression model are stored in a CSV file and an R data file. We can read the data using appropriate methods.

read.csv()

Use read.csv() to get the data. Request a listing of the first six observations as well as a list of the variables.

   team_data = read.csv("Data/teamData_2001_to_2020.csv")

   head(team_data)
  yearID lgID teamID franchID divID Rank   G Ghome  W   L DivWin WCWin LgWin
1   2001   AL    TBA      TBD     E    5 162    81 62 100      N     N     N
2   2001   AL    BAL      BAL     E    4 162    80 63  98      N     N     N
3   2001   AL    KCA      KCR     C    5 162    81 65  97      N     N     N
4   2001   AL    DET      DET     C    4 162    81 66  96      N     N     N
5   2001   AL    TEX      TEX     W    4 162    82 73  89      N     N     N
6   2001   AL    ANA      ANA     W    3 162    81 75  87      N     N     N
  WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO SV
1     N 672 5524 1426 311  21 121 456 1116 115 52  54 25 887 781 4.94  1   6 30
2     N 687 5472 1359 262  24 136 514  989 133 53  77 49 829 744 4.67 10   6 31
3     N 729 5643 1503 277  37 152 406  898 100 42  44 47 858 779 4.87  5   1 30
4     N 724 5537 1439 291  60 139 466  972 133 61  51 49 876 795 5.01 16   2 34
5     N 890 5685 1566 326  23 246 548 1093  97 32  75 55 968 913 5.71  4   3 37
6     N 691 5551 1447 275  26 158 494 1001 116 52  77 53 730 671 4.20  6   1 43
  IPouts   HA HRA BBA  SOA   E  DP    FP                 name
1   4271 1513 207 569 1030 139 144 0.977 Tampa Bay Devil Rays
2   4297 1504 194 528  938 125 137 0.979    Baltimore Orioles
3   4320 1537 209 576  911 117 204 0.981   Kansas City Royals
4   4288 1624 180 553  859 131 164 0.979       Detroit Tigers
5   4315 1670 222 596  951 114 167 0.981        Texas Rangers
6   4313 1452 168 525  947 103 142 0.983       Anaheim Angels
                         park attendance BPF PPF teamIDBR teamIDlahman45
1             Tropicana Field    1298365  98 100      TBD            TBA
2 Oriole Park at Camden Yards    3094841  95  96      BAL            BAL
3            Kauffman Stadium    1536371 107 108      KCR            KCA
4               Comerica Park    1921305  93  95      DET            DET
5   The Ballpark at Arlington    2831021 104 105      TEX            TEX
6  Edison International Field    2000919 101 101      ANA            ANA
  teamIDretro  X1B   TB      RPG       AVG       SLG       OBP       RC
1         TBA  973 2142 4.148148 0.2581463 0.3877625 0.3195247 163.5749
2         BAL  937 2077 4.240741 0.2483553 0.3795687 0.3190445 178.5728
3         KCA 1037 2310 4.500000 0.2663477 0.4093567 0.3180782 155.2923
4         DET  949 2267 4.469136 0.2598880 0.4094275 0.3204981 176.2221
5         TEX  971 2676 5.493827 0.2754617 0.4707124 0.3440201 235.5229
6         ANA  988 2248 4.265432 0.2606738 0.4049721 0.3268016 183.9469
        OPS
1 0.7072872
2 0.6986132
3 0.7274349
4 0.7299256
5 0.8147325
6 0.7317737
   names(team_data)
 [1] "yearID"         "lgID"           "teamID"         "franchID"      
 [5] "divID"          "Rank"           "G"              "Ghome"         
 [9] "W"              "L"              "DivWin"         "WCWin"         
[13] "LgWin"          "WSWin"          "R"              "AB"            
[17] "H"              "X2B"            "X3B"            "HR"            
[21] "BB"             "SO"             "SB"             "CS"            
[25] "HBP"            "SF"             "RA"             "ER"            
[29] "ERA"            "CG"             "SHO"            "SV"            
[33] "IPouts"         "HA"             "HRA"            "BBA"           
[37] "SOA"            "E"              "DP"             "FP"            
[41] "name"           "park"           "attendance"     "BPF"           
[45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"   
[49] "X1B"            "TB"             "RPG"            "AVG"           
[53] "SLG"            "OBP"            "RC"             "OPS"           

load()

Use load() to get the teamData dataframe that is stored in the R data file. Again, we request a listing of the first six observations as well as a list of the variables.

   load("Data/teamData_2001_to_2020.RData")

   head(teamData)
  yearID lgID teamID franchID divID Rank   G Ghome  W   L DivWin WCWin LgWin
1   2001   AL    TBA      TBD     E    5 162    81 62 100      N     N     N
2   2001   AL    BAL      BAL     E    4 162    80 63  98      N     N     N
3   2001   AL    KCA      KCR     C    5 162    81 65  97      N     N     N
4   2001   AL    DET      DET     C    4 162    81 66  96      N     N     N
5   2001   AL    TEX      TEX     W    4 162    82 73  89      N     N     N
6   2001   AL    ANA      ANA     W    3 162    81 75  87      N     N     N
  WSWin   R   AB    H X2B X3B  HR  BB   SO  SB CS HBP SF  RA  ER  ERA CG SHO SV
1     N 672 5524 1426 311  21 121 456 1116 115 52  54 25 887 781 4.94  1   6 30
2     N 687 5472 1359 262  24 136 514  989 133 53  77 49 829 744 4.67 10   6 31
3     N 729 5643 1503 277  37 152 406  898 100 42  44 47 858 779 4.87  5   1 30
4     N 724 5537 1439 291  60 139 466  972 133 61  51 49 876 795 5.01 16   2 34
5     N 890 5685 1566 326  23 246 548 1093  97 32  75 55 968 913 5.71  4   3 37
6     N 691 5551 1447 275  26 158 494 1001 116 52  77 53 730 671 4.20  6   1 43
  IPouts   HA HRA BBA  SOA   E  DP    FP                 name
1   4271 1513 207 569 1030 139 144 0.977 Tampa Bay Devil Rays
2   4297 1504 194 528  938 125 137 0.979    Baltimore Orioles
3   4320 1537 209 576  911 117 204 0.981   Kansas City Royals
4   4288 1624 180 553  859 131 164 0.979       Detroit Tigers
5   4315 1670 222 596  951 114 167 0.981        Texas Rangers
6   4313 1452 168 525  947 103 142 0.983       Anaheim Angels
                         park attendance BPF PPF teamIDBR teamIDlahman45
1             Tropicana Field    1298365  98 100      TBD            TBA
2 Oriole Park at Camden Yards    3094841  95  96      BAL            BAL
3            Kauffman Stadium    1536371 107 108      KCR            KCA
4               Comerica Park    1921305  93  95      DET            DET
5   The Ballpark at Arlington    2831021 104 105      TEX            TEX
6  Edison International Field    2000919 101 101      ANA            ANA
  teamIDretro  X1B   TB      RPG       AVG       SLG       OBP       RC
1         TBA  973 2142 4.148148 0.2581463 0.3877625 0.3195247 163.5749
2         BAL  937 2077 4.240741 0.2483553 0.3795687 0.3190445 178.5728
3         KCA 1037 2310 4.500000 0.2663477 0.4093567 0.3180782 155.2923
4         DET  949 2267 4.469136 0.2598880 0.4094275 0.3204981 176.2221
5         TEX  971 2676 5.493827 0.2754617 0.4707124 0.3440201 235.5229
6         ANA  988 2248 4.265432 0.2606738 0.4049721 0.3268016 183.9469
        OPS
1 0.7072872
2 0.6986132
3 0.7274349
4 0.7299256
5 0.8147325
6 0.7317737
   names(teamData)
 [1] "yearID"         "lgID"           "teamID"         "franchID"      
 [5] "divID"          "Rank"           "G"              "Ghome"         
 [9] "W"              "L"              "DivWin"         "WCWin"         
[13] "LgWin"          "WSWin"          "R"              "AB"            
[17] "H"              "X2B"            "X3B"            "HR"            
[21] "BB"             "SO"             "SB"             "CS"            
[25] "HBP"            "SF"             "RA"             "ER"            
[29] "ERA"            "CG"             "SHO"            "SV"            
[33] "IPouts"         "HA"             "HRA"            "BBA"           
[37] "SOA"            "E"              "DP"             "FP"            
[41] "name"           "park"           "attendance"     "BPF"           
[45] "PPF"            "teamIDBR"       "teamIDlahman45" "teamIDretro"   
[49] "X1B"            "TB"             "RPG"            "AVG"           
[53] "SLG"            "OBP"            "RC"             "OPS"           

Check the Data

We perform a quick plot to see if things look okay. We also look at a simple MLR model and determine the MSE.

  bwplot(lgID ~ R, data=teamData)

  teamData.lm.r.h.bb = lm(R ~ H + BB, data=teamData)
  summary(teamData.lm.r.h.bb)

Call:
lm(formula = R ~ H + BB, data = teamData)

Residuals:
     Min       1Q   Median       3Q      Max 
-138.433  -32.288    1.632   31.545  159.549 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -33.21556   12.33483  -2.693  0.00728 ** 
H             0.37142    0.01248  29.773  < 2e-16 ***
BB            0.46071    0.02921  15.774  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.11 on 597 degrees of freedom
Multiple R-squared:  0.8634,    Adjusted R-squared:  0.863 
F-statistic:  1887 on 2 and 597 DF,  p-value: < 2.2e-16
  anova(teamData.lm.r.h.bb)
Analysis of Variance Table

Response: R
           Df  Sum Sq Mean Sq F value    Pr(>F)    
H           1 7825448 7825448 3525.51 < 2.2e-16 ***
BB          1  552298  552298  248.82 < 2.2e-16 ***
Residuals 597 1325141    2220                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  MSE = rev(anova(teamData.lm.r.h.bb)$"Mean Sq")[1]
  RMSE = sqrt(MSE)
  RMSE
[1] 47.11333

So, for the model fitting R as a function of H and BB (Runs as a function of Hits and walks), the mean squared error is MSE = 2219.666, and the root mean squared error is RMSE = 47.1133.