This code is taken from Bill Petti’s GitHub site: https://billpetti.github.io/2020-01-07-acquire-minor-league-pitch-by-pitch-data-rstats-baseballr/
Aggregated statistics for minor league players have been available for some time through sites like FanGraphs, Baseball-Reference, and MiLB.com. However, pitch-level data similar to what is availabel for MLB is not easy to find.
To try and fill that gap, I’ve added and updated functions in baseballr that allow a user to query data through MLB’s stats api at the minor league level. This isn’t a perfect solution, nor is it as easy to grab data at a player level like can be done for major leaguers at Baseball Savant, but it’s a start.
Before grabbing pitch-by-pitch (pbp) data, we need some information at the game level. To obtain this information, you can use the get_game_pks_mlb function. Simply provide a datea and a numeric vector for what levels you want game information returned.
You can view a look up table with ?get_game_pks_mlb to get the appropriate IDs, but I’ll post them here as well (this is not comprehensive):
1 MLB 11 Triple-A 12 Double-A 13 Class A Advanced 14 Class A 15 Class A Short Season 5442 Rookie Advanced 16 Rookie 17 Winter League
You can provide more than one level at a time. Say you want all games played on 2019-05-01 across Triple-A and Double-A:
games <- get_game_pks_mlb(date = '2019-05-01',
level_ids = c(11, 12))
games %>%
select(game_pk, gameDate, teams.away.team.name, teams.home.team.name) %>%
slice(1:10)
## ── MLB Game Pks data from MLB.com ─────────────────────────── baseballr 1.2.0 ──
## ℹ Data updated: 2022-09-03 22:01:48 PDT
## # A tibble: 10 × 4
## game_pk gameDate teams.away.team.name teams.home.team.name
## <int> <chr> <chr> <chr>
## 1 571587 2019-05-01T14:30:00Z Erie SeaWolves Altoona Curve
## 2 572288 2019-05-01T14:30:00Z New Hampshire Fisher Cats Trenton Thunder
## 3 575655 2019-05-01T14:35:00Z Louisville Bats Toledo Mud Hens
## 4 571938 2019-05-01T14:35:00Z Portland Sea Dogs Hartford Yard Goats
## 5 575163 2019-05-01T14:35:00Z Norfolk Tides Durham Bulls
## 6 575589 2019-05-01T15:05:00Z Gwinnett Stripers Charlotte Knights
## 7 571728 2019-05-01T15:05:00Z Richmond Flying Squirrels Bowie Baysox
## 8 572149 2019-05-01T15:35:00Z Harrisburg Senators Reading Fightin Phils
## 9 579921 2019-05-01T16:05:00Z Round Rock Express Oklahoma City Dodgers
## 10 579919 2019-05-01T16:10:00Z Round Rock Express Oklahoma City Dodgers
You can also use the get_game_info_mlb function to grab additional info on each game, such as weather and (in some cases) attendance:
map_df(.x = games$game_pk[1:10],
~get_game_info_mlb(.x)) %>%
select(game_date, venue_name, temperature, other_weather, wind)
## ── MLB Game Info data from MLB.com ────────────────────────── baseballr 1.2.0 ──
## ℹ Data updated: 2022-09-03 22:01:49 PDT
## # A tibble: 10 × 5
## game_date venue_name temperature other_weather wind
## <chr> <chr> <chr> <chr> <chr>
## 1 2019-05-01 Peoples Natural Gas Field 58 Overcast 7 mph, In …
## 2 2019-05-01 ARM & HAMMER Park 55 Overcast 5 mph, L T…
## 3 2019-05-01 Fifth Third Field 59 Cloudy 7 mph, R T…
## 4 2019-05-01 Dunkin' Donuts Park 50 Cloudy 1 mph, Calm
## 5 2019-05-01 Durham Bulls Athletic Park 70 Partly Cloudy 7 mph, In …
## 6 2019-05-01 BB&T Ballpark 74 Cloudy 11 mph, Va…
## 7 2019-05-01 Prince George's Stadium 59 Cloudy 8 mph, Calm
## 8 2019-05-01 FirstEnergy Stadium 59 Cloudy 1 mph, Calm
## 9 2019-05-01 Chickasaw Bricktown Ballpark 63 Cloudy 4 mph, R T…
## 10 2019-05-01 Chickasaw Bricktown Ballpark 72 Cloudy 13 mph, R …
Once you have the game_pk IDs grabbing the pbp data is very simple. All you need to do is pass the game_pk of interest to the get_pbp_mlb function.
Let’s say you are interested in the Gwinnett Stripers versus the Charlotte Knights:
payload <- get_pbp_mlb(575589)
The function will return a data frame with 131 columns. Data availability will vary depending on the park and the league level, as most sensor data is not availble in minor league parks via this API. Also note that the column names have mostly been left as-is and there are likely duplicate columns in terms of the information they provide. I plan to clean the output up down the road, but for now I am leaving the majority as-is.
Some of the colums of interest at the minor league level are:
pitchNumber and atBatIndex: the pitch number within a given plate appearance and the plate appearance within a given game. pitchData.coordinates.x and pitchData.coordinates.y: the x,z coordinates of the pitch as it crosses the plate. As far as I can tell, these are the pixel coordinates for a location that a stringer manually plots and likely need to be transformed and rotated to get a view of the pitch as it crosses the plate. I am working on figuring out an easy transformation to get them on the same scale as the MLB coordinates, but they appear different by park. I do believe you can multiple both by -1 and that will at least allow you to orient the coordinates correctly (i.e. catcher’s view) details.call.code, details.call.description, result.event, result.eventType, and result.description: these are similar to what we find with Statcast data–codes and detailed desriptions for what happened on a pitch or at the end of a plate appearance. count. variables that tell you how many balls, strikes, and outs before and after the pitch. batter.id and pitcher.id matchup.batSide.code and matchup.pitchHand.code: handedness of the batter and pitcher. A series of columns that tell you what the league and level is of both the home and away teams and includes their parent organizations. batted.ball.result, hitData.coordinates.coordX, hitData.coordinates.coordY, hitData.trajectory: various information about the batted ball. Of most interest will be the coordinate columns.
We can easily plot batted balls with this data:
bb_palette <- c('Single' = "#006BA4",
'Double' = "#A2CEEC",
'Triple'= "#FFBC79",
'Home Run'= "#C85200",
'Out/Other' = "#595959")
ggspraychart(payload,
x_value = 'hitData.coordinates.coordX',
y_value = '-hitData.coordinates.coordY',
fill_value = 'batted.ball.result',
fill_palette = bb_palette,
point_size = 3) +
labs(title = 'Batted Balls: Gwinnett Stripers versus the Charlotte Knights',
subtitle = '2019-05-01')
## Warning: Removed 269 rows containing missing values (geom_point).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
As I mentioned, getting pbp data for a single player or team is problematic given that the api call is game-based. (This is as far as I can tell without formal documentation.)
For players, you will likely need to collect data in bulk and house it in your own database to make querying far easier. However, here’s an example of how you might get all pbp data for all teams in a single MLB team’s system for a given week.
First, grab all game_pks for the first week of June 2019 for all levels from Triple-A to Class A:
x <- map_df(.x = seq.Date(as.Date('2019-06-01'),
as.Date('2019-06-07'),
'day'),
~get_game_pks_mlb(date = .x,
level_ids = c(11,12,13,14,15))
)
Next, map over those game_pks and run the get_pbp_mlb function for each (Note: you are making hundreds of api calls, so this will take about 5 minutes):
safe_milb <- safely(get_pbp_mlb)
# filter the game files for only those games that were completed and pull the game_pk as a numeric vector
# df <- map(.x = x %>%
# filter(status.codedGameState == "F") %>%
# pull(game_pk),
# ~safe_milb(game_pk = .x)) %>%
# map('result') %>%
# bind_rows()
# write_csv(df, "milb_2019_06_01_to_07.csv") ### Save a copy to keep from having to create the dataframe in the future.
# saveRDS(df, "milb_2019_06_01_to_07.RDS")
### Cheat and read the CSV file
# df <- read_csv("milb_2019_06_01_to_07.csv")
### Cheat and read the RDS file
df <- readRDS("milb_2019_06_01_to_07.RDS")
Now that you have the data you can filter for any team and their related minor league teams. Here’s what the Rays organization looks like (note: there is a data table in the package that houses all teams and their ids – teams_lu_table):
ggspraychart(df %>%
filter(home_parentOrg_id == 139 | away_parentOrg_id == 139),
x_value = 'hitData.coordinates.coordX',
y_value = '-hitData.coordinates.coordY',
fill_value = 'batted.ball.result',
fill_palette = bb_palette,
point_size = 3) +
facet_wrap(~home_level_name) +
labs(title = 'Batted Balls: Tampa Bay Rays Minor League Affiliates',
subtitle = '2019-06-01 through 2019-06-07')
## Warning: Removed 7600 rows containing missing values (geom_point).
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): font family not
## found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
In terms of a single player, the simplist way would be to grab all the game_pks on days when a player’s team(s) played and then query the pbp for those game_pks.
For example, let’s make spray charts for all of Vladimir Guerrero Jr.’s batted balls from 2018.
First, grab the dates of his games:
vlad <- baseballr::milb_batter_game_logs_fg(19611, year = 2019)
Next, grab the game_pk’s of those games (note I am being lazy here and grabbing all games across Triple- and Double-A):
vlad_dates <- vlad %>%
pull(Date)
Then, loop over the games, grab the pbp data, and filter for Guerrero as the batter:
vlad_gk <- map_df(.x = vlad_dates,
~get_game_pks_mlb(date = .x,
level_ids = c(11,12))
)
vlad_gk_TOR <- vlad_gk %>%
filter(teams.home.team.name == "Buffalo Bisons" | teams.home.team.name == "New Hampshire Fisher Cats")
vlad_data <- map(.x = vlad_gk_TOR %>%
filter(status.codedGameState == "F") %>%
pull(game_pk),
~safe_milb(game_pk = .x)) %>%
map('result') %>%
bind_rows()
vlad_pbp <- vlad_data %>%
filter(matchup.batter.id == 665489)
We can plot the data by level:
ggspraychart(vlad_pbp,
x_value = 'hitData.coordinates.coordX',
y_value = '-hitData.coordinates.coordY',
fill_value = 'batted.ball.result',
fill_palette = bb_palette,
point_size = 3) +
facet_wrap(~home_level_name) +
labs(title = 'Vladimir Guerrero Jr: Batted Balls 2018') +
facet_wrap(~home_level_name)
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
It is not the most efficient process, but from what I can tell that’s the best way to do it today.
That’s all for now. Let me know if you have any issues. Oh, and before I forget, you can also grab MLB pbp data using the same functions.
Comments and pull requests welcome!
Tags: R, baseballr, MLB, baseball
Or by pitcher handedness and level:
ggspraychart(vlad_pbp %>%
mutate(matchup.pitchHand.description = paste0(matchup.pitchHand.description, 'handed')),
x_value = 'hitData.coordinates.coordX',
y_value = '-hitData.coordinates.coordY',
fill_value = 'batted.ball.result',
fill_palette = bb_palette,
point_size = 3) +
facet_wrap(~home_level_name) +
labs(title = 'Vladimir Guerrero Jr: Batted Balls 2018') +
facet_wrap(~matchup.pitchHand.description+home_level_name, ncol = 4)
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, : font
## family not found in Windows font database