In this example, the baseballr package is used to acquire Statcast data for Corbin Burnes for the 2021 season. This data is then used to generate different plots to showcase his arsenal and his cutter’s velocity by inning. The original document can be found at https://billpetti.github.io/baseballr/articles/using_statcast_pitch_data.html.

Load Packages

  library(pacman)
  p_load(ggplot2)
  #if (!requireNamespace('devtools', quietly = TRUE)){
  #  install.packages('devtools')
  #}
  #devtools::install_github(repo = "BillPetti/baseballr", force = TRUE)
  p_load(baseballr)
  p_load(tidyverse)
  p_load(dplyr)
  p_load(readr)
  p_load(DBI)
  p_load(RPostgres)

Find Corbin Burnes’ and Clayton Kershaw’s MLBAM IDs

  burnes_id <- baseballr::playerid_lookup(last_name = "Burnes", first_name = "Corbin") %>% 
                 dplyr::pull(mlbam_id)

  kershaw_id <- baseballr::playerid_lookup(last_name = "Kershaw", first_name = "Clayton") %>%
                  dplyr::pull(mlbam_id)

Use Burnes’ ID To Load Statcast Data

  burnes_data <- baseballr::statcast_search_pitchers(start_date = "2021-03-01",
                                                     end_date = "2021-12-01",
                                                     pitcherid = burnes_id)

This block will download all of Corbin Burnes’ pitches from March 1st through December 1st.

Use Kershaw’s ID to get data from PostgreSQL

    ### To connect to your own database you would use something like
  db_name <- 'r_baseball'  #  provide the name of your db. 'postgres' is typical. pdAdmin can be
                    #  used to create a database like 'baseball' to hold tables
  db_host <- 'localhost' #i.e. 'ec2-54-83-201-96.compute-1.amazonaws.com'  
  db_port <- '5432'  # or any other port specified by the DBA.  5432 is typical.
  db_user <- 'superuser' #rstudioapi::askForPassword("Database Name ['postgres' is typical]:") # 'postgres' 
                    # is created at installation, but other users can be used
                    # For personal demo use 'user'
  db_password <- 'notSecure' #rstudioapi::askForPassword("Database Password:") # password for the db_user
                    # For personal demo use 'notSecure'
  
    ### connect to your database
  statcast_db <- dbConnect(RPostgres::Postgres(), 
                            dbname = db_name, 
                            user = db_user, 
                              password = db_password, 
                              host = db_host, 
                            port = db_port
                           )

  ### Grab all of Kershaw's 2022 data
  kershaw_data <-  tbl(statcast_db, 'statcast') %>%
                     filter((game_year == 2022) & (pitcher == kershaw_id)) %>%
                     collect()

  dim(kershaw_data)
## [1] 2023   98
  head(kershaw_data, 3)
## # A tibble: 3 × 98
##   pitch_…¹ game_…² relea…³ relea…⁴ relea…⁵ playe…⁶ batter pitcher events descr…⁷
##   <chr>    <chr>     <dbl>   <dbl>   <dbl> <chr>    <int>   <int> <chr>  <chr>  
## 1 ""       2022-0…      NA      NA      NA Zimmer… 605548  477132 strik… swingi…
## 2 ""       2022-0…      NA      NA      NA Belt, … 474832  477132 strik… swingi…
## 3 ""       2022-0…      NA      NA      NA Willia… 663897  477132 field… hit_in…
## # … with 88 more variables: spin_dir <chr>, spin_rate_deprecated <chr>,
## #   break_angle_deprecated <chr>, break_length_deprecated <chr>, zone <dbl>,
## #   des <chr>, game_type <chr>, stand <chr>, p_throws <chr>, home_team <chr>,
## #   away_team <chr>, type <chr>, hit_location <int>, bb_type <chr>,
## #   balls <int>, strikes <int>, game_year <int>, pfx_x <dbl>, pfx_z <dbl>,
## #   plate_x <dbl>, plate_z <dbl>, on_3b <dbl>, on_2b <dbl>, on_1b <dbl>,
## #   outs_when_up <int>, inning <int>, inning_topbot <chr>, hc_x <dbl>, …
  ### disconnect database 
  dbDisconnect(statcast_db)
  
  ### Clean up a little
  gc()
##           used (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells 1412878 75.5    4291027 229.2  3470663 185.4
## Vcells 3860644 29.5   25352486 193.5 39612962 302.3

Clean Data

If this is your first time looking at Statcast data I recommend looking at their documentation for for the dataset returned. It will walk you through the data represented by each column and give you a better idea of the data points collected for each pitch.

Some of the more common data points used for pitching analysis:

Since we are going to be making a scatterplot of Corbin Burnes’ pitch movement, we need to make sure we have the data in the proper format to match a traditional movement plot. The pfx_x and pfx_z columns are both in feet so let’s create two new columns and convert them to inches. pfx_x is also from the catcher’s point of view so let’s also reverse it to be from the pitcher’s.

# The glimpse function is something I use regularly
# to quickly preview the data I'm working with.
# Try it out if you haven't used it before!
# 
# 
# burnes_data %>% dplyr::glimpse()

burnes_cleaned_data <- burnes_data %>% 
  # Only keep rows with pitch movement readings
  # and during the regular season
  dplyr::filter(!is.na(pfx_x), !is.na(pfx_z),
                game_type == "R") %>% 
  dplyr::mutate(pfx_x_in_pv = -12*pfx_x,
                pfx_z_in = 12*pfx_z)

kershaw_cleaned_data <- kershaw_data %>% 
  # Only keep rows with pitch movement readings
  # and during the regular season
  dplyr::filter(!is.na(pfx_x), !is.na(pfx_z),
                game_type == "R") %>% 
  dplyr::mutate(pfx_x_in_pv = -12*pfx_x,
                pfx_z_in = 12*pfx_z)

Create A Movement Plot

Now that we’ve created our new columns, let’s use them to plot how Corbin Burnes’ pitches move.

# Make a named vector to scale pitch colors with
pitch_colors <- c("4-Seam Fastball" = "red",
                  "2-Seam Fastball" = "blue",
                  "Sinker" = "cyan",
                  "Cutter" = "violet",
                  "Fastball" = "black",
                  "Curveball" = "green",
                  "Knuckle Curve" = "pink",
                  "Slider" = "orange",
                  "Changeup" = "gray50",
                  "Split-Finger" = "beige",
                  "Knuckleball" = "gold")
# Find unique pitch types to not have unnecessary pitches in legend
burnes_pitch_types <- unique(burnes_cleaned_data$pitch_name)
burnes_cleaned_data %>% 
  ggplot2::ggplot(ggplot2::aes(x = pfx_x_in_pv, y = pfx_z_in, color = pitch_name)) +
  ggplot2::geom_vline(xintercept = 0) +
  ggplot2::geom_hline(yintercept = 0) +
  # Make the points slightly transparent
  ggplot2::geom_point(size = 1.5, alpha = 0.25) +
  # Scale the pitch colors to match what we defined above
  # and limit it to only the pitches Burnes throws
  ggplot2::scale_color_manual(values = pitch_colors,
                              limits = burnes_pitch_types) +
  # Scale axes and add " to end of labels to denote inches
  ggplot2::scale_x_continuous(limits = c(-25,25),
                              breaks = seq(-20,20, 5),
                              labels = scales::number_format(suffix = "\"")) +
  ggplot2::scale_y_continuous(limits = c(-25,25),
                              breaks = seq(-20,20, 5),
                              labels = scales::number_format(suffix = "\"")) +
  ggplot2::coord_equal() +
  ggplot2::labs(title = "Corbin Burnes Pitch Movement",
                subtitle = "2021 MLB Season | Pitcher's POV",
                caption = "Data: Baseball Savant via baseballr", 
                x = "Horizontal Break",
                y = "Induced Vertical Break",
                color = "Pitch Name")

# Find unique pitch types to not have unnecessary pitches in legend
kershaw_pitch_types <- unique(kershaw_cleaned_data$pitch_name)
kershaw_cleaned_data %>% 
  ggplot2::ggplot(ggplot2::aes(x = pfx_x_in_pv, y = pfx_z_in, color = pitch_name)) +
  ggplot2::geom_vline(xintercept = 0) +
  ggplot2::geom_hline(yintercept = 0) +
  # Make the points slightly transparent
  ggplot2::geom_point(size = 1.5, alpha = 0.25) +
  # Scale the pitch colors to match what we defined above
  # and limit it to only the pitches Kershaw throws
  ggplot2::scale_color_manual(values = pitch_colors,
                              limits = kershaw_pitch_types) +
  # Scale axes and add " to end of labels to denote inches
  ggplot2::scale_x_continuous(limits = c(-25,25),
                              breaks = seq(-20,20, 5),
                              labels = scales::number_format(suffix = "\"")) +
  ggplot2::scale_y_continuous(limits = c(-25,25),
                              breaks = seq(-20,20, 5),
                              labels = scales::number_format(suffix = "\"")) +
  ggplot2::coord_equal() +
  ggplot2::labs(title = "Clayton Kershaw Pitch Movement",
                subtitle = "2022 MLB Season | Pitcher's POV",
                caption = "Data: Baseball Savant via baseballr", 
                x = "Horizontal Break",
                y = "Induced Vertical Break",
                color = "Pitch Name")

I like to use pitch_name to color my scatterplots as it gives the full pitch name in the legend, but pitch_type would also work if you prefer the shorter abbreviations (ex: 4-Seam Fastball = FF). If you were to use pitch_type instead, be sure to make a new vector for the colors.

Velocity By Inning

Now let’s take a look at Corbin Burnes’ pitch velocity by inning for his trademark cutter.

burnes_velocity_by_inning <- burnes_cleaned_data %>% 
  dplyr::filter(pitch_name == "Cutter") %>% 
  dplyr::group_by(inning, pitch_name) %>% 
  dplyr::summarize(average_velo = mean(release_speed, na.rm = TRUE))
## `summarise()` has grouped output by 'inning'. You can override using the
## `.groups` argument.
burnes_velocity_by_inning %>% 
  ggplot2::ggplot(ggplot2::aes(x = inning, y = average_velo, color = pitch_name)) +
  ggplot2::geom_line(linewidth = 1.5, alpha = 0.5, show.legend = FALSE) +
  ggplot2::geom_point(size = 3, show.legend = FALSE) +
  ggplot2::scale_color_manual(values = pitch_colors) +
  ggplot2::scale_x_continuous(breaks = 1:9) +
  ggplot2::scale_y_continuous(limits = c(90, 100)) +
  ggplot2::labs(title = "Corbin Burnes Cutter Velocity By Inning",
                subtitle = "2021 MLB Season",
                caption = "Data: Baseball Savant via baseballr",
                x = "Inning",
                y = "Average Velocity")

kershaw_velocity_by_inning <- kershaw_cleaned_data %>% 
  dplyr::filter(pitch_name == "4-Seam Fastball") %>% 
  dplyr::group_by(inning, pitch_name) %>% 
  dplyr::summarize(average_velo = mean(release_speed, na.rm = TRUE))
## `summarise()` has grouped output by 'inning'. You can override using the
## `.groups` argument.
kershaw_velocity_by_inning %>% 
  ggplot2::ggplot(ggplot2::aes(x = inning, y = average_velo, color = pitch_name)) +
  ggplot2::geom_line(linewidth = 1.5, alpha = 0.5, show.legend = FALSE) +
  ggplot2::geom_point(size = 3, show.legend = FALSE) +
  ggplot2::scale_color_manual(values = pitch_colors) +
  ggplot2::scale_x_continuous(breaks = 1:9) +
  ggplot2::scale_y_continuous(limits = c(90, 92)) +
  ggplot2::labs(title = "Clayton Kershaw 4-Seam Fastball Velocity By Inning",
                subtitle = "2022 MLB Season",
                caption = "Data: Baseball Savant via baseballr",
                x = "Inning",
                y = "Average Velocity")

Conclusion

These were just two examples of things you can do with pitch data acquired using baseballr. Statcast data goes back to 2015 and contains a multitude of data points for each pitch/batted ball event so there’s endless things to go research!