4 minute read

I do not update this site as much as I would like. Interacting through twitter and the #R4DS slack channel has been my main contributions as of late to being more active in the #rstats community. I make plans to get to things like posting, then end up with other work, a neat idea, or just decide spend time with my family.

Worthy endeavors all of them.

Anyways, Thomas Mock (@thomasmock), has started up #tidytuesday out of the mentorship pilot program in #R4DS. I’ve been helping with little inputs here and there with #TidyTuesday and the future #tidyweek projects. It’s really great to see these take off, and I hope they keep going (congratulations Thomas and everyone stay tuned for #tidyweek).

I didn’t do a write up for the inaugural #TidyTuesday last week. To sum up my submission: I like nice clean looking graphs, and I’m a sucker for a good slopegraph. Now I know I can work on my submission right in this post, but I made the decision to spin the #tidytuesday submissions off into their own repo. This way I don’t have to tie them to a post (keeping with my inconsistent blogging schedule)!

This weeks was interesting. Love football. Love fivethirtyeight. Love that graphic. While I realize the level of granularity was appropriate for the story, I felt that it would be neat to see how ALL of the positions are faring. This wasn’t included in the #tidytuesday dataset, so I went hunting.

Found the website that provided the data (sportrac), and checked out the tables that were the source of the data. I saw that they had everything available through tables on the website. I quickly checked out the terms of service to see if they had any sort of scraping policy. I couldn’t find anything, so I went forward with it, making sure to space out my scraping requests to not flood the server.

It also involved a lot of url construction, at least in a very logical fashion. To do these sort of tasks, I really like using tibble to create a tibble of parameters, then create a scraping function and use purrr to iterate over the rows. Like so..


# Create the data scaffold 
years <- 2011:2018
positions <- c("quarterback", "running-back", "fullback", "guard", "center", "left-tackle", "right-tackle", "tight-end","wide-receiver","defensive-end","cornerback","defensive-tackle", "inside-linebacker", "outside-linebacker", "free-safety", "strong-safety", "kicker","punter","long-snapper")

scaffold <- tibble(year = years,
                   position = list(positions)) %>% 

# Function to scrape the top avg cap salary by player ----
pull_position_data <- function(year, position) {
  url <- glue("http://www.spotrac.com/nfl/positional/{year}/{position}")
  read_html(url) %>% 
    html_nodes("#main > div.teams > table:nth-child(6)") %>% 
    html_table() %>%
    flatten_df() %>% 
    set_names(c("rank","player","cap_dollars", "cap_percent"))

tbl_data <- scaffold %>% 
    mutate(data = map2(year, position, ~pull_position_data(.x, .y)))

That will get you the full data for all positions from 2011:2018. Some important things to note. First, always check to see if you are allowed to scrape the website. Two, if they have a guideline for it, follow it. Three, if they don’t have a guideline, be nice. Being nice is easy, and reflected in the use of Sys.sleep(5) to not overwhelm the server with requests.

Another good thing to do is save the data to an RDS file so you don’t have to do it again :)

After getting all of that data and playing around with various types, I decided on my submission, a heatmap. From the scraped data it was a quick group_by, top_n and summarize to get the data for the top 16 highest paid players in their positions and get the average.

I did tweak the plot quite a bit, and included a simple function to make year labels similar to the style of fivethirtyeight. I’m cheating here, because I hardcoded that we’re looking at data from the 21st century. A really good function would do this automatically!

# Formatter for 538 year labels 

labels_538 <- function(labels) {
  labels_out <- sprintf("20%s", str_sub(labels, 3, 4))
  labels_out <- c(labels_out[1], glue("'{str_sub(labels_out[-1], 3, 4)}"))

Looking at the results, you can see that the bottom end of the NFL pay band are: Fullbacks, Long Snappers, Kickers and Punters. Just doesn’t get better for these guys year over year. The data that you see really emphasizes the focus of the NFL, highlighting passing attacks that emphasize scoring and excitement. The highest cap prices are elated to the passing game on offense, and either rushing or stopping the pass on defense.

See you next week!

comments powered by Disqus