Amanda Dobbyn
--- # **Warning** -- ### This presentation contains less rap than you might have expected. -- <br> .center[I won't blame you if you want to make a quick getaway.] -- <p align="center"> <img src="" height="300px"> </p> <!-- .center[] --> --- class: inverse ## Quick About Me <br> .left-column[ **Day job**: ultimate frisbee player **For fun**: Data Scientist at [Earlybird Software](, former co-organizer of [R-Ladies Chicago]( <!-- .pull-left[] --> **GitHub**: [@aedobbyn]( **Website**: **Twitter**: [@dobbleobble]( ] .right-column[] --- ## The Plan <br> 1) I'll give an intro to what `drake` is and how it works. -- <br> 2) We'll switch to a [live coding Rmd]( which hopefully won't totally break π€ -- <br> In that part, we'll use the Twitter and Google Maps geocoding APIs to run a `drake` pipeline. -- <br> <br> <br> <br> <br> All code and slides on [GitHub]( --- ## `drake`'s Main Idea -- [`drake`]( is workflow manager for your R code. -- In a complex analysis pipeline, it makes changing your code easier. -- <br> `drake` loves changes. -- <p align="left"> <img src="" alt="ilovechanges" height="300px"> </p> --- ## `drake`'s Main Idea -- When something changes that makes the most recent results **out-of-date**, `drake` rebuilds *only* things that need to be rebuilt, so that -- *what gets done stays done*. <p align="left" style="padding-right: 20%;"> <img src="./img/drake_pitch.svg" height="300px"> </p> -- Created and maintained by [Will]( [Landau]( and friends. --- ## What's the deal with the name? -- **`d`**`ataframes in` **`R`** `for` `M`**`ake`** <br> -- .pull-left[ [GNU Make]( is a tool that uses a file called a Makefile to specify **dependencies** in a pipeline. <br> `drake` implements that idea in a way that's more native to how we work in R. ] Example of a Makefile: .pull-right[] --- class: inverse ## Better Workflows <br> Does your analysis directory look like this? -- .pull-left[ `01_import.R` `02_clean.R` `03_deep_clean.R` `04_join.R` `05_analyze.R` `06_analyze_more.R` `07_report.Rmd` ] -- .pull-right[ <br> #### What's bad about this? <br> **It doesn't scale well** <br> Which you know if you've tried to add another intermediate step or reorganize your subdirectories. ] --- #### Your pipeline depends on -- - You keeping file names up-to-date and sourcing things in the right order -- - You knowing when the input data changes -- - You knowing which objects and functions are used by which other objects and functions <!-- - Explicitly saving intermediate data representations --> -- <br> #### If something breaks -- - Can you be sure about where it broke? -- - Do you know which intermediate data stores are up to date? -- - Do you need to re-run the entire pipeline again? -- .pull-right[ <p align="right"> <img src="./img/tired_drake.jpeg"> </p> ] --- ## Nice features of `drake` .pull-left[ 1) Tidy **dataframe** shows how pieces in your pipeline fit together ] -- <br> .pull-right[ 2) **Dependency graph** of all inputs and outputs ] <br> -- .pull-left[ 3) Great for iteration and **reproducibility**, especially if used with git ] <br> -- .pull-right[ 4) Automated parallel and distributed computing ] <br> -- .pull-left[ 5) It's all in R, so no writing config files! π ] <!-- .pull-right[] --> <!-- .pull-right[] --> --- ## `drake` : `knitr` (Analogy stolen from Will's [interview on the R podcast]( <p align="center"> <img src="./img/tiny_hats.jpg"> </p> -- 1) `knitr` can **cache** chunks if they've already been run, and nothing in them has changed. -- 2) A chunk successfully knitting **depends** on the previous chunk knitting and on any chunk that you specify a [`depedson`]( for. -- 3) Report lives in a single file, making that part reproducible and **compact**. ??? With knitr, you expect to be able to rerun someone's report from a single file. --- class: inverse ## `drake` : `knitr` But, `knitr` is a reporting tool, not a pipelining tool. -- <br> - You can't summon values from the `cache` in an interactive session <br> -- - If you're outsourcing preprocessing steps to R scripts outside the `Rmd`, we haven't solved any of the usual dependency issues from before <br> -- - `Rmd`s quickly get big and unweildy for serious pipelines --- ## Is this also kinda like memoising? -- Yes! -- But better. -- Memoising **caches the return value of a function for a given a set of arguments** -- <br> - If the function is called again with the *same* set of arguments, the value is pulled from the cache instead of recomputed - Saves time & resources π -- <br> - In R implemented nicely in the [`memoise` package]( --- ## On Memoising The downside: **Memoising only applies to one function.** -- <br> What if a function upstream of the memoised function changes? We could get the wrong answer. -- <p align="center"> <img src="./img/sad_drake.jpeg" height="330px"> </p> --- ## On Memoising ```r add <- function(a, b) { a + b } add_and_square <- function(a, b) { add(a, b) ^ 2 } ``` -- <br> ```r add(2, 3) ## [1] 5 add_and_square(2, 3) ## [1] 25 ``` --- ## On Memoising If we've memoised `add_and_square`, ```r add_and_square <- memoise::memoise(add_and_square) ``` -- we return `add_and_square(2, 3)` from the cache. (Yay, fast!) ```r add_and_square(2, 3) ## [1] 25 ``` <br> -- **But** if we now redefine `add` so that it *subtracts* `b` from `a`... ```r add <- function(a, b) { * a - b } ``` -- What will happen when we call `add_and_square(2, 3)`? --- ## On Memoising -- We should get ```r add(2, 3) ^ 2 ## [1] 1 ``` -- But instead we return the old answer, which is now wrong: ```r add_and_square(2, 3) ## [1] 25 ``` -- <br> Luckily, `drake` knows the *all* the dependency relationships between functions as they relate to your targets. So, `drake` would know that the definition of `add` has changed, meaning that `add_and_square(2, 3)` needs to be recomputed. --- class: inverse ## A Few Pieces of `drake` Vocab <br> > **Targets** are the objects that drake generates; <br> -- > **Commands** are the pieces of R code that produce them. <br> -- > **Plans** wrap up the relationship between targets and commands into a workflow representation: a dataframe. <br> ??? one column for targets, and one column for their corresponding commands. --- ## More on Plans Plans are like that top-level script that runs your entire pipeline. <br> ```r source("01_import.R") source("02_clean.R") ... source("06_analyze_more.R") final <- do_more_things(object_in_env) write_out_my_results(final) ``` <br> *But*, a plan **knows about the dependencies** in your code. --- ## How to `drake` -- <br> 1) Store functions and any packages you need to load in a file `funs.R` -- 2) Store a `drake` **plan** in another file ```r plan <- drake_plan( cleaned_data = clean_my(raw_data), results = analyze_my(cleaned_data), report = report_out_my(results) ) ``` -- 3) **Run** the plan ```r make(plan) ``` --- ## What `drake` does -- ```r plan <- drake_plan( cleaned_data = clean_my(raw_data), results = analyze_my(cleaned_data), report = report_out_my(results) ) ``` -- `drake_plan` stores your plan as targets and commands in a dataframe. -- ```r plan ## # A tibble: 3 x 2 ## target command ## <chr> <chr> ## 1 cleaned_data clean_my(raw_data) ## 2 results analyze_my(cleaned_data) ## 3 report report_out_my(results) ``` --- ## What `drake` does ```r plan ## # A tibble: 3 x 2 ## target command ## <chr> <chr> ## 1 cleaned_data clean_my(raw_data) ## 2 results analyze_my(cleaned_data) ## 3 report report_out_my(results) ``` -- ```r make(plan) ``` -- **First run** of `make(plan)`: `drake` runs the plan from scratch -- <br> **Thereafter**: `drake` will only rebuild targets that are out of date, and everything downstream of them --- ## What makes a target become out of date? 1) A trigger is activated (more on these later) -- 2) Something used to generate that target *or one of its upstream targets* has changed -- ```r plan <- drake_plan( cleaned_data = clean_my(raw_data), * results = analyze_my(cleaned_data), report = report_out_my(results) ) ``` `drake` knows that `results` depends on the object `cleaned_data` and the function `analyze_my()` because those are both part of the command used to generate `results`. <br> -- **So, if `cleaned_data` changes or `analyze_my` changes, `results` is out of date.** --- ## Where is all this info stored? <br> #### **targets** -- In a hidden `.drake` directory, or cache, in your project's root. [More on storage.]( -- <p align="left"> <img src="./img/drake_cache.jpg" height="180px"> <figcaption style="margin-left: 20%;">Spot the cache among the hidden dirs.</figcaption> </p> -- <br> `loadd()` loads targets from the cache into your R session. -- `clean()` cleans the cache. (You can recover a cache if you clean it by accident.) <br> --- ## Where is all this info stored? <br> #### **dependencies** -- `drake` **hashes** a target's dependencies to know when one of those dependencies changes -- <p align="left"> <img src="./img/drake_cache_hashes_small.jpg" height="150px"> <figcaption style="margin-left: 20%;">Inside the data subdir of the .drake cache</figcaption> </p> -- and creates a `config` list that stores a dependency graph (`igraph` object) of the plan along with a bunch of other things. -- You can access all of this with `drake_config()`. ??? You can check that the cache is there with `ls -a`. You have [control]( over the hashing algorithm used, location of the cache, etc. --- class: inverse ## It's all about Functions `drake` is all built around *functions* rather than scripts. <br> -- - A plan works by using functions to create targets -- <br> - This allows `drake` to infer **dependencies** between - objects and functions - functions and other functions -- <br> - Running `drake_plan` creates a dataframe relating each target to the command used to generate it --- ## All about Functions ```r bad_plan <- drake_plan( first_target = source("import.R"), second_target = source("clean.R") ) ``` -- Sourcing files breaks the dependency structure that makes `drake` useful. -- <br> ```r source("all_my_funs.R") good_plan <- drake_plan( first_target = do_stuff(my_data), second_target = do_more_stuff(first_target) ) ``` Now `drake` knows`first_target` needs to be built before work on `second_target` can begin. --- ## `drake` things we won't get into - [Generate ~ big plans ~]( .center[] for analyses that require lots of different permutations of a certain analysis. (`drake` version 7.0.0 has a [new syntax]( that makes it easier to create them.) - Support for [debugging and testing ]( plans - Compatibility with [high performance computing]( backends --- ## Moar Resources <br> - [`drake` user manual]( <br> - [debugging drake]( <br> - [Kirill MΓΌller's cheat sheet]( <br> - [Sina RΓΌeger]( and [Christine Stawitz]('s `drake` presentations <br> - [Drake's Spotify station]( --- class: blue-light <!-- background-image: url(" --> ## Our Plan Remember the [crazy blue light]( from late December? -- <p align="left" style="padding-right: 20%;"> <img src="./img/blue_light.jpg" height="350px"> </p> -- π± π± π± --- ## Our Plan .pull-right[] <br> <br> The Twitter account that let us know that this wasn't in fact aliens is [NYCFireWire]( <br> Normally they just tweet out fires and their locations in a more or less predictable pattern: <br> <br> -- `<borough> ** <some numbers> ** <address> <description of fire>` -- <br> We can use their tweets to get some info on where and when fires happen in NYC. ??? I'll illustrate a way you might want to use `drake` with something that's close to home for us. What if we were constructing an analysis of these tweets and wanted to make sure our pipeline worked end-to-end, but didn't want to unnecessarily re-run outdated parts of it unless we needed to? --- ## The Pipeline 1. Pull in tweets, either the first big batch or any new ones that show up -- 2. Extract addresses from the tweets (πΆ regex time πΆ) -- 3. Send addresses to the Google Maps API to grab their latitudes and longitudes -- 4. Profit -- <br> All functions are defined in [`didnt_start_it.R`](, which we'll source in now. ```r source(here::here("R", "didnt_start_it.R")) ``` -- <br> **Caveats** This analysis relies on the [rtweet]( and [ggmap]( packages. To be able to run it in full you'll need a [Twitter API access token]( and [Google Maps Geocoding API key]( --- ## Grabbing Tweets [`get_tweets`]( -- *Main idea*: * **Builds up a file** of the most recent set of tweets from a given account -- *Details*: - If neither file nor `tbl` is supplied as arguments, grabs an initial *seed* batch of tweets - If either is supplied, checks for new tweets and grabs them if any - Spits out the latest to the same file <br> ```r get_tweets(n_tweets_seed = 3) ## # A tibble: 3 x 5 ## text user_id status_id created_at screen_name ## <chr> <chr> <chr> <dttm> <chr> ## 1 Queens **99-75-5985*β¦ 560024β¦ 1090764990β¦ 2019-01-30 19:13:33 NYCFireWire ## 2 Big fire in Bergen Cβ¦ 560024β¦ 1090762878β¦ 2019-01-30 19:05:10 NYCFireWire ## 3 Blocking the hydrantβ¦ 560024β¦ 1090739760β¦ 2019-01-30 17:33:18 NYCFireWire ``` ??? - `get_seed_tweets` grabs a batch of tweets *or* reads in seed tweets from a file if the file exists - `get_more_tweets` checks if there are new tweets and, if so, pulls in the right number of them - `get_tweets` runs `get_seed_tweets` if given a null `tbl` argument, otherwise runs `get_more_tweets` --- ## Grabbing Seed Tweets A closer look at just the text of the tweets: ```r get_tweets(n_tweets_seed = 5) %>% select(text) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> text </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Queens **99-75-5985** 220-19 145th Avenue Near 221st Street, Battalion 54 reports all hands going to work for a basement fire in A Private Dwelling </td> </tr> <tr> <td style="text-align:left;"> Big fire in Bergen County, NJ. Old timber buildings, they just said the fire is making it's way to the Propane Filling station. </td> </tr> <tr> <td style="text-align:left;"> Blocking the hydrant is dangerous! </td> </tr> <tr> <td style="text-align:left;"> From NYC OEM: Snow Squall Warning in effect for NYC on 1/30 until 4 PM. Heavy and blowing snow with wind gusts up to 50 mph causing whiteout conditions, zero visibility and life-threatening travel conditions.... </td> </tr> <tr> <td style="text-align:left;"> Queens *99-75-9298* 60-15 Calloway St. Fire top floor 7 story 200x70 multiple dwelling. </td> </tr> </tbody> </table> --- ## Reupping Tweets To show how `get_tweets` can start with a `tbl` of tweets and look for new ones, we'll grab 10 `seed_tweets` that are all **older** than an old tweet ID. -- <br> ```r old_tweet_id <- "1084948074588487680" # From Jan 14 seed_tweets <- get_tweets( n_tweets_seed = 10, max_id = old_tweet_id ) nrow(seed_tweets) ## [1] 10 ``` --- ## Reupping Tweets Using `seed_tweets` as an input to the same `get_tweets` function, we check for new tweets, and, if there are any, pull them in. -- ```r full_tweets <- get_tweets(seed_tweets, n_tweets_reup = 5) ## Searching for new tweets. ## 5 new tweet(s) pulled. ``` -- <br> ```r nrow(seed_tweets) ## [1] 10 nrow(full_tweets) ## [1] 15 ``` --- ## Getting Addresses With `pull_addresses` we parse the text of the tweet to pull out borough and street and string them together into an address. ```r get_tweets(max_id = old_tweet_id) %>% * pull_addresses() %>% select(text, street, borough, address) %>% kable() ``` <table> <thead> <tr> <th style="text-align:left;"> text </th> <th style="text-align:left;"> street </th> <th style="text-align:left;"> borough </th> <th style="text-align:left;"> address </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Bronx *66-75-3049* 1449 Commonwealth ave. Attic fire private dwelling. </td> <td style="text-align:left;"> 1449 Commonwealth ave </td> <td style="text-align:left;"> The Bronx </td> <td style="text-align:left;"> 1449 Commonwealth ave, The Bronx </td> </tr> <tr> <td style="text-align:left;"> Manhattan *66-75-0755* 330 E 39 St. Fire in the duct work 3rd floor. 10-77(HiRise Residential). E-16/TL-7 1st due </td> <td style="text-align:left;"> 330 E 39 St </td> <td style="text-align:left;"> Manhattan </td> <td style="text-align:left;"> 330 E 39 St, Manhattan </td> </tr> <tr> <td style="text-align:left;"> Manhattan 10-77* 66-75-2017* 70 Little West St x 2nd Pl. BC01 has a fire on the 7th floor in the laundry area. </td> <td style="text-align:left;"> 70 Little West St x 2nd Pl </td> <td style="text-align:left;"> Manhattan </td> <td style="text-align:left;"> 70 Little West St x 2nd Pl, Manhattan </td> </tr> <tr> <td style="text-align:left;"> Bronx *66-75-2251* 2922 3rd Avenue at Westchester Avenue, Battalion 14 transmitting a 10-75 for a fire on the 4th floor of a 6 story commercial building. Squad 41 First Due </td> <td style="text-align:left;"> 2922 3rd Avenue at Westchester Avenue </td> <td style="text-align:left;"> The Bronx </td> <td style="text-align:left;"> 2922 3rd Avenue at Westchester Avenue, The Bronx </td> </tr> <tr> <td style="text-align:left;"> Brooklyn **77-75-0270** 330 Bushwick Avenue Near McKibbin Street, Fire on the 4th Floor </td> <td style="text-align:left;"> 330 Bushwick Avenue Near McKibbin Street </td> <td style="text-align:left;"> Brooklyn </td> <td style="text-align:left;"> 330 Bushwick Avenue Near McKibbin Street, Brooklyn </td> </tr> <tr> <td style="text-align:left;"> Brooklyn *77-75-0855* 899 Hancock St. Fire top floor 3 story </td> <td style="text-align:left;"> 899 Hancock St </td> <td style="text-align:left;"> Brooklyn </td> <td style="text-align:left;"> 899 Hancock St, Brooklyn </td> </tr> <tr> <td style="text-align:left;"> Brooklyn **77-75-0855** 899 Hancock Street Near Howard Avenue, All hands going to work for fire Iβm the top floor </td> <td style="text-align:left;"> 899 Hancock Street Near Howard Avenue </td> <td style="text-align:left;"> Brooklyn </td> <td style="text-align:left;"> 899 Hancock Street Near Howard Avenue, Brooklyn </td> </tr> <tr> <td style="text-align:left;"> Bronx *66-75-3937* 3840 Orloff Av. Fire 4th floor. </td> <td style="text-align:left;"> 3840 Orloff Av </td> <td style="text-align:left;"> The Bronx </td> <td style="text-align:left;"> 3840 Orloff Av, The Bronx </td> </tr> <tr> <td style="text-align:left;"> Staten Island *MVA/PIN* Box 1744- 490 Harold St off Forest Hill Rd. Hurst tool in operation. </td> <td style="text-align:left;"> Box 1744- 490 Harold St off Forest Hill Rd </td> <td style="text-align:left;"> Staten Island </td> <td style="text-align:left;"> Box 1744- 490 Harold St off Forest Hill Rd, Staten Island </td> </tr> <tr> <td style="text-align:left;"> Queens 99-75-6810 111-15 227 St BC-54 using all hands for a fire in a pvt dwelling </td> <td style="text-align:left;"> NA </td> <td style="text-align:left;"> Queens </td> <td style="text-align:left;"> Queens </td> </tr> </tbody> </table> --- ## Getting Lat and Long Last step of the main pipeline! -- **Reverse geocoding** = getting latitude and longitude from an address. The [`ggmap`]( package exposes this feature of the [Google Maps]( API. -- ```r get_tweets(n_tweets_seed = 5, max_id = old_tweet_id) %>% pull_addresses() %>% * get_lat_long() ## Source : ## Source : ## Source : ## Source : ## Source : ## # A tibble: 5 x 5 ## address lat long created_at text ## <chr> <dbl> <dbl> <dttm> <chr> ## 1 1449 Commonwealthβ¦ 40.8 -73.9 2019-01-14 15:54:18 Bronx *66-75-3049* 14β¦ ## 2 330 E 39 St, Manhβ¦ 40.7 -74.0 2019-01-14 13:41:11 Manhattan *66-75-0755β¦ ## 3 70 Little West Stβ¦ 40.7 -74.0 2019-01-14 12:40:19 Manhattan 10-77* 66-7β¦ ## 4 2922 3rd Avenue aβ¦ 40.8 -73.9 2019-01-14 07:06:21 Bronx *66-75-2251* 29β¦ ## 5 330 Bushwick Avenβ¦ 40.7 -73.9 2019-01-13 21:27:15 Brooklyn **77-75-0270β¦ ``` --- ## Downstream Analysis Later in the pipeline we'll: `count_fires`, summing up the total number of fires per `lat`-`long` combo <br> ```r count_fires <- function(tbl) { tbl %>% drop_na() %>% count(lat, long) } ``` <br> and plot them on a map (thanks again, `ggmap`) --- ## Downstream Analysis ```r get_map("new york city") %>% ggmap() ## Source : ## Source : ``` <!-- --> --- Using 3000 tweets: ```r plot_fire_sums(fire_sums, output_path = NULL) ``` <!-- --> <!-- <p> --> <!-- <img src="./img/fire_sums_plot.png"> --> <!-- </p> --> --- ## Quick Benchmark So where does `drake` really come in handy here? -- The trips to and from Twitter and Google take a while. -- <br> What's the estimate of running the pipeline on a single tweet? -- ```r (our_bench <- bench::mark({ get_tweets(n_tweets_seed = 1) %>% # Hi Twitter pull_addresses() %>% get_lat_long() # Hi Google }) %>% as_tibble() %>% pull(median)) ## [1] 292ms ``` --- ## Quick Benchmark Roughly how many **minutes** would the pipeline take for 3k tweets? (No batch speedups since we're going `rowwise` on each tweet.) <br> ```r (n_mins <- (as.numeric(our_bench) # Returns this in seconds * 3000) # 3k tweets / 60 # 60 seconds in a minute ) ## [1] 14.61218 ``` --- ## Quick Benchmark All our downstream analyses depend on this pipeline. If we tweak some code but `drake` determines we don't need to rerun the pipeline, we will save **15 minutes** of our lives. <br>  <br> And we rest assured we have the most up-to-date data. --- ## Our `drake` Plan We'll set up the `drake` plan and run it for real. 