rostrum.blog - Packages that Sparked Joy in 2019

Marie Kondo putting her hands together and bowing. — Marie Kondo (Netflix via Giphy).

Thank you package-makers

I’ve used a lot of packages in 2019 and many have brought great joy to my R experience. Thank you to everyone who has created, maintained or contributed to a package this year.

Some particular packages of note for me have been:

🤖 {usethis} by Hadley Wickham and Jenny Bryan
🦆 {drake} by Will Landau
🐈 {purrr} by Lionel Henry and Hadley Wickham

And some honourable mentions are:

📝 {blogdown} by Yihui Xie
⚔️ {xaringan} by Yihui Xie
🙇 {polite} by Dmytro Perepolkin
↔︎️ {arsenal} by Ethan Heinzen, Jason Sinnwell, Elizabeth Atkinson, Tina Gunderson and Gregory Dougherty

Click the package name to jump to that section.

Packages of note

{usethis}

The format and content of R packages is objectively odd. What files are necessary? What structure should it have? The {usethis} package from RStudio’s Hadley Wickham and Jenny Bryan makes it far easier for newcomers and experienced useRs alike.

In fact, you can make a minimal package in two lines:

create_package() to create the necessary package structure
use_r() to create in the right place an R script for your functions

But there’s way more functions to help you set up your package. To name a few more that I use regularly:

use_vignette() and use_readme_md() for more documentation
use_testthat() and use_test() for setting up tests
use_package() to add packages to the Imports section of the DESCRIPTION file
use_data() and use_data_raw() to add data sets to the package and the code used to create them
use_*_license() to add a license

There are also other flavours of function like git_*() and pr_*() to work with version control and proj_*() for working with RStudio Projects.

I focused this year on making different types of package. {usethis} made it much easier to develop:

{altcheckr} to read and assess image alt text from web pages
{oystr} to handle London travel-history data from an Oyster card
{gdstheme} to use a {xaringan} presentation theme and template
{blogsnip} to insert blog-related code snippets via an RStudio addin (there’s even a use_addin() function to create the all-important inst/rstudio/addins.dcf file)

For more package-development info, I recommend Emil Hvitfeldt’s {usethis} workflow, as well as Karl Broman’s R Package Primer and Hadley Wickham’s R Packages book. To help me remember this stuff, I also wrote some slides about developing a package from scratch with {usethis} functions.

{drake}

Your analysis has got 12 input data files. They pass through 15 functions There are some computationally-intensive, long-running processes. Plots and tables are produced and R Markdown files are rendered. How do you keep on top of this? Is it enough to have a set of numbered script files (01_read.R, etc) or a single script file that sources the rest? What if something changes? Do you have to re-run everything from scratch?

You need a workflow manager. Save yourself some hassle and use Will Landau‘s {drake} package, backed by rOpenSci’s peer review process. {drake} ’remembers’ all the dependencies between files and only re-runs what needs to be re-run if any errors are found or changes are made. It also provides visualisations of your workflow and allows for high-performance computing.

In short, you:

Supply the steps of your analysis as functions to drake_plan(), which generates a data frame of commands (functions) to operate over a set of targets (objects)
Run make() on your plan to run the steps and generate the outputs
If required, make changes anywhere in your workflow and re-make() the plan – {drake} will only re-run things that are dependent on what you changed

Below is an extreme example from a happy customer. Each point on the graph is an object or function; black ones are out of date and will be updated when make() is next run.

A graph of a complicated-looking drake pipeline. There are hundreds of interconnected nodes. — ‘I’m so glad {drake} is tracking those dependencies between #rstats computations for me’

It’s hard to do {drake} justice in just a few paragraphs, but luckily it’s one of the best-documented packages out there. Take a look at:

the {drake} rOpenSci website
the thorough user manual
the learndrake GitHub repo, which can be launched in the cloud
the drakeplanner Shiny app
Will’s {drake} examples page
this rOpenSci community call
a Journal of Open Source Software (JOSS) paper
more things listed in the documentation section of the user manual

I wrote about {drake} earlier in the year and made a demo and some slides. I think it could be useful for reproducibility of statistical publications in particular.

{purrr}

You want to apply a function over the elements of some list or vector.

The map() family of functions from the {purrr} package–by Lionel Henry and Hadley Wickham of RStudio–has a concise and consistent syntax for doing this.

You can choose what gets returned from your iterations by selecting the appropriate map_*() variant: map() for a list, map_df() for a data frame, map_chr() for a character vector and so on. Here’s a trivial example that counts the number of Street Fighter characters from selected continents. Here’s a list:

# Create the example list
street_fighter <- list(
 china = "Chun Li", japan = c("Ryu", "E Honda"),
 usa = c("Ken", "Guile", "Balrog"), `???` = "M Bison"
)

street_fighter  # take a look at the list

$china
[1] "Chun Li"

$japan
[1] "Ryu"     "E Honda"

$usa
[1] "Ken"    "Guile"  "Balrog"

$`???`
[1] "M Bison"

Now to map the length() function to each element of the list and return a named integer vector.

library(purrr)  # load the package

# Get the length of each list element
purrr::map_int(
  street_fighter,  # list
  length           # function
)

china japan   usa   ??? 
    1     2     3     1

But what if you want to iterate over two or more elements? You can use map2() or pmap(). And what if you want to get the side effects? walk() and pwalk().

{purrr} is also great for working with data frames with columns that contain lists (listcols), like the starwars data from the {dplyr} package. Let’s use the length() function again, but in the context of a listcol, to get the characters in the most films.

# Load packages
suppressPackageStartupMessages(library(dplyr))
library(purrr)

# map() a listcol within a mutate() call
starwars %>% 
  mutate(films_count = map_int(films, length)) %>% 
  select(name, films, films_count) %>% 
  arrange(desc(films_count)) %>% head()

# A tibble: 6 × 3
  name           films     films_count
  <chr>          <list>          <int>
1 R2-D2          <chr [7]>           7
2 C-3PO          <chr [6]>           6
3 Obi-Wan Kenobi <chr [6]>           6
4 Luke Skywalker <chr [5]>           5
5 Leia Organa    <chr [5]>           5
6 Chewbacca      <chr [5]>           5

Why not just write a loop or use the *apply functions? Jenny Bryan has a good {purrr} tutorial that explains why you might consider either choice. Basically, do what you feel; I like the syntax consistency and the ability to predict what function I need based on its name.

Check out the excellent {purrr} cheatsheet for some prompts and excellent visual guidance.

Honourable mentions

{blogdown}

This blog, and I’m sure many others, wouldn’t exist without {blogdown} by Yihui Xie. {blogdown} lets you write and render R Markdown files into blog posts via static site generators like Hugo. This is brilliant if you’re trying to get R output into a blog post with minimal fuss. The {blogdown} book by Yihui, Amber Thomas, Alison Presmanes Hill is particularly helpful.

{xaringan}

{xaringan} is another great package from Yihui Xie that lets you turn R Markdown into a slideshow using remark.js. It’s very customisable via CSS, to the extent that I was able to mimic the house style of my organisation this year. One of my favourite functions¹ is inf_mr() (Infinite Moon Reader), which lets you live-preview your outputs as they’re written.

{polite}

Web scraping is ethically dubious if you fail to respect the terms of the sites you’re visiting. Dmytro Perepolkin has made it easy to be a good citizen of the internet with the {polite} package, which has just hit version 1.0.0 and is on CRAN (congratulations!). First you introduce yourself to the site with a bow() and collect any information about limits and no-go pages from the robots.txt file, then you can modify search paths with a nod() and collect information from them with a scrape(). Very responsible.

{arsenal}

I’ve been using the handy² {arsenal} package to compare data frames as part of a quality assurance process. First, you supply two data frames to comparedf() to create a ‘compare’ object. Run diffs() on that object to create a new data frame where each row is a mismatch, given a tolerance, with columns for the location and values that are causing problems. We managed to quality assure nearly a million values with this method in next to no time. Check out their vignette on how to do this.

Bonus!

{govdown}

Aha, well done for reading this far. As a bonus, I’m calling out Duncan Garmonsway’s {govdown} package. Duncan grappled with the complexities of things like Pandoc and Lua filters to build a package that applies the accessibility-friendly GOV.UK design system to R Markdown. This means you can create things like the the Reproducible Analaytical Pipelines (RAP) website in the style of GOV.UK. Endorsed by Yihui Xie himself! Check out Duncan’s {tidyxl} and {unpivotr} packages for handling nightmare Excel files while you’re at it.

Environment

Session info

Last rendered: 2023-07-23 10:49:06 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.2 purrr_1.0.1

loaded via a namespace (and not attached):
 [1] digest_0.6.33     utf8_1.2.3        R6_2.5.1          fastmap_1.1.1    
 [5] tidyselect_1.2.0  xfun_0.39         magrittr_2.0.3    glue_1.6.2       
 [9] tibble_3.2.1      knitr_1.43.1      pkgconfig_2.0.3   htmltools_0.5.5  
[13] generics_0.1.3    rmarkdown_2.23    lifecycle_1.0.3   cli_3.6.1        
[17] fansi_1.0.4       vctrs_0.6.3       withr_2.5.0       compiler_4.3.1   
[21] rstudioapi_0.15.0 tools_4.3.1       pillar_1.9.0      evaluate_0.21    
[25] yaml_2.3.7        rlang_1.1.1       jsonlite_1.8.7    htmlwidgets_1.6.2

Footnotes

Along with yolo: true, of course.↩︎
Unlike Arsenal FC in 2019, rofl.↩︎

Reuse

CC BY-NC-SA 4.0