rostrum.blog - R has obscenely long function names

tl;dr

Use ls() on a package name in the form "package:base" to see all the objects it contains. I’ve done this to find the longest (and shortest) function names in base R and the {tidyverse} suite.

Naming things

I try to keep to a few rules when creating function names, like:

use a verb to make clear the intended action, like get_badge() from {badgr}
start functions with a prefix to make autocomplete easier, like the dh_*() functions from {dehex}
try to be descriptive but succinct, like r2cron() from {dialga}

It can be tricky to be succinct. Consider the base R function suppressPackageStartupMessages()¹: it’s a whopping 30 characters, but all the words are important. Something shortened, like suppPkgStartMsg(), wouldn’t be so clear.

It made me wonder: what’s the longest function name in R?²

But! It seems tricky and time consuming to find the longest function name from all R packages. CRAN alone has over 18,000 at time of writing.

A much easier (lazier) approach is to focus on some package subsets. I’ll look at base R and the {tidyverse}.

The long and the short of it

Base R

Certain R packages are built-in and attached by default on startup.

base_names <- sessionInfo()$basePkgs
base_names

[1] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
[7] "base"

How can we fetch all the functions from these packages? We can use ls() to list all their objects, supplying the package name in the format "package:base", for example. Note that I said ‘objects’, not ‘functions’, since it will also return names that refer to things like datasets.

For fun, we can use this as an excuse to demo ‘lambda’ syntax and the dog’s balls approach to function-writing, both introduced in R v4.1.³

base_pkgs <- paste0("package:", base_names)

base_fns <- lapply(base_pkgs, ls) |>
  setNames(base_names) |> 
  lapply(\(object) as.data.frame(object)) |> 
  (\(x) do.call(rbind, x))()  # the balls ()()

base_fns$package <- gsub("\\.\\d{,4}$", "", row.names(base_fns))
row.names(base_fns) <- NULL
base_fns$nchar <- nchar(base_fns$object)

base_fns <- base_fns[order(-base_fns$nchar), ]

Of the 2465 objects across these packages, a quick histogram shows that the most frequent character length is under 10, with a tail stretching out to over 30.

hist(
  base_fns$nchar,
  main = "Character length of base-object names",
  xlab = "Number of characters",
  las = 1
)

Histogram of character lengths for base object names. It's fairly normal around a bin of 6 to 8 characters, which has a peak frequency of over 400, plus there's a tail stretching out to over 30 characters.

Here’s the top 10 by character length.

base_fns_top <- base_fns[order(-base_fns$nchar), ]
rownames(base_fns_top) <- seq(length = nrow(base_fns_top))
head(base_fns_top, 10)

                                  object package nchar
1  aspell_write_personal_dictionary_file   utils    37
2     getDLLRegisteredRoutines.character    base    34
3       getDLLRegisteredRoutines.DLLInfo    base    32
4        reconcilePropertiesAndPrototype methods    31
5         suppressPackageStartupMessages    base    30
6          as.data.frame.numeric_version    base    29
7           as.character.numeric_version    base    28
8            print.DLLRegisteredRoutines    base    27
9             as.data.frame.model.matrix    base    26
10            conditionMessage.condition    base    26

So there are four objects with names longer than suppressPackageStartupMessages(), though they are rarely used as far as I can tell. The longest is aspell_write_personal_dictionary_file(), which has 37(!) characters. It’s part of the spellcheck functions in {utils}.

It’s interesting to me that it follows some of those rules for function naming that I mentioned earlier. It has a verb, is descriptive and uses a prefix for easier autocomplete; ‘aspell’ refers to the GNU open-source Aspell spellchecker on which it’s based.

I’m intrigued that the function uses snake_case rather than camelCase or dot.case, which seem more prevalent in base functions. You could argue then that the underscores have ‘inflated’ the length by four characters. Similarly, the prefix adds another six characters. So maybe the function could be simplified to writePersonalDictionaryFile(), which is merely 27 characters.

What about shortest functions? There are many one-character functions in base R.

sort(base_fns[base_fns$nchar == 1, ][["object"]])

 [1] "-" ":" "!" "?" "(" "[" "{" "@" "*" "/" "&" "^" "+" "<" "=" ">" "|" "~" "$"
[20] "c" "C" "D" "F" "I" "q" "t" "T"

Some of these will be familiar, like c() to concatenate and t() to transpose. You might wonder why operators and brackets are in here. Remember: everything in R is a function, so `[`(mtcars, "hp") is the same as mtcars["hp"]. I have to admit that stats::C() and stats::D() were new to me.

Tidyverse

How about object names from the {tidyverse}?

To start, we need to attach the packages. Running library(tidyverse) only loads the core packages of the tidyverse, so we need another approach to attach them all.

One method is to get the a vector of the package names with the tidyverse_packages() function and pass it to p_load() from {pacman}, which prevents the need for a separate library() call for each one.⁴

First, here’s the tidyverse packages.

Note

I updated this post in July 2023. The {lubridate} package is now installed as part of the tidyverse and many new functions have appeared across the multitude of packages in the metapackage.

# install.packages("tidyverse")  # if not installed
suppressPackageStartupMessages(  # in action!
  library(tidyverse)
)
tidy_names <- tidyverse_packages()
tidy_names

 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"

And now to load them all.

# install.packages("pacman")  # if not installed
library(pacman)
p_load(char = tidy_names)

Once again we can ls() over packages in the form "package:dplyr". Now the {tidyverse} is loaded, we might as well use it to run the same pipeline as we did for the base packages.

tidy_pkgs <- paste0("package:", tidy_names)

tidy_fns <- map(tidy_pkgs, ls) |>
  set_names(tidy_names) |> 
  enframe(name = "package", value = "object") |>
  unnest(object) |> 
  mutate(nchar = nchar(object))

So we’re looking at even more packages this time, since the whole tidyverse contains 3071 of them.

The histogram is not too dissimilar to the one for base packages, though the tail is shorter, it’s arguably more normal-looking and the peak is perhaps slightly closer to 10. The latter could be because of more liberal use of snake_case.

hist(
  tidy_fns$nchar,
  main = "Character length of {tidyverse} object names",
  xlab = "Number of characters",
  las = 1
)

Histogram of character lengths for tidyverse object names. It's fairly normal around a bin of 8 to 10 characters, which has a peak frequency of over 600, plus there's a tail stretching out to over 30 characters.

Here’s the top 10 by character length.

slice_max(tidy_fns, nchar, n = 10)

# A tibble: 11 × 3
   package       object                            nchar
   <chr>         <chr>                             <int>
 1 rlang         ffi_standalone_check_number_1.0.7    33
 2 googlesheets4 vec_ptype2.googlesheets4_formula     32
 3 googlesheets4 vec_cast.googlesheets4_formula       30
 4 cli           cli_progress_builtin_handlers        29
 5 rstudioapi    getRStudioPackageDependencies        29
 6 rstudioapi    registerCommandStreamCallback        29
 7 rlang         ffi_standalone_is_bool_1.0.7         28
 8 dbplyr        supports_star_without_alias          27
 9 rstudioapi    launcherPlacementConstraint          27
10 cli           ansi_has_hyperlink_support           26
11 ggplot2       scale_linewidth_continuous           26

Well there you are: ffi_standalone_check_number_1.0.7() from {rlang} wins the prize with 33 characters. What does it do? The full documentation is literally ‘Internal API for standalone-types-check’. Okey doke.

Intriguingly, the next two are both from {googlesheets4}. The help pages say they’re ‘internal {vctrs} methods’. The names of these are long because of the construction: the first part tells us the method name, e.g. vec_ptype2, and the second part tells us that they apply to the googlesheets4_formula S3 class.

So maybe these don’t really count because they would be executed as as vec_ptype2() and vec_cast()? And they’re inflated because they contain the package name, {googlesheets4}, which is quite a long one (13 characters). That would leave cli::cli_progress_builtin_handlers() and rstudioapi::getRStudioPackageDependencies() as the next longest (29 characters). The latter uses camelCase—which is typical of the {rstudioapi} package—so isn’t bulked out by underscores.

On the other end of the spectrum, there’s only one function with one character: dplyr::n(). I think it makes sense to avoid single-character functions in non-base packages, because they aren’t terribly descriptive. n() can at least be understood to mean ‘number’.

Instead, here’s all the two-letter functions from the {tidyverse}. Note that many of these are from {lubridate} and are shorthand expressions that make sense in context, like hm() for hour-minute. You can also see some of {rlang}’s operators creep in here, like bang-bang (!!) and the walrus (:=).⁵

dplyr::filter(tidy_fns, nchar == 2)

# A tibble: 16 × 3
   package   object nchar
   <chr>     <chr>  <int>
 1 cli       no         2
 2 dplyr     do         2
 3 dplyr     id         2
 4 lubridate am         2
 5 lubridate hm         2
 6 lubridate ms         2
 7 lubridate my         2
 8 lubridate pm         2
 9 lubridate tz         2
10 lubridate ym         2
11 lubridate yq         2
12 magrittr  or         2
13 rlang     :=         2
14 rlang     !!         2
15 rlang     ll         2
16 rlang     UQ         2

Many of these are due to {lubridate} using single letters to represent time periods, like hm is ‘hour minute’. You can also see some symbols from {rlang}, like the good ol’ :=, or ‘walrus’ operator.

Both the {dplyr} functions here are no longer intended for use. I’m sad especially for dplyr::do(): the help page says it ‘never really felt like it belong[ed] with the rest of dplyr’. Sad.

In memoriam: do().

Environment

Session info

Last rendered: 2023-07-17 18:19:29 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] xml2_1.3.5          rvest_1.0.3         rstudioapi_0.15.0  
 [4] rlang_1.1.1         reprex_2.0.2        readxl_1.4.2       
 [7] ragg_1.2.5          pillar_1.9.0        modelr_0.1.11      
[10] magrittr_2.0.3      jsonlite_1.8.7      httr_1.4.6         
[13] hms_1.1.3           haven_2.5.2         googlesheets4_1.1.1
[16] googledrive_2.1.1   dtplyr_1.3.1        dbplyr_2.3.2       
[19] cli_3.6.1           conflicted_1.2.0    broom_1.0.5        
[22] pacman_0.5.1        lubridate_1.9.2     forcats_1.0.0      
[25] stringr_1.5.0       dplyr_1.1.2         purrr_1.0.1        
[28] readr_2.1.4         tidyr_1.3.0         tibble_3.2.1       
[31] ggplot2_3.4.2       tidyverse_2.0.0    

loaded via a namespace (and not attached):
 [1] gtable_0.3.3      xfun_0.39         htmlwidgets_1.6.2 gargle_1.5.1     
 [5] tzdb_0.4.0        vctrs_0.6.3       tools_4.3.1       generics_0.1.3   
 [9] fansi_1.0.4       pkgconfig_2.0.3   data.table_1.14.8 lifecycle_1.0.3  
[13] compiler_4.3.1    textshaping_0.3.6 munsell_0.5.0     fontawesome_0.5.1
[17] htmltools_0.5.5   yaml_2.3.7        cachem_1.0.8      tidyselect_1.2.0 
[21] digest_0.6.31     stringi_1.7.12    fastmap_1.1.1     grid_4.3.1       
[25] colorspace_2.1-0  utf8_1.2.3        withr_2.5.0       scales_1.2.1     
[29] backports_1.4.1   timechange_0.2.0  rmarkdown_2.23    cellranger_1.1.0 
[33] memoise_2.0.1     evaluate_0.21     knitr_1.43.1      glue_1.6.2       
[37] DBI_1.1.3         R6_2.5.1          systemfonts_1.0.4 fs_1.6.2

Footnotes

I wrote recently a whole post about package startup messages.↩︎
Luke was curious too, so that’s at least two of us. (Luke also noticed that a link to my {linkrot} package was itself rotten, lol.)↩︎
My understanding is that a future version of R will allow an underscore as the left-hand-side placeholder, in a similar manner to how the {tidyverse} allows a dot. That will do away with the need for ()(). Also ignore my badly-written base code; I’m trying to re-learn.↩︎
In fact, p_load() will attempt installation if the package can’t be found in your library. Arguably, this is poor behaviour; you should always ask the user before installing something on their machine.↩︎
Bang-Bang and the Walrus, touring Spring 2022.↩︎

Reuse

CC BY-NC-SA 4.0