origin

In contrast to other programming languages, R has no widely established and undisputed style guide (e.g. PEP 8 for Python). As a data scientist, I helped to establish a company wide R style guide. While it mainly relies on the tidyverse style guide, we generally decided to be more explicit in our coding practice. This includes that we always refer to functions from non-native R packages with the double colon operator ::. While it is relatively easy to establish such a convention in new projects, it is challenging to adapt ongoing projects and legacy code. origin allows for much faster conversions of both legacy code as well as currently written code.

Purpose of origin

The main purpose is to add pkg:: to an R function call, i.e. it changes code like this:

Usage of origin

In general, you can either originize some selected text (more on that later in Addins), a whole script, or a all scripts in a specific folder, e.g. your project folder. There is a specifically designed function for each purpose yet they all share the same options. Therefore, only originize_file() is extensively presented as an example with its default options.

Code Usage

originize_file(file = "testscript.R",
               pkgs = .packages(), 
               overwrite = TRUE,
               ask_before_applying_changes = TRUE,
               ignore_comments = TRUE,
               check_conflicts = TRUE,
               add_base_packages = FALSE,
               check_base_conflicts = TRUE, 
               check_local_conflicts = TRUE,
               excluded_functions = list(dplyr = c("%>%", "across"),
                                         data.table = c(":=", "%like%"),
                                         # exclude from all packages:
                                         c("first", "last")), 
               verbose = TRUE, 
               use_markers = TRUE)

Common Arguments

Addins

Besides using regular R functions to originize files, there are also useful addins delivered with origin. These addins are designed to be used on-the-fly while coding. You can either originize selected text, the currently opened file, or all scripts in the currently opened project. However, to have as much control as when using functions, each function argument corresponds to an option that can be set and used inside the addins, e.g.

options(origin.pkgs = c("dplyr", "data.table"),
        origin.overwrite = TRUE)

Actually, most function arguments of origin first check whether an option has been declared and uses the assigned value as its default. This allows for equal outcomes regardless whether you use the addin or a function sequentially.

Safety Measures

Since origin changes files on disk, it is very important that the user has full control over what happens and user input is required before critical steps.

Logging

Most importantly, the user must be aware of what the originized file(s) would look like. For this, all changes and potential missed changes are presented, either in the Markers tab (recommended) or in the console.

  • insertion: pkg:: is inserted prior to a function
  • missing: an object that has the same name as a function but not undoubtedly used as a function. In R it is usually no problem to have variables that name like functions (data or df are popular examples). While it is always clear when a function is directly used as one, functions can also be arguments of other functions, most famously in functional programming like the *apply family or purrr. origin highlights such cases in the logging output.
  • infix: functions like %>% are exported by packages but cannot be called with the pkg::fun() convention. Such functions are highlighted by default to point the user that these stem from a package. When using dplyr-style code, consider to exclude the pipe-operator via exclude_functions.

Same Function Name in Multiple Packages

Due to the variety of R packages, function names must not be unique across all packages out there. By default, R masks priorly imported functions by those imported afterwards. origin mimics this rule by applying a higher priority to those packages that are listed first. In case there is a conflict regarding a used function, These functions are listed along with the packages from which they stem.

Used functions in mutliple Packages!

filter: dplyr, stats first: data.table, dplyr

Order in which relevant packges are evaluated: data.table >> dplyr >> stats

Do you want to proceed? 1: YES 2: NO

Custom Functions Mask Exported Functions

As packages mask each others functions, the same applies to locally defined custom functions. In case you defined your own last function in your project, origin should not add dplyr:: to it. Therefore, your project is searched for function definitions and local functions have higher priority than those exported by packages. Note that, depending on the project size, this process can take quite some time. In this case, set the argument/option path_to_local_functions to a subdirectory or check_local_conflicts to FALSE to skip this feature.

Locally defined and used functions mask exported functions from packages

last: dplyr

Local functions have higher priority. In case you want to use an exported version of a function listed above set pkg::fun manually

Got it? 1: YES 2: NO 3: Show files

Many Files Selected

When originizing a complete folder or project, many R scripts might be checked. In case the user is unaware that there are many files in the selected folder, resulting in a long run time of origin, a warning is triggered and user input is required.

You are about to originize 99 files.

Proceed? 1: YES 2: NO 3: Show files

Final Check

Before the proposed changes are applied eventually, a final user input is required.

Happy with the result? 😀

1: YES 2: NO

Discussion

Whether or not to add pkg:: to each (imported) function is a controversial issue in the R community. While the tidyverse style guide does not mention explicit namespacing, R Packages and the Google R style guide are in favor of it.

Pros

Cons

Check Package Usage since origin 1.0.0

As a new feature origin origin exports the function check_pkg_usage. Given you take over a project or just built a huge barrage of library calls over time. Which of those are actually still needed. Just run all those library(...) calls and then call check_pkg_usage()

Interpreting the Output of check_pkg_usage

== Package Usage Report ================================================
-- Used Packages: 2 ----------------------------------------------------
v data.table
v testthat  

-- Unused Packages: 1 --------------------------------------------------
i dplyr

-- Possible Namespace Conflicts:  1 -----------------------------------
x last      data.table >> dplyr

-- Specifically (`pkg::fun()`) further used Packages: 2 ----------------
i purrr

-- Functions with unknown origin: 1 ------------------------------------
x map

The output shows - we had attached 3 packages: {data.table}, {testthat}, and {dplyr} - functions from {data.table} and {testthat} are used - {dplyr} functions are not used - a namespace conflict for the function last between {data.table} and {dplyr} - additionally, we use purrr:: at some occasions - we use the map() function that is not exported from {data.table}, {testthat}, or {dplyr}. Note that map is exported from {purrr} that is used elsewhere but here our code would fail since {purrr} is not attached and `map cannot be found.

A markers Tab shows all unknown functions and unknown packages that are used explicitly

Interpreting the Result of check_pkg_usage

Having a closer look into result

as.data.frame(result)
#>       pkg         fun n_calls namespaced conflict conflict_pkgs
#> 1    base        %in%      53      FALSE    FALSE            NA
#> 2    base   .packages       8      FALSE    FALSE            NA
#> 3    base      Filter       3      FALSE    FALSE            NA
#> 4    base         Map       1      FALSE    FALSE            NA
#> 5    base      Reduce       5      FALSE    FALSE            NA
#> ...

It first shows a lot of base functions. That is, even though their are not explicitly attached, base r packages are always attached. The print output does not show them but if you want to deep dive into the functions that are used in the project they are available

#>             pkg              fun n_calls namespaced conflict conflict_pkgs
#> 110  data.table           %like%      10      FALSE    FALSE            NA
#> 111  data.table               :=       1      FALSE    FALSE            NA
#> 112  data.table               CJ       1      FALSE    FALSE            NA
#> 113  data.table    as.data.table       1      FALSE    FALSE            NA
#> 114  data.table    as.data.table       3       TRUE    FALSE            NA
#> 115  data.table             last       2       TRUE    FALSE            NA
#> 116  data.table             last       1      FALSE     TRUE         dplyr

Going further, there are a bunch of {data.table} functions that have been used. Some are listed twice because they were sometimes called via data.table::, sometimes not. Furthermore, last is marked with conflict = TRUE. This is because {dplyr} does export a last function, as well. However, since {data.table} has the higher priority than {dplyr} in this project, {origin} considers it as an {data.table} function. Note that if a function is namespaced via ::, no conflict is given.

Finally, at the end of the output:

#>         pkg  fun n_calls namespaced conflict conflict_pkgs
#> 219    <NA>  map       1      FALSE       NA            NA
#> 220  dplyr  <NA>       0         NA       NA            NA

Here we see the map function that would not be assigned to one of the given packages and the {dplyr} package that has not been used.

Final Remarks

Locally defined functions are also detected via parsing. These also do have a higher priority than exported function from other packages.