Chapter 14 Reproducible Environments
by Anna Nguyen and Jade Benjamin-Chung
14.1 Package Version Control with renv
14.1.1 Introduction
Replicable code should produce the same results, regardless of when or where it’s run. However, our analyses often leverage open-source R packages that are developed by other teams. These packages continue to be developed after research projects are completed, which may include changes to analysis functions that could impact how code runs for both other team members and external replicators.
For example, suppose we had used a function that took in one argument, such that our code contained example_function(arg_a = “a”). A few months after we publish our code, the package developers update the function to take in another mandatory argument arg_b. If someone runs our code, but has the most recent version of the package, they’ll receive an error message that the argument arg_b is missing and will not be able to full reproduce our results.
To ensure that the right functions are used in replication efforts, it is important for us to keep track of package versions used in each project.
renv can be to promote reproducible environments within R projects. renv creates individual package libraries for each project instead of having all projects, which may use different versions of the same package, share the same package library. However, for projects that use many packages, this process can be memory intensive and increase the time needed for a new users to start running code.
In this lab manual chapter, we provide a quick tutorial for integrating renv into research workflows. For more detailed instructions, please refer to the renv package vignette.
14.1.2 Implementing renv in projects
Ideally, renv should be initiated at the start of projects and updated continuously when new packages are introduced in the codebase. However, this process can be initated at any point in a project
To add renv to your workflow, follow these steps:
- Install the
renvpackage by runninginstall.packages(“renv”) - Create an RProject file and ensure that your working directory is set to the correct folder
- In the R console, run
renv::init()to intiialize renv in your R Project - This will create the following files:
renv.lock, .Rprofile,renv/settings.jsonandrenv/activate.R. Commit and push these files to GitHub so that they’re accessible to other users. - As you write code, update the project’s R library by running
renv::snapshot()in the R console - Add
renv::restore()to the head of your config file, to make sure that all users that run your code are on the same package versions.
14.1.3 Using projects with renv
If you’re starting to work on an ongoing project that already has renv set up, follow these steps to ensure that you’re using the same project versions.
- Install the
renvpackage by runninginstall.packages(“renv”) - Pull the most updated version of the project from GitHub
- Open the project’s RProject file
- Run
renv::restore(). In our lab’s projects, this is often already found at the top of the config file, so you can just run scripts as is. - This will pull up a list of the project’s packages that need to be updated for you to be consistent with the project. The console will ask if you want to proceed with updating these packages - type “Y” to continue.
- Wait for the correct versions of each package to install/update. This may take some time, depending on how many packages the project uses.
- Your R environment should now be using the same package versions as specified in the
renvlock file. You should now be able to replicate the code. - If you make edits to the code and introduce new/updated packages, see the section above for instructions on how to make updates.
14.2 Documenting the R version
renv tracks package versions but does not track the version of R itself. Since R updates can change behavior (e.g., changes to the default stringsAsFactors argument in R 4.0, or updates to the random number generator in R 3.6), it is important to document the R version used in each project.
Add the following to your project’s README: This project was developed using R version 4.3.1 (2023-06-16). Package versions are tracked with renv (see renv.lock).
You can find your current R version by running R.version.string in the console. Update this line in the README whenever you upgrade R for a project.
For published code, also include the R version in the renv.lock file header, which renv does automatically. Reviewers or replicators can then check whether their R version matches before attempting to reproduce results.
14.3 Ensuring consistency between local and cluster environments
If you develop code locally and run it on Sherlock, differences between your local R environment and the cluster environment can cause code to behave differently or fail. The most common sources of discrepancy are the R version, package versions, and system libraries.
Use the same R version on both. Check which R versions are available on Sherlock by running module avail R on the command line. Load a specific version with module load R/4.3.1 (replacing with your version). If your local R version is not available on Sherlock, either update your local installation to match an available Sherlock version, or vice versa. Document the version you are using in your README.
Use renv on Sherlock. The same renv.lock file that tracks packages locally should be used on Sherlock. After cloning your repo on Sherlock, run renv::restore() to install the correct package versions. Note that some packages with compiled code (e.g., data.table, Stan-based packages) may need to be recompiled on Sherlock due to differences in the operating system and system libraries. If renv::restore() fails for specific packages, try installing them manually with renv::install("package_name") and then running renv::snapshot() to update the lock file.
Test on Sherlock before running full analyses. Before submitting large batch jobs, run a small test script interactively on a development node (sdev) to verify that packages load correctly and that a simple version of your analysis produces expected output. This catches environment mismatches before you spend allocation hours on jobs that will fail.