Chapter 6 Coding practices
by Kunal Mishra, Jade Benjamin-Chung, Stephanie Djajadi, and Iris Tong
6.1 Organizing scripts
Just as your data “flows” through your project, data should flow naturally through a script. Very generally, you want to:
- describe the work completed in the script in a comment header
- source your configuration file (
0-config.R
) - load all your data
- do all your analysis/computation
- save your data.
Each of these sections should be “chunked together” using comments. See this file for a good example of how to cleanly organize a file in a way that follows this “flow” and functionally separate pieces of code that are doing different things.
6.2 Documenting your code
6.2.1 File headers
Every file in a project should have a header that allows it to be interpreted on its own. It should include the name of the project and a short description for what this file (among the many in your project) does specifically. You may optionally wish to include the inputs and outputs of the script as well, though the next section makes this significantly less necessary.
################################################################################
# @Organization - Example Organization
# @Project - Example Project
# @Description - This file is responsible for [...]
################################################################################
6.2.2 Sections and subsections
Rstudio (v1.4 or more recent) supports the use of Sections and Subsections. You can easily navigate through longer scripts using the navigation pane in RStudio, as shown on the right below.
# Section -------
## Subsection -------
### Sub-subsection -------
6.2.3 Code folding
Consider using RStudio’s code folding feature to collapse and expand different sections of your code. Any comment line with at least four trailing dashes (-), equal signs (=), or pound signs (#) automatically creates a code section. For example:
6.2.5 Function documentation
Every function you write must include a header to document its purpose, inputs, and outputs. For any reproducible workflows, they are essential, because R is dynamically typed. This means, you can pass a string
into an argument that is meant to be a data.table
, or a list
into an argument meant for a tibble
. It is the responsibility of a function’s author to document what each argument is meant to do and its basic type. This is an example for documenting a function (inspired by JavaDocs and R’s Plumber API docs):
##############################################
##############################################
# Documentation: calc_fluseas_mean
# Usage: calc_fluseas_mean(data, yname)
# Description: Make a dataframe with rows for flu season and site
# and the number of patients with an outcome, the total patients,
# and the percent of patients with the outcome
# Args/Options:
# data: a data frame with variables flu_season, site, studyID, and yname
# yname: a string for the outcome name
# silent: a boolean specifying whether the function shouldn't output anything to the console (DEFAULT: TRUE)
# Returns: the dataframe as described above
# Output: prints the data frame described above if silent is not True
calc_fluseas_mean = function(data, yname, silent = TRUE) {
### function code here
}
The header tells you what the function does, its various inputs, and how you might go about using the function to do what you want. Also notice that all optional arguments (i.e. ones with pre-specified defaults) follow arguments that require user input.
Note: As someone trying to call a function, it is possible to access a function’s documentation (and internal code) by
CMD-Left-Click
ing the function’s name in RStudioNote: Depending on how important your function is, the complexity of your function code, and the complexity of different types of data in your project, you can also add “type-checking” to your function with the
assertthat::assert_that()
function. You can, for example,assert_that(is.data.frame(statistical_input))
, which will ensure that collaborators or reviewers of your project attempting to use your function are using it in the way that it is intended by calling it with (at the minimum) the correct type of arguments. You can extend this to ensure that certain assumptions regarding the inputs are fulfilled as well (i.e. thattime_column
,location_column
,value_column
, andpopulation_column
all exist within thestatistical_input
tibble).
6.3 Object naming
Generally we recommend using nouns for objects and verbs for functions. This is because functions are performing actions, while objects are not.
Try to make your variable names both more expressive and more explicit. Being a bit more verbose is useful and easy in the age of autocompletion! For example, instead of naming a variable vaxcov_1718
, try naming it vaccination_coverage_2017_18
. Similarly, flu_res
could be named absentee_flu_residuals
, making your code more readable and explicit.
- For more help, check out Be Expressive: How to Give Your Variables Better Names
We recommend you use Snake_Case.
Base R allows
.
in variable names and functions (such asread.csv()
), but this goes against best practices for variable naming in many other coding languages. For consistency’s sake,snake_case
has been adopted across languages, and modern packages and functions typically use it (i.e.readr::read_csv()
). As a very general rule of thumb, if a package you’re using doesn’t usesnake_case
, there may be an updated version or more modern package that does, bringing with it the variety of performance improvements and bug fixes inherent in more mature and modern software.Note: you may also see
camelCase
throughout the R code you come across. This is okay but not ideal – try to stay consistent across all your code withsnake_case
.Note: again, its also worth noting there’s nothing inherently wrong with using
.
in variable names, just that it goes against style best practices that are cropping up in data science, so its worth getting rid of these bad habits now.
6.4 Function calls
In a function call, use “named arguments” and put each argument on a separate line to make your code more readable.
Here’s an example of what not to do when calling the function a function calc_fluseas_mean
(defined above):
mean_Y = calc_fluseas_mean(flu_data, "maari_yn", FALSE)
And here it is again using the best practices we’ve outlined:
mean_Y = calc_fluseas_mean(
data = flu_data,
yname = "maari_yn",
silent = FALSE
)
6.5 The here package
The here
package is one great R package that helps multiple collaborators deal with the mess that is working directories within an R project structure. Let’s say we have an R project at the path /home/oski/Some-R-Project
. My collaborator might clone the repository and work with it at some other path, such as /home/bear/R-Code/Some-R-Project
. Dealing with working directories and paths explicitly can be a very large pain, and as you might imagine, setting up a Config with paths requires those paths to flexibly work for all contributors to a project. This is where the here
package comes in and this a great vignette describing it.
6.6 Reading/Saving Data
6.6.1 .RDS
vs .RData
Files
One of the most common ways to load and save data in Base R is with the load()
and save()
functions to serialize multiple objects in a single .RData
file. The biggest problems with this practice include an inability to control the names of things getting loaded in, the inherent confusion this creates in understanding older code, and the inability to load individual elements of a saved file. For this, we recommend using the RDS format to save R objects.
- Note: if you have many related R objects you would have otherwise saved all together using the
save
function, the functional equivalent withRDS
would be to create a (named) list containing each of these objects, and saving it.
6.6.2 CSVs
Once again, the readr
package as part of the Tidvyerse is great, with a much faster read_csv()
than Base R’s read.csv()
. For massive CSVs (> 5 GB), you’ll find data.table::fread()
to be the fastest CSV reader in any data science language out there. For writing CSVs, readr::write_csv()
and data.table::fwrite()
outclass Base R’s write.csv()
by a significant margin as well.
6.7 Integrating Box and Dropbox
Box and Dropbox are cloud-based file sharing systems that are useful when dealing with large files. When our scripts generate large output files, the files can slow down the workflow if they are pushed to GitHub. This makes collaboration difficult when not everyone has a copy of the file, unless we decide to duplicate files and share them manually. The files might also take up a lot of local storage. Box and Dropbox help us avoid these issues by automatically storing the files, reading data, and writing data back to the cloud.
Box and Dropbox are separate platforms, but we can use either one to store and share files. To use them, we can install the packages that have been created to integrate Box and Dropbox into R. The set-up instructions are detailed below.
Make sure to authenticate before reading and writing from either Box or Dropbox. The authentication commands should go in the configuration file; it only needs to be done once. This will prompt you to give your login credentials for Box and Dropbox and will allow your application to access your shared folders.
6.7.1 Box
Follow the instructions in this section to use the boxr
package. Note that there are a few setup steps that need to be done on the box website before you can use the boxr
package, explained here in the section “Creating an Interactive App.” This gets the authentication keys that must be put in box.
Once that is done, add the authentication keys to your code in the configuration file, with box_auth(client_id = "<your_client_id>", client_secret = "<your_client_secret_id>")
. It is also important to set the default working directory so that the code can reference the correct folder in box: box_setwd(<folder_id>)
. The folder ID is the sequence of digits at the end of the URL.
Further details can be found here.
6.7.2 Dropbox
Follow the instructions at this link to use the rdrop2
package. Similar to the boxr
package, you must authenticate before reading and writing from Dropbox, which can be done by adding drop_auth()
to the configuration file.
Saving the authentication token is not required, although it may be useful if you plan on using Dropbox frequently. To do so, save the token with the following commands. Tokens are valid until they are manually revoked.
# first time only
# save the output of drop_auth to an RDS file
token <- drop_auth()
# this token only has to be generated once, it is valid until revoked
saveRDS(token, "/path/to/tokenfile.RDS")
# all future usages
# to use a stored token, provide the rdstoken argument
drop_auth(rdstoken = "/path/to/tokenfile.RDS")
6.8 Tidyverse
Throughout this document there have been references to the Tidyverse, but this section is to explicitly show you how to transform your Base R tendencies to Tidyverse (or Data.Table, Tidyverse’s performance-optimized competitor). For most of our work that does not utilize very large datasets, we recommend that you code in Tidyverse rather than Base R. Tidyverse is quickly becoming the gold standard in R data analysis and modern data science packages and code should use Tidyverse style and packages unless there’s a significant reason not to (i.e. big data pipelines that would benefit from Data.Table’s performance optimizations).
The package author has published a great textbook on R for Data Science, which leans heavily on many Tidyverse packages and may be worth checking out.
The following list is not exhaustive, but is a compact overview to begin to translate Base R into something better:
Base R | Better Style, Performance, and Utility |
---|---|
_ | _ |
read.csv() |
readr::read_csv() or data.table::fread() |
write.csv() |
readr::write_csv() or data.table::fwrite() |
readRDS |
readr::read_rds() |
saveRDS() |
readr::write_rds() |
_ | _ |
data.frame() |
tibble::tibble() or data.table::data.table() |
rbind() |
dplyr::bind_rows() |
cbind() |
dplyr::bind_cols() |
df$some_column |
df %>% dplyr::pull(some_column) |
df$some_column = ... |
df %>% dplyr::mutate(some_column = ...) |
df[get_rows_condition,] |
df %>% dplyr::filter(get_rows_condition) |
df[,c(col1, col2)] |
df %>% dplyr::select(col1, col2) |
merge(df1, df2, by = ..., all.x = ..., all.y = ...) |
df1 %>% dplyr::left_join(df2, by = ...) or dplyr::full_join or dplyr::inner_join or dplyr::right_join |
_ | _ |
str() |
dplyr::glimpse() |
grep(pattern, x) |
stringr::str_which(string, pattern) |
gsub(pattern, replacement, x) |
stringr::str_replace(string, pattern, replacement) |
ifelse(test_expression, yes, no) |
if_else(condition, true, false) |
Nested: ifelse(test_expression1, yes1, ifelse(test_expression2, yes2, ifelse(test_expression3, yes3, no))) |
case_when(test_expression1 ~ yes1, test_expression2 ~ yes2, test_expression3 ~ yes3, TRUE ~ no) |
proc.time() |
tictoc::tic() and tictoc::toc() |
stopifnot() |
assertthat::assert_that() or assertthat::see_if() or assertthat::validate_that() |
For a more extensive set of syntactical translations to Tidyverse, you can check out this document.
Working with Tidyverse within functions can be somewhat of a pain due to non-standard evaluation (NSE) semantics. If you’re an avid function writer, we’d recommend checking out the following resources:
- Tidy Eval in 5 Minutes (video)
- Tidy Evaluation (e-book)
- Data Frame Columns as Arguments to Dplyr Functions (blog)
- Standard Evaluation for *_join (stackoverflow)
- Programming with dplyr (package vignette)
6.9 Coding with R and Python
If you’re using both R and Python, you may wish to check out the Feather package for exchanging data between the two languages extremely quickly.
6.10 Repeating analyses with different variations
In many cases, we will need to apply our modeling on different combinations of interests (outcomes, exposures, etc.). We can certainly use a for
loop to repeat the execution of a wrapper function, but generally, for
loops request high memory usage and produce the results in long computation time.
Fortunately, R has some functions which implement looping in a compact form to help repeating your analyses with different variations (subgroups, outcomes, covariate sets, etc.) with better performances.
6.10.1 lapply()
and sapply()
lapply()
is a function in the base R package that applies a function to each element of a list and returns a list. It’s typically faster than for
. Here is a simple generic example:
result <- lapply(X = mylist, FUN = func)
There is another very similar function called sapply()
. It also takes a list as its input, but if the output of the func
is of the same length for each element in the input list, then sapply()
will simplify the output to the simplest data structure possible, which will usually be a vector.
6.10.2 mapply()
and pmap()
Sometimes, we’d like to employ a wrapper function that takes arguments from multiple different lists/vectors. Then, we can consider using mapply()
from the base R package or pmap()
from the purrr
package.
Please see the simple specific example below where the two input lists are of the same length and we are doing a pairwise calculation:
mylist1 = list(0:3)
mylist2 = list(6:9)
mylists = list(mylist1, mylist2)
square_sum <- function(x, y) {
x^2 + y^2
}
#Use `mapply()`
result1 <- mapply(FUN = square_sum, mylist1, mylist2)
#Use `pmap()`
library(purrr)
result2 <- pmap(.l = mylists, .f = square_sum)
#unlist(as.list(result1)) = result2 = [36 50 68 90]
There are two major differences between mapply()
and pmap()
. The first difference is that mapply()
takes seperate lists as its input arguments, while pmap()
takes a list of list. Secondly, the output of mapply()
will be in the form of a matrix or an array, but pmap()
produces a list directly.
However, when the input lists are of different lengths AND/OR the wrapper function doesn’t take arguments in pairs, mapply()
and pmap()
may not give the preferable results.
Both mapply()
and pmap()
will recycle shorter input lists to match the length of the longest input list. Assume that now mylist2 = list(6:12)
. Then, pmap(mylists, square_sum)
will generate [36 50 68 90 100 122 148]
where elements 0, 1, and 2 are recycled to match 10, 11, and 12. And it will return an error message that “longer object length is not a multiple of shorter object length.”
Thus, unless the recycling pattern described above is desirable feature for a certain experiment design, when the input lists are of different lengths, the best practice is probably to use lapply()
and then combine the results.
Here is an example where we’d like to find the square_sum
for every element combination of mylist1
and mylist2
.
mylist1 <- list(0:3)
mylist2 <- list(6:12)
square_sum <- function(x, y) {
x^2 + y^2
}
results <- list()
for (i in seq_along(mylist1[[1]])) {
result <- lapply(X = mylist2, FUN = function(y) square_sum(mylist1[[1]][i], y))
results[[i]] <- result
}
This example doesn’t work in the way that 0 is paired to 6, 1 is paired to 7, and so on. Instead, every element in mylist1
will be paired with every element in mylist2
. Thus, the “unlisted” results from the example will have \(4*7 = 28\) elements.
We can use flatten()
or unlist()
functions to decrease the depths of our results. If the results are data frames, then we will need to use bind_rows()
to combine them.
6.10.3 Parallel processing with parallel
and future
packages
One big drawback of lapply()
is its long computation time, especially when the list length is long. Fortunately, computers nowadays must have multiple cores which makes parallel processing possible to help make computation much faster.
Assume you have a list called mylist
of length 1000, and lapply(X = mylist, FUN = func)
applies the function to each of the 1000 elements one by one in \(T\) seconds. If we could execute the func
in \(n\) processors simultaneously, then ideally, we would shrink the computation time to \(T/n\) seconds.
In practice, using functions under the parallel
and the future
packages, we can split mylist
into smaller chunks and apply the function to each element of the several chunks in parallel in different cores to significantly reduce the run time.
6.10.3.1 parLapply()
Below is a generic example of parLapply()
:
library(parallel)
# Set how many processors will be used to process the list and make cluster
n_cores <- 4
cl <- makeCluster(n_cores)
#Use parLapply() to apply func to each element in mylist
result <- parLapply(cl = cl, x = mylist, FUN = func)
#Stop the parallel processing
stopCluster(cl)
Let’s still assume mylist
is of length 1000. The parLapply
above splits mylist
into 4 sub-lists each of length 250 and applies the function to the elements of each sub-list in parallel. To be more specific, first apply the function to element 1, 251, 501, 751; second apply to element 2, 252, 502, 752; so on and so forth. As such, the computation time will be greatly reduced.
You can use parallel::detectCores()
to test how many cores your machine has and to help decide what to put for n_cores
. It would be a good idea to leave at least one core free for the operating system to use.
We will always start parLapply()
with makeCluster()
. stopCluster()
is not fully necessary but follows the best practices. If not stopped, the processing will continue in the back end and consuming the computation capacity for other software in your machine. But keep in mind that stopping the cluster is similar quitting R, meaning that you will need to re-load the packages needed when you need to do parallel processing use parLapply()
again.
6.10.3.2 future.lapply()
Below is a generic example of future.lapply()
:
library(future)
library(future.apply)
# First, plan how the future_lapply() will be resolved
future::plan(
multisession, workers = future::availableCores() - 1
)
# Use future_lapply() to apply func to each element in mylist
future_lapply(x = mylist, FUN = func)
Here, future::availableCores()
checks how many cores your machine has. Similar to parLapply()
showed above, future_lapply()
parallelizes the computation of lapply()
by executing the function func
simultaneously on different sub-lists of mylist
.
6.11 Reviewing Code
Before publishing new changes, it is important to ensure that the code has been tested and well-documented. GitHub makes it possible to document all of these changes in a pull request. Pull requests can be used to describe changes in a branch that are ready to be merged with the base branch (more information in the GitHub section). Github allows users to create a pull request template in a repository to standardize and customize the information in a pull request. When you add a pull request template to your repository, everyone will automatically see the template’s contents in the pull request body.
6.11.1 Creating a Pull Request Template
Follow the instructions below to add a pull request template to a repository. More details can be found at this GitHub link.
- On GitHub, navigate to the main page of the repository.
- Above the file list, click
Create new file
. - Name the file
pull_request_template.md
. GitHub will not recognize this as the template if it is named anything else. The file must be on themaster
branch.- To store the file in a hidden directory instead of the main directory, name the file
.github/pull_request_template.md
.
- To store the file in a hidden directory instead of the main directory, name the file
- In the body of the new file, add your pull request template. This could include:
- A summary of the changes proposed in the pull request
- How the change has been tested
- @mentions of the person or team responsible for reviewing proposed changes
Here is an example pull request template.
# Description
## Summary of change
Please include a summary of the change, including any new functions added and example usage.
## Link to Spec
Please include a link to the Trello card or Google document with details of the task.
## Who should review the pull request?
@ ...
6.2.4 Comments in the body of your script
Commenting your code is an important part of reproducibility and helps document your code for the future. When things change or break, you’ll be thankful for comments. There’s no need to comment excessively or unnecessarily, but a comment describing what a large or complex chunk of code does is always helpful. See this file for an example of how to comment your code and notice that comments are always in the form of:
# This is a comment -- first letter is capitalized and spaced away from the pound sign