Chapter 9 Working with Big Data

by Kunal Mishra and Jade Benjamin-Chung

9.1 The data.table package

It may also be the case that you’re working with very large datasets. Generally I would define this as 10+ million rows. As is outlined in this document, the 3 main players in the data analysis space are Base R, Tidvyerse (more specificially, dplyr), and data.table. For a majority of things, Base R is inferior to both dplyr and data.table, with concise but less clear syntax and less speed. Dplyr is architected for medium and smaller data, and while its very fast for everyday usage, it trades off maximum performance for ease of use and syntax compared to data.table. An overview of the dplyr vs data.table debate can be found in this stackoverflow post and all 3 answers are worth a read.

You can also achieve a performance boost by running dplyr commands on data.tables, which I find to be the best of both worlds, given that a data.table is a special type of data.frame and fairly easy to convert with the as.data.table() function. The speedup is due to dplyr’s use of the data.table backend and in the future this coupling should become even more natural.

If you want to test whether using a certain coding approach increases speed, consider the tictoc package. Run tic() before a code chunk and toc() after to measure the amount of system time it takes to run the chunk. For example, you might use this to decide if you really need to switch a code chunk from dplyr to data.table.

9.2 Using downsampled data

In our studies with very large datasets, we save “downsampled” data that usually includes a 1% random sample stratified by any important variables, such as year or household id. This allows us to efficiently write and test our code without having to load in large, slow datasets that can cause RStudio to freeze. Be very careful to be sure which dataset you are working with and to label results output accordingly.

9.3 Optimal RStudio set up

Using the following settings will help ensure a smooth experience when working with big data. In RStudio, go to the “Tools” menu, then select “Global Options.” Under “General”:

Workspace

  • Uncheck Restore RData into workspace at startup
  • Save workspace to RData on exit – choose never

History

  • Uncheck Always save history

Unfortunately RStudio often gets slow and/or freezes after hours working with big datasets. Sometimes it is much more efficient to just use Terminal / gitbash to run code and make updates in git.