Chapter 9 Working with Big Data
by Kunal Mishra and Jade Benjamin-Chung
9.1 The data.table package
It may also be the case that you’re working with very large datasets. Generally I would define this as 10+ million rows. As is outlined in this document, the 3 main players in the data analysis space are Base R,
Tidvyerse (more specificially,
data.table. For a majority of things, Base R is inferior to both
data.table, with concise but less clear syntax and less speed.
Dplyr is architected for medium and smaller data, and while its very fast for everyday usage, it trades off maximum performance for ease of use and syntax compared to
data.table. An overview of the
data.table debate can be found in this stackoverflow post and all 3 answers are worth a read.
You can also achieve a performance boost by running
dplyr commands on
data.tables, which I find to be the best of both worlds, given that a
data.table is a special type of
data.frame and fairly easy to convert with the
as.data.table() function. The speedup is due to
dplyr’s use of the
data.table backend and in the future this coupling should become even more natural.
If you want to test whether using a certain coding approach increases speed, consider the
tictoc package. Run
tic() before a code chunk and
toc() after to measure the amount of system time it takes to run the chunk. For example, you might use this to decide if you really need to switch a code chunk from
9.2 Using downsampled data
In our studies with very large datasets, we save “downsampled” data that usually includes a 1% random sample stratified by any important variables, such as year or household id. This allows us to efficiently write and test our code without having to load in large, slow datasets that can cause RStudio to freeze. Be very careful to be sure which dataset you are working with and to label results output accordingly.
9.3 Optimal RStudio set up
Using the following settings will help ensure a smooth experience when working with big data. In RStudio, go to the “Tools” menu, then select “Global Options.” Under “General”:
- Uncheck Restore RData into workspace at startup
- Save workspace to RData on exit – choose never
- Uncheck Always save history
Unfortunately RStudio often gets slow and/or freezes after hours working with big datasets. Sometimes it is much more efficient to just use Terminal / gitbash to run code and make updates in git.