Last active
April 15, 2020 18:55
-
-
Save potterzot/4872b169c56fe1d3ee9889262ac05905 to your computer and use it in GitHub Desktop.
Introduction to data.table
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: An Introduction to data.table using geographic data | |
author: Nicholas A Potter | |
output: html_document | |
--- | |
```{r, echo = FALSE} | |
``` | |
This is a short introduction to using data.table based on my work with climate data. | |
## Initial setup | |
Let's begin by generating some data to work with. We'll use the following packages: | |
```{r, message=FALSE, warning=FALSE} | |
# If you don't have the packages below, install with: | |
#install.packages(c("data.table", "readr", "dplyr", "microbenchmark")) | |
library(data.table) | |
library(dplyr) # tidyverse data manipulation | |
library(readr) # tidyverse reading in data | |
library(microbenchmark) # for timing code to see which is faster | |
``` | |
We will work with a grid of latitude and longitude points, and we'll generate some fake temperature and precip data: | |
```{r} | |
# grid resolution | |
res <- 0.0416666666666667 | |
# Grid extents | |
lats <- seq( 24.358333333333333, 52.909999999999997, by = res) | |
lons <- seq(-124.933333300000000, -66.016666633333330, by = res) | |
# Define a data.table | |
dt <- data.table(expand.grid(lat = lats, lon = lons)) | |
# default printing of data.tables is screen friendly: | |
dt | |
# Assume: | |
# GDD is a function of latitude (Areas closer to the poles are less warm) | |
# Precip is random | |
# Equivalent to dplyr::mutate(gdd = ..., prec = ...) | |
dt[, `:=`( | |
gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100), | |
prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))] | |
dt | |
# For comparison, a data.frame | |
df <- as.data.frame(dt) | |
``` | |
__A good reference comparing data.table and dplyr:__ | |
https://atrebas.github.io/post/2019-03-03-datatable-dplyr/ | |
__If you're wedded to dplyr grammar and don't want to learn data.table, try `dtplyr`:__ | |
https://dtplyr.tidyverse.org/ | |
### Why data.table? | |
When is data.table perhaps better than the tidyverse? I like and use both, so I don't hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when: | |
#### 1. I need complex transformations on large data sets | |
As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use `data.table`, `sp`, and refactoring functions, I decreased the processing time for one year of climate data from __four days__ to __five hours__. I had 90 years of data to process for a minimum of six different models... | |
For example, let's time a summary of gdd at a more coarse lat/lon grid: | |
```{r} | |
microbenchmark( | |
dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))], | |
times = 10, unit = "ms") | |
``` | |
```{r} | |
microbenchmark( | |
df %>% | |
#mutate(lat = round(lat, 2), lon = round(lon, 2)) %>% | |
group_by(lat = round(lat, 2), lon = round(lon, 2)) %>% | |
summarize(gdd = mean(gdd)), | |
times = 10, unit = "ms") | |
``` | |
#### 2. tidyverse grammar gets too long / complex | |
People will also say this is a negative, because other people can't read your code as easily. I'm not sure I agree. In my experience as a researcher, we don't collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it's just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write: | |
```{r eval = FALSE} | |
# dplyr approach | |
df %>% | |
mutate(lat = round(lat), lon = round(lon)) %>% | |
group_by(lat, lon) %>% | |
summarize(gdd = mean(gdd)) | |
# data.table | |
dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))] | |
``` | |
That doesn't seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It's much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing. | |
### Using data.table | |
The grammar of data.table is `DT[i,j, other]`, where `i` is a logical selector on rows, `j` is operations on columns, and `other` is additional arguments for grouping, which columns to perform operations on, etc... | |
#### I Operations (select rows) | |
```{r} | |
# Select using the "i" argument | |
dt[lat < 25] # equivalent to df %>% dplyr::filter(lat < 25) | |
dt[lat < 25 & lon > -67] | |
``` | |
#### J Operations (operate on columns) | |
```{r} | |
# Perform an operation on specific columns | |
dt[, .( | |
lat = mean(lat), | |
lon = mean(lon), | |
gdd = mean(gdd))] | |
# Alternatively, for all columns or just specific ones: | |
dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all() | |
dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")] | |
# A more complicated function | |
# center GDD by removing the mean | |
dt[, .(lat, lon, d_gdd = gdd - mean(gdd))] | |
# Perform operations on the same data set | |
dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate() | |
# For multiple variables at once: | |
dt[, `:=`(gd_gdd = gdd - mean(gdd), | |
gd_prec = prec - mean(prec))] | |
dt | |
# Or equivalently | |
dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))] | |
# Group transformations | |
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))] | |
dt[, `:=`(gd_gdd = gdd - mean(gdd), | |
gd_prec = prec - mean(prec)), | |
by = .(lat0, lon0)] | |
dt | |
# Removing variables | |
dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec) | |
``` | |
#### Other Operations (keys, indices, merges, and renaming) | |
```{r} | |
# Create some 2-digit lat/lon groups | |
dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))] | |
# No keys is okay, but operations are slower | |
key(dt) # No key set | |
microbenchmark( | |
dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)], | |
times = 10, unit = "ms") | |
# Set keys that you are grouping by is faster | |
setkey(dt, lat2, lon2) | |
setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions | |
key(dt) # Now with lat2, lon2 | |
microbenchmark( | |
dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)], | |
times = 10, unit = "ms") | |
``` | |
#### In Functions (or for loops) | |
```{r} | |
#' Center variables of a data.table. | |
#' @param d a data.table. | |
#' @param vars a list of quoted column names to center. | |
#' @param byvar a list of quoted column names by which to center. | |
#' @param na.rm exclude NA? | |
#' @return a data.table with centered variables. | |
center_dt <- function(d, vars, byvars, na.rm = TRUE) { | |
setkeyv(d, byvars) | |
d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }), | |
by = byvars, .SDcols = vars] | |
} | |
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))] | |
center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0")) | |
# Alternatively, specify just the transformation as a function | |
center <- function(x) { x - mean(x) } | |
dt[, lapply(.SD, function(x){ center(x) }), | |
by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")] | |
``` | |
#### Reading and writing data | |
`data.table` provides `fread` and `fwrite` to quickly read and write data. | |
`fread()` also takes a URL if you want to directly read data from http. | |
```{r message=FALSE, warning=FALSE} | |
microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms") | |
microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms") | |
microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms") | |
microbenchmark(fread("dt.csv"), times = 1, unit = "ms") | |
microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms") | |
microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms") | |
``` | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment