Skip to content

Instantly share code, notes, and snippets.

@potterzot
Last active April 15, 2020 18:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save potterzot/4872b169c56fe1d3ee9889262ac05905 to your computer and use it in GitHub Desktop.
Save potterzot/4872b169c56fe1d3ee9889262ac05905 to your computer and use it in GitHub Desktop.
Introduction to data.table
---
title: An Introduction to data.table using geographic data
author: Nicholas A Potter
output: html_document
---
```{r, echo = FALSE}
```
This is a short introduction to using data.table based on my work with climate data.
## Initial setup
Let's begin by generating some data to work with. We'll use the following packages:
```{r, message=FALSE, warning=FALSE}
# If you don't have the packages below, install with:
#install.packages(c("data.table", "readr", "dplyr", "microbenchmark"))
library(data.table)
library(dplyr) # tidyverse data manipulation
library(readr) # tidyverse reading in data
library(microbenchmark) # for timing code to see which is faster
```
We will work with a grid of latitude and longitude points, and we'll generate some fake temperature and precip data:
```{r}
# grid resolution
res <- 0.0416666666666667
# Grid extents
lats <- seq( 24.358333333333333, 52.909999999999997, by = res)
lons <- seq(-124.933333300000000, -66.016666633333330, by = res)
# Define a data.table
dt <- data.table(expand.grid(lat = lats, lon = lons))
# default printing of data.tables is screen friendly:
dt
# Assume:
# GDD is a function of latitude (Areas closer to the poles are less warm)
# Precip is random
# Equivalent to dplyr::mutate(gdd = ..., prec = ...)
dt[, `:=`(
gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100),
prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))]
dt
# For comparison, a data.frame
df <- as.data.frame(dt)
```
__A good reference comparing data.table and dplyr:__
https://atrebas.github.io/post/2019-03-03-datatable-dplyr/
__If you're wedded to dplyr grammar and don't want to learn data.table, try `dtplyr`:__
https://dtplyr.tidyverse.org/
### Why data.table?
When is data.table perhaps better than the tidyverse? I like and use both, so I don't hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when:
#### 1. I need complex transformations on large data sets
As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use `data.table`, `sp`, and refactoring functions, I decreased the processing time for one year of climate data from __four days__ to __five hours__. I had 90 years of data to process for a minimum of six different models...
For example, let's time a summary of gdd at a more coarse lat/lon grid:
```{r}
microbenchmark(
dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))],
times = 10, unit = "ms")
```
```{r}
microbenchmark(
df %>%
#mutate(lat = round(lat, 2), lon = round(lon, 2)) %>%
group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%
summarize(gdd = mean(gdd)),
times = 10, unit = "ms")
```
#### 2. tidyverse grammar gets too long / complex
People will also say this is a negative, because other people can't read your code as easily. I'm not sure I agree. In my experience as a researcher, we don't collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it's just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write:
```{r eval = FALSE}
# dplyr approach
df %>%
mutate(lat = round(lat), lon = round(lon)) %>%
group_by(lat, lon) %>%
summarize(gdd = mean(gdd))
# data.table
dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))]
```
That doesn't seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It's much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing.
### Using data.table
The grammar of data.table is `DT[i,j, other]`, where `i` is a logical selector on rows, `j` is operations on columns, and `other` is additional arguments for grouping, which columns to perform operations on, etc...
#### I Operations (select rows)
```{r}
# Select using the "i" argument
dt[lat < 25] # equivalent to df %>% dplyr::filter(lat < 25)
dt[lat < 25 & lon > -67]
```
#### J Operations (operate on columns)
```{r}
# Perform an operation on specific columns
dt[, .(
lat = mean(lat),
lon = mean(lon),
gdd = mean(gdd))]
# Alternatively, for all columns or just specific ones:
dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all()
dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")]
# A more complicated function
# center GDD by removing the mean
dt[, .(lat, lon, d_gdd = gdd - mean(gdd))]
# Perform operations on the same data set
dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate()
# For multiple variables at once:
dt[, `:=`(gd_gdd = gdd - mean(gdd),
gd_prec = prec - mean(prec))]
dt
# Or equivalently
dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))]
# Group transformations
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
dt[, `:=`(gd_gdd = gdd - mean(gdd),
gd_prec = prec - mean(prec)),
by = .(lat0, lon0)]
dt
# Removing variables
dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec)
```
#### Other Operations (keys, indices, merges, and renaming)
```{r}
# Create some 2-digit lat/lon groups
dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))]
# No keys is okay, but operations are slower
key(dt) # No key set
microbenchmark(
dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
times = 10, unit = "ms")
# Set keys that you are grouping by is faster
setkey(dt, lat2, lon2)
setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions
key(dt) # Now with lat2, lon2
microbenchmark(
dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
times = 10, unit = "ms")
```
#### In Functions (or for loops)
```{r}
#' Center variables of a data.table.
#' @param d a data.table.
#' @param vars a list of quoted column names to center.
#' @param byvar a list of quoted column names by which to center.
#' @param na.rm exclude NA?
#' @return a data.table with centered variables.
center_dt <- function(d, vars, byvars, na.rm = TRUE) {
setkeyv(d, byvars)
d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }),
by = byvars, .SDcols = vars]
}
dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0"))
# Alternatively, specify just the transformation as a function
center <- function(x) { x - mean(x) }
dt[, lapply(.SD, function(x){ center(x) }),
by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")]
```
#### Reading and writing data
`data.table` provides `fread` and `fwrite` to quickly read and write data.
`fread()` also takes a URL if you want to directly read data from http.
```{r message=FALSE, warning=FALSE}
microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms")
microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms")
microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms")
microbenchmark(fread("dt.csv"), times = 1, unit = "ms")
microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms")
microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms")
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment