potterzot · April 15, 2020 18:55
diff --git a/r_datatable.Rmd b/r_datatable.Rmd
 ---
 title: An Introduction to data.table using geographic data
 author: Nicholas A Potter
 output: html_document
 ---

 ```{r, echo = FALSE}

 ```

 This is a short introduction to using data.table based on my work with climate data.

 ## Initial setup

 Let's begin by generating some data to work with. We'll use the following packages:

 ```{r, message=FALSE, warning=FALSE}

 # If you don't have the packages below, install with:
 #install.packages(c("data.table", "readr", "dplyr", "microbenchmark"))

 library(data.table) 
 library(dplyr) # tidyverse data manipulation
 library(readr) # tidyverse reading in data
 library(microbenchmark) # for timing code to see which is faster
 ```

 We will work with a grid of latitude and longitude points, and we'll generate some fake temperature and precip data:

 ```{r}
 # grid resolution
 res <- 0.0416666666666667

 # Grid extents
 lats <- seq(  24.358333333333333,  52.909999999999997, by = res)
 lons <- seq(-124.933333300000000, -66.016666633333330, by = res)

 # Define a data.table
 dt <- data.table(expand.grid(lat = lats, lon = lons))

 # default printing of data.tables is screen friendly:
 dt

 # Assume:
 # GDD is a function of latitude (Areas closer to the poles are less warm)
 # Precip is random

 # Equivalent to dplyr::mutate(gdd = ..., prec = ...)
 dt[, `:=`(
  gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100),
  prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))]
 dt

 # For comparison, a data.frame
 df <- as.data.frame(dt)
 ```

 __A good reference comparing data.table and dplyr:__ 

 https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

 __If you're wedded to dplyr grammar and don't want to learn data.table, try `dtplyr`:__

 https://dtplyr.tidyverse.org/


 ### Why data.table?

 When is data.table perhaps better than the tidyverse? I like and use both, so I don't hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when:

 #### 1. I need complex transformations on large data sets

 As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use `data.table`, `sp`, and refactoring functions, I decreased the processing time for one year of climate data from __four days__ to __five hours__. I had 90 years of data to process for a minimum of six different models...

 For example, let's time a summary of gdd at a more coarse lat/lon grid:

 ```{r}
 microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))],
  times = 10, unit = "ms")
 ```

 ```{r}
 microbenchmark(
  df %>%
    #mutate(lat = round(lat, 2), lon = round(lon, 2)) %>%
    group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%
    summarize(gdd = mean(gdd)),
  times = 10, unit = "ms")
 ```


 #### 2. tidyverse grammar gets too long / complex

 People will also say this is a negative, because other people can't read your code as easily. I'm not sure I agree. In my experience as a researcher, we don't collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it's just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write:

 ```{r eval = FALSE}
 # dplyr approach
 df %>%
  mutate(lat = round(lat), lon = round(lon)) %>%
  group_by(lat, lon) %>%
  summarize(gdd = mean(gdd))

 # data.table
 dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))]
 ```

 That doesn't seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It's much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing.

 ### Using data.table

 The grammar of data.table is `DT[i,j, other]`, where `i` is a logical selector on rows, `j` is operations on columns, and `other` is additional arguments for grouping, which columns to perform operations on, etc...

 #### I Operations (select rows)

 ```{r}
 # Select using the "i" argument
 dt[lat < 25] # equivalent to df %>% dplyr::filter(lat < 25)
 dt[lat < 25 & lon > -67]
 ```

 #### J Operations (operate on columns)

 ```{r}
 # Perform an operation on specific columns
 dt[, .(
  lat = mean(lat),
  lon = mean(lon),
  gdd = mean(gdd))]

 # Alternatively, for all columns or just specific ones:
 dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all()
 dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")]

 # A more complicated function
 # center GDD by removing the mean
 dt[, .(lat, lon, d_gdd = gdd - mean(gdd))]

 # Perform operations on the same data set
 dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate()

 # For multiple variables at once:
 dt[, `:=`(gd_gdd = gdd - mean(gdd),
          gd_prec = prec - mean(prec))]
 dt

 # Or equivalently
 dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))]

 # Group transformations
 dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
 dt[, `:=`(gd_gdd = gdd - mean(gdd),
          gd_prec = prec - mean(prec)),
   by = .(lat0, lon0)]
 dt

 # Removing variables
 dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec)
 ```

 #### Other Operations (keys, indices, merges, and renaming)
 ```{r}

 # Create some 2-digit lat/lon groups
 dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))]

 # No keys is okay, but operations are slower
 key(dt) # No key set
 microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
  times = 10, unit = "ms")

 # Set keys that you are grouping by is faster
 setkey(dt, lat2, lon2)
 setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions
 key(dt) # Now with lat2, lon2
 microbenchmark(
  dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
  times = 10, unit = "ms")


 ```

 #### In Functions (or for loops)
 ```{r}

 #' Center variables of a data.table.
 #' @param d a data.table.
 #' @param vars a list of quoted column names to center.
 #' @param byvar a list of quoted column names by which to center.
 #' @param na.rm exclude NA?
 #' @return a data.table with centered variables.
 center_dt <- function(d, vars, byvars, na.rm = TRUE) {
  setkeyv(d, byvars)
  d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }),
    by = byvars, .SDcols = vars]
 }
 dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
 center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0"))

 # Alternatively, specify just the transformation as a function
 center <- function(x) { x - mean(x) }
 dt[, lapply(.SD, function(x){ center(x) }), 
   by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")]
 ```

 #### Reading and writing data

 `data.table` provides `fread` and `fwrite` to quickly read and write data.

 `fread()` also takes a URL if you want to directly read data from http.

 ```{r message=FALSE, warning=FALSE}

 microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms")
 microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms")
 microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms")


 microbenchmark(fread("dt.csv"), times = 1, unit = "ms")
 microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms")
 microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms")
 ```
	---
	title: An Introduction to data.table using geographic data
	author: Nicholas A Potter
	output: html_document
	---

	```{r, echo = FALSE}

	```

	This is a short introduction to using data.table based on my work with climate data.

	## Initial setup

	Let's begin by generating some data to work with. We'll use the following packages:

	```{r, message=FALSE, warning=FALSE}

	# If you don't have the packages below, install with:
	#install.packages(c("data.table", "readr", "dplyr", "microbenchmark"))

	library(data.table)
	library(dplyr) # tidyverse data manipulation
	library(readr) # tidyverse reading in data
	library(microbenchmark) # for timing code to see which is faster
	```

	We will work with a grid of latitude and longitude points, and we'll generate some fake temperature and precip data:

	```{r}
	# grid resolution
	res <- 0.0416666666666667

	# Grid extents
	lats <- seq( 24.358333333333333, 52.909999999999997, by = res)
	lons <- seq(-124.933333300000000, -66.016666633333330, by = res)

	# Define a data.table
	dt <- data.table(expand.grid(lat = lats, lon = lons))

	# default printing of data.tables is screen friendly:
	dt

	# Assume:
	# GDD is a function of latitude (Areas closer to the poles are less warm)
	# Precip is random

	# Equivalent to dplyr::mutate(gdd = ..., prec = ...)
	dt[, `:=`(
	gdd = 3000 - 70*(lat-24) + rnorm(n = nrow(dt), sd = 100),
	prec = pmax(0,rnorm(n = nrow(dt), mean = 12, sd = 3)))]
	dt

	# For comparison, a data.frame
	df <- as.data.frame(dt)
	```

	__A good reference comparing data.table and dplyr:__

	https://atrebas.github.io/post/2019-03-03-datatable-dplyr/

	__If you're wedded to dplyr grammar and don't want to learn data.table, try `dtplyr`:__

	https://dtplyr.tidyverse.org/


	### Why data.table?

	When is data.table perhaps better than the tidyverse? I like and use both, so I don't hold with the idea that there is one that is better. Instead, there are specific reasons to use each. I use data.table when:

	#### 1. I need complex transformations on large data sets

	As the number of rows increases, data.table becomes increasingly faster that a data.frame or tibble. This can turn a several day process into a day or less, which is huge when you inevitably end up rerunning things. By converting a script to use `data.table`, `sp`, and refactoring functions, I decreased the processing time for one year of climate data from __four days__ to __five hours__. I had 90 years of data to process for a minimum of six different models...

	For example, let's time a summary of gdd at a more coarse lat/lon grid:

	```{r}
	microbenchmark(
	dt[, .(gdd = mean(gdd)), by = .(lat = round(lat, 2), lon = round(lon,2))],
	times = 10, unit = "ms")
	```

	```{r}
	microbenchmark(
	df %>%
	#mutate(lat = round(lat, 2), lon = round(lon, 2)) %>%
	group_by(lat = round(lat, 2), lon = round(lon, 2)) %>%
	summarize(gdd = mean(gdd)),
	times = 10, unit = "ms")
	```


	#### 2. tidyverse grammar gets too long / complex

	People will also say this is a negative, because other people can't read your code as easily. I'm not sure I agree. In my experience as a researcher, we don't collaboratively write code often. Your most common reader is going to be yourself in 3-6 months when you have to revisit something. So documenting and writing clear code is important, but data.table is clear, it's just a different language than the tidyverse. It is wonderfully succinct at times. For example, in dplyr you might write:

	```{r eval = FALSE}
	# dplyr approach
	df %>%
	mutate(lat = round(lat), lon = round(lon)) %>%
	group_by(lat, lon) %>%
	summarize(gdd = mean(gdd))

	# data.table
	dt[, .(gdd = mean(gdd)), by = .(lat = round(lat), lon = round(lon))]
	```

	That doesn't seem like much for one transformation, but if the number of transformations is high either because you have multiple data files or multiple variables that all need a different transformation, the difference in code length is substantial. It's much easier to look at 20 lines of data.table transformations than it is to look at 50 lines of dplyr transformations to accomplish the same thing.

	### Using data.table

	The grammar of data.table is `DT[i,j, other]`, where `i` is a logical selector on rows, `j` is operations on columns, and `other` is additional arguments for grouping, which columns to perform operations on, etc...

	#### I Operations (select rows)

	```{r}
	# Select using the "i" argument
	dt[lat < 25] # equivalent to df %>% dplyr::filter(lat < 25)
	dt[lat < 25 & lon > -67]
	```

	#### J Operations (operate on columns)

	```{r}
	# Perform an operation on specific columns
	dt[, .(
	lat = mean(lat),
	lon = mean(lon),
	gdd = mean(gdd))]

	# Alternatively, for all columns or just specific ones:
	dt[, lapply(.SD, mean, na.rm = TRUE)] # equivalent to dplyr::transmute_all()
	dt[, lapply(.SD, mean, na.rm = TRUE), .SDcols = c("gdd", "prec")]

	# A more complicated function
	# center GDD by removing the mean
	dt[, .(lat, lon, d_gdd = gdd - mean(gdd))]

	# Perform operations on the same data set
	dt[, gd_gdd := gdd - mean(gdd)] # equivalent to dplyr::mutate()

	# For multiple variables at once:
	dt[, `:=`(gd_gdd = gdd - mean(gdd),
	gd_prec = prec - mean(prec))]
	dt

	# Or equivalently
	dt[, c("gd_gdd", "gd_prec") := .(gdd - mean(gdd), prec - mean(prec))]

	# Group transformations
	dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
	dt[, `:=`(gd_gdd = gdd - mean(gdd),
	gd_prec = prec - mean(prec)),
	by = .(lat0, lon0)]
	dt

	# Removing variables
	dt[, `:=`(gd_gdd = NULL, gd_prec = NULL)] # dplyr::select(-gd_gdd, -gd_prec)
	```

	#### Other Operations (keys, indices, merges, and renaming)
	```{r}

	# Create some 2-digit lat/lon groups
	dt[, `:=`(lat2 = round(lat,2), lon2 = round(lon,2))]

	# No keys is okay, but operations are slower
	key(dt) # No key set
	microbenchmark(
	dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
	times = 10, unit = "ms")

	# Set keys that you are grouping by is faster
	setkey(dt, lat2, lon2)
	setkeyv(dt, c("lat2", "lon2")) #Equivalent - useful for quoted vars in functions
	key(dt) # Now with lat2, lon2
	microbenchmark(
	dt[, .(gdd = mean(gdd)), by = .(lat2, lon2)],
	times = 10, unit = "ms")


	```

	#### In Functions (or for loops)
	```{r}

	#' Center variables of a data.table.
	#' @param d a data.table.
	#' @param vars a list of quoted column names to center.
	#' @param byvar a list of quoted column names by which to center.
	#' @param na.rm exclude NA?
	#' @return a data.table with centered variables.
	center_dt <- function(d, vars, byvars, na.rm = TRUE) {
	setkeyv(d, byvars)
	d[, lapply(.SD, function(x) { x - mean(x, na.rm = na.rm) }),
	by = byvars, .SDcols = vars]
	}
	dt[, `:=`(lat0 = round(lat), lon0 = round(lon))]
	center_dt(dt, vars = c("gdd", "prec"), byvars = c("lat0", "lon0"))

	# Alternatively, specify just the transformation as a function
	center <- function(x) { x - mean(x) }
	dt[, lapply(.SD, function(x){ center(x) }),
	by = c("lat0", "lon0"), .SDcols = c("gdd", "prec")]
	```

	#### Reading and writing data

	`data.table` provides `fread` and `fwrite` to quickly read and write data.

	`fread()` also takes a URL if you want to directly read data from http.

	```{r message=FALSE, warning=FALSE}

	microbenchmark(fwrite(dt, file = "dt.csv"), times = 1, unit = "ms")
	microbenchmark(write_csv(dt, path = "dt.csv"), times = 1, unit = "ms")
	microbenchmark(saveRDS(dt, file = "dt.rds"), times = 1, unit = "ms")


	microbenchmark(fread("dt.csv"), times = 1, unit = "ms")
	microbenchmark(read_csv("dt.csv"), times = 1, unit = "ms")
	microbenchmark(readRDS("dt.rds"), times = 1, unit = "ms")
	```