name: 1 class: center middle main-title section-title-4 # Workflows for combining raster and vector data .class-info[ **Session 15** .light[HES597: Introduction to Spatial Data in R<br> Boise State University Human-Environment Systems<br> Fall 2021] ] --- # Outline for today - Revisiting Raster operations - Generating zonal statistics - Generalizing extractions - Scaling to larger problems --- name: ops class: center middle main-title section-title-4 # Revisiting raster operations --- # Revisiting raster operations <img src="08-slides_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" /> --- # Raster Math .pull-left[ * Simple calculations ```r development.difficulty <- exp(mos) + elev.res/4 plot(log(development.difficulty)) ``` <img src="08-slides_files/figure-html/rstmth-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ * Custom Functions ```r normalize<- function(rster) (rster - min(rster, na.rm=TRUE))/(max(rster, na.rm=TRUE) - min(rster, na.rm=TRUE)) elev.nrm <- app(elev.res, normalize) plot(elev.nrm) ``` <img src="08-slides_files/figure-html/rstfun-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Raster math with polygons * Need to convert field to raster first * `terra::rasterize` and `fasterize::fasterize` do this for `SpatVector` and `sf` objects ```r med.inc.rstr <- terra::rasterize(x = as(id.income, "SpatVector"), y = mos, field="estimate") ``` --- # Raster math with polygons .pull-left[ <img src="08-slides_files/figure-html/vplot-1.png" width="504" style="display: block; margin: auto;" /> ] .pull-right[ <img src="08-slides_files/figure-html/rplot-1.png" width="504" style="display: block; margin: auto;" /> ] --- # Raster math with polygons ```r income.ratio <- med.inc.rstr/mos plot(income.ratio, main="Ratio of median income to housing cost") ``` <img src="08-slides_files/figure-html/vmath-1.png" width="504" style="display: block; margin: auto;" /> --- name: zonal class: center middle main-title section-title-4 # Zonal statistics --- # Zonal statistics * Allow extraction from one raster to a categorical raster * Need the zones to be rasterized ```r cty.rster <- terra::rasterize(counties.p, y=mos, field="GEOID") zones <- terra::zonal(mos, cty.rster, "mean", na.rm=TRUE) head(zones) ``` ``` ## GEOID idval ## 1 16001 9.849115 ## 2 16003 8.290065 ## 3 16005 9.360389 ## 4 16007 9.474390 ## 5 16009 8.151090 ## 6 16011 9.211108 ``` --- name: extract class: center middle main-title section-title-4 # Extractions --- # Extractions * Zonal stats are fast, but limited * `terra::extract` more generalizable * converts vector to raster i the background ```r extraction <- terra::extract(mos, counties.p, mean, na.rm=TRUE) head(extraction) ``` ``` ## ID idval ## 1 1 9.083365 ## 2 2 9.194087 ## 3 3 8.346153 ## 4 4 9.536322 ## 5 5 9.870428 ## 6 6 8.643443 ``` --- name: tiles class: center middle main-title section-title-4 # Scaling to larger extents --- ## Revisiting the `apply` family .pull-left[ - A flexible group of functions that replace `for` or `while` loops - Translates loops in C++ code, often provides speed-up - Which member of the 'family' depends on input data and output desired - Can be tricky to get desired behavior (*algorithmic efficiency* vs. *programmer efficiency*) ] .pull-right[ - `apply` requires three arguments: an `array`, the `margin`, and the `function` you want to execute - `arrays` are R data objects that contain 1, 2, or more dimensions - `margins` identify which parts of the array to apply the `function` over (1 = rows, 2 = columns, 1:2 = all cells in a matrix) ] --- ## A reminder .pull-left[ ```r #create data test.matrix <- matrix(rnorm(50), 10, 5) dim(test.matrix) ``` ``` ## [1] 10 5 ``` ```r # generate column means col.mean <- apply(X=test.matrix, MARGIN = 2, mean) str(col.mean) ``` ``` ## num [1:5] 0.119 -0.444 -0.282 -0.514 0.1 ``` ] .pull-right[ ```r #generate row sums row.sum <- apply(X=test.matrix, MARGIN = 1, sum) str(row.sum) ``` ``` ## num [1:10] -0.963 -3.781 -2.891 -3.784 -0.961 ... ``` ```r #exponentiate each cell of the data exp.all <- apply(X=test.matrix, MARGIN = 1:2, exp) str(exp.all) ``` ``` ## num [1:10, 1:5] 1.961 0.902 1.288 1.296 3.426 ... ``` ] --- ## The `map` function from `purrr` and the `tidyverse` - Acts like `apply` but can be more flexible and interpretable - Has a similar 'family' of functions designed for different situations - Here we are downloading files from Google Drive ```r folder_url <- "https://drive.google.com/open?id=1xSsKYpB2o9SEIo6pZDFWYqVthb162_Zo" folder <- drive_get(as_id(folder_url)) tif_files <- drive_ls(folder) lapply(tif_files$id, function(x) drive_download(as_id(x), overwrite = TRUE)) map(tif_files$id, ~drive_download(as_id(.x), overwrite = TRUE)) ``` - the key is that we are still sending the first argument (`tif_files$id`) to a function! --- ## Parallelism and spatial computing - `lapply` and `map` work serially - Parallelization is useful when pieces of a problem are independent - "Embarassingly parallel" refers to problems that can be broken down into small chunks and run simultaneously using your computer's different processors - `mclapply` and `future_map` allow this in a general way; `raster` and `terra` allow some implicit parallelism --- ## What is efficiency? - Algorithmic efficiency - Programmer efficiency - Goal: _Fast enough_ computationally, _fast as possible_ for the programmer --- ## Benchmarking vs. Profiling - **Benchmarking**: tests the performance of specific operations repeatedly - **Profiling**: evaluates many lines of code to find "bottlenecks" --- ## Using `mclapply` to extract data - relies on "forking" - Can be slower than standard processing if the dataset being copied to the child process is large ```r counties.ext <- mclapply(seq_along(counties.p), function(x){ cty.ext = counties.p[x] terra::extract(mos, cty.ext) }, mc.cores = 4) counties.sf <- as(counties.p, "Spatial") %>% as(., "sf") counties.df <- map(seq_along(counties.p), function(x){ data.frame(GEOID = st_drop_geometry(counties.sf[x, "GEOID"]), mn = mean(counties.ext[[x]][[2]], na.rm=TRUE), sd = sd(counties.ext[[x]][[2]], na.rm = TRUE), min = min(counties.ext[[x]][[2]], na.rm = TRUE), max = max(counties.ext[[x]][[2]], na.rm = TRUE)) } ) %>% do.call(rbind, .) ``` --- ## Evaluating `mclapply` to extract data .pull-left[ ```r s.mclapply <- system.time(mclapply(seq_along(counties.p), function(x){ cty.ext = counties.p[x] terra::extract(mos, cty.ext) }, mc.cores = 4)) s.mclapply ``` ``` ## user system elapsed ## 2.923 0.272 1.332 ``` ] .pull-right[ ```r s.lapply <- system.time(lapply(seq_along(counties.p), function(x){ cty.ext = counties.p[x] terra::extract(mos, cty.ext) })) s.lapply ``` ``` ## user system elapsed ## 3.231 0.128 3.375 ``` ] --- ## Translating to the `map_` family - relies on the `furrr` package ```r library(furrr) mos.rast <- raster::raster(mos) future::plan(multisession) counties.ext <- future_map(seq_along(counties.p), function(x){ library(sf) #each operation in a new session cty.ext = counties.sf[x,] %>% as(., "Spatial") raster::extract(mos.rast, cty.ext) }) ``` --- ## Timing `future_` .pull-left[ ```r s.future <- system.time(future_map(seq_along(counties.p), function(x){ library(sf) #each operation in a new session cty.ext = counties.sf[x,] %>% as(., "Spatial") raster::extract(mos.rast, cty.ext) })) s.future ``` ``` ## user system elapsed ## 0.609 0.132 26.947 ``` ] .pull-right[ ```r s.map <- system.time(map(seq_along(counties.p), function(x){ library(sf) #each operation in a new session cty.ext = counties.sf[x,] %>% as(., "Spatial") raster::extract(mos.rast, cty.ext) })) s.map ``` ``` ## user system elapsed ## 52.649 4.966 58.787 ``` ] --- ## Benchmarking with `microbenchmark` - `system.time` provides a sample - `microbenchmark` generates multiple (default is 100) samples - More robust to vagaries of background computer noies --- ## Profiling - Designed to locate bottle necks in large chunks of code - Pinpoint which parts of your script or function takes the longes to run - **Important:** _working code first, profile code second_ --- ## Profile example ```r rasterize.prof <- profvis(expr = { #load packages library(terra) library(tidycensus) library(tidyverse) #download data and convert to Spatial id.income <- tidycensus:: get_acs(geography = "tract", variables = c(medincome = "B19013_001"), state = "ID", year = 2018, key = "90b94953d2f24e81e890229e0128174f5ba80d3f", geometry = TRUE) %>% st_transform(., crs(counties.p)) %>% as(., "SpatVector") #process into raster income.tct.rst <- terra::rasterize(x=id.income, y=elev.res, field = "estimate") }, interval = 0.01, prof_output = 'fstr-prof' ) ``` --- ## Profile Example .pull-left[ <figure> <img src="img/08/prof1.png" alt="ZZZ" title="ZZZ" width="100%"> <figcaption>Profile results</figcaption> </figure> ] .pull-right[ <figure> <img src="img/08/prof2.png" alt="ZZZ" title="ZZZ" width="100%"> <figcaption>Profile results</figcaption> </figure> ] --- ## Fixing bottlenecks 1. Look for existing solutions (`fasterize`) 2. Do less work 3. Parallelize (next week) 4. Avoid copies _Remember programmer efficiency!_ --- ## Additional directions - socket clusters (`makeCluster`, `parLapply`) - Useful when data needs to be passed back and forth among jobs (see [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html)) - lots of variations on `map_` (see the [purrr documentation](https://purrr.tidyverse.org/reference/map.html)) and `apply` - The `foreach` package adds additional functionalilty (see [this blog](https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/) for more info on parallel processing)