Combining raster and vector data

name: 1
class: center middle main-title section-title-4

# Workflows for combining raster and vector data

.class-info[

**Session 15**

.light[HES597: Introduction to Spatial Data in R<br>
Boise State University Human-Environment Systems<br>
Fall 2021]

]

---
# Outline for today
 - Revisiting Raster operations
 
 - Generating zonal statistics
 
 - Generalizing extractions
 
 - Scaling to larger problems
---
name: ops
class: center middle main-title section-title-4

# Revisiting raster operations
---
# Revisiting raster operations

<img src="08-slides_files/figure-html/unnamed-chunk-1-1.png" width="504" style="display: block; margin: auto;" />
---
# Raster Math

.pull-left[
* Simple calculations

```r
development.difficulty <- exp(mos) + elev.res/4
plot(log(development.difficulty))
```

<img src="08-slides_files/figure-html/rstmth-1.png" width="504" style="display: block; margin: auto;" />
]

.pull-right[
* Custom Functions

```r
normalize<- function(rster) (rster - min(rster, na.rm=TRUE))/(max(rster, na.rm=TRUE) - min(rster, na.rm=TRUE))
elev.nrm <- app(elev.res, normalize)

plot(elev.nrm)
```

<img src="08-slides_files/figure-html/rstfun-1.png" width="504" style="display: block; margin: auto;" />
]
--- 
# Raster math with polygons

* Need to convert field to raster first
* `terra::rasterize` and `fasterize::fasterize` do this for `SpatVector` and `sf` objects

```r
med.inc.rstr <- terra::rasterize(x = as(id.income, "SpatVector"), y = mos, field="estimate")
```
---
# Raster math with polygons

.pull-left[
<img src="08-slides_files/figure-html/vplot-1.png" width="504" style="display: block; margin: auto;" />
]
.pull-right[
<img src="08-slides_files/figure-html/rplot-1.png" width="504" style="display: block; margin: auto;" />
]
---
# Raster math with polygons

```r
income.ratio <- med.inc.rstr/mos
plot(income.ratio, main="Ratio of median income to housing cost")
```

---
name: zonal
class: center middle main-title section-title-4

# Zonal statistics
---
# Zonal statistics

* Allow extraction from one raster to a categorical raster
* Need the zones to be rasterized

```r
cty.rster <- terra::rasterize(counties.p, y=mos, field="GEOID")
zones <- terra::zonal(mos, cty.rster, "mean", na.rm=TRUE)
head(zones)
```

```
##   GEOID    idval
## 1 16001 9.849115
## 2 16003 8.290065
## 3 16005 9.360389
## 4 16007 9.474390
## 5 16009 8.151090
## 6 16011 9.211108
```
---
name: extract
class: center middle main-title section-title-4
# Extractions
---
# Extractions
* Zonal stats are fast, but limited
* `terra::extract` more generalizable
* converts vector to raster i the background

```r
extraction <- terra::extract(mos, counties.p, mean, na.rm=TRUE)
head(extraction)
```

```
##   ID    idval
## 1  1 9.083365
## 2  2 9.194087
## 3  3 8.346153
## 4  4 9.536322
## 5  5 9.870428
## 6  6 8.643443
```

---
name: tiles
class: center middle main-title section-title-4

# Scaling to larger extents
---

## Revisiting the `apply` family
.pull-left[
- A flexible group of functions that replace `for` or `while` loops

- Translates loops in C++ code, often provides speed-up

- Which member of the 'family' depends on input data and output desired

- Can be tricky to get desired behavior (*algorithmic efficiency* vs. *programmer efficiency*)
]

.pull-right[
- `apply` requires three arguments: an `array`, the `margin`, and the `function` you want to execute

- `arrays` are R data objects that contain 1, 2, or more dimensions

- `margins` identify which parts of the array to apply the `function` over (1 = rows, 2 = columns, 1:2 = all cells in a matrix) 
]
---

## A reminder
.pull-left[

```r
#create data
test.matrix <- matrix(rnorm(50), 10, 5)
dim(test.matrix)
```

```
## [1] 10  5
```

```r
# generate column means
col.mean <- apply(X=test.matrix, MARGIN = 2, mean)
str(col.mean)
```

```
##  num [1:5] 0.119 -0.444 -0.282 -0.514 0.1
```
]
.pull-right[

```r
#generate row sums
row.sum <- apply(X=test.matrix, MARGIN = 1, sum)
str(row.sum)
```

```
##  num [1:10] -0.963 -3.781 -2.891 -3.784 -0.961 ...
```

```r
#exponentiate each cell of the data
exp.all <- apply(X=test.matrix, MARGIN = 1:2, exp)
str(exp.all)
```

```
##  num [1:10, 1:5] 1.961 0.902 1.288 1.296 3.426 ...
```
]
---
## The `map` function from `purrr` and the `tidyverse`

- Acts like `apply` but can be more flexible and interpretable
- Has a similar 'family' of functions designed for different situations
- Here we are downloading files from Google Drive

```r
folder_url <- "https://drive.google.com/open?id=1xSsKYpB2o9SEIo6pZDFWYqVthb162_Zo"
folder <- drive_get(as_id(folder_url))
tif_files <- drive_ls(folder)

lapply(tif_files$id, function(x) drive_download(as_id(x), overwrite = TRUE))
map(tif_files$id, ~drive_download(as_id(.x), overwrite = TRUE))
```

- the key is that we are still sending the first argument (`tif_files$id`) to a function!

---
## Parallelism and spatial computing

- `lapply` and `map` work serially

- Parallelization is useful when pieces of a problem are independent

- "Embarassingly parallel" refers to problems that can be broken down into small chunks and run simultaneously using your computer's different processors

- `mclapply` and `future_map` allow this in a general way; `raster` and `terra` allow some implicit parallelism

---
## What is efficiency?
 
 - Algorithmic efficiency
 
 - Programmer efficiency

- Goal: _Fast enough_  computationally, _fast as possible_ for the programmer

---

## Benchmarking vs. Profiling

- **Benchmarking**: tests the performance of specific operations repeatedly

- **Profiling**: evaluates many lines of code to find "bottlenecks"

---
## Using `mclapply` to extract data

- relies on "forking"

- Can be slower than standard processing if the dataset being copied to the child process is large

```r
counties.ext <- mclapply(seq_along(counties.p), function(x){
  cty.ext = counties.p[x]
  terra::extract(mos, cty.ext)
}, mc.cores = 4)

counties.sf <- as(counties.p, "Spatial") %>% 
  as(., "sf")

counties.df <- map(seq_along(counties.p), function(x){
   data.frame(GEOID = st_drop_geometry(counties.sf[x, "GEOID"]),
              mn = mean(counties.ext[[x]][[2]], na.rm=TRUE),
              sd = sd(counties.ext[[x]][[2]], na.rm = TRUE),
              min = min(counties.ext[[x]][[2]], na.rm = TRUE),
              max = max(counties.ext[[x]][[2]], na.rm = TRUE))
  }
) %>% do.call(rbind, .)
```
---
## Evaluating `mclapply` to extract data
.pull-left[

```r
s.mclapply <- system.time(mclapply(seq_along(counties.p), function(x){
  cty.ext = counties.p[x]
  terra::extract(mos, cty.ext)
}, mc.cores = 4))
s.mclapply
```

```
##    user  system elapsed 
##   2.923   0.272   1.332
```
]
.pull-right[

```r
s.lapply <- system.time(lapply(seq_along(counties.p), function(x){
  cty.ext = counties.p[x]
  terra::extract(mos, cty.ext)
}))
s.lapply
```

```
##    user  system elapsed 
##   3.231   0.128   3.375
```
]

---
## Translating to the `map_` family
- relies on the `furrr` package

```r
library(furrr)
mos.rast <- raster::raster(mos)
future::plan(multisession)
counties.ext <- future_map(seq_along(counties.p), function(x){
  library(sf) #each operation in a new session
  cty.ext = counties.sf[x,] %>% as(., "Spatial")
  raster::extract(mos.rast, cty.ext)
  })
```
---
## Timing `future_`
.pull-left[

```r
s.future <- system.time(future_map(seq_along(counties.p), function(x){
  library(sf) #each operation in a new session
  cty.ext = counties.sf[x,] %>% as(., "Spatial")
  raster::extract(mos.rast, cty.ext)
}))

s.future
```

```
##    user  system elapsed 
##   0.609   0.132  26.947
```
]
.pull-right[

```r
s.map <- system.time(map(seq_along(counties.p), function(x){
  library(sf) #each operation in a new session
  cty.ext = counties.sf[x,] %>% as(., "Spatial")
  raster::extract(mos.rast, cty.ext)
}))

s.map
```

```
##    user  system elapsed 
##  52.649   4.966  58.787
```
]

---
## Benchmarking with `microbenchmark`
 
 - `system.time` provides a sample
 
 - `microbenchmark` generates multiple (default is 100) samples
  -   More robust to vagaries of background computer noies
---
## Profiling

- Designed to locate bottle necks in large chunks of code

- Pinpoint which parts of your script or function takes the longes to run

- **Important:** _working code first, profile code second_ 
---
## Profile example

```r
rasterize.prof <- profvis(expr = {
  #load packages
  library(terra)
  library(tidycensus)
  library(tidyverse)
  #download data and convert to Spatial
  id.income <- tidycensus:: get_acs(geography = "tract", 
              variables = c(medincome = "B19013_001"),
              state = "ID", 
              year = 2018,
              key = "90b94953d2f24e81e890229e0128174f5ba80d3f",
              geometry = TRUE) %>% 
                st_transform(., crs(counties.p)) %>% 
    as(., "SpatVector")
  
  #process into raster
  income.tct.rst <- terra::rasterize(x=id.income,
                                           y=elev.res,
                                           field = "estimate")
}, interval = 0.01, prof_output = 'fstr-prof' )
```
---
## Profile Example
.pull-left[
<figure>
  <img src="img/08/prof1.png" alt="ZZZ" title="ZZZ" width="100%">
  <figcaption>Profile results</figcaption>
</figure>
]
.pull-right[
<figure>
  <img src="img/08/prof2.png" alt="ZZZ" title="ZZZ" width="100%">
  <figcaption>Profile results</figcaption>
</figure>
]
---
## Fixing bottlenecks

1. Look for existing solutions (`fasterize`)
2. Do less work
3. Parallelize (next week)
4. Avoid copies

_Remember programmer efficiency!_
---

## Additional directions

- socket clusters (`makeCluster`, `parLapply`) - Useful when data needs to be passed back and forth among jobs (see [R Programming for Data Science](https://bookdown.org/rdpeng/rprogdatascience/parallel-computation.html))

- lots of variations on `map_` (see the [purrr documentation](https://purrr.tidyverse.org/reference/map.html)) and `apply`

- The `foreach` package adds additional functionalilty (see [this blog](https://www.r-bloggers.com/how-to-go-parallel-in-r-basics-tips/) for more info on parallel processing)