---
title: "Benchmark"
date: "`r Sys.Date()`"
code-annotations: hover
urlcolor: blue
vignette: >
  %\VignetteIndexEntry{Benchmark}
  %\VignetteEngine{quarto::html}
  %\VignetteEncoding{UTF-8}
execute:
  eval: false
knitr:
  opts_chunk:
    collapse: true
    comment: "#>"
    out.width: "100%"
---

```{r}
#| include: false

# Limit threads to avoid a CRAN NOTE
Sys.setenv(OMP_THREAD_LIMIT = 2)
```

```{r}
#| echo: false

## Internal packages
library(patchwork)

## Functions used in the vignette
ggplot_benchmark <- function(
  data, 
  log = FALSE,
  show.legend = TRUE,
  ...
  ) {

    ## Generate base plot
    ggplot(
      data = data, 
      aes(
        x     = factor(format(n, big.mark = ".")),
        y     = if (log) median else as.numeric(median),
        color = pkg,
        group = pkg
    )) +
      geom_line(
        linewidth = 0.7,
        linetype  = "dashed",
        alpha     = 0.6,
        show.legend = show.legend
      ) +
      geom_point(
        aes(size = as.numeric(mem_alloc)),
        alpha = 0.9,
        show.legend = show.legend
      ) +
      scale_size_binned(
        name   = "Memory allocated",
        labels = scales::label_bytes(),
        n.breaks = 4,
        range = c(1, 8)
      ) +
      scale_color_brewer(palette = "Set1") +
      labs(
        color    = "Package",
        x        = "Data size (rows)",
        y        = "Computation time",
        ...
      ) +
      theme_minimal(base_size = 13) +
      theme(
        # legend.position  = "top",
        panel.grid.minor = element_blank(),
        plot.title       = element_text(face = "bold")
      )
}

ggplot_assemble <- function(plot1, plot2, fun_name) {

  plot1 +
    plot2 +
    plot_annotation(
      title    = paste("Benchmark Comparison for", fun_name),
      subtitle = "Execution time vs. data size",
      theme    = theme(
        plot.title = element_text(face = "bold", hjust = .5),
        plot.subtitle = element_text(hjust = .5),
        text = element_text(size = rel(3.5))
      )
    )

}
```

This vignette benchmarks **{duckspatial}** against **{sf}** across several spatial operations, comparing computation time and memory usage as dataset size grows. We plan to extend it with additional operation types in future releases. Versions of packages used in this vignette:

* `duckspatial`: v1.0.0

* `duckdb`: v1.5.1

* `sf`: v1.1.0

## TL;DR

{duckspatial} is substantially faster and allocates far less memory than {sf} in almost all cases, with the advantage becoming more pronounced on larger datasets. The one exception is pairwise distance calculation, where {sf} retains a memory advantage for Euclidean distances.

## Prepare data

We are going to test the speed of a a bunch of functions using simulated data. We use the `make_points()` function to generate `n` random points within the globe, and 10,000 random rectangles.

```{r}
#| message: false
#| warning: false
#| code-fold: true
#| code-summary: "Set-up"

# Load necessary packages
library(duckspatial)
library(bench)
library(dplyr)
library(sf)
library(ggplot2)
options(scipen = 999)

# Function to generate random points
make_points <- function(n_points) {
    points_df <- data.frame(
      id = 1:n_points,
      x = runif(n_points, min = -180, max = 180),
      y = runif(n_points, min = -90, max = 90),
      value = rnorm(n_points, mean = 100, sd = 15),
      category = sample(c("A", "B", "C", "D"), n_points, replace = TRUE)
  ) |>
    sf::st_as_sf(coords = c("x", "y"), crs = 4326)
}

# Generate datasets of different sizes
withr::with_seed(27, {
  points_sf_100k <- make_points(1e5)
  points_sf_1mi  <- make_points(1e6)
  points_sf_3mi  <- make_points(3e6)
})

# Generate polygons
# Create large polygon dataset (e.g., administrative regions, zones, etc.)
n_polygons <- 10000
polygons_list <- vector("list", n_polygons)

for(i in 1:n_polygons) {
  # Random center point with buffer from edges
  center_x <- runif(1, min = -170, max = 170)
  center_y <- runif(1, min = -80, max = 80)
  
  # Create simple rectangular polygons to avoid geometry issues
  width <- runif(1, min = 0.5, max = 3)
  height <- runif(1, min = 0.5, max = 3)
  
  # Create rectangle coordinates (must be closed: first point = last point)
  x_coords <- c(
    center_x - width/2,
    center_x + width/2,
    center_x + width/2,
    center_x - width/2,
    center_x - width/2  # Close the polygon
  )
  
  y_coords <- c(
    center_y - height/2,
    center_y - height/2,
    center_y + height/2,
    center_y + height/2,
    center_y - height/2  # Close the polygon
  )
  
  # Create polygon matrix
  coords <- cbind(x_coords, y_coords)
  
  # Create polygon (wrapped in list as required by st_polygon)
  polygons_list[[i]] <- st_polygon(list(coords))
}

polygons_sf <- st_sf(
  poly_id    = 1:n_polygons,
  region     = sample(c("North", "South", "East", "West"), n_polygons, replace = TRUE),
  population = sample(1000:1000000, n_polygons, replace = TRUE),
  geometry   = st_sfc(polygons_list, crs = 4326)
)
```

# Spatial join

The spatial join aims to bring the attributes of a dataset to another dataset based in a spatial predicate. One example would be to have an `x` dataset with points that represent observations of wolves. In an `y` dataset, we could have polygons with attributes describing the geopraphical location (e.g. the name of the country, the name of the region..). So, by using a `ST_Join(x, y, "intersects")`, we would assign the attributes of `y` to `x` according to where the observation of the wolf fall.

This operation can be intensive. In the current version of {duckspatial} we see an improvement in speed and memory usage in bigger datasets, as shown in @fig-st-join.

-   Using 1 million points: {duckspatial} was about 50 times faster than {sf}, and allocated 18 times less memory.

-   Using 3 million points: {duckspatial} was about 35 times faster than {sf}, and allocated 16 times less memory.

```{r}
#| message: false
#| code-fold: true
#| code-summary: "Benchmark code - ddbs_join"

# Helper to run the benchmark
run_join_benchmark <- function(points_sf) {
  temp <- bench::mark(
    iterations  = 3,
    check       = FALSE,
    duckspatial = ddbs_join(points_sf, polygons_sf, join = "within"),
    sf          = st_join(points_sf, polygons_sf, join = st_within)
  )
  temp$n   <- nrow(points_sf)
  temp$pkg <- c("duckspatial", "sf")
  temp
}

# Run the benchmark
df_bench_join <- lapply(
  X   = list(points_sf_100k, points_sf_1mi, points_sf_3mi),
  FUN = run_join_benchmark
) |>
  dplyr::bind_rows()
```

```{r}
#| echo: false
#| warning: false

# Id to store the figures, saving the older ones
id_output <- "v1.5.1"

# Generate the plots
gg_join <- ggplot_benchmark(
  data     = df_bench_join,
  log      = FALSE,
  show.legend = T,
  subtitle = "A) Normal scale"
)

gg_join_log <- ggplot_benchmark(
  data     = df_bench_join,
  log      = TRUE,
  show.legend = F,
  subtitle = "B) Log scale"
)

# Assemble them in a single plot
ggplot_assemble(
  plot1 = gg_join,
  plot2 = gg_join_log,
  fun_name = "ddbs_join()"
)

# Export it
ggsave(
  filename = paste0("man/figures/bench/bench-st-join-", id_output, ".png"),
  height   = 15,
  width    = 30,
  units    = "cm"
)
```

![Benchmark that compares spatial join for duckspatial and sf](../man/figures/bench/bench-st-join-v1.5.1.png){#fig-st-join fig-align="center"}

# Spatial filter

The spatial filter aims to filter rows of `x` based in a spatial relationship with `y`. For example, let's imagine that we have an `x` dataset with observations of wolves all around the world, and we want to filter only those that are in a specific country, for instance in Spain. If the dataset has this attritbute, we can just do it with a simple `dplyr::filter()`, but, if the wolves datase doesn't include a column with the country, we can use a spatial filter. For that, we need a second dataset `y` with the boundaries of Spain, and by using `ST_Filter(x, y, "intersects")`, we would filter only those wolves observations that intersects with the Spain's polygon.

In the current version of {duckspatial} we see an improvement in speed and memory usage in bigger datasets, as shown in @fig-st-filter.

-   Using 1 million points: {duckspatial} was about 42 times faster than {sf}, and allocated 5 times less memory.

-   Using 3 million points: {duckspatial} was about 12 times faster than {sf}, and allocated 5 times less memory.

```{r}
#| message: false
#| code-fold: true
#| code-summary: "Benchmark code - ddbs_filter"

# Helper to run the benchmark
run_filter_benchmark <- function(points_sf) {
  temp <- bench::mark(
    iterations  = 3,
    check       = FALSE,
    duckspatial = ddbs_filter(points_sf, polygons_sf),
    sf          = st_filter(points_sf, polygons_sf)
  )
  temp$n   <- nrow(points_sf)
  temp$pkg <- c("duckspatial", "sf")
  temp
}

# Run the benchmark
df_bench_filter <- lapply(
  X   = list(points_sf_100k, points_sf_1mi, points_sf_3mi),
  FUN = run_filter_benchmark
) |>
  dplyr::bind_rows()
```

```{r}
#| echo: false
#| warning: false

# Generate the plots
gg_filter <- ggplot_benchmark(
  data     = df_bench_filter,
  log      = FALSE,
  show.legend = T,
  subtitle = "A) Normal scale"
)

gg_filter_log <- ggplot_benchmark(
  data     = df_bench_filter,
  log      = TRUE,
  show.legend = F,
  subtitle = "B) Log scale"
)

# Assemble them in a single plot
ggplot_assemble(
  plot1 = gg_filter,
  plot2 = gg_filter_log,
  fun_name = "ddbs_filter()"
)

# Export it
ggsave(
  filename = paste0("man/figures/bench/bench-st-filter-", id_output, ".png"),
  height   = 15,
  width    = 30,
  units    = "cm"
)
```

![Benchmark that compares spatial filter for duckspatial and sf](../man/figures/bench/bench-st-filter-v1.5.1.png){#fig-st-filter fig-align="center"}

# Spatial distances

The `ST_Distance(x, y)` calculates the distance between each observation in `x` against each observation in `y`. The default {duckspatial} mode will return a lazy table with three columns: (id_x) the id of the row in `x`; (id_y) the id of the row in `y`; (distance) the actual distance between those pair of observations. In the case of `mode = 'sf'`, the result will be a sparse matrix. Note that {duckspatial} will use by default the best distance for the input CRS and geometry type.

When calculating the distance between 10,000 pairs of points, {duckspatial} was about 1.5 times faster, but used much less memory (> 400 times).

```{r}
#| message: false
#| code-fold: true
#| code-summary: "Benchmark code - ddbs_distance"

# Helper to run the benchmark
run_distance_benchmark <- function(n) {

  points_sf <- withr::with_seed(27, make_points(n))

  temp <- bench::mark(
    iterations  = 1,
    check       = FALSE,
    duckspatial = ddbs_distance(points_sf, points_sf),
    sf          = st_distance(points_sf, points_sf)
  )
  temp$n   <- n
  temp$pkg <- c("duckspatial", "sf")
  temp
}

df_bench_distance <- lapply(
  X   = c(1000, 5000, 10000),
  FUN = run_distance_benchmark
) |>
  dplyr::bind_rows()
```


```{r}
#| echo: false
#| warning: false

# Generate the plots
gg_distance <- ggplot_benchmark(
  data     = df_bench_distance,
  log      = FALSE,
  show.legend = T,
  subtitle = "A) Normal scale"
)

gg_distance_log <- ggplot_benchmark(
  data     = df_bench_distance,
  log      = TRUE,
  show.legend = F,
  subtitle = "B) Log scale"
)

# Assemble them in a single plot
ggplot_assemble(
  plot1 = gg_distance,
  plot2 = gg_distance_log,
  fun_name = "ddbs_distance()"
)

# Export it
ggsave(
  filename = paste0("man/figures/bench/bench-st-distance-", id_output, ".png"),
  height   = 15,
  width    = 30,
  units    = "cm"
)
```

![Benchmark that compares spatial distance for duckspatial and sf](../man/figures/bench/bench-st-distance-v1.5.1.png){#fig-st-distance fig-align="center"}


# Dissolving geometries

Dissolving geometries consist in merging/aggregating geometries that share a common attribute into a single geometry per group.

In the current version of {duckspatial} we see an improvement in speed and memory usage in bigger datasets, as shown in @fig-st-dissolve.

-   Using 1 million points: {duckspatial} was about 14 times faster than {sf}, and allocated 10 times less memory.

-   Using 3 million points: {duckspatial} was about 12 times faster than {sf}, and allocated 10 times less memory.

```{r}
#| message: false
#| code-fold: true
#| code-summary: "Benchmark code - ddbs_union_agg"

# Helper to run the benchmark
run_union_benchmark <- function(points_sf) {
  temp <- bench::mark(
    iterations  = 3,
    check       = FALSE,
    duckspatial = ddbs_union_agg(points_sf, by = "category"),
    sf          = points_sf |> 
      group_by(category) |> 
      summarise(geometry = st_union(geometry))
  )
  temp$n   <- nrow(points_sf)
  temp$pkg <- c("duckspatial", "sf")
  temp
}

# Run the benchmark
df_bench_union <- lapply(
  X   = list(points_sf_100k, points_sf_1mi, points_sf_3mi),
  FUN = run_union_benchmark
) |>
  dplyr::bind_rows()
```

```{r}
#| echo: false
#| warning: false

# Generate the plots
gg_union_agg <- ggplot_benchmark(
  data     = df_bench_union,
  log      = FALSE,
  show.legend = T,
  subtitle = "A) Normal scale"
)

gg_union_agg_log <- ggplot_benchmark(
  data     = df_bench_union,
  log      = TRUE,
  show.legend = F,
  subtitle = "B) Log scale"
)

# Assemble them in a single plot
ggplot_assemble(
  plot1 = gg_union_agg,
  plot2 = gg_union_agg_log,
  fun_name = "ddbs_union_agg()"
)

# Export it
ggsave(
  filename = paste0("man/figures/bench/bench-st-dissolve-", id_output, ".png"),
  height   = 15,
  width    = 30,
  units    = "cm"
)
```

![Benchmark that compares dissolving geometries by group for duckspatial and sf](../man/figures/bench/bench-st-dissolve-v1.5.1.png){#fig-st-dissolve fig-align="center"}


# Geometry predicates

Geometry predicates are boolean functions that test spatial relationships between two geometries (they return true or false). They're fundamental to spatial queries, letting you filter or join data based on how geometries relate to one another in space.
Common examples include: ST_Contains, ST_Within, ST_Touches, ST_Overlaps, ST_Crosses, ST_Disjoint, and ST_Equals.

In the following example, we compare `ST_Intersects(geometry A, geometry B)`, which returns true if the two geometries share any point in common (their interiors or boundaries overlap in any way).

-   Using 1 million points: {duckspatial} was about 28 times faster than {sf}, and allocated 3 times less memory.

-   Using 3 million points: {duckspatial} was about 23 times faster than {sf}, and allocated 3 times less memory.

```{r}
#| message: false
#| code-fold: true
#| code-summary: "Benchmark code - ddbs_intersects"

# Helper to run the benchmark
run_predicate_benchmark <- function(points_sf) {
  temp <- bench::mark(
    iterations  = 1,
    check       = FALSE,
    duckspatial = ddbs_intersects(points_sf, polygons_sf),
    sf          = st_intersects(points_sf, polygons_sf)
  )
  temp$n   <- nrow(points_sf)
  temp$pkg <- c("duckspatial", "sf")
  temp
}

# Run the benchmark
df_bench_predicate <- lapply(
  X   = list(points_sf_100k, points_sf_1mi, points_sf_3mi),
  FUN = run_predicate_benchmark
) |>
  dplyr::bind_rows()
```

```{r}
#| echo: false
#| warning: false

# Generate the plots
gg_predicate <- ggplot_benchmark(
  data     = df_bench_predicate,
  log      = FALSE,
  show.legend = T,
  subtitle = "A) Normal scale"
)

gg_predicate_log <- ggplot_benchmark(
  data     = df_bench_predicate,
  log      = TRUE,
  show.legend = F,
  subtitle = "B) Log scale"
)

# Assemble them in a single plot
ggplot_assemble(
  plot1 = gg_predicate,
  plot2 = gg_predicate_log,
  fun_name = "ddbs_intersects()"
)

# Export it
ggsave(
  filename = paste0("man/figures/bench/bench-st-intersects-", id_output, ".png"),
  height   = 15,
  width    = 30,
  units    = "cm"
)
```

![Benchmark that compares the geometry predicate ST_Intersects(x, y) for duckspatial and sf](../man/figures/bench/bench-st-intersects-v1.5.1.png){#fig-st-intersects fig-align="center"}