--- title: "Getting started" date: "`r Sys.Date()`" vignette: > %\VignetteIndexEntry{Getting started} %\VignetteEngine{quarto::html} %\VignetteEncoding{UTF-8} execute: eval: !expr identical(tolower(Sys.getenv("NOT_CRAN")), "true") knitr: opts_chunk: collapse: true comment: "#>" out.width: "100%" --- ```{r} #| include: false # Limit threads to avoid a CRAN NOTE Sys.setenv(OMP_THREAD_LIMIT = 2) ``` # Getting started **{duckspatial}** provides fast, memory-efficient functions for analysing and manipulating large spatial vector datasets in R. It bridges [DuckDB's spatial extension](https://duckdb.org/docs/stable/core_extensions/spatial/functions) with R's spatial ecosystem — in particular **{sf}** — so you can leverage DuckDB's analytical power without leaving your familiar R workflow. Let's start by loading the packages we need: ```{r} #| label: setup #| message: false #| warning: false library(duckdb) library(duckspatial) library(dplyr) library(sf) ``` # Installation Install the stable release from CRAN: ``` r # install.packages("pak") pak::pak("duckspatial") ``` Install the latest GitHub version (more features, fewer accumulated bugs): ``` r pak::pak("Cidree/duckspatial") ``` Install the development version (may be unstable): ``` r pak::pak("Cidree/duckspatial@dev") ``` # Reading data {duckspatial} is built around the `duckspatial_df` S3 class: a lazy-table-like object that holds a geometry column alongside its geospatial properties, but keeps the data **outside R's memory** until you explicitly ask for it. If you have a local file (a GeoPackage, a Shapefile, a GeoJSON, etc.) you can open it lazily with `ddbs_open_dataset()`: ```{r} countries_ddbs <- ddbs_open_dataset( system.file( "spatial/countries.geojson", package = "duckspatial" ) ) print(countries_ddbs) ``` > **Note:** The first call to `ddbs_open_dataset()` may take a few seconds. Internally, {duckspatial} creates a default DuckDB connection, then installs and loads the Spatial extension into it. Subsequent calls reuse the same connection and are much faster. We cover this connection in more detail in the [Working in a database] section below. Printing a `duckspatial_df` object displays the most important metadata at a glance: | Field | Description | |-------------------------|------------------------------------| | **CRS** | Coordinate reference system (AUTHORITY:CODE) | | **Geometry column** | Name of the column holding geometries | | **Geometry type** | Type(s) present (e.g. `POLYGON`, `MULTIPOLYGON`, `POINT`) | | **Bounding box** | Four coordinates bounding all geometries | | **Source** | Name of the temporary view inside DuckDB | | **Database** | DuckDB database path and version | | **Data** | First rows of the dataset | The table is not in R's memory; it lives inside the DuckDB connection. Every {duckspatial} operation you apply runs there, using the DuckDB engine. Alternatively, if you already have an `sf` object in memory, you can convert it to a `duckspatial_df` with `as_duckspatial_df()`: ```{r} ## read with sf as usual countries_sf <- read_sf( system.file( "spatial/countries.geojson", package = "duckspatial" ) ) ## push into DuckDB countries_ddbs <- as_duckspatial_df(countries_sf) class(countries_ddbs) ``` # Processing data Let's run a typical spatial workflow: dissolving all country polygons into a single `MULTIPOLYGON` with internal boundaries removed, using `ddbs_union()` (the {duckspatial} equivalent of `sf::st_union()`). `ddbs_union()` requires all geometries to be valid. We can check this first with `ddbs_is_valid()` (equivalent to `sf::st_is_valid()`), which appends a logical `is_valid` column to the lazy table so the subsequent `filter()` also runs inside DuckDB: ```{r} countries_ddbs |> ddbs_is_valid() |> filter(!is_valid) ``` Antarctica has invalid geometries (likely self-intersections). We can repair them with `ddbs_make_valid()` before computing the union, and because `duckspatial_df` objects are lazy, we can chain both steps in a single pipe: ```{r} world_ddbs <- countries_ddbs |> ddbs_make_valid() |> ddbs_union() print(world_ddbs) ``` The result is still a lazy `duckspatial_df`. To visualise it we need to pull the data into R as an `sf` object. Any of the following three calls do that: ``` r # Option A world_sf <- ddbs_collect(world_ddbs) # Option B world_sf <- collect(world_ddbs) # Option C world_sf <- st_as_sf(world_ddbs) ``` ```{r} world_sf <- world_ddbs |> ddbs_collect() print(world_sf) ``` ```{r} plot(world_sf) ``` # Working in a database So far we have been using the default, temporary DuckDB connection that {duckspatial} manages for us. For some use cases you may want to manage the connection yourself, most commonly when you need a **persistent database** that survives the R session. There are two connection modes: - **Non-persistent (in-memory):** data exists only for the duration of the R session or until it is closed. As of v1.0.0, this mode is kept for backward compatibility, but working with `duckspatial_df` objects directly achieves the same goals with less boilerplate. - **Persistent:** data is written to a `.duckdb` or `.db` file on disk and survives after the session ends. ## Creating a connection {duckspatial} provides a convenience wrapper, `ddbs_create_conn()`, that creates a DuckDB connection, installs the Spatial extension, and loads it, all in one call: ```{r} conn <- ddbs_create_conn() ``` You can also limit the resources DuckDB is allowed to use: ```{r} conn <- ddbs_create_conn( threads = 2, memory_limit_gb = 8 ) ``` Under the hood, `ddbs_create_conn()` is equivalent to: ``` r conn <- dbConnect(duckdb()) ddbs_install(conn) ddbs_load(conn) ``` ## Non-persistent database Once you have a connection, write spatial data into it with `ddbs_write_table()`. It accepts both `sf` and `duckspatial_df` objects: ```{r} ddbs_write_table(conn, countries_sf, name = "countries") ``` Verify the table is there: ```{r} ddbs_list_tables(conn) ``` From here the workflow mirrors the `duckspatial_df` workflow. Functions that accept a `duckspatial_df` also accept a table name + connection pair: ```{r} ddbs_is_valid("countries", conn = conn) |> filter(!is_valid) ``` You can write intermediate results as named tables in the database by passing a `name` argument: ```{r} ddbs_make_valid("countries", conn = conn, name = "countries_valid") ddbs_union("countries_valid", conn = conn, name = "world") ``` `ddbs_read_table()` materialises a table directly as `sf` (not lazily), so the result can be passed straight to `plot()`: ```{r} ddbs_read_table(conn, "world") |> plot() ``` When you are done, close the connection. Because this is an in-memory database, all tables written to it will be discarded: ```{r} ddbs_stop_conn(conn) ``` ## Persistent database The workflow is identical to the non-persistent case. The only difference is the connection string, pass a file path to `ddbs_create_conn()`: ```{r} #| eval: false conn <- ddbs_create_conn("my_database.duckdb") ``` A practical pattern is to do all processing with the `duckspatial_df` workflow (which is lazily evaluated inside the default connection), and only write the final results to the persistent database: ```{r} #| eval: false ## open persistent connection conn <- ddbs_create_conn("my_database.duckdb") ## do all processing with duckspatial_df objects world_ddbs <- ddbs_open_dataset( system.file("spatial/countries.geojson", package = "duckspatial") ) |> ddbs_make_valid() |> ddbs_union() ## write only the final result to the persistent database ddbs_write_table(conn, world_ddbs, name = "world") ## close — "my_database.duckdb" will persist on disk ddbs_stop_conn(conn) ```