--- title: "Remote databases" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Remote databases} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) ``` DuckDB is the default backend in TestGenerator, and it is usually the fastest place to start. Remote database testing is useful when you also want to check that your study code works on another SQL engine. This is especially helpful for code that will run in PostgreSQL, SQL Server, or Databricks/Spark in production. The idea is the same as with local testing: you create a small, controlled test population, load it into a CDM, run your study code, and assert the expected result. The difference is that `patientsCDM()` uploads the test CDM to a remote database and returns a CDM reference connected to that backend. ## Use `patientsCDM()` with a remote database First create a Unit Test Definition JSON file with `readPatients()` or `readPatients.csv()`. Then call `patientsCDM()` with the backend you want to test. ```{r postgresql-example} library(TestGenerator) cdm <- patientsCDM( pathJson = "tests/testthat/testCases", testName = "my_test_population", cdmVersion = "5.4", dbms = "postgresql" ) ``` The supported remote values for `dbms` are: | Backend | `dbms` value | | --- | --- | | PostgreSQL | `"postgresql"` | | SQL Server | `"sqlserver"` | | Databricks/Spark | `"databricks"` | For remote databases, `patientsCDM()` creates the CDM locally first, trims the vocabulary to what the test population needs, uploads the result to the remote database, and returns the remote CDM reference. By default, TestGenerator creates a temporary test schema for the upload. You can also provide `writeSchema` if you want to choose the schema name yourself. ```{r write-schema} cdm <- patientsCDM( pathJson = "tests/testthat/testCases", testName = "my_test_population", cdmVersion = "5.4", dbms = "sqlserver", writeSchema = "testgenerator_my_case" ) ``` When the test has finished, clean up the remote schema and close the connection: ```{r cleanup-remote} cleanupTestCdm(cdm) ``` This is important for shared remote databases. The test schemas are small, but cleaning them up keeps the database tidy and avoids name conflicts in later runs. ## Environment variables for direct use When you call `patientsCDM()` directly against a remote database, TestGenerator reads the connection details from environment variables. | Backend | Required environment variables | | --- | --- | | PostgreSQL | `POSTGRESQL_SERVER`, `POSTGRESQL_DBNAME`, `POSTGRESQL_PORT`, `POSTGRESQL_USER`, `POSTGRESQL_PASSWORD` | | SQL Server | `SQLSERVER_SERVER`, `SQLSERVER_DBNAME`, `SQLSERVER_PORT`, `SQLSERVER_USER`, `SQLSERVER_PASSWORD` | | Databricks/Spark | `DATABRICKS_HOST`, `DATABRICKS_TOKEN`, `DATABRICKS_HTTPPATH` | For Databricks, TestGenerator also reads `DATABRICKS_USER` and `DATABRICKS_WORKSPACE` when they are set. If they are not set, it uses `token` as the user and `hive_metastore` as the workspace/catalog. For SQL Server, TestGenerator reads `SQL_SERVER_DRIVER` when it is set. If it is not set, it uses `ODBC Driver 18 for SQL Server`. A typical local `.Renviron` setup could look like this: ```{bash envvars, eval=FALSE} POSTGRESQL_SERVER=localhost POSTGRESQL_DBNAME=cdm POSTGRESQL_PORT=5432 POSTGRESQL_USER=postgres POSTGRESQL_PASSWORD=your-password SQLSERVER_SERVER=localhost SQLSERVER_DBNAME=cdm SQLSERVER_PORT=1433 SQLSERVER_USER=sa SQLSERVER_PASSWORD=your-password DATABRICKS_HOST=https://your-workspace.cloud.databricks.com DATABRICKS_TOKEN=your-token DATABRICKS_HTTPPATH=/sql/1.0/warehouses/your-warehouse ``` Use values that match your own database or Databricks workspace. The database user needs permission to create schemas, create tables, insert data, read data, and drop the test schema during cleanup. ## A minimal backend test pattern A backend-specific test usually follows this shape: ```{r test-pattern} testthat::test_that("study logic works on PostgreSQL", { cdm <- NULL on.exit({ if (!is.null(cdm)) { TestGenerator::cleanupTestCdm(cdm) } }, add = TRUE) cdm <- TestGenerator::patientsCDM( pathJson = "tests/testthat/testCases", testName = "my_test_population", cdmVersion = "5.4", dbms = "postgresql" ) result <- myPackage::runMyStudy(cdm) testthat::expect_equal(result$n_subjects, 3) }) ``` The exact expectations should be specific to your micro population. Good tests usually check counts, dates, cohort entry and exit, exclusions, or any other result that should be predictable from the small input dataset.