--- title: "Usage of the sdcHierarchies-Package" author: "Bernhard Meindl" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true toc_depth: 5 number_sections: false vignette: > %\VignetteIndexEntry{Usage of the sdcHierarchies-Package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo=FALSE, include=FALSE} library(rmarkdown) library(sdcHierarchies) library(data.table) ``` # Introduction The `sdcHierarchies` package allows to create, modify and export nested hierarchies that are used for example to define tables in statistical disclosure control (SDC) software such as [**sdcTable**](https://cran.r-project.org/package=sdcTable). # Usage Before using, the package needs to be loaded: ```{r, eval = FALSE} library(sdcHierarchies) ``` ## Create and modify a hierarchy from scratch `hier_create()` allows to create a hierarchy. Argument `root` specifies the name of the root node. Optionally, it is possible to add some nodes to the top-level by listing their names in argument `node_labs`. Also, `hier_display()` shows the hierarchical structure of the current tree: ```{r} h <- hier_create(root = "Total", nodes = LETTERS[1:5]) hier_display(h) ``` Once such an object is created, it can be modified by the following functions: - `hier_add()`: allows to add nodes to the hierarchy - `hier_delete()`: allows to delete nodes from the tree - `hier_rename()`: allows to rename nodes These functions can be applied as shown below: ```{r} ## adding nodes below the node specified in argument `node` h <- hier_add(h, root = "A", nodes = c("a1", "a2")) h <- hier_add(h, root = "B", nodes = c("b1", "b2")) h <- hier_add(h, root = "b1", nodes = c("b1_a", "b1_b")) # deleting one or more nodes from the hierarchy h <- hier_delete(h, nodes = c("a1", "b2")) h <- hier_delete(h, nodes = c("a2")) # rename nodes h <- hier_rename(h, nodes = c("C" = "X", "D" = "Y")) hier_display(h) ``` We note that the underlying [**data.tree**](https://cran.r-project.org/package=data.tree) package allows to modify the objects on reference so no explicit assignment is required. ## Information about nodes Function `hier_info()` returns metadata for specific nodes provided in the `nodes` argument. If this argument is omitted, the function returns information for all nodes in the hierarchy. ```{r} # about a specific node info <- hier_info(h, nodes = c("b1", "E")) ``` `info` is a named list where each list element refers to a queried node. The results for level `b1` could be extracted as shown below: ```{r} info$b1 ``` ## Convert to other formats Function `hier_convert()` takes a hierarchy and allows to convert the network based structure to different formats while `hier_export()` does the conversion and writes the results to a file on the disk. The following formats are currently supported: - `df`: a "@;label"-based format used in [**sdcTable**](https://cran.r-project.org/package=sdcTable) - `dt`: the same as `df`, but the result is returned as a `data.table` - `argus`: also a "@;label"-based format used to create `hrc`-files for [$\tau$-argus](https://github.com/sdcTools/tauargus/) - `json`: a json-encoded string - `code`: the required R-code to re-build the current hierarchy - `sdc`: a `list` which is a suitable input for [**sdcTable**](https://cran.r-project.org/package=sdcTable) ```{r} # conversion to a "@;label"-based format res_df <- hier_convert(h, as = "df") print(res_df) ``` The required code to create this hierarchy could be computed using: ```{r} code <- hier_convert(h, as = "code"); cat(code, sep = "\n") ``` Using `hier_export()`, one can write the results to a file. This is for example useful if one wants to create `hrc`-files that could be used as input for [$\tau$-argus](https://github.com/sdcTools/tauargus/) which can be achieved as follows: ```{r, eval=FALSE} hier_export(h, as = "argus", path = file.path(tempdir(), "hierarchy.hrc")) ``` ## Create a hierarchy from different sources `hier_import()` returns a network-based hierarchy given either a `data.frame` (in `@;labs`-format), `json`, `code` or from a $\tau$-argus compatible `hrc-file`. For example, if we want to create a hierarchy based on `res_df`: ```{r} n_df <- hier_import(inp = res_df, from = "df") hier_display(n_df) ``` Using `hier_import(inp = "hierarchy.hrc", from = "argus")` one could create a sdc hierarchy object directly from a `hrc`-file. ## Create/Compute hierarchies from a string Often it is the case, the the nested hierarchy information in encoded in a string. Function `hier_compute()` allows to transform such strings into hierarchy objects. One can distinguish two cases: The first case is where all input codes have the same length while in the latter case the length of the codes differs. Let's assume we have a geographic code given in `geo_m` where digits 1-2 refer to the first level, digit 3 to the second and digits 4-5 to the third level of the hierarchy. ```{r} geo_m <- c( "01051", "01053", "01054", "01055", "01056", "01057", "01058", "01059", "01060", "01061", "01062", "02000", "03151", "03152", "03153", "03154", "03155", "03156", "03157", "03158", "03251", "03252", "03254", "03255", "03256", "03257", "03351", "03352", "03353", "03354", "03355", "03356", "03357", "03358", "03359", "03360", "03361", "03451", "03452", "03453", "03454", "03455", "03456", "10155") ``` Often, hierarchical information is encoded within character strings (e.g., geographic or sector codes). The `hier_compute()` function allows you to transform such vectors into hierarchy objects. The `method` argument provides two ways to define how these levels are encoded: * **`endpos`**: Requires a numeric vector in `dim_spec` defining the **end position** (index) of each level within the string. * **`len`**: Requires a numeric vector in `dim_spec` defining the **number of characters** (width) allocated to each level. If the overall total is not explicitly encoded in the input strings, the `root` argument can be used to provide a name for the top-level node. Additionally, the `as` parameter specifies the output format. For example, setting `as = "df"` returns the result as a `data.frame` in the `@; label` format. As shown below, these two methods are interchangeable and yield identical hierarchies: ```{r} # Using end positions (e.g., level 1 ends at index 2, level 2 at 3, level 3 at 5) v1 <- hier_compute( inp = geo_m, dim_spec = c(2, 3, 5), root = "Tot", method = "endpos", as = "df" ) # Using lengths (e.g., level 1 is 2 chars, level 2 is 1 char, level 3 is 2 chars) v2 <- hier_compute( inp = geo_m, dim_spec = c(2, 1, 2), root = "Tot", method = "len", as = "df" ) identical(v1, v2) hier_display(v1) ``` If the total *is* already contained within the string (for example, in the first 3 positions), the hierarchy can be computed by including that segment in the `dim_spec` and omitting the `root` argument: ```{r} geo_m_with_tot <- paste0("Tot", geo_m) head(geo_m_with_tot) v3 <- hier_compute( inp = geo_m_with_tot, dim_spec = c(3, 2, 1, 2), method = "len" ) hier_display(v3) ``` The result is identical to `v1` and `v2`. `hier_compute()` is also robust enough to handle input strings of varying lengths: ```{r} ## Example with unequal string lengths; overall total provided via 'root' yae_h <- c( "1.1.1.", "1.1.2.", "1.2.1.", "1.2.2.", "1.2.3.", "1.2.4.", "1.2.5.", "1.3.1.", "1.3.2.", "1.3.3.", "1.3.4.", "1.3.5.", "1.4.1.", "1.4.2.", "1.4.3.", "1.4.4.", "1.4.5.", "1.5.", "1.6.", "1.7.", "1.8.", "1.9.", "2.", "3.") v1 <- hier_compute( inp = yae_h, dim_spec = c(2, 2, 2), root = "Tot", method = "len" ) hier_display(v1) ``` ### Creating hierarchies from a list Alternatively, you can create a hierarchy by setting `method = "list"`. In this mode, the input should be a named list where each element's name is interpreted as a **parent node**, and the element's content represents its **child nodes**. ```{r} yae_ll <- list() yae_ll[["Total"]] <- c("1.", "2.", "3.") yae_ll[["1."]] <- paste0("1.", 1:9, ".") yae_ll[["1.1."]] <- paste0("1.1.", 1:2, ".") yae_ll[["1.2."]] <- paste0("1.2.", 1:5, ".") yae_ll[["1.3."]] <- paste0("1.3.", 1:5, ".") yae_ll[["1.4."]] <- paste0("1.4.", 1:6, ".") d <- hier_compute(inp = yae_ll, root = "Total", method = "list") hier_display(d) ``` ## Grids and Indexing The `hier_grid()` function computes all possible combinations of codes from multiple hierarchies. This is a crucial step in building complete tables for Statistical Disclosure Control (SDC). ### Handling Bogus Codes A "bogus" chain occurs when a parent node has only a single child. In such cases, the parent and the child represent the same set of underlying units, which can cause redundancies in SDC software. In the example below, both `h1` and `h2` contain bogus structures: * In **`h1`**, the node `A` has only one child `a1`, which in turn has only one child `aa1`. * In **`h2`**, the nodes `b` and `d` each have only a single child (`b1` and `d1` respectively). ```{r} h1 <- hier_create("Total", nodes = LETTERS[1:3]) h1 <- hier_add(h1, root = "A", node = "a1") h1 <- hier_add(h1, root = "a1", node = "aa1") hier_display(h1) h2 <- hier_create("Total", letters[1:5]) h2 <- hier_add(h2, root = "b", node = "b1") h2 <- hier_add(h2, root = "d", node = "d1") hier_display(h2) ``` When calling `hier_grid()`, setting `add_dups = FALSE` automatically prunes these redundant parent nodes (like `A`, `a1`, `b`, and `d`). They are replaced by their most granular descendants (e.g., `aa1`, `b1`, and `d1`), ensuring the resulting grid aligns with the granularity of the raw microdata. ```{r} # cell_id is a unique string created by concatenating default codes r <- hier_grid(h1, h2, add_dups = FALSE, add_levs = TRUE) print(r) ``` ### High-Performance Indexing For large datasets, mapping microdata strings to grid cells using character matching is computationally expensive. By setting `add_contributing_cells = TRUE`, `sdcHierarchies` generates an optimized integer-based indexing system: 1. **`leaf_id`**: A unique integer assigned to every combination of **base-level codes** (the most granular codes in the hierarchies). 2. **`contributing_leaf_ids`**: A list-column containing the integers of all base-level codes that contribute to a specific cell (e.g., all codes falling under a "Total" or "Sub-total"). ```{r} # Create an SDC-optimized grid r_sdc <- hier_grid(h1, h2, add_dups = FALSE, add_contributing_cells = TRUE) # Genrate microdata using base-level codes for region and sector # Note: 'aa1', 'b1', and 'd1' are the granular leaf nodes microdata <- data.table( region = c("aa1", "B", "C", "aa1", "B"), sector = c("a", "b1", "c", "d1", "e"), val = c(10, 20, 30, 40, 50) ) # Map microdata to base-level IDs using a named list microdata[, leaf_id := hier_create_ids( data = microdata, dims = list("region" = h1, "sector" = h2) )] print(microdata) # Fast aggregation: Summing 'Total_Total' using integer lookups total_ids <- r_sdc[v1 == "Total" & v2 == "Total", contributing_leaf_ids[[1]]] print(total_ids) sum(microdata[leaf_id %in% total_ids, val]) ``` ### Differentiating Totals and Primary Cells The `leaf_id` column serves as a built-in classifier to distinguish between different cell types in the grid: - **Primary Cells**: If `leaf_id` contains an integer, the row represents a unique combination of base-level codes. These are the "internal" cells where microdata is directly mapped. - **(Sub)-Totals**: If `leaf_id` is `NA`, the row represents an aggregate cell. This allows for extremely fast filtering during SDC tasks, such as isolating primary cells for sensitivity testing: ```{r} # Isolate primary cells for primary suppression primary_cells <- r_sdc[!is.na(leaf_id)] # Isolate aggregate cells for marginal consistency checks sub_totals <- r_sdc[is.na(leaf_id)] ``` ## Interactively Create or Modify Hierarchies The `sdcHierarchies` package includes a **Shiny-based interactive application** accessible via `hier_app()`. This interface is designed for users who prefer a visual approach to building or refining complex structures. The app accepts either a raw character vector (to be converted using `hier_compute()` logic) or an existing hierarchy object. For example, to modify the hierarchy created in the previous section: ```{r, eval=FALSE} # Start the app and store the modified result upon closing d_modified <- hier_app(d) ``` ### Key Features of the Interactive App * **Visual Construction**: If a character vector is passed, the app provides a guided interface to specify `dim_spec` and `method` arguments. * **Drag-and-Drop Editing**: Once the tree is loaded, you can dynamically restructure the hierarchy by dragging nodes to new parent locations. * **Node Management**: Easily add, remove, or rename nodes through the sidebar controls. * **Live Code Generation**: The `R` code required to reproduce the current state of the hierarchy is updated in real-time and can be copied or saved. * **Export Options**: Supports exporting the final hierarchy directly back to the R session, saving it as a JSON file, or generating a $\tau$-argus compatible `hrc` file. * **Undo Functionality**: A built-in history allows you to revert recent changes during the editing process. Because `hier_app()` returns the modified hierarchy object upon closing, it is recommended to assign the function call to an object (as shown above) to capture your interactive changes for further use in your SDC pipeline. ## Summary The `sdcHierarchies` package provides a robust framework for hierarchical data management. In case you have any suggestions or improvements, please feel free to file an issue at [**our issue tracker**](https://github.com/bernhard-da/sdcHierarchies/issues) or contribute by filing a [**pull request**](https://github.com/bernhard-da/sdcHierarchies/pulls).