--- title: "6. tidyped Class Structure and Extension Notes" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{6. tidyped Class Structure and Extension Notes} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- This document describes the structural contract of the `tidyped` class in visPedigree 1.8.0. It is intended for maintenance and extension work. ## 1. Class identity `tidyped` is an S3 class layered on top of `data.table`. Expected class vector: ```r c("tidyped", "data.table", "data.frame") ``` The class is created through `new_tidyped()` (internal constructor) and checked with `is_tidyped()`. ## 2. Core design goals `tidyped` is designed to be: 1. **safe for C++**: integer pedigree indices (`IndNum`, `SireNum`, `DamNum`) are always aligned with row order, so C++ routines can index directly without translation; 2. **fast for large pedigrees**: the fast path skips redundant validation when the input is already a `tidyped`; 3. **compatible with `data.table`**: in-place modification via `:=` and `set()` preserves class and metadata without copying; 4. **explicit about structural degradation**: row subsets that break pedigree completeness are downgraded to plain `data.table` with a warning. ## 3. The head invariant: IndNum == row index The single most important structural rule in visPedigree: > **`IndNum[i]` must equal `i` for every row.** This means `SireNum` and `DamNum` are direct row pointers: the sire of individual `i` lives at row `SireNum[i]`, and `0L` encodes a missing parent. Every C++ function in visPedigree — inbreeding coefficients, relationship matrices, BFS tracing, topological sorting — relies on this invariant. If it breaks, C++ will read wrong parents. This invariant is enforced at three levels: - **`tidyped()`**: builds indices from scratch during construction. - **`[.tidyped`**: rebuilds indices in-place after valid row subsets. - **`ensure_tidyped()` / `ensure_complete_tidyped()`**: detect and repair stale indices when class was accidentally dropped. ## 4. Column contract ### 4.1 Minimal structural columns These four columns define a valid `tidyped`: | Column | Type | Description | |--------|-----------|--------------------------------------| | `Ind` | character | Unique individual ID | | `Sire` | character | Sire ID, `NA` for unknown | | `Dam` | character | Dam ID, `NA` for unknown | | `Sex` | character | `"male"`, `"female"`, or `"unknown"` | Checked by `validate_tidyped()`. ### 4.2 Integer pedigree columns | Column | Type | Description | |-----------|---------|-------------------------------------| | `IndNum` | integer | Row index (== row number, see §3) | | `SireNum` | integer | Row index of sire, `0L` for missing | | `DamNum` | integer | Row index of dam, `0L` for missing | These exist whenever `tidyped()` is called with `addnum = TRUE` (default). They are the interface between R and C++. ### 4.3 Other common columns | Column | Description | |--------------|----------------------------------------------| | `Gen` | Generation number | | `Family` | Family group code | | `FamilySize` | Number of offspring in the family | | `Cand` | `TRUE` for candidate individuals | | `f` | Inbreeding coefficient (added by `inbreed()`) | ### 4.4 Column naming convention All data columns use **PascalCase** (`Ind`, `SireNum`, `MeanF`, `ECG`), matching the core column style. ## 5. Metadata layer Pedigree-level metadata is stored in a single attribute: ```r attr(x, "ped_meta") ``` Built by `build_ped_meta()`, accessed by `pedmeta()`. | Field | Type | Description | |--------------------|-----------|-----------------------------------------| | `selfing` | logical | Whether self-fertilization mode was used | | `bisexual_parents` | character | IDs appearing as both sire and dam | | `genmethod` | character | `"top"` or `"bottom"` generation numbering | No other pedigree-level attributes should be added outside `ped_meta`. ## 6. Structural invariants The following invariants must hold for a valid `tidyped`: 1. **IndNum == row index** (see §3). 2. **Ind is unique** — no duplicate individual IDs. 3. **Completeness** — every non-`NA` Sire and Dam appears in `Ind`. 4. **Acyclicity** — no individual is its own ancestor. 5. **SireNum / DamNum consistency** — `0L` for missing parents, valid row indices otherwise. 6. **ped_meta is the sole metadata container** — no scattered attributes. Invariants 1–5 are established by `tidyped()` and guarded by `[.tidyped`. Invariant 6 is a development convention. ## 7. Constructor pipeline `tidyped()` currently has two distinct tracing paths: - **Raw-input path** (`data.frame` / `data.table`) — uses igraph for loop detection, candidate tracing, and topological sorting before integer indices are finalized. - **Fast path** (`tidyped` + `cand`) — skips graph rebuilding and uses C++ for candidate tracing and topological sorting on existing integer pedigree indices. ### 7.1 Full path: `tidyped(raw_input)` When the input is a raw `data.frame` or `data.table`: 1. `validate_and_prepare_ped()` — normalize IDs, detect duplicates and bisexual parents, inject missing founders. 2. Loop detection — igraph builds a directed graph and checks `is_dag()`; `which_loop()` and `shortest_paths()` are used only on the error path to report informative loop diagnostics. 3. Candidate tracing — if `cand` is supplied, igraph neighborhood search is used on the raw-input path. 4. Topological sort — igraph `topo_sort()` on the raw-input path. 5. Generation assignment — C++ (`cpp_assign_generations_top` / `cpp_assign_generations_bottom`) using the pedigree implied by the sorted rows. 6. Sex inference — resolve unknowns from parental roles. 7. Build integer indices — `IndNum`, `SireNum`, `DamNum`. 8. `new_tidyped()` + attach `ped_meta`. ### 7.2 Fast path: `tidyped(tp, cand = ids)` When the input is already a `tidyped` **and** `cand` is supplied: - **Skipped**: ID validation, loop detection, sex inference, founder injection. - **Executed**: C++ BFS tracing → C++ topo sort → C++ generation assignment → rebuild indices → `new_tidyped()` + `ped_meta`. The fast path is the preferred workflow for repeated local tracing from a previously validated master pedigree: ```r tp_master <- tidyped(raw_ped) tp_local <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3) ``` ### 7.3 `new_tidyped()` — internal constructor `new_tidyped()` attaches the `"tidyped"` class via `setattr()` (no copy) and clears data.table's invisible flag via `x[]`. It does **not** attach `ped_meta` — that is the caller's responsibility. It should only be called when the caller has already ensured structural validity. ## 8. Three-tier guard system Analysis functions must guard their inputs. visPedigree provides three guard levels, chosen based on what each function needs. ### 8.1 `validate_tidyped()` — visualization guard - Attempts silent class recovery via `ensure_tidyped()`. - Checks only that `Ind`, `Sire`, `Dam`, `Sex` exist. - **Does not require** pedigree completeness. - Used by: `visped()`, `plot.tidyped()`, `summary.tidyped()`. ### 8.2 `ensure_tidyped()` — structure-light guard - If already `tidyped`: returns as-is. - If class was dropped but 8 core columns (`Ind`, `Sire`, `Dam`, `Sex`, `Gen`, `IndNum`, `SireNum`, `DamNum`) are present: rebuilds `IndNum` if stale, restores class, emits a message. - **Does not check** pedigree completeness. - Used by: `pedsubpop()`, `splitped()`, `pedne(method = "demographic")`, `pedstats(ecg = FALSE, genint = FALSE)`, `pedfclass()` (when `f` column already exists). ### 8.3 `ensure_complete_tidyped()` — complete-pedigree guard - Everything `ensure_tidyped()` does, **plus**: - Calls `require_complete_pedigree()` — verifies that every non-`NA` Sire/Dam is present in `Ind`. Stops with an error if not. - Required by any function that recurses through pedigree structure in C++. - Used by: `inbreed()`, `pedecg()`, `pedgenint()`, `pedrel()`, `pedne(method = "inbreeding" | "coancestry")`, `pedcontrib()`, `pedancestry()`, `pedfclass()` (when `f` must be computed), `pedpartial()`, `pediv()`, `pedmat()`, `pedhalflife()`. ### 8.4 Choosing the right guard | Guard | Recovers class? | Requires completeness? | When to use | |-----------------------------|:---------------:|:----------------------:|-------------------------------| | `validate_tidyped()` | yes | no | Visualization | | `ensure_tidyped()` | yes | no | Summaries on existing columns | | `ensure_complete_tidyped()` | yes | **yes** | Pedigree recursion in C++ | Some functions are **conditionally guarded**: they use `ensure_tidyped()` by default but escalate to `ensure_complete_tidyped()` when a parameter triggers pedigree recursion (for example `pedstats(ecg = TRUE)`, `pedne(method = "coancestry")`). ## 9. Safe subsetting contract `[.tidyped` is the key protection layer. ### 9.1 `:=` operations Modify-by-reference is passed through safely. Class and metadata are preserved via `setattr()`. No copy occurs. ### 9.2 Column-only selections If the selection removes core pedigree columns, the result is returned as a plain `data.table` without warning. ### 9.3 Row subsets After row subsetting, `[.tidyped` checks pedigree completeness: - **Complete subset** (all referenced parents still present): `IndNum`, `SireNum`, `DamNum` are rebuilt in-place, class and `ped_meta` are preserved. - **Incomplete subset** (parent records missing): result is downgraded to plain `data.table` with a warning guiding the user to `tidyped(tp, cand = ids, trace = "up")`. This downgrade is deliberate. It prevents stale integer indices from reaching C++ routines. ## 10. Computational boundaries: C++ vs igraph visPedigree delegates heavy pedigree recursion to C++ and uses igraph where a graph object is still the simplest representation. ### 10.1 C++ — core computation path | Task | C++ function | |-------------------------------|------------------------------------------------------| | Ancestry / descendant tracing | `cpp_trace_ancestors`, `cpp_trace_descendants` | | Topological sorting | `cpp_topo_order` | | Generation assignment | `cpp_assign_generations_top`, `cpp_assign_generations_bottom` | | Inbreeding coefficients | `cpp_calculate_inbreeding` (Meuwissen & Luo) | | Relationship matrices | `cpp_addmat`, `cpp_dommat`, `cpp_aamat`, `cpp_ainv` | All C++ functions consume `SireNum` / `DamNum` integer vectors and assume the head invariant (§3). ### 10.2 igraph — graph-specific tasks | Task | Where | igraph functions | |------------------------|-------------------------------|------------------------------------------------------| | Pedigree visualization | `visped()` pipeline | `graph_from_data_frame`, `layout_with_sugiyama`, `plot.igraph` | | Connected components | `splitped()` | `graph_from_edgelist`, `components` | | Loop detection | `tidyped()` raw-input path | `graph_from_edgelist`, `is_dag` | | Loop diagnosis | `tidyped()` error path | `which_loop`, `shortest_paths`, `neighbors`, `components` | | Candidate tracing | `tidyped()` raw-input path | `neighborhood` | | Topological sorting | `tidyped()` raw-input path | `topo_sort` | igraph is not used in the core numerical pedigree analysis routines such as `inbreed()`, `pedmat()`, `pedecg()`, or `pedrel()`, but it is still part of the preprocessing and visualization stack. ## 11. Extension rules When extending the class, follow these rules. ### 11.1 Do not add new pedigree-level attributes Prefer adding fields to `ped_meta` instead of scattering new standalone attributes. ### 11.2 Keep computed state derivable If a column can be rebuilt from pedigree structure, prefer derivation over storing opaque cached state. ### 11.3 Preserve `data.table` semantics Use `:=`, `set()`, and `setattr()` carefully. Avoid patterns that trigger full copies unless unavoidable. ### 11.4 Respect downgrade semantics Any future method that subsets rows must preserve the current rule: valid complete subset -> `tidyped`; incomplete subset -> plain `data.table`. ### 11.5 Document C++ assumptions Any feature using `IndNum`, `SireNum`, or `DamNum` should document whether it requires: - topologically ordered rows, - dense consecutive indices, - `0L` encoding for missing parents. ## 12. User-facing inspection helpers | Function | Returns | |---------------------------|-----------------------------------| | `is_tidyped(x)` | `TRUE` if class is present | | `is_complete_pedigree(x)` | `TRUE` if all Sire/Dam are in Ind | | `pedmeta(x)` | The `ped_meta` named list | | `has_inbreeding(x)` | `TRUE` if `f` column exists | | `has_candidates(x)` | `TRUE` if `Cand` column exists | Future extensions should prefer helper functions over direct attribute access. ## 13. Maintenance checklist Before merging a structural change to `tidyped`, check: 1. Does class identity remain `c("tidyped", "data.table", "data.frame")`? 2. Is the head invariant `IndNum == row index` preserved after every code path? 3. Are `ped_meta` fields preserved correctly? 4. Does `[.tidyped` still handle `:=` without copy issues? 5. Do incomplete row subsets still downgrade with warning? 6. Are integer pedigree columns rebuilt whenever a subset remains valid? 7. Does `tidyped(tp_master, cand = ...)` match the full path result? 8. After `setorder()` or `merge()`, are indices rebuilt before reaching C++? 9. Do package tests and vignettes build cleanly? ## 14. Recommended workflow For large pedigrees, the intended usage pattern is: ```r # build one validated master pedigree tp_master <- tidyped(raw_ped) # reuse it for repeated local tracing (fast path) tp_local <- tidyped(tp_master, cand = ids, trace = "up", tracegen = 3) # modify analysis columns in place tp_master[, phenotype := pheno] # split only when disconnected components matter parts <- splitped(tp_master) ``` This keeps workflows explicit, fast, and safe.