--- title: "Introduction to robscale" output: rmarkdown::html_vignette bibliography: ../inst/references.bib vignette: > %\VignetteIndexEntry{Introduction to robscale} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ## 1. The Robustness Problem Outliers compromise the reliability of classical estimators even in moderate samples. A single recording error can destroy the standard deviation, yet the standard deviation remains the default scale estimator in most statistical software. To see why, consider a small sample with one gross error: ```{r robustness-demo} library(robscale) x_clean <- c(2.1, 2.3, 2.0, 2.4, 2.2, 2.1, 2.3, 1.9) x_contaminated <- c(x_clean, 200) # one recording error sd(x_clean) sd(x_contaminated) # collapses gmd(x_clean) gmd(x_contaminated) # unaffected qn(x_clean) qn(x_contaminated) # unaffected ``` The **breakdown point** is the fraction of contaminated observations an estimator tolerates before producing an arbitrarily large result. The standard deviation has a 0% breakdown point; a single outlier drives it to infinity. The GMD achieves 29.3%; `qn` and `sn` achieve the maximum 50%. The **asymptotic relative efficiency** (ARE) measures statistical efficiency relative to the sample standard deviation under the Gaussian model [@bickel1976]. High efficiency means fewer observations are needed to match the standard deviation's precision. The tradeoff is inherent: higher breakdown point generally costs efficiency. `robscale` exposes all major points on this robustness--efficiency frontier. ## 2. Getting Started ### Installation ```r install.packages("robscale") # Development version: # install.packages("remotes") # remotes::install_github("davdittrich/robscale") ``` ### Quick demo ```{r quick-demo} library(robscale) set.seed(42) x <- c(rnorm(20), 50) # 20 clean observations + one outlier scale_robust(x) # auto-selects best strategy scale_robust(x, ci = TRUE) # with confidence interval data.frame( estimator = c("sd", "sd_c4", "gmd", "qn", "mad_scaled"), value = round(c(sd(x), sd_c4(x), gmd(x), qn(x), mad_scaled(x)), 4) ) ``` ## 3. The Estimator Spectrum `robscale` provides 11 exported functions spanning the full robustness--efficiency spectrum: | Function | ARE | Breakdown | Description | | :--- | ---: | ---: | :--- | | `sd_c4(x)` | 100% | 0% | Bias-corrected standard deviation | | `gmd(x)` | 98% | 29.3% | Gini mean difference | | `adm(x)` | 88.3% | $1/n$ | Average deviation from median | | `qn(x)` | 82.3% | 50% | Rousseeuw--Croux $Q_n$ | | `sn(x)` | 58.2% | 50% | Rousseeuw--Croux $S_n$ | | `iqr_scaled(x)` | 37% | 25% | Scaled interquartile range | | `mad_scaled(x)` | 36.7% | 50% | Scaled median absolute deviation | | `robScale(x)` | ~55% | 50% | M-estimator for scale | | `robLoc(x)` | 98.4% | 50% | M-estimator for location | | `scale_robust(x)` | ensemble | --- | Variance-weighted dispatcher | | `get_consistency_constant(n)` | --- | --- | Finite-sample bias correction factors | ### Decision guide - **Maximum efficiency, no outliers expected:** `sd_c4` - **High efficiency with moderate outlier protection:** `gmd` - **Maximum breakdown with good efficiency:** `qn` - **Fast computation on large data:** `mad_scaled` or `iqr_scaled` - **Uncertain which estimator to use:** `scale_robust()` (adaptive) - **M-estimate of location:** `robLoc` - **M-estimate of scale:** `robScale` ## 4. The `scale_robust()` Dispatcher `scale_robust()` adapts its strategy to sample size automatically. ### Ensemble mode ($n < 20$) For small samples, all 7 scale estimators run on bootstrap replicates. Inverse-variance weights---where the variance is the bootstrap variance of each estimator across replicates---combine them into a single estimate: $$ \hat\sigma = \frac{\sum_{j=1}^{7} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{7} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)} $$ ```{r ensemble} set.seed(1) x_small <- rnorm(12) scale_robust(x_small) # ensemble (n = 12 < 20) scale_robust(x_small, ci = TRUE) # BCa bootstrap CI ``` ### Automatic switch to GMD ($n \geq 20$) For larger samples, `scale_robust()` uses `gmd()` directly: 98% ARE, 29.3% breakdown, $O(n \log n)$. ```{r auto-switch} x_large <- rnorm(500) scale_robust(x_large) # equivalent to gmd(x_large) ``` ### Explicit method selection ```{r explicit-method} scale_robust(x_small, method = "qn") # force Qn scale_robust(x_small, method = "robScale") # force M-scale scale_robust(x_small, auto_switch = FALSE) # keep ensemble at any n ``` ## 5. Individual Estimators ### `sd_c4(x)`: Bias-corrected standard deviation Applies the exact $c_4(n)$ finite-sample correction to remove the downward bias of the sample standard deviation under normality: ```{r sd-c4} x <- c(1.2, 0.8, 1.5, 0.9, 1.1) sd_c4(x) # unbiased under normality sd(x) # biased downward for small n sd_c4(x, ci = TRUE) ``` ### `gmd(x)`: Gini mean difference Average absolute pairwise difference [@gini1912], computed via the order-statistics form in $O(n \log n)$: ```{r gmd} gmd(c(1, 2, 3, 5, 7, 8)) gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE) ``` ### `adm(x)`: Average deviation from the median The mean deviation from the median [@nair1936; @nair1947], scaled for consistency at the normal distribution. ```{r adm} adm(c(1, 2, 3, 5, 7, 8)) ``` ### `qn(x)`: Rousseeuw--Croux $Q_n$ [@rousseeuw1993] ```{r qn} qn(c(1, 2, 3, 5, 7, 8)) qn(c(1, 2, 3, 5, 7, 8), ci = TRUE) ``` ### `sn(x)`: Rousseeuw--Croux $S_n$ ```{r sn} sn(c(1, 2, 3, 5, 7, 8)) sn(c(1, 2, 3, 5, 7, 8), ci = TRUE) ``` ### `iqr_scaled(x)` and `mad_scaled(x)` ```{r iqr-mad} iqr_scaled(c(1, 2, 3, 5, 7, 8)) mad_scaled(c(1, 2, 3, 5, 7, 8)) ``` ### `robLoc(x)`: M-estimator for location M-estimator for location via Newton--Raphson iteration on the logistic psi function [@rousseeuw2002, Eq. 21]. Starting value: $T^{(0)} = \text{median}(x)$; auxiliary scale: $S = \text{MAD}(x)$. **Fallback:** When scale is unknown and $n < 4$, or when scale is known and $n < 3$, returns `median(x)` without iteration. ```{r robloc} robLoc(c(1, 2, 3, 5, 7, 8)) robLoc(c(1, 2, 3), scale = 1.5) # known scale enables n = 3 ``` Newton--Raphson algorithm for `robLoc()`: ``` flowchart TD A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x] B --> C{n < minobs?} C -- Yes --> D([Return med]) C -- No --> E{Scale known?} E -- Yes --> F[s = scale] E -- No --> G[s = MAD x] F --> H{s = 0?} G --> H H -- Yes --> I([Return med]) H -- No --> J[t = med] J --> K[psi_i = tanh( (x_i - t) / 2s )] K --> L[sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)] L --> M[t += 2s * sum_psi / sum_dpsi] M --> N{"|v| <= tol * max(|t|, 1)?"} N -- No --> K N -- Yes --> O([Return t]) ``` ### `robScale(x)`: M-estimator for scale M-estimator for scale via Newton--Raphson iteration on the M-scale estimating equation [@rousseeuw2002]. Starting value: $S^{(0)} = \text{MAD}(x)$. ```{r robscale} robScale(c(1, 2, 3, 5, 7, 8)) robScale(c(5, 5, 5, 5, 6), fallback = "na") # revss compatibility robScale(c(1, 2, 3, 5, 7, 8), ci = TRUE) ``` Newton--Raphson iteration for `robScale()`: ``` flowchart TD A{Location known?} A -- Yes --> B1[w = x - loc] B1 --> B2[s = 1.4826 * median( abs(w) )] B2 --> B3[t = 0, minobs = 3] A -- No --> C1[med = median x] C1 --> C2[s = MAD x, t = med] C2 --> C3[minobs = 4] B3 --> D{n < minobs?} C3 --> D D -- Yes --> E{s <= implbound?} E -- Yes --> F{fallback?} F -- adm --> F1([Return ADM x]) F -- na --> F2([Return NA]) E -- No --> G([Return s]) D -- No --> H{s <= implbound?} H -- Yes --> F H -- No --> J["u_i = (x_i - t) / (2cs), compute tanh(u_i)"] J --> K["numer = Σtanh²/n − 0.5
denom = (2/n)·Σ(u·tanh·sech²)"] L --> LG{"|denom| <= 1e-14 * s?"} LG -- Yes --> LF["s *= sqrt(2 * mean_tanh²)"] LF --> J LG -- No --> MM["Δs = s · numer / denom"] MM --> MG{"s + Δs <= 0?"} MG -- Yes --> MH["s /= 2"] MH --> J MG -- No --> MS["s ← s + Δs"] MS --> N{"|Δs|/s ≤ tol?"} N -- No --> J N -- Yes --> O([Return s]) ``` ## 6. Confidence Intervals All scale estimators support `ci = TRUE`: ```{r ci} # Analytical CI (chi-squared for sd_c4, ARE-based for others) sd_c4(c(1, 2, 3, 5, 7, 8), ci = TRUE) gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE) # BCa bootstrap CI for scale_robust() ensemble set.seed(42) x_small <- rnorm(10) scale_robust(x_small, ci = TRUE, n_boot = 500) ``` The CI return type is a named list with `estimate`, `lower`, `upper`, and `level`. For `scale_robust()` in ensemble mode, the object class is `robscale_ensemble_ci` and includes per-estimator weights. ## 7. Performance Architecture ### SIMD vectorization The logistic psi function satisfies $\psi_{\log}(x) = \tanh(x/2)$. `robscale` evaluates `tanh` in bulk using the fastest available platform backend. The `configure` script selects one of two vectorized `tanh` libraries at compile time; at runtime, CPUID selects the appropriate instruction set: 1. **Apple Accelerate** (`vvtanh`): array-wide SIMD on macOS. 2. **glibc libmvec** (Linux x86\_64, glibc $\geq$ 2.35) or **SLEEF** (fallback when libmvec is absent). Only one is linked per binary. The AVX2 4-wide kernel (`_ZGVdN4v_tanh` or `Sleef_tanhd4_u10avx2`) processes 4 doubles per iteration. 3. **OpenMP SIMD** (`#pragma omp simd`): compiler-guided portable fallback. 4. **Scalar** `std::tanh`: universal fallback. For the bulk tanh dispatcher, vectorization requires $n \geq 8$. However, `robLoc`'s fused AVX2 NR kernel activates at $n \geq 4$ (one full 4-wide vector). The Gini mean difference uses a separate AVX2 FMA kernel (`_mm256_fmadd_pd`) for its weighted sum $\sum (2i - (n-1))\, x_{(i)}$, shared across the standalone `gmd()`, the ensemble bootstrap, and the BCa jackknife. This kernel activates for $n \geq 8$ on AVX2 hardware. ### $O(n)$ median selection `robscale` uses a tiered strategy rather than `sort()`: - **$n \in \{8, 16, 32\}$ on AVX2 hardware**: SIMD selection networks (bitonic merge, 23--58% faster than scalar selection networks). - **$n \leq 36$**: scalar selection networks for median computation. - **$n \leq 56$**: branchless sorting networks for full-sort paths (GMD, $Q_n$ exact, $S_n$), compiled to conditional-move instructions at `-O2`: | $n$ | Comparators | | ---: | ---: | | 3 | 3 | | 4 | 5 | | 8 | 19 | | 16 | 60 | | 32 | 185 | | 56 | 438 | - **$37 \leq n$ (selection) / $57 \leq n$ (sort)**: Floyd--Rivest algorithm [@floyd1975] or pdqselect, chosen per-estimator based on a runtime L2-cache-derived crossover threshold. The crossover between Floyd--Rivest and pdqselect is derived at runtime from the per-core L2 cache size; each estimator uses its own threshold based on working-set pressure: | Estimator | Strategy | Threshold | | :--- | :--- | :--- | | `iqr_scaled` | Always pdqselect | --- | | `mad_scaled` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 40)$ | | `robScale` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ | | $S_n$ | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ | | $Q_n$ (final window) | Inline check | $\max(2048,\, L_2 / 32)$ | ### Stack-allocated memory arenas A 128-double micro-buffer (1 KB) covers the smallest samples ($n \leq 128$ for MAD, $n \leq 64$ for `robScale` and `robLoc`). For $n \leq 2{,}048$, stack-allocated arrays avoid heap allocation: `robScale` uses one 2,048-double array (16 KB); `robLoc` uses one 4,096-double array (32 KB, split into equal-sized buffer and deviation halves). **Fused single-buffer algorithm.** After median selection the working buffer holds a permutation of the input. Because MAD is order-invariant, absolute deviations can be computed in-place, eliminating the second scratch array and halving memory from $2n$ to $n$ doubles per estimator. ### Iteration convergence **`robLoc`: Newton--Raphson.** Location estimation uses Newton--Raphson (quadratic convergence, ~3 iterations) rather than the scoring fixed-point method (~6--8 iterations): $$ T^{(k+1)} = T^{(k)} + \frac{2\,S\sum \psi\!\left(\frac{x_i - T^{(k)}}{2S}\right)} {\sum \left[1 - \psi^2\!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]} $$ Since $\tanh$ values are already computed for the numerator, the denominator costs only squaring and subtraction. Typical iteration counts: | $n$ | Scoring | Newton--Raphson | | ---: | ---: | ---: | | 4 | 7 | 2--4 | | 8 | 7 | 2--4 | | 20 | 6 | 2--4 | | 100 | 5 | 2--4 | **Fused AVX2 kernel for `robLoc()`.** `rob_loc_nr_step_avx2` collapses three passes into one. Two AVX2 accumulators advance in lockstep: $$ \texttt{acc\_psi} \mathrel{+}= p_i, \qquad \texttt{acc\_dpsi} \mathrel{+}= (1 - p_i^2) $$ where $p_i = \tanh(u_i)$ is evaluated once. The derivative accumulator uses a fused negative multiply-add (fnmadd) to accumulate $1 - p_i^2$ directly, avoiding the catastrophic cancellation that would arise from subtracting two large numbers ($n - \sum p_i^2$) when all $|u_i| \gg 1$. **`robScale`: Newton--Raphson.** Scale estimation also uses Newton--Raphson, applied to the M-scale estimating equation $n^{-1}\sum \rho(u_i) = 1/2$. The fused single-pass kernel (`nr_scale_step_avx2` / `nr_scale_step_scalar`) computes $\sum \tanh^2(u_i)$ and $\sum u_i \tanh(u_i)\,\text{sech}^2(u_i)$ in one read over the data, yielding the NR update $\Delta s = s \cdot (\bar\rho - \tfrac{1}{2}) / \bar{D}$ where $\bar{D}$ is the scaled derivative sum. The same quadratic convergence as `robLoc` applies; typical iteration counts are 3--4. A degenerate-denominator guard applies two protections: when the denominator is degenerate ($|\bar{D}| \leq 10^{-14} \times s$), a multiplicative fallback $s \leftarrow s \times \sqrt{2 \times \overline{\tanh^2}}$ replaces the Newton step, avoiding the catastrophic cancellation of subtracting two large numbers. A separate guard halves $s$ when the proposed update would make $s$ non-positive ($s + \Delta s \leq 0$). ### TBB parallelism and ensemble design $Q_n$ and $S_n$ partition inner loops across TBB threads for $n$ above a runtime L2-derived threshold. The ensemble bootstrap parallelizes across TBB threads when $n \times n_{\text{boot}} \geq 10{,}000$; with default settings ($n < 20$, $n_{\text{boot}} = 200$) the bootstrap runs serially. The ensemble bootstrap stores results in estimator-major layout ($7 \times n_{\text{boot}}$): each estimator's replicates occupy a contiguous memory column, so the mean/variance reduction pass reads at stride-1 rather than stride-7. Within each replicate, the bootstrap resample is sorted once and reused by all seven estimators; pre-allocated workspace buffers eliminate per-replicate heap allocation for $S_n$ and $Q_n$. Bootstrap resampling uses a deterministic XorShift32 PRNG [@marsaglia2003] seeded per replicate for reproducibility. ### Numerical stability and compile-time optimizations - `sd_c4` uses Welford's one-pass algorithm [@welford1962] for numerically stable variance. - The reciprocal constant $1/c$ is `constexpr`, replacing an inner-loop division with multiplication. - Loop-invariant quantities (`inv_s`, `half_inv_s`, `inv_n`) are hoisted before the iteration loop. - Sorting-network entry points are instantiated in a single translation unit, reducing cold build time from ~300 s to ~52 s. ## 8. Mathematical Foundations ### Logistic psi function [@rousseeuw2002, Eq. 23]: $$ \psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2) $$ Bounded in $(-1, 1)$, $C^\infty$, strictly monotone. Boundedness provides robustness; smoothness avoids the corner artifacts of Huber's psi at small $n$. ### Decoupled estimation Location and scale are estimated separately with a fixed auxiliary estimate, breaking the positive-feedback loop of Huber's Proposal 2. `robLoc` fixes scale at $\text{MAD}(x)$; `robScale` fixes location at $\text{median}(x)$. ### Rho function for scale [@rousseeuw2002, Eq. 26]: $$ \rho_{\log}(x) = \psi_{\log}^2(x / c) $$ where $c = 0.37394112142347236$ is the constant that yields a 50% breakdown point. ### $Q_n$ and $S_n$ statistics $$Q_n = c_n \cdot 2.2191 \cdot \{|x_i - x_j|;\; i < j\}_{(k)}$$ where $k = \binom{h}{2}$, $h = \lfloor n/2 \rfloor + 1$, and $c_n$ is the finite-sample correction factor. $$S_n = c_n' \cdot 1.1926 \cdot \operatorname{lomed}_i \{\operatorname{himed}_j |x_i - x_j|\}$$ where $\operatorname{lomed}$ and $\operatorname{himed}$ denote the low and high medians respectively. ### $c_4(n)$ correction factor $$ c_4(n) = \sqrt{\frac{2}{n{-}1}} \cdot \frac{\Gamma(n/2)}{\Gamma((n{-}1)/2)} $$ ### Gini mean difference $$ \text{GMD}(x) = C_{\text{GMD}} \cdot \frac{2}{n(n{-}1)}\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)} $$ Algebraically equivalent to $\frac{1}{\binom{n}{2}}\sum_{i