---
title: "Introduction to robscale"
output: rmarkdown::html_vignette
bibliography: ../inst/references.bib
vignette: >
%\VignetteIndexEntry{Introduction to robscale}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## 1. The Robustness Problem
Outliers compromise the reliability of classical estimators even in moderate
samples. A single recording error can destroy the standard deviation, yet the
standard deviation remains the default scale estimator in most statistical
software.
To see why, consider a small sample with one gross error:
```{r robustness-demo}
library(robscale)
x_clean <- c(2.1, 2.3, 2.0, 2.4, 2.2, 2.1, 2.3, 1.9)
x_contaminated <- c(x_clean, 200) # one recording error
sd(x_clean)
sd(x_contaminated) # collapses
gmd(x_clean)
gmd(x_contaminated) # unaffected
qn(x_clean)
qn(x_contaminated) # unaffected
```
The **breakdown point** is the fraction of contaminated observations an
estimator tolerates before producing an arbitrarily large result. The standard
deviation has a 0% breakdown point; a single outlier drives it to infinity.
The GMD achieves 29.3%; `qn` and `sn` achieve the maximum 50%.
The **asymptotic relative efficiency** (ARE) measures statistical efficiency
relative to the sample standard deviation under the Gaussian model
[@bickel1976]. High efficiency means fewer observations are needed to match
the standard deviation's precision. The tradeoff is inherent: higher breakdown
point generally costs efficiency.
`robscale` exposes all major points on this robustness--efficiency frontier.
## 2. Getting Started
### Installation
```r
install.packages("robscale")
# Development version:
# install.packages("remotes")
# remotes::install_github("davdittrich/robscale")
```
### Quick demo
```{r quick-demo}
library(robscale)
set.seed(42)
x <- c(rnorm(20), 50) # 20 clean observations + one outlier
scale_robust(x) # auto-selects best strategy
scale_robust(x, ci = TRUE) # with confidence interval
data.frame(
estimator = c("sd", "sd_c4", "gmd", "qn", "mad_scaled"),
value = round(c(sd(x), sd_c4(x), gmd(x), qn(x), mad_scaled(x)), 4)
)
```
## 3. The Estimator Spectrum
`robscale` provides 11 exported functions spanning the full
robustness--efficiency spectrum:
| Function | ARE | Breakdown | Description |
| :--- | ---: | ---: | :--- |
| `sd_c4(x)` | 100% | 0% | Bias-corrected standard deviation |
| `gmd(x)` | 98% | 29.3% | Gini mean difference |
| `adm(x)` | 88.3% | $1/n$ | Average deviation from median |
| `qn(x)` | 82.3% | 50% | Rousseeuw--Croux $Q_n$ |
| `sn(x)` | 58.2% | 50% | Rousseeuw--Croux $S_n$ |
| `iqr_scaled(x)` | 37% | 25% | Scaled interquartile range |
| `mad_scaled(x)` | 36.7% | 50% | Scaled median absolute deviation |
| `robScale(x)` | ~55% | 50% | M-estimator for scale |
| `robLoc(x)` | 98.4% | 50% | M-estimator for location |
| `scale_robust(x)` | ensemble | --- | Variance-weighted dispatcher |
| `get_consistency_constant(n)` | --- | --- | Finite-sample bias correction factors |
### Decision guide
- **Maximum efficiency, no outliers expected:** `sd_c4`
- **High efficiency with moderate outlier protection:** `gmd`
- **Maximum breakdown with good efficiency:** `qn`
- **Fast computation on large data:** `mad_scaled` or `iqr_scaled`
- **Uncertain which estimator to use:** `scale_robust()` (adaptive)
- **M-estimate of location:** `robLoc`
- **M-estimate of scale:** `robScale`
## 4. The `scale_robust()` Dispatcher
`scale_robust()` adapts its strategy to sample size automatically.
### Ensemble mode ($n < 20$)
For small samples, all 7 scale estimators run on bootstrap replicates.
Inverse-variance weights---where the variance is the bootstrap variance of
each estimator across replicates---combine them into a single estimate:
$$
\hat\sigma = \frac{\sum_{j=1}^{7} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{7} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}
$$
```{r ensemble}
set.seed(1)
x_small <- rnorm(12)
scale_robust(x_small) # ensemble (n = 12 < 20)
scale_robust(x_small, ci = TRUE) # BCa bootstrap CI
```
### Automatic switch to GMD ($n \geq 20$)
For larger samples, `scale_robust()` uses `gmd()` directly: 98% ARE,
29.3% breakdown, $O(n \log n)$.
```{r auto-switch}
x_large <- rnorm(500)
scale_robust(x_large) # equivalent to gmd(x_large)
```
### Explicit method selection
```{r explicit-method}
scale_robust(x_small, method = "qn") # force Qn
scale_robust(x_small, method = "robScale") # force M-scale
scale_robust(x_small, auto_switch = FALSE) # keep ensemble at any n
```
## 5. Individual Estimators
### `sd_c4(x)`: Bias-corrected standard deviation
Applies the exact $c_4(n)$ finite-sample correction to remove the downward
bias of the sample standard deviation under normality:
```{r sd-c4}
x <- c(1.2, 0.8, 1.5, 0.9, 1.1)
sd_c4(x) # unbiased under normality
sd(x) # biased downward for small n
sd_c4(x, ci = TRUE)
```
### `gmd(x)`: Gini mean difference
Average absolute pairwise difference [@gini1912], computed via the
order-statistics form in $O(n \log n)$:
```{r gmd}
gmd(c(1, 2, 3, 5, 7, 8))
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```
### `adm(x)`: Average deviation from the median
The mean deviation from the median [@nair1936; @nair1947], scaled for
consistency at the normal distribution.
```{r adm}
adm(c(1, 2, 3, 5, 7, 8))
```
### `qn(x)`: Rousseeuw--Croux $Q_n$ [@rousseeuw1993]
```{r qn}
qn(c(1, 2, 3, 5, 7, 8))
qn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```
### `sn(x)`: Rousseeuw--Croux $S_n$
```{r sn}
sn(c(1, 2, 3, 5, 7, 8))
sn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```
### `iqr_scaled(x)` and `mad_scaled(x)`
```{r iqr-mad}
iqr_scaled(c(1, 2, 3, 5, 7, 8))
mad_scaled(c(1, 2, 3, 5, 7, 8))
```
### `robLoc(x)`: M-estimator for location
M-estimator for location via Newton--Raphson iteration on the logistic psi
function [@rousseeuw2002, Eq. 21]. Starting value:
$T^{(0)} = \text{median}(x)$; auxiliary scale: $S = \text{MAD}(x)$.
**Fallback:** When scale is unknown and $n < 4$, or when scale is known and
$n < 3$, returns `median(x)` without iteration.
```{r robloc}
robLoc(c(1, 2, 3, 5, 7, 8))
robLoc(c(1, 2, 3), scale = 1.5) # known scale enables n = 3
```
Newton--Raphson algorithm for `robLoc()`:
```
flowchart TD
A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
B --> C{n < minobs?}
C -- Yes --> D([Return med])
C -- No --> E{Scale known?}
E -- Yes --> F[s = scale]
E -- No --> G[s = MAD x]
F --> H{s = 0?}
G --> H
H -- Yes --> I([Return med])
H -- No --> J[t = med]
J --> K[psi_i = tanh( (x_i - t) / 2s )]
K --> L[sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)]
L --> M[t += 2s * sum_psi / sum_dpsi]
M --> N{"|v| <= tol * max(|t|, 1)?"}
N -- No --> K
N -- Yes --> O([Return t])
```
### `robScale(x)`: M-estimator for scale
M-estimator for scale via Newton--Raphson iteration on the M-scale
estimating equation [@rousseeuw2002]. Starting value: $S^{(0)} = \text{MAD}(x)$.
```{r robscale}
robScale(c(1, 2, 3, 5, 7, 8))
robScale(c(5, 5, 5, 5, 6), fallback = "na") # revss compatibility
robScale(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```
Newton--Raphson iteration for `robScale()`:
```
flowchart TD
A{Location known?}
A -- Yes --> B1[w = x - loc]
B1 --> B2[s = 1.4826 * median( abs(w) )]
B2 --> B3[t = 0, minobs = 3]
A -- No --> C1[med = median x]
C1 --> C2[s = MAD x, t = med]
C2 --> C3[minobs = 4]
B3 --> D{n < minobs?}
C3 --> D
D -- Yes --> E{s <= implbound?}
E -- Yes --> F{fallback?}
F -- adm --> F1([Return ADM x])
F -- na --> F2([Return NA])
E -- No --> G([Return s])
D -- No --> H{s <= implbound?}
H -- Yes --> F
H -- No --> J["u_i = (x_i - t) / (2cs), compute tanh(u_i)"]
J --> K["numer = Σtanh²/n − 0.5
denom = (2/n)·Σ(u·tanh·sech²)"]
L --> LG{"|denom| <= 1e-14 * s?"}
LG -- Yes --> LF["s *= sqrt(2 * mean_tanh²)"]
LF --> J
LG -- No --> MM["Δs = s · numer / denom"]
MM --> MG{"s + Δs <= 0?"}
MG -- Yes --> MH["s /= 2"]
MH --> J
MG -- No --> MS["s ← s + Δs"]
MS --> N{"|Δs|/s ≤ tol?"}
N -- No --> J
N -- Yes --> O([Return s])
```
## 6. Confidence Intervals
All scale estimators support `ci = TRUE`:
```{r ci}
# Analytical CI (chi-squared for sd_c4, ARE-based for others)
sd_c4(c(1, 2, 3, 5, 7, 8), ci = TRUE)
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
# BCa bootstrap CI for scale_robust() ensemble
set.seed(42)
x_small <- rnorm(10)
scale_robust(x_small, ci = TRUE, n_boot = 500)
```
The CI return type is a named list with `estimate`, `lower`, `upper`, and
`level`. For `scale_robust()` in ensemble mode, the object class is
`robscale_ensemble_ci` and includes per-estimator weights.
## 7. Performance Architecture
### SIMD vectorization
The logistic psi function satisfies $\psi_{\log}(x) = \tanh(x/2)$. `robscale`
evaluates `tanh` in bulk using the fastest available platform backend.
The `configure` script selects one of two vectorized `tanh` libraries at
compile time; at runtime, CPUID selects the appropriate instruction set:
1. **Apple Accelerate** (`vvtanh`): array-wide SIMD on macOS.
2. **glibc libmvec** (Linux x86\_64, glibc $\geq$ 2.35) or **SLEEF**
(fallback when libmvec is absent). Only one is linked per binary.
The AVX2 4-wide kernel (`_ZGVdN4v_tanh` or `Sleef_tanhd4_u10avx2`)
processes 4 doubles per iteration.
3. **OpenMP SIMD** (`#pragma omp simd`): compiler-guided portable fallback.
4. **Scalar** `std::tanh`: universal fallback.
For the bulk tanh dispatcher, vectorization requires $n \geq 8$. However,
`robLoc`'s fused AVX2 NR kernel activates at $n \geq 4$ (one full 4-wide vector).
The Gini mean difference uses a separate AVX2 FMA kernel
(`_mm256_fmadd_pd`) for its weighted sum $\sum (2i - (n-1))\, x_{(i)}$,
shared across the standalone `gmd()`, the ensemble bootstrap, and the BCa
jackknife. This kernel activates for $n \geq 8$ on AVX2 hardware.
### $O(n)$ median selection
`robscale` uses a tiered strategy rather than `sort()`:
- **$n \in \{8, 16, 32\}$ on AVX2 hardware**: SIMD selection networks
(bitonic merge, 23--58% faster than scalar selection networks).
- **$n \leq 36$**: scalar selection networks for median computation.
- **$n \leq 56$**: branchless sorting networks for full-sort paths (GMD,
$Q_n$ exact, $S_n$), compiled to conditional-move instructions at `-O2`:
| $n$ | Comparators |
| ---: | ---: |
| 3 | 3 |
| 4 | 5 |
| 8 | 19 |
| 16 | 60 |
| 32 | 185 |
| 56 | 438 |
- **$37 \leq n$ (selection) / $57 \leq n$ (sort)**: Floyd--Rivest algorithm
[@floyd1975] or pdqselect, chosen per-estimator based on a runtime
L2-cache-derived crossover threshold.
The crossover between Floyd--Rivest and pdqselect is derived at runtime from
the per-core L2 cache size; each estimator uses its own threshold based on
working-set pressure:
| Estimator | Strategy | Threshold |
| :--- | :--- | :--- |
| `iqr_scaled` | Always pdqselect | --- |
| `mad_scaled` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 40)$ |
| `robScale` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ |
| $S_n$ | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ |
| $Q_n$ (final window) | Inline check | $\max(2048,\, L_2 / 32)$ |
### Stack-allocated memory arenas
A 128-double micro-buffer (1 KB) covers the smallest samples ($n \leq 128$
for MAD, $n \leq 64$ for `robScale` and `robLoc`). For $n \leq 2{,}048$,
stack-allocated arrays avoid heap allocation: `robScale` uses one 2,048-double
array (16 KB); `robLoc` uses one 4,096-double array (32 KB, split into
equal-sized buffer and deviation halves).
**Fused single-buffer algorithm.** After median selection the working buffer
holds a permutation of the input. Because MAD is order-invariant, absolute
deviations can be computed in-place, eliminating the second scratch array and
halving memory from $2n$ to $n$ doubles per estimator.
### Iteration convergence
**`robLoc`: Newton--Raphson.** Location estimation uses Newton--Raphson
(quadratic convergence, ~3 iterations) rather than the scoring fixed-point
method (~6--8 iterations):
$$
T^{(k+1)} = T^{(k)} + \frac{2\,S\sum \psi\!\left(\frac{x_i - T^{(k)}}{2S}\right)}
{\sum \left[1 - \psi^2\!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]}
$$
Since $\tanh$ values are already computed for the numerator, the denominator
costs only squaring and subtraction. Typical iteration counts:
| $n$ | Scoring | Newton--Raphson |
| ---: | ---: | ---: |
| 4 | 7 | 2--4 |
| 8 | 7 | 2--4 |
| 20 | 6 | 2--4 |
| 100 | 5 | 2--4 |
**Fused AVX2 kernel for `robLoc()`.** `rob_loc_nr_step_avx2` collapses three
passes into one. Two AVX2 accumulators advance in lockstep:
$$
\texttt{acc\_psi} \mathrel{+}= p_i, \qquad \texttt{acc\_dpsi} \mathrel{+}= (1 - p_i^2)
$$
where $p_i = \tanh(u_i)$ is evaluated once. The derivative accumulator uses a
fused negative multiply-add (fnmadd) to accumulate $1 - p_i^2$ directly,
avoiding the catastrophic cancellation that would arise from subtracting two
large numbers ($n - \sum p_i^2$) when all $|u_i| \gg 1$.
**`robScale`: Newton--Raphson.** Scale estimation also uses Newton--Raphson,
applied to the M-scale estimating equation $n^{-1}\sum \rho(u_i) = 1/2$.
The fused single-pass kernel (`nr_scale_step_avx2` / `nr_scale_step_scalar`)
computes $\sum \tanh^2(u_i)$ and $\sum u_i \tanh(u_i)\,\text{sech}^2(u_i)$ in
one read over the data, yielding the NR update
$\Delta s = s \cdot (\bar\rho - \tfrac{1}{2}) / \bar{D}$
where $\bar{D}$ is the scaled derivative sum. The same quadratic convergence as
`robLoc` applies; typical iteration counts are 3--4. A degenerate-denominator
guard applies two protections: when the denominator is degenerate
($|\bar{D}| \leq 10^{-14} \times s$), a multiplicative fallback
$s \leftarrow s \times \sqrt{2 \times \overline{\tanh^2}}$ replaces the Newton
step, avoiding the catastrophic cancellation of subtracting two large numbers.
A separate guard halves $s$ when the proposed update would make $s$
non-positive ($s + \Delta s \leq 0$).
### TBB parallelism and ensemble design
$Q_n$ and $S_n$ partition inner loops across TBB threads for $n$ above a
runtime L2-derived threshold. The ensemble bootstrap parallelizes across
TBB threads when $n \times n_{\text{boot}} \geq 10{,}000$; with default
settings ($n < 20$, $n_{\text{boot}} = 200$) the bootstrap runs serially.
The ensemble bootstrap stores results in estimator-major layout
($7 \times n_{\text{boot}}$): each estimator's replicates occupy a contiguous
memory column, so the mean/variance reduction pass reads at stride-1 rather
than stride-7. Within each replicate, the bootstrap resample is sorted once
and reused by all seven estimators; pre-allocated workspace buffers eliminate
per-replicate heap allocation for $S_n$ and $Q_n$. Bootstrap resampling uses
a deterministic XorShift32 PRNG [@marsaglia2003] seeded per replicate for
reproducibility.
### Numerical stability and compile-time optimizations
- `sd_c4` uses Welford's one-pass algorithm [@welford1962] for numerically
stable variance.
- The reciprocal constant $1/c$ is `constexpr`, replacing an inner-loop
division with multiplication.
- Loop-invariant quantities (`inv_s`, `half_inv_s`, `inv_n`) are hoisted
before the iteration loop.
- Sorting-network entry points are instantiated in a single translation unit,
reducing cold build time from ~300 s to ~52 s.
## 8. Mathematical Foundations
### Logistic psi function
[@rousseeuw2002, Eq. 23]:
$$
\psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2)
$$
Bounded in $(-1, 1)$, $C^\infty$, strictly monotone. Boundedness provides
robustness; smoothness avoids the corner artifacts of Huber's psi at small $n$.
### Decoupled estimation
Location and scale are estimated separately with a fixed auxiliary estimate,
breaking the positive-feedback loop of Huber's Proposal 2. `robLoc` fixes
scale at $\text{MAD}(x)$; `robScale` fixes location at $\text{median}(x)$.
### Rho function for scale
[@rousseeuw2002, Eq. 26]:
$$
\rho_{\log}(x) = \psi_{\log}^2(x / c)
$$
where $c = 0.37394112142347236$ is the constant that yields a 50% breakdown
point.
### $Q_n$ and $S_n$ statistics
$$Q_n = c_n \cdot 2.2191 \cdot \{|x_i - x_j|;\; i < j\}_{(k)}$$
where $k = \binom{h}{2}$, $h = \lfloor n/2 \rfloor + 1$, and $c_n$ is the
finite-sample correction factor.
$$S_n = c_n' \cdot 1.1926 \cdot \operatorname{lomed}_i \{\operatorname{himed}_j |x_i - x_j|\}$$
where $\operatorname{lomed}$ and $\operatorname{himed}$ denote the low and high
medians respectively.
### $c_4(n)$ correction factor
$$
c_4(n) = \sqrt{\frac{2}{n{-}1}} \cdot \frac{\Gamma(n/2)}{\Gamma((n{-}1)/2)}
$$
### Gini mean difference
$$
\text{GMD}(x) = C_{\text{GMD}} \cdot \frac{2}{n(n{-}1)}\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)}
$$
Algebraically equivalent to
$\frac{1}{\binom{n}{2}}\sum_{i