---
title: "Introduction to robscale"
output: rmarkdown::html_vignette
bibliography: ../inst/references.bib
vignette: >
  %\VignetteIndexEntry{Introduction to robscale}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

## 1. The Robustness Problem

Outliers compromise the reliability of classical estimators even in moderate
samples. A single recording error can destroy the standard deviation, yet the
standard deviation remains the default scale estimator in most statistical
software.

To see why, consider a small sample with one gross error:

```{r robustness-demo}
library(robscale)

x_clean <- c(2.1, 2.3, 2.0, 2.4, 2.2, 2.1, 2.3, 1.9)
x_contaminated <- c(x_clean, 200)   # one recording error

sd(x_clean)
sd(x_contaminated)   # collapses

gmd(x_clean)
gmd(x_contaminated)  # unaffected

qn(x_clean)
qn(x_contaminated)   # unaffected
```

The **breakdown point** is the fraction of contaminated observations an
estimator tolerates before producing an arbitrarily large result. The standard
deviation has a 0% breakdown point; a single outlier drives it to infinity.
The GMD achieves 29.3%; `qn` and `sn` achieve the maximum 50%.

The **asymptotic relative efficiency** (ARE) measures statistical efficiency
relative to the sample standard deviation under the Gaussian model
[@bickel1976]. High efficiency means fewer observations are needed to match
the standard deviation's precision. The tradeoff is inherent: higher breakdown
point generally costs efficiency.

`robscale` exposes all major points on this robustness--efficiency frontier.

## 2. Getting Started

### Installation

```r
install.packages("robscale")

# Development version:
# install.packages("remotes")
# remotes::install_github("davdittrich/robscale")
```

### Quick demo

```{r quick-demo}
library(robscale)

set.seed(42)
x <- c(rnorm(20), 50)   # 20 clean observations + one outlier

scale_robust(x)              # auto-selects best strategy
scale_robust(x, ci = TRUE)   # with confidence interval

data.frame(
  estimator = c("sd", "sd_c4", "gmd", "qn", "mad_scaled"),
  value = round(c(sd(x), sd_c4(x), gmd(x), qn(x), mad_scaled(x)), 4)
)
```

## 3. The Estimator Spectrum

`robscale` provides 11 exported functions spanning the full
robustness--efficiency spectrum:

| Function | ARE | Breakdown | Description |
| :--- | ---: | ---: | :--- |
| `sd_c4(x)` | 100% | 0% | Bias-corrected standard deviation |
| `gmd(x)` | 98% | 29.3% | Gini mean difference |
| `adm(x)` | 88.3% | $1/n$ | Average deviation from median |
| `qn(x)` | 82.3% | 50% | Rousseeuw--Croux $Q_n$ |
| `sn(x)` | 58.2% | 50% | Rousseeuw--Croux $S_n$ |
| `iqr_scaled(x)` | 37% | 25% | Scaled interquartile range |
| `mad_scaled(x)` | 36.7% | 50% | Scaled median absolute deviation |
| `robScale(x)` | ~55% | 50% | M-estimator for scale |
| `robLoc(x)` | 98.4% | 50% | M-estimator for location |
| `scale_robust(x)` | ensemble | --- | Variance-weighted dispatcher |
| `get_consistency_constant(n)` | --- | --- | Finite-sample bias correction factors |

### Decision guide

- **Maximum efficiency, no outliers expected:** `sd_c4`
- **High efficiency with moderate outlier protection:** `gmd`
- **Maximum breakdown with good efficiency:** `qn`
- **Fast computation on large data:** `mad_scaled` or `iqr_scaled`
- **Uncertain which estimator to use:** `scale_robust()` (adaptive)
- **M-estimate of location:** `robLoc`
- **M-estimate of scale:** `robScale`

## 4. The `scale_robust()` Dispatcher

`scale_robust()` adapts its strategy to sample size automatically.

### Ensemble mode ($n < 20$)

For small samples, all 7 scale estimators run on bootstrap replicates.
Inverse-variance weights---where the variance is the bootstrap variance of
each estimator across replicates---combine them into a single estimate:

$$
\hat\sigma = \frac{\sum_{j=1}^{7} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{7} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}
$$

```{r ensemble}
set.seed(1)
x_small <- rnorm(12)
scale_robust(x_small)              # ensemble (n = 12 < 20)
scale_robust(x_small, ci = TRUE)   # BCa bootstrap CI
```

### Automatic switch to GMD ($n \geq 20$)

For larger samples, `scale_robust()` uses `gmd()` directly: 98% ARE,
29.3% breakdown, $O(n \log n)$.

```{r auto-switch}
x_large <- rnorm(500)
scale_robust(x_large)   # equivalent to gmd(x_large)
```

### Explicit method selection

```{r explicit-method}
scale_robust(x_small, method = "qn")       # force Qn
scale_robust(x_small, method = "robScale") # force M-scale
scale_robust(x_small, auto_switch = FALSE) # keep ensemble at any n
```

## 5. Individual Estimators

### `sd_c4(x)`: Bias-corrected standard deviation

Applies the exact $c_4(n)$ finite-sample correction to remove the downward
bias of the sample standard deviation under normality:

```{r sd-c4}
x <- c(1.2, 0.8, 1.5, 0.9, 1.1)
sd_c4(x)              # unbiased under normality
sd(x)                 # biased downward for small n
sd_c4(x, ci = TRUE)
```

### `gmd(x)`: Gini mean difference

Average absolute pairwise difference [@gini1912], computed via the
order-statistics form in $O(n \log n)$:

```{r gmd}
gmd(c(1, 2, 3, 5, 7, 8))
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```

### `adm(x)`: Average deviation from the median

The mean deviation from the median [@nair1936; @nair1947], scaled for
consistency at the normal distribution.

```{r adm}
adm(c(1, 2, 3, 5, 7, 8))
```

### `qn(x)`: Rousseeuw--Croux $Q_n$ [@rousseeuw1993]

```{r qn}
qn(c(1, 2, 3, 5, 7, 8))
qn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```

### `sn(x)`: Rousseeuw--Croux $S_n$

```{r sn}
sn(c(1, 2, 3, 5, 7, 8))
sn(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```

### `iqr_scaled(x)` and `mad_scaled(x)`

```{r iqr-mad}
iqr_scaled(c(1, 2, 3, 5, 7, 8))
mad_scaled(c(1, 2, 3, 5, 7, 8))
```

### `robLoc(x)`: M-estimator for location

M-estimator for location via Newton--Raphson iteration on the logistic psi
function [@rousseeuw2002, Eq. 21]. Starting value:
$T^{(0)} = \text{median}(x)$; auxiliary scale: $S = \text{MAD}(x)$.

**Fallback:** When scale is unknown and $n < 4$, or when scale is known and
$n < 3$, returns `median(x)` without iteration.

```{r robloc}
robLoc(c(1, 2, 3, 5, 7, 8))
robLoc(c(1, 2, 3), scale = 1.5)   # known scale enables n = 3
```

Newton--Raphson algorithm for `robLoc()`:

```
flowchart TD
    A[Set minobs: 3 if scale known, 4 if unknown] --> B[med = median x]
    B --> C{n < minobs?}
    C -- Yes --> D([Return med])
    C -- No --> E{Scale known?}
    E -- Yes --> F[s = scale]
    E -- No --> G[s = MAD x]
    F --> H{s = 0?}
    G --> H
    H -- Yes --> I([Return med])
    H -- No --> J[t = med]
    J --> K[psi_i = tanh( (x_i - t) / 2s )]
    K --> L[sum_psi = Sum psi_i, sum_dpsi = Sum (1 - psi_i^2)]
    L --> M[t += 2s * sum_psi / sum_dpsi]
    M --> N{"|v| <= tol * max(|t|, 1)?"}
    N -- No --> K
    N -- Yes --> O([Return t])
```

### `robScale(x)`: M-estimator for scale

M-estimator for scale via Newton--Raphson iteration on the M-scale
estimating equation [@rousseeuw2002]. Starting value: $S^{(0)} = \text{MAD}(x)$.

```{r robscale}
robScale(c(1, 2, 3, 5, 7, 8))
robScale(c(5, 5, 5, 5, 6), fallback = "na")   # revss compatibility
robScale(c(1, 2, 3, 5, 7, 8), ci = TRUE)
```

Newton--Raphson iteration for `robScale()`:

```
flowchart TD
    A{Location known?}
    A -- Yes --> B1[w = x - loc]
    B1 --> B2[s = 1.4826 * median( abs(w) )]
    B2 --> B3[t = 0, minobs = 3]
    A -- No --> C1[med = median x]
    C1 --> C2[s = MAD x, t = med]
    C2 --> C3[minobs = 4]
    B3 --> D{n < minobs?}
    C3 --> D
    D -- Yes --> E{s <= implbound?}
    E -- Yes --> F{fallback?}
    F -- adm --> F1([Return ADM x])
    F -- na --> F2([Return NA])
    E -- No --> G([Return s])
    D -- No --> H{s <= implbound?}
    H -- Yes --> F
    H -- No --> J["u_i = (x_i - t) / (2cs), compute tanh(u_i)"]
    J --> K["numer = Σtanh²/n − 0.5<br/>denom = (2/n)·Σ(u·tanh·sech²)"]
    L --> LG{"|denom| <= 1e-14 * s?"}
    LG -- Yes --> LF["s *= sqrt(2 * mean_tanh²)"]
    LF --> J
    LG -- No --> MM["Δs = s · numer / denom"]
    MM --> MG{"s + Δs <= 0?"}
    MG -- Yes --> MH["s /= 2"]
    MH --> J
    MG -- No --> MS["s ← s + Δs"]
    MS --> N{"|Δs|/s ≤ tol?"}
    N -- No --> J
    N -- Yes --> O([Return s])
```

## 6. Confidence Intervals

All scale estimators support `ci = TRUE`:

```{r ci}
# Analytical CI (chi-squared for sd_c4, ARE-based for others)
sd_c4(c(1, 2, 3, 5, 7, 8), ci = TRUE)
gmd(c(1, 2, 3, 5, 7, 8), ci = TRUE)

# BCa bootstrap CI for scale_robust() ensemble
set.seed(42)
x_small <- rnorm(10)
scale_robust(x_small, ci = TRUE, n_boot = 500)
```

The CI return type is a named list with `estimate`, `lower`, `upper`, and
`level`. For `scale_robust()` in ensemble mode, the object class is
`robscale_ensemble_ci` and includes per-estimator weights.

## 7. Performance Architecture

### SIMD vectorization

The logistic psi function satisfies $\psi_{\log}(x) = \tanh(x/2)$. `robscale`
evaluates `tanh` in bulk using the fastest available platform backend.
The `configure` script selects one of two vectorized `tanh` libraries at
compile time; at runtime, CPUID selects the appropriate instruction set:

1. **Apple Accelerate** (`vvtanh`): array-wide SIMD on macOS.
2. **glibc libmvec** (Linux x86\_64, glibc $\geq$ 2.35) or **SLEEF**
   (fallback when libmvec is absent). Only one is linked per binary.
   The AVX2 4-wide kernel (`_ZGVdN4v_tanh` or `Sleef_tanhd4_u10avx2`)
   processes 4 doubles per iteration.
3. **OpenMP SIMD** (`#pragma omp simd`): compiler-guided portable fallback.
4. **Scalar** `std::tanh`: universal fallback.

For the bulk tanh dispatcher, vectorization requires $n \geq 8$. However,
`robLoc`'s fused AVX2 NR kernel activates at $n \geq 4$ (one full 4-wide vector).

The Gini mean difference uses a separate AVX2 FMA kernel
(`_mm256_fmadd_pd`) for its weighted sum $\sum (2i - (n-1))\, x_{(i)}$,
shared across the standalone `gmd()`, the ensemble bootstrap, and the BCa
jackknife. This kernel activates for $n \geq 8$ on AVX2 hardware.

### $O(n)$ median selection

`robscale` uses a tiered strategy rather than `sort()`:

- **$n \in \{8, 16, 32\}$ on AVX2 hardware**: SIMD selection networks
  (bitonic merge, 23--58% faster than scalar selection networks).
- **$n \leq 36$**: scalar selection networks for median computation.
- **$n \leq 56$**: branchless sorting networks for full-sort paths (GMD,
  $Q_n$ exact, $S_n$), compiled to conditional-move instructions at `-O2`:

| $n$ | Comparators |
| ---: | ---: |
| 3 | 3 |
| 4 | 5 |
| 8 | 19 |
| 16 | 60 |
| 32 | 185 |
| 56 | 438 |

- **$37 \leq n$ (selection) / $57 \leq n$ (sort)**: Floyd--Rivest algorithm
  [@floyd1975] or pdqselect, chosen per-estimator based on a runtime
  L2-cache-derived crossover threshold.

The crossover between Floyd--Rivest and pdqselect is derived at runtime from
the per-core L2 cache size; each estimator uses its own threshold based on
working-set pressure:

| Estimator | Strategy | Threshold |
| :--- | :--- | :--- |
| `iqr_scaled` | Always pdqselect | --- |
| `mad_scaled` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 40)$ |
| `robScale` | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ |
| $S_n$ | Floyd--Rivest vs. pdqselect | $\max(2048,\, L_2 / 16)$ |
| $Q_n$ (final window) | Inline check | $\max(2048,\, L_2 / 32)$ |

### Stack-allocated memory arenas

A 128-double micro-buffer (1 KB) covers the smallest samples ($n \leq 128$
for MAD, $n \leq 64$ for `robScale` and `robLoc`). For $n \leq 2{,}048$,
stack-allocated arrays avoid heap allocation: `robScale` uses one 2,048-double
array (16 KB); `robLoc` uses one 4,096-double array (32 KB, split into
equal-sized buffer and deviation halves).

**Fused single-buffer algorithm.** After median selection the working buffer
holds a permutation of the input. Because MAD is order-invariant, absolute
deviations can be computed in-place, eliminating the second scratch array and
halving memory from $2n$ to $n$ doubles per estimator.

### Iteration convergence

**`robLoc`: Newton--Raphson.** Location estimation uses Newton--Raphson
(quadratic convergence, ~3 iterations) rather than the scoring fixed-point
method (~6--8 iterations):

$$
T^{(k+1)} = T^{(k)} + \frac{2\,S\sum \psi\!\left(\frac{x_i - T^{(k)}}{2S}\right)}
{\sum \left[1 - \psi^2\!\left(\frac{x_i - T^{(k)}}{2S}\right)\right]}
$$

Since $\tanh$ values are already computed for the numerator, the denominator
costs only squaring and subtraction. Typical iteration counts:

| $n$ | Scoring | Newton--Raphson |
| ---: | ---: | ---: |
| 4 | 7 | 2--4 |
| 8 | 7 | 2--4 |
| 20 | 6 | 2--4 |
| 100 | 5 | 2--4 |

**Fused AVX2 kernel for `robLoc()`.** `rob_loc_nr_step_avx2` collapses three
passes into one. Two AVX2 accumulators advance in lockstep:

$$
\texttt{acc\_psi} \mathrel{+}= p_i, \qquad \texttt{acc\_dpsi} \mathrel{+}= (1 - p_i^2)
$$

where $p_i = \tanh(u_i)$ is evaluated once. The derivative accumulator uses a
fused negative multiply-add (fnmadd) to accumulate $1 - p_i^2$ directly,
avoiding the catastrophic cancellation that would arise from subtracting two
large numbers ($n - \sum p_i^2$) when all $|u_i| \gg 1$.

**`robScale`: Newton--Raphson.** Scale estimation also uses Newton--Raphson,
applied to the M-scale estimating equation $n^{-1}\sum \rho(u_i) = 1/2$.
The fused single-pass kernel (`nr_scale_step_avx2` / `nr_scale_step_scalar`)
computes $\sum \tanh^2(u_i)$ and $\sum u_i \tanh(u_i)\,\text{sech}^2(u_i)$ in
one read over the data, yielding the NR update
$\Delta s = s \cdot (\bar\rho - \tfrac{1}{2}) / \bar{D}$
where $\bar{D}$ is the scaled derivative sum. The same quadratic convergence as
`robLoc` applies; typical iteration counts are 3--4. A degenerate-denominator
guard applies two protections: when the denominator is degenerate
($|\bar{D}| \leq 10^{-14} \times s$), a multiplicative fallback
$s \leftarrow s \times \sqrt{2 \times \overline{\tanh^2}}$ replaces the Newton
step, avoiding the catastrophic cancellation of subtracting two large numbers.
A separate guard halves $s$ when the proposed update would make $s$
non-positive ($s + \Delta s \leq 0$).

### TBB parallelism and ensemble design

$Q_n$ and $S_n$ partition inner loops across TBB threads for $n$ above a
runtime L2-derived threshold. The ensemble bootstrap parallelizes across
TBB threads when $n \times n_{\text{boot}} \geq 10{,}000$; with default
settings ($n < 20$, $n_{\text{boot}} = 200$) the bootstrap runs serially.

The ensemble bootstrap stores results in estimator-major layout
($7 \times n_{\text{boot}}$): each estimator's replicates occupy a contiguous
memory column, so the mean/variance reduction pass reads at stride-1 rather
than stride-7. Within each replicate, the bootstrap resample is sorted once
and reused by all seven estimators; pre-allocated workspace buffers eliminate
per-replicate heap allocation for $S_n$ and $Q_n$. Bootstrap resampling uses
a deterministic XorShift32 PRNG [@marsaglia2003] seeded per replicate for
reproducibility.

### Numerical stability and compile-time optimizations

- `sd_c4` uses Welford's one-pass algorithm [@welford1962] for numerically
  stable variance.
- The reciprocal constant $1/c$ is `constexpr`, replacing an inner-loop
  division with multiplication.
- Loop-invariant quantities (`inv_s`, `half_inv_s`, `inv_n`) are hoisted
  before the iteration loop.
- Sorting-network entry points are instantiated in a single translation unit,
  reducing cold build time from ~300 s to ~52 s.

## 8. Mathematical Foundations

### Logistic psi function

[@rousseeuw2002, Eq. 23]:

$$
\psi_{\log}(x) = \frac{e^x - 1}{e^x + 1} = \tanh(x/2)
$$

Bounded in $(-1, 1)$, $C^\infty$, strictly monotone. Boundedness provides
robustness; smoothness avoids the corner artifacts of Huber's psi at small $n$.

### Decoupled estimation

Location and scale are estimated separately with a fixed auxiliary estimate,
breaking the positive-feedback loop of Huber's Proposal 2. `robLoc` fixes
scale at $\text{MAD}(x)$; `robScale` fixes location at $\text{median}(x)$.

### Rho function for scale

[@rousseeuw2002, Eq. 26]:

$$
\rho_{\log}(x) = \psi_{\log}^2(x / c)
$$

where $c = 0.37394112142347236$ is the constant that yields a 50% breakdown
point.

### $Q_n$ and $S_n$ statistics

$$Q_n = c_n \cdot 2.2191 \cdot \{|x_i - x_j|;\; i < j\}_{(k)}$$

where $k = \binom{h}{2}$, $h = \lfloor n/2 \rfloor + 1$, and $c_n$ is the
finite-sample correction factor.

$$S_n = c_n' \cdot 1.1926 \cdot \operatorname{lomed}_i \{\operatorname{himed}_j |x_i - x_j|\}$$

where $\operatorname{lomed}$ and $\operatorname{himed}$ denote the low and high
medians respectively.

### $c_4(n)$ correction factor

$$
c_4(n) = \sqrt{\frac{2}{n{-}1}} \cdot \frac{\Gamma(n/2)}{\Gamma((n{-}1)/2)}
$$

### Gini mean difference

$$
\text{GMD}(x) = C_{\text{GMD}} \cdot \frac{2}{n(n{-}1)}\sum_{i=1}^{n} (2i - n - 1)\, x_{(i)}
$$

Algebraically equivalent to
$\frac{1}{\binom{n}{2}}\sum_{i<j}|x_i - x_j|$, but computed in
$O(n \log n)$ via sorted-array indexing.

### Ensemble weighting formula

Weights are the normalized inverse bootstrap variances
$w_j = (1/V_j) / \sum_k (1/V_k)$ where $V_j = \operatorname{Var}_{\text{boot}}(\hat\sigma_j)$:

$$
\hat\sigma = \frac{\sum_{j=1}^{J} \hat\sigma_j(x) \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}{\sum_{j=1}^{J} 1 \,/\, \operatorname{Var}_{\mathrm{boot}}(\hat\sigma_j)}
$$

### Key constants (full double precision)

| Symbol | Value | Definition |
| :--- | :--- | :--- |
| $c$ | `0.37394112142347236` | Solution to $\int \rho_{\log}(u)\,d\Phi(u) = 0.5$; scale rho constant |
| $C_{\text{ADM}}$ | `1.2533141373155001` | $\sqrt{\pi/2}$; ADM consistency constant |
| $C_{\text{MAD}}$ | `1.482602218505602` | $1/\Phi^{-1}(3/4)$; MAD consistency constant |
| $C_{\text{GMD}}$ | `0.886226925452758` | $\sqrt{\pi}/2$; GMD consistency constant |
| $C_{\text{IQR}}$ | `0.741301109252801` | $1/(\Phi^{-1}(0.75) - \Phi^{-1}(0.25))$; IQR consistency constant |

## 9. Compatibility

### revss API

`adm()`, `robLoc()`, and `robScale()` accept the same arguments and return the
same values as `revss`. Code using `revss` can switch to `robscale` by
changing only the `library()` call:

```{r revss-compat, eval = requireNamespace("revss", quietly = TRUE)}
x <- c(1.2, 2.4, 3.1, 5.5, 7.0)
revss::robScale(x)
robscale::robScale(x)   # same value
```

Note: `revss` v3.1.0 updated its bias correction factors; `robscale` follows
the @rousseeuw2002 estimating equations and constants but solves them via
Newton--Raphson rather than the original multiplicative iteration, so results
may differ at convergence tolerance.

### robustbase API

`qn()` and `sn()` match `robustbase::Qn()` and `robustbase::Sn()` (with
lowercase names for consistency).

### Numerical equivalence

At tolerance $\sqrt{\epsilon_{\text{mach}}} \approx 1.49 \times 10^{-8}$,
`robscale` matches `revss` across 5,400 randomly generated inputs
($n = 3, \ldots, 20$; 100 replicates per $n$; 100% pass rate). See
`tests/testthat/test-cross-check.R` for full test coverage.

## 10. References