---
title: "miceFast - Introduction and Advanced Usage"
author: "Maciej Nasinski"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_depth: 3
vignette: >
  %\VignetteIndexEntry{miceFast - Introduction}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
```

# Overview

**miceFast** provides fast imputation methods built on C++/RcppArmadillo
(Eddelbuettel & Sanderson, 2014). The main interfaces are:

- `fill_NA` and `fill_NA_N` work with **dplyr** and **data.table** pipelines.
- `new(miceFast)` provides an OOP interface for maximum control.
- `pool()` combines results from multiply imputed datasets using Rubin's rules
  (Rubin, 1987) with the Barnard-Rubin (1999) degrees-of-freedom adjustment.

For missing-data theory (MCAR/MAR/MNAR) and full MI workflows,
see the companion vignette:
[Missing Data Mechanisms and Multiple Imputation](missing-data-and-imputation.html).
For a comprehensive treatment, see van Buuren (2018).

```{r load-packages}
library(miceFast)
library(dplyr)
library(data.table)
library(ggplot2)
set.seed(123456)
```

# Available Imputation Models

**miceFast** supports several models via the `model` argument in `fill_NA()` and `fill_NA_N()`:

| Model | Type | Stochastic? | Use case |
|-------|------|-------------|----------|
| `lm_pred` | Linear regression | No | Deterministic point prediction. With only an `Intercept` column, equivalent to mean imputation. |
| `lm_bayes` | Bayesian linear regression | Yes | Draws coefficients from their posterior. Recommended for MI of continuous variables. |
| `lm_noise` | Linear regression + noise | Yes (improper) | Adds residual noise but omits parameter uncertainty. Useful for sensitivity analysis, not recommended as the primary MI model. |
| `lda` | Linear Discriminant Analysis | Approximate | For **categorical** variables. With a random `ridge` parameter it becomes stochastic. Suitable for MI but not strictly proper (uses ad-hoc perturbation, not a posterior draw). |
| `pmm` | Predictive Mean Matching | Yes (proper) | Imputes from observed values via nearest-neighbour matching on Bayesian-posterior predictions. Works for both **continuous** and **categorical** variables. Available through `fill_NA_N()` / OOP `impute()` / `impute_N()`. |

For MI you need a stochastic model. **`lm_bayes`** is strictly
**proper** (Rubin, 1987): it draws both coefficients and residual
variance from their posterior. **`pmm`** is also **proper**. It draws
coefficients from the posterior for predictions on missing rows, then
matches to the nearest observed values (Type II PMM; van Buuren, 2018).
It works for both continuous and categorical variables and naturally
preserves the observed data distribution. For categorical variables,
**`lda`** with `ridge = runif(1, ...)` is an **approximate** approach.
The random ridge creates between-imputation variability, but it is not a
true posterior draw. `lm_noise` is **improper**. It adds residual noise
but omits parameter uncertainty. Both `lda + ridge` and `lm_noise` work
well in practice and are useful for sensitivity analysis.
See the [MI vignette](missing-data-and-imputation.html).

# Example Data

The package ships with `air_miss`, an extended version of `airquality` with additional columns
including a character variable, weights, groups, and a categorized Ozone variable.

```{r data}
data(air_miss)
str(air_miss)
```

## Examining Missingness

```{r upset, eval=requireNamespace("UpSetR", quietly=TRUE)}
upset_NA(air_miss, 6)
```

## Checking for Collinearity

Before imputing, check Variance Inflation Factors. Values above ~10 suggest
problematic collinearity that can destabilize imputation models. Consider
dropping or combining redundant predictors before imputing.

```{r vif}
VIF(
  air_miss,
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups")
)
```

---

# Single Imputation with `fill_NA()`

`fill_NA()` imputes missing values in a single column using a specified model and predictors.
It accepts column names (strings) or position indices and works inside `dplyr::mutate()` or
`data.table` `:=` expressions.

## Basic usage (dplyr)

```{r fill-na-basic}
data(air_miss)

result <- air_miss %>%
  # Continuous variable: Bayesian linear model (stochastic)
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp")
  )) %>%
  # Categorical variable: LDA
  mutate(x_char_imp = fill_NA(
    x = .,
    model = "lda",
    posit_y = "x_character",
    posit_x = c("Wind", "Temp")
  ))

head(result[, c("Ozone", "Ozone_imp", "x_character", "x_char_imp")])
```

## With weights and grouped imputation

Grouping fits a separate model per group, which is useful when the relationship between
variables varies across subpopulations. Weights allow for heteroscedasticity or survey designs.

```{r fill-na-grouped}
data(air_miss)

result_grouped <- air_miss %>%
  group_by(groups) %>%
  do(mutate(.,
    Solar_R_imp = fill_NA(
      x = .,
      model = "lm_pred",
      posit_y = "Solar.R",
      posit_x = c("Wind", "Temp", "Intercept"),
      w = .[["weights"]]
    )
  )) %>%
  ungroup()

head(result_grouped[, c("Solar.R", "Solar_R_imp", "groups")])
```

## Log-transformation

For right-skewed variables (like Ozone), use `logreg = TRUE` to impute on the log scale.
The model fits on $\log(y)$ and back-transforms the predictions:

```{r fill-na-logreg}
data(air_miss)

result_log <- air_miss %>%
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp", "Intercept"),
    logreg = TRUE
  ))

# Compare distributions: log imputation avoids negative values
summary(result_log[c("Ozone", "Ozone_imp")])
```

## Using column position indices

You can refer to columns by position instead of name.
Check `names(air_miss)` to find the right positions:

```{r fill-na-position}
data(air_miss)

result_pos <- air_miss %>%
  mutate(Ozone_imp = fill_NA(
    x = .,
    model = "lm_bayes",
    posit_y = 1,
    posit_x = c(4, 6),
    logreg = TRUE
  ))

head(result_pos[, c("Ozone", "Ozone_imp")])
```

## Basic usage (data.table)

```{r fill-na-dt}
data(air_miss)
setDT(air_miss)

air_miss[, Ozone_imp := fill_NA(
  x = .SD,
  model = "lm_bayes",
  posit_y = "Ozone",
  posit_x = c("Solar.R", "Wind", "Temp")
)]

air_miss[, x_char_imp := fill_NA(
  x = .SD,
  model = "lda",
  posit_y = "x_character",
  posit_x = c("Wind", "Temp")
)]

# With grouping
air_miss[, Solar_R_imp := fill_NA(
  x = .SD,
  model = "lm_pred",
  posit_y = "Solar.R",
  posit_x = c("Wind", "Temp", "Intercept"),
  w = .SD[["weights"]]
), by = .(groups)]
```

---

# Multiple Imputation with `fill_NA_N()`

For `lm_bayes` and `lm_noise`, `fill_NA_N()` generates *k* stochastic draws
per missing observation and returns their **average**. For `pmm`, *k* is the
number of nearest observed neighbours from which one value is randomly selected
(no averaging). In both cases the result is a single filled-in dataset.
Between-imputation variance is lost, so Rubin's rules cannot be applied.
For MI with continuous variables, use `fill_NA()` + `pool()` with `lm_bayes`.
For MI with **PMM** (proper, works for both continuous and categorical
variables), use the OOP interface: call `impute("pmm", ...)` in a loop
(see the [OOP section](#oop-interface) and the
[MI vignette](missing-data-and-imputation.html)).

## dplyr

```{r fill-na-n-dplyr}
data(air_miss)

result_n <- air_miss %>%
  # PMM with 20 draws. Imputes from observed values.
  mutate(Ozone_pmm = fill_NA_N(
    x = .,
    model = "pmm",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp"),
    k = 20
  )) %>%
  # lm_noise with 30 draws and weights
  mutate(Ozone_noise = fill_NA_N(
    x = .,
    model = "lm_noise",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp"),
    w = .[["weights"]],
    logreg = TRUE,
    k = 30
  ))
```

## data.table with grouping

```{r fill-na-n-dt}
data(air_miss)
setDT(air_miss)

air_miss[, Ozone_pmm := fill_NA_N(
  x = .SD,
  model = "pmm",
  posit_y = "Ozone",
  posit_x = c("Wind", "Temp", "Intercept"),
  w = .SD[["weights"]],
  logreg = TRUE,
  k = 20
), by = .(groups)]
```

---

# Comparing Imputations

Use `compare_imp()` to visually compare the distribution of observed vs. imputed values:

```{r compare-imp}
data(air_miss)

air_miss_imp <- air_miss %>%
  mutate(
    Ozone_bayes = fill_NA(x = ., model = "lm_bayes",
      posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")),
    Ozone_pmm = fill_NA_N(x = ., model = "pmm",
      posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20)
  )

compare_imp(air_miss_imp, origin = "Ozone", target = c("Ozone_bayes", "Ozone_pmm"))
```

---

# Multiple Imputation with `pool()`

For proper statistical inference on incomplete data, use the MI workflow:

1. Generate *m* completed datasets (each with a different stochastic imputation).
2. Fit your analysis model on each completed dataset.
3. Pool results using **Rubin's rules** via `pool()`.

`pool()` works with any model that has `coef()` and `vcov()` methods.

```{r mi-workflow}
data(air_miss)

# Select the 4 core variables: Ozone and Solar.R have NAs; Wind and Temp are complete.
df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]

# Step 1: Generate m = 10 completed datasets.
# Impute Solar.R first (predictors fully observed), then Ozone
# (can now use the freshly imputed Solar.R). This sequential order
# ensures all NAs are filled in a single pass.
set.seed(1234)
completed <- lapply(1:10, function(i) {
  df %>%
    mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>%
    mutate(Ozone   = fill_NA(., "lm_bayes", "Ozone",   c("Solar.R", "Wind", "Temp")))
})

# Step 2: Fit the analysis model on each completed dataset
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))

# Step 3: Pool using Rubin's rules
pool_res <- pool(fits)
pool_res
summary(pool_res)
```

---

# Full Imputation: Filling All Variables and MI with Rubin's Rules

In practice you often need to impute every variable that has missing values,
then run MI with proper pooling. The key is **ordering**: impute variables
whose predictors are complete first, then variables that depend on the
freshly imputed ones. This sequential chain resolves joint missingness
in a single pass without iterative FCS.

Below is a complete workflow using `air_miss`.

```{r full-impute-all}
data(air_miss)

# Define a function that fills all variables with NAs in one pass.
# Order matters: Solar.R first (Wind, Temp are complete), then Ozone
# (uses the freshly imputed Solar.R), then categorical variables.
impute_all <- function(data) {
  data %>%
    # Continuous: Solar.R (predictors fully observed)
    mutate(Solar.R = fill_NA(
      x = ., model = "lm_bayes",
      posit_y = "Solar.R",
      posit_x = c("Wind", "Temp")
    )) %>%
    # Continuous: Ozone (Solar.R now complete)
    mutate(Ozone = fill_NA(
      x = ., model = "lm_bayes",
      posit_y = "Ozone",
      posit_x = c("Solar.R", "Wind", "Temp")
    )) %>%
    # Categorical: x_character (LDA with random ridge for stochasticity)
    mutate(x_character = fill_NA(
      x = ., model = "lda",
      posit_y = "x_character",
      posit_x = c("Wind", "Temp"),
      ridge = runif(1, 0.5, 50)
    )) %>%
    # Categorical: Ozone_chac (use tryCatch for safety with small groups)
    group_by(groups) %>%
    do(mutate(., Ozone_chac = tryCatch(
      fill_NA(
        x = ., model = "lda",
        posit_y = "Ozone_chac",
        posit_x = c("Temp", "Wind")
      ),
      error = function(e) .[["Ozone_chac"]]
    ))) %>%
    ungroup()
}

# MI: generate m = 10 completed datasets
set.seed(2024)
m <- 10
completed <- lapply(1:m, function(i) impute_all(air_miss))

# Fit the analysis model on each completed dataset
fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))

# Pool using Rubin's rules
pool_result <- pool(fits)
pool_result
summary(pool_result)
```

The same workflow using data.table:

```{r full-impute-all-dt}
data(air_miss)
setDT(air_miss)

impute_all_dt <- function(dt) {
  d <- copy(dt)
  # Order: Solar.R first (predictors complete), then Ozone, then categoricals
  d[, Solar.R := fill_NA(
    x = .SD, model = "lm_bayes",
    posit_y = "Solar.R",
    posit_x = c("Wind", "Temp")
  )]
  d[, Ozone := fill_NA(
    x = .SD, model = "lm_bayes",
    posit_y = "Ozone",
    posit_x = c("Solar.R", "Wind", "Temp")
  )]
  d[, x_character := fill_NA(
    x = .SD, model = "lda",
    posit_y = "x_character", posit_x = c("Wind", "Temp"),
    ridge = runif(1, 0.5, 50)
  )]
  d[, Ozone_chac := tryCatch(
    fill_NA(
      x = .SD, model = "lda",
      posit_y = "Ozone_chac", posit_x = c("Temp", "Wind")
    ),
    error = function(e) .SD[["Ozone_chac"]]
  ), by = .(groups)]
  d
}

set.seed(2024)
completed_dt <- lapply(1:10, function(i) impute_all_dt(air_miss))
fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d))
pool(fits_dt)
```

---

# The `miceFast` OOP Module

For maximum performance and fine-grained control, use the C++ object directly.
This interface operates on numeric matrices and uses column position indices.

## Methods

| Method | Description |
|--------|-------------|
| `set_data(x)` | Set the data matrix (by reference). |
| `set_g(g)` | Set a grouping variable (positive numeric, no NAs). |
| `set_w(w)` | Set a weighting variable (positive numeric, no NAs). |
| `impute(model, y, x)` | Single imputation. All models including `pmm` (k = 1). |
| `impute_N(model, y, x, k)` | `lm_bayes`/`lm_noise`: averaged *k* draws. `pmm`: samples from *k* nearest observed values. |
| `update_var(y, imps)` | Permanently update a column with imputations. |
| `vifs(y, x)` | Variance Inflation Factors. |
| `get_data()` / `get_g()` / `get_w()` / `get_index()` | Retrieve data or metadata. |
| `sort_byg()` / `is_sorted_byg()` | Sort by grouping variable. |
| `which_updated()` | Check which columns have been updated. |

Note that `update_var()` permanently alters the data matrix by reference.
When a grouping variable is set, data is automatically sorted on first imputation;
use `get_index()` to recover the original row order.

## Simple example

```{r oop-simple}
data <- cbind(as.matrix(airquality[, 1:4]), intercept = 1, index = 1:nrow(airquality))
model <- new(miceFast)
model$set_data(data)

# Single imputation with linear model (col 1 = Ozone)
model$update_var(1, model$impute("lm_pred", 1, 5)$imputations)

# Averaged multiple imputation for Solar.R (col 2, Bayesian, k=10 draws)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5), k = 10)$imputations)

model$which_updated()
head(model$get_data(), 4)
```

## With weights and groups

```{r oop-groups}
data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(airquality[, 5])

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), k = 10)$imputations)

# Recover original row order
head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
```

## With unsorted groups

When groups are not pre-sorted, the data is automatically sorted on first imputation:

```{r oop-unsorted}
data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality))
weights <- rgamma(nrow(data), 3, 3)
groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE))

model <- new(miceFast)
model$set_data(data)
model$set_w(weights)
model$set_g(groups)

model$update_var(1, model$impute("lm_pred", 1, 6)$imputations)
model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), 10)$imputations)

head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4)
```

## Full MI workflow with OOP

The OOP interface can also drive the full MI loop.
Each iteration creates a fresh model, imputes all variables with missing values,
and returns the completed matrix in the original row order.

This example uses `lm_bayes`; for PMM (which also works for categorical variables),
simply replace `"lm_bayes"` with `"pmm"`. PMM is proper and draws from the
Bayesian posterior before matching to observed values.

```{r oop-mi}
data(air_miss)

# Prepare a numeric matrix with an intercept and row index
mat <- cbind(
  as.matrix(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]),
  intercept = 1,
  index = seq_len(nrow(air_miss))
)
groups <- as.numeric(air_miss$groups)

impute_oop <- function(mat, groups) {
  m <- new(miceFast)
  m$set_data(mat + 0)  # copy. set_data uses the matrix by reference.
  m$set_g(groups)
  # col 1 = Ozone, col 2 = Solar.R, col 3 = Wind, col 4 = Temp, col 5 = intercept
  m$update_var(1, m$impute("lm_bayes", 1, c(2, 3, 4))$imputations)
  m$update_var(2, m$impute("lm_bayes", 2, c(3, 4, 5))$imputations)
  completed <- m$get_data()[order(m$get_index()), ]
  as.data.frame(completed)
}

set.seed(2025)
completed <- lapply(1:10, function(i) impute_oop(mat, groups))

fits <- lapply(completed, function(d) lm(V1 ~ V3 + V4, data = d))
pool(fits)
summary(pool(fits))
```

## Iterative FCS (Chained Equations) with miceFast

The mice package uses **Fully Conditional Specification** (FCS): it imputes each
variable given all others and cycles through multiple iterations until convergence.
miceFast can do exactly the same thing. The key is that `update_var()` modifies the
data matrix in place, so each subsequent `impute()` call sees the freshly imputed
values.

### data.table (convenience functions)

The same logic works with `fill_NA` and `:=`, which also updates columns in place:

```{r fcs-dt}
data(air_miss)
setDT(air_miss)

fcs_dt <- function(dt, n_iter = 5) {
  d <- copy(dt)
  na_ozone <- is.na(d$Ozone)
  na_solar <- is.na(d[["Solar.R"]])
  d <- naive_fill_NA(d)                          # initialise
  for (iter in seq_len(n_iter)) {
    set(d, which(na_ozone), "Ozone", NA_real_)   # restore NAs
    d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))]
    set(d, which(na_solar), "Solar.R", NA_real_)
    d[, Solar.R := fill_NA_N(.SD, "pmm", "Solar.R", c("Ozone", "Wind", "Temp", "Intercept"))]
  }
  d
}

set.seed(2025)
completed_dt <- lapply(1:10, function(i) fcs_dt(air_miss))
fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Wind + Temp, data = d))
pool(fits_dt)
```

With a monotone missing-data pattern a single pass (`n_iter = 1`) is sufficient.
For complex interlocking patterns, 5--10 iterations is typical. The OOP interface
avoids R-level data copies and is fastest for large datasets.

---

# Generating Correlated Data with `corrData`

The `corrData` module generates correlated data for simulations.
This is useful for creating test datasets with known properties.

```{r corrdata-example}
# 3 continuous variables, 100 observations
means <- c(10, 20, 30)
cor_matrix <- matrix(c(
  1.0, 0.7, 0.3,
  0.7, 1.0, 0.5,
  0.3, 0.5, 1.0
), nrow = 3)

cd <- new(corrData, 100, means, cor_matrix)
sim_data <- cd$fill("contin")
round(cor(sim_data), 2)
```

```{r corrdata-discrete}
# With 2 categorical variables: first argument is nr_cat
cd2 <- new(corrData, 2, 200, means, cor_matrix)
sim_discrete <- cd2$fill("discrete")
head(sim_discrete)
```

---

# Tips

- **Creating matrices from data frames:** R matrices hold only one type.
  Convert factors with `model.matrix()`:
  ```r
  mtcars$cyl <- factor(mtcars$cyl)
  model.matrix(~ ., data = mtcars, na.action = "na.pass")
  ```

- **Collinearity:** Always check `VIF()` before imputing. High VIF (>10) indicates
  collinearity that can destabilize imputation models.

- **Error handling with groups:** Small groups may not have enough observations.
  Wrap `fill_NA()` calls in `tryCatch()`:
  ```r
  tryCatch(
    fill_NA(x = ., model = "lda", posit_y = "y", posit_x = c("x1", "x2")),
    error = function(e) .[["y"]]
  )
  ```

- **BLAS optimization:** Install an optimized BLAS library for significant speedups:
  - **Linux:** `sudo apt-get install libopenblas-dev`
  - **macOS:** See [R macOS FAQ](https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Which-BLAS-is-used-and-how-can-it-be-changed_003f)

---

# References

- Rubin, D.B. (1987). *Multiple Imputation for Nonresponse in Surveys*. John Wiley & Sons.

- Barnard, J. and Rubin, D.B. (1999). Small-sample degrees of freedom with
  multiple imputation. *Biometrika*, 86(4), 948-955.

- van Buuren, S. (2018). *Flexible Imputation of Missing Data* (2nd ed.). Chapman & Hall/CRC.

- Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with
  high-performance C++ linear algebra. *Computational Statistics and Data Analysis*, 71, 1054-1063.