--- title: "miceFast - Introduction and Advanced Usage" author: "Maciej Nasinski" date: "`r Sys.Date()`" output: html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{miceFast - Introduction} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE) ``` # Overview **miceFast** provides fast imputation methods built on C++/RcppArmadillo (Eddelbuettel & Sanderson, 2014). The main interfaces are: - `fill_NA` and `fill_NA_N` work with **dplyr** and **data.table** pipelines. - `new(miceFast)` provides an OOP interface for maximum control. - `pool()` combines results from multiply imputed datasets using Rubin's rules (Rubin, 1987) with the Barnard-Rubin (1999) degrees-of-freedom adjustment. For missing-data theory (MCAR/MAR/MNAR) and full MI workflows, see the companion vignette: [Missing Data Mechanisms and Multiple Imputation](missing-data-and-imputation.html). For a comprehensive treatment, see van Buuren (2018). ```{r load-packages} library(miceFast) library(dplyr) library(data.table) library(ggplot2) set.seed(123456) ``` # Available Imputation Models **miceFast** supports several models via the `model` argument in `fill_NA()` and `fill_NA_N()`: | Model | Type | Stochastic? | Use case | |-------|------|-------------|----------| | `lm_pred` | Linear regression | No | Deterministic point prediction. With only an `Intercept` column, equivalent to mean imputation. | | `lm_bayes` | Bayesian linear regression | Yes | Draws coefficients from their posterior. Recommended for MI of continuous variables. | | `lm_noise` | Linear regression + noise | Yes (improper) | Adds residual noise but omits parameter uncertainty. Useful for sensitivity analysis, not recommended as the primary MI model. | | `lda` | Linear Discriminant Analysis | Approximate | For **categorical** variables. With a random `ridge` parameter it becomes stochastic. Suitable for MI but not strictly proper (uses ad-hoc perturbation, not a posterior draw). | | `pmm` | Predictive Mean Matching | Yes (proper) | Imputes from observed values via nearest-neighbour matching on Bayesian-posterior predictions. Works for both **continuous** and **categorical** variables. Available through `fill_NA_N()` / OOP `impute()` / `impute_N()`. | For MI you need a stochastic model. **`lm_bayes`** is strictly **proper** (Rubin, 1987): it draws both coefficients and residual variance from their posterior. **`pmm`** is also **proper**. It draws coefficients from the posterior for predictions on missing rows, then matches to the nearest observed values (Type II PMM; van Buuren, 2018). It works for both continuous and categorical variables and naturally preserves the observed data distribution. For categorical variables, **`lda`** with `ridge = runif(1, ...)` is an **approximate** approach. The random ridge creates between-imputation variability, but it is not a true posterior draw. `lm_noise` is **improper**. It adds residual noise but omits parameter uncertainty. Both `lda + ridge` and `lm_noise` work well in practice and are useful for sensitivity analysis. See the [MI vignette](missing-data-and-imputation.html). # Example Data The package ships with `air_miss`, an extended version of `airquality` with additional columns including a character variable, weights, groups, and a categorized Ozone variable. ```{r data} data(air_miss) str(air_miss) ``` ## Examining Missingness ```{r upset, eval=requireNamespace("UpSetR", quietly=TRUE)} upset_NA(air_miss, 6) ``` ## Checking for Collinearity Before imputing, check Variance Inflation Factors. Values above ~10 suggest problematic collinearity that can destabilize imputation models. Consider dropping or combining redundant predictors before imputing. ```{r vif} VIF( air_miss, posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "x_character", "Day", "weights", "groups") ) ``` --- # Single Imputation with `fill_NA()` `fill_NA()` imputes missing values in a single column using a specified model and predictors. It accepts column names (strings) or position indices and works inside `dplyr::mutate()` or `data.table` `:=` expressions. ## Basic usage (dplyr) ```{r fill-na-basic} data(air_miss) result <- air_miss %>% # Continuous variable: Bayesian linear model (stochastic) mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )) %>% # Categorical variable: LDA mutate(x_char_imp = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )) head(result[, c("Ozone", "Ozone_imp", "x_character", "x_char_imp")]) ``` ## With weights and grouped imputation Grouping fits a separate model per group, which is useful when the relationship between variables varies across subpopulations. Weights allow for heteroscedasticity or survey designs. ```{r fill-na-grouped} data(air_miss) result_grouped <- air_miss %>% group_by(groups) %>% do(mutate(., Solar_R_imp = fill_NA( x = ., model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .[["weights"]] ) )) %>% ungroup() head(result_grouped[, c("Solar.R", "Solar_R_imp", "groups")]) ``` ## Log-transformation For right-skewed variables (like Ozone), use `logreg = TRUE` to impute on the log scale. The model fits on $\log(y)$ and back-transforms the predictions: ```{r fill-na-logreg} data(air_miss) result_log <- air_miss %>% mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp", "Intercept"), logreg = TRUE )) # Compare distributions: log imputation avoids negative values summary(result_log[c("Ozone", "Ozone_imp")]) ``` ## Using column position indices You can refer to columns by position instead of name. Check `names(air_miss)` to find the right positions: ```{r fill-na-position} data(air_miss) result_pos <- air_miss %>% mutate(Ozone_imp = fill_NA( x = ., model = "lm_bayes", posit_y = 1, posit_x = c(4, 6), logreg = TRUE )) head(result_pos[, c("Ozone", "Ozone_imp")]) ``` ## Basic usage (data.table) ```{r fill-na-dt} data(air_miss) setDT(air_miss) air_miss[, Ozone_imp := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )] air_miss[, x_char_imp := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp") )] # With grouping air_miss[, Solar_R_imp := fill_NA( x = .SD, model = "lm_pred", posit_y = "Solar.R", posit_x = c("Wind", "Temp", "Intercept"), w = .SD[["weights"]] ), by = .(groups)] ``` --- # Multiple Imputation with `fill_NA_N()` For `lm_bayes` and `lm_noise`, `fill_NA_N()` generates *k* stochastic draws per missing observation and returns their **average**. For `pmm`, *k* is the number of nearest observed neighbours from which one value is randomly selected (no averaging). In both cases the result is a single filled-in dataset. Between-imputation variance is lost, so Rubin's rules cannot be applied. For MI with continuous variables, use `fill_NA()` + `pool()` with `lm_bayes`. For MI with **PMM** (proper, works for both continuous and categorical variables), use the OOP interface: call `impute("pmm", ...)` in a loop (see the [OOP section](#oop-interface) and the [MI vignette](missing-data-and-imputation.html)). ## dplyr ```{r fill-na-n-dplyr} data(air_miss) result_n <- air_miss %>% # PMM with 20 draws. Imputes from observed values. mutate(Ozone_pmm = fill_NA_N( x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20 )) %>% # lm_noise with 30 draws and weights mutate(Ozone_noise = fill_NA_N( x = ., model = "lm_noise", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), w = .[["weights"]], logreg = TRUE, k = 30 )) ``` ## data.table with grouping ```{r fill-na-n-dt} data(air_miss) setDT(air_miss) air_miss[, Ozone_pmm := fill_NA_N( x = .SD, model = "pmm", posit_y = "Ozone", posit_x = c("Wind", "Temp", "Intercept"), w = .SD[["weights"]], logreg = TRUE, k = 20 ), by = .(groups)] ``` --- # Comparing Imputations Use `compare_imp()` to visually compare the distribution of observed vs. imputed values: ```{r compare-imp} data(air_miss) air_miss_imp <- air_miss %>% mutate( Ozone_bayes = fill_NA(x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp")), Ozone_pmm = fill_NA_N(x = ., model = "pmm", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp"), k = 20) ) compare_imp(air_miss_imp, origin = "Ozone", target = c("Ozone_bayes", "Ozone_pmm")) ``` --- # Multiple Imputation with `pool()` For proper statistical inference on incomplete data, use the MI workflow: 1. Generate *m* completed datasets (each with a different stochastic imputation). 2. Fit your analysis model on each completed dataset. 3. Pool results using **Rubin's rules** via `pool()`. `pool()` works with any model that has `coef()` and `vcov()` methods. ```{r mi-workflow} data(air_miss) # Select the 4 core variables: Ozone and Solar.R have NAs; Wind and Temp are complete. df <- air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")] # Step 1: Generate m = 10 completed datasets. # Impute Solar.R first (predictors fully observed), then Ozone # (can now use the freshly imputed Solar.R). This sequential order # ensures all NAs are filled in a single pass. set.seed(1234) completed <- lapply(1:10, function(i) { df %>% mutate(Solar.R = fill_NA(., "lm_bayes", "Solar.R", c("Wind", "Temp"))) %>% mutate(Ozone = fill_NA(., "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))) }) # Step 2: Fit the analysis model on each completed dataset fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) # Step 3: Pool using Rubin's rules pool_res <- pool(fits) pool_res summary(pool_res) ``` --- # Full Imputation: Filling All Variables and MI with Rubin's Rules In practice you often need to impute every variable that has missing values, then run MI with proper pooling. The key is **ordering**: impute variables whose predictors are complete first, then variables that depend on the freshly imputed ones. This sequential chain resolves joint missingness in a single pass without iterative FCS. Below is a complete workflow using `air_miss`. ```{r full-impute-all} data(air_miss) # Define a function that fills all variables with NAs in one pass. # Order matters: Solar.R first (Wind, Temp are complete), then Ozone # (uses the freshly imputed Solar.R), then categorical variables. impute_all <- function(data) { data %>% # Continuous: Solar.R (predictors fully observed) mutate(Solar.R = fill_NA( x = ., model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind", "Temp") )) %>% # Continuous: Ozone (Solar.R now complete) mutate(Ozone = fill_NA( x = ., model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )) %>% # Categorical: x_character (LDA with random ridge for stochasticity) mutate(x_character = fill_NA( x = ., model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp"), ridge = runif(1, 0.5, 50) )) %>% # Categorical: Ozone_chac (use tryCatch for safety with small groups) group_by(groups) %>% do(mutate(., Ozone_chac = tryCatch( fill_NA( x = ., model = "lda", posit_y = "Ozone_chac", posit_x = c("Temp", "Wind") ), error = function(e) .[["Ozone_chac"]] ))) %>% ungroup() } # MI: generate m = 10 completed datasets set.seed(2024) m <- 10 completed <- lapply(1:m, function(i) impute_all(air_miss)) # Fit the analysis model on each completed dataset fits <- lapply(completed, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) # Pool using Rubin's rules pool_result <- pool(fits) pool_result summary(pool_result) ``` The same workflow using data.table: ```{r full-impute-all-dt} data(air_miss) setDT(air_miss) impute_all_dt <- function(dt) { d <- copy(dt) # Order: Solar.R first (predictors complete), then Ozone, then categoricals d[, Solar.R := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Solar.R", posit_x = c("Wind", "Temp") )] d[, Ozone := fill_NA( x = .SD, model = "lm_bayes", posit_y = "Ozone", posit_x = c("Solar.R", "Wind", "Temp") )] d[, x_character := fill_NA( x = .SD, model = "lda", posit_y = "x_character", posit_x = c("Wind", "Temp"), ridge = runif(1, 0.5, 50) )] d[, Ozone_chac := tryCatch( fill_NA( x = .SD, model = "lda", posit_y = "Ozone_chac", posit_x = c("Temp", "Wind") ), error = function(e) .SD[["Ozone_chac"]] ), by = .(groups)] d } set.seed(2024) completed_dt <- lapply(1:10, function(i) impute_all_dt(air_miss)) fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Solar.R + Wind + Temp, data = d)) pool(fits_dt) ``` --- # The `miceFast` OOP Module For maximum performance and fine-grained control, use the C++ object directly. This interface operates on numeric matrices and uses column position indices. ## Methods | Method | Description | |--------|-------------| | `set_data(x)` | Set the data matrix (by reference). | | `set_g(g)` | Set a grouping variable (positive numeric, no NAs). | | `set_w(w)` | Set a weighting variable (positive numeric, no NAs). | | `impute(model, y, x)` | Single imputation. All models including `pmm` (k = 1). | | `impute_N(model, y, x, k)` | `lm_bayes`/`lm_noise`: averaged *k* draws. `pmm`: samples from *k* nearest observed values. | | `update_var(y, imps)` | Permanently update a column with imputations. | | `vifs(y, x)` | Variance Inflation Factors. | | `get_data()` / `get_g()` / `get_w()` / `get_index()` | Retrieve data or metadata. | | `sort_byg()` / `is_sorted_byg()` | Sort by grouping variable. | | `which_updated()` | Check which columns have been updated. | Note that `update_var()` permanently alters the data matrix by reference. When a grouping variable is set, data is automatically sorted on first imputation; use `get_index()` to recover the original row order. ## Simple example ```{r oop-simple} data <- cbind(as.matrix(airquality[, 1:4]), intercept = 1, index = 1:nrow(airquality)) model <- new(miceFast) model$set_data(data) # Single imputation with linear model (col 1 = Ozone) model$update_var(1, model$impute("lm_pred", 1, 5)$imputations) # Averaged multiple imputation for Solar.R (col 2, Bayesian, k=10 draws) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5), k = 10)$imputations) model$which_updated() head(model$get_data(), 4) ``` ## With weights and groups ```{r oop-groups} data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(airquality[, 5]) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), k = 10)$imputations) # Recover original row order head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4) ``` ## With unsorted groups When groups are not pre-sorted, the data is automatically sorted on first imputation: ```{r oop-unsorted} data <- cbind(as.matrix(airquality[, -5]), intercept = 1, index = 1:nrow(airquality)) weights <- rgamma(nrow(data), 3, 3) groups <- as.numeric(sample(1:8, nrow(data), replace = TRUE)) model <- new(miceFast) model$set_data(data) model$set_w(weights) model$set_g(groups) model$update_var(1, model$impute("lm_pred", 1, 6)$imputations) model$update_var(2, model$impute_N("lm_bayes", 2, c(1, 3, 4, 5, 6), 10)$imputations) head(cbind(model$get_data(), model$get_g(), model$get_w())[order(model$get_index()), ], 4) ``` ## Full MI workflow with OOP The OOP interface can also drive the full MI loop. Each iteration creates a fresh model, imputes all variables with missing values, and returns the completed matrix in the original row order. This example uses `lm_bayes`; for PMM (which also works for categorical variables), simply replace `"lm_bayes"` with `"pmm"`. PMM is proper and draws from the Bayesian posterior before matching to observed values. ```{r oop-mi} data(air_miss) # Prepare a numeric matrix with an intercept and row index mat <- cbind( as.matrix(air_miss[, c("Ozone", "Solar.R", "Wind", "Temp")]), intercept = 1, index = seq_len(nrow(air_miss)) ) groups <- as.numeric(air_miss$groups) impute_oop <- function(mat, groups) { m <- new(miceFast) m$set_data(mat + 0) # copy. set_data uses the matrix by reference. m$set_g(groups) # col 1 = Ozone, col 2 = Solar.R, col 3 = Wind, col 4 = Temp, col 5 = intercept m$update_var(1, m$impute("lm_bayes", 1, c(2, 3, 4))$imputations) m$update_var(2, m$impute("lm_bayes", 2, c(3, 4, 5))$imputations) completed <- m$get_data()[order(m$get_index()), ] as.data.frame(completed) } set.seed(2025) completed <- lapply(1:10, function(i) impute_oop(mat, groups)) fits <- lapply(completed, function(d) lm(V1 ~ V3 + V4, data = d)) pool(fits) summary(pool(fits)) ``` ## Iterative FCS (Chained Equations) with miceFast The mice package uses **Fully Conditional Specification** (FCS): it imputes each variable given all others and cycles through multiple iterations until convergence. miceFast can do exactly the same thing. The key is that `update_var()` modifies the data matrix in place, so each subsequent `impute()` call sees the freshly imputed values. ### data.table (convenience functions) The same logic works with `fill_NA` and `:=`, which also updates columns in place: ```{r fcs-dt} data(air_miss) setDT(air_miss) fcs_dt <- function(dt, n_iter = 5) { d <- copy(dt) na_ozone <- is.na(d$Ozone) na_solar <- is.na(d[["Solar.R"]]) d <- naive_fill_NA(d) # initialise for (iter in seq_len(n_iter)) { set(d, which(na_ozone), "Ozone", NA_real_) # restore NAs d[, Ozone := fill_NA(.SD, "lm_bayes", "Ozone", c("Solar.R", "Wind", "Temp"))] set(d, which(na_solar), "Solar.R", NA_real_) d[, Solar.R := fill_NA_N(.SD, "pmm", "Solar.R", c("Ozone", "Wind", "Temp", "Intercept"))] } d } set.seed(2025) completed_dt <- lapply(1:10, function(i) fcs_dt(air_miss)) fits_dt <- lapply(completed_dt, function(d) lm(Ozone ~ Wind + Temp, data = d)) pool(fits_dt) ``` With a monotone missing-data pattern a single pass (`n_iter = 1`) is sufficient. For complex interlocking patterns, 5--10 iterations is typical. The OOP interface avoids R-level data copies and is fastest for large datasets. --- # Generating Correlated Data with `corrData` The `corrData` module generates correlated data for simulations. This is useful for creating test datasets with known properties. ```{r corrdata-example} # 3 continuous variables, 100 observations means <- c(10, 20, 30) cor_matrix <- matrix(c( 1.0, 0.7, 0.3, 0.7, 1.0, 0.5, 0.3, 0.5, 1.0 ), nrow = 3) cd <- new(corrData, 100, means, cor_matrix) sim_data <- cd$fill("contin") round(cor(sim_data), 2) ``` ```{r corrdata-discrete} # With 2 categorical variables: first argument is nr_cat cd2 <- new(corrData, 2, 200, means, cor_matrix) sim_discrete <- cd2$fill("discrete") head(sim_discrete) ``` --- # Tips - **Creating matrices from data frames:** R matrices hold only one type. Convert factors with `model.matrix()`: ```r mtcars$cyl <- factor(mtcars$cyl) model.matrix(~ ., data = mtcars, na.action = "na.pass") ``` - **Collinearity:** Always check `VIF()` before imputing. High VIF (>10) indicates collinearity that can destabilize imputation models. - **Error handling with groups:** Small groups may not have enough observations. Wrap `fill_NA()` calls in `tryCatch()`: ```r tryCatch( fill_NA(x = ., model = "lda", posit_y = "y", posit_x = c("x1", "x2")), error = function(e) .[["y"]] ) ``` - **BLAS optimization:** Install an optimized BLAS library for significant speedups: - **Linux:** `sudo apt-get install libopenblas-dev` - **macOS:** See [R macOS FAQ](https://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Which-BLAS-is-used-and-how-can-it-be-changed_003f) --- # References - Rubin, D.B. (1987). *Multiple Imputation for Nonresponse in Surveys*. John Wiley & Sons. - Barnard, J. and Rubin, D.B. (1999). Small-sample degrees of freedom with multiple imputation. *Biometrika*, 86(4), 948-955. - van Buuren, S. (2018). *Flexible Imputation of Missing Data* (2nd ed.). Chapman & Hall/CRC. - Eddelbuettel, D. and Sanderson, C. (2014). RcppArmadillo: Accelerating R with high-performance C++ linear algebra. *Computational Statistics and Data Analysis*, 71, 1054-1063.