--- title: "Complete Workflow with hbsaems 1.0.0" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Complete Workflow with hbsaems 1.0.0} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} # Vignette evaluation strategy: # # By default we run lightweight diagnostic chunks (data inspection, # convergence summaries, prediction tables) so that the vignette # carries real output. The model-fitting chunks themselves use # `eval = FALSE` for two reasons: # # 1. Stan compile + warmup on a single chain takes ~20-60 seconds # on a fast machine and well over a minute on CRAN's slower # build agents. With 4 model fits in this workflow the budget # is too tight to evaluate them all. # 2. The vignette is meant as a reference / cookbook, not a # validation run -- the displayed code is what users should # paste into their own R session. # # Mini-fits used for the diagnostic chunks below are cached at the # start of the vignette so subsequent chunks see a real hbmfit # object without re-running Stan repeatedly. knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE # default: display-only ) # Whether to actually run the mini-fit and diagnostic chunks. # Set to FALSE in environments without a working Stan toolchain (e.g. # CI containers that ship rstan but not the BH/Boost headers needed for # Stan compile, or on CRAN's slow build agents where a few extra # seconds matter). The probe tries to compile a trivial Stan program # in-memory; if that fails we silently fall back to display-only mode. RUN_DIAGNOSTICS <- (function() { if (!requireNamespace("brms", quietly = TRUE)) return(FALSE) if (!requireNamespace("rstan", quietly = TRUE)) return(FALSE) # Toolchain probe: try compiling the simplest Stan program possible. # If Boost / a C++ compiler is missing we'll get a clear error here # rather than later in a more confusing place. ok <- tryCatch({ rstan::stan_model( model_code = "parameters { real x; } model { x ~ normal(0, 1); }", verbose = FALSE ) TRUE }, error = function(e) FALSE, warning = function(w) FALSE) isTRUE(ok) })() ``` ```{r load-libs, eval = TRUE, message = FALSE, warning = FALSE} library(hbsaems) ``` A typical Bayesian SAE workflow with `hbsaems` v1.0.0 follows seven steps, each backed by a single primary function. This vignette walks through the canonical pipeline. ## See also For deeper coverage of topics beyond this introductory workflow, see the **articles on the package website**: The articles cover: - Distribution-specific walkthroughs (Beta logit-normal, binomial logit-normal, lognormal-lognormal) - Spatial models (CAR / SAR / BYM2) - Handling missing data (deletion, multiple imputation, joint Bayesian imputation) - The interactive Shiny dashboard - Advanced features (custom families, benchmarking, smooth terms) - AST-based formula manipulation (internals) - Migration guide for users coming from earlier versions ## 0. Inspect the data ```{r data-inspect, eval = TRUE} data("data_fhnorm") str(data_fhnorm, max.level = 1) head(data_fhnorm[, c("regency", "province", "y", "D", "x1", "x2", "x3")], 4) ``` `data_fhnorm` is a simulated Fay-Herriot dataset: 100 regencies nested within 5 provinces, with a known sampling variance `D` per area and three covariates. ## 1. Prior predictive check ```r # This is the production call. Replace `chains`, `iter` with your # usual values; the lighter settings below are a quick demonstration. model_prior <- hbm( formula = brms::bf(y ~ x1 + x2 + x3), data = data_fhnorm, re = ~ (1 | regency), # area-level random effect (Fay-Herriot) sampling_variance = "D", # KNOWN sampling variance D_i sample_prior = "only", prior = c( brms::prior(normal(0, 1), class = "b"), brms::prior(normal(10, 5), class = "Intercept"), brms::prior(normal(0, 2), class = "sd") ), chains = 2, iter = 1000 ) pc <- prior_check(model_prior, data = data_fhnorm, response_var = "y") plot(pc) ``` The `sampling_variance = "D"` argument is the **defining feature** of a Fay-Herriot model: the sampling variance \eqn{D_i} is treated as **known** from the design (it is not estimated), so brms pins the observation-level \eqn{\sigma_i} to \eqn{\sqrt{D_i}}. Omitting this argument makes the model unidentified because the residual \eqn{\sigma} and the area random-effect \eqn{\sigma_u} would compete to explain the same variance, typically producing divergent transitions and very wide credible intervals. ## 2. Fit the model ```r model <- hbm( formula = brms::bf(y ~ x1 + x2 + x3), data = data_fhnorm, re = ~ (1 | regency), sampling_variance = "D", control = list(adapt_delta = 0.99), chains = 4, iter = 4000, warmup = 2000, cores = 4 ) summary(model) ``` Two settings deserve attention: * **`sampling_variance = "D"`** -- the column `D` in `data_fhnorm` holds the known sampling variance \eqn{D_i} for each area; `hbsaems` translates this into an offset on the observation-level standard deviation so brms / Stan never tries to estimate it. * **`adapt_delta = 0.99`** -- Fay-Herriot likelihoods can produce mild funnel geometry between \eqn{\sigma_u} and the area random effects \eqn{u_i} when many areas have small \eqn{D_i}. Raising `adapt_delta` from the brms default 0.95 to 0.99 buys a more conservative leapfrog step and eliminates the occasional divergent transition without re-parameterising the model. ### A small fitted model to demonstrate the rest of the workflow For the remainder of the vignette we use a tiny model (very few iterations) so that the diagnostic and prediction chunks below have a real object to operate on. In your own analysis, **use the full production settings shown above**, not the toy settings here. ```{r mini-fit, eval = RUN_DIAGNOSTICS, message = FALSE, warning = FALSE, cache = TRUE} # Mini fit -- iter = 200, chains = 1 -- for vignette demonstration only. # Do NOT use these settings for inference: the chains have not # converged at this length and the posterior will be biased. fit_demo <- suppressWarnings( hbm( formula = brms::bf(y ~ x1 + x2 + x3), data = data_fhnorm, re = ~ (1 | regency), sampling_variance = "D", chains = 1, iter = 200, warmup = 100, refresh = 0, seed = 1 ) ) ``` ## 3. Convergence diagnostics ```{r diagnostics, eval = RUN_DIAGNOSTICS} # Operate on the mini fit_demo above (NOT a substitute for production # diagnostics on full chains). diag <- convergence_check(fit_demo) is_converged(fit_demo) summary(diag) hbm_warnings(fit_demo) ``` The full set of diagnostics on your production fit would also include trace plots, R-hat / ESS tables, and pp-checks. These all follow the same calling convention: ```r plot(diag, type = "trace") plot(diag, type = "rhat") plot(diag, type = "ess") ``` ## 4. Goodness-of-fit ```r gof <- model_compare(model) # single-model: LOO, WAIC, pp_check summary(gof) plot(gof, type = "pp_check") ``` ## 5. Compare alternatives ```r model2 <- hbm(brms::bf(y ~ x1 + x2), data = data_fhnorm, re = ~ (1 | regency), sampling_variance = "D", control = list(adapt_delta = 0.99), chains = 4, iter = 4000) model3 <- hbm(brms::bf(y ~ x1), data = data_fhnorm, re = ~ (1 | regency), sampling_variance = "D", control = list(adapt_delta = 0.99), chains = 4, iter = 4000) model_compare(model, model2) model_compare_all(full = model, medium = model2, simple = model3) ``` ## 6. Small-area estimates The simplest call uses the data the model was fit on -- no `newdata` argument needed: ```{r predictions, eval = RUN_DIAGNOSTICS} # In-sample prediction (default): operates on the data stored # inside the fitted object. est <- sae_predict(fit_demo) summary(est) head(est$result_table, 5) ``` To predict at fresh data points, pass `newdata`. When you do, any internal offset columns the original fit needed (e.g.\ `.hbsaems_sigma_fixed = sqrt(D)` for the Fay-Herriot sugar) are repopulated automatically when the new data frame has the same number of rows as the training data -- the standard "predict at the same areas with updated covariates" use case. ```r # Out-of-sample prediction at new areas: est_new <- sae_predict(fit_demo, newdata = data_new) ``` ### Bayesian model averaging For model averaging, fit several candidate models, then either let `model_average()` weight them by stacking weights (LOO-based) or supply your own weights: ```r # Stacking weights (default, derived from LOO) avg_stacked <- model_average(model, model2, model3) # User-specified weights -- must sum to 1 avg_manual <- model_average(model, model2, weights = c(0.7, 0.3)) # The returned object is an `hbsae_results`, identical in shape to # what sae_predict() returns, so all post-processing helpers # (sae_transform, sae_scale, sae_filter, sae_aggregate) work on it. plot(avg_stacked, type = "predictions") ``` ### Visualisations ```r # Visualisations (require a graphics device): plot(est, type = "predictions") plot(est, type = "uncertainty") ``` ## 7. Post-processing predictions All `sae_*` post-processing helpers operate on the `hbsae_results` object returned by either `sae_predict()` or `model_average()`: ```r log_est <- sae_transform(est, log) sc_est <- sae_scale(est) hi_est <- sae_filter(est, est$pred > stats::median(est$pred)) agg_est <- sae_aggregate(sae_predict(model), sae_predict(model2), method = "mean") ``` ## Session info ```{r session-info, eval = TRUE} sessionInfo() ```