DICErClust provides an R implementation of Deep Significance Clustering (DICE), a self-supervised framework that discovers clinically meaningful patient subgroups from electronic health record (EHR) data. Unlike conventional unsupervised clustering, DICE simultaneously optimises four objectives:
The result is a partition into risk-stratified subgroups that are both data-driven and statistically validated.
Reference: Huang Y, Du C, Zhu F, et al. (2021). Self-supervised deep clustering of patient subgroups for heart failure with preserved ejection fraction. J Am Med Inform Assoc, 28, 2394–2403. doi:10.1093/jamia/ocab203
## From a local source tarball:
install.packages(
"/path/to/DICErClust_0.1.1.tar.gz",
repos = NULL, type = "source"
)DICErClust depends on the torch package for R. If
you have not installed torch before, run
torch::install_torch() once after installing the
package.
DICEr() reads training and test data from RDS files.
Each file must be a length-3 list:
| Position | Name | Type | Description |
|---|---|---|---|
[[1]] |
data_x |
numeric matrix n × p | Continuous features — LSTM encoder input |
[[2]] |
data_v |
numeric matrix n × q | Binary demographic covariates — outcome-head auxiliary input |
[[3]] |
data_y |
integer vector length n | Binary outcome (0/1) |
Important:
data_vmust use Rnumeric(float64) storage, notinteger. Thetorchbackend infers tensor dtype from the R storage mode; integer columns produce int64 tensors that are incompatible with the float32 model weights.
## Build a minimal synthetic dataset ----------------------------------------
set.seed(42)
n_train <- 120L; n_test <- 40L; p <- 6L; q <- 3L
make_rds <- function(n, path) {
saveRDS(
list(
matrix(runif(n * p), n, p), # data_x: continuous
matrix(as.numeric(rbinom(n * q, 1, 0.5)), n, q), # data_v: binary float
rbinom(n, 1, 0.3) # data_y: outcome
),
path
)
}
data_dir <- file.path(tempdir(), "dice_intro")
dir.create(data_dir, showWarnings = FALSE)
make_rds(n_train, file.path(data_dir, "train.rds"))
make_rds(n_test, file.path(data_dir, "test.rds"))
## Verify format
d <- readRDS(file.path(data_dir, "train.rds"))
cat("data_x:", nrow(d[[1]]), "×", ncol(d[[1]]), " storage:", storage.mode(d[[1]]), "\n")
cat("data_v:", nrow(d[[2]]), "×", ncol(d[[2]]), " storage:", storage.mode(d[[2]]), "\n")
cat("data_y: length", length(d[[3]]), " table:", paste(table(d[[3]]), collapse = "/"), "\n")library(DICErClust)
args <- list(
seed = 42L,
input_path = data_dir,
filename_train = "train.rds",
filename_test = "test.rds",
n_input_fea = p, # columns in data_x
n_hidden_fea = 3L, # LSTM latent dimension
lstm_layer = 1L,
lstm_dropout = 0.0,
K_clusters = 2L, # number of clusters
n_dummy_demov_fea = q, # columns in data_v
cuda = FALSE, # set TRUE to use GPU
lr = 1e-4,
init_AE_epoch = 5L, # Stage 1 warm-up epochs
iter = 20L, # Stage 2 iterations
epoch_in_iter = 2L,
lambda_AE = 1.0,
lambda_classifier = 1.0,
lambda_outcome = 1.0,
lambda_p_value = 1.0
)
old_wd <- setwd(tempdir())
DICEr(args) # writes output to hn_3_K_2/part2_AE_nhidden_3/
setwd(old_wd)part2_dir <- file.path(tempdir(), "hn_3_K_2", "part2_AE_nhidden_3")
res_train <- readRDS(file.path(part2_dir, "data_train_iter.rds"))
res_test <- readRDS(file.path(part2_dir, "data_test_iter.rds"))
## Cluster assignments
## Training set: use res_train$C (k-means labels, re-ordered by outcome rate)
## Test set: use res_test$pred_C (nearest-centroid assignments)
table(res_test$pred_C)| Argument | Default | Effect |
|---|---|---|
n_hidden_fea |
— | LSTM latent dimension; controls representation capacity |
K_clusters |
— | Number of clusters |
init_AE_epoch |
5 | Stage 1 warm-up length |
iter |
20 | Maximum Stage 2 iterations |
epoch_in_iter |
1 | Gradient-update epochs per iteration |
lr |
1e-4 | Adam learning rate |
lambda_AE |
1.0 | Weight on reconstruction loss |
lambda_classifier |
1.0 | Weight on cluster-assignment loss |
lambda_outcome |
1.0 | Weight on outcome BCE loss |
lambda_p_value |
1.0 | Weight on LRT significance penalty |
All four lambda weights are equal at their defaults,
giving each objective equal influence. The LRT significance threshold
(χ²₁, α = 0.05 → 3.841) is fixed and not user-tunable.
After a successful run, DICEr creates:
<working_dir>/
└── hn_<n_hidden>_K_<K>/
├── part1_AE_nhidden_<n>/ # Stage 1 autoencoder outputs
│ └── part1_loss_AE.png
└── part2_AE_nhidden_<n>/ # Stage 2 best checkpoint
├── data_train_iter.rds # training set with C assignments
└── data_test_iter.rds # test set with pred_C assignments
data_train_iter.rds and data_test_iter.rds
are the data lists enriched with cluster assignment fields:
$C — k-means cluster labels (training set; cluster 0 =
highest mortality)$pred_C — nearest-centroid labels for the test set
(use this, not $C, for test-set
evaluation)For a complete end-to-end analysis on the UCI Heart Failure Clinical Records dataset — including preprocessing, training, cluster evaluation (AUC = 0.823, χ² = 32.99, p < 0.001), and publication-quality figures — see:
vignette("heart-failure-example", package = "DICErClust")