| Title: | A Graph-Based Cross-Fitting Engine in R |
|---|---|
| Description: | Provides a general cross-fitting engine for semiparametric estimation (e.g., double/debiased machine learning). Supports user-defined target functionals and directed acyclic graphs of nuisance learners with per-node training fold widths, target-specific evaluation windows, and fold-allocation modes ("overlap", "disjoint", "independence"). Returns either numeric estimates (mode = "estimate") or cross-fitted prediction functions (mode = "predict"), with configurable aggregation over panels and repetitions, reuse-aware caching, and failure isolation, making it well-suited for simulation studies and large benchmarks. |
| Authors: | Etienne Peyrot [aut, cre] (ORCID: <https://orcid.org/0009-0006-8520-6201>) |
| Maintainer: | Etienne Peyrot <[email protected]> |
| License: | GPL-3 |
| Version: | 0.1.3 |
| Built: | 2026-06-02 09:36:16 UTC |
| Source: | https://github.com/etiennepeyrot/crossfit-r |
Provides a general cross-fitting engine for double / debiased machine learning and other meta-learners. The core functions implement flexible graphs of nuisance models with per-node training fold widths, target-specific evaluation windows, and several fold allocation schemes ("independence", "overlap", "disjoint"). The engine supports both numeric estimators (mode = "estimate") and cross-fitted prediction functions (mode = "predict"), with configurable aggregation over panels and repetitions.
Maintainer: Etienne Peyrot [email protected] (ORCID)
Useful links:
Report bugs at https://github.com/EtiennePeyrot/crossfit-R/issues
Helper to create a method specification for
crossfit / crossfit_multi. A method
bundles together:
a target functional target(),
a named list of nuisance specifications,
cross-fitting geometry (folds, repeats,
eval_fold, mode, fold_allocation),
and panel / repetition aggregation functions.
create_method( target, list_nuisance = NULL, folds, repeats, mode = c("estimate", "predict"), eval_fold = if (mode == "estimate") 1L else 0L, fold_allocation = c("independence", "overlap", "disjoint"), aggregate_panels = NULL, aggregate_repeats = NULL )create_method( target, list_nuisance = NULL, folds, repeats, mode = c("estimate", "predict"), eval_fold = if (mode == "estimate") 1L else 0L, fold_allocation = c("independence", "overlap", "disjoint"), aggregate_panels = NULL, aggregate_repeats = NULL )
target |
A function representing the target functional. It must
accept nuisance predictions as arguments (named after nuisances) and
optionally a |
list_nuisance |
Optional named list of nuisance specifications
created by |
folds |
Positive integer giving the number of folds |
repeats |
Positive integer giving the number of repetitions. |
mode |
Cross-fitting mode. Either |
eval_fold |
Integer giving the width (in folds) of the
evaluation window for the target. Must be |
fold_allocation |
Fold allocation strategy; one of
|
aggregate_panels |
Aggregation function for panel-level
results, typically one of |
aggregate_repeats |
Aggregation function for repetition-level
results, typically one of |
The returned list is validated by validate_method() to ensure
structural soundness, but the validated object is not stored: you are
free to modify the returned method before passing it to
crossfit or crossfit_multi.
By default, eval_fold is chosen to be 1L when
mode = "estimate" and 0L when mode = "predict".
If you override eval_fold, it must satisfy these constraints:
positive integer for "estimate", zero for "predict".
A method specification list suitable for use in
crossfit or crossfit_multi.
set.seed(1) n <- 50 x <- rnorm(n) y <- x + rnorm(n) # Nuisance: regression for E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Target: mean squared error of the nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } m <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 1, eval_fold = 1L, mode = "estimate", fold_allocation = "independence", aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) str(m)set.seed(1) n <- 50 x <- rnorm(n) y <- x + rnorm(n) # Nuisance: regression for E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Target: mean squared error of the nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } m <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 1, eval_fold = 1L, mode = "estimate", fold_allocation = "independence", aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) str(m)
Helper to create a nuisance specification with basic structural
checks. A nuisance is defined by a fit function, a
predict function, and optional dependency mappings.
create_nuisance( fit, predict, train_fold = 1L, fit_deps = NULL, pred_deps = NULL )create_nuisance( fit, predict, train_fold = 1L, fit_deps = NULL, pred_deps = NULL )
fit |
A function |
predict |
A function |
train_fold |
Positive integer giving the width (in folds) of the
training window used for this nuisance. Defaults to |
fit_deps |
Optional named character vector mapping
|
pred_deps |
Optional named character vector mapping
|
A list representing a nuisance specification, suitable for
inclusion in the list_nuisance argument of
create_method.
# Simple linear regression nuisance: E[Y | X] set.seed(1) n <- 50 x <- rnorm(n) y <- x + rnorm(n) nuis <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) str(nuis)# Simple linear regression nuisance: E[Y | X] set.seed(1) n <- 50 x <- rnorm(n) y <- x + rnorm(n) nuis <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) str(nuis)
Convenience wrapper around crossfit_multi for the
common case of a single method. It enforces that method is a
single method specification and forwards the aggregation functions
stored inside method_1.
crossfit( data, method, fold_split = function(data, K) sample(rep_len(1:K, nrow(data))), seed = NULL, max_fail = Inf, verbose = FALSE )crossfit( data, method, fold_split = function(data, K) sample(rep_len(1:K, nrow(data))), seed = NULL, max_fail = Inf, verbose = FALSE )
data |
Data frame or matrix with the observations. |
method |
A single method specification (list) created by
|
fold_split |
A function producing a K-fold split of the data
(see |
seed |
Integer base random seed. |
max_fail |
Non-negative integer or |
verbose |
Logical; if |
The same structure as crossfit_multi, but with
a single method named "method". The final estimate is in
$estimates$method.
set.seed(1) n <- 100 x <- rnorm(n) y <- x + rnorm(n) data <- data.frame(x = x, y = y) # Nuisance: E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Target: mean squared error of the nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } method <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "independence", aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) cf <- crossfit(data, method) cf$estimatesset.seed(1) n <- 100 x <- rnorm(n) y <- x + rnorm(n) data <- data.frame(x = x, y = y) # Nuisance: E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Target: mean squared error of the nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } method <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "independence", aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) cf <- crossfit(data, method) cf$estimates
Runs cross-fitting for one or more methods defined via
create_method and create_nuisance. This
is the main engine that:
validates and normalizes method specifications,
builds the global instance graph and fold geometry,
repeatedly draws K-fold splits and evaluates all active methods,
aggregates results across panels and repetitions.
crossfit_multi( data, methods, fold_split = function(data, K) sample(rep_len(1:K, nrow(data))), seed = NULL, aggregate_panels = identity, aggregate_repeats = identity, max_fail = Inf, verbose = FALSE )crossfit_multi( data, methods, fold_split = function(data, K) sample(rep_len(1:K, nrow(data))), seed = NULL, aggregate_panels = identity, aggregate_repeats = identity, max_fail = Inf, verbose = FALSE )
data |
Data frame or matrix of size |
methods |
A (named) list of method specifications, typically
created with |
fold_split |
A function of the form |
seed |
Integer base random seed used for the K-fold splits; each
repetition uses |
aggregate_panels |
Function used as the default aggregator
over panels (folds) for each method. It is applied to the list of
per-panel values. Methods can override this via their own
|
aggregate_repeats |
Function used as the default
aggregator over repetitions for each method. It is applied to the
list of per-repetition aggregated values. Methods can override this
via their own |
max_fail |
Non-negative integer or |
verbose |
Logical; if |
Each method can operate in either mode = "estimate" (target
returns numeric values) or mode = "predict" (target returns a
prediction function). Cross-fitting ensures that nuisance models are
always trained on folds disjoint from the folds on which their
predictions are used in the target.
A list with components:
estimatesNamed list of final estimates per method (after aggregating over panels and repetitions).
per_methodFor each method, a list with
values (per-repetition aggregated results) and
errors (error traces).
repeats_doneNumber of repetitions successfully completed for each method.
KNumber of folds used in the plan.
K_requiredPer-method minimal required K based on their dependency structure.
methodsThe validated and normalized method specifications.
planThe cross-fitting plan produced by
build_instances().
set.seed(1) n <- 100 x <- rnorm(n) y <- x + rnorm(n) data <- data.frame(x = x, y = y) # Shared nuisance: E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Method 1: MSE of nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } # Method 2: mean fitted value target_mean <- function(data, nuis_y, ...) { mean(nuis_y) } m1 <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "independence" ) m2 <- create_method( target = target_mean, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "overlap" ) cf_multi <- crossfit_multi( data = data, methods = list(mse = m1, mean = m2), aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) cf_multi$estimatesset.seed(1) n <- 100 x <- rnorm(n) y <- x + rnorm(n) data <- data.frame(x = x, y = y) # Shared nuisance: E[Y | X] nuis_y <- create_nuisance( fit = function(data, ...) lm(y ~ x, data = data), predict = function(model, data, ...) predict(model, newdata = data) ) # Method 1: MSE of nuisance predictor target_mse <- function(data, nuis_y, ...) { mean((data$y - nuis_y)^2) } # Method 2: mean fitted value target_mean <- function(data, nuis_y, ...) { mean(nuis_y) } m1 <- create_method( target = target_mse, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "independence" ) m2 <- create_method( target = target_mean, list_nuisance = list(nuis_y = nuis_y), folds = 2, repeats = 2, eval_fold = 1L, mode = "estimate", fold_allocation = "overlap" ) cf_multi <- crossfit_multi( data = data, methods = list(mse = m1, mean = m2), aggregate_panels = mean_estimate, aggregate_repeats = mean_estimate ) cf_multi$estimates
These helpers implement simple aggregation schemes for panel-level
and repetition-level estimates in crossfit and
crossfit_multi.
mean_estimate(xs) median_estimate(xs)mean_estimate(xs) median_estimate(xs)
xs |
A list of numeric values or numeric vectors. Elements are
unlisted and concatenated prior to aggregation, so |
In mode = "estimate", each repetition typically produces a list
of numeric values (one per evaluation panel). The functions
mean_estimate() and median_estimate() aggregate such
lists into a single numeric value.
A single numeric value (the mean or median of all entries in
xs.
xs <- list(c(1, 2, 3), 4, c(5, 6)) mean_estimate(xs) xs <- list(c(1, 100), 10, 20) median_estimate(xs)xs <- list(c(1, 2, 3), 4, c(5, 6)) mean_estimate(xs) xs <- list(c(1, 100), 10, 20) median_estimate(xs)
These helpers aggregate several cross-fitted predictors into a single
ensemble predictor. They are designed for methods run with
mode = "predict" in crossfit and
crossfit_multi.
mean_predictor(fs) median_predictor(fs)mean_predictor(fs) median_predictor(fs)
fs |
A list of prediction functions. Each function must accept
at least a |
A function of the form function(newdata, ...), which
returns a numeric vector of predictions. If fs is empty, the
returned function always returns numeric(0).
# Two simple prediction functions of x f1 <- function(newdata, ...) newdata$x f2 <- function(newdata, ...) 2 * newdata$x ens_mean <- mean_predictor(list(f1, f2)) newdata <- data.frame(x = 1:5) ens_mean(newdata) # Two simple prediction functions of x f1 <- function(newdata, ...) newdata$x f2 <- function(newdata, ...) 2 * newdata$x ens_median <- median_predictor(list(f1, f2)) newdata <- data.frame(x = 1:5) ens_median(newdata)# Two simple prediction functions of x f1 <- function(newdata, ...) newdata$x f2 <- function(newdata, ...) 2 * newdata$x ens_mean <- mean_predictor(list(f1, f2)) newdata <- data.frame(x = 1:5) ens_mean(newdata) # Two simple prediction functions of x f1 <- function(newdata, ...) newdata$x f2 <- function(newdata, ...) 2 * newdata$x ens_median <- median_predictor(list(f1, f2)) newdata <- data.frame(x = 1:5) ens_median(newdata)