Methods

CalMoE predicts disease-specific survival for seven cancer types using patient age, pathologic tumor stage, histopathology imaging, and gene expression data. This page documents how it works, what it can do, and - equally - what it can\'t.

What is CalMoE?

CalMoE is a calibrated survival prediction model. Unlike most AI survival tools, which produce rankings without validating whether their probability estimates are reliable, CalMoE passes formal calibration testing: when it says 70%, we have evidence that approximately 70% of similar patients actually survive.

How it works

Age + Stageclinical featuresHistopathologyWSI · UNI2-h featuresGene expressionRNA · 50 hallmarksMixture-of-Expertsgated fusion+ Platt · CiPOTCalibrated S(t)0y10yResearch model architecture (used to train the deployed calculator). Inputs → MoE Fusion → Calibrated survival curve.

CalMoE has two tiers. The research model is a Mixture-of-Experts fusion: each modality - age & stage, histopathology, RNA-seq - is encoded by a specialist expert, and a gating network learns which expert to weight per patient. It produces a full disease-specific survival curve S(t), not a single risk score. The web calculator runs a smaller, clinical-only model distilled from that research model. To keep absolute probabilities trustworthy, the distilled student is wrapped in a Cox proportional hazards decomposition: a non-parametric Kaplan-Meier baseline per (cancer, stage) stratum supplies the population-level calibration, and the student supplies a per-patient hazard ratio (exp(β·z)) that scales the baseline.

Calculator inference flow

What actually runs when you press Calculate. The full multimodal research model above is the teacher; the calculator runs the smaller distilled student wrapped in a Cox decomposition that anchors absolute probabilities to empirical TCGA Kaplan-Meier curves.

Cancer · Age · AJCC stageDistilled clinical-only model~600k params · age + stage onlyRisk score r(x)Empirical TCGA Kaplan-Meierper (cancer, stage) or age bandS(t | x) = KM(t)^exp(β·z)Calibrated S(t)with reliable horizonThree inference modes: Cox-stratified (large strata) · empirical KM only (small strata) · age-band KM (no stage data).

Inference modes

Depending on training-data support, the calculator selects one of three inference modes per request:

  • Cox-stratified - the (cancer, stage) stratum has ≥25 patients and the student exhibits meaningful within-stratum variation. The Kaplan-Meier baseline is multiplied by the per-patient hazard ratio.
  • Empirical Kaplan-Meier - the stratum exists but is too small to support a reliable proportional hazards adjustment, or the student does not meaningfully separate patients within it. The prediction reports the empirical KM curve of the stratum directly, with Greenwood's 95% confidence band.
  • Age-band Kaplan-Meier - for cancer types where TCGA does not record AJCC stage (STAD, HNSC), or where the typed-in stage has too few training patients (e.g. BLCA stage I, n=1). The baseline is stratified by age band instead.

Reliable follow-up horizon

For each stratum we identify the last time-grid point where at least 5 patients remained under observation. Survival estimates past that horizon are based on too few patients at risk to be statistically defensible - the calculator displays the curve as dashed, asterisks the affected time-point rows, and reports any horizon-bounded median as a lower bound ("> X years") rather than a point estimate.

Plain-language definition of calibration: if our model tells 100 patients they each have a 70% chance of 5-year survival, approximately 70 of them will actually survive 5 years. Most AI survival models are never tested for this property.

Performance

Held-out concordance index (C-index) across 5-fold site-stratified cross-validation on TCGA cohorts, disease-specific survival endpoint. The CalMoE column shows the multimodal research model (hetero ensemble of 5 FiLM-modulated teachers and the production HySurv teacher, rank-averaged) used to train the deployed calculator. MMP, SurvPath and MCAT numbers are taken from the published reports. Higher is better.

CohortCalMoEMMP (2024)SurvPath (2024)MCAT (2021)
BRCA0.8340.7530.7090.648
KIRC0.9030.7480.7380.670
COADREAD0.8520.6360.5390.578
LUAD0.7550.6430.6120.615
BLCA0.7420.6280.6190.619
STAD0.6790.5800.5560.528
HNSC0.674-0.6000.531
Mean (6 shared)0.7940.6650.6290.610
Calibration (BH-adjusted)27 / 30not reportednot reportednot reported

Training data

  • The Cancer Genome Atlas (TCGA) cohorts, seven cancer types (BRCA, BLCA, LUAD, KIRC, STAD, COADREAD, HNSC)
  • Disease-Specific Survival (DSS) endpoint
  • 5-fold site-stratified cross-validation (no patient or acquisition site appears in both train and test)
  • UNI2-h feature extractor for whole-slide histopathology
  • 50 hallmark gene pathway signatures for genomics

Limitations

Trained on TCGA data (US academic centers, primarily White / European patients). May not generalize to other populations.
The web calculator runs a distilled clinical-only model on patient age and AJCC stage. Absolute survival probabilities are anchored to non-parametric Kaplan-Meier curves estimated from the matching (cancer, stage) stratum of the TCGA training cohort, so they reflect the observed disease-specific survival of training patients in that stratum rather than free-running model output. For STAD and HNSC, where TCGA does not record AJCC stage, the baseline is stratified by age band instead. Whole-slide histopathology and RNA-seq are used only by the upstream research model and are not processed at inference time.
Predictions are population-level estimates, not individual guarantees.
The model does not account for treatment effects - no treatment-specific predictions.
Not validated prospectively.

Calibration methodology

The methods below describe how the upstream multimodal research model was tested for calibration. The web calculator inherits population-level calibration by construction - its absolute survival probabilities are the empirical TCGA Kaplan-Meier curves of the matching stratum, optionally scaled by a per-patient hazard ratio.

  • 1-calibration (Hosmer-Lemeshow) evaluated at the median event time per cohort
  • Benjamini-Hochberg FDR correction across 30 fold-by-cohort tests
  • Platt scaling fit on training folds and applied to held-out validation (research model only)
  • CiPOT conformal post-hoc adjustment for prediction-interval coverage (research model only)