Forecasting Cognitive Decline & Incident Dementia.
Calibrated 2-year incident-dementia forecasting on the Health and Retirement Study, measured against the Langa-Weir silver-label proxy — a research classification, not a clinical diagnosis. Public data, the discipline shown openly. Calibration is the headline metric. The simplicity is the result.
By Michael Key · ORCID
The cohort, the harmonized variables, and the cognitive outcome here belong to others. The Health and Retirement Study (HRS) is sponsored by the National Institute on Aging (NIA grant U01AG009740) and conducted by the University of Michigan. The harmonized longitudinal variables come from the RAND HRS Longitudinal file. The cognitive-status outcome — the classification of respondents into Normal / Cognitive Impairment No Dementia (CIND) / Dementia strata — is the Langa-Weir classification (a probabilistic research proxy for dementia status, not a clinical diagnosis), developed by Kenneth Langa and David Weir. This study’s contribution is not new data or a new measurement. It is rigorous, calibrated, honest modeling on that foundation: a model that says how sure it is, and is right about that. All data are public; any researcher can run the same analysis on the same files. This is not a clinical diagnostic, not individual advice, and not a statement about any single person.
Stated once here. Reminded only where decision-controlling.
1. Silver label. Every metric is measured against the Langa-Weir classification — a probabilistic algorithmic proxy for dementia, 18–62% sensitive versus the ADAMS clinical gold standard (the face-to-face neuropsychological assessment) and demographically biased: it over-diagnoses Black and Hispanic respondents (Crimmins 2011; Gianattasio 2019). With no ADAMS ground truth in this dataset, it is impossible to decompose measured performance into label-bias-versus-biology. Read every “calibrated” and every AUROC reported here as to the silver label.
2. Self-respondent selection. The primary analysis keeps only people still self-respondents at the next interview — and decline drives the switch to a proxy. The self-respondent share at baseline falls from 0.962 (Normal) to 0.575 (Demented); among self-respondents re-interviewed, the self→proxy transition rate by next wave climbs from 0.013 (Normal) to 0.178 (Demented). The self→proxy filter alone dropped 2,233 person-periods carrying 1,282 imputed-dementia events. Every metric below is optimistic for the full at-risk population, most so among the highest-risk. An imputed-inclusive run is reported as a parallel anchor throughout; it bounds the direction and rough size of the selection bias, though it carries its own caveat (the imputation model uses adjacent-wave cognition scores, so it may carry entry-wave future information).
3. Pooled AUROC is separation-inflated. The pooled AUROC is lifted by the model separating Normal from CIND — a baseline-status feature, not foresight into who will convert. The honest discrimination figure is within-stratum; every pooled AUROC reported here is labeled as such and the within-stratum numbers are always the ones to trust.
A cognition-only model, calibrated to the silver label, whose simplicity is the result.
The model and the holdout
A discrete-time pooled-logistic on age, cognitive score, and lagged baseline status, evaluated on an out-of-time holdout (W15 2020 → W16 2022): 5,022 person-periods, 195 incident events, base rate 0.039.
Calibration first
The cognition-only model is well-calibrated to the silver label — calibration slope 0.89 on the temporal holdout. ECE is 0.0097, though at a base rate of 0.039 ECE is mechanically small; the slope is the lead metric. Person-grouped cross-validation confirms this is not a one-wave fluke: slope 1.00, ECE 0.0011, AUROC 0.873.
Discrimination: pooled is optimistic, within-stratum is honest
Pooled AUROC on the temporal holdout: 0.850 [0.821, 0.880]. This sits at the optimistic top of what real externally-validated dementia-risk models reach — external-validation pooled c-statistics run in the low-to-mid 0.70s for the best-validated scores, reaching into the low 0.80s for the strongest EHR models (Stephan et al. 2026, BMC Medicine, doi:10.1186/s12916-026-04652-y) — and it is separation-inflated (caveat 3). The honest within-stratum discrimination is 0.733 (Normal) / 0.675 (CIND) — consistent with that real-world range: genuinely hard, not under a benchmark. Note where the skill is weakest: discrimination is worst in the higher-risk CIND stratum (0.675), the stratum nearest the threshold and most directly relevant to a family or clinician.
The simplicity is the finding — the ADM thesis
Three escalations were tested against the cognition-only rung, each with a paired Δ-AUROC on the same holdout rows and a TOST equivalence test at a pre-registered Δ = 0.02 decision bar:
- Penalization (elastic-net). The fuller feature set adds a paired Δ-AUROC of +0.0196 [0.011, 0.028] — real (excludes zero), but sits just under the 0.02 bar and is not certifiable as equivalent (TOST inconclusive at 195 events).
- Gradient boosting (HistGBM). Paired Δ-AUROC +0.018 [0.005, 0.032] — again real, again below the bar. GBM is statistically indistinguishable from the tuned linear full models.
- Recalibration. The raw model is already well-calibrated (slope 0.89). Platt recalibration is a wash; isotonic shaves pooled ECE marginally but worsens the slope and delivers no decision-relevant improvement over an already-calibrated model.
The ADM conclusion — stated honestly, not as a tie: across all comparisons, complexity adds a small but statistically real gain (every paired 95% CI excluding zero) that consistently fails to clear the 0.02 decision bar, with no consistent or decision-relevant calibration improvement. The sample of 195 events is underpowered to certify formal equivalence (TOST inconclusive). The right-fidelity reading is not “they tie” but “the extra fidelity is real and too small to matter for the decision” — so the simple model is the right one.
- Self-respondent selection (caveat 2 applies here). The self→proxy filter drops 2,233 person-periods carrying 1,282 imputed-dementia events; the highest-risk converters exit the observed primary earliest. Every metric above is optimistic for the full population.
- Riley power caveat. The primary 195-event holdout sits just under the Riley ≥200/≥200 calibration-sample rule (
riley_ok=False). The ≥50 and imputed-inclusive variants (248 / 329 events) corroborate every direction. - Pooled AUROC is separation-inflated (caveat 3). The 0.850 pooled figure reflects Normal-vs-CIND separation. The 0.733 / 0.675 within-stratum numbers are the ones to trust.
- All metrics are to the silver label (caveat 1). No ADAMS ground truth is available to decompose label-bias from biology.
Death is a competing risk, and it moves the absolute number.
A naïve estimator that censors death as if it were random overstates cumulative incidence — this is a known identity from competing-risks theory, quantified here for this cohort. The overstatement scales with the competing death rate, and death competes hardest for exactly the people most at risk of dementia.
Two-year mortality in the CIND stratum is 0.136 versus 0.052 in Normal — a more than two-fold difference. The practical consequence:
- Overall: naïve incidence 0.0319 → competing-risk 0.0297, a relative overstatement of ~7%.
- CIND stratum: 0.1343 → 0.1160, a relative overstatement of ~14% — because death competes hardest where dementia risk is highest.
The discrete-time multinomial that models death as an explicit third outcome is itself calibrated (to the silver label): pooled AUROC 0.856 (separation-inflated; within-stratum 0.768 Normal / 0.693 CIND), well-calibrated to the silver label.
The load-bearing, planning-relevant takeaway is the relative correction, not the absolute level. The absolute level carries the self-respondent-selection bias of caveat 2 (a larger, downward effect that largely cancels in the relative shift but not in the level). A death-censored incidence figure is biased upward — most so in the stratum where both risk and urgency are highest.
This is a known competing-risks identity, quantified for this cohort — not a new discovery. Its value here is that the overstatement is real, non-trivial (~14% in CIND), and calibrated to the silver label. The “shift ≈ death-rate identity” flag in the analysis serves as a sanity check, not independent evidence. The absolute incidence level is optimistic (caveat 2); the relative correction is not.
Aggregate-fair, subgroup-unfair, and calibrated to a biased label.
The demographics-blind cognition-only model passes an overall fairness check: out-of-fold, aggregate slope 0.996, ECE 0.0010, AUROC 0.873 [0.867, 0.880] (pooled, separation-inflated). Stratified, that aggregate conceals disparity:
| Subgroup | Silver-label base rate | AUROC (pooled) | Cal. slope |
|---|---|---|---|
| White, non-Hispanic | 0.0228 | 0.882 [0.874, 0.890] | 1.021 |
| Black, non-Hispanic | 0.0692 | 0.820 [0.805, 0.835] | 0.857 |
| Hispanic | 0.0622 | 0.830 [0.808, 0.848] | 0.917 |
| College or more † | 0.0096 | 0.903 [0.880, 0.922] | 1.072 |
| Less than high school | 0.0852 | 0.804 [0.793, 0.816] | 0.862 |
† Riley-underpowered (179 events < 200); read this interval as wide. All AUROCs are pooled-across-strata (caveat 3). Lead with slope and AUROC, not raw ECE — ECE’s cross-group ratio is base-rate-inflated.
The model is differentially over-confident for Black, Hispanic, and less-educated respondents (calibration slopes 0.857–0.917 for Black/Hispanic/0.862 <HS vs 1.021 White / 1.072 College+) and less discriminating for those same groups, with non-overlapping White-vs-Black and College-vs-<HS AUROC intervals. Because the model uses no demographic features, this unfairness is emergent, not a biased input.
Is the gap discrimination or case-mix? Both. The pooled subgroup AUROCs are themselves separation-inflated — every group falls sharply within-stratum — mediocre discrimination is universal in the CIND stratum across all groups. The cross-group gap is partly case-mix (higher-base-rate groups carry more CIND, where discrimination is uniformly mediocre) and partly real (within the cognitively-normal stratum, White and College+ still out-discriminate Black, Hispanic, and <HS). These within-stratum cells are thin — indicative, not definitive.
Two failure modes stack. The model is over-confident in groups the label over-diagnoses (caveat 1). We cannot decompose how much of the slope gap is label-bias vs. biology without ADAMS. The race and education base-rate gaps (∼3× by race, ∼9× by education) are confounded by sampling design and label bias, not established true-incidence differences. That is the honest account.
The continuous score is forecastable — partly real skill, partly de-noising.
Beyond binary dementia crossing: can the next-wave 27-point cognitive score be forecast? A Ridge regression on cognition-only features lowers MAE from 3.031 (persistence baseline) to 2.690 (cognition-only); the full feature set reaches 2.611; GBM-regressor 2.616 ≈ Ridge. All model confidence intervals sit below persistence.
The honest deflation. The gain is largely de-noising / mean-reversion of a noisy, autocorrelated self-report — not new prognostic skill against clinical cognition. It lives almost entirely in the Normal stratum (model MAE 2.690 vs a higher persistence baseline, non-overlapping CIs). Cognition-only ties persistence in the at-risk CIND group at age ≥65 (2.829 vs 2.831) — only the full feature set extracts a small CIND gain (2.687). GBM ≈ Ridge; full ≈ cognition-only-plus-a-little — the ADM “simple is the right fidelity” conclusion, reproduced independently on a regression target.
Optimistic by construction (caveat 2): measured only on respondents who stayed self-respondents at t+1, selecting against the steepest decliners.
Stated plainly, not softened.
Out-of-cohort generalisation is not claimed. Harmonised ELSA/CHARLS cross-cohort files are not on disk; out-of-cohort transfer would need them acquired and Langa-Weir-mapped — buildable, not done, stated as a limitation rather than softened. Separately and more fundamentally, no ADAMS clinical ground truth is on disk — label bias and biology cannot be decomposed. Two different limits: a transfer-generalisation gap and a label-validity gap.
The t+2 (~4-year) horizon is deferred, not reported. It needs its own 4-year competing-risk and informative-attrition treatment; the substrate is horizon-parameterized so it remains a flagged future excursion.
Complexity and recalibration are cleanly ruled out — this is the ADM thesis. Penalization, gradient boosting, and recalibration each add a real but sub-0.02-AUROC gain with no decision-relevant calibration improvement over the simple rung. That is a genuine negative kept in the record because it is the point.
What this contributes, honestly stated.
- A cognition-only discrete-time logistic, calibrated to the Langa-Weir silver label, with trustworthy 2-year dementia-risk probabilities — calibration slope 0.89, confirmed not a one-wave fluke by person-grouped cross-validation (slope 1.00).
- The simplicity is the result. Penalization, gradient boosting, and recalibration each buy a small, statistically real, sub-decision-bar AUROC gain and no decision-relevant calibration improvement. The right-fidelity model is the simple one.
- Death-as-competing-risk matters for absolute incidence. Censoring death as random overstates 2-year risk by ~7% overall and ~14% in the high-risk CIND stratum. The relative correction is real; the absolute level remains optimistic by selection.
- Emergent subgroup unfairness, calibrated to a biased label. The demographics-blind model is aggregate-fair but differentially over-confident for Black, Hispanic, and less-educated respondents — while calibrated to a label that itself over-diagnoses those same groups. Two failure modes stack; the gap is part case-mix, part real.
- Every claim is to the silver label, every metric is optimistic for the highest-risk by self-respondent selection, every number reproduces byte-identically, and the negatives are kept in the record — the calibrated-honesty discipline this Validation tier exists to demonstrate, ahead of the gated ALS/AD cohorts (PRO-ACT, ADNI) where the flagship clinical contribution will live.
This is the mission’s first Validation/Benchmark study — public-data, openly shown, ahead of the access-gated cohorts. The destination hasn’t changed: calibrated individual prognosis for ALS and Alzheimer’s on the datasets built for that purpose. This is the on-ramp to it, built on data anyone can check.
← Back to ResearchEvery number is checkable.
Data. All datasets are public and locked by sha256 checksum: the RAND HRS Longitudinal file 1992–2022 v1 (NIA / University of Michigan) and the Langa-Weir Classification of Cognitive Function 1995–2022 (Kenneth Langa and David Weir). No other data enters any result.
Reproduction. Each of the eight investigations regenerates its results from scratch on every run. A reproduction harness runs each investigation twice and asserts byte-identical output — if any number here drifts from what the scripts produce, it is immediately visible. The full code, research protocol, and per-investigation provenance (sha256-locked inputs, script checksums, wall-clock run times) are available on request.
Method. Discrete-time pooled-logistic hazard model; elastic-net penalization; HistGBM; discrete-time multinomial for competing risks; person-cluster bootstrap confidence intervals; grouped person-CV. Full method and per-investigation caveats are in each investigation’s analysis.md / scenario.md. Analog-band source: Stephan BCM et al., “Discriminative performance of externally validated dementia risk prediction models: a systematic review and meta-analysis,” BMC Medicine 2026;24(1), doi:10.1186/s12916-026-04652-y.