The HRS validation: what I built, found, and where it points — Build Log

The molecular track closed pointing at a clear next destination: calibrated clinical prognosis for the diseases this mission is about, on the cohorts built for it — PRO-ACT and ADNI for longitudinal ALS and Alzheimer’s, and the Speech Accessibility Project for bulbar-onset speech. Those datasets are access-gated, and the applications take time.

Rather than wait, I did the rigorous thing on data already in hand: the Health and Retirement Study — a large, public, longitudinal cohort with a usable cognitive outcome. Eight investigations. Every number script-generated and checksum-locked. All eight reproduce byte-identically.

That choice is deliberate, not a detour. Doing clinical-style prognosis end-to-end on public data keeps the research moving while access is pending, exercises the ADM process on ground the molecular track never touched — a human longitudinal cohort, competing risks, subgroup fairness, a silver-standard label — and demonstrates the calibrated-honesty discipline that earns access to the gated cohorts in the first place. The destination hasn’t changed; this is the on-ramp to it, built on data anyone can check.

What I was building toward

The question a family or clinician actually asks: two years from now, is this 72-year-old likely to have crossed into dementia — and how much should I trust that number? Calibration is the headline metric, not accuracy. A model that says 80% and is right 50% of the time is not a tool; it is a false alarm generator. A model that says 12% and means it — that is useful, even if 12% isn’t what anyone wants to hear.

The outcome is the Langa-Weir classification — an algorithmic proxy, not a clinical diagnosis. 18–62% sensitive versus the ADAMS face-to-face assessment and biased against Black and Hispanic respondents (Crimmins 2011; Gianattasio 2019). Every metric here is to the silver label. That caveat is load-bearing; I state it once here and remind you only where it is decision-controlling. And to be unambiguous up front: this is a population-level research model — not a clinical tool, and not individual medical advice.

The build

The data and the ground rules

RAND HRS Longitudinal 1992–2022 and the Langa-Weir classification file, both public, both sha256-locked before any analysis ran. A single shared data loader derives the outcome cut-points from the documented codebook constants — no duplicated literals, no hand-typed thresholds. Agreement between my reproduction of the label and the file was perfect — every status reproduced exactly.

One decision matters for everything downstream: the primary analysis keeps only people still self-respondents at the next interview. Decline drives the switch to a proxy, so converters — the highest-risk people — exit the primary sample earliest. The self→proxy filter dropped 2,233 person-periods carrying 1,282 imputed-dementia events. Every metric below is optimistic for the full population. An imputed-inclusive run runs throughout as a mandatory parallel anchor.

What incidence looks like in this cohort

Incident dementia is strongly age-graded. Among self-respondents, dementia-free at baseline, the ~2-year rate runs 3.0% at ages 65–74, 7.1% at 75–84, and 16.4% at 85+. Within those age bands, baseline cognitive state dominates — the order-of-magnitude gap between a person starting from CIND versus cognitively normal motivates reporting everything stratified by baseline status throughout.

The honest baseline model (INV 02)

A discrete-time pooled-logistic on age + cognitive score + lagged baseline status, evaluated on a temporal holdout (W15 2020 → W16 2022): 5,022 person-periods, 195 incident events, base rate 0.039.

Calibration first: the cognition-only model is well-calibrated to the silver label — slope 0.89, ECE 0.0097. At a base rate of 0.039, ECE is mechanically small; the slope is the lead metric. Person-grouped cross-validation confirms this is not a one-wave artifact: slope 1.00, ECE 0.0011, AUROC 0.873.

Discrimination second — pooled versus within-stratum. Pooled AUROC on the temporal holdout: 0.850 [0.821, 0.880]. That sits at the optimistic top of what externally validated dementia-risk models reach in the literature (pooled c-statistics in the low-to-mid 0.70s for the best-validated scores; Stephan et al. 2026, BMC Medicine, doi:10.1186/s12916-026-04652-y). But it is separation-inflated — lifted by the model distinguishing Normal from CIND, which is a baseline feature, not foresight. The honest discrimination is within-stratum: 0.733 (Normal) / 0.675 (CIND) — consistent with the real-world range. Discrimination is weakest in the higher-risk CIND stratum, the one nearest the threshold.

Three escalations, none clearing the bar

The ADM question: does adding complexity improve the model enough to matter for the decision? I set a pre-registered decision bar of Δ = 0.02 AUROC and tested three escalations, each with a paired Δ-AUROC on the same holdout rows.

Elastic-net penalization (INV 03). The fuller feature set adds a paired Δ-AUROC of +0.0196 [0.011, 0.028] — real (excludes zero), just under the bar, TOST inconclusive at 195 events. Person-grouped CV: slope 1.00, ECE 0.0011.
Gradient boosting (INV 04). Paired Δ-AUROC +0.018 [0.005, 0.032] — again real, again below the bar. GBM is statistically indistinguishable from the tuned linear models within CI.
Recalibration (INV 05). The raw model is already well-calibrated. Platt is a wash. Isotonic shaves pooled ECE marginally but worsens the slope — and a pooled recalibrator cannot fix a mild within-Normal over-confidence that is stratum-specific.

The ADM conclusion — stated honestly, not as a tie

Across all four paired comparisons, complexity adds a small but statistically real gain (every paired 95% CI excluding zero) that consistently fails to clear the 0.02 decision bar, with no consistent or decision-relevant calibration improvement. At 195 events we are underpowered to certify formal equivalence. The honest reading is not “they tie” but “the extra fidelity is real and too small to matter for the decision” — so the right-fidelity model is the simple one. This is the ADM thesis demonstrated end-to-end on a clinical-style target.

Death is a competing risk (INV 06)

A naïve estimator that censors death as if it were random overstates cumulative incidence. This is a known identity from competing-risks theory; what matters here is its size. In this cohort, censoring death overstates 2-year incidence by ~7% overall and ~14% in the CIND stratum — because death competes hardest for exactly the people most at risk of dementia. A discrete-time multinomial that models death as an explicit third outcome corrects this and is itself well-calibrated to the silver label. The relative correction is the load-bearing takeaway; the absolute level remains optimistic by self-respondent selection.

The model is aggregate-fair, subgroup-unfair (INV 07)

A demographics-blind cognition-only model passes an overall fairness check: aggregate calibration slope 0.996, AUROC 0.873. Stratified, that conceals disparity. The model is differentially over-confident for Black, Hispanic, and less-educated respondents — calibration slopes 0.857 for Black and 0.862 for <HS versus 1.021 for White and 1.072 for College+ — with non-overlapping AUROC intervals (Black 0.820 vs White 0.882).

Because the model uses no demographic features, this unfairness is emergent, not a biased input. And it is calibrated to a label that itself over-diagnoses Black and Hispanic respondents — two failure modes stack. The cross-group gap is partly case-mix (higher-base-rate groups carry more CIND, where discrimination is mediocre for everyone) and partly real. That is as honest as the data allow.

The continuous score: forecastable, but mostly de-noising (INV 08)

Beyond binary dementia crossing: a Ridge regression on cognition-only features lowers MAE from 3.031 (persistence baseline) to 2.690, with non-overlapping confidence intervals. But the gain lives almost entirely in the Normal stratum — cognition-only ties persistence in the at-risk CIND group at age ≥65. GBM ≈ Ridge; full feature set ≈ cognition-only-plus-a-little. The ADM “simple is the right fidelity” conclusion, reproduced independently on a regression target.

What I will not claim — and this matters

What I will not claim

These numbers are to the Langa-Weir silver label, not clinical ground truth. The label is 18–62% sensitive versus the ADAMS assessment, and it over-diagnoses Black and Hispanic respondents. I cannot separate label bias from biology. Every “calibrated” and every AUROC above means calibrated to this proxy.
Every metric is optimistic for the highest-risk people. Self-respondents who convert to proxy exit the primary sample earliest. The 195-event holdout sits just under the Riley ≥200/≥200 rule. The ≥50 and imputed-inclusive variants corroborate every direction but cannot replace the primary.
This is not a clinical tool, not individual advice, not a statement about any single person. It is a population-level research model, built on public data, graded against a research proxy. The flag in every methodology note is there for a reason.
Out-of-cohort generalisation is not claimed. Harmonised ELSA/CHARLS cross-cohort files are not on disk. The label-validity wall (no ADAMS ground truth) is a separate, more fundamental limit.

Where it points

Eight investigations, all byte-reproducible, every number script-generated from checksum-locked data — the same discipline the gated cohorts require. The HRS study demonstrates that discipline on a real human longitudinal cohort before the access-gated data arrives. That is what a Validation/Benchmark study is for.

The next step is not more HRS. It is the thing I started this for: calibrated, individual-level prognosis for ALS, starting with speech. Bulbar-onset ALS takes voice first. Acoustic biomarkers from recorded speech may surface disease progression earlier and more precisely than functional scales alone. The Speech Accessibility Project and PRO-ACT applications are in progress. When they clear, the study opens.

I haven’t solved anything on the speech side yet. I haven’t started. What I have is the same discipline I applied here — calibrated uncertainty, honest intervals, clear statements of what the model knows and doesn’t — pointed at the harder problem. That is the work I came here to build. The HRS study was the on-ramp. The speech work is the thing.

The full technical account of the HRS validation: the study page → · More in the build log when the speech-prognosis track opens. — Michael

The HRS validation: what I built, found, and where it points next.