The Question
Mortality is the unambiguous endpoint. No self-report, no rating scale — just alive or dead, confirmed by the National Death Index. With the 2017–2018 NHANES cycle linked to public-use mortality files (follow-up through December 2019), we get a modest cohort with objectively verified outcomes. The question: does machine learning beat a well-calibrated domain score when the sample is small and the follow-up is short?
ADM Prediction (Made Before Running Models)
Predicted winner: ML, but domain should be strong. Mortality is dominated by age (Gompertz law — risk rises exponentially after 40), so a published risk score with a proper age term will have decent discrimination. ML should find interactions — how CRP compounds with diabetes, how hypertension accelerates decline in obesity — that a linear score misses.
Prediction flipped. Domain AUC 0.807 [0.770–0.843] beats ML AUC 0.768 [0.716–0.805] by 0.045 points — but the 95% CIs overlap, so the difference is not statistically significant. The hybrid model (0.786) also trails the domain score. At 24-month follow-up with 111 events, the Charlson + Gompertz baseline captures short-term mortality signal better than GradientBoosting can with this much data.
Results
ROC Curves
Feature Importance (Top 8)
Confidence Intervals
Net Reclassification
Why the Domain Score Wins Here
Short follow-up favors age-dominant models. With only 24 months of observation, deaths are concentrated in the oldest and sickest — the population where the Gompertz age component of the Charlson score does most of the work. When the signal is this compressed, the published weights already have it: age, heart disease, cancer, stroke, COPD. GradientBoosting has less to learn.
Small N, few events. 3,578 adults with 111 deaths is near the lower bound for flexible ML. NRI is the one place ML looks better — it flips 558 cases the domain score gets wrong, while the domain score only flips 224 of ML’s errors. The two models rank people differently even when their overall AUCs are similar. But flipping cases isn’t the same as improving accuracy: a correctly-calibrated score that ranks people well in aggregate is more useful than a flexible model that reshuffles individuals without improving the overall curve.
Hybrid doesn’t rescue ML. Stacking GradientBoosting on top of the domain score (hybrid AUC 0.786 [0.738–0.823]) lands between the two — it doesn’t clear the domain baseline. When the domain score is this strong, there isn’t orthogonal signal for ML to add.
The ADM Insight
This is the investigation where domain knowledge wins. At 24-month follow-up on 3,578 NHANES adults with 111 deaths, a Charlson + Gompertz risk score (AUC 0.807) outperforms GradientBoosting (0.768) by 0.045 points. The CIs overlap, so this is a draw-in-favor-of-simpler rather than a definitive ML loss — but that’s the point. When the underlying biology is well-characterized, published weights are small, and the event count is modest, the simpler model is the right fidelity. ML needs either longer follow-up (more events) or larger N before its interaction-finding pays off here.
Data source: CDC NHANES 2017–2018 (demographics, examination, laboratory, questionnaire files) linked to the NCHS Linked Mortality Files (2019 public-use release). Mortality-eligible respondents only (ELIGSTAT=1). Follow-up through December 2019 — mean 24.2 months.
Cohort: NHANES adults aged 40+ with complete baseline lab and exam data. Final analytic sample: 3,578 participants, 111 deaths (3.1%). Cause-of-death codes available via UCOD_LEADING.
Domain baseline: Charlson-style log-relative-risk score with a Gompertz age component, plus hypertension and CRP thresholds (12 variables total). Weights from published cohort studies — not fit to this data.
ML model: GradientBoostingClassifier on the full biomarker panel — age, BMI, waist circumference, averaged systolic/diastolic BP, total cholesterol, HDL, LDL, triglycerides, HbA1c, CRP, hemoglobin, serum creatinine, self-rated health, plus chronic condition and behavior indicators.
Hybrid model: Stacked — GradientBoosting on the full feature set plus the domain risk score as an additional input.
Evaluation: 5-fold stratified cross-validation. Bootstrap 95% CIs from 1,000 resamples. Permutation tests for AUC differences. Net reclassification computed at each fold.
Short follow-up (24 months). The public-use mortality release covers deaths through December 2019. Most participants have less than 2.5 years of observation. With longer follow-up (e.g., the restricted-access files extending to 2022), ML would have more events to learn from and the ranking could change.
Few events (111 deaths). At a 3.1% event rate, flexible models have limited data to estimate interaction effects without overfitting. Larger event counts would favor GradientBoosting.
Modest sample size (3,578). Only the 2017–2018 cycle is currently linked in the public-use release. Pooling earlier cycles (available for SDOH / 10-year mortality in Q15) would boost both N and event count.
CIs overlap. The 0.045 AUC gap between domain and ML has overlapping 95% bootstrap CIs. Domain appears to win but the difference is not statistically significant. Claim: simpler isn’t worse here. Not claim: simpler is definitively better.
Cross-sectional baseline. Single NHANES interview; no trajectory features, no repeated measures. A model with longitudinal inputs (HRS-style) might close some of the gap.