The Question
Diabetes risk prediction underpins clinical screening decisions. ADA-aligned published risk scores combine known factors — age, BMI, hypertension, smoking, activity, family history — using log-relative-risks from landmark studies. Machine learning trained on the same biomarker panel (lipids, CRP, waist circumference, blood pressure) can discover nonlinear interactions between these factors. Which approach better identifies diabetes in 39,839 NHANES adults? Outcome: ADA criteria (HbA1c ≥ 6.5% OR fasting glucose ≥ 126 mg/dL OR physician diagnosis); HbA1c and glucose are excluded from predictor sets because they define the outcome.
ADM Prediction (Made Before Running Models)
Predicted winner: machine learning. Published log-relative-risks capture the direction of each factor (obesity increases risk, activity decreases it) but not the nonlinear interactions between them. Diabetes is a threshold disease: borderline triglycerides + central adiposity + rising age compounds risk beyond what any additive score captures. With richer biomarkers (lipids, CRP, waist circumference) and ~40K samples, gradient boosting has enough data to learn these synergies.
Expected margin: 5–10 AUC points. Actual: +6.7 AUC points (0.851 vs 0.784). Prediction confirmed — non-overlapping 95% CIs.
Results
ROC Curves
Feature Importance (Top 8)
Confidence Intervals
Net Reclassification
A hybrid model that feeds the ADA-aligned risk score as an additional feature alongside all biomarkers tests whether published epidemiology adds information the ML model misses on its own.
Hybrid AUC: 0.851 [0.846–0.856] — statistically indistinguishable from ML alone (+0.000 vs ML). GradientBoosting on the raw biomarkers already captures everything the published score encodes. The domain score is useful as an interpretable baseline, not as an additional signal.
A model that works well on average might fail for specific populations. Does the ML advantage hold across age groups, sexes, and BMI categories — or does it only work for some subgroups?
The ADM Insight
Published diabetes risk factors — BMI thresholds, hypertension, family history — capture the population average. But a 55-year-old with borderline triglycerides and rising waist circumference has a different trajectory than someone with identical BMI and normal lipids. The ML model finds the biomarker interactions that guideline thresholds flatten into averages. The biggest lift is in the hardest subgroup: Age 60+, where the domain score falls to AUC 0.68 and ML holds at 0.76 (+8.8 points). This is a question where ML provides genuine predictive value — but only with a biomarker-rich panel.
Data: NHANES 2005–2018, seven two-year cycles pooled (MEC sample weights divided by K=7 for prevalence estimates; model evaluation is unweighted). 39,839 adults; 6,433 diabetic (16.15% prevalence). Outcome: ADA diagnostic criteria — HbA1c ≥ 6.5% OR fasting glucose ≥ 126 mg/dL OR self-reported physician diagnosis. HbA1c and fasting glucose are excluded from predictor sets because they define the outcome.
The ADA-aligned domain score uses log-relative-risks from published literature across 9 variables: age, BMI, hypertension, smoking, inactivity, depression, family history of diabetes, waist circumference, and a BMI × age interaction.
The GradientBoosting model trains on the full NHANES biomarker panel: lipids (total cholesterol, HDL, triglycerides), CRP, waist circumference, blood pressure (SBP/DBP), BMI, age, family history, and activity/smoking variables.
Both evaluated on identical held-out folds via 5-fold stratified cross-validation. Bootstrap 95% CIs from 1,000 resamples. Non-overlapping CIs (ML [0.846–0.856] vs Domain [0.778–0.790]) confirm the gap isn’t chance.
Cross-sectional, not prospective: NHANES is cross-sectional — the “outcome” is current diabetes status at the time of exam, not future onset. A truly prospective version (predicting who becomes diabetic over the next N years) would need a longitudinal cohort (HRS, UK Biobank, ARIC).
Domain baseline is an approximation: The ADA-aligned score uses published log-relative-risks from landmark studies. A clinical tool like the ADA Diabetes Risk Test or FINDRISC, scored as published, could sit slightly above or below our implementation.
Hybrid adds nothing: Feeding the domain score as an extra feature to GradientBoosting yields AUC 0.851 — identical to ML alone. The published score does not encode information beyond what the raw biomarkers already contain.
Biomarker-rich panel: The ML advantage likely narrows on datasets without lipids, CRP, and waist circumference. Replication on HRS (self-reported diabetes, fewer labs) would test this.
Survey-weighted evaluation: Model evaluation is unweighted; prevalence estimates use MEC weights divided by K=7 for pooling. A fully weighted AUC estimator would add a small correction.