Uncertainty Aware Approach for Multiple Imputation Using Conventional and Machine Learning Models: A Real-World Data Study

Wabina, Romen Samuel; Looareesuwan, Panu; Sonsilphong, Suphachoke; Teza, Htun; Ponthongmak, Wanchana; McKay, Gareth J.; Attia, John; Pattanateepapon, Anuchate; Panitchoke, Anupol; Thakkinstian, Ammarin

doi:10.1186/s40537-025-01136-3

Uncertainty Aware Approach for Multiple Imputation Using Conventional and Machine Learning Models: A Real-World Data Study

Romen Samuel Wabina¹, Panu Looareesuwan¹, Suphachoke Sonsilphong², Htun Teza¹, Wanchana Ponthongmak¹, Gareth J. McKay³, John Attia^4,5, Anuchate Pattanateepapon¹, Anupol Panitchoke², Ammarin Thakkinstian¹

¹ Department of Clinical Epidemiology and Biostatistics, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand
² Faculty of Medicine, Khon Kaen University, Khon Kaen, Thailand
³ Centre for Public Health, School of Medicine, Dentistry, and Biomedical Sciences, Queen's University Belfast, Northern Ireland, UK
⁴ School of Medicine and Public Health, University of Newcastle, Australia
⁵ Hunter Medical Research Institute, Newcastle, Australia

Journal of Big Data 17 April 2025

Abstract

Missing data poses a significant challenge in clinical real-world studies, often arising from unplanned data collection, misplacement, patient loss to follow-up, and other factors. While multiple imputation by chained equations (MICE) is a widely used method, its sequential nature introduces uncertainty, potentially impacting prediction model performance. We proposed and evaluated three uncertainty-aware functions (uncertainty sampling, probability of improvement, and expected improvement) integrated with linear regression, decision tree, random forest, and XGBoost using three large datasets: chronic kidney disease (CKD, n = 31,043), hypertension cohort from Ramathibodi Hospital (HT-RAMA, n = 140,047) and Khon Kaen University Hospital (HT-KKU, n = 108,942) with high missing rates. In the CKD cohort, uncertainty-aware models significantly improved performance over standard MICE, except for XGBoost. LinearReg-EI performed best (RMSE 0.12, MAE 0.36). In HT-RAMA, LinearReg-US performed best (RMSE 0.24, MAE 8.15), and similarly in HT-KKU (RMSE 0.98, MAE 12.00). Uncertainty-aware models produced imputed distributions closely resembling the original data, unlike standard MICE. Our findings suggest that incorporating uncertainty functions can improve MICE, particularly for linear regression, random forest, and decision tree models.

Find this paper

SCHOLAR FULLTEXT MIRROR

Uncertainty Aware Approach for Multiple Imputation Using Conventional and Machine Learning Models: A Real-World Data Study

Abstract

Find this paper

Tags