A Machine Learning Framework for Dual-Population Sepsis Prediction Using Real and Synthetic Clinical Datasets

Authors

  • Smitha N Department of Computer Science and Engineering University of Visvesvaraya College of Engineering (UVCE), K. R. Circle, Bengaluru – 560001, Karnataka, India
  • Ganesha G Department of Computer Science and Engineering University of Visvesvaraya College of Engineering (UVCE), K. R. Circle, Bengaluru – 560001, Karnataka, India
  • Tanuja R Department of Computer Science and Engineering University of Visvesvaraya College of Engineering (UVCE), K. R. Circle, Bengaluru – 560001, Karnataka, India
  • Manjula S H Department of Computer Science and Engineering University of Visvesvaraya College of Engineering (UVCE), K. R. Circle, Bengaluru – 560001, Karnataka, India

DOI:

https://doi.org/10.70917/ijcisim-2026-2659

Keywords:

Sepsis prediction, Machine learning, Synthetic healthcare data, Adult sepsis, Neonatal sepsis, Deep Cross Network, Random Forest, XGBoost, Privacy-preserving healthcare, Clinical decision support

Abstract

Sepsis is a life-threatening medical condition that requires timely diagnosis to reduce morbidity and mortality. However, the development of accurate machine learning models for sepsis prediction is often constrained by limited access to clinical data due to privacy regulations. This study presents a comprehensive machine learning framework for comparative sepsis prediction using both real and synthetic Adult and Neonatal clinical datasets. The proposed framework integrates data preprocessing, synthetic data generation, exploratory data analysis, feature importance analysis, supervised machine learning, and Deep Cross Network (DCN) learning to evaluate the effectiveness of privacy-preserving synthetic datasets for clinical prediction. Ten supervised machine learning algorithms, including Logistic Regression, K-Nearest Neighbors, Decision Tree, Random Forest, Extra Trees, Histogram Gradient Boosting, AdaBoost, XGBoost, Multi-Layer Perceptron, and Voting Classifier, were evaluated using Accuracy, Precision, F1-score, Log Loss, and Area Under the Receiver Operating Characteristic Curve (AUC-ROC). Experimental results demonstrate that ensemble learning methods consistently outperform conventional classifiers across both datasets. For the Adult dataset, the Decision Tree achieved the highest classification accuracy of 98.45%, while the Voting Classifier obtained the highest AUC-ROC of 0.9795. For the Neonatal dataset, XGBoost and Histogram Gradient Boosting achieved 100% Accuracy, Precision, F1-score, and AUC-ROC on the real dataset. Although predictive performance decreased moderately on synthetic datasets, ensemble models such as Voting Classifier, Random Forest, and XGBoost maintained strong discriminative capability, confirming that synthetic data preserve important statistical relationships required for reliable machine learning model development. Furthermore, DCN learning curves demonstrated stable convergence and effective feature representation for both Adult and Neonatal datasets. The findings indicate that privacy-preserving synthetic datasets provide a reliable alternative for early-stage machine learning research, model benchmarking, and clinical decision-support system development while ensuring patient confidentiality.

Downloads

Download data is not yet available.

Downloads

Published

2026-07-04

How to Cite

Smitha N, Ganesha G, Tanuja R, & Manjula S H. (2026). A Machine Learning Framework for Dual-Population Sepsis Prediction Using Real and Synthetic Clinical Datasets. International Journal of Computer Information Systems and Industrial Management Applications, 18(5s), 13–27. https://doi.org/10.70917/ijcisim-2026-2659

Issue

Section

Original Articles