Optimizing Feature Selection for Multi-Task Prediction of Non-Communicable Diseases on Imbalanced Healthcare Data
DOI:
https://doi.org/10.65205/jcct.2026.e3069Keywords:
Multi-Task Learning, Class Imbalance, SMOTE, XGBoost, LightGBMAbstract
This research aims to investigate and compare the performance of feature selection techniques, encompassing Filter and Wrapper methods, for multi-task healthcare data characterized by high class imbalance. The primary objective is to develop a predictive model for non-communicable diseases (NCDs) by integrating the most efficient feature selection strategies with machine learning algorithms, evaluated through statistical metrics appropriate for imbalanced datasets. The methodology consists of six key stages: 1) data collection from Kaggle, comprising 253,680 records and 19 features; 2) data preprocessing and multi-target definition for three diseases; 3) feature selection using Filter methods (Information Gain, Gain Ratio, and Chi-Square) and Wrapper methods (Forward Selection and Backward Elimination); 4) class imbalance mitigation via the SMOTE technique; 5) classification model development using XGBoost, Random Forest, and LightGBM; and 6) model evaluation and hyperparameter optimization to identify the most suitable values. The experimental results reveal that the best-performing model, utilizing the Gain Ratio technique, significantly enhances the performance of the XGBoost model, successfully reducing the feature set from 19 to 7 key features—including BMI, Age, Income, Physical Health, General Health, Education, and Mental Health—while achieving a peak accuracy of 0.8768 and a maximum AUC of 0.9300. These findings indicate that the proposed approach is highly suitable for accurate and efficient multi-task prediction of non-communicable diseases.
Downloads
References
Jupriyadi, Budiman, A., Hamidi, E. A. Z., Ahdan, S., & Negara, R. M. (2024, July 4-5). Wrapper-Based Feature Selection to Improve the Accuracy of Intrusion Detection System (IDS). 2024 10th International Conference on Wireless and Telematics, 1-5. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icwt62080.2024.10674687 DOI: https://doi.org/10.1109/ICWT62080.2024.10674687
Khan, R. A. (2023). Resilience Family of Receiver Operating Characteristic Curves. IEEE Transactions on Reliability, 72(2), 716-726. https://doi.org/10.1109/tr.2022.3194710 DOI: https://doi.org/10.1109/TR.2022.3194710
Leite, Â. (2025). Chronic Illnesses: Varied Health Patterns and Mental Health Challenges. Healthcare, 13(12), 1396. https://doi.org/10.3390/healthcare13121396 DOI: https://doi.org/10.3390/healthcare13121396
Li, W., Peng, Y., & Peng, K. (2024). Diabetes Prediction Model Based on GA-XGBoost and Stacking Ensemble Algorithm. PLOS One, 19(9), e0311222. https://doi.org/10.1371/journal.pone.0311222 DOI: https://doi.org/10.1371/journal.pone.0311222
Noroozi, Z., Orooji, A., & Erfannia, L. (2023). Analyzing the Impact of Feature Selection Methods on Machine Learning Algorithms for Heart Disease Prediction. Scientific Reports, 13, 22588. https://doi.org/10.1038/s41598-023-49962-w DOI: https://doi.org/10.1038/s41598-023-49962-w
Pongshaing, T., & Thongkam, J. (2023). Optimization of Models for Hypertension Treatment Prediction with Factor Selection. Journal of Science and Technology, Ubon Ratchathani University, 25(1), 13-20. (In Thai)
Rainio, O., Teuho, J., & Klén, R. (2024). Evaluation Metrics and Statistical Tests for Machine Learning. Scientific Reports, 14, 6086. https://doi.org/10.1038/s41598-024-56706-x DOI: https://doi.org/10.1038/s41598-024-56706-x
Romsaiyud, W. (2024). Fast Synthesis of the Minority Class Using Generative Adversarial Networks for Imbalanced Data Classification Problems. Journal of Science and Technology Mahasarakham University, 43(2), 108-121. (In Thai)
Rufo, D. D., Debelee, T. G., Ibenthal, A., & Negera, W. G. (2021). Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics, 11(9), 1714. https://doi.org/10.3390/diagnostics11091714 DOI: https://doi.org/10.3390/diagnostics11091714
Sai, M. J., Chettri, P., Panigrahi, R., Garg, A., Bhoi, A. K., & Barsocchi, P. (2023). An Ensemble of Light Gradient Boosting Machine and Adaptive Boosting for Prediction of Type-2 Diabetes. International Journal of Computational Intelligence Systems, 16(1), 14. https://doi.org/10.1007/s44196-023-00184-y DOI: https://doi.org/10.1007/s44196-023-00184-y
Salhi, A., Henslee, A. C., Ross, J., Jabour, J., & Dettwiller, I. (2023). Data Preprocessing Using AutoML: A Survey. 2023 Congress in Computer Science, Computer Engineering, & Applied Computing, 1619-1623. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/csce60160.2023.00265 DOI: https://doi.org/10.1109/CSCE60160.2023.00265
Sunggad, S., & Maneerat, P. (2023). Comparison of Feature Selection Methods to Improve Diabetes Predictions. Journal of Science and Technology Thonburi University, 7(2), 12-24. (In Thai)
Teboul, A. (n.d.). Diabetes Health Indicators Dataset. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset
Tepdang, S. (2023). The Classification of Diabetic Patients Using Machine Learning Method by Feature Selection. RMUTSB Academic Journal, 11(1), 29-44. (In Thai)
World Health Organization. (2023). Global Report on Hypertension: The Race Against a Silent Killer. https://www.who.int/teams/noncommunicable-diseases/hypertension-report
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Journal of Computer and Creative Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.





















