Optimizing Feature Selection for Multi-Task Prediction of Non-Communicable Diseases on Imbalanced Healthcare Data

Authors

  • Yanisa Wongsombut Science in Information and Communication Technology School of Science and Technology, Sukhothai Thammathirat Open University, Nonthaburi 11120, Thailand
  • Nithizethe Mhuadthongon Science in Information and Communication Technology School of Science and Technology, Sukhothai Thammathirat Open University, Nonthaburi 11120, Thailand https://orcid.org/0009-0000-3031-9985

DOI:

https://doi.org/10.65205/jcct.2026.e3069

Keywords:

Multi-Task Learning, Class Imbalance, SMOTE, XGBoost, LightGBM

Abstract

This research aims to investigate and compare the performance of feature selection techniques, encompassing Filter and Wrapper methods, for multi-task healthcare data characterized by high class imbalance. The primary objective is to develop a predictive model for non-communicable diseases (NCDs) by integrating the most efficient feature selection strategies with machine learning algorithms, evaluated through statistical metrics appropriate for imbalanced datasets. The methodology consists of six key stages: 1) data collection from Kaggle, comprising 253,680 records and 19 features; 2) data preprocessing and multi-target definition for three diseases; 3) feature selection using Filter methods (Information Gain, Gain Ratio, and Chi-Square) and Wrapper methods (Forward Selection and Backward Elimination); 4) class imbalance mitigation via the SMOTE technique; 5) classification model development using XGBoost, Random Forest, and LightGBM; and 6) model evaluation and hyperparameter optimization to identify the most suitable values. The experimental results reveal that the best-performing model, utilizing the Gain Ratio technique, significantly enhances the performance of the XGBoost model, successfully reducing the feature set from 19 to 7 key features—including BMI, Age, Income, Physical Health, General Health, Education, and Mental Health—while achieving a peak accuracy of 0.8768 and a maximum AUC of 0.9300. These findings indicate that the proposed approach is highly suitable for accurate and efficient multi-task prediction of non-communicable diseases.

Downloads

Download data is not yet available.

References

Jupriyadi, Budiman, A., Hamidi, E. A. Z., Ahdan, S., & Negara, R. M. (2024, July 4-5). Wrapper-Based Feature Selection to Improve the Accuracy of Intrusion Detection System (IDS). 2024 10th International Conference on Wireless and Telematics, 1-5. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icwt62080.2024.10674687 DOI: https://doi.org/10.1109/ICWT62080.2024.10674687

Khan, R. A. (2023). Resilience Family of Receiver Operating Characteristic Curves. IEEE Transactions on Reliability, 72(2), 716-726. https://doi.org/10.1109/tr.2022.3194710 DOI: https://doi.org/10.1109/TR.2022.3194710

Leite, Â. (2025). Chronic Illnesses: Varied Health Patterns and Mental Health Challenges. Healthcare, 13(12), 1396. https://doi.org/10.3390/healthcare13121396 DOI: https://doi.org/10.3390/healthcare13121396

Li, W., Peng, Y., & Peng, K. (2024). Diabetes Prediction Model Based on GA-XGBoost and Stacking Ensemble Algorithm. PLOS One, 19(9), e0311222. https://doi.org/10.1371/journal.pone.0311222 DOI: https://doi.org/10.1371/journal.pone.0311222

Noroozi, Z., Orooji, A., & Erfannia, L. (2023). Analyzing the Impact of Feature Selection Methods on Machine Learning Algorithms for Heart Disease Prediction. Scientific Reports, 13, 22588. https://doi.org/10.1038/s41598-023-49962-w DOI: https://doi.org/10.1038/s41598-023-49962-w

Pongshaing, T., & Thongkam, J. (2023). Optimization of Models for Hypertension Treatment Prediction with Factor Selection. Journal of Science and Technology, Ubon Ratchathani University, 25(1), 13-20. (In Thai)

Rainio, O., Teuho, J., & Klén, R. (2024). Evaluation Metrics and Statistical Tests for Machine Learning. Scientific Reports, 14, 6086. https://doi.org/10.1038/s41598-024-56706-x DOI: https://doi.org/10.1038/s41598-024-56706-x

Romsaiyud, W. (2024). Fast Synthesis of the Minority Class Using Generative Adversarial Networks for Imbalanced Data Classification Problems. Journal of Science and Technology Mahasarakham University, 43(2), 108-121. (In Thai)

Rufo, D. D., Debelee, T. G., Ibenthal, A., & Negera, W. G. (2021). Diagnosis of Diabetes Mellitus Using Gradient Boosting Machine (LightGBM). Diagnostics, 11(9), 1714. https://doi.org/10.3390/diagnostics11091714 DOI: https://doi.org/10.3390/diagnostics11091714

Sai, M. J., Chettri, P., Panigrahi, R., Garg, A., Bhoi, A. K., & Barsocchi, P. (2023). An Ensemble of Light Gradient Boosting Machine and Adaptive Boosting for Prediction of Type-2 Diabetes. International Journal of Computational Intelligence Systems, 16(1), 14. https://doi.org/10.1007/s44196-023-00184-y DOI: https://doi.org/10.1007/s44196-023-00184-y

Salhi, A., Henslee, A. C., Ross, J., Jabour, J., & Dettwiller, I. (2023). Data Preprocessing Using AutoML: A Survey. 2023 Congress in Computer Science, Computer Engineering, & Applied Computing, 1619-1623. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/csce60160.2023.00265 DOI: https://doi.org/10.1109/CSCE60160.2023.00265

Sunggad, S., & Maneerat, P. (2023). Comparison of Feature Selection Methods to Improve Diabetes Predictions. Journal of Science and Technology Thonburi University, 7(2), 12-24. (In Thai)

Teboul, A. (n.d.). Diabetes Health Indicators Dataset. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset

Tepdang, S. (2023). The Classification of Diabetic Patients Using Machine Learning Method by Feature Selection. RMUTSB Academic Journal, 11(1), 29-44. (In Thai)

World Health Organization. (2023). Global Report on Hypertension: The Race Against a Silent Killer. https://www.who.int/teams/noncommunicable-diseases/hypertension-report

Downloads

Published

16-04-2026

How to Cite

Wongsombut, Y., & Mhuadthongon, N. (2026). Optimizing Feature Selection for Multi-Task Prediction of Non-Communicable Diseases on Imbalanced Healthcare Data. Journal of Computer and Creative Technology, 4(1), e3069. https://doi.org/10.65205/jcct.2026.e3069