การวิเคราะห์และเปรียบเทียบผลกระทบของการลดขนาดคุณลักษณะเชิงสถิติสำหรับการจำแนกด้วยการเรียนรู้ของเครื่อง

Yotsapat Ruangpaisarn; Sakulchai Saramat; Auttanit Wongjak; Sukanya Chansamorn

doi:10.65205/jcct.2026.e3002

Authors

Yotsapat Ruangpaisarn Program in Computer Science, Rajamangala University of Technology Tawan-ok, Chonburi 20110, Thailand https://orcid.org/0009-0006-9467-9780
Sakulchai Saramat Program in Computer Science, Rajamangala University of Technology Tawan-ok, Chonburi 20110, Thailand
Auttanit Wongjak Program in Information and Communication Technology, Rajamangala University of Technology Tawan-ok, Chonburi 20110, Thailand
Sukanya Chansamorn Program in Information and Communication Technology, Rajamangala University of Technology Tawan-ok, Chonburi 20110, Thailand https://orcid.org/0000-0002-3354-2004

DOI:

https://doi.org/10.65205/jcct.2026.e3002

Keywords:

Disease Classification, Feature Selection, Hyperparameter, Machine Learning

Abstract

This study aims to analyze the impact of statistical feature reduction on classification performance and to identify the optimal hyperparameter configurations for predicting medical conditions. Three machine learning algorithms—Support Vector Machine (SVM), Decision Tree (DT) and K-Nearest Neighbors (KNN) were compared using the Healthcare Risk Factors Dataset. Two filter-based feature selection methods, the Kruskal–Wallis test (KW) and Mutual Information (MI), were employed to generate a full feature set and feature subsets at the 25th, 50th, and 75th percentiles. Model performance was evaluated using 5 repeats of 5-fold Stratified K-Fold Cross-Validation, integrated with systematic hyperparameter tuning. The experimental results demonstrated that SVM achieved the highest performance using the full feature set (C = 300), yielding an accuracy of 0.9214 ± 0.0005, precision of 0.9212 ± 0.0005, recall of 0.9214 ± 0.0005, and an F1-Score of 0.9210 ± 0.0005. This accuracy significantly outperformed DT (0.8752 ± 0.0028) and KNN (0.7916 ± 0.0015). Furthermore, reducing the feature set to the 75th percentile using the KW method maintained a high accuracy of 0.9165 ± 0.0006, representing a marginal decline of only 0.49%. These findings indicate that SVM is highly suitable for disease diagnosis within this dataset. Moreover, statistical feature reduction effectively minimizes computational complexity without a significant loss in accuracy.

Downloads

Download data is not yet available.

References

Ahmed, A. (2025). Healthcare Risk Factors Dataset [Dataset]. Kaggle. https://www.kaggle.com/datasets/abdallaahmed77/healthcare-risk-factors-dataset

Bashir, S., Khattak, I. U., Khan, A., Khan, F. H., Gani, A., & Shiraz, M. (2022). A Novel Feature Selection Method for Classification of Medical Data Using Filters, Wrappers, and Embedded Approaches. Complexity, 2022(1), 8190814. https://doi.org/10.1155/2022/8190814 DOI: https://doi.org/10.1155/2022/8190814

Clarke, R., Ressom, H. W., Wang, A., Xuan, J., Liu, M. C., Gehan, E. A., & Wang, Y. (2008). The Properties of High-Dimensional Data Spaces: Implications for Exploring Gene and Protein Expression Data. Nature Reviews Cancer, 8(1), 37-49. https://doi.org/10.1038/nrc2294 DOI: https://doi.org/10.1038/nrc2294

Guyon, I., & Elisseeff, A. (2003). An Introduction to Variable and Feature Selection. Journal of Machine Learning Research, 3, 1157-1182.

Jothi, N., Syed-Mohamed, S. M., & Rajagopal, H. (2022, July 20-22). Hybrid Feature Selection using Shapley Value and ReliefF for Medical Datasets. 2022 International Conference on Inventive Computation Technologies, 351-355. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icict54344.2022.9850833 DOI: https://doi.org/10.1109/ICICT54344.2022.9850833

Kavakiotis, I., Tsave, O., Salifoglou, A., Maglaveras, N., Vlahavas, I., & Chouvarda, I. (2017). Machine Learning and Data Mining Methods in Diabetes Research. Computational and Structural Biotechnology Journal, 15, 104-116. https://doi.org/10.1016/j.csbj.2016.12.005 DOI: https://doi.org/10.1016/j.csbj.2016.12.005

Kruskal, W. H., & Wallis, W. A. (1952). Use of Ranks in One-Criterion Variance Analysis. Journal of the American Statistical Association, 47(260), 583-621. https://doi.org/10.1080/01621459.1952.10483441 DOI: https://doi.org/10.1080/01621459.1952.10483441

Majed, R. J., Al-Heddi, R. M., & Zeki, A. M. (2023, October 25-26). Parkinson’s Disease Detection Using Data Mining Models: A Comparative Study. 2023 4th International Conference on Data Analytics for Business and Industry, 290-294. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icdabi60145.2023.10629302 DOI: https://doi.org/10.1109/ICDABI60145.2023.10629302

Narwane, S. V., & Sawarkar, S. D. (2021, August 27-29). Dimensionality Reduction of Unbalanced Datasets: Principal Component Analysis. 2021 Asian Conference on Innovation in Technology, 1-6. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/asiancon51346.2021.9544971 DOI: https://doi.org/10.1109/ASIANCON51346.2021.9544971

Prusty, S., Patnaik, S., & Dash, S. K. (2022). SKCV: Stratified K-Fold Cross-Validation on ML Classifiers for Predicting Cervical Cancer. Frontiers in Nanotechnology, 4, 972421. https://doi.org/10.3389/fnano.2022.972421 DOI: https://doi.org/10.3389/fnano.2022.972421

Remeseiro, B., & Bolon-Canedo, V. (2019). A Review of Feature Selection Methods in Medical Applications. Computers in Biology and Medicine, 112, 103375. https://doi.org/10.1016/j.compbiomed.2019.103375 DOI: https://doi.org/10.1016/j.compbiomed.2019.103375

Saha, P., Patikar, S., & Neogy, S. (2020, October 2-4). A Correlation-Sequential Forward Selection Based Feature Selection Method for Healthcare Data Analysis. 2020 IEEE International Conference on Computing, Power and Communication Technologies, 69-72. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/gucon48875.2020.9231205 DOI: https://doi.org/10.1109/GUCON48875.2020.9231205

Salehin, S., Islam, A. J., Iqbal, A. M., Barua, L., Islam, K., & Uddin, A. (2024, December 7-8). A Comparative Analysis of Feature Selection Methods on the Accuracy of Heart Disease Prediction Models. 2024 International Conference on Recent Progresses in Science, Engineering and Technology, 1-6. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icrpset64863.2024.10955882 DOI: https://doi.org/10.1109/ICRPSET64863.2024.10955882

Shannon, C. E. (1948). A Mathematical Theory of Communication. Bell System Technical Journal, 27(3), 379-423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x DOI: https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Siburian, R. H., Km Nasution, M., & Tarigan, J. T. (2025, January 21). Optimization Of KNN, SVM, And SVM Kernel in Water Potability Prediction with Hyperparameter Approach. 2025 International Conference on Computer Sciences, Engineering, and Technology Innovation, 576-581. Institute of Electrical and Electronics Engineers. https://doi.org/10.1109/icocseti63724.2025.11019055 DOI: https://doi.org/10.1109/ICoCSETI63724.2025.11019055

Uddin, S., Khan, A., Hossain, M. E., & Moni, M. A. (2019). Comparing Different Supervised Machine Learning Algorithms for Disease Prediction. BMC Medical Informatics and Decision Making, 19(1), 281. https://doi.org/10.1186/s12911-019-1004-8 DOI: https://doi.org/10.1186/s12911-019-1004-8