Comparative Study of Machine Learning Pre-processing Techniques for Diabetes Mellitus Prediction
Keywords:
Diabetes Mellitus, Machine Learning, Data Pre-processing, Feature Selection, Prediction Accuracy, Healthcare AnalyticsAbstract
Diabetes Mellitus is a chronic and life-threatening disease that requires timely and accurate diagnosis to prevent complications. Machine Learning (ML) techniques have become critical in healthcare for early disease prediction, but their performance heavily depends on the quality of input data. Data pre-processing, including cleaning, feature selection and handling missing values, significantly enhances predictive accuracy. This paper provides a comparative analysis of various pre-processing methods applied to diabetes prediction datasets, including the Pima Indian Diabetes dataset. Techniques such as Average Weighted Objective Distance (AWOD), feature engineering pipelines, wrapper-based feature selection, Support Vector Machines, Random Forests, fuzzy SVM, ANFIS-based imputations, clustering and hybrid approaches are reviewed. Results from previous studies show that optimized feature selection and tailored pre-processing pipelines can boost prediction accuracy to over 98% in some cases. The paper concludes that while no single method is universally applicable, pre-processing plays a crucial role in improving the robustness and accuracy of diabetes prediction models. Future work should focus on adaptive hybrid models combining multiple techniques to handle heterogeneous healthcare datasets.
References
• Gupta, P., & Sharma, K. (2021). Prediction of Type 2 Diabetes using Average Weighted Objective Distance (AWOD). Journal of Healthcare Informatics Research.
• Verma, S., & Choudhary, R. (2021). Crow Search Algorithm and Feature Engineering for Diabetes Classification. International Journal of Computational Intelligence Systems.
• Patel, D., & Singh, M. (2020). Wrapper-based Feature Selection with Multilayer Perceptron for Early Diabetes Prediction. Expert Systems with Applications.
• Rao, A., & Jain, P. (2020). Feature Selection and Classification Models using WEKA Environment. Procedia Computer Science.
• Khan, S., & Iqbal, N. (2019). Performance Analysis of SVM and Random Forest for Diabetes Prediction. International Journal of Data Mining and Bioinformatics.
Downloads
How to Cite
Issue
Section
License
Copyright (c) 2019 International Journal of Engineering, Science and Humanities

This work is licensed under a Creative Commons Attribution 4.0 International License.