Abstract
Background: With the increasing quality of life of people, people have begun to have more
time and energy to pay attention to their own health problems. Among them, diabetes, as one of the
most common and fastest-growing diseases, has attracted widespread attention from experts in bioinformatics.
People of different ages all over the world suffer from diabetes, which can shorten the life
span of patients. Diabetes has a significant impact on human health, so that the accuracy of the initial
diagnosis becomes essential. Diabetes can bring some serious complications, especially in the elderly,
such as cardiovascular and cerebrovascular diseases, stroke, and multiple organ damage. The initial diagnosis
of diabetes can reduce the possibility of deterioration. Identifying and analyzing potential risk
factors for different physical attributes can help diagnose the prevalence of diabetes. The more accurate
the prevalence, the more likely it is to reduce the incidence of complications.
Methods: In this paper, we use the open source NHANES data set to analyze and determine potential
risk factors relevant to diabetes by an improved version of Logistic Regression, SVM, and other improved
machine learning algorithms.
Results: Experimental results show that the improved version of Random Forest has the best effect,
with a classification accuracy of 92%, and it can be found that age, blood-related diabetes, high blood
pressure, cholesterol and BMI are the most important risk factors related to diabetes.
Conclusion: Through the proposed method of machine learning, we can cope with class imbalance and
outlier detection problems.
Keywords:
Health problems, diabetes, risk factors, machine learning, class imbalance, outlier detection.
Graphical Abstract
[2]
Raihan M, Alvi N, Islam MT, et al. Diabetes Mellitus Risk Prediction Using Artificial Neural Network. Proceedings of the International Joint Conference on Computational Intelligence Springer. Singapore. 2020; pp. 85-97.
[5]
Hasan KA, Hasan MAM. Prediction of clinical risk factors of diabetes using multiple machine learning techniques resolving class imbal-ance. Proceedings of the International Conference on Computer and Information Technology Dhaka. Bangladesh. 2020.
[6]
Rajalakshmi K, Dhenakaran DSS. Analysis of data mining prediction techniques in healthcare management system. Int J Adv Res Comput Sci Softw Eng 2015; 5(4): 1343-7.
[8]
Durairaj M, Priya K. Breast cancer prediction using soft computing techniques a survey. Int J Comput Sci Eng 2018; 6(8): 135-45.
[15]
Pedregosa F, Varoquaux G, Gramfort A. Scikit-learn: Machine learning in python. J Mach Learn Res 2011; 12: 2825-30.
[16]
Mao Y, Chen WL, Guo BL, Chen YX. A novel logistic regression model based on density estimation. Acta Automat Sin 2014; 40(1): 62-72.
[18]
Deng Z, Li D, Ke YH, et al. An improved SVM algorithm for high spatial resolution remote sensing image classification. Remote Sens Land Resour 2016; 28(3): 12-8.
[19]
Luo HW, Chen YJ, Zhang WD. An improved ID3 algorithm based on attribute importance-weighted, Database Technology and Applications (DBTA). IEEE 2010; 2010: 1-4.
[20]
Wang RS, Xie HW, An JC. Improvement of random forests algorithm based on classification accuracy and correlation. Kexue Jishu Yu Gongcheng 2017; 17(20): 67-72.
[21]
Zhu Y, Newsan SD. DenseNet for dense flow. Comput Vision Pattern Recogn 2017; 2017: 790-4.
[23]
Hasan KAM, Hasan MAM. Classification of Parkinson’s disease by analyzing multiple vocal features sets. Proceedings of the IEEE Re-gion 10 Symposium (TENSYMP). 2020 June 5-7; Dhaka, Bangladesh. 2020; 758-61.