Diabetes Induced Factors Prediction Based on Various Improved Machine
Learning Methods

Jun       Wu; Lulu      Qu; Guoping      Yang; Nan       Han

Abstract

Background: With the increasing quality of life of people, people have begun to have more time and energy to pay attention to their own health problems. Among them, diabetes, as one of the most common and fastest-growing diseases, has attracted widespread attention from experts in bioinformatics. People of different ages all over the world suffer from diabetes, which can shorten the life span of patients. Diabetes has a significant impact on human health, so that the accuracy of the initial diagnosis becomes essential. Diabetes can bring some serious complications, especially in the elderly, such as cardiovascular and cerebrovascular diseases, stroke, and multiple organ damage. The initial diagnosis of diabetes can reduce the possibility of deterioration. Identifying and analyzing potential risk factors for different physical attributes can help diagnose the prevalence of diabetes. The more accurate the prevalence, the more likely it is to reduce the incidence of complications.

Methods: In this paper, we use the open source NHANES data set to analyze and determine potential risk factors relevant to diabetes by an improved version of Logistic Regression, SVM, and other improved machine learning algorithms.

Results: Experimental results show that the improved version of Random Forest has the best effect, with a classification accuracy of 92%, and it can be found that age, blood-related diabetes, high blood pressure, cholesterol and BMI are the most important risk factors related to diabetes.

Conclusion: Through the proposed method of machine learning, we can cope with class imbalance and outlier detection problems.

Keywords: Health problems, diabetes, risk factors, machine learning, class imbalance, outlier detection.

Graphical Abstract

[1] 
Maniruzzaman M, Rahman MJ, Ahammed B, Abedin MM. Classification and prediction of diabetes disease using machine learning para-digm. Health Inf Sci Syst  2020; 8(1): 7.
[http://dx.doi.org/10.1007/s13755-019-0095-z] [PMID:  31949894] 
[2] 
Raihan M, Alvi N, Islam MT, et al. Diabetes Mellitus Risk Prediction Using Artificial Neural Network. Proceedings of the International Joint Conference on Computational Intelligence Springer.  Singapore. 2020; pp. 85-97.
[3] 
De Iuliis A, Montinaro E, Fatati G, Plebani M, Colosimo C. Diabetes mellitus and Parkinson’s disease: Dangerous liaisons between insulin and dopamine. Neural Regen Res  2022; 17(3): 523-33.
[http://dx.doi.org/10.4103/1673-5374.320965] [PMID:  34380882] 
[4] 
Zimmet P, Alberti KG, Magliano DJ, Bennett PH. Diabetes mellitus statistics on prevalence and mortality: Facts and fallacies. Nat Rev Endocrinol  2016; 12(10): 616-22.
[http://dx.doi.org/10.1038/nrendo.2016.105] [PMID:  27388988] 
[5] 
Hasan KA, Hasan MAM. Prediction of clinical risk factors of diabetes using multiple machine learning techniques resolving class imbal-ance. Proceedings of the International Conference on Computer and Information Technology Dhaka. Bangladesh. 2020.
[6] 
Rajalakshmi K, Dhenakaran DSS. Analysis of data mining prediction techniques in healthcare management system. Int J Adv Res Comput Sci Softw Eng  2015; 5(4): 1343-7.
[7] 
Marinov M, Mosa ASM, Yoo I, Boren SA. Data-mining technologies for diabetes: A systematic review. J Diabetes Sci Technol  2011; 5(6): 1549-56.
[http://dx.doi.org/10.1177/193229681100500631] [PMID:  22226277] 
[8] 
Durairaj M, Priya K. Breast cancer prediction using soft computing techniques a survey. Int J Comput Sci Eng  2018; 6(8): 135-45.
[9] 
Kandhasamy JP, Balamurali S. Performance analysis of classifier models to predict diabetes mellitus. Procedia Comput Sci  2015; 47: 45-51.
[http://dx.doi.org/10.1016/j.procs.2015.03.182] 
[10] 
Khan FA, Zeb K, AlRakhami M. Detection and prediction of diabetes using data mining: A comprehensive review. IEEE Access  2021; 9: 43711-35.
[http://dx.doi.org/10.1109/ACCESS.2021.3059343] 
[11] 
Tsanas A, Xifara A. Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools. Energy Build  2012; 49: 560-7.
[http://dx.doi.org/10.1016/j.enbuild.2012.03.003] 
[12] 
Idowu PA, Balogiun JA. Fuzzy logic-based predictive model for the risk of type 2 diabetes mellitus. Int J E-Health Med Commun  2019; 10(3): 56-78.
[http://dx.doi.org/10.4018/IJEHMC.2019070104] 
[13] 
Reddy SS, Rajender R, Sethi N. A data mining scheme for detection and classification of diabetes mellitus using voting expert strategy. Int J Knowledge-based Intelligent Eng Sys  2019; 23(2): 103-8.
[http://dx.doi.org/10.3233/KES-190403] 
[14] 
Wan X, Wang W, Liu J, Tong T. Estimating the sample mean and standard deviation from the sample size, median, range and/or interquar-tile range. BMC Med Res Methodol  2014; 14(1): 135.
[http://dx.doi.org/10.1186/1471-2288-14-135] [PMID:  25524443] 
[15] 
Pedregosa F, Varoquaux G, Gramfort A. Scikit-learn: Machine learning in python. J Mach Learn Res  2011; 12: 2825-30.
[16] 
Mao Y, Chen WL, Guo BL, Chen YX. A novel logistic regression model based on density estimation. Acta Automat Sin  2014; 40(1): 62-72.
[17] 
Schölkopf B, Sung KK, Burges CJC, et al. Comparing support vector machines with gaussian kernels to radial basis function classifiers. IEEE Trans Signal Process  1997; 45(11): 2758-65.
[http://dx.doi.org/10.1109/78.650102] 
[18] 
Deng Z, Li D, Ke YH, et al. An improved SVM algorithm for high spatial resolution remote sensing image classification. Remote Sens Land Resour  2016; 28(3): 12-8.
[19] 
Luo HW, Chen YJ, Zhang WD. An improved ID3 algorithm based on attribute importance-weighted, Database Technology and Applications (DBTA). IEEE 2010; 2010: 1-4.
[20] 
Wang RS, Xie HW, An JC. Improvement of random forests algorithm based on classification accuracy and correlation. Kexue Jishu Yu Gongcheng  2017; 17(20): 67-72.
[21] 
Zhu Y, Newsan SD. DenseNet for dense flow. Comput Vision Pattern Recogn  2017; 2017: 790-4.
[22] 
Tumer K, Agogino AK. Ensemble clustering with voting active clusters. Pattern Recognit Lett  2008; 29(14): 1947-53.
[http://dx.doi.org/10.1016/j.patrec.2008.06.011] 
[23] 
Hasan KAM, Hasan MAM. Classification of Parkinson’s disease by analyzing multiple vocal features sets. Proceedings of the IEEE Re-gion 10 Symposium (TENSYMP).  2020 June 5-7; Dhaka, Bangladesh. 2020; 758-61.

Cite As

Current Bioinformatics

Diabetes Induced Factors Prediction Based on Various Improved Machine Learning Methods

Abstract

Graphical Abstract