Heart Disease Risk Identification Using Machine Learning Techniques for A Highly Imbalanced Dataset: A Comparative Study
Abstract
Heart disease has become one of the most prevailing universal diseases in the world today. It is estimated that 32%
of all deaths worldwide are caused due to heart diseases. One of the major causes for this is that its extremely
difficult even for medical practitioners to predict heart diseases as heart attacks as it is a complex task which
requires a great amount of knowledge and experience. The number of deaths caused by heart diseases has hugely
increased in the recent past. Machine learning has become one of the most popular areas in computer science
where many complex problems have been addressed successfully specially in the field of medicine. In this study we
trained multiple supervised classifiers namely’; Naïve Bayes, LightGBM, Decision Trees, Random Forest, XGBoost,
K Nearest Neighbours and ADABoost and we compared the accuracies and identified what models perform better
for heart disease prediction. We used the Behavioral Risk Factor Surveillance System (BRFSS) 2015 Heart Disease
Health Indicators Dataset which was highly imbalanced and in order to address the class imbalance problem we
used methods such as Synthetic Minority Over Sampling Technique (Smote) Sampling, Adaptive Synthetic Sampling,
Random Over Sampling, Random Under Sampling, TomekLink, SmoteTomek, Smoteen and Cluster Centroid.
According to the results obtained, we can conclude that the hybrid models such as Smoteen and SmoteTomek
performed better than the other sampling methods.