Health Care

The Potential Of Machine Learning In Advancing The Prediction Of Coronary Cardiovascular Disease

Author: Henry Smith
Posted on: 24 Mar 2025

Abstract

Table of Contents

Coronary cardiovascular disease (CVD) remains a leading cause of death worldwide, necessitating accurate and early prediction methods to mitigate risks and improve patient outcomes. This study utilizes the Rates_and_Trends_in_Coronary_Heart_Disease dataset, which consists of two classes: Normal and Predicting CVD, to develop predictive models for coronary heart disease. Several machine learning models were employed, including the Random Forest Classifier, Decision Tree, XGBoost Classifier, and K-Nearest Neighbors (KNN), along with a proposed Artificial Neural Network (ANN) model. The models were evaluated using performance metrics such as precision, recall, F1 score, and accuracy to ensure a thorough assessment of their effectiveness in predicting CVD. The Random Forest and XGBoost classifiers both achieved an accuracy of 91%, while the Decision Tree and KNN models each achieved 90% accuracy. Notably, the proposed ANN model significantly outperformed the others, achieving an impressive 99% accuracy. These findings underscore the potential of machine learning, particularly deep learning, in advancing the prediction of coronary cardiovascular disease, paving the way for improved diagnostic and preventive strategies.

Introduction

The heart is essential to life because it efficiently pumps oxygen-rich blood and controls key hormones to keep blood pressure at ideal levels. Any disruption in its operation can result in the development of heart diseases, which are all referred to as Cardiovascular Diseases (CVD) (Robinson, 2021). A variety of conditions that impact the heart and blood vessels are included in CVD, including cerebrovascular accidents, congenital defects, pulmonary blood clots, cardiac arrhythmia peripheral arterial problems, Coronary Artery Disease (CAD), rheumatic heart conditions, CHD, and heart muscle-affecting cardiomyopathies (Saheera & Krishnamurthy, 2020). Cardiovascular disease (CHD) is the subtype that comprises a substantial 64% of all cases. Although it mostly impacts men, women are not immune to its effects. Among CVD, CAD is especially worrisome because of its correlation with worldwide death rates (Al-Khlaiwi et al., 2023). The World Health Organization (WHO) states that there are severe repercussions from CVDs, with startling data showing that these illnesses are thought to be the cause of 17.9 million deaths globally each year (Prabhakaran et al., 2022). These figures demonstrate the importance of scientific investigations and medical breakthroughs aimed at preventing and decreasing the effects of cardiovascular illnesses globally (Vaduganathan et al., 2022).

Millions of lives are lost to CVDs every year, which is a major cause for concern in the global healthcare community (Flores-Alonso et al., 2022). It is critical to give the early identification and management of CVDs top priority in order to lower mortality rates (Sarrafzadegan & Mohammmadifard, 2019). Although auscultation is a straightforward and accurate technique for identifying CVDs, even highly skilled doctors may find it difficult to quickly identify CVDs (Yan et al., 2019). Physicians can make better decisions with the aid of artificial intelligence-driven automated cardiac screening systems based on phonocardiography (PCG) classification (Sethi et al., 2022).

The World Health Organization (WHO) (Organization, 2020) states that heart diseases are the primary worldwide root of death. CVDs are an important reason for concern in the global healthcare community, claiming millions of lives each year (Shokouhmand et al., 2021). The Internet of Medical Things (IoMT) is a technology that links medical devices and collects and processes data in real time, enhancing healthcare workflow. It combines IoT power with patient details, ensuring data security in the IoMT-based framework (Alshehri & Muhammad, 2020). In today’s healthcare sector, the IoMT is a quickly developing field where a variety of contemporary medical equipment, software programmers, and healthcare professionals come together on a single platform to provide high-quality services (Jadhav, 2018). Globally, cardiovascular disease is the primary reason for rising death rates (Shaffer & Ginsberg, 2017).

The heart’s arteries are impacted by the common and potentially dangerous condition known as coronary artery disease (CAD) (Shao et al., 2020). The heart muscle receives oxygen-rich blood from the coronary arteries, which narrow or obstruct in this condition. Atherosclerosis, a disorder where fatty deposits, cholesterol, calcium, and other materials build up inside the artery walls and form plaques, is the main cause of CAD. CAD is a serious condition affecting the heart’s blood vessels, primarily caused by atherosclerosis (Shao et al., 2020). Common risk factors include smoking, high blood pressure, cholesterol, diabetes, obesity, a sedentary lifestyle, a family history of heart disease, and ageing (Ciumărnean et al., 2021). CAD can cause Angina is the term for chest pain and, if a plaque ruptures or a blood clot blocks an artery, it can lead to a heart attack, as shown in figure 1.1. Diagnosis involves medical history, physical examination, ECG, stress tests, imaging, and blood tests (Habuza et al., 2021). Reducing symptoms, averting issues, and enhancing general cardiovascular health are the goals of CAD management (Cacciatore et al., 2023). It’s imperative to make lifestyle changes, which should include regular exercise, a diet low in cholesterol and fatty foods, managing weight, quitting smoking, and reducing stress. Prescription drugs are frequently used to treat symptoms like angina, inhibit blood clots, reduce cholesterol levels, and control blood pressure (Flora & Nayak, 2019). For the restoration of blood flow to the heart, invasive treatments like coronary artery bypass grafting (CABG) or angioplasty with placement of stents may be required in specific circumstances.

Figure 1.1: Coronary Artery Disease

Machine learning has the potential to completely transform the healthcare sector. Its remarkable progress can be attributed to its superior data processing capabilities over human capabilities (Quazi, 2022). As a result, the healthcare industry has seen the creation of a number of AI applications that take advantage of the speed and accuracy of machine learning, opening the door for ground-breaking answers to a variety of healthcare problems (Holmes et al., 2004). Numerous machine-learning techniques have been used to identify cardiovascular illnesses. Predictive models still need to be improved, and research gaps in the current detection methods need to be filled (Quazi, 2022). One such area is the problem of imbalanced datasets, which can result in biased predictions. Researchers have looked into a variety of approaches, such as neural networks and different machine learning methods, to improve prediction accuracy by examining the efficacy of hybrid models that combine different techniques. The intricacy of the predictive task is highlighted by the variations in datasets, models, and results, even though these studies offer insightful information. Even with these developments, more research is still desperately needed to improve the accuracy of cardiovascular disease prediction using the current models. The wide range of machine learning applications in this field highlights how crucial it is to carry out more research to improve the predictive models’ generalizability, accuracy, and dependability in order to improve patient care and medical treatments.

Research Motivation

Machine learning techniques are being used to predict CVD in response to the pressing need to enhance early diagnosis and intervention strategies to lower the high rates of morbidity and death that are associated with this condition. Conventional diagnostic techniques can be expensive, time-consuming, and occasionally unreliable. They also frequently require invasive procedures. Through the analysis of massive volumes of medical data, machine learning provides a non-invasive, effective, and possibly more accurate alternative by identifying patterns and risk factors related to CVD. Healthcare professionals can use machine learning algorithms to forecast a patient’s risk of CVD based on various clinical parameters, including demographic data, medical history, and results of diagnostic tests. Ultimately, this method improves patient outcomes and maximizes healthcare resources by improving diagnosis precision and facilitating individualized treatment plans and preventive measures.

Research problem

The research problem in machine learning-based CVD prediction centres on the difficulty of creating precise, dependable, and broadly applicable models that can successfully identify individuals at risk. The complexity of feature selection and engineering to ensure relevant variables are used without introducing bias, the need for high-quality, comprehensive datasets that include diverse demographic and clinical features, and the requirement for sophisticated algorithms capable of handling the complex relationships between risk factors are all involved in this. Developing clinical trust and ensuring that healthcare providers can comprehend and act upon the predictions depends heavily on the interpretability and transparency of the model. Another difficulty is incorporating these machine learning models into the current healthcare systems in a way that makes sense for users and facilitates clinical workflows. Developing strong, useful tools that improve early diagnosis, direct preventive measures, and eventually lower the burden of CVD requires addressing these issues.

Research scope

The field of study on the application of machine learning techniques to the prediction of CVD includes an array of important domains to create a comprehensive and efficient predictive framework. To make sure the models are inclusive and widely applicable, this involves gathering and integrating diverse datasets with clinical, demographic, lifestyle, and genetic data from different populations. To determine the most precise and effective methods, the scope also entails investigating and contrasting various machine learning algorithms, including logistic regression, decision trees, and support vector machines. Crucial elements include feature engineering and selection, which concentrate on locating the most significant predictors and reducing noise. To guarantee robustness and reliability, the scope also includes model validation and evaluation using metrics like accuracy, precision, recall, and AUC-ROC. The model’s interpretability, which guarantees that healthcare practitioners can readily comprehend and apply the predictions, is another essential component.

Research objectives

The primary objectives of this thesis are as follows:

To propose a comprehensive framework that integrates diverse data sources, including clinical records, and demographic information to create a robust dataset for model training and validation.

To implement various machine learning algorithms in order to identify and compare the most effective techniques for accurately predicting CVD risk.

To develop advanced feature selection and engineering methods that pinpoint the most relevant predictors, reduce dimensionality, and minimize data noise, thereby enhancing the accuracy and efficiency of the predictive models.

Research questions

Try to find the answers to the following questions:

How can diverse data sources, including clinical records and demographic information, be effectively integrated to create a robust and comprehensive dataset for model training and validation in predicting CVD risk?

What are the computational and practical considerations when implementing various machine learning models for real-world clinical applications?

What strategies can be employed to balance the complexity and interpretability of predictive models while maintaining high accuracy in CVD risk prediction?

The proposed contribution of the dissertation

This work aims to build a robust and comprehensive dataset necessary for precise model training and validation by creating a novel framework that integrates various data sources, such as demographic data and clinical records. The goal of the dissertation is to determine the best methods for predicting CVD risk by applying and thoroughly analyzing a variety of machine learning algorithms, including logistic regression, decision trees, and support vector machines. It will also include advanced feature engineering and selection techniques to reduce data dimensionality, minimize noise, and highlight the most important predictors, all of which will improve model efficiency and accuracy. To facilitate the practical adoption of these risk factors in clinical settings, the research will focus on developing interpretable models that offer precise, practical insights into CVD risk factors. The dissertation will evaluate the influence of these predictive models on patient outcomes, preventive measures, and early diagnosis by incorporating them into clinical workflows. This will show how machine learning can revolutionize CVD risk prediction and management.

Dissertation organization

Chapter 1 introduces the research problem, objectives, and significance of predicting cardiovascular disease using machine learning. Chapter 2 describes the existing research on CVD prediction and the application of machine learning in healthcare. This chapter identifies gaps in the current literature, justifying the need for this study and highlighting opportunities for innovation. Chapter 3 describes in detail the comprehensive framework proposed for integrating diverse data sources, including clinical records and demographic information. It describes the selection of machine learning algorithms, the feature selection and engineering methods developed, and the experimental design and evaluation metrics used to assess model performance. Chapter 4 describes the performance evaluation using machine leaning by conducting experiments and showing results and also describes the limitations and discussion. Chapter 5 includes the conclusion and future work.

Chapter Summary

This chapter summarizes the impact of coronary and cardiovascular disease (CVD) on the world’s health and emphasizes the urgent need for precise and early prediction techniques. It shows the possibility that machine learning methods could transform CVD risk prediction by providing accurate, quick, and non-invasive diagnostic instruments. The chapter describes the goals of the research, which include improving feature selection techniques, implementing different machine learning algorithms, and creating a comprehensive framework that integrates a variety of data sources.

Background And Literature Reviewed

Background

Cardiovascular disease (CVD) is a major global health concern that is associated with multiple risk factors, such as obesity, smoking, high cholesterol, a lack of exercise, and hypertension (Flora & Nayak, 2019). Heart arrhythmias, congestive heart failure, and congenital heart disease are just a few of the conditions that fall under the general heading of CVD (Lockhart & Sun, 2021). The complicated and frequently problematic nature of traditional methods for predicting and diagnosing CVD had an adverse effect on people’s general well-being (Levine et al., 2021). Since this illness continues to be the primary cause of death in both developed and developing nations, appropriate preventive and diagnostic measures are required. Due to a lack of resources, physicians in developing nations have difficulty correctly diagnosing and treating CVD. Early detection and risk assessment for CVD have been made possible by the introduction of computer technology and machine learning as clinical decision-making aids. Because medical data is so complex, it is essential that medical data mining technologies be able to extract meaningful information from the vast amounts of data in the healthcare industry. Our CVD prediction technology has the potential to save millions of lives by facilitating faster treatment for more people.

A significant change occurred with the introduction of electronic health records (EHRs) in the late 20th and early 21st centuries (Arvisais-Anhalt et al., 2022). These records offered a plethora of patient data that could be used for increasingly complex analysis. The handling of these massive datasets was made easier by concurrent improvements in processing power and data storage, which opened the door for the use of machine learning in healthcare (Awotunde et al., 2021). Simple algorithms like logistic regression and decision trees were the focus of early machine-learning applications in CVD prediction. These methods were more accurate than traditional statistical approaches, but they were still constrained by the complexity of the disease.

The development of machine learning in recent years has given the medical industry a revolutionary opportunity (Ahmed et al., 2020). Large, complex datasets can be analyzed using machine learning techniques to find patterns and insights that may be missed by traditional statistical methods (Meshref, 2019). Researchers can create predictive models that evaluate the risk of CVD based on a variety of variables, such as genetic information, lifestyle choices, clinical records, and demographic data, by utilizing machine learning (Allan et al., 2022). These models can provide more accurate and customized risk assessments, which can help with early diagnosis and focused preventive actions.

The creation of models that can accurately predict the risk of CVD from medical imaging data, such as echocardiograms and coronary angiography, represents significant advancements in this evolution (Gahungu et al., 2020). Furthermore, research has shown the potential of incorporating genomic data into prediction models to provide insights into the genetic predispositions that influence the risk of CVD. The availability of large-scale public health datasets and the development of sophisticated algorithms that can extract meaningful patterns from noisy, high-dimensional data have been key drivers of these advancements.

Traditional diagnostic methods for heart disease, such as ECGs and echocardiograms, while effective, often require specialized equipment and clinical settings, posing challenges for continuous monitoring and early detection (Ulloa-Cerna et al., 2022). Deep learning, a branch of machine learning, has shown substantial potential in medical diagnostics due to its ability to automatically extract features and patterns from raw data. CNNs excel at identifying spatial features in data, making them particularly suitable for analyzing complex patterns in heart sound spectrograms (Shuvo et al., 2021). LSTM networks, a type of recurrent neural network RNN, are proficient at learning temporal dependencies, ideal for time-series data such as heart sounds. By leveraging these deep learning models, an IoMT-based approach can enable continuous, real-time analysis of heart sounds, providing timely and accurate diagnoses. Early detection of heart diseases becomes more feasible, leading to better treatment outcomes. Additionally, this approach extends diagnostic capabilities to remote and underserved areas, improving healthcare accessibility. By minimizing reliance on subjective human interpretation, the method standardizes diagnostic procedures and improves accuracy. The combination of IoMT and deep learning thus holds significant potential in transforming heart disease diagnosis and management, ultimately contributing to better healthcare outcomes and patient quality of life (Adewole et al., 2021).

Literature review

This study (Pachiyannan et al., 2024) presents a healthcare technique, the Machine Learning-based Congenital Heart Disease Prediction Method (ML-CHDPM), designed to recognize and categorize CHD in pregnant women. The algorithm, trained on a large dataset, recognizes intricate patterns and correlations, leading to accurate forecasts and classifications. ML-CHDPM’s evaluation encompasses Receiver Operating Characteristic Curve (ROC) area, sensitivity, specificity, and accuracy, showcasing its superior performance across critical metrics: recall 96.25%, accuracy 94.28%, specificity 91.74%, with low False Positive Rate (FPR) 8.26% and False Negative Rate (FNR) 3.75%.

This article (Mohanty et al., 2024) focuses on the design, construction, and structural analysis of a passive optical Fiber Bragg Grating (FBG) sensor, in order to obtain real-time Heart Rate Variability (HRV) parameters, such as heart rate, the median variation of normal-to-normal intervals, root mean square of each subsequent differences, and Percentage of Successive Normal-to Normal (PSNN) intervals differing by more than 50 ms. Furthermore, an Internet of Things (IoT) based architectural design and sophisticated signal processing methods are described. Five people, three male and two female, participated in an experimental investigation that was carried out in a laboratory. The study showed good performance with an error rate of less than 10% when compared to a standard Heart Rate (HR) monitor. By detecting arrhythmia, coronary heart disease, aortic illnesses, and strokes, this intelligent system significantly improves healthcare. Advanced technology, IoT architecture, and FBG sensors together have enormous potential to improve cardiac surveillance and patient outcomes.

This study (Aljohani et al., 2023) introduces deep convolutional neural networks for categorizing common valve diseases and typical valve sounds into binary and multiclass categories. Three feature extraction methods, including Mel-Frequency Cepstral Coefficients (MFCC) and Discrete Wavelet Transform (DVT), were explored. Both models achieved precision with F1 scores exceeding 98.2% and specificities surpassing 98.5%, indicating minimal misclassification of regular instances. These findings affirm the proposed model as a highly accurate tool for assisted diagnosis.

This research (Khan Mamun & Elfouly, 2023) presents a hybrid 1D-CNN, which selects features using feature selection techniques and makes use of a sizable dataset amassed from online survey data. When contrasted with modern machine learning methods and Artificial Neural Networks (ANN), the 1D-CNN demonstrated superior accuracy. The accuracy for both the CHD and non-coronary heart disease (no-CHD) validation data was 76.9% and 80.1%, respectively. The model was contrasted with Support Vector machines (SVM), Random Forests (RF), AdaBoost, and ANN. In terms of accuracy, FNR, and FPR, 1D-CNN performed better overall. Analyses of four other heart diseases using similar methodologies demonstrated that the hybrid 1D-CNN achieved higher accuracy.

CardioXNet is a portable end-to-end Convolutional Recurrent Neural Network (CRNN) architecture that uses raw PCG signals to automatically detect five classes of cardiac auscultation (Chen et al., 2022). Results show that the proposed architecture outperforms previous state-of-the-art methods, achieving up to 99.60% accuracy, 99.56% precision, 99.52% recall, and 99.68% F1 score. It works particularly well for point-of-care CVD screening using memory-constrained mobile devices in low-resource settings.

The strategy put forward in this research (Liu et al., 2021) uses cross- and entropy-entropy features together with a fusion of multiple interfaces of cardiac sound with multi-domain feature recordings. The data collection involved 36 participants, comprising 21 individuals with CAD and 15 without CAD. Each participant underwent simultaneous recording of five-channel heart sound signals for 5 minutes. Following segmentation and quality assessment, 553 samples remained in the CAD group, while 438 samples were retained in the non-CAD group. An SVM was fed the ideal feature set for classification. According to the findings, the method improved classification accuracy from 78.75% to 86.70%, and after entropy and cross-entropy characteristics, it continued to improve to 90.92%. Features of entropy and features of cross entropy are essential and important for multi-domain fusion for recordings of the heart, which play a vital role in or identification of CAD.

The goal of the research was to determine how a machine learning platform could be created to help physician assessment and simplify strain echocardiogram research (O’driscoll et al., 2022). In order to obtain new geometrical and kinematics information from strain echocardiograms acquired in the course of a sizable forward-looking, multicenter, multivendor investigation carried out in the United Kingdom, a computerized computational imaging workflow was created. The collected characteristics were used to build a combined neural network decoder to recognize patients who had significant heart disease during noninvasive cardiac imaging. In a separate American study, the model was examined. A controlled split reading research looked at how the accessibility of an AI categorization would affect the medical assessment of stressful echocardiograms. Cross-fold verification utilizing 31 distinct geometrical and kinematic factors produced a rate of classification that was satisfactory for identifying individuals who had significant heart disease in the initial data collection, with a specificity of 92.7% and a sensitivity of 84.4%. Throughout the separate verification information set, this precision was preserved. By using the AI categorization tool, doctors were able to acquire an area under the receiver-operating characteristic curve of 0.93 while also improving inter-reader assurance, acceptance, and specificity for recognizing diseases by 10%.

The objective of this study (Schuuring et al., 2021) of the joints-creating grouping established by the American Academy of Echocardiography and the European Association of the Use of Imaging was to make revised suggestions to the earlier released standards for the heart container measurement in light of the recent decade’s quick advances in technology and the modifications in echocardiographic perform these advancements have caused regarding. On the back of significantly higher numbers of normal people, gathered from various forms of databases, this paper gives revised normal values for all four heartbeat spaces, incorporating multifaceted echocardiogram and cardiac stretching, if applicable. Additionally, this paper aims to fix a few small inconsistencies with earlier stated regulations. Information on ventricular arterial pressure, diastolic heart rate, elevated blood pressure evaluation, cardiovascular disease medication, diagnosis of diabetes, fasting glucose, creatinine concentrations overall cholesterol, low-density lipoprotein cholesterol, and triglycerides were gathered whenever it was practical. The Mosteller method was utilized to determine BSA. By dividing the weight in kilos by the square of the length in kilometres, the human body weight index was computed.

This paper (Xiao et al., 2020) introduces an innovative heart sound classification technique leveraging deep learning technologies for predicting cardiovascular diseases. The method consists of three main components: pre-processing, classification of 1-D waveform heart sound segments using a deep convolutional neural network (CNN) with an attention mechanism, and majority voting for the final prediction of heart sound recordings. To enhance information flow within the CNN, a block-stacked architecture with clique blocks is employed, featuring a bidirectional connection structure within each clique block. By integrating stacked cliques and transition blocks, the proposed CNN achieves both spatial and channel attention, resulting in notable classification performance. A novel separable convolution with an inverted bottleneck is utilized to efficiently decouple spatial and channel-wise feature relevancy. Experiments conducted on the PhysioNet/CinC 2016 dataset demonstrate that the proposed method achieves superior classification results and excels in parameter efficiency compared to state-of-the-art methods.

We have effectively tackled the huge task that is stress cardiology interpretation in the current research (Pellikka, 2022). Featured were dobutamine and activity investigations, carried out in conjunction with or without ultrasonography image-enhancing drugs utilizing a variety of ultrasonic technologies. For evaluation, endocardial visibility of at least 14 of 16 sections and an average of 4 images spanning end-diastole and end-systole were employed in basal 4-chamber, 2-chamber, and parasternal short-axis midventricular perspectives at repose and strain. None of the individuals in question had undergone previous cardiac procedures and all reached a desired rhythm, double item, or other outcome. The simulation was then separately assessed with 154 stress cardiac echocardiograms from an earlier investigation. The AUROC was 0.927 utilizing the same categorization limit, with an accuracy of 84 percent and an accuracy of 92.7%. When 38 individuals with established coronary artery disease (CAD) or aberrant stationary wall movement were excluded from the subgroup study, the degree of sensitivity and accuracy remained at 90.5% and 88.4%, respectively.

In this study (Yang et al., 2022), we developed a deep learning (DL) system that recognizes valvular heart disorders (VHDs) in echocardiographic films. While improvements in DL have been utilized for interpreting echocardiograms, it has not been documented that these techniques have been used to analyze coloured Doppler recordings to diagnose VHDs. The researchers created a three-stage DL structure that categorizes echocardiographic opinions, recognizes the existence of VHDs, and, when measuring important metrics associated with VHD levels in order to automatically screen echocardiographic videos for mitral stenosis (MS), mitral regurgitation (MR), aortic stenosis (AS), and aortic regurgitation (AR). Retrospective analyses from five medical centres were used to instruct (n = 1,335), validate (n = 311), and test (n = 434) the method. The practical test information set consisted of 1,374 sequential cardiac echocardiograms that were retrospectively acquired. Using regions around the line for MS, MR, AS, and AR in the future test information set of 0.99 (95% CI: 0.97-0.99), 0.88 (95% CI: 0.86-0.90), 0.97 (95% CI: 0.95-0.99), 0.90 (95% CI: 0.88-0.92), and 0.90 (95% CI: 0.88-0.92), respectively, disease diagnosis precision was good. The degree of agreement (LOA) between the DL method and doctor predicts of statistics of valve injury levels ranged from 0.60 to 0.77 cm2 vs. 0.44 to 0.44 cm2 for MV area; from 0.27 to 0.25 vs. 0.23 to 0.08 for MR jet area/left atrial area; from 0.86 to 0.52 m/s vs. 0.48 to 0.54 m/s.

The present paper explores the application of advanced machine learning approaches to echocardiograms (echo), an exciting and actively investigated diagnostic approach. In this study (Wahlang et al., 2021), the echo is classified according to two distinct categories. Initially utilizing 2D echo pictures, 3D Doppler images, and video graphics images, categorization into ordinary (absence of anomalies) or improper (presence of anomalies) has been done. Additionally, utilizing video graphic echo pictures, the distinct types of valve—namely, mitral regurgitation (MR), aortic regurgitation (AR), tricuspid regurgitation (TR), and a mixture of the three types of valve—are classified. Long Short Term Memory (LSTM), which relies on recurrent neural networks (RNN), and Variational AutoEncoder (VAE), which is based on AutoEncoder, are two deep-learning approaches utilized for these goals. The use of video graphic pictures set this study apart from earlier SVM (Support Vector Machine)-based research, and the first of many deep-learning applications in this field. In the categorization of typical or deviant behaviour, it was discovered that deep-learning methods outperform SVM approaches. Overall, VAE outperforms LSTM for static 2D and 3D Doppler images, whereas LSTM outperforms VAE for video graphic data.

Aortic stenosis (AS) is frequently brought on by problems with the original arterial button, decrepit valves with calcification, and arthritis (Wahlang et al., 2020). For serious, the aorta repair technique is critical. Heart dimension measurements and regions in 2D ultrasound images have been individually identified prior to replacing the valve operation to assess the degree of heart constriction and offer sufficient data for estimating the volume of artificial gates. However given the location of the valves in the aorta in life varies dynamically, a 2D static visual evaluation is not only personal but also only provides measurements gathered from one or two images throughout the entire cardiac cycle. For quick and self-monitoring of the valves of the aorta and upstream end of the right cardiac outflow canal utilizing 3D-TEE the internet, a few investigations have developed a computerized monitoring technique using structural and optical flow. It offers up-to-date, precise evidence of support for preclinical replacement of the aortic valve organizing, assisting in improving the precision of valve assessment and boosting practitioner confidence.

This article (Fatima et al., 2020), describes both authors’ initial experiences using Auto valve Analysis, a revolutionary (AI)-based semi-automated tricuspid valvular analysis program from Siemens Healthcare in Mountain View, California. By reducing the amount of effort needed to examine heart frameworks, customized AI-based programs with live visualization and automation verification speed up medical decision-making and ensure strong consistency with little involvement from users. This approach will close deficiencies in variables predicting TV performance in the medical and scientific fields with the implementation of TV research. Additionally, these characteristics can enhance analytical and predictive classification for surgery and medical interventions when combined with interventionist design.

A growing amount of individuals in a range of medical settings employ myocardial point-of-care echocardiography to quickly diagnose significant cardiac disease on the patient’s side (Kirkpatrick et al., 2020). It may be necessary for echocardiographers and ultrasound technicians to assist in instructing professionals in cardiovascular ultrasonography who are trained in fields unrelated to heart disease. Echocardiography may face difficulties or have chances depending on the learners’ backgrounds, requirements, goals, and free time. Materials are needed, in addition to properly-directed and organized utilization of assets in order to participate in cardiac echocardiography training. In particular, educational initiatives benefit most from unrestricted institutional/departmental support, extensive academic expertise, committed academic time and money, computer technology assistance, and clinic- or hospital-wide collaboration.

The present research tested the hypothesis that when compared to cardiologists, sonographers, and resident readers, a deep convolutional neural network (DCNN) could better detect regional wall motion abnormalities (RWMAs) and distinguish between groups of coronary injury areas from conventional 2-dimensional echocardiographic images (Kusunose et al., 2020). There were included a total of 300 people with a diagnosis of cardiac attack. Three separate sets of 100 individuals each from this cohort experienced heart attacks of the right coronary artery (RCA), left circumflex (LCX) branch, and left anterior descending (LAD) artery. From a record set, 100 control individuals with adequate wall movement who were identical in age were chosen. Cardiovascular ultrasound images from short-axis views at the end-diastolic, mid-systolic, and end-systolic phases were included in each case. Diagnostic accuracies were calculated from the test set after the DCNN underwent 100 steps of retraining. The identical model received instruction separately into ten different iterations, and composite estimates have been generated using those iterations. The region under the receiver-operating characteristic curve (AUC) generated by the deep learning algorithm for detecting the existence of WMAs was comparable to that generated by the cardiologists and sonographer readers (0.99 vs. 0.98, respectively; p = 0.15) and significantly higher than the AUC result of the resident readers (0.99 vs. 0.90, respectively; p = 0.002). The deep learning algorithm’s AUC for detecting WMA areas was greater than that of resident readers (0.97 vs. 0.83, respectively; p = 0.003) but equivalent to that of cardiologist and sonographer readers (0.97 vs. 0.95, respectively; p = 0.61). The deep learning algorithm’s AUC from the verification group at a separate site (n = 40) was 0.90.

In this study (Davis et al., 2020), Echocardiography is only one of the areas of medical treatment where AI has found a home. Various fields of cardiac ultrasound, imaging, tests, and diagnostics are currently impacted by AI. The chance that AI will enhance sonographers’ employment and lessen the variation that is that exists in echocardiograms persists despite reservations amongst ultrasound technicians and echocardiographers. That is crucial to continue using analytical techniques and to recognize the proposed union between computers and mankind will succeed only when properly built AI and knowledgeable individuals are combined. As multidimensional echocardiograms become more common, specially designed devices using algorithms that are supervised may be able to discern when objects are visible prior to capturing them directly for a more informative sample.

The purpose of this research (Genovese et al., 2019) was to evaluate the precision and repeatability of novel, completely automated, machine learning (ML)-based technology for three-dimensional assessment of RV size and function. On exactly the same day, a transthoracic 3DE test was performed on 56 not chosen individuals who had been referred for clinically indicated cardiac magnetic resonance (CMR) imaging and had a wide range of RV dimensions, functions, and quality of the picture. The ML-based method was used to assess the end-systolic and end-diastolic RV volumes (ESV, EDV) and the ejection fraction (EF), which was then compared to CMR reference values using Bland-Altman and linear regression analyses. It was possible to measure RV activity by ultrasonography in all cases. The computerized method had an assessment time of 15 1 seconds, was 100% reversible, and was correct in 32% of cases. After automatic post-processing, endocardia contour editing was required in the remaining 68% of patients, increasing study time to 114 ± 71 seconds. A little these small corrections, the measures of RV amounts and EF were precise when compared to the CMR regard (biases: EDV, −25.6 ± 21.1 mL; ESV, −7.4 ± 16 mL; EF, −3.3% ± 5.2%) and demonstrated outstanding consistency, as evidenced by values of variation of 7% and intra class connections of 0.95 for all tests.

In this study (Kusunose et al., 2019), the identification and treatment of heart illness, and echocardiography are crucial. For medical judgment, a precise and trustworthy echocardiographic examination is necessary. Even if novel methods (3-dimensional echocardiography, speckle-tracking, semi-automated analysis, etc.) are being developed, operators’ expertise still plays a significant role in the end result of the choice. Unsolved diagnostic errors are a significant issue. Furthermore, when readings are taken repeatedly, an identical observer could reach an alternate conclusion. This is because cardiac specialists can disagree with one another regarding how to interpret images. All cardiac specialists need to have an accurate perception in this area due to the daily high work in clinical practice that may cause this inaccuracy. Though the necessary enormous database and “black box” approach raise a number of questions, AI can deliver acceptable outcomes in this area. Cardiologists will eventually need to modify their standard operating procedures to incorporate AI in the current phase of cardiology.

A neural network with deep learning is a subset of ANN that is also a subset of artificial intelligence (Madani et al., 2018). The field of artificial intelligence has applications across many fields of research, technology, and even everyday life. The present article will go over the function and present-day uses of neural network-based studies for cardiology assessment as well as its drawbacks and difficulties. The numerous cardiac illnesses, their roles, and even the way they look are all determined via echo. Decision-making is time-consuming, costly, and requires specialist expertise because requires is analyzed and understood. The use of computerized systems for cardiovascular imaging has significantly changed medical techniques by finding anomalies in cardiovascular muscle movements and device functions that aid in determining heart disease. The method known as deep learning is employed to analyze pictures, and it is currently being used to solve diagnostic issues. It is also highly helpful for doctors to improve their care for patients. In contrast to methods based on statistics, an extensive set of photos is needed to develop an algorithm for the particular issue. Machine learning is used for finding and establishing complicated designs and their relationships in images whenever performed in big databases. Using the extensive dataset, the machine understands and detects the required structure of the photo. Despite the lack of widespread acceptance of computerized devices in medical research, that method benefits academicians as well as physicians.

The purpose of this work (Nath et al., 2016) was to investigate the viability and dependability of enormous, focused NLP retrieval of numerous data items from cardiac reports. The machine-learning extraction of information about cardiovascular anatomy and functioning from differently structured echocardiographic records was made possible by the development of the NLP tool EchoInfer. Three independent current medical study projects’ accessible echocardiogram results from 2004 to 2013 were subjected to EchoInfer analysis. 15116 echocardiogram records from 1684 individuals were evaluated by EchoInfer, and 59 statistical and 21 qualitative data items were collected from each report. With regard to all 80 data pieces in 50 reports, EchoInfer attained an accuracy of 94.06%, a recall of 92.21%, and an F1-score of 93.12%. The 15,116 reports for this investigation included 10,590 dot echocardiographic reports, 861 stress echocardiographic reports, 3,456 transesophageal echocardiographic reports, and 1,050 transthoracic echocardiographic reports. EchoInfer assessed 9,444 reports from patients for various signs with no history of valvular surgery, 3,725 notifications from individuals with a past of replacing an aortic valve (AVR), 828 reports from individuals with a history of mitral valve replacement (MV) substitution, 441 reports from individuals with a past of mitral valve repair, 677 reports from individuals with a history of combined AVR and MV replacement or fixation. EchoInfer attained an overall F1-score of 93.12%, an overall precision (positive predictive value) of 94.06%, and an overall retention (sensitivity).

The objective of this study (Krittanawong et al., 2017) of the joints-creating grouping established by the American Academy of Echocardiography and the European Association of the Use of Imaging was to make revised suggestions to the earlier released standards for the heart container measurement in light of the recent decade’s quick advances in technology and the modifications in echocardiographic perform these advancements have caused regarding. On the back of significantly higher numbers of normal people, gathered from various forms of databases, this paper gives revised normal values for all four heartbeat spaces, incorporating multifaceted echocardiogram and cardiac stretching, if applicable. Additionally, this paper aims to fix a few small inconsistencies with earlier stated regulations. Information on ventricular arterial pressure, diastolic heart rate, elevated blood pressure evaluation, cardiovascular disease medication, diagnosis of diabetes, fasting glucose, creatinine concentrations overall cholesterol, low-density lipoprotein cholesterol, and triglycerides were gathered whenever it was practical. The Mosteller method was utilized to determine BSA. By dividing the weight in kilos by the square of the length in kilometres, the human body weight index was computed.

The development of multidimensional (3-D) current-time echocardiograms in the past few decades has made it possible and crucial for clinicians to automatically create particular-to-patient mathematical modelling (Bersvendsen et al., 2017). Although differentiation of the right ventricle (RV) is increasingly recognized as having a role in heart failure, a large number of echocardiographic segmentation methods described in the scientific literature concentrate on the left ventricle’s (LV) endocardial border. We outline a technique for linked separation of the LV and RV endo- and epicardial boundaries in 3-D ultrasound photographs. They suggest extending effective state-estimation classification architecture with an illustration of connected areas in order to address the separation issue. They also suggest adding cardiac incompressibility to the system in order to regularize the segmentation. In photos of 16 clients, the technique was examined against personal measurements and divisions. For the interfaces of the LV endocardium, RV endocardium, and LV epicardium, total actual differences across the suggested and comparison categorizations were found to be 2.8 ± 0.4 mm, 3.2 ± 0.7 mm, and 3.1 ± 0.5 mm, correspondingly. The approach was effective in terms of calculation, taking only 2.1 ± 0.4 s.

In this work (Balaji et al., 2015), a fully machine-generated categorization of the echocardiogram’s heart picture is suggested. The methodology depends on a technique called machine learning that distinguishes between two distinct characteristics. The parasternal short axis (PSAX), parasternal long axis (PLAX), apical two-chamber (A2C), and apex four-chamber (A4C) views are the four traditional perspectives described in this framework. Due to the sound, analyzing echocardiography pictures is challenging. The echocardiography picture contains a mixture of pepper and salt sounds, which will complicate the picture categorization procedure. Initially, the median filtering procedure is used to eliminate distortion from the source echocardiography picture. Labelling and triangles that are visible in the limits of the echocardiography images are the aberrations. Studies involving two hundred echocardiography pictures demonstrate that the suggested approach, which has a precision of 87.5%, may be utilized to classify heart views efficiently.

This work (Balaji et al., 2014) suggested work on effective ventricular image classification of echocardiography. Systolic blood and ventricular stages make up a heart circuit. The period is a state of relaxation and filling, whereas the systolic is the contraction of the blood vessels. Only the rhythmic images from the supplied video series were taken out and used to establish the echocardiogram’s image. First, distortion was removed from the echocardiography image while brightness was improved. In order to highlight the cardiac cavity prior to division, computational morphology is applied. It was done to break down the data using connected components labelling (CCL). Three common heart views were categorized: the parasternal short axis (PSAX), the apical two chambers (A2C), and the four-chamber (A4C) views. Over 200 echo pictures were subjected to tests, with an accuracy rate of 94.56%.

The goal of the research was to determine how a machine learning platform could be created to help physician assessment and simplify strain echocardiogram research (Johnson et al., 2018). In order to obtain new geometrical and kinematics information from strain echocardiograms acquired in the course of a sizable forward-looking, multicenter, multivendor investigation carried out in the United Kingdom, a computerized computational imaging workflow was created. The collected characteristics were used to build a combined neural network decoder to recognize patients who had significant heart disease during noninvasive cardiac imaging. In a separate American study, the model was examined. A controlled split reading research looked at how the accessibility of an AI categorization would affect the medical assessment of stressful echocardiograms. Cross-fold verification utilizing 31 distinct geometrical and kinematic factors produced a rate of classification that was satisfactory for identifying individuals who had significant heart disease in the initial data collection, with a specificity of 92.7% and a sensitivity of 84.4%. Throughout the separate verification information set, this precision was preserved. By using the AI categorization tool, doctors were able to acquire an area under the receiver-operating characteristic curve of 0.93 while also improving inter-reader assurance, acceptance, and specificity for recognizing diseases by 10%.

This study presents (Volpato et al., 2019), contrasted with well-known standard methods, automated 3DE analysis of left ventricular (LV) mass utilizing the innovative ML algorithm yields repeatable and precise readings. 23 individuals who had 3DE (Philips EPIQ) and CMR scanning on the exact same day were prospectively evaluated. Wide-angle 3D single-beat datasets of the left ventricle were collected. The recently released computerized program (Philips HeartModel), along with traditional volumetric measurements (TomTec), was used to measure the amount of LV mass. By manually identifying the LV endo- and epicardial borders, CMR analysis was carried out. Repeated measurements were used to determine the repeatability of the ML technique and to quantify it using intra-class correlation (ICC) and coefficients of variation (CoV). In 20 patients (87%), computerized LV mass evaluations proved practical. The findings were comparable to those obtained from CMR (Bland-Altman bias 5 g, limits of agreement 37 g), as well as to findings obtained from a traditional 3DE study (bias 7 g, 27 g). While manual modifications were made in the majority of clients, computation time was significantly shorter (1.02 0.24 mins vs. 2.20 0.13 minutes for CMR and 2.36 0.09 minutes for TomTec). Experiments taken repeatedly revealed great reproducibility: ICC is 0.99 and CoV is 4 +/- 5%.

The purpose of this work was to better accurately forecast survival following cardiology using artificial intelligence (Samad et al., 2019). An extensive provincial medical system’s 171,510 randomly chosen individuals who received 331,317 cardiac echocardiograms were evaluated for morbidity. Using three distinct components, the researchers evaluated the prediction abilities of asymmetric artificial intelligence models to those of conventional logistics regression models. Sex, age, height, weight, heart rate, arterial pressure, low-density lipoprotein, high-density lipoprotein, cigarette smoking, and 90 cardiovascular-relevant global classifications of Conditions, Tenth Update, and codes are among the physiological factors. Other clinical factors include physician-reported EF and 57 more echocardiographic units of measurement. A multimodal restoration was performed on data that was absent using the linked equations approach (MICE). The researchers used a mean area under the curve (AUC) over 10 cross-validation folds for comparing the predictions to one another and basic medical grading systems. Just ten factors, 6 of which were generated from cardiac echocardiography, were required to reach 96% of the highest precision for prediction. Compared to LVEF, regurgitation of the tricuspid speed was a better predictor of mortality. Multimodal restoration using chain equations produced significantly lower prediction accuracy levels (the variation in AUC of 0.003) than the initial information in a selection of trials that included complete data for the top 10 factors.

The paper (Baumgartner et al., 2017) concentrates especially on improving the evaluation of the left ventricle’s discharge system, which includes low slope atrial stenosis with kept ejection ratio, a novel category of heart steno by slope, circulation, and discharge percentage, and an evaluation technique for a combined and gradually method of assessing the coronary a narrowing in health care settings. It is crucial to employ identical techniques for both AVA and shifts in velocity/gradient in order to prevent unexpected shifts. As an illustration, contrasting the results of the range acquired from the correct parasternal method with earlier measures that were taken from a coronal method can result in a spike in peak speeds of 0.3 m/s that may prompt surgery to be performed. When the flows have dropped concurrently, speed and slope may stay stable or even decline throughout AS advancement.

Table 2.1: State-of-the-art table for predicting coronary cardiovascular disease

References	Methodology	Dataset	Evaluation Measures	Limitations
(Pachiyannan et al., 2024)	Machine Learning-based Congenital Heart Disease Prediction Method (ML-CHDPM)	A large dataset of pregnant women	ROC curve area, sensitivity, specificity, accuracy; recall: 96.25%, accuracy: 94.28%	Potential bias in the dataset, limited to pregnant women, high computation power required
(Mohanty et al., 2024)	Design and analysis of passive optical FBG sensor for HRV parameters; IoT-based architectural design	Experimental investigation with 5 people	Error rate < 10% compared to standard HR monitor	Small sample sizes and experimental settings might not reflect real-world variability.
(Aljohani et al., 2023)	Deep convolutional neural networks for valve diseases classification; MFCC, DWT feature extraction	Dataset for valve sounds (not specified)	Precision with F1 scores > 98.2%, specificities > 98.5%	Dataset details not provided, performance might vary with different datasets
(Khan Mamun & Elfouly, 2023)	Hybrid 1D-CNN for CHD detection using feature selection techniques	Large dataset from online surveys	Accuracy: 76.9% for CHD, 80.1% for no-CHD; compared with SVM, RF, AdaBoost, ANN	Limited to survey data, performance might vary with clinical data, with relatively moderate accuracy.
(O’driscoll et al., 2022)	Neural network decoder for strain echocardiograms; cross-fold validation with geometrical and kinematic factors	Multicenter, multivendor strain echocardiograms	Sensitivity (84.4%), Specificity (92.7%), AUROC (0.93)	Limited to specific strain echocardiogram data; applicability to other datasets not explored
(Pellikka, 2022)	Dobutamine stress echocardiograms with ultrasound enhancements; AUROC, sensitivity, and specificity calculations	Stress echocardiograms	AUROC (0.927), Sensitivity (90.5%), Specificity (88.4%)	Limited to stress echocardiogram context; impact on routine clinical settings not assessed
(Yang et al., 2022)	DL system for VHD recognition in echocardiograms; three-stage DL structure for disease detection and metric measurement	Retrospective data from five medical centers	Disease diagnosis precision for MS, MR, AS, AR	Performance metrics specific to the DL approach; dataset bias and variability among centers could affect results
(Abbas et al., 2022)	Attention-based Convolutional Vision Transformer (CVT-Trans) using CWTS	Dataset for PCG signals (not specified)	Accuracy: 100%, sensitivity: 99.00%, specificity: 99.5%, F1-score: 98%	Dataset details not provided, high computational requirements
(Li et al., 2021)	Lightweight neural network for heart sound categorization using time-frequency properties	Heart sound data (not specified)	Accuracy: 95.00%, memory size: 1.36 MB	Dataset details not provided, limited to time-frequency properties, need for optimization for different devices.
(Schuuring et al., 2021)	Revised guidelines for cardiac chamber measurement; inclusion of diverse normal populations; data on cardiovascular parameters	Various databases	Updated normal values, Methodological consistency	Small inconsistencies not fully resolved; variability in data sources might affect generalizability
(Kusunose et al., 2020)	DCNN for detecting regional wall motion abnormalities in cardiac ultrasound; comparison with expert readers	Various coronary injury groups	AUC comparison with cardiologists and sonographers	Generalizability to other imaging modalities and clinical contexts not discussed
(Davis et al., 2020)	AI impact on various cardiac ultrasound fields; potential for reducing variability in echocardiograms	Multidimensional echocardiogram datasets	Future AI applications in ultrasound, Reduction of variation	Speculative impact; actual integration and acceptance in clinical practice not evaluated

Research gap

Although there are still certain research gaps, machine learning has made great strides in the prediction of coronary cardiovascular disease. The interpretability of sophisticated models, the integration and standardization of various data sources, and the models’ applicability to various clinical contexts or demographic groups are a few of these. Studies currently in existence frequently concentrate on particular populations, which raises questions about bias and fairness. Another difficulty is integrating these models into standard clinical workflows. Additional research is required to address the ethical issues and data privacy concerns related to using large amounts of patient data for machine learning.

Methodology

Introduction

This chapter uses machine learning models to analyze the role of predicting cardiovascular disease. It describes a dataset that contains different classes. The chapter covers both traditional and advanced machine learning models, including Random Forest Classifier, Decision Tree, XGBoost Classifier, and K-Nearest Neighbors (KNN). It also proposes an Artificial Neural Network (ANN) model to capture the nonlinear relationships between detecting cardiovascular disease.

Dataset

A comprehensive collection of data essential for using machine learning techniques to predict coronary cardiovascular disease (CVD) is the “Rates_and_Trends_in_Coronary_Heart_Disease” dataset. This dataset includes several dimensions that are necessary for a thorough analysis, including location, year, geography, and classes. It includes several classes, such as cardiovascular diseases, stroke, and coronary heart disease that offer a comprehensive understanding of cardiovascular health. Researchers can examine patterns and trends across a range of diseases thanks to the insights provided by each class, which covers various facets and kinds of cardiovascular conditions. The geographic and location characteristics in the dataset aid in capturing the regional variations in CVD prevalence and trends, offering important context regarding the ways in which local and environmental factors influence the risk of heart disease. Through temporal analysis made possible by the year attribute, trends and shifts in CVD rates over time can be identified and linked to modifications in population behaviours, healthcare, and policy, as shown in Figure 3.1. A comprehensive approach to cardiovascular health is made possible by this dataset’s inclusion of a variety of topics, including stroke and coronary disease. This makes it easier to develop machine learning models that can distinguish between related conditions and specifically predict coronary heart disease. The dataset facilitates the application of diverse machine learning techniques, including supervised and unsupervised learning models, by integrating a broad range of features and labels. This allows for the comprehensive assessment and prediction of risks.

Figure 3.1: Dataset sample

Preprocessing

When utilizing machine learning techniques to predict CVD, preprocessing is an essential step because it directly affects the predictive models’ performance and quality. Comprehensive preprocessing is necessary to guarantee that the data is clear, consistent, and prepared for analysis given the complexity and heterogeneity of the data involved, which range from clinical records and demographic data to imaging data and lifestyle factors. Preprocessing usually entails several important procedures, such as feature selection, dimensionality reduction, data transformation, data cleaning, data normalization, and handling of missing values.

Data cleansing

Eliminating duplicate entries, fixing inconsistencies, and handling outliers that might interfere with the model’s learning process are all part of the data-cleaning process. In medical datasets, where gaps in data may arise from incomplete patient records or data entry errors, handling missing values is especially crucial. To address missing values without significantly increasing bias, strategies like mean/mode imputation, forward filling, or sophisticated techniques like KNN imputation are employed.

Handling missing values: Identify and manage missing data. Common techniques include imputation (e.g., filling missing values with mean, median, or mode) or removing rows/columns with excessive missing values.

Removing Duplicates: Ensure there are no duplicate entries in the dataset to maintain data integrity.

Encoding categorical variables

Convert categorical variables into a numerical format using techniques such as one-hot encoding or label encoding. Depending on the models being used and the characteristics of the categorical variables, different encoding techniques are applied, such as Label Encoding, One-Hot Encoding, and Target Encoding. This stage makes sure that all the data is in a numerical format that can be used by machine learning algorithms.

Normalization/Standardization

By using this method, the numerical characteristics are rescaled to fall between [0, 1] and [-1, 1]. Normalizing variables like age, income, loan amount, and mortgage amount, for instance, guarantees that they are on a comparable scale and keeps any one feature from unduly impacting the model. When there is a non-Gaussian distribution of the data, normalization is especially helpful.

Feature engineering

When utilizing machine learning techniques to predict coronary CVD, feature engineering plays a crucial role in the preprocessing stage. To increase the predictive capacity of machine learning models, new features are added or current ones are changed. Because CVD is a complex disease with many risk factors, including lifestyle choices, genetic markers, clinical measurements, demographic information, and imaging data, effective feature engineering is essential for identifying the underlying patterns and relationships that influence disease risk. Finding important predictors that have a strong correlation with CVD outcomes is the first step in the feature engineering process when it comes to CVD prediction. This could involve simple transformations like combining characteristics like blood pressure, cholesterol, and glucose levels to create composite risk scores, or it could involve calculating the body mass index (BMI) from weight and height. To capture more complex risk profiles, more sophisticated approaches might include combining genetic markers with family history or developing interaction terms between characteristics, like age and smoking status.

Data splitting

Using an 80-20 data splitting strategy, machine learning techniques are applied to the prediction of CVD. The remaining 20% of the training set is used for testing, and the remaining 80% is used to learn patterns and correlations from different risk factors. Predictive accuracy is improved by the large variety of data points in the 80% training set, as shown in Figure 3.2. After the model has been fully trained and optimized, its performance on unobserved data is evaluated using the 20% testing set. Evaluation is done on important performance metrics, such as recall, accuracy, precision, F1-score, and ROC curve.

$C:\Users\Mega Computers\AppData\Local\Packages\Microsoft.Windows.Photos_8wekyb3d8bbwe\TempState\ShareServiceTempFolder\CVD disease detection.drawio.jpeg$

Figure 3.2: Data splitting

Machine learning models

The process for evaluating asthma using a variety of machine-learning models is explained in this section. The machine learning models include the Random Forest Classifier, Decision Tree, XGBoost Classifier, and KNN and proposed ANN model.

Random forest classifier

Since the Random Forest classifier can handle complex, non-linear relationships and high-dimensional datasets well, it is a powerful ensemble learning technique that is frequently used in the prediction of coronary CVD. In order to arrive at a final, more reliable result, it first builds a “forest” of several decision trees during the training phase. A random subset of features and a random subset of data points are used to build each decision tree in the random forest (with replacement, known as bootstrapping). This randomness lowers the likelihood of overfitting, a common issue in machine learning, by ensuring that the model is not unduly sensitive to particular features or data points. The Random Forest classifier can handle a wide range of predictor variables, including age, gender, blood pressure, cholesterol, smoking status, diabetes, family history, and other clinical measurements, in the context of CVD prediction. Different patterns and relationships within the data will be discovered by each decision tree in the forest. One tree may concentrate on the effects of age and smoking status, while another may discover how certain combinations of high blood pressure and cholesterol raise the risk of coronary heart disease. Random Forest accurately predicts CVD risk by capturing a wide range of patterns and interactions among the features, which are often critical for multi-tree training.

From the original training dataset D with n instances, generate B bootstrap sample where (b = 1, 2, 3…, B) by sampling n instances with replacement. The best split among these m features is chosen based on a splitting criterion, such as Gini impurity or entropy.

Feature importance scores, which measure each feature’s contribution to the model’s predictions, are provided by Random Forest.

Decision tree

Using subsets of a dataset based on input features, the Decision Tree machine learning algorithm predicts coronary CVD. This forms a structure resembling a tree, where each node denotes a choice made in response to a particular feature, and each branch denotes a result. The final classification or prediction result is represented by the end nodes, also referred to as leaves.

The decision tree determines which critical characteristics such as age, blood pressure, cholesterol, or smoking status best differentiate the target classes in CVD prediction. The metrics used to determine this choice are information gain and Gini impurity, which gauge the split’s purity. For instance, the tree may divide the dataset according to other characteristics like blood pressure, diabetes status, BMI, and family history of CVD if a person’s cholesterol level exceeds a predetermined threshold.

To choose the best feature for data splitting, the decision tree applies a splitting criterion. Information Gain (IG), which is predicated on the information theory concept of entropy, is one popular criterion. The degree of uncertainty or impurity in a dataset is measured by entropy.

XG Boost classifier

A scalable and highly effective machine-learning method for predicting coronary CVD is the XGBoost classifier. It is a more complex variation of gradient boosting that gradually assembles a collection of weak learners, usually decision trees. The main idea of XGBoost is to minimize errors through an iterative process of prediction optimization, where each new tree corrects the mistakes made by the previous trees. Because of this, XGBoost is especially good at identifying minute patterns and interactions between different CVD risk factors. When predicting CVD, XGBoost first initializes a basic model that forecasts the mean result for the training set. After that, it continuously adds new decision trees to the ensemble with the goal of estimating residual errors of the total predictions made by all of the earlier trees. This aids in lowering errors and improving the model’s forecasts. The L1 (Lasso) and L2 (Ridge) regularization methods in XGBoost assist in preventing overfitting in intricate medical datasets with high-dimensional features. This guarantees the model’s strong generalization to new data and its high predictive accuracy in practical scenarios. To effectively compute optimal splits and handle missing values, XGBoost also employs a weighted quantile sketch algorithm. This feature makes XGBoost a useful tool for early CVD risk assessment in clinical settings.

Determine the loss function’s first and second derivatives (gradients and Hessians) in relation to the predictions. For instance, equation (3.4) describes the gradient and Hessian is calculated as follows in logistic regression for binary classification:

Gradient ():

The loss function’s first derivative with respect to the predicted value is represented by the gradient. It indicates the direction in which the model’s prediction needs to be adjusted in order to minimize the loss and measures the rate at which the loss function changes in relation to the prediction.

Hessian ():

The first derivative of the loss function with respect to the predicted value is represented by the gradient. It shows the direction in which the model’s prediction must be changed in order to minimize the loss by measuring the rate of change of the loss function with respect to the prediction.

K-Nearest Neighbor classifier

To predict coronary CVD, machine learning techniques such as the K-Nearest Neighbors (KNN) algorithm are straightforward and efficient. It is a lazy learning algorithm that uses the training dataset directly to inform predictions rather than going through a separate training phase. KNN uses a user-defined parameter k to determine how similar a new data point is to the k closest data points in the training dataset. Using distance metrics like Euclidean, Manhattan, or Minkowski distances, KNN determines the distance between each new patient’s data point and every other data point in the training set for CVD prediction. Age, cholesterol, blood pressure, heart rate, smoking status, diabetes, family history, and other clinical measurements are among the characteristics that go into determining this distance. After determining the k neighbours that are closest to the new data point, KNN uses a majority voting system to predict the likelihood that the new patient will have CVD. Because it is based on the results of comparable cases, KNN’s majority voting system lends itself to ease of interpretation.

The process involves identifying the k data points (neighbours) that are closest to a given data point, usually by utilizing distance metrics like Euclidean distance. In n-dimensional space, the Euclidean distance between two points is defined as follows in equation (3.5):

The k-NN algorithm is then fed with the feature vector f that was extracted from the deep learning model. Based on the distances between the feature vectors, the k-NN classifier finds the k-nearest neighbours, as shown in equation (3.6). Target data point 𝑓 𝑡f tis classified according to the majority vote of its k nearest neighbours’ risk labels :

Proposed Artificial Neural Network (ANN)

An innovative technique for forecasting coronary CVD is the Artificial Neural Network (ANN). ANNs, which are modelled after the human brain, are made up of linked layers of neurons that process input data and produce predictions. An input layer, several hidden layers, and an output layer make up an ANN. A collection of cardiovascular risk-related features, including age, blood pressure, heart rate, cholesterol, BMI, diabetes, smoking status, family history of heart disease, and other clinical or demographic factors, are sent to the input layer. The hidden layers are made up of several neurons that apply an activation function to introduce non-linearity and perform weighted computations on the inputs. During training, an optimization algorithm is used to iteratively adjust the weights of these connections, which are initially set randomly. Backpropagation is a technique used by the ANN during training to adjust the parameters and update the weights across several epochs, as shown in Figure 3.3. For binary classification tasks, the output layer typically employs a sigmoid activation function to generate the final prediction. A probability score indicating the patient’s chance of developing CVD is the output.

The model uses Keras to predict coronary CVD and is a Sequential ANN. Based on the input features, it is intended for binary classification, where the output is either the presence or absence of CVD. The model is composed of several layers: activation functions, dropout layers for regularization, dense layers, and an output layer set up for binary classification. The binary cross-entropy loss function, accuracy as the performance metric, and a learning rate of 0.001 are used to compile the model using the RMSprop optimizer. The number of features in the training data is matched by the input layer, which is followed by a dense layer with 1,056 neurons and ReLU activation, as described in Table 3.1. While the third layer polishes the learned features, the second layer analyzes the transformed features to discover intricate patterns. One neuron with a sigmoid activation function makes up the output layer. This neuron is best suited for binary classification tasks because it produces a probability value between 0 and 1.

Table 3.1: Parameters detail of proposed model ANN

Layer Type	Output Shape	Number of Parameters	Activation Function
Dense	(None, 1056)	48,608	ReLU
Dense	(None, 512)	540,224	ReLU
Dense	(None, 256)	131,328	ReLU
Dropout	(None, 256)	0	–
Dense (Output)	(None, 1)	257	Sigmoid

$C:\Users\Mega Computers\AppData\Local\Packages\Microsoft.Windows.Photos_8wekyb3d8bbwe\TempState\ShareServiceTempFolder\Proposed diagram CVD.drawio.jpeg$

Figure 3.3: Proposed model

Evaluation measures

Evaluation measures in the prediction of coronary CVD are essential for figuring out how well predictive models work overall and in terms of accuracy and dependability. Accuracy is one of the key metrics; it gives an overall idea of how frequently the model predicts outcomes correctly, but it may not be enough when there is a class imbalance, as shown in equation (3.7). More specific insights are provided by precision and recall. Precision represents the accuracy of asthma predictions as the percentage of true positive predictions out of all positive predictions, while recall quantifies the model’s capacity to capture all real positive instances and shows how well it detects situations.

True positives (TP) are instances that are positive in the test set and are correctly labelled as positive by the classifier. True negatives (TN) are instances that are negative in the test set and are correctly labelled as negative by the classifier. False positives (FP) are instances that are negative in the test set but are incorrectly labelled as positive by the classifier. False negatives (FN) are instances that are positive in the test set but are incorrectly labelled as negative by the classifier. Equation (3.10) shows the F1 score is the harmonic mean of precision and recall, providing a combined measure of precision and recall.