Optimization of The Machine Learning Approach using Optuna in Heart Disease Prediction

Heart disease prediction is a critical area in healthcare, as early identification and accurate assessment of cardiovascular risks can lead to improved patient outcomes. This study explores the application of machine learning techniques for predicting heart disease. Various data attributes, including medical history, clinical measurements, and lifestyle factors, are utilized to develop predictive models. A comprehensive analysis of different machine learning algorithms is conducted to determine their efficacy in classification tasks. The dataset used for experimentation is sourced from a diverse patient population, enhancing the generalizability of the findings. Through rigorous evaluation and validation, the study aims to identify the most suitable machine learning approach for effectively predicting heart disease. The results highlight the potential of machine learning as a valuable tool in assisting healthcare professionals in making informed decisions and providing personalized care to individuals at risk of heart disease.


Introduction
Cardiovascular diseases (CVDs) including coronary heart disease (heart attack), stroke, and heart failure are a major burden of disease globally [1]. According to the World Health Organization (WHO), CVD including Heart Disease (HD) is responsible for 31% of total deaths worldwide [2]. HD occurs when the heart is unable to provide enough blood throughout the body. This can be affected by high blood pressure, diabetes, coronary heart disease, and other heart problems or disorders [3].
The human body consists of several tissues. These tissues need oxygen and nutrients to work properly. The heart is the main organ that supplies blood to all parts of the body using the circulatory system. Through this system, it supplies nutrients and oxygen to the tissues. If there is a problem that causes the heart to not function properly, the circulatory system will experience a blockage and will cause heart failure [1], [4]. There are many forms of heart disease; However, cardiovascular disease (CVD) is the most lethal [2]. CVD is one of the diseases that causes the most deaths worldwide [3]. More than 31% of global deaths occurdue to heart failure. By 2030, it is predicted that there will be more than 22 million deaths due to heart problems [5]. The American Heart Association says that more than 121.5 million adults suffer from heart disease [6]. Several factors that cause heart disease, lack of exercise, smoking, drinking alcohol, poor lifestyle, eating junk food, etc., are the main factors for heart disease [7]. Doctors and medical professionals use angiography to treat heart disease, but there are some drawbacks associated with this method, including requiring human assistance, so it will take a lot of time to produce results, and because humans are operators, there is a high probability of getting wrong results, and the most importantly, this procedure is very expensive; everyone can't afford it. Therefore it is necessary to identify cardiovascular disease so that patients can take the necessary precautions to prevent a severe heart attack. In this study, disease identification was carried out using a machine learning approach with several methods, namely Random Forest (RF), Logistic Regression (LR), and k-Nearest Neighbor (KNN) models which were optimized using optuna.

Research Method
The overall research methodology is described in Figure 1, which starts from Exploratory Data Analysis (EDA), dataset preprocessing, upsampling to balance target variables, modeling using Random Forest (RF), Logistic Regression (LR), k-Nearest Neighbor (KNN), and evaluation to determine the best model in Figure 1.  (Table 1) and 606 observations. The main task of this dataset is to predict whether a patient has heart disease or not based on the attributes given. The goal of our experiments is to diagnose and find insights from this data set that can help in understanding heart disease.

Exploratory Data Analysis
We performed Exploratory Data Analysis (EDA) on the dataset to see data dimensions, data distribution, feature significance by Chi2 test and T test, correlation test using Pearson and multicollinearity test using variance inflation factor (VIF) in Table 1. Based on EDA, we preprocess the data by: Discarding duplicate observations (304 observations), Discarding outlier observations (13 observations) using z-score, Implement feature transformation. We apply feature transformation with the min-max data normalization method. Normalization is needed because some ML models try to search for and discover patterns in datasets by comparing attributes and data ranges. Differences in data scale will cause problems if used to train ML models. Simply put, scale differences between features will result in a weak ML model. To ensure each feature has the same scale, we apply minmax data normalization. Min-max normalization is a linear data transformation method, where the minimum value of the data becomes 0 and the maximum value of the data becomes 1. This method is applied to each feature. Each feature is normalized using this Equation (1).
Where ′ is new value of each entry, is attribute data value, max( ) is absolute maximum value of A, and min( ) is absolute minimum value of A.
Apply categorical encoding. We apply categorical encoding to numeric features that have <= 5 unique records. We consider these features to be categorical features. Figure 4 exemplifies the transformation of numerical features into categorical features using categorical encoding.

Train -Test Split
By applying stratified random sampling, we divided this dataset into two parts, namely training data (75%) and test data (25%). The stratified division aims to maintain class proportions on the dependent variable (55% for class 1 and 44% for class 0). Test data is not included in model training with the aim of avoiding overfitting and increasing fairness at the model evaluation stage so that the evaluation results truly illustrate the feasibility of the model

Upsampling
We apply the upsampling/oversampling method using SMOTE to balance the distribution of the dependent variable. SMOTE is an oversampling method that generates new synthetic samples using interpolation to balance the number in the dependent class. SMOTE will create synthetic samples based on the proximity of the samples to the minority class. Synthetic samples are generated by taking the difference between the feature vectors to be enhanced and the closest observation. This difference is then multiplied by a random number between 0 and 1 and added to the feature vector to be enhanced. This approach forces the decision region of the minority class to become more common. The newly generated synthetic minority class, x new , is located between the observations x i and x k .

Tools
In building the prediction model, we used two tools, namely Python with Python version 3.10.12 and RapidMiner with version 10.1.003. Both of them processed the same data.

Modeling
We used the Random Forest (RF), Logistic Regression (LR), and k-Nearest Neighbors (KNN) models, optimized using Optuna in Python and Grid Search Optimization in RapidMiner ( Figure 5), to predict heart disease. In both Python and RapidMiner, each model was trained using the 10-fold cross-validation method.
Training with the 10-fold cross-validation method randomly divides the training data into 10 parts while maintaining the same proportion (stratified) for each class in each training and testing part.
Using OPTUNA for search is an efficient and beneficial approach considering the search speed and the improvement in model accuracy. OPTUNA is responsible for finding the best combination among the available hyperparameters. This step is called hyperparameter optimization [8].
RF is a type of classification algorithm that consists of multiple Decision Trees (DT), analogous to how a forest has many trees.
Deep DT can cause a problem known as overfitting during the training stage with the training dataset, resulting in significant changes in classification outcomes for small differences in test samples. Various DTs, which are part of the RF, are trained with different parts of the training dataset [9]. Input values must be sent along with each DT in the forest to identify new samples. Each DT then uses a specific part of the input values and returns its result as a classification output. The forest then selects the output with the highest ''votes'' (for categorical segmentation output) or the sum of all trees in the forest (for numerical segmentation output). Since the results from multiple DTs are considered by the RF, the variation caused by one DT for similar datasets will be reduced [10].
The LR model describes and estimates the relationship between one binary dependent variable, also known as the outcome variable, and one or more independent variables, also known as covariates or explanatory variables. The LR model has a strong interpretation. It is used to analyze retrospective data, including casecontrol studies, as well as to create prediction algorithms. LR is commonly used to solve two-class classification problems [11].
KNN is a generalization algorithm for the nearest neighbor rule. Its inductive bias is the class label of the k-nearest samples with the label closest to the test sample. The nearest neighbor rule can be described as a simple class determination, where the test sample is assigned the class of the nearest sample. If the training set and the distance metric remain unchanged, the decision outcome of the nearest neighbor rule will be uniquely determined for each test instance [12] in Figure 2.

Evaluation
In evaluating the model built, we use measurements of accuracy, precision, sensitivity, specificity, F-measure, g-mean, MCC and AUC. In a binary class classification task, there are two possible outcome classes: True (1) and False (0). The results of the correct and incorrect class predictions are depicted in the Confusion matrix in Table 2. In evaluating prediction models for classification cases, the accuracy metric is the most commonly used metric. However, for predictions of class unequal classification cases, accuracy metrics can be misleading due to prediction bias towards the majority class. Therefore, other metrics are needed that are more useful in evaluating the reliability of the model. In classification, accuracy is defined as the ratio of the total number of correct predictions compared to the total number of instances, which can be described in the following Equation (2).
c. Sensitivity Sensitivity can also be called recall, hit rate, or true positive rate (TPR), this metric describes the performance of a classification model in predicting positive classes. The high sensitivity value reflects that the classification model is reliable in predicting the positive class. Sensitivity is described in the following Equation (4).

Result and Discussion
Model training using Python shows that the LR model is the best model compared to the RF and KNN models, while in Rapid Miner, the RF model is the best model compared to the LR and KNN models. Overall, this research produces the best RF model trained on Python with optimization using Optuna based on the accuracy (93.15%) and F-Measure (93.83%) metrics, however this model is not good enough in predicting TN which can be seen from the specificity metric in the RF rapid miner model (95%) outperforms this model ( Table 6). In addition, the significance test shows that the fbs (p-value 0.761) and chol (p-value 0.158) features are features that are not significant in predicting HD. Feature selection using the significance test method increases performance on the optimized KNN model, but conversely decreases performance on the optimized RF model in Table 3 and Table 4.  The results of measuring feature importance in the best model (Optuna-RF) describe that cp with a value of 0 is the best predictor of HD, followed by ca, thal with a value of 2, oldpeak, thalach, and age in Figure 3. The results of the EDA found that: There were 304 observations of duplication of data, no null data (NULL Value) in the observations, the dataset had an imbalanced class (imbalance class) with a distribution of 138 class 0 and 164 class 1, 44 observations were outliers if detected using the IQR method, 13 observations are outliers when detected using the zscore method in Figure 4.

Conclusion
From the research results, it can be concluded that by using Optuna, the Machine Learning model can be optimized more efficiently and produce more accurate predictions in identifying heart disease. In this study, the author successfully improved the accuracy of heart disease predictions using optimization techniques provided by Optuna. As a result, the prediction of heart disease can be enhanced. This research has important implications in the field of health, where early detection of heart disease can assist in more effective diagnosis and treatment.