Hepatitis Prediction Using K-NN, Naive Bayes, Support Vector Machine, Multilayer Perceptron and Random Forest, Gradient Boosting, K-Means

Hepatitis is a serious disease that causes death throughout the world. It is responsible for inflammation in the human liver. If we manage to detect this life-threatening disease early, we can save many lives from it. In this research paper, we predict hepatitis disease using data mining techniques. We have attempted to propose a feasible approach to improve the performance of our prediction models in our research. We address the problem of missing values in the dataset by replacing them with the mean value. Nine algorithms were applied to the hepatitis disease dataset to calculate prediction accuracy. We measure accuracy, precision, recall, ROC and best score, and we compare them with random search hyperparameter tuning. It is hoped that by using them we will find the optimal combination of hyperparameters to improve the performance of machine learning models which helps us compare the performance of classification models.


Introduction
Hepatitis is a disease defined as inflammation of the liver and is most often caused by viral infections, resulting in 1.5 million deaths worldwide each year [1].Viral hepatitis has become a major threat to human health in recent decades, with a wide variety of hepatitisassociated viruses [2].Medical diagnosis is an important and complex task that requires accurate identification.It plays an important role in diagnosing the disease at the right time and early stages of recovery.The liver is an important organ in the human body, and hepatitis is a serious disease that affects its function.
The main factor that causes liver inflammation is the presence of viruses in life [3].Classification algorithms can help medical professionals in diagnosing diseases.A classification algorithm will be applied to predict patient data for hepatitis [4], [5].Determining the diagnosis of hepatitis is a challenging task for doctors because many factors need to be considered and analyzed [6].The healthcare industry collects information from various clinical reports and diagnostic test results to identify dataset class labels by observing invisible patterns and correlated features in the dataset [7].Both hidden and correlated patterns help distinguish between those who have hepatitis and those who do not.
Predicting the survival of hepatitis patients is a challenging task in the early stages due to interdependent features.Therefore, models can be developed to predict the survival of hepatitis patients [8].Data mining refers to the extraction or "mining" of knowledge from large amounts of data.Data mining has been widely used in bioinformatics to analyze biomedical data.Data mining algorithms can be used efficiently for prediction and classification of interrelated data.The use of data in the health care industry is very important to assist in reliable early disease detection and improve the quality of health services [9].

Research Method
This research aims to improve the accuracy of predictions used in data mining algorithms.The datasets used in prediction models must be more precise and accurate.The collected data set may contain irrelevant or missing values.To ensure that the data mining process produces the best results in terms of accuracy, it must be managed effectively with the framework presented in Figure 1.

Attribute Identification
The amount of data is 155 samples and 20 features with classes indicating whether the prediction is "yes" or "no" for survival, the dataset is taken from the UCI Machine Learning Repository.The dataset consists of six multivalued characteristics and 14 nominal attributes.The characteristics listed are the most common in the dataset used and are presented in Table 1.

Naive Bayes
Naïve Bayes is used for classification and is based on Bayes' theorem.It is very easy to build this classifier model.We can determine the probability of an event occurring given the probability of another event that has occurred before, using the help of the Bayes Hypothesis [10].The posterior probability value is calculated using Equation (1).
Where X is Attribute, C is Class, and P(C|X) is The probability that C given X.

Random Forest
The Random Forest algorithm is a machine learning algorithm that is very popular for classification and regression purposes.In this study, we use it for classification purposes.It works in three processes.In the first process during the learning phase, a Decision Tree is generated from a number of trees.In the second process, for each dataset, the tree used to make the decision in the previous step predicts the class name.In the final step, which is the third process, the correct class name is assigned to the dataset based on the majority of each data present in the dataset encountered in step 3 [11].Comparing different types of supervised machine learning algorithms for predicting heart disease is the focus of this paper.In this paper, the info-gain feature selection technique is applied to improve the accuracy of the classification model.The best results were obtained from Logistic Regression with an accuracy value of 92.76% [8].

K-Nearest Neighbors (KNN)
KNN has three stages in the classification process of this classifier.In step 1, it calculates the K value.In step 2, it sorts and calculates the distance between all training data for each test sample.In step 3 a majority voting approach is used to assign class names to the test sample data [12].Calculating the Euclidean distance is presented in Equation (2). (2)

Support Vector Machine (SVM)
SVM is considered to be a good classifier in terms of accuracy and generalization ability, but its limitation lies in its higher training time.Therefore, to overcome this, various feature selection techniques have been developed that can be integrated with SVM to achieve better results with smaller dimensional data [13].

Result and Discussion
The nine classification algorithms implemented to determine the value that achieves the highest accuracy are then compared with a classification algorithm that uses hyperparameter tuning random search to find the optimal combination using hyperparameter tuning random search to randomly determine a combination of values from a predetermined search space.In the random search process, we trained and tested the model based on several possible combinations of hyperparameter tuning, then obtained a comparison of the process before and after using hyperparameter tuning which is presented in Table 3.   .This means that a model that has been well optimized is able to achieve higher accuracy than the model before hyperparameter tuning was carried out.The results are presented in Figure 2, Figure 3, Figure 4, and Figure 5.

Conclusion
In our research, we have felt the great importance of dealing with datasets with missing values and also decisiveness in feature selection to improve the accuracy of classification models.to get the best classifier, we have made a comparison between our classification models before and after using RS Tuning hyperparameters.To apply a machine learning model to this problem, hyperparameters must be set to handle a particular data set.Hyperparameters are used in ML models to get the best hyperparameters.The dataset we use is very small and it can be seen that the randomly selected set is very limited in representing the dataset.It is best to use a dataset with a large capacity because hyperparameters are very effective in optimizing the ML model that will be used.In future research, it is hoped that large datasets will be used so that comparisons with these hyperparameters are more optimal with better feature selection.

Figure 1 .
Figure 1.Architecture of Methods Used

Table 3 .
Classification Uses 20 Features with Missing Values Replaced by The Average and Divided by Data Split for Testing

Table 4 .
Classification Uses 20 Features with Missing Values Replaced by The Average and Uses Hyperparameter Tuning Random Serch

Table 4
presents the classification results using RS hyperparameter tuning to optimize the best search by producing the best performance.From the results above, it was found that Hyperparameter Tuning RS in the Random Search algorithm for neural networks succeeded in increasing model accuracy by 83.87%, achieving the best score of 91.98%