Linear discriminant analysis and support vector machines for classifying breast cancer

Received Feb 22, 2020 Revised Dec 17, 2020 Accepted Feb 26, 2021 Breast cancer is an abnormal cell growth in the breast that keeps changed uncontrolled and it forms a tumor. The tumor can be benign or malignant. Benign could not be dangerous to health and cancerous, but malignant could be has a probability dangerous to health and be cancerous. A specialist doctor will diagnose the patient and give treatment based on the diagnosis which is benign or malignant. Machine learning offer times efficiency to determine a cancer cell. The machine will learn the pattern based on the information from the dataset. Support vector machines and linear discriminant analysis are common methods that can be used in the classification of cancer. In this study, both of linear discriminant analysis and support vector machines are compared by looking from accuracy, sensitivity, specificity, and F1-score. We will know which methods are better in classifying breast cancer dataset. The result shows that the support vector machine has better performance than the linear discriminant analysis. It can be seen from the accuracy is 98.77%.


INTRODUCTION
Breast cancer is one of the common types of cancer causes death. About half a million people especially women die every year around the world. According to WHO in 2018, it is approximately 15% of all cancer deaths among women [1]. Cancer cell will grow and spread on breast tissue such as in the duct that brings milk to nipple, some in lobular which that makes breast milk, and some in other support tissue in breast [2]. Treatment will be different in every patient according to the status of the classes. Specialist doctors would determine the classes of the cancer. Detecting breast cancer often using machine learning techniques. Machine learning techniques provide time efficiency and a more accurate diagnosis to help doctors diagnose patients. There are some machine learning methods that used in classification of breast cancer such as normed kernel function-based fuzzy possibilistic C-means algorithm [3], sparse learning based fuzzy c-means [4], deep learning approach [5], combination of K-means, fuzzy C-means algorithm, and kernel function [6], convolutional neural network [7], using hybrid deep neural network [8], using SVM and hough transform [9]. In this research, a breast cancer dataset is performed by using linear discriminant analysis and support vector machines. Both methods have good performance for disease diagnosis and classification.

RESEARCH METHOD 2.1. Dataset
This research uses the wisconsin diagnostic breast cancer (WDBC) dataset from UCI machine learning repository [10]. This dataset has 32 attributes and 569 of instances without missing values. From the number of instances, 357 belongs to benign and 212 belongs to malignant. Benign and malignant are as a diagnosis in. All features are processed into numerical from a digitized image of a fine needle aspirate (FNA) of a breast mass and recoded with four significant digits. The dataset described characteristics of the cell nuclei present in the image.

Linear discriminant analysis (LDA)
Linear discriminant analysis (LDA) is one of discriminant analysis method which can be used in classification and dimension reduction [11][12][13]. The main purpose of linear discriminant analysis is to predict the best categorize for multi-class labels [14]. Apply following equation: Refers to the score function The score function is maximized by the estimation linear coefficients. It is calculated by the following formula.
Where menas the linear model coefficients, means the covariance matrix and means the average vector.
To calculate the best discriminant between the two groups, use the Mahalanobis equation: The equations ∆ represents the mahalanobis difference between the two groups, represent data vector, and represents class probabilities. For the final step, if the condition in equation is satisfied, a new feature is classified [15][16].

Support vector machines (SVM)
Support vector machine is supervised machine learning technique for classification and regression problems which was proposed by Vapnik et al. in 1992. SVM is a computational algorithm that learns to assign labels to object from experience and examples. SVM can be applied to medical diagnosis [17][18][19] weather prediction, finance [20], stock market analysis [21][22] and image processing [23]. SVM has the fundamental feature of separating binary labeled data centered on a line that achieves the labeled data's maximum distance [24]. To help labeled data separate, SVM uses a hyperplane which divides plane into classes and measuring a maximum margin where in class lies on the either side. Given a dataset { , } =1 where is an element of , is the class label, where ∈ {−1,1} for binary classification, and N is number of samples [25]. Since the goal of SVM is to find the best hyperplane, it follows: The decision function can be expressed as:

RESULTS AND ANALYSIS
This research used RStudio software for running the program of both methods which are support vector machines and linear discriminant analysis. From Figure 1, the result og linear discriminant analysis, the red graph is a group of samples diagnosed with benign breast cancer and the blue graph is for malignant. It can be said that the linear discriminant analysis successfully classifies base on the dataset. According to Tables 1-2, there are 355 of samples that has benign breast cancer correctly and 2 of healthy samples that were incorrectly identified breast cancer. Samples that has malignant correctly are 194 from Table 1 and 207  from Table 2. By testing the accuracy, sensitivity, specificity and F1-score with 80% of data training and 20% of data testing. The result is in following table.
From Table 3, support vector machines (SVM) has better performance than linear discriminant analysis (LDA) according to the percentage of the result. Accuracy measure how accurate of the model performance that perform the data. Accuracy from support vector machines is 98.77% it is representing the accurate of the model in support vector machines and its more accurate than linear discriminant analysis that has 96.49%. Sensitivity is the probability that patients with cancer are diagnosed with our model. Sensitivity is 99.44%, both of linear discriminant analysis and support vector machines are same. Specificity is the probability that patients without cancer are not diagnosed. Specificity from support vector machines is 97.64% and from linear discriminant analysis is 91.51%. F1-score measured the realistic accuracy the model performances. F1-score from support vector machines is 99% and 97,26% for linear discriminant analysis.

CONCLUSION
According to the result, both support vector machine and linear discriminant analysis has a good performance based on accuracy, sensitivity, specificity and F1-score. By comparing two methods based on the number of results, it can be concluded that support vector machine better than linear discriminant analysis. Support vector machines has been widely used by researchers especially on breast cancer classification because it has a good performance. Support vector machines is suggested to help the doctor to predict and classify a disease or a dataset that similar.