Classification of mammograms based on features extraction techniques using support vector machine

ABSTRACT


INTRODUCTION
Breast cancer is the most widespread disease that affects females during their lifetime, and the 2nd leading death cause for women globally [1] according to the most recent statistical data that has been released in 2015 in the USA, breast cancer occupies 29% of the new cases of cancer and 15% of the cancer death cases [2]. Early detection is the optimal option for increasing treatment options, as there are many screening methods for breast cancer detection such as biopsy, MRI, ultrasound, and mammography [3]. Mammography is considered a common method for detecting abnormalities in the early stages. Mammography is a low-dose Xray application allowing to visualize the inner breast structure [4]. It's usually contained many artifacts and noises that make them difficult to understand in the initial stages. Therefore, standardizing image quality and extracting ROI is necessary to reduce the search for distortions [5]. Visual examination of mammograms by a radiologist to detect breast cancer very command but sometimes leads to less accurate diagnosis, due to the stress of the radiologist and low image quality. The studies indicate that the error rate of 10% to 30% for diagnoses of malignant masses by a radiologist who utilizes visual inspection. The error rate decreases significantly by utilizing the computer-aided detection CADe system, which provides the support of the final decision and is considered as a second opinion with radiology specialists diagnosis to classify breast tumors [6], [7]. Normally, there are five phases of the CADe system for breast cancer detection as can be seen in Figure. 1, [8].  [8] To achieve this goal, the medium filter was applied to remove noise, and local contrast, and then the binary image with a global threshold applied for small artifacts removal. In the segmentation phase, an HBBRG algorithm is utilizing to remove the pectoral muscles [9], and then we find the largest possible square area that can be obtained from the breast, which represents the ROI [10]. In the features extraction three feature extraction techniques were used are first Order, GLCM, and LBP, to acquire strong texture features that were entered into the SVM classifier at the classification phase. To conduct this search, investigate it, and evaluating all its phases, a special database was established that relies mainly on MIAS which has been proposed by the U.K. national program of breast screening. This database includes 322 digitized mammograms, 114 abnormal, and 208 normal which sub-dividing to 51 malignant and 63 Benign. A 1024x1024 pixel image with a "PGM" format [14]. The database is available on the website http://peipa.essex.ac.uk/info/mias.html [12]. Also in coordination with the Teaching Oncology Hospital / Medical City / Baghdad, a set of images was obtained and added to the database. The pre-processing of these images has also been performed to be in the same MIAS database image specification.
Theoretical consideration Some many technologies and algorithms have been included in the stages of the CAD system in this part. We will look at some of the techniques that were used in our proposed method to extract and select features.

First order (statistical) features
Statistical texture analyses have been based upon the statistical characteristics of the intensity histogram with no consideration of the spatial dependences. The image histogram has given a statistical information summary concerning this image. The 1st order statistical image information may be produced with the use of an image histogram [13]. It is a group of the useful features that may be directly extracted from the spatial domain of the image histogram based on pixel values only, including mean, SD, variance, Kurtosis, and Skewness [14]. These features have been calculated using the following equations [15], [16]: Where g is grey level in the image (0 to 255) Standard Deviation = √∑ (g − g ̅) 2 P(g) Where g ̅ is mean of gray level in the image (3) (4) Local binary patterns (LBP) features This texture descriptor is very interesting it is texture descriptor is consider very interestingly because particularly suitable for real-time quality controlling applications because it is fast and easy to implement. It was expanded for the color image by Maenpaa and Pietik¨ainen and used in numerous applications to classify problems based on color images [17], [18]. LBP indicates a relationship between a central pixel and its adjacent pixels in Micro pattern LBP examined window to cells (for example 8x8 and 16 x 16 pixels for each cell). or every one of the pixels in a cell, compression to all pixels of its eight neighbors (on the upper left, middle left, lower left, right top, and so on) as shown in Figure 2 which shows an example of the original operator of the LBP. LPB follow pixels along the circle, that is, counterclockwise or clockwise. The string '10000111' is getting for 3 * 3 block with the central pixel 5. The binary form is transformed to its 135 duplicates in a decimal form. LBP histograms are created from all micro patterns depend on a decimal value. Assume that I is an image intensity and r = (x, y) ᵀ is a vector of position in I. The LBP b(r ∈ R ᴺⁿ) is known as in the following description: Bi(r) =1: if I(r) <I (r +Δ si), 0: otherwise, (i =1 Nn). Nn represents the number of the neighboring pixels, and Δsi is vectors of displacement from the position of center pixel r to neighboring pixels [19].
Gray level co-occurrences matrix (GLCM) features GLCM is also a statistical method that takes into account the spatial relationship between pixels. It has been employed widely in numerous applications based on texture analysis in the area of texture analysis [20]. The probability of a pixel with a gray level of I occurring in terms of a particular spatial relationship to another pixel j can always be calculated with GLCM. The size of the GLCM is located by how many pixels are found in an image. The GLCM feature is applied based on four angles (0°, 45°, 90°, 135° degrees) and displacement distance as shown in Figure 3 [21].In this paper, four features extracted from GLCM are energy, contrast, correlation, and homogeneity. Those features have been calculated utilized by the following (6).
Where n is the Co-Occurrence matrix dimension, p is the probability of GLCM.
Where μx, μy, σx and σy represent the mean and standard deviation values.
(10) [22]. Related works Many researchers completed early to diagnose breast cancer in an attempt to help radiologists to detect abnormal tissue in the form of mammograms; we will review the most significant studies in this area below: In 2017 [23] Harefa, et al, applied to pre-process for improving the image quality, and then the segmentation stage has been applied depending on the database to obtain the ROI. Features are extracted by using GLCM at 0 o , 45 o , 90 o , and 135 o with a 128 x 128 block size. In the procedure of the classification, this study attempted at comparing the KNN and SVM classifiers for achieving a higher level of accuracy. The result shows that SVM outperforms KNN in breast cancer abnormalities classification with 93.88% accuracy.
In 2018 [24] Sheba, et al, in the proposed methodology, for the pre-processing median filter was utilized for the noise filtering, global thresholding for removing the small artifacts. BB is utilized for removing pectoral muscles, and adaptive fuzzy logic based bi-histogram equalization to enhance the quality of the mammograms for better perception. The ROI is automatically selected and segmentation from mammograms image with the use of morphological operations and global thresholding. Shape, GLCM, and texture features have been obtained from ROI, and then optimum features have been chosen with the use of Classifier and Regression Tree (CART). Finally, the classification step has been carried out with the Feed-forward ANN utilizing the backpropagation. The proposed approach achieved 96% accuracy. In 2019 [25] Mostafa, Shaimaa, et al. The researchers utilized few features than other previous research that used many feature sets, many techniques have been used to reduce dimensions. The (KNN) and (ANN) classifiers are used to classify these few features. 50 cases of the 'BAHEYA Foundation to Early Detection and Treatment of Breast Cancer by doctors and radiologists in the hospital have been utilized for the proposed system. The images used are Contrast-Enhanced Spectral Mammograms (CESMs) that have clearer and more contrasting images compared to the typical mammals. The KNN and ANN classifiers were used and the outcomes indicate to achieve accuracy percent with 92 percent with ANN.
In 2019 [26] Salman, Nassir, and Semaa Ibrahim, the authors have proposed a system for detect potential cancer tumors in mammograms, the detection is made through automatically dividing breast images by combining hybrid density slicing technique with the adaptive k-means algorithm, also by dividing breast images and extracting areas of cancer. (GLCM) have been used with proposed features that are gray level density matrices (GLDM) to detect abnormal tissue using MLP classifiers. Experimental results showed a significant improvement in breast cancer diagnosis accuracy with more than, 91.17%. In 2019 [27] R. Vijayarajeswari et al, authors present a CAD system with the features obtained with the use of the Hough transformation method, it is a 2-D transformation. Which is utilized for isolating the feature of a specific shape in the image. This study discusses strategies for the process of classification and feature extraction. Here, it is utilized for the detection of the mammogram image features and has been classified with the use of the SVMs. The results have shown that the suggested approach has been successful in classifying the abnormal mammogram classes with an accuracy of 94%.

RESEARCH METHOD
The main goal of the present search is to build a classifier model to helps physicians and diagnostic experts by providing a second diagnostic view for a more reliable diagnostic decision.

Preprocessing
Mammography sometimes contains many errors like noise, small artifacts, and pectoral muscles. These effects should be removed because they greatly affect the results of the following stages, such as feature extraction. A median filter is used for noise and local contrast removal; it represents a filter of the spatial domain, which works through replacing the central pixels in a certain block with block pixel median values. The small artifacts have been eliminated through the conversion of the image into a binary format with the use of a suitable threshold and after that, the arrangement of those components through the area for the isolation of small spaces, which include the numbers and labels. The results which have been obtained from the application of this stage have been illustrative in Figure 5.

Segmentation
Many segmentation algorithms have been used on medical images. In this paper, this stage was applied to remove pectoral muscles and cut the largest possible square from a mammogram, which represents the ROI. Firstly, the HBBRG algorithm was used for the pectoral muscle removal by combining BB and RG. Where BB algorithms were applied according to the fact that pectoral muscles are almost triangular and are appearing in the breast contour's upper left or corner according to whether it is the right or the left breast, then the region's growing algorithm applied by selecting a seed point that will be automatically characterized to be in the pectoral muscle limits. In addition to that, this function requires locating the distance of the maximal intensity between the seed point and neighbor pixels, finally, the 2 methods have been combined to obtain the HBBRG mask as shown in Figure 6.  Figure 6. Results of the HBBRG algorithm (a). preprocessing image (b). cut breast only (c). one's value for breast (d). mask BB €. mask RG (f). merging between BB, and RG mask (HBBRG mask) (g). improvement mask HBBRG algorithm (h). integrates with the zero matrix's (i). inverse mask (j). output image In the second phase, the breast image has been segmented to obtain the largest possible square area that can be cut from the image, as this area is square, This process was applied using a geometrical method by converting the image into the binary and finding the mask that contains only the breast with one's value and then perform a reverse search process starts from the penultimate pixel in the lower right corner and compares it with its three neighbors right, bottom and diagonal to find the least value between them and then increase its value in one measure, this process continues on all the mask, then a square is drawn with the coordinates starting from the smallest pixel to the largest value. Finally, to remove the black background all columns and rows with a total sum equal to zero are excluded. Algorithm1 describes the process and Figure 7 illustrates results. Step6: Find the pixel that contains the maximum value in the formed array: max (p(r, c). Then apply A following equation to find a square with white values, Mask3 (c: c +p(c, r), r: r+ p(c, r)) =1 Step7: for i= r to r +p (c, r) Step8: for j= c to c + p(c, r) Step9: IM= Multiply Mask3 (i, j) × IM (i, j) Step10: end for Step11: end for Step12: counting the sum of pixels values for all rows and column and then find the min and max For rows and columns containing a sum greater than zero Step13: ROI = IM (min row: max row, min column: max column) End Where r is the row, c is the columns, and Temp is the temporary tank.

Features extraction
In this phase, the ROI is assigned a set of features that represent the properties of the tissue. These features can be a set of real numbers through which the normal tissue can be distinguished from the abnormal and malignant from benign tissue. In this paper, after finding the region of interest (ROI) for each image the vector features consist of 73 features that were calculated based on three techniques. Firstly ten features of the first-order features are (mean and SD of the mean, mean and SD of SD, mean and SD of variance, mean and SD of skewness, mean and standard deviation of kurtosis), Secondly the fifty-nine feature of the LBP method which is listed from eleven to sixty-nine features. Finally, four features of the GLCM are contrast, correlation, energy, and homogeneity. All of those features were calculated with the aim of creating powerful texture features. The steps for extracting these features are described in Algorithm 2.

Classification
Once extracted the features, choose the appropriate ones to enter it into the SVM classifier. SVM has been applied in two levels of binary classification, the first level representing the classification of image features to a normal or abnormal image, then if the results of the first level are abnormal, the second level of binary classification is applied, which classifies features of benign or malignant images. SVM can be defined as a supervised ML classifier, where a reduced feature vector from the step of the features selection has been provided as input data to SVMs classifier. It produces support vectors for the identification of boundaries between both classes. This support-vector is utilized for the determination of the hyperplane position where it has been tested with a variety of kernel functions. There is an infinite amount of separating lines which may be drawn, the objective is finding the "optimal" one, which means, one which has a minimal classification error on the previously unseen tuples. The SVM has approached this issue by searching for maximal marginal hyper-plane. The optimal splitting of a hyperplane is shown in Figure 8.

RESULTS AND DISCUSSIONS
The efficiency of the approaches of machine learning is evaluated based on some of the indices of the performance measures. The result, False-Positive (FP), may put the patient in a fragile position however, using complementary exams, the result may be excluded. While in a case where the results the False-Negative (FN) it is a more worrying case if an individual has the lesion but the algorithm does not detect [28]. A confusion matrix for the predicted and actual classes is carried out comprising false positive (FP), true positive (TP), false-negative (FN), and true negative (TN). In this classification, positive/negative indicates the decision which has been made by the algorithm, and true/false indicates the way by which the decision agrees with the actual clinical state. Where in case the two classes there are only four possible outputs represented elements of the confusion matrix of (2 * 2) for a binary classifier see Figure 9 [29], [30]. Figure 9. An illustrative example of the 2*2 confusion matrix [29] There are six statistical metrics utilized for the evaluation of the efficiency of the proposed system based on the confusion matrix are accuracy(ACC), error rate (ERR), sensitivity (SN), false-positive rate (FPR), specificity (SP), and precision (P). In the present research, the proposed system has been applied to all images in the MIAS database, where 70% of the image was used for the training phase and 30% testing phase of random instants of image features from the dataset with 100 iterations. The results show that SVM for the first level has achieved the average, best, and worse accuracy they are 89.171%, 95.454%, and 79.293%, respectively, see Table 1. Also, the results show that SVM for the second level has achieved the average, best, and worse accuracy they are 90.493%, 97.60%, and 80.342%, respectively, as shown in Figure 10. Analysis of the performance of results of the evaluation metrics for a first-level classification which classifies the image to normal or abnormal classes is shown in Figure 10, where it can be observed the best and worst value obtained for this classifier.  Table 2 shows Confusion matrix result of SVM for second-level. Analysis of the performance of results of the evaluation metrics for a second-level classification which classifies the abnormal images only to benign and malignant classes is shown in Figure11, where it can be observed the best and worst value obtained for this classifier.  Figure 11. Displays the results and performance analysis of the SVM for second level In this part, in order to better evaluate our proposed system, our proposed system was compared with a set of previous works by comparing all the main stages of the system, algorithms, techniques, and the level for which the system was designed without acres and the results obtained. These comparisons are shown in Table 3.
Table3. Shows the comparison of the proposed system with the related works

CONCLUSION
Mammography is the most effective method that is used in early detections of breast cancer, The main objective is developing the CAD system for diagnosis mammogram images to assist doctors and diagnostic experts by providing a second viewpoint, it gives more confidence to the diagnostic process. In this paper, The results proved that the median filter is an ideal filter to remove noise and local contrast found in the mammogram. The proposed algorithm for removing small artifacts has achieved 100% results for this purpose. Removal of the pectoral muscles represents the biggest obstacle in the treatment of mammograms because they closely resemble tumors and the rest of the breast tissue, particularly in the types of fatty and adenocarcinomas, as well as their presence, has a great impact on the results of the following stages. A geometric segmentation method was proposed to cut the largest possible square representing a sample of breast tissue, it achieving 100% success with normal images and more than 98% with abnormal images, where the tumor is within the crop area. The proposed texture features are based on three technologies are first-order, LBP, and GLCM, these features give the algorithm more robustness due to its resistance to many image situation variations that lead to the best discrimination potential for classification the type of image. Finally, several future research work can be done for the production of our paper such as model development from the diagnostic model to the diagnostic and prediction models, and tests of new segmentation methods that will provide better results to identify the damage and insulation from the rest of our breast tissue, particularly in fatty and glandular photos, during this stage, also apply the proposed breast diagnosis model to other breast imaging and examination methods, like MRI and CT.