Genome feature optimization and coronary artery disease prediction using cuckoo search

ABSTRACT


INTRODUCTION
Among the various health aspects that lead to deaths, cardiovascular diseases (CVD) are one of the major factors that lead to millions of deaths globally every year [1].Acute myocardial infarction (MI) is resultant element of the myocardial tissue formation because of reduced blood supply to the heart and it causes results in millions of deaths [1].Many scientific studies have focused on solutions in terms of diagnosis, prevention, and cure for MI, but still the optimal success not accomplished in terms of mitigating the mortality ratio led resulting due to MI issues.In the present scenario, predominantly clinical symptoms are used for diagnosis of MI.Certain symptoms like complexities in breathing, inconvenience or uneasiness faced by the patients like chest pain, reports of abnormal electrocardiogram (ECG) results, abnormal fall in the circulation levels of cTns (cardiac troponins) [2].Though there are many developments that has taken place in the domain, still there are certain limitations and constraints faced in attaining accurate analysis using the current diagnostic systems.For instance, the contemporary methods and solutions that were proposed in hs-c Tn assays has Comput.Sci.Inf.Technol.


Genome feature optimization and coronary artery disease prediction using cuckoo search… (E.Neelima) 107 resulted in improved scope of detecting the lower circulating Tn concentrations (with increased sensitivity towards analysis).However, one of the key constraints from the process is the rise of false alarm rates as a greater number of non-diseased people are also shown as prone to conditions, because of change resulting in cTns due to the other complication (this reflects reduced sensitivity) [3].The other diagnostic method used for detection is the cardiac mi RNAs that considered as sensitive biomarkers [4], but few limitations like the low abundance, tissue specific expression issues and the small size has impaired the reliability over the model.The role like the biomarkers has become more significant because of invention of fast, improved, and automated detection systems [5].In some of the other studies that carried out in the lines of defining biomarkers for diagnosis, C-reactive protein (CRP), brain natriuretic peptide (BNP), and other such kind of inflammatory markers too considered, however only marginal improvements in the accuracy levels were attained as the outcome [6][7][8].
Domain knowledge of the pathological and physiological aspects the key aspects relied upon for developing many of the earlier cardiac biomarkers.Whereas, the microarray platforms consider the expression of large number of genes in simultaneous manner, that focuses on enabling gene expression profiling across varied pathways in simultaneous.The aforesaid method has the capability to indicate broad range of pathophysiological processes of CVD in more economic and efficient manner [9].Gene expression profiling extends deeper than the biomarkers to identify more potential biomarkers that earlier reported to be associated with CVD.Gene expressions usually enable us to identify and discover insightful and more sensitive biomarkers that can reflect upon CVD.Majority of the studies that have focused on this section has provided significant results from the process.In [10], a study carried out for gene expression analysis to understand and discover contemporary and sensitive biomarkers of CVD identified 482 genes that are in association to composition of coronary atherosclerotic plaques and majority of them never tagged to the atherosclerosis [10].In [11], wide scale gene expression profiling comprising 56 divergent genes for atherosclerotic and nonatherosclerotic human coronary arteries explored, of which 49 of them were associated with coronary artery disease (CAD) earlier [11].In [12], the authors have focused on identifying a set of classifying genes based on demographics and it has strongly depicted the obstructive CAD in non-diabetic patients [12].Divergent range of gene expressions identified that differentiated the ischemic and non-ischemic cardiomyopathy conditions of the patients confronting end-stages [13,14].In [15], the authors have worked on microarray analysis and gene expression profiling that are used for discovering genes related to heart failures based on expression profiles of patients with heart failure complications.In [16], the study has targeted on normal controls and MI patients have found that the genetic markets and the deregulated pathways that are associated with the disease recurrence in first time MI patients [16].
It is imperative that the efficacy with which the blood transcriptase denotes the changes of transcriptional elements in heart, improves the accuracy of diagnosis.In [17], the authors have reported that upon conducting a genome wide survey by using microarrays and the expressed sequence tags having the peripheral blood transcript me to the transcript me of nine other human tissues including the ones of heart, more than 80% of overlapping is estimated at tissue levels.84% of overlapping with heart, indicating that study of peripheral blood transcriptase can be an economic and readily accessible tool for proxy gene expression in other tissues [17].Though many studies have focused on the domain of differential expression in CVD outcomes, in [18] the authors have focused on using differential expression for classifying the patient record outcomes.Such an approach provides efficacy to improve the diagnosis to sub-classify patients.Also, the discriminatory features for differentiating over normal profiles and the patients with MI, CAD and the ones comprising unstable angina over gene expression in blood cells.Blood transcriptase that used with easily accessible tissue for the diagnostic purposes and majority of such contributions depict that the computational overhead resulting from dense number of gene features are adapted in the learning process.In this paper, a ttest dependent feature optimization model proposed which could effectively reduce the count of features used for analysis.The solution uses lesser number of features when compared to many of the earlier models.Despite of using limited set of features, the accuracy levels of diagnosis with reduced false alarm rates has been the outcome for the proposed solution.

RELATED WORKS
In [19], the study has detected varied issues of imbalances that might creep up in the usage of microarray, because of noisy, huge volume and irrelevant samples.Because of the afore-stated complexities, researchers focused to use swarm intelligence techniques for addressing the issue.The study used the technique of ant colony optimization (ACO) sampling, which developed based on ACO algorithm for eliminating the noisy and irrelevant features in the process of feature selection.Support vector machine (SVM) classifiers were adapted because of its prominence for high dimensional data classification even with small set of samples.The issues of unstable classification performance identified in cross validation process are a major factor.In [20], the authors have adapted a hybrid model for selecting optimal features by using artificial bee colony (ABC) and the classification carried out using the SVM classifiers.ABC used for clustering and selecting optimal features, which reduces the search space.Experimental studies depict that the unstable accuracy at the level of 10-fold classification.In [21], the model proposes the usage of ACO, and rough set theory (RST) in combination for achieving the optimized feature count.Accuracy of feature selection is inversely proportionate to the level of dimensionality in the feature set.In [22], it explores the usage of BAT algorithm for reducing the dimensionality of feature and selection of optimal features.
In [23], fuzzy based model depicting the rules depending on relationship among the features developed, using the combination of ACO and BAT technique.In addition, the rules can be in use for selecting optimal features in dynamic manner.Among the constraints that envisaged in the model, there is need for exposure to ensure selection of prior attributes that supports in selecting the dependent attributes, based on devised fuzzy rules.RST and BCO combined in [24] wherein, the emphasis is on clustering the features based on phenotype or the pattern that identifies the optimal features.It used the locality sensitive discriminant analysis (LSDA) for reducing dimensionality of feature sets, which further clusters, the outcome using fuzzy c-means (FCM) algorithm.FCM used in combination with ABC approach for feature similarity assessment whilst forming the clusters.The FCM incorporated with ABC approach that used for feature similarity assessment during cluster formation.Other contemporary models in the feature optimization are binary bat algorithm and ABC were used [25], and in [26] minimum redundancy and maximum relevance (M-RMR), and particle swam organization and decision tree in [27].The M-RMR [26] is an effective method for reduction of noise and irrelevant features apart from reducing the dimensionality.
In order to surpass the constraints observed in existing meta-heuristic swarm intelligence-based feature selection models, a couple of feature selection techniques called forward feature selection, forward feature inclusion, and backward feature elimination discussed in [28].The experimental study indicating that, among these three strategies forward feature selection is optimal.However, the performance observed in 10fold classification done by SVM, maximum classification accuracy limited to 89% and not consistent between divergent folds.The classifiers depicted above have varied levels of performance efficacy that influenced by pre-processing stages for datasets.Pre-processing stages depicted in feature selection process could lead to better performances for classifiers.Features reduction in the datasets is one of the critical aspects facing the classifier.Many of the earlier techniques of feature selection or reduction has depicted that it could be a resourceful solution for classification purposes.In addition, the accuracy and performance of classification might depend on the quality of feature selection techniques adapted.

CORONARY ARTERY DISEASE PREDICTION FROM GENOME FEATURES USING CUCKOO SEARCH
In this section of study, the process of feature optimization for genome features and in terms of predicting the CAD heuristic scale-based defining based on Cuckoo search is proposed.The further sections, firstly the methods and materials used in the devised model discussed.Further, the method of feature optimization based on ANOVA standard termed as bidirectional pooled variance estimation discussed.In furtherance, the search process and label prediction based on cuckoo search discussed.

Methods and materials 3.1.1. The feature set
The 636 genomes among the total 25000 genomes are related to the CVD [29], which is usually depicted as CAD genes.In terms of evaluating the correlation among the 636 genome features, high levels of process complexity are imperative, and it causes significant range of false alarming over the prediction models.Hence, in order to ensure liner and lower levels of complexity, ensuring truncation of false alarm rates to minimal levels is very essential.Every record of the dataset adapted for training and testing phases comprise the single nucleotide polymorphism (SNP) of every gene, denoting genetic variation of various genes.In addition, the initial length of every record is 636 values depicting the SNPs of all the 636 genomes that listed in CAD genes.Initial dataset comprises the set of records that either labelled as prone to CVD or the ones that are salubrious with no trace of any CVD implications.In addition, the dimensionality of genes count has to reduce from the current number of 636 to considerably lesser values.ANOVA standard termed as bidirectional pooled variance estimation is adapted for the process of reducing the dimensionality to optimize the gene count and building the proposed scale.In addition, the details of bidirectional pooled variance estimation that is adapted for feature optimization process explored in the following section. Comput.Sci.Inf.Technol.


Genome feature optimization and coronary artery disease prediction using cuckoo search… (E.Neelima) 109

Bidirectional pooled variance estimation
Attributes of every record in the chosen dataset denotes each gene of CAD genes for the count of 636.Hence, every record comprises 636 SNPs as values pertaining to all the genomes.To defuse the number of genes that considered for optimal features, the covariance amidst values denoting every gene in the record labelled either as prone or salubrious for all the features.Genes are optimal features comprising effective covariance amidst values pertaining to prone or the salubrious records chosen.For estimating variance of SNPs, comprising values of a gene related to prone or salubrious records of the chosen training set, the method adapts ANOVA standard bidirectional pooled variance estimation.Based on results envisaged in [30,31], the method is chosen for analysis.The bidirectional pooled variance estimation is adapted for selecting optimal features pertaining to every record (both prone and salubrious) for a training set chosen.Differential values amidst two distinct vectors depicted by the usage of bidirectional pooled variance estimation as follows: In the equation above − ⟨ 1 ⟩, ⟨ 2 ⟩ indicates the mean values identified for relevant vectors 1, 2 and these vectors indicate the SNPs constituted as values to a gene pertaining to records labelled as prone and salubrious respectively in given training set.

−
The representation s( 1 ), ( 2 ) signify the mean square distance of the vectors 1, 2 respectively.The bidirectional pooled variance estimation is the ratio amidst the mean variation of relative vectors and the square root of sum of mean square distances of the relative vectors.In furtherance, the p-value (degree of probability) [32] is attained based on t-table [33].P-value is much lesser than the probability threshold, which reflects that the vectors vary.Hence, the feature denoting respective vectors are of optimal feature.

Cuckoo search
The natural elements based meta-heuristics models developed are among the best set of algorithms to address the issues of optimization.The proposed work evaluates the fitness for a given gene vector for CAD prone set and the salubrious sets based on contemporary meta-heuristic model of cuckoo search (CS) [34].CS developed based on obligate brood parasitism of the cuckoo species.Its main characteristic is to let the eggs in the nests of other bird species that are relatively matching.Three key fundamentals based on such nesting process followed by Cuckoo are: Cuckoo egg denotes a solution to the issue and it drops randomly in a chosen nest.However, only one egg left at every instance.The nests that comprise higher quality of eggs have to pass to the future generation Nest owner shall identify a cuckoo egg based on probability ∈ [0, 1].In the instance of such occurrence, the nest owner leaves the nest and develops other nest in a varied location.The cumulative number of nests is the fixed value.Not all the previously mentioned rules are essential, as the cuckoo search used in the proposed model, only for identifying the fitness of features for a chosen input gene record.Hence, the proposal is to develop nests in a traditional manner and the search performed using random approach.Traditional search drops only one egg in the chosen nest, but in the proposed solution, it clones the egg to varied number of compatible nests and places one egg in every compatible nest.It also estimates fitness of every egg for entire nest hierarchy.

The dataset
Data set generated based on records denoting coronary artery susceptibility mode (NCBI GEO Dataset ID: GDS4527) and atherosclerotic CAD prone (NCBI GEO Dataset ID: GDS3690) are gathered from NCBI gene expression omnibus (NCBI GEO) [35], authenticated as gene expression dataset repository.The dataset GDS4527 comprise gene expressions of 20 subjects.Among them 10 records are categorized as salubrious and rest of the records are categorized as prone to coronary artery disease.The other dataset GDS 3690 comprises 153 records of which 66 records categorized as salubrious and rests of them as prone to coronary artery disease.Based on the records of two datasets representing 173 subjects, values observed for CAD genes, which are collected as record for every subject.Statistics of final datasets that generated from the process depicted in the following Table 1

Optimizing genome features
As a part of portioning process of labelled records in the dataset, which classified to two sets  indicting CAD prone and salubrious records respectively.The setsare in the form of matrix size of records counting as row count and CAD genes counted as column count, which are fixed to 636 [29].Every row of the matrix shall be a vector denoting SNPs attained for all the CAD genes pertaining to individual case and every column in the vector indicates SNPs gathered from specific gene in the chosen cases.Context of optimal feature selection is about a gene comprising a varied vector of SNPs pertaining to prone and salubrious record sets.In addition, it applies bidirectional pooled variance estimation test over the attained value for a gene pertaining to both labelling sets using the following process.// Estimating the bidirectional pool variance score of the vector   and vector   comparison step 6: ��   ⇌  � < � // Upon instance of degree of probability �   ⇌  � identified for    ⇌  is lesser than the probability threshold (usually 0.01, 0.05 or 0.1) given step 7:  ← {} // then the  ℎ gene of the CAD genes set is deliberated as optimal and moved to the optimal gene set  step 8: End

Cuckoo search for fitness assessment
This section explores the process of fitness assessment through cuckoo search.The overall process includes nest formation, hierarchical search to notify the fitness of the optimal features of the given record towards prone to CAD and salubrious state.Nest formation, search and label prediction process explored in following sections

Nest formation
In order to perform the cuckoo search, the hierarchy of the nests should generate for corresponding disease prone and salubrious sets , .The optimal gene features represent the nests in a hierarchy such that each set of optimal gene features represents a unique nest in hierarchy that referred further as nest representative set  .The optimal gene feature sets explored such that each set contains more than one gene feature that are highly correlate in regard to the their respective SNPs as values found in records of the corresponding sets , .Further, these nest representative sets referred as  and let the hierarchies ,  formed respective to disease prone and salubrious sets,  using these nest representative sets  Further, the unique value sets { 1 ,  2 , . .,   }as eggs, such that each egg represents the values of a gene features in nest representative set {  ∃  ∈ } and exists in at least one record of the respective records-set, should place in to the nest represented by {  ∃  ∈ }.

Assessing fitness by nest search
The fitness of the given record estimates based on the number of compatible nests noticed in respective hierarchies , .Concerning this, for each nest, any egg of the respective nest is identical to the values observed in given record for the gene features in corresponding nest representative set then the fitness of the given record in related to corresponding hierarchy will increment by 1.This practice delivers the fitness related to disease prone and salubrious state for given record.Further the fitness ratio of the given record about to both hierarchies will measure, which is the average of the fitness related to number of nests in corresponding hierarchies.Then the root mean square distance of the fitness values corresponding to both hierarchies should be measure.Then these fitness ratios and root mean square distances corresponding to both hierarchies will use to confirm the state of the given record is prone to coronary vascular disease or not that explored in following section.The mathematical model to assess the fitness follows: step 1: Let  be the nest representative sets (see sec 3.C) of disease prone and salubrious hierarchies ,  respectively, such that each nest representative set contains a set of highly correlated features obtained from optimal gene features discovered (see sec 3.B) step 2: Let be the record contains SNPs respective to all optimal feature genes selected (see sec 3.B) step 3: Let  be the set representing the sets of values as eggs to place in nests, such that each egg contains the values observed in  for the genes of respective nest representative set {  ∃  ∈ }.
//add 1 to disease prone fitnessof the given record  related to prone hierarchy  if egg   is compatible to place in nest   in prone hierarchy.|| // finding the root mean square distance of the salubrious step 10: fitness using the similar process defined for prone fitness rmsd calculation in step 6

Discovering the record state
The fitness ratios ⟨⟩, ⟨⟩ and root mean square distances   ,   obtained in respective to disease prone and salubrious hierarchies, for given input record should use to label the record is prone to disease or salubrious.The label should define using the conditional flow that follows: step 1: (⟨⟩ ≅ ⟨⟩) Begin step 2: �  <   � Begin step 3: Label the record as disease prone step 4: End //of step 2 step 5: Else �  >   �Begin step 6: Label the record as salubrious step 7: End // of step 5 step 8: Else //of condition in step 5 step 9: Record state is ambiguous// since the fitness ratios and root mean square distance obtained for both hierarchies is same step 10: End //of step 1 step 11: Else Begin // of condition in step 1 step 12: (⟨⟩ > ⟨⟩)Begin step 13: Label the record as disease prone step 14: End //of step 11 step 15: Else (⟨⟩ < ⟨⟩)Begin step 16: Label the record as salubrious step 17: End //of step 14 step 18: Else Begin//of condition in step 15 step 19: Record state is ambiguous// since the fitness ratios and root mean square distance obtained for both hierarchies are not meeting the prescribed conditions step 20: End // of step 18 step 21: End //step 11

Empirical analysis of the proposed model
The experimental study conducted on dataset explored in section 3.4).In order to explore the performance significance of the proposed model that incorporated the feature optimization by bidirectional pooled variance and cuckoo search model classifier (BPVE&CS), the experimental results obtained and compared to the other contemporary model [28] that selects optimal features using forward selection technique and classifies using SVM classifier (FFS&SVM).The statistics of the dataset used can depict in table 1 that explored in 3.4).The classification process on said dataset using both models done in 4 folds.In addition, the performance assessment of the proposed model and contemporary model depicted using classification assessment metrics [36] such as precision, sensitivity, specificity, and accuracy.The results obtained for both the proposed and contemporary model depicted in The prediction accuracy of the both the models observed from the experiments depicted in Figure 1.The results depicted in Figure 1 evincing that the classification accuracy observed for BPVE&CS is stable and substantially high with greater than 93% that compared to FFS&SVM, which observed as inconsistent and less than 90%.Figures 2 and 3

CONCLUSION
Gene expressions forms as the combination of many genes among the thousands of genes defined until now.Among these thousands of genes, 636 genes identified as cardiovascular related that are usually refers as CAD genes.Still this count of genes is high dimension to apply machine-learning methods to learn cardiovascular related information.Concerning this, reducing the dimensionality of the CAD genes is essential factor to improve the performance of the machine learning process that applied on these CAD genes set.This manuscript depicted a novel optimal feature selection technique that uses bidirectional pooled variance estimation (BPVE) for CAD prediction.Learnings from the contemporary literature stating that existing classifiers are unstable towards classification accuracy and inconstant to label the individual record, hence the label prediction for given record of the individual is highly false alarmed.Considering this, a novel classifier as prediction scale proposed here in this article.The depicted classifier built over the swarm intelligence technique called cuckoo search.The experimental study stating that the proposed model BPVE&CS is the best to reduce dimensionality of the CAD genes among the models found in recent literature.The experimental study compared the results obtained from proposed model with the results obtained from contemporary model that selects features through forward feature selection and classifies using SVM (FFS&SVM).The proposed model evinced 18 genes as optimal features, which is best count that compared to any of the contemporary model found in recent literature.The label prediction strategy through the proposed classifier that build over cuckoo search is consistent in classification accuracy, evinced less fallout and missing rate and high sensitivity and specificity.The process completion time of the proposed model also found as much less and linear that compared to the FFS&SVM.The future research can extend this work to discover the possibilities of using other ANOVA standards like Wilcoxon Signed rank, Entropy test to reduce the dimensionality of the feature set.


ISSN: 2722-3221 Comput.Sci.Inf.Technol., Vol. 1, No. 3, November 2020: 106 -115 112 depicts the performance advantage of the BPVE&CS over FFS&SVM towards sensitivity and specificity those refers the significance of disease scope prediction and significance of salubrious state prediction respectively.The proposed model clearly outperformed the FFS&SVM in this regard.The Figures4 and 5evincing the false negative rate or missing rate, false positive rate or fallout those indicates prediction failure rate of disease scope and salubrious state respectively observed for BPVE&CS and FFS&SVM.From the depicted results of prediction failure rate for disease scope and salubrious state is much low for proposed model that compared to FFS&SVM.

Table 2 .
The notation used as row and column headers are: