# Introduction he past few decades have seen a rapid growth in biologi-cal information that is coming in the form of genomes, protein sequences, gene expression, biological disorders data and many other medical diagnosis problems. There is the absolute need of effective and efficient computational tools to store, analyze and interpret the multifaceted data. The conventional techniques [1] of computational biology [2] involve the use of applied mathematics, informatics, statistics and biochemistry to solve biological problems usually on the molecular level. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, and prediction of gene expression, pro-tein-protein interactions and the modeling of evolution. All these problems need to deal with a huge amount of multi-faceted data. For example: there are approximately 26 billion base pairs (bp) representing the various genomes available on the server of the National Center for Biotech-nology Information (NCBI). The computational biology [2] is concerned with the use of computation to understand biological phenomena and to acquire and exploit biological data, increasingly large-scale data [9]. Methods from computational biology are increa-singly used to augment or leverage traditional laboratory and observation-based biology. These methods have become critical in biology due to recent changes in our ability and determination to acquire massive biological data sets, and due to the ubiquitous, successful biological insights that have come from the exploitation of those data. This transformation from a data-poor to a data-rich field began with DNA sequence data, but is now occurring in many other areas of biology. The bioinformatics involve the crea-tion and advancement of algorithms using techniques in-cluding modern computer science, applied mathematics, statistics, and biochemistry. Hence, in other words, bioin-formatics can be described as the application of computa-tional methods to make biological discoveries [6]. The Computational intelligence [3] is now became a well-established paradigm for solving complex problems dealing with large scale data which are having overlapping, inexact and ill-defined boundaries. Now days, researchers are evolving new theories with a sound biological under-standing in solving problems of molecular and computa-tional biology [9]. They are able to perform a variety of tasks that are difficult or impossible to do with convention-al mathematics, statistics and informatics [12]. To name a few, Tasoulis et al. [10] introduced the application of neural networks, evolutionary algorithms and clustering algo-rithms to DNA microarray experimental data analysis; Liang and Kelemen [11] propose a time lagged recurrent neural network with trajectory learning for identifying and classifying gene functional patterns from the heteroge-neous nonlinear time series microarray experiments. In this paper we are investigating the method and technique of machine learning through artificial neural networks which proved to be more suitable for genomic and other biological data analysis. The performance of the gene prediction approaches [4] mostly depends on the effectiveness of detecting the splice sites. This paper proposes a system for utilizing an artificial neural network [6] to addresses the problem ( D D D D D D D D ) Year 2014 of splice site detection. ANN takes up it as a two-class problem and classifies a given sequence whether it will be a donor or an acceptor site. Further it predicts the splice form for a given sequence using the scores provided by the single site detectors for every appearing AG and GT dimer. The challenge is to find a splice form that consis-tently combines all predictions. The empirical analysis has further revealed that the results come out more refined if data analyzed in binary format as compared to other format. In the neural network structure, a standard three layer feed forward network of neurons is considered for analysis in which there are two neurons in output corres-ponding to the donor and acceptor splice sites, 128 neurons in hidden layer and 240 units at input end. The 240 input units were used since the orthogonal input scheme uses four inputs each nucleotide in the window. To provide useful insights for neural network applications in biological information analysis, we structure the rest of the paper as follows: section 2 elaborates the related recent trends in biological information that is coming in the form of genomes [4], protein sequences, gene expression, bio-logical disorders data and medical diagnosis problems. Artificial neural network technique involved in classification and recognition process is presented in section 3. Section 4 presents the empirical evaluation of different biological data analysis and experimental outcome. Finally, section 5 summarizes the paper with the inferences and discussions. # II. # Recent Trends in Biological Information Gene prediction [4] is a very powerful and important task for many ongoing researches in the field of bioinformatics [5]. A gene is a set of instruction which governs the assem-bly and function of all organisms. We know that a gene is a region of DNA that control a certain basic characteristic and ultimately lead to protein synthesis. In the 1990s ge-nomic data started becoming available. Since conventional mathematical models [8] proved to be unworkable in anal-ysis of biological information, bioinformaticians turned to computational intelligence [ [9] models for help in tasks such as gene finding and protein structure prediction. The feature selection and class prediction, two learning tasks that are strictly paired in the search of molecular profiles from microarray data, were performed with ANN. The models with ANN have been shown to present a good choice, thus providing analysis and clues for biological information samples. Recently, proteomic data considered potentially rich, but arguably unexploited, for genome an-notation using ANNs which shows favorable performances as compared to conventional mathematical models. The idea of using manifold learning for feature reduction com-bined with an ANN classifier was successful applied in biomedical diagnosis and protein identification. In this fig DNA expresses the gene product that it encodes. Figure demonstrates that certain region of the DNA is tran-scribed into RNA in the form of pre-mRNA. Further, the introns of the pre-mRNA are excised, leaving only exon intact to become the mature mRNA by translation. The ribosome then translates the mRNA into a polypeptide chain of amino acids that eventually becomes a protein by splicing [1]. In splice site prediction in E. coli gene DNA se-quences one need to identify the boundaries between exon (the part of DNA sequence retained after splicing) and in-trons (the part of DNA sequence that are spliced out) in given DNA gene sequence. Thus, this problem contains three classes. First is intron-exon (IE) boundary (donors), second is exon-intron (EI) (acceptors) and third class be-longs to neither donors nor acceptors (Neither). DNA splice sites (Figure 2) are boundaries where splicing occurs and are found between the regions of DNA that code for gene products (exon) and those that do not (intron) [2]. The presence of introns in eukaryotic organisms are believed to be involved in exon shuffling (or alternative splicing) that is responsible for the higher diversity of gene. Products found in eukaryotic organisms than that of prokaryotic organisms [3]. A typical example of exon shuffling is the generation of antibodies against foreign antigens that may invade the host system. The dinucleotide AG are splice sites that borders the transition from intron to exon (Intron/Exon border) going from 5' to 3', while GT are associated with the transition from exon to intron (Exon/Intron border). The GT dinucleotide is usually referred to as "donor" whereas the AG dinucleotide is known as "acceptor" [4]. III. Neural Network for Biological in-Formation Analysis Neural networks have several unique characteristics and advantages as tools for the molecular sequence analysis problem. A very important feature of these networks is their adaptive nature, where "learning by example" replaces conventional mathematical techniques which are time-consuming, computation extensive, and weak to noise. A small complexity, robust performance, and quick convergence of artificial neural network (ANN) are vital for its wide applicability. This feature makes such computational models [10] very appealing in application domains where one has little or incomplete understanding of the problem to be solved, but where training data are readily available. Owing to the large number of interconnections between their basic processing units, neural networks are error-tolerant, and can deal with noisy data. Neural network [12] architecture encodes information in a distributed fashion. This inherent parallelism makes it easy to optimize the network to deal with a large volume of data and to analyze numerous input parameters. Flexible encoding schemes can be used to combine heterogeneous sequence features for network input. Finally, a multilayer network is capable of capturing and discovering high-order correlations and relationships in input data. The artificial neural networks [13] are "neural" in the sense that they may have been inspired by neuroscience but not necessarily because they are faithful models of biological neural or cognitive phenomena. Y = f (? i=1 n w i x i + b), Where X {xi, i=1, 2, . . ., n} represent the inputs to the neu-ron and Y represents the output. Each input is multiplied by its weight wi, a bias b is associated with each neuron and their sum goes through a activation function f. A neural network is characterized by (1) its pattern of connections between the neurons (its architecture), (2) its method of determining the weights on the connections (training or learning, algorithm), and (3) its activation function. In summary, the applications of ANNs in biological infor-mation processing have to be analyzed individually. ANN has been applied to biological data to deal with the issues that cannot be addressed by traditional algorithms or by other classification techniques. By introducing artificial neural networks, algorithms developed for processing and analysis often become more intelligent than conventional techniques. While neural networks are undoubtedly power-ful tools for classification, clustering and pattern recogni-tion; analysis of the internal weight and bias values for neurons in a network is possible, and a network itself can be represented formulaically, they are sometimes too large to be explained in a way that a human can easily under-stand. Despite this, they are still widely used in situations where a black-box solution is acceptable, and where em-pirical evidence of their accuracy is sufficient for testing and validation. IV. # Design of Learning Machine It has been widely observed that in comparison to other machine learning approaches [3] neural networks have many positive characteristics for a prospective user. The variety of different network architectures and learning pa-radigms available, coupled with a theoretically limitless number of combinations of layers amounts, connections topologies, transfer functions and neuron amounts, make ANNs incredibly flexible processing tools. They can be applied to data with almost any number of inputs and out-puts, and are well supported in different programming lan-guages and software suites. Through manual modification of weights prior to training, and through imposing custom limitations on their modification during training, existing expert knowledge can be incorporated into their design and construction. Additionally, neural networks based learning machine are usually computationally inexpensive to use after they The neural networks used in this study (Fig. 3) are of the multi-layer neural network containing neurons of summa-tion aggregation function [13]. They are feedforward con-nected and have three layers: an input layer, one hidden layer and an output layer. In case of gene prediction prob-lems, the network input is a segment of nucleotides from the nucleotide sequence. The output consists of one unit, giving a real valued output between 0.9 and 0.1. Using a threshold this number is interpreted as a category assign-ment for the nucleotide in the input window. The networks were trained by standard error back propagation learning algorithm on two different tasks: (i) detection of coding nucleotides (versus non-coding nucleotides), and (ii) the prediction of splice sites (defined as the first and last Intron, nucleotide respectively). Thus neural network unarguably possess strong potential for output prediction as can be seen by their widespread use in designing learning machine involving modeling and prediction. In order to estimate the strength and effectiveness of bio-logical information from wide spectrum of problems, one needs to analyze them with standard computational intelli-gence technique. I have considered the artificial neural network to prove the motivation and to establish the signi-ficance of work done. Benchmark problems are standard enough to associate the tasks like classification and pattern recognition. They also incorporate the tasks from different fields of importance; few of them are biological engineering, medical and bioinformatics. In this section, I have tho-roughly evaluated kinds of problem to present the impor-tance of neural network in characterizing and analyzing the medical and biological information. In the present investi-gation, we use 5 datasets. Dataset containing primate splice-junction gene sequences and promoter gene se-quences were used in both normalized and binary forms. We observed better error convergence in binary form than the normalized form of dataset. The other datasets Heart Spectf, Bupa Liver Disorder and Protein Localization sites [15] are in numeric forms only, therefore sets normalized in preprocessing. In all the experiments, I have divided whole dataset into two parts: one is training set and second is test-ing set. Performance is analyzed in terms of parameters which are briefly defined as In this paper we have used two genomic datasets and three biological disorders data sets. All these data stets are benchmark and available online for research and academic purposes. 1. E. coli promoter gene sequences (DNA) [11] is acquired to predict the member/non-member of class of sequences with biological promoter activity. The dataset contains non-numeric domain of attributes. The attributes are one of the 'a', 'g','t' and 'c' (a=Adenine, b=Guanine, t=Thymine and c=Cytosine). This dataset have been also used by Harley, C. and Reynolds, R. 1987 in "Analysis of E. Coli Promoter Sequences" Nucleic Acids Research. 2. Primate splice-junctions are the points on DNA sequence at which superfluous DNA is removed during the process of protein creation in higher organisms. The splice-junction gene problem is to identify the boundaries between exons and introns in given DNA gene sequence. 3. Heart SPECTF Data set [14] is based on cardiac single proton emission computed tomography (SPECT) images. Each patient is classified in normal or abnormal categories. Database was used in automated Cardiac SPECT Diagnosis. 4. BUPA liver disorders dataset contains 345 instances that are basically records of 345 males who have taken excessive alcohol consumption. The first 5 attributes are all blood tests which are on thought to be sensitive to liver disorders that might arise from excessive alcohol consumption. The last 2 attributes are different from blood tests. One of them is no of drink having taken in a day and other is selector field. 5. Protein localization sites dataset [15] can be achieve from "Expert System for Predicting Protein At first step we get DNA sequences to analyze splice sites at second step we used sparse encoding to encode these sequences that is A as (1000), C as (0100), G as (0010) and T as (0001) for preprocessing. It is used to avoid alge-braic dependencies between nucleotides in the encoding also called BIN4 encoding in which each letter coded by four digits with the combination of 0 and 1 to input data. The learning in the neural network is done with error back propagation method and result is presented in Table 1 with discussion in section 6. ii. Primate splice-junction gene sequences In this dataset all examples taken from Gen bank 64.1 (ftp site: genbank.bio.net), there are three categories "ei", "ie" and "n" for splice sites recognition. Dataset contains 3190 instances including three classes. Class 'EI' as donor con-sists of 767 instances, class 'IE' as accepter consists of 768 instances and rest of as 'N' neither belongs to any class consists of 1655 instances. In this dataset containing pri-mate splice junction gene sequences (DNA), the standard result was 85% and we achieved the accuracy of 81% using binary form of the dataset and 79% of accuracy was achieved when we used the normalized form of dataset. Result is presented in Table 1 There are 336 instances each with 8 attributes one of them is name and other are predictive. In this 8 classes are according to the protein location in bacteria. In protein localization dataset containing the standard result was an accuracy of 84% with ad hoc structured probability model; but, we found an ac-curacy of 81% with the artificial neural network. # VI. # Inferences and Discussion Research in bioinformatics is driven by the expe-rimental data. Current biological databases are populated by vast amounts of experimental data. Machine learning has been widely applied to bioinformatics and has gained a lot of success in this research area. At present, with various learning algorithms available in the literature [16], researchers are facing difficulties in choosing the best me-thod that can apply to their data. We performed an empiri-cal study and observed that single learning networks are perfectly usable in splice site prediction, gene prediction, liver disorders and localization site in the same manner. The performance of the learning technique is highly de-pendent on the nature of the training data or on the basis of dataset design. We Year 2014 E conclude that, if dataset is in norma-lized form as well as in binary, then the best results can be achieved. In the following Table 1, the dataset containing promoter gene sequences (DNA) we achieved the accuracy of 85% using binary form of the dataset and 83% of accu-racy was achieved when we used the normalized form of normal dataset. We can infer that the variation in the results due to the different forms of dataset was because in binary form the variables A-T-G-C are converted into orthogonal vectors. The dataset containing primate splice junction gene sequences (DNA), we achieved the accuracy of 81% using binary form of the dataset and 79% of accuracy was achieved when we used the normalized form of dataset. In the past usage as results of study indicate that machine learning techniques (neural networks, nearest neighbor, contributors' KBANN system) have performed as well/ better than classification based on canonical pattern matching. In dataset containing SPECTF heart data we achieve an accuracy of 84%. In protein localization dataset we found an accuracy of 81% with the artificial neural network. In dataset containing BUPA Liver disorders we found a mediocre accuracy of 77%. We conclude that we achieved the said accuracies with most straight forward and convenient technique which does not possess the complicated computing operations. There may be techniques which may yield little more accuracy for corresponding dataset but our technique is computationally efficient. We can see the overall analysis at a glance in the Table 1 1![Figure 1 : It shows the importance of splicing which ultimate cause for making of protein.](image-2.png "Figure 1 :") 2![Figure 2 : Schematic representation of the splice site.](image-3.png "Figure 2 :") 3![Figure 3 : Real biological neuron to artificial neuron](image-4.png "Figure 3 :") ![have been trained, making them ideal for real-time applications where immediate output is desirable.](image-5.png "E") 4![Figure 4 : Learning machine design with artificial neural network V. Empirical Evaluation of Biological Data Analysis](image-6.png "Figure 4 :") ![Sites in Gram-Negative Bacteria", Kenta Nakai & Minoru Kanehisa, PROTEINS: Structure, Function, and Genetics 11:95-110, 1991. b) Learning Machine with Benchmark Datasets i. Splice site prediction in E. coli gene DNA sequences In E. coli promoter gene sequences (DNA) dataset from University of Wisconsin Biochemistry Department, there are 106 instances with 59 attributes. In these 59 attribute One of {+/-}, indicating the class ("+" = promoter) and second 2-60 remaining 59 fields are the sequence, starting at position filled by one of {a, g, t, c} base pairs. There is no missing attribute Values. In Class attribute there are 53 positive instances and 53 negative instances.](image-7.png "Localization") iii. Heart Spectf DatasetThis dataset describes diagnosing of cardiacSingle Proton Emission Computed Tomography(SPECT) images. This can be achieved from Universityof Colorado at Denver, Denver, CO 80217,u.s.a.krys.cios@cudenver.edu. Data-base used byKurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M. &Goodenday, L.S. "Knowledge Discovery Ap-proach toAutomated Cardiac SPECT Diagnosis" ArtificialIntelligence in Medicine, vol. 23:2, pp 149-169, Oct2001. 1SNo.Name of databaseNo ofNo ofTrainingTestinginstancesattributeaccuracyaccuracy1Promoter gene1065985%83%2Primate splice-junction gene3190159581%85%sequence3SPECTF-heart data2672387%84%4BUPA-liver disorders345777%80%5Protein Localization Sites.336881%84%VII. © 2014 Global Journals Inc. (US) © 2014 Global Journals Inc. (US) Global Journal of Computer Science and Technology ## Acknowledgements I want to pay my sincere thanks to all of them who are related directly and indirectly with my work and reviewed & encouraged all the time. * AKrogh MBrown ISMian Sjölander 1994 Elsevier * Computational systems biology, H Kitano -Nature 2002 * Neuro-fuzzy and soft computing-a computational approach to learning and machine intelligence Automatic Control Jsr Jang Sun Mizutani 1997 * A gene prediction improvement pipeline for prokaryotic genomes NNPati Ivanova GMikhailova -Ovchinnikova Gene Primp Nature ? 2010 * PierreBaldi Søren Brunak Bioinformatics Second Edition (Adaptive Computation and Machine Learning) on Amazon 2002 The Machine Learning Approach * Applications of artificial intelligence in bioinformatics ZoheirEzziane 2006 * Bioinformatics with soft computing MitraSHayashi Y IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 36 2006 * Com-putational Intelligence Algorithms and DNA Microarrays DKTasoulis VPPlagianakos VrahatisM Studies in Computational Intelligence (SCI) 94 2008 * Com-putational Intelligence Algorithms and DNA Microarrays DKTasoulis VPPlagianakos VrahatisM Studies in Computational Intelligence(SCI) 94 2008 * Com-putational Intelligence in Bioinformatics. Studies in Com-putational Intelligence Arpad Kelemen Ajith Abraham YuehuiChen Berlin Heidelberg Springer-Verlag 2008 * Analysis of E. Coli Promoter Sequences CReynolds R Nucleic Acids Research 15 1987 * Neural Networks and Genome Informatics CHWu JWMclarty Methods in computational Biology and Biochemistry, 1 ed 1 2000 Elsevier * Novel Neural Network Prediction. Systems for Human Promoters And. Splice Sites MGReese FHEeckman 1995 * Knowledge Discovery Approach to Automated Cardiac SPECT Diagnosis GMariofanna Milanova3 GTomasz Smolinski4 MOgiela LSGoodenday Artificial Intelligence in Medicine 23 2 Oct 2001 * Expert Sytem for Predicting Protein Localization Sites in Gram KentaNakai & MinoruKanehisa * US) Guidelines Handbook Global Journals Inc 2014