# Introduction

NA contains lots of information. For DNA sequence to transcript and form RNA which copies the required information, we need a promoter. So promoter plays a vital role in DNA transcription. It is defined as "the sequence in the region of the upstream of the transcriptional start site (TSS)''?.

Identifying a new promoter in a DNA sequence will lead to find a new protein. If we identify the promoter region we can extract information regarding gene expression patterns, cell specificity and development. Promoters will regulate a gene expression. Some of the genetic diseases which are associated with variations in promoters are asthma, beta thalassemia and rubinsteintaybi syndrome. Promoter sequence can be used to control the speed of translation from DNA into protein. It is also used in genetically modified foods.


# II.


# Literature Review

Steven Salzberg [7] has used a decision tree algorithm for locating protein coding region. This algorithm is adoptable and can handle DNA sequences of length 54,108 and 162. P.Maji [8] et al. has developed neural network tree classifier for prediction of splice junction and coding regions in genomic DNA. A decision tree named as NNTree (Neural Network Tree) is constructed by dividing the training set with their corresponding labels to recursively generates a tree. Ying Xu [9] et al. has developed an improved system GRAIL II which is a hybrid AI system which can predict the number of exons in a human DNA sequence and also supports gene modeling. This process combines edge signal like accepter, donor, translation start site detection and coding feature analysis.

Eric E Snyder [10] et al. has applied dynamic programming and neural networks for predicting protein coding regions from a genomic DNA. They have developed a program Gene Parser which first scores the DNA sequences based on exon-intron specific measures like local compositional complexity, codon usage, length distribution, 6-tuple frequency and periodic asymmetry. Edward C Uberbacher [11] et al. has proposed a method which combines some set of sensor algorithms and neural network to predict the protein coding regions in eukaryotes. The programs developed will calculate the values of seven sensors that were considered by the authors. The measures are frame bias matrix, Fickett(three periodicity) , dinucleotide fractal dimension, coding six tuple word preferences, coding six tuple in frame preferences, word commonality and repetitive six tuple word preferences.

J. Pinho [12] et al. has proposed a three state model for protein coding region prediction. Authors have considered three base periodicity property. M.Q. Zhang [13] has used quadratic discriminant analysis method named as MZEF for identifying protein coding regions in genomic human DNA. David J. States [14] at el. proposed a computer program named BLASTC which In vertebrates only five percentage of the gene is made up of exons. Genes mostly will have seven to eight exons with 145 bp length at an average. Introns have 3365 bp length at an average. Promoter comprises a small percentage of entire genome. The features of promoters are different from other functional regions like exons, introns and 3'UTRs. These facts make protein coding and promoter region predictions as very difficult tasks.

uses sequence similarity and codon utilization for predicting the protein coding regions.

Method [8] takes more time to construct a tree for sequences of length 162. The height of the trees is also a major concern for using this algorithm with DNA sequences of more length. Method [9] suffers with less accuracy due to more error rate at classifier nodes. Methods [10], [11], [12] depends more on the statistical information. After this literature survey the concern of a new classifier is to achieve a good classifier accuracy and develop a classifier which can handle DNA sequences of length more than 162 with a fewer nodes. Jia Zeng [15] et al. has proposed a hierarchical promoter prediction system named as SCS where they have used signal, structure and context features .Xiomeng Li [16] et al. has proposed a method PCA-HPR (Principal Component Analysis-Human Promoter Recognition) to predict the promoters and transcription sites (TSS). Sridgar Hannenhalli [17] et al. tried to enhance the accuracy of promoter prediction by combining CpG island feature with information of independent signals which are biologically motivated and these cover most of the knowledge to predict the promoter in human genome. Shuanhu Wu et al. have proposed a method [18] for enhancing the performance of human promoter region identification by selecting most important features of DNA sequence for each different functional region.Uwe Ohler et al. have proposed a model [19] which integrates physical properties of DNA into a probabilistic eukaryotic promoter prediction system.Goni J Ramon et al. has proposed a system ProStar [20] which uses structural parameters for promoter region identification. Authors only used descriptors derived from physical first principles.

Vladimir B. Bajic [21] et al. has developed new software for identifying promoters in a DNA sequence of vertebrates. This program takes input as DNA sequence and generates a list of predicted TSS (Transcription Stating Site).Michael Q.Zhang [22] has proposed a new program for predicting a core promoter in human gene named as CorePromoter. After the literature survey on promoter prediction, the main goal of proposed classifier is to reduce the false prediction rates and improve specificity and sensitivity values.  Human promoter data sets are collected from DBTSS database consist of 30,966 of length 251. We have used 7,741 for constructing an IN-AIS-MACA tree and 7,741 for checking the accuracy of the tree. Rest of the 15,483 promoter sequences are used for testing the proposed classifier.


# III.


# Design of In-Ais-Maca

Human non-promoter data sets are collected from EID and UTRdb databases. We have extracted 75,438 exons from EID database, where 18,859 are used for constructing an IN-AIS-MACA tree and 18,860 data components are used for checking the accuracy of the constructed tree. Rest of 37,719 data components are used for testing the classifier. We have extracted 53,684 introns from EID database, where 13,421 are used for constructing the tree and 13,421 are used for checking the accuracy of the constructed tree, rest of the 26,842 are used for testing the classifier. We have extracted 80,538 3'UTRs from UTR dB. In that 21 No information regarding the reading frame is used in our study. We are going to predict both regions where nothing is known. Each window should belong to a single class (promoter/non-promoter, coding /non-V. Step 6: Store the basins (Be, Bi, Bu, Bp, Bpr).


# Learning of In-Ais-Maca

Step 7: Repeat the steps 1 to 6 till the completion of input or individual attractor basins count is 6.

Step 8: Stop Where Be represents the exon basins, Bi represents the intron basins, Bu represents the 3'UTR basins, Bp represents the promoter basins and Bpr represents the protein coding region basins.


# VI.


# Testing of In-Ais-Maca

The accuracy of protein coding region prediction with IN-AIS-MACA depends on the accuracy of exon prediction. As the promoter prediction module has reported 96.5% accuracy, the protein coding region prediction accuracy gets improved. The main aim of this algorithm is to process the DNA sequence based on the features and distribute it into any one of the basin.


# Algorithm:

Input: DNA Sequence Output: Class of the sequence Step 1: Read the DNA sequence in the multiples of three.

Step 2: Encode the sequence in the multiples of three Step 3: Extract the features Step 4: Check whether the input belongs to EXON class, if not, go to step 6. If it is found as EXON report the corresponding class and boundary.

Step 5: (a) Read the encoded DNA sequence starting with the upper bound to the end of the string. 


# Data Sets and Methods


# Global Journal of Computer Science and Technology

Volume XIV Issue II Version I Step 6: Check whether the sequence belongs to intron, 3'UTR or promoter. 6a) Choose the best fitness rule to direct the sequence to the attractor basins of Bi,Bu,Bp 6c) Report the boundaries and respective class.

Step 7: Stop.


# VII.


# Output & Experimental Results of In-Ais-Maca

The output1 shown below is a DNA sequence of length 252bp. The output of promoter prediction has indicated initial exon at 30 to 64. So the protein coding interface starts its processing from 64 to 251.The next internal and terminal exons are reported in both the strands.

Output 


# Comparison of the Performance of In-Ais-Maca

IN-AIS-MACA uses the strength of existing AIS-PRMACA design to predict both PR & PCR regions. The accuracy, Se, Sp and execution time of PR prediction with IN-AIS-MACA is same as of AIS-PRMACA reported in chapter 6. So we report the accuracy, Se and Sp of predicting PCR using this IN-AIS-MACA. The important challenge of IN-AIS-MACA is to reduce the total prediction time (TPT) of both PCR and PR which will be discussed in this section.

The performance of IN-AIS-MACA is measured with Se,Sp and accuracy as shown in table 1. We have extended the DT and NNtree to accommodate 252 length DNA sequences and compared the results with them. IN-AIS-MACA reports a high sensitivity, specificity, accuracy of 0.934, 0.925 and 0.93 respectively. This improved performance, when compared with AIS-MACA prediction for 252bp length DNA sequence is due to the classifier accuracy of AI-PRMACA.


# Table 1 : IN-AIS-MACA Performance in PCR prediction

If the accuracy of AIS-PRMACA to predict the first exon is more, then the accuracy of predicting the PCR with IN-AIS-MACA is more. The accuracy of AIS-PRMACA prediction of exon is 94.5%, so there is a considerable improvement of PCR prediction with IN-AIS-MACA particularly in the 252bp length DNA sequences. IN-AIS-MACA maintains good balance between Se and Sp, Se+Sp ie 1.859. The performance of a decision tree in processing lengths of 252bp is poor due to the height of the tree build for predicting the PCR is more. Decision tree reports an accuracy of 86.5%. NNtree performs better compared with DT reports 87.3% accuracy. Performance of both classifiers suffers when processing a DNA sequence of length more than 162. 


# IX. Execution Time Comparisons with In-Ais-Maca

The aim of IN-AIS-MACA is to predict both PCR and PR in human DNA sequence of length 252bp. Since this is the first algorithm to handle predictions of both regions, we have chosen better algorithms in combination , to report the corresponding execution times of individual predictions and total predictions. In the first combination we have used classifiers AIS-MACA  


# IN-AIS-MACA Se,Sp,Accuracy Comparison


# Table 2 : IN-AIS-MACA total prediction time comparison

In the fourth combination we have used classifiers decision tree and AIS-PRMACA which reports the total prediction time of 1930 ms. In the fifth combination we have used classifiers NNtree and AIS-PRMACA which reports the total prediction time of 1897 ms. In the sixth combination we have used classifiers dicodon usage and AIS-PRMACA which reports the total prediction time of 1897 ms. The proposed classifier IN-AIS-MACA reports a total prediction time of 1031ms which is best among all the reported classifiers in table 2 and figure 3. Identifying both PCR and PR with a minimum execution time leads to a faster gene prediction.  For achieving higher accuracies with IN-AIS-MACA to predict protein coding regions and promoter regions, we have to analyze three important parameters. The first parameter is the number of generations. We have to extract higher accuracies with lesser generations. Figure 4 shows that the minimum number of generations that required to achieve a higher accuracy for PCR prediction is 75.  


# Conclusion

We have successfully developed an integrated classifier which can predict both protein coding and promoter regions in human DNA of length 252bp. IN -AIS-MACA reports a Sensitivity (Se) of 0.934 ,Specificity(Sp) of 0.925 and accuracy of 93% which makes this as the best algorithm for predicting both PCR and PR. The important contribution of this classifier lies in predicting both these regions with an execution time of 1031ms, which will faster the gene perdition rate. 
![IN-AIS-MACA design IN-AIS-MACA partial design is shown in Fig: 1. IN-AIS-MACA takes a DNA sequence as input and extracts the features. Initially IN-AIS-MACA checks whether the given sequence belongs to an exon or not.If it belongs to an exon, the exact boundaries with nonpromoter class will be displayed. These boundaries will be used to trace the protein coding region starting from that boundary. Since the first exon boundary is already predicted say (P, Q), this algorithm reads the encoded DNA sequence starting with Q to the end of the string say R. The IN-AIS-MACA tree is built only for a length R-Q for PCR prediction. If the input does not belong to exons then it is checked whether it is an intron or 3'UTR or a promoter. The corresponding class and boundary is displayed.Global Journal of Computer Science and TechnologyVolume XIV Issue II Version I Journals Inc. (US)](image-2.png "")
1![Figure 1 :](image-3.png "Figure 1 :")


VIII.Year 20141: DNA Sequence GAATTCTTGTTGAGAAGGAATTGGGCTCAATGAAGTTCGGGGATATTCCAAGTGAATTATTCCAGTGAGTGTTATTCAG CAATGGACGTGACTGTCGTTTGCCAGATCAGCAGAAGCCGAAAGGAATCCTTTCGGCTTCTGCTGATCTGGCAAAC4GACAGTCACGTCCATTGCTGAATAACACTCACTGGAATAATTCACTTGGAATATCCCCGAACTTCATTGAGCCCAATT CCTTCTCAACAAGAATTCVolume XIV Issue II Version I ( D D D D ) G# Sequence Kiran_63jntuh Length = 252 bp Sequence Kiran_63jntuh, Start End Score 30 64 0.61 Sequence Name Program ATGAAGTTCGGGGATATTCCAAGTGAATTATTCC Human Promoter Prediction Non Promoter Sequence/Exon Type of Exon Boundary Strand Kiran_63jntuh IN-AIS-MACA First 82 189 + Kiran_63jntuh IN-AIS-MACA First 82 207 + Kiran_63jntuh IN-AIS-MACA First 82 222 +Global Journal of Computer Science and TechnologyKiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuh Kiran_63jntuhIN-AIS-MACA IN-AIS-MACA IN-AIS-MACA IN-AIS-MACA IN-AIS-MACA IN-AIS-MACA IN-AIS-MACA Internal 53 First 198 First 198 First 198 First 198 First 198 First 198 IN-AIS-MACA Internal 66 IN-AIS-MACA Internal 80 IN-AIS-MACA Internal 80 IN-AIS-MACA Internal 80 IN-AIS-MACA Internal 106 IN-AIS-MACA Internal 106 IN-AIS-MACA Terminal 106 IN-AIS-MACA Terminal 106 IN-AIS-MACA Terminal 111207 214 222 226 232 207 222 87 199 207 222 132 207 136 197 136+ + + + + + + + + + + + + + + +Kiran_63jntuhIN-AIS-MACA Terminal 111197+Kiran_63jntuhIN-AIS-MACA Terminal 167193+Kiran_63jntuhIN-AIS-MACA Terminal 167197+Kiran_63jntuhIN-AIS-MACA Internal 151249-Kiran_63jntuhIN-AIS-MACA Internal 151249-Kiran_63jntuhIN-AIS-MACA Internal 130249-Kiran_63jntuhIN-AIS-MACA Terminal 194249-
Kiran_63jntuhIN-AIS-MACA Terminal 76249-Kiran_63jntuhIN-AIS-MACA Terminal 72249-MethodSeSpSe+SpAccuracyIN-AIS-MACA0.9340.9251.8590.93Decision Tree0.8510.8791.730.865Neural Network0.8760.871.7460.873Tree0.960.94Se, Sp , Accuracy0.84 0.86 0.88 0.9 0.920.820.8IN-AIS-MACADecision TreeNeural NetworkStandared MethodsTree
X. Parameters Manipulation for HigherAccuracies of In-Ais-MacaExecutionExecutionTotalMethodtime to predict PCRtime to predict PRPrediction Time(TPT)(ms)(ms)(ms)Year 2014IN-AIS-MACA AIS-MACA & AIS+PRMACA1031 7961031 10311031 18276AIS-MACA & SCS79611211917Volume XIV Issue II Version IAIS-MACA & McPromoter DT & AIS-PRMACA NNTree & AIS-PRMACA Dicodon Usage & AIS-PRMACA796 899 866 9561025 1031 1031 10311821 1930 1897 1987D D D D ) G(Total Global Journal of Computer Science and Technology0 500 1000 1500 2000 2500
			© 2014 Global Journals Inc. (US)
			© 2014 Global Journals Inc. (US) pp. 241
		
		
* 
	
		The Babel of bioinformatics
		
			TeresaKAttwood
		
	
		Science
		
			290
			5491
			
			2000
		
	
* 
	
		Assessment of protein coding measures
		
			JamesWFickett
		
		
			Chang-ShungTung
		
	
		Nucleic acids research
		
			20
			24
			
			1992
		
	
* 
	
		DBTSS: database of human transcription start sites, progress report 2006
		
			RiuYamashita
		
		
			YutakaSuzuki
		
		
			HiroyukiWakaguri
		
		
			KatsukiTsuritani
		
		
			KentaNakai
		
		
			SumioSugano
		
	
		Nucleic acids research
		
			34
			1
			
			2006
		
	
	suppl


* 
	
		EID: the Exon-Intron Database-an exhaustive database of protein-coding introncontaining genes
		
			SergeSaxonov
		
		
			IrajDaizadeh
		
		
			AlexeiFedorov
		
		
			WalterGilbert
		
	
		Nucleic acids research
		
			28
			1
			
			2000
		
	
* 
	
		UTRdb and UTRsite: specialized databases of sequences and functional elements of 5? and 3? untranslated regions of eukaryotic mRNAs
		
			GrazianoPesole
		
		
			SabinoLiuni
		
		
			GiorgioGrillo
		
		
			FlavioLicciulli
		
		
			FlavioMignone
		
		
			CarmelaGissi
		
		
			CeciliaSaccone
		
	
		Nucleic acids research
		
			30
			1
			
			2002. 2002
		
	
* 
	
		AIX-MACA-Y Multiple Attractor Cellular Automata Based Clonal Classifier for Promoter and Protein Coding Region Prediction
		
			PokkuluriSree
		
		
			Inampudi RameshKiran
		
		
			Babu
		
	
		Journal of Bioinformatics and Intelligent Control
		
			3
			1
			
			2014
		
	
* 
	
		Locating protein coding regions in human DNA using a decision tree algorithm
		
			StevenSalzberg
		
	
		Journal of Computational Biology
		
			2
			3
			
			1995
		
	
* 
	
		Neural Network Tree for Identification of Splice Junction and Protein Coding Region in DNA
		
			PradiptaMaji
		
		
			SushmitaPaul
		
	
		Scalable Pattern Recognition Algorithms
				
			Springer International Publishing
			2014
			
		
* 
	
		Recognizing exons in genomic sequence using GRAIL II
		
			YingXu
		
		
			RMural
		
		
			MShah
		
		
			EUberbacher
		
	
		Genetic engineering
		
			16
			253
			1993
		
	
* 
	
		Identification of protein coding regions in genomic DNA
		
			EricESnyder
		
		
			GaryDStormo
		
	
		Journal of molecular biology
		
			248
			1
			
			1995
		
	
* 
	
		Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach
		
			EdwardCUberbacher
		
		
			RichardJMural
		
	
		Proceedings of the National Academy of Sciences
		
			88
			24
			
			1991
		
	
* 
	
		A three-state model for DNA protein-coding regions
		
			ArmandoJPinho
		
		
			JRAntónio
		
		
			VeraNeves
		
		
			Afreixo
		
		
			ACCarlos
		
		
			Paulo Jorge SgBastos
		
		
			Ferreira
		
	
		Biomedical Engineering
		
			53
			11
			
			2006
		
	
	IEEE Transactions on


* 
	
		Identification of protein coding regions in the human genome by quadratic discriminant analysis
		
			MZhang
		
	
		Proceedings of the National Academy of Sciences
		
			94
			2
			373
			1997. 2008
		
	
	Bioinformation


* 
	
		Promoter prediction in the human genome
		
			SridharHannenhalli
		
		
			SamuelLevy
		
	
		Bioinformatics
		
			17
			1
			
			2001
		
	
* 
	
		Eukaryotic promoter prediction based on relative entropy and positional information
		
			ShuanhuWu
		
		
			XudongXie
		
		
			Alan Wee-ChungLiew
		
		
			HongYan
		
	
		Physical Review E
		
			75
			4
			41908
			2007
		
	
* 
	
		Joint modeling of DNA sequence and physical properties to improve eukaryotic promoter recognition
		
			UweOhler
		
		
			HeinrichNiemann
		
		
			Guo-ChunLiao
		
		
			GeraldMRubin
		
	
		Bioinformatics
		
			17
			1
			
			2001
		
	
* 
	
		Determining promoter location based on DNA structure first-principles calculations
		
			JGoñi
		
		
			AlbertoRamon
		
		
			DavidPérez
		
		
			ModestoTorrents
		
		
			Orozco
		
	
		Genome Biol
		
			8
			12
			R263
			2007
		
	
* 
	
		Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters
		
			VladimirBBajic
		
		
			HongSeng
		
		
			AllenSeah
		
		
			GuanglanChong
		
		
			Zhang
		
		
			LYJudice
		
		
			VladimirKoh
		
		
			Brusic
		
	
		Bioinformatics
		
			18
			1
			
			2002
		
	
* 
	
		Identification of human gene core promoters in silico
		
			MichaelQZhang
		
	
		Genome research
		
			8
			3
			
			1998
		
	
* 
	
		SCS: Signal, context, and structure features for genome-wide human promoter recognition
		
			JiaZeng
		
		
			Xiao-YuZhao
		
		
			Xiao-QinCao
		
		
			HongYan
		
	
		Computational Biology and Bioinformatics
		
			7
			3
			
			2010
		
	
	IEEE/ACM Transactions on


* 
	
		PCA-HPR: A principle component analysis model for human
		
			XiaomengLi
		
		
			JiaZeng
		
		
			HongYan
		
		
* 
	
		Identification of protein coding regions by database similarity search
		
			WarrenGish
		
		
			DavidJStates
		
	
		Nature genetics
		
			3
			3
			
			1993