# Introduction ata mining is the process of extracting knowledge able information from huge amounts of data. It is an integration of multiple disciplines such as statistics, machine learning, neural networks and pattern recognition. Data mining extracts biomedical and health care knowledge for clinical decision making and generate scientific hypothesis from large medical data. Association rule mining and classification are two major techniques of data mining. Association rule mining is an unsupervised learning method for discovering interesting patterns and their association in large data bases. Whereas classification is a supervised learning method used to find class label for unknown sample. Classification is designed as the task of learning a target function F that maps each attribute set A to one of the predefined class labels C [1]. The target function is also known as classification model. A classification model is useful for mainly two purposes. 1) descriptive modeling 2) Predictive modeling. An artificial neural network (ANN) is the simulation of human brain and is being applied to an increasingly number of real world problems. Neural networks as a tool we can mine knowledgeable data from data ware house. ANN are trained to recognize, store and retrieve patterns to solve combinatorial optimization problems. Pattern recognition and function Estimation abilities make ANN prevalent utility in data mining. Their main advantage is that they can solve problems that are too complex for conventional technologies. Neural networks are well suited to problem like pattern recognition andforecasting.ANN are used to extract useful patterns from the data and infer rules from them. These are useful in providing information on associations, classifications and clustering. Heart disease is a form of cardio vascular disease that affects men and women. Coronary heart disease is an epidemic in India and one of the disease burden and deaths. It is estimated that by the year 2012, India will bear 60% of the world's heart disease burden. Sixty years is the average age of heart patients in India against 63-68 in developed countries. In India Andhra Pradesh state is in risk of more deaths due to CHD. Hence there is a need to combat the heart disease. Diagnosis of the heart disease in the early phase solves many lives. Feature subset selection is a processing step in Machine learning and is used to reduce dimensionality and removes irrelevant data. It increases accuracy thus improves resultcomprehensibility.PCA is the oldest multivariate statistical technique used by most of the scientific disciplines. The goal of PCA is to extract the important information from data bases and express this information as principle components. Chi square test evaluates the worth of an attribute by computing the values of chi squared statistics w. Year selection for classification of heart disease with reduced no of attributes. Our approach him proves classification and determines the attributes which contribute more towards the predication of heart disease which indirectly Reduces no. of diagnosis test which are needed tube taken by a patient. In section 2 we review basic concepts of neural networks, PCA, chi square and heart disease section 3 deals with related work. Section 4explains our proposed approach .Section 5 deals with results and discussion. We conclude our final remarks in section 6. # II. # Basic Concepts In this section we will discuss basic concepts of Neural networks, PCA, chi square and heart Disease. # a) Artificial Neural Networks An ANN also called as neural network is a mathematical model based on biological neural networks. Artificial neural network is based on observation of a human brain [2].Human brain is very complicated web of neurons. Analogically artificial neural network is an interconnected set of three simple units namely input, hidden and output unit. The attributes that are passed as input to the next form a first layer. In medical diagnosis patients risk factors are treated as input to the artificial neural network. Popular neural network algorithms are Hopfield, Multilayer perception, counter propagation networks, radial basis function and self organizing maps etc. The feed forward neural network was the first and simplest type of artificial neural network consists of 3 units input layer, hidden layer and output layer. There are no cycles or loops in this network. A neural network has to be configured to produce the desired set of outputs. Basically there are three learning situations for neural network. 1) Supervised Learning, 2) Unsupervised Learning, 3) Reinforcement learning the perception is the basic unit of a artificial neural network used for classification where patterns are linearly separable. The basic model of neuron used in perception is the Mc culluchpitts model. The perception takes an input value Vector and outputs 1 if the result is greater than predefined thresholds or -1 otherwise. The proof convergence of the algorithm is known apperception convergence theorem. Figure 1 shows ANN and figure 2 shows modeling a Boolean function using single layer perception [1]. Output node is used to represent the model output the nodes in neural network architecture are commonly known as neurons. Each input node is connected to output node via a weighted link. This is used to emulate the strength of synaptic connection between neurons. Simple perception learning algorithm is shown below. Step 1 : Ret D= {{Xi, Yi}/i=1, 2, 3----n}} be the set of training example. Step 2 : Initialize the weight vector with random value, W (o). Step 3 : Repeat. # D The main function of artificial neural network is prediction. The effectiveness of artificial neural network was proven in medicine [3]. The note worthy achievement of neural network was in the application of coronary heart disease [4]. There is numerous advantages of ANN some of these include 1) High Accuracy. 2) Independent from prior assumptions about the distribution of the data. 3) Noise tolerance. 4) Ease of Maintatenance. Medical diagnosis is an important yet complicated task that needs to be executed accurately and efficiently [5].feature subset selection if applied on medical data mining will leads to accurate results. The detection of disease from several factors is a multi layered problem and sometimes leads to false assumptions frequently associated with erratic effects. Therefore it appears reasonable if we apply feature subset selection in medical data mining towards assisting the diagnosis process. In this paper we apply feature subset selection. Using neural network for heart disease prediction For Andhra Pradesh population. # i. Principle Component Analysis Principle component analysis (PCA) is a statically technique used in many applications like face recognition, pattern recognition, image compression and data mining. PCA is used to reduce dimensionality of the data consisting of a large no. of attributes. PCA can be generalized as multiple factor analysis and as correspondence analysis to handle heterogeneous sets of variables and quantitative variables respectively. Mathematically PCA depends on SVD of rectangular matrices and Eigen decomposition of positive semi definite matrices. The goals principle component analyses are 1) To extract the most important information 2) from the data base Compress the size of the data and keeping only important information. 3) Simplify the data set description and to analyze the variables and structure of the observations. 4) Simplification. 5) Modeling. 6) Outlier Detection. a. Procedure Step 1 : Obtain the input matrix. Step 2 : Subtract the mean from the data set in all dimensions. Step 3 : Calculate covariance matrix of this mean subtracted data set. ii. Chi-Squared Test One of the first steps in data mining and knowledge discovery is the process of eliminating redundant and irrelevant variables. There are various reasons for taking this step. The obvious reason is that going from a few hundred variables to few variables will make the result interpretation in an easy way. The second reason is due to curse of dimensionality. If the dimensionality is large. It is necessary to have a large training set. This is also known as peaking phenomenon. If the dimensionality increases up to a certain point accuracy increases, but after there is a reduction in classification accuracy [6]. Simplest way of determining relevant variables is to use chi square technique (? 2). Chi square technique is used if all the variables are continuous. Assume that a target variable is selected; every parameter is checked to see if the use chi square technique detects the existence of a relationship between the parameter and the target. Karl Pearson proved that the statistic . # Procedure for principle component analysis is shown below r X 2 = ï?" (Oi -Ei) 2 E i Where O observed frequency and E expected frequency. If the data is given in a series of n numbers then degrees of freedom = n-1. In case of binomial distribution degrees of freedom = n-1. In case of poison distribution degrees of freedom = n-2. In case of normal distribution degrees of freedom= n-3 [7]. The following example illustrates chi square hypothesis. The total no of automobiles accidents per week in a certain community are as follows12, 8, 20,2,14,10,15,6,9 and 4.We have to verify to see whether accident conditions were same during 10week period using chi square test. Expected frequency of accidents each week=100/10=10. Null hypothesis H: The accident conditions were the same during the 10 week period. Table 1 : Chi square computation Chi square=26.6 and degree of freedom=10-1=9 Tabulated chi square=16.9 Calculated chi square> Tabulated chi square So the null hypothesis is rejected i.e. accident condition were not same during the 10 week period. blocked. Cardiovascular diseases account for high mortality and morbidity all around the world. In India mortality due to Chewier 1.6 million in the year 2000.by the year 2015,61 million cases will be due to CHD [8].Studies to determine the precise cause of death in rural areas of Andhra Pradesh have revealed that CVD Cause about 30% deaths in rural areas [9]. i. Risk factors for heart disease Some of the risk factors for heart disease are 1) Smoking: Smokers risk a heart attack twice as much as non smokers. 2) Cholesterol: A diet low in cholesterol and saturated Tran's fat will help lower cholesterol levels and reduce the risk of heart disease. 3) Blood pressure: High BP leads to heart Attack 4) Diabetes: Diabetes if not controlled can lead to significant heart damage including heart attack and death 5) Sedentary life style: Simple leisure time activities like gardening and walking can lower our risk of heart disease. 6) Eating Habits: Heart healthy diet, low in salt, saturated fat, Trans fat, cholesterol and refined sugars will lower our chances of getting heart disease. # Global Journal of Computer Science and Technology Volume XIII Issue III Version I Coronary heart disease occurs when the arteries of the heart that normally provide blood and oxygen to the heart are narrowed or even completely # 7) Poorly controlled stress an danger can lead to heart attacks and strokes. # d) Genetic Search # Stress: This epidemic may be halted through the promotion of healthier life styles, physical activity; traditional food consumption would help to mitigate this burden. Classification of Heart Disease using Artificial Neural Network and Feature Subset Selection Generally for feature subset selection, search spaces will be large. There are 2 Power 204 possible features combinations for cloud classification problem. Search strategies such as genetic search are used to find feature subsets with high accuracy. Genetic algorithms are used for optimization problems and are well known for robust search techniques.GA searches globally and obtain global competitive solutions. Genetic Algorithms are biologically motivated optimization methods which evolve a population of individuals where each individual who is more fit have a higher probability of surviving into subsequent generation. GA uses a set of evolutionary metaphors named selection of individuals, mutation, and crossover. Many of the algorithms uses classifier as a fitness function. Figure 4shows working principle of genetic algorithm. # Related Work Few research works has been carried out for diagnosis of various diseases using data mining. Our approach is to apply feature subset selection and artificial neural networks for prediction of heart disease.M.A.Jabbaret.al proposed a new algorithm combining associative classification and feature subset selection for heart disease prediction [5]. They applied symmetrical uncertainty of attributes and genetic algorithm to remove redundant attributes. Enhanced prediction of heart disease using genetic algorithm and feature subset selection was proposed by Anbarasiet.al [10]. Heart disease prediction using associative classification was proposed by M.A. Jabbar et.al [11].matrix based association rule mining for heart disease prediction was proposed by M.A. Jabbaret.al [12]. association rule mining and genetic algorithm based heart disease prediction was proposed in [13].cluster based association rule mining for disease prediction was proposed in [14]. sellappan palaniappan et al proposed intelligent heart disease prediction system using naïve bayes, decision tree and neural network in [15].graph based approach for heart disease prediction for Andhra Pradesh population was proposed by M.A.Jabbar et.al [16]. They combined maximum clique concept in graph with weighted association rule mining for disease prediction. Feature subset selection using FCBF in type II Diabetes patient's data was proposed by sarojinibala Krishnan et.al. [17]. Heart disease prediction using associative classification and genetic algorithm was proposed by M.A. Jabbaret.al [18].in their paper they used Z-Statistics a measure to filter out the rules generated by the system. This paper proposes a new approach which combines Feature subset selection with artificial neural network to predict the heart disease. IV. # Proposed Method In this paper we used PCA and chi square as a feature subset selection measure. This measure used to rank the attributes and to prune irrelevant, redundant attributes. After applying feature subset selection classification using ANN will be applied on the data sets.PCA is a mathematical procedure that transforms a no. of correlated attributes into a smaller no. of correlated variables called principle components. Assume If V is a set of N column vector of dimensions D, the mean of the of the data set, is M v=E {v} (1) The covariance matrix is -------------------------------------------------Total no. of samples in test data PROPOSED ALGORITHM Cv=E{(V-M v){ V-M v V. # Results and Discussion We have evaluated accuracy of our approach on various data sets taken from tuned it repository [19] The rank of the attributes is done w.r.t the values of PCA, and ? 2 in a descending order. High values of PCA the more information the corresponding attribute has related to class. We trained the classifier to classify the heart disease data set as either healthy or sick. The accuracy of a classifier can be computed by Step 1) Load the data set Step 2) Apply feature subset on the data set. Step 3) Rank the attributes in descending order based on their value. A high value of PCA and ? 2 indicates attribute is more related to class. Least rank attributes will be pruned. Select the subsets with highest value. Step 4) Apply multi layer perceptron on the remaining features of the data set that maximizes the classification accuracy. Steps 5) find the accuracy of the classifier. Accuracy measures the ability of the classifier to correctly classify unlabeled data. # 5.comparision of our accuracy with various classification algorithms. The heart disease data set is collected from various corporate hospitals and opinion from expert doctors. Attributes selected for A.P Heart disease is shown in is shown in Table 7.Applyingfeature subset selection helps increase computational efficiency while improving accuracy. Figure 9 shows parameters for performing multilayer preceptor. we set learning grates 0.3 and training time as 500.figure 8 11.crossover rate should be high so we set the cross over rate as 60%.Mutation rate should be very low. Best rates reported are about 0.5%-1%. Big population size usually does not improve the performance of Genetic algorithm. So good population size is about 20-30.In our method we used roulette wheel selection method. The comparison of GA+ANN system with other classification systems has been given in table12 and figure 11.The results acquired reveals that integrating GA with ANN performed well for many data sets and especially for heart disease A.P .we compared three feature selection methods in table 13.GA works well for 6 data sets .overall PCA with ANN works performed better than other classification methods. VI. # Conclusion In this paper we have proposed a new feature selection method for heart disease classification using ANN and various feature selection methods for Andhra Pradesh Population. We applied different feature selection methods to rank the attributes which contribute more towards classification of heart disease, which indirectly reduces the no. of diagnosis tests to be taken by a patient. Our experimental results indicate that on an average with ANN and feature subset selection provides on the average better classification accuracy and dimensionality reduction. Our proposed method eliminates useless and distortive data. This research will contribute reliable and faster automatic heart disease diagnosis system, where easy diagnosis of heart disease will saves lives. Coronary heart disease can be handled successfully if more research is encourage din this area. ![r.t class. The larger the D](image-2.png "") 12![Figure 1 : Example ANN](image-3.png "Figure 1 :Figure 2 :") 4![For each training sample (Xi, Yi) ? D. Steps 5 : Compute the predicted output Yi^ (k)Step 6 : For each weight we does.Step 7 : Update the weight we (k+1) = Wj(k) + (y i -yi^ (k))xij.Step 8 : End for.Step 9 : End for.Step 10 : Until stopping criteria is met.Global Journal of Computer Science and TechnologyVolume XIII Issue III Version I](image-4.png "Step 4 :") 5![ANN can be implemented in parallel hardware. The following are the examples of where ANN are used 1) Accounting. 2) Fraud Detection. 3) Telecommunication. . Classification of Heart Disease using Artificial Neural Network and Feature Subset Selection](image-5.png "5 )") ![Resources. Performance of ANN can be improved by designing ANN with evolutionary algorithms and developing neuron fuzzy systems. b) Feature Subset Selection Feature subset selection is a preprocessing step commonly used in machine learning. It is effective in reducing dimensionality and removes irrelevant data thus increases learning accuracy. It refers to the problem of identifying those features that are useful in predicting class. Features can be discrete, continuous or nominal. Generally features are of three types. 1) Relevant, 2) Irrelevant, 3) Redundant. Feature selection methods wrapper and embedded models. Filter model rely on analyzing the general characteristics of data and evaluating features and will not involve any learning algorithm, where as wrapper model uses après determined learning algorithm and use learning algorithms performance on the provided features in the evaluation step to identify relevant feature. Embedded models incorporate feature selection as a part of the model training process. Data from medical sources are highly voluminous nature. Many important factors affect the success of data mining on medical data. If the data is irrelevant, redundant then knowledge discovery during training phase is more difficult. Figure 3shows flow of FSS.](image-6.png "") 3![Figure 3 : Feature subset selection](image-7.png "Figure 3 :") ![Step 4 : Calculate the Eigen values and Eigen vector from covariance matrix.Step 5 : Form a feature vector.Steps 6 : Derive the new data set.](image-8.png "D") 4![Figure 4 : Working principle of genetic algorithm III.](image-9.png "Figure 4 :") ![}T}The components of Cox denoted by Cij represent the co variances between the random variable components Vi and Vj. the component Cii is the variance of Vi. The Eigen vectors ei and their corresponding values ----are C v ei= ? i e I where i=1,2,3---n V.r =A (V-mv)The original data vector v can be reconstructed from r V=XT r+mv(5) ](image-10.png "") 7![Figure 7 : Comparison of classification accuracy](image-11.png "Figure 7 :") 2If X is a matrix of Eigen vectors of the covaria-nce matrix, row vectors formed by transformingConsider the weather data set. Attributes areranked by applying the principal component analysis.No.OutlookTempe ratureHumidityWindyPlay1sunnyhothighFALSEno2sunnyhothighTRUEno3overcasthothighFALSEyes4rainymildhighFALSEyes5rainycoolnormalFALSEyes6rainycoolnormalTRUEno7overcastcoolnormalTRUEyes8sunnymildhighFALSEno9sunnycoolnormalFALSEyes10rainymildnormalFALSEyes11sunnymildnormalTRUEyes12overcastmildhighTRUEyes13overcasthotnormalFALSEyes14rainymildhighTRUEnoa) Correlation Matrix1 -0.47 -0.56 0.19 -0.04 -0.14 -0.15 0.04-0.47 1 -0.47 0.3-0.23 -0.05 0 -0.09-0.56 -0.47 1 -0.47 0.26 0.19 0.15 0.040.19 0.3 -0.47 1 -0.55 -0.4 -0.32 0.23-0.04 -0.23 0.26 -0.55 1 -0.55 -0.29 -0.13-0.14 -0.05 0.19 -0.4 -0.55 1 0.63-0.09-0.15 0 0.15 -0.32 -0.29 0.63 1 00.04 -0.09 0.040.23 -0.13 -0.09 0 1 3No. of samples correctly classified in testAccuracy=Data set 0132Year10Volume XIII Issue III Version ID D D D ) D(Global Journal of Computer Science and TechnologyRank 1 2 3 4Name of the attribute outlook humidity windy temperatureChi square value 3.547 2.8 0.933 0.57© 2013 Global Journals Inc. (US) 4and table 8 clearly 5 6 7Figure 6 : Parameters of PCA 8 9 10 11Parameters of multilayer perceptron during t raining1) GM-False2) Decay-False3) Auto build-False4) Debug-False5) Hidden layer-a6) Training time -5007) Learning rate-0.3013 2 Year8) Momentum-0.2 9) Nominal to Binary-True 10) Normalize-True 11) Reset-false12) Seed -01213) Validation set-0Volume XIII Issue III Version I ( D D D D ) DData Set Weather Pima hypothyroid breast cancer liver disorder primary tumor heart stalog lymph heart disease A.PWithout feature subset selection 100 98.69 95.94 96.5 95.07 80.83 97.4 99.3 100With feature subset selection 100 98.82 97.08 97.9 85 80 98.14 99.3 10014) Validation threshold-20 Data set J48 Naïve Bayes Weather 100 92.8 Pima 85.1 76.3 hypothyroid 99.8 95.44 breast cancer 75.87 75.17 liver disorder 84.6 56.8PART 85.7 81.2 99.86 80.06 86.08Our method 100 98.82 97.64 97.64 70Global Journal of Computer Science and TechnologyData set Weather Pima hypothyroid breast cancer liver disorder primary tumor heart stalog lymph heart disease A.P AVERAGEWithout feature subset selection 100 98.69 95.84 95.84 74.78 80.82 97.4 99.3 92.5 92.7with feature subset selection 100 98.82 97.64 97.64 70 83.18 97.7 100 100 93.8primary tumor 61.35 heart stalog 91.48 lymph 93.23 heart disease A.P 95 Average 87.3 Sl.no Parameter 56.04 85.18 87.16 72.5 77.4 1 Cross over rate 0.6 2 Mutation rate 0.5%-1% 61.35 94.4 95.27 95 86.5 value 3 Population size 20-30 4 Basic roulette Selection wheel selection83.18 97.7 100 100 93.8© 2013 Global Journals Inc. (US) 13 * Surveillance of cardiovascular disease risk factors in India: The need and scope BelaShah PrashantMathur 2010 Nov Review article, Indian journal of medicine pp 634-642 * Recent trends in CHD epidemiology in India RajeevGupta Indian Heart journal 2008 * Enhanced Prediction of heart disease with feature subset selection using genetic algorithm MAmbarasi JEST 2 10 2010 * Knowledge discovery using associative classification for heart disease prediction Ma Jabbar AISC 182 2012 Springer-Verlag * Knowledge discovery from mining association rules for heart disease prediction Ma Jabbar JAJIT 41 2 2012 * An evolutionary algorithm for heart disease prediction Ma BJabbar PritiDeekshatulu Chandra CCIS pp 378-389springer Verlag 2012 * cluster based association rule mining for heart disease prediction Ma BJabbar PritiDeekshatulu Chandra JATIT 32 2 October (2011 * Intelligent heart disease prediction system using data mining techniques Sellappan IEEE 2008 * Graph based approach for heart disease prediction Ma BJabbar PritiDeekshatulu Chandra LNEE pp Springer Verlag 2012 * feature subset selection using FCBF in type II data bases Sarojini Balakrishnan 2009 ICIT Thailand March * Heart Disease prediction system using associative classification Ma BJabbar PritiDeekshatulu Chandra ICECIT 2012 Elsevier * Tuned it Repository www.tunedit.com References Références Referencias * Introduction to data mining Pang-Ning Tan 2009 Pearson * The Hand book of data mining YNang 2003 Lawrence Erlbaum associates * Guest editorial" introduction to the special section on mining biomedicaldata Tsymbal IEEE Transactions on information technology in Biomedicine 10 2006 * A rough neural expert system for medical diagnosis ALiping TLingyun Service systems and service management 2005 2 * Predictions of risk score for heart disease using associative classification and hybrid feature subset selection Ma Jabbar Proceedings of 12th International Conference on Intelligent Systems Design and Applications (ISDA) 12th International Conference on Intelligent Systems Design and Applications (ISDA)Cochin 2012 * Probability and statistics" Scand Publishers pp TKIyengar 2008