# Use of Data Mining to Predict Human Diseases Saumya Shandilya Abstract-In this project, we intend to make an intelligent agent that asks the user about their medical symptoms and tries to predict the most probable diseases/medical conditions that they might be suffering from. Based on the results, it can also direct the user/patient to go to pharmacy or consult a doctor or to go for medical emergency services. It is truly said that "Prevention Is Better Than Cure". Sometimes diseases like cancers have very minor symptoms in the early stages but if detected this could save a patient's life. There is no harm in taking preventive medical advice than regretting later. Artificial Neural Networks (ANN) is currently a 'hot' research area in medicine and it is believed that they will receive extensive application to biomedical systems in the next few years. An application called the "Instant Physician" trained an auto associative memory neural network to store a large number of medical records. After training, the net can be presented with input consisting of a set of symptoms; it will then find the full stored pattern that represents the "best" diagnosis and treatment. This product can be useful for various users such as: 1. General Population/Patients a. This can act as a preliminary advice mechanism for patients before they consult a doctor. b. They can get suggestions as to whether they need to consult a doctor, or a visit to the local pharmacy would be fine for them. # Medical Professionals a. To speed up the process of diagnosis and to reduce human errors involved in finding the possible ailments. # Medical Undergraduate/Students a. To understand the common diseases and the symptoms related to them. b. To understand all possible medical conditions which could be present in the patient who is exhibiting a said symptom. 4. Hospitals a. Based on the diagnosis, hospital websites can display their specialist doctors that the patients can visit. Keywords: artificial neural networks, associative memory neural network, data mining. # INTRODUCTION ometimes people ignore some medical symptoms or conditions that they might be suffering from and do not feel like going to the doctor for every small medical problem that they are facing. Hence, we felt that there is a need for a medical health advisor that would guide people about the diseases or medical conditions that they might be suffering from. This Author: Computer Science Department, Symbiosis Institute of Technology Bachelors of Engineering in Computer Science. e-mail: saumyanda@gmail.com medical health advisor is an intelligent learning and heuristics based system that predicts the diseases based on the symptoms that they enter. Based on this prediction the application would also suggest if they need to take medical advice from a doctor for their condition and if yes what kind of medical specialist do they need to visit. This application would also be useful for medical professionals and new medical students if they need to know about all the possible diseases that might be related to one particular symptom. Thus, particularly in the Indian context where medical advice is not readily available especially in rural areas, tie-ups could be done with local health centers and the state government in extending this application's reach. Medical ignorance could be life-threatening thus it is important to stay informed to stay safe. # II. LITERATURE SURVEY Research phase is very crucial for the success of any project. The capabilities and strengths of a project depend on how strong the research is. We devoted 40% of our time towards research on various Natural Language Processing Algorithms, Sentiment Analysis Tools and various APIs. This hidden information is useful for making effective decisions. Computer based information along with advanced Data mining techniques are used for appropriate results. Neural network is widely used tool for predicting heart disease and other diseases in human beings. In this research paper, a Heart Disease Prediction system (HDPS) is developed using Neural network. The HDPS system predicts the likelihood of patient getting a Heart disease. For prediction, the system uses sex, blood pressure, cholesterol like 13 medical parameters. Here two more parameters are added i.e. obesity and smoking for better accuracy. From the results, it has been seen that neural network predict heart disease accurately. # III. RESEARCH ELABORATION We have a unique approach to the classification algorithm for this project, i.e. we have developed our own classification algorithm for the dataset. This is because no standard algorithm such as Random Forests or Bayesian networks could be employed in this use case. Also, we intended to question the user dynamically, hence to find the order of questions was difficult using the standard algorithms. To classify the diseases based on the symptoms, we thought of implementing a rule-based algorithm, which is the basis of AI. The algorithm which we initially thought of implementing was Apriori Algorithm, which talks about generating the most frequent item set from a set of transactions and gives the support count of the items occurring in a said order. In essence, Apriori algorithm talks about rule based mining. Upon implementing the same on the dataset, we couldn't get accuracy more than 70%. Hence, we discarded the approach. Next, we thought of Longest Common Subsequence (LCS) approach to understand the patterns of the dataset and generate the dynamic questions according the most frequent longest subsequence. This approach was significantly better than Apriori Algorithm as it was giving an accuracy of 85%. Upon testing with unknown data we found that this approach couldn't yield the required results. We then thought of performing a frequency analysis of the entire data to understand the sparsity of the data and subsequently to generate the dynamic nature of questions based on the clusters and outliers of the data. The frequency analysis was done using a MultiValueMap, a class in the org.apache.commons. collections library. The MultiValueMap map stores the data set in the format such that one key can have multiple values mapped to it. In this map the key is the frequency of the symptom and value array stores all the symptoms which have the frequency same as the key. Hence, we can say that the MultiValueMap does the clustering of the dataset upon feeding the entire dataset into it. The keyset of the MultiValueMap was sorted and used as the input of the Binary Search Tree (BST) which was made to understand the nature of the frequency distribution. Every node of the BST has the structure as follows: 1. Frequency of the node: integer value 2. Symptom list associated with said frequency ArrayList data type A mirroring operation is performed on the BST data structure to exchange the left and right subtrees of each node. This is done to ensure that that the most frequent symptoms fall in the left subtree of the root node, hence making the traversal of the BST simple. We are implementing an in order traversal for the entire BST to get the symptoms in decreasing order of frequency with every traversal. At every traversal, we get the symptoms associated with the node which is then used by the dynamic questioning interface to intelligently ask questions to the user. Hence, our classification algorithm builds a decision tree from the dataset and intelligently asks relevant questions based the user interactions with the system. The output of the algorithm is all the possible set of diseases associated with the set of symptoms selected by the user on runtime. This sample code uses the gem "Nokogiri" for the purpose of fetching the structure of a said webpage, which is passed as a parameter to the function body_systems_descriptions (url). The url is then parsed using the gem and the required element of the HTML page is selected using the xpath. Tree structure of the HTML node required is passed to the xpath and the processing of data is done to populate the dataset. After extensive study about diseases and their symptoms, we have developed a preliminary health assessing tool for a common man to use. We aimed to tell the user about the possible diseases that the user may be suffering from depending on the symptoms. This application could be very useful for people who are uncertain about the diseases that they might have but do not have prompt access to medical services. At the same time, we do not intend to take the place of a general physician or OPD clinics; we just aim to guide the patient to the right type of medical assistance. While working on this project, we realized that the true Indian doesn't really have the knowledge of what he/she may be having and are ignorant about the diseases that they may be suffering from. Hence, we feel that this project will be a big contribution in this area where people hesitate are ignorant about their health or those who don't have access to medical services. 12![Fig.1: Example of a decision tree](image-2.png "Fig. 1 :Fig. 2 :") 3![Fig. 3: Symptoms and their frequencies](image-3.png "Fig. 3 :") 6![Fig. 6: CLI Application](image-4.png "Fig. 6 :") 11![Fig. 11: Sample output page showing diseases detected V. CONCLUSION](image-5.png "Fig. 11 :") c) A Data Mining Approach for Prediction of HeartDisease using Neural NetworksAbstract-Heart disease diagnosis is a complextask which requires much experience and knowledge.Traditional way of predicting heart disease is doctor'sexamination or number of medical tests such as ECG,Stress Test, and Heart MRI etc. Nowadays, health careindustry contains huge amount of heath care data,which contains hidden information.ofcerebral infarction (e.g., 0 for healthy persons and 1 forsick persons) and the like, are repeatedly input into aneural network to let it learn the correlation of thesecharacteristics and, thereafter, a set of data of a personto be diagnosed, including his age, measured values ofthe coagulo-fibrinolytic molecular markers and the like,are input in the neural network to obtain an indexindicative of his state of cerebral infarction as a degreeof dangerousness of cerebral infarction. This method issignificantly higher inaccuracy as compared with theprior art methods using the same data.a) Method of Diagnosing Cerebral Infarction (US Patent No. 5590665 A) Developed by Kazuyuki Kanai. Publication Date: Jan 7, 1997 Abstract-A novel method of diagnosing cerebral infarction using a neural network, wherein plural sets of data previously obtained from healthy and sick persons, each including an age, measured values of coagulo-fibrinolytic molecular markers (e.g., D-dimer, TAT and PAP) , an index indicative of the state b) Artificial Neural Networks in Medical Diagnosis Abstract-An extensive amount of information is currently available to clinical specialists, ranging from details of clinical symptoms to various types of biochemical data and outputs of imaging devices. © 2017 Global Journals Inc. (US) Year 2017 ( ) D Use of Data Mining to Predict Human Diseases Year 2017 ( ) © 2017 Global Journals Inc. (US) 1 * FilippoAmato AlbertoLópez EladiaMaríaPeña-Méndez PetrVa?hara Ale? Hampl3 and Josef Havel1; Artificial neural networks in medical diagnosis * A data mining approach for prediction of heart disease using neural networks ChaitraliSMiss DrDangare Mrs SSulabha Apte International Journal of Computer Science and Technology 2 June (2011 * Method of diagnosing cerebral imfarction. US Patent no: US005590665A KazuyukiKanai Jan. 7,1997