Clinical Dengue Data Analysis and Prediction using Multiple Classifiers: An Ensemble Techniques

Table of contents

1. I. Introduction

engue fever (DF) is an arthropod-borne viral disease common past three decades. According to WHO, 51-101 million new infections with dengue occur every year in more than a hundred endemic countries [1]. Dengue fever is a severe viral infection with potentially fatal consequences. Dengue fever was originally known as "water poison." The dengue caused by the female Aedes aegypti mosquito is shown in Fig. 1 A Female Aedes Aegypti Mosquito

In the 1780s, the first clinically recognized epidemics of dengue occurred at the same time in Africa, Asia, and North America. Benjamin Rush was named "break-bone fever" based on the features of arthralgia and myalgia. The dengue epidemic was first reported in Chennai in 1780, the first virologically proven outbreak of dengue fever in India appeared at Calcutta and the East Coast of India in 1963-64. In the 1970s and 1980s, epidemic activity accelerated dramatically, resulting in the widespread of viruses and mosquito vectors and the consequent DENV transmission across the world [2]. The first major DHF epidemic occurred in the Philippines during 1953-1954, continued by a rapid global spread of DF/DHF epidemics. The first major DHF/DSS epidemics in India occurred in 1996, at Delhi and Lucknow, and later extended throughout the country. In India outbreaks of dengue have become more common in many parts. Between 2010 to 2014 incidence of reported cases of dengue was 34.81 per million population. Dengue fever became endemic in Orissa, Uttarakhand, Bihar, Assam, and Jharkhand, in 2010 [3].

2. II. Background Study

Kassaye Yitbarek Yigzaw et al [2] presented a benchmarking platform for the prediction of communicable diseases. Rathi et al [4] studied dengue infection in Rajasthan. The study was based on 100 admitted children and he classified the patients based on their symptoms. Kalayanarooj S [3] demonstrates the clinical appearances of dengue and DHF. Aldallal, A.S [5] explained that data mining techniques are used for the prediction of non-communicable diseases like heart and diabetes. Agrawal et al [7] demonstrated the ensemble approach by using multiple classifiers Ada boost, and a decision tree for the prediction of diabetes. Ghosh et al [10] used multiple classifiers for the sentiment analysis performance assessment. Gupta et al [12] compared different ML approaches for heart disease prediction. Mesafint et al [14] explained ML algorithms for the prediction of HIV/AIDS tests.

3. III. Proposed Methodology

The ensemble models are Extreme Gradient Boost (XGB), Random Forest (RF) by majority voting, and Stacking, which is based on a combination of heterogeneous classifiers like NB, KNN, and SVM. It is very helpful to consider ensemble techniques [6], for dengue fever diagnosis and prediction. The proposed framework is shown in Fig 3. The main aim of data acquisition and the data pre-processing module is to get the Dengue fever dataset and process them into a suitable form for further analysis. Datasets have features/attributes which will finally distinguish the data into patient sick and healthy. The dataset has thirty-eight features and different data types. The dataset is spitted into an 80% training set and a 20% testing dataset. The pre-processing includes feature selection and missing value imputation [8]. The proposed model combines different classifiers such as Naïve Bayes, K -Nearest Neighbor, and Support vector machine. For each classifier, the output is predicted.

Each base classifier is used in the ensemble framework by training data to make it useful for the prediction of dengue. Dataset features and target values are known to each classifier, which in turn can predict whether the disease is present or not.

4. i. Description of the Dengue Dataset

The patient data is collected from the Department of General Medicine, PESIMSR, Kuppam, Andrapradesh. The patient is diagnosed in the laboratory using the dengue duo card test shown in fig 4. Dataset consists of 18 attributes and one target value. The number of patients having each symptom is listed in Table I and corresponding bar charts explain the importance of each feature [9] are shown in fig. 7. Among 140 dengue-infected cases all the patients are suffering from fever,106 headache, 97 and 94 myalgia and arthralgia and 83 low back pain and others.

5. ii. XGBoost

Boosting is a broadly used and highly effective machine learning algorithm. An end-to-end tree boosting system called XGBoost is widely used by data experts. The important factor is its scalability for better accuracy. The system is ten times faster than existing conventional methods. The scalability of XGBoost is due to several algorithm optimizations. Parallel and distributed computing will make learning faster [15]. In the stacking algorithm, the base (first-level) classifiers are trained by the same set of the training sample, which is used to prepare the inputs for the meta (second-level) classifier, which may cause overfitting. The stackingCVclassifier uses the cross-validation method. The dataset is split into k folds, and k-1 folds are used to fit the level-1 classifier in k successive rounds. In every iteration, the level-1 classifiers are then applied to the remaining subset. The predictions of the base classifiers are then stacked and which is an input to the level-2 classifier.

6. NO. OF PATIENTS

7. IV. Performance Evaluation

The clinical dengue fever data set was used to analyse the performance of the ensemble model and to compare it with the other models. The class labels dengue infected (DF) with the dengue not infected (NDF) is replaced with class 1 and class 0 to maintain uniformity [16]. Each dataset is split into training and testing sets. Cross validations of 10-fold are applied. performance measure of each base classifier, as well as the ensemble model, is calculated using a confusion matrix. The base classifiers NB, SVM & KNN are trained first and then they are tested. The proposed research work analysed the performance of the ensemble methods XGB, RF, and Stacking. The metrics are accuracy, recall, precision, and f1-score. The confusion matrix illustrates the actual and predicted classification [15,17]. The equations ( 1), ( 2), (3), and ( 4), are used to calculate the metrics [17]. III and Fig. 11. The ensemble methods XGB, RF, and Stacking give 98.57%, 99.12%, and 99.56% for the training dataset, whereas 97.80%, 94.82% and 98.27% for the testing dataset. We observed better accuracy for ensemble methods. IV. The AUC for the proposed ensemble XGB is 97.14% and 97.81% for random forest 98.14% and 99.14%, for stacking 98.14% and 98.68% for testing and Training datasets respectively. As shown in Table III, the AUC values for the datasets lie between 0.97 to 0.99, indicating that the positive class values are correctly distinguished from the negative class values.

8. Table II: Confusion Matrix

9. Actual

10. Table V: Auc Comparision

11. V. Conclusion

The main objective of this research work is to the prediction of dengue fever using ensemble techniques. We used bagging, boosting, and stacking methods for prediction and the end results are compared with the NB, KNN, and SVM models. The experimental results prove that Ensemble techniques are the best models for the prediction of dengue fever. The techniques were analysed using performance metrics. The accuracy for the extended boost, random forest with majority voting, and stacking using metaclassifiers gives better accuracy for both the training and testing datasets compared to other models. The extended analysis was done by using the roc curve and precision-recall curve, which explains the performance of the models. The Area under the curve lies between 0.97 to 0.99. The ensemble models are the better models for the prediction of dengue-infected patients.

Figure 1. Fig. 2 :
2Fig. 2: Pictorial Representation of Dengue Fever Symptoms According to the World Health Organization, Dengue fever is classified into four types: DENV1, DENV2, DENV3, and DENV4. The incubation period is 2 to 7 days [4]. The Dengue symptoms are high fever, joint and muscle pain, headache, vomiting, rashes, pain behind the eyes, diarrhea, etc. The dengue fever symptoms are shown in Fig.2. Different ML algorithms are used for dengue fever classification such as NB classifier, K Nearest Neighbour, Decision Tree, Support Vector Machine, and Neural Networks. The proposed model demonstrates ensemble techniques called bagging, boosting, and stacking. The dengue binary classification is based on Extreme Gradient Boost (XGB), Random Forest by
Figure 2. Fig. 3 :
3Fig. 3: An Ensemble Frame Work for the Prediction and Evaluation of Dengue Dataset
Figure 3. Fig. 4 :
4Fig. 4: Diagnosis-Dengue Duo Card Test It consists of 286 instances with 18 attributes and one target. The target consists of dengue patients and Non dengue patients. levels. The numerical value is assigned for each level like 0 for non-dengue patients (NDF), and 1 for Dengue patients (DF). The screenshot of the dataset is shown in Fig.5.
Figure 4. Fig. 5 :
5Fig. 5: The screenshot of the dataset The target value consists of 140 cases of dengue infected and 146 non-dengue cases among 286 cases. The distribution is shown in Fig.6
Figure 5. Fig. 6 :
6Fig. 6: Distribution of a Target Value
Figure 6. Fig. 8 :Fig. 9 :
89Fig. 8: Random Forest Algorithm Procedure iii. StackingStacking is an ensemble technique, which uses meta-classifiers to learn, the possible way to combine two or more base ML algorithms predictions. The base or level 0 classifiers consists of different ML algorithms and therefore stacking ensembles are generally heterogeneous classifiers. Level 1 classifiers are used as new features to train a meta classifier. An ensemble stacking procedure is illustrated in fig 9.The meta classifier can be any classifier[13]
Figure 7.
and experimental score of the NB, SVM, KNN, XGB, RF, and Stacking models training dataset and testing dataset are shown in Fig.10.
Figure 8. Fig. 10 :Fig. 11 :
1011Fig. 10:
Figure 9. Fig. 13 :
13Fig. 13: Testing Dataset Precision, Recall and F1 Score Comparison of ML Models The precision, recall, and f1 score for training and testing datasets are listed in Table IV and a comparison of an ensemble with other methods is shown in fig 12 and 13, which explains the ensemble methods give better performance for unseen data. The Receiver Operating Characteristic curve and the Precision-Recall curve is a graphical representation of a, by calculating and plotting the false positive rate (FPR) Vs the true positive rate (TPR) and precision Vs recall for each classifier at various threshold values. The precision and recall curve for both training and testing datasets is shown in fig .14 and fig.15 correspondingly the ROC curve is shown in Fig 16 and Fig 17.
Figure 10. Fig. 14 :
14Fig. 14: The Performance Comparison of the Training Dataset by Precision Recall Curve
Figure 11. Fig. 16 :Fig. 17 :
1617Fig. 16: The Performance Comparison of the Training Dataset by ROC Curve
Figure 12.
Figure 13.
Figure 14.
Figure 15.
Figure 16. Table I :
I
Target
200
150
100 50 Year 2022
Non Dengue
Fig. 7: Bar Chart Representation
b) Ensemble Methods
Clinical Feature Ensemble means combining multiple models. This approach gives better performance compared to a No. of Patients
Fever single model. Thus, a set of models is used for Headache predictions than a single model [7]. The main challenge is to obtain a base model which gives different kinds of 140 106 ( ) D
Myalgia errors. If the ensemble technique of bagging, boosting, 97
Arthralgia and stacking are used for classification, high accuracies 94
Low Backache can be obtained. Bagging creates a different subset of 83
Retro Orb Pain training data from the sample training dataset & the final Rashes output depends on majority voting. e.g., Random Vomiting Forest. Boosting the creation of sequential models by Pain Abdomen combining weak learners with strong learners and the finally constructed model has the highest accuracy e.g., 71 65 57 41
XGBOOST and ADA BOOST Bleeding 39
i. Cough 30
Diarrhea 25
Sore Throat 16
Breathlessnes 6
Seizures 5
© 2022 Global Journals
Figure 17. Accuracy of K-Neighbors Classifier :96.49 precision recall f1-score 0 0.93 1.00 0.97 1 1.00 0.93 0.96 Accuracy of K-Nearest Neighbour : 85.66 precision recall f1-score 0 0.86 0.97 0.91 1 0.96 0.81 0.88
SVM RF Training Dataset SVM RF Testing dataset
NB NB
Accuracy of Random Forest: 99.12 Accuracy of Random Forest :
Accuracy of Naive Bayes model: 95.40 precision recall f1-score 0 0.93 0.98 0.9 1 0.98 0.92 0.95 Accuracy of Support Vector Classifier: 97.5 precision precision recall f1-score recall f1-score 0 0.96 0.98 0.97 1 0.99 1.00 1.00 0 0.98 1.00 0.99 1 1.00 0.98 0.99 Accuracy of Naivey bayes : 93.17 precision recall f1-score 0 0.94 0.98 0.96 1 0.97 0.93 0.95 94.82 Accuracy of Support Vector machine: precision recall f1-score 89.65 precision recall f1-score 0 0.91 0.98 0 0.97 0.98 0.97 0.94 1 0.98 0.89 0.93 1 0.98 0.96 0.
XGB XGB
KNN KNN
Accuracy of Extreme Gradient Boost Accuracy of Extreme gradient Boost
:98.57 precision recall f1-score :97.80 precision recall f1-score
0 0.99 0.97 0.98 0 0.97 0.98 0.97
1 0.97 0.99 0.98 1 0.98 0.96 0.97
Figure 18. 97 Stacking Accuracy of Stacking CV Classifier :99.56 precision recall f1-score 0 1.00 0.99 1.00 1 0.99 1.00 1.00 Stacking Accuracy of Stacking CV Classifier: 98.27 precision recall f1-score 0 0.97 0.99 0.98 1 0.99 0.96 0.98 Confusion
Note: Matrix and Experimental Results of Training and Testing Dataset of the Ensemble and Other M Models
Figure 19. Table III :
III
Year 2022
47
( ) D
Classifiers Training Dataset Testing Dataset
NB 95.40 93.17
KNN 96.49 85.66
SVM 97.51 89.65
XGB 98.57 97.80
RF 99.12 94.82
Stacking 99.56 98.27
© 2022 Global Journals
Note: Global Journal of Computer Science and TechnologyVolume XXII Issue II Version I
Figure 20. Table IV :
IV
110
90 97.5
80
NB KNN SVM XGB RF
Year 2022
48
Volume XXII Issue II Version I ( ) D Global Journal of Computer Science and Technology Classifiers NB KNN SVM RF XGB Ensemble Stacking Training dataset Precision Recall (%) (%) NDF 93 98 DF 98 92 NDF 93 100 DF 100 93 NDF 96 98 DF 99 100 NDF 98 100 DF 100 98 NDF 99 97 DF 97 99 NDF 100 99 DF 99 100 f1-score (%) 95 95 97 96 97 100 99 99 98 98 100 100 Classifiers NB KNN SVM RF XGB Ensemble Testing Dataset Precision (%) NDF 94 DF 97 NDF 86 DF 96 NDF 91 DF 98 NDF 97 DF 98 NDF 97 DF 98 NDF 97 Recall (%) 98 93 97 81 98 89 98 96 98 96 99 f1-score (%) 96 95 91 88 94 93 97 97 97 97 98
© 2022 Global Journals
Figure 21.
Year 2022
50
( ) D
Classifier Testing Dataset Training Dataset
Auc_Nb 0.9629 0.9514
Auc_Knn 0.8333 0.9342
Auc_Svc 0.9444 0.9956
Auc_Xgb 0.9781
Auc_Rf 0.9814 0.9914
Auc_Scv 0.9814 0.9868

Appendix A

Appendix A.1 Acknowledgment

Our sincere thanks to Dr. Veerapuram Manoj Reddy, Department of General Medicine, PES Medical sciences and Research, Kuppam, Andrapradesh for his support for the collection of dengue data.

Appendix B

  1. Performance Assessment of Multiple Classifiers Based on Ensemble Feature Selection Scheme for Sentiment Analysis. Applied Computational Intelligence and Soft Computing, 12. 10.155/2018/8909357. 10. Ghosh, Monalisa & Sanyal, Prof(Dr.) Goutam. (ed.) 2018. 2018.
  2. Diabetes Diagnosis Prediction Using Ensemble Approach, Agrawal , G Bhargav , E Spandana . 10.1007/978-981-15-5546-6_66. 2021.
  3. Using Data Mining Techniques to Predict Diabetes and Heart Diseases. A S Aldallal , A A Al-Moosa . 4th International Conference on Frontiers of Signal Processing, 2018. 2018. ICFSP. p. .
  4. Prediction of Dengue, Diabetes and Swine Flu Using Random Forest Classification Algorithm, A Tate , U Gavhane , J Pawar , B Rajpurohit , G B Deshmukh . 2017.
  5. Cardiotocographic Diagnosis of Fetal Health based on Multiclass Morphologic Pattern Predictions using Deep Learning Classification. 11.10.14569/IJACSA.2018.090501. International Journal of Advanced Computer Science and Applications 9.
  6. Classification and Feature Selection Approaches by Machine Learning Techniques: Heart Disease Prediction. Chandra Reddy , N S Shue Nee , S Zhi Min , L Ying , C . 10.11113/ijic.v9n1.210. https://doi.org/10.11113/ijic.v9n1.210 International Journal of Innovative Computing 2019. 9 (1) .
  7. Grid search in hyperparameter optimization of machine learning models for prediction of HIV/AIDS test results. Daniel & D H Mesafint , Manjaiah . 12.10.1080/1206212X.2021. International Journal of Computers and Applications. 1 2021. 19746. p. 63.
  8. Comparison of various machine learning approaches uses in heart ailments prediction. Gunjan & Gupta , U & Adarsh , N Reddy , B Subba & Rao , Ashwath . 2161.012010.10.1088/1742-6596/2161/1/012010. Journal of Physics: Conference Series, 2022.
  9. Early heart disease prediction using hybrid quantum classification, Heidari , Gerhard Hanif & Hellstern . 10.48550/arXiv.2208.08882. 2022.
  10. Comparative Study of Classification algorithms used for the Prediction of Non-communicable diseases. H M Veena , D S Suresh . 10.30534/ijeter/2021/14972021. Int. J. Emerg. Trends Eng. Res 2021. 9 (7) p. .
  11. , Julia & Miao , Kathleen Miao . 2018.
  12. A communicable disease prediction benchmarking platform. Kassaye Yigzaw , Johan Bellika . BHI2014.564-568.10.1109/BHI.2014.6864427. IEEE-EMBS International Conference on Biomedical and Health Informatics, 2014. 2014.
  13. A hybrid Algorithm for Dengue Disease Prediction with Multi Dimensional Data. Konadala Kameswara Rao & Nynalasetti , Dr G P Varma , Saradhi . International Journal of Advanced Research in Computer Science and Software Engineering 2014. 14 p. .
  14. Dengue Fever Prediction: A Data Mining Problem. K Shaukat , N Masood , S Mehreen , U Azmeen . 10.4172/2153-0602.1000181. J Data Mining Genomics Proteomics 2015. 6 p. 181.
  15. STUDY OF DENGUE INFECTION IN RURAL RAJASTHAN. Manisha & Rathi , Masand , Alok Purohit . 10.14260/jemds/2015/993. 4.6849-6859.10.14260/jemds/2015/993 Journal of Evolution of Medical and Dental Sciences 2015.
  16. IntelliHealth: A medical decision support application using a novel weighted multi-layer classifier ensemble framework. Saba Bashir , Usman Qamar , Farhan Hassan Khan . 10.1016/j.jbi.2015.12.001. https://doi.org/10.1016/j.jbi.2015.12.001 Journal of Biomedical Informatics 1532-0464. 2016. Pages 185-200. 59.
  17. Clinical Manifestations and Management of Dengue/DHF/DSS, S Kalayanarooj . 10.2149/tmh.2011-S10.Epub. 22500140. PMC3317599. 2011 Dec. 2011 Dec 22. 39 p. . (Trop Med Health. Suppl)
  18. World Health Organization, http://www.who.int/mediacentre/factsheets/fs117/en/ March 2014. 2014. 16 Oct 2019. Geneva. (Fact sheet no)
Date: 1970-01-01