# Introduction

eople are often susceptible to making mistakes during analyses or, possibly, when trying to determine relationships between multiple features. This fact, makes it difficult for them to seek out solutions to certain problems. Data mining involves the utilization of sophisticated data analysis tools to get previously unknown, valid patterns, and relationships in the datasets [1]. These tools can include statistical models, mathematical algorithms, and machine learning methods [2].

Consequently, data processing consists of quite a collection and managing data, it also includes analysis and prediction [1].

The classification technique is capable of processing a sort of data than regression and is growing in popularity [3].


# II.


# Dataset Used

In this research work, we use the IRIS plant data set, one of the most popular databases for the classification problems, it is obtained from UCI Machine Learning Repository and created by R.A. Fisher while donated by Michael Marshall (MARSHALL%PLU @io.arc.nasa.gov) on July 1988[4].

The IRIS dataset contains three different classes of IRIS plants depending on their pattern [5,6]. Each class of IRIS plant contain fifty objects. The attributes that already predicted belongs to a category of IRIS plant. The list of attributes presents within the IRIS is often described as categorical, nominal, and continuous. The experts have mentioned that the info set is complete i.e. there isn't any missing value found in any attribute of this data set [6].

This research makes use of the documented IRIS dataset, which contains three classes of fifty instances each. The 150 instances, which are equally divided between the three classes, hold the subsequent four numeric attributes: 


# Classifiers Used

In this paper, we compared the proficiency assessment of IRIS variety for two tree based classifiers: Random Forest and J48 Classifiers.


# a) Random Forest Classifier

Random Forest [7] is considered one of the best "off-the-shelf" classifiers for high-dimensional data. Random forest is a mix of tree predictors sampled autonomously count on the values of a random vector following an equivalent distribution for all trees of the forest. The generalization error of random forest classifier depends on the association between the individual trees inside the forest and the strength of them. The dataset divided into a training dataset to learn each tree, and the remaining of the data set is used to estimate error and variable importance. Class assignment is formed according to the number of votes for any of the trees, to apply the model of the results. it's almost like bagged decision trees with hardly some key differences as given below:

For every split point, the search isn't overall p variables but just over m (number of tested) variables (where, e.g,m = [p/3])

No pruning necessary. Trees are often grown until each node contains just only a few observations. The Random Forest gave better prediction, and almost no parameter adjustment is necessary.


# b) J48 Classifier

The J48 classifier is an extension of the decision tree C4.5 algorithm for classification [8], which creates a binary tree. It's the foremost useful decision tree approach for classification problems. This system constructs a tree to model the classification process. After the tree is made, the algorithm is applied to every tuple within the database and leads to classification for that tuple [ The absent values are ignored byJ48 while building a decision tree, i.e. the known information about the attribute values for the other records is helpful to predict the value for that item. The idea is to divide the data into a range based on the attribute values for that element which are identified in the training sample [10].

IV.


# Performance Measures Used

Various scales are wont to gauge the performance of the classifiers.


# a) Classification Accuracy (CA)

Classification accuracy presents the percent of correctly classified instance in the test dataset. We calculate it by dividing the correctly classified instances by the total number of instance multiplied by 100.


# b) Mean Absolute Error (MAE)

Mean absolute error is that the average of the variance between predicted and actual value altogether test cases. It's an honest measure to measure performance.


# c) Root Mean Square Error (RMSE)

Root mean squared error is employed to scale dissimilarities between values. It's determined by taking the root of the mean square error.


# d) Confusion Matrix (CM)

A confusion matrix is a tool checking in particular how often the predictions are correct compared to reality in classification problems.

V.


# Results and Discussion

In this work, to evaluate the performance of the different Tree-based Classifiers (Random Forest and J48), we used a well-known open-source tool in the machine learning field called "WEKA". The performance is tested using two methods, first by splitting the dataset into training (70%) and testing (30%) datasets, as well as using different Cross-Validation methods.


# a) Performance of Random Forest Classifier

Table 1 show the global evaluation summary of Random Forest Classifier using both of the test modes: splitting and different cross-validation methods. Fig. 1 and Fig. 2 display the performance of Random Forest Classifier in terms of Classification Accuracy and time taken to build the model. From Table I to Table VI we gave the confusion matrix for different test modes.

By applying these test modes using Random Forest Classifier, we got 95.55% accuracy, spending 0.17s on building the model for the split. Using different cross-validation methods to check their performance, we obtained around 94.99% accuracy, spending 0.06s on building the model.   By applying these test modes using J48classifier we got 95.55% accuracy, spending 0.05s on building the model for the split mode. Using different cross-validation methods to check their performance, on average we obtained around 95.83% accuracy, spending 0.025s to build the model.    Comparison of Random Forest and j48 Classifiers   


# Conclusion

This research work compares the efficiency of Random Forest and J48 Classifiers for IRIS variety prediction. The test is accomplished using WEKA 3.9in a machine with a processor i5-2430M 2.40 GHz and 4.00GB in RAM. Also, we compare the performance of both of the classifiers in terms of different scales of effectiveness evaluation. At last, we observed that J48classifier performs best than Random Forest classifier for IRIS variety prediction by taking different measures, including classification accuracy, Mean Absolute Error, and Time Taken to Build the Model.
2![Figure 2: Time Taken to Build the Model of Random Forest Classifier](image-2.png "Figure 2 :")
3![Figure 3: Classification Accuracy of J48 Classifier](image-3.png "Figure 3 :")
4![Figure 4: Time Taken to Build the Model of J48Classifier](image-4.png "Figure 4 :")
5![Fig. 5 and Fig. 6 illustrate a comparison between Random forest and J48 according to classification accuracy and time taken on building the model. Through the comparison of the performance using training set (70%) process and various crossvalidation methods between Random Forest and J48 classifiers depending on time taken on building the model, CA, MAE, and RMSE values, we reached thatJ48 classifier outperforms Random Forest.](image-5.png "Fig. 5")
5![Figure 5: Classification Accuracy, Comparison between Random Forest and J48 Classifiers](image-6.png "Figure 5 :")
6![Figure 6: Time Taken to Build the Model, Comparison between Random Forest and J48 Classifiers VII.](image-7.png "Figure 6 :")
1CorrectlyIncorrectlyMeanRoot MeanTime Taken toTest ModeClassifiedClassifiedAccuracyAbsoluteSquaredBuild ModelInstancesInstancesErrorError(Sec)Split (70%)43295.55%0.03630.15320.175 Fold CV143795.33%0.0370.15310.0510Fold CV142894.66%0.04080.16240.0315Fold CV142894.66%0.03850.16130.1420Fold CV143795.33%0.03790.15580.03
2SetosaVersicolorVirginicaActual (Total)Setosa140014Versicolor016016Virginica021315Predicted (Total)14181345
3SetosaVersicolorVirginicaActual (Total)Setosa500050Versicolor047350Virginica044650Predicted (Total)505149150
4SetosaVersicolorVirginicaActual (Total)Setosa500050Versicolor047350Virginica044650Predicted (Total)505149150
5SetosaVersicolorVirginicaActual (Total)Setosa500050Versicolor047350Virginica054550Predicted (Total)505248150
6SetosaVersicolorVirginicaActual (Total)Setosa500050Versicolor047350Virginica044650Predicted (Total)505149150b) Performance of J48Classifier
7CorrectlyIncorrectlyMeanRoot MeanTime Taken toTest ModeClassifiedClassifiedAccuracyAbsoluteSquaredBuild ModelInstancesInstancesErrorError(Sec)Split (70%)43295.55%0.04160.16820.055Fold CV144696%0.0350.15820.0210Fold CV144696%0.0350.15860.0215Fold CV143795.33%0.03950.17580.0320Fold CV144696%0.03540.15860.03
8SetosaVersicolorVirginicaActual (Total)Setosa140014Versicolor016016Virginica021315Predicted (Total)141813
		
		
* 
	
		An Introduction to Data Mining
		
			TDaniel
		
		
			ChantalDLarose
		
		
			Larose
		
	
		Computer Science
		
			2014
		
	
* 
	
		
			MMehmed
		
		
			Kantardzic
		
	
		Data Mining: Concepts, Models, Methods, and Algorithms
				
			2002
		
	
* 
	
		Data Mining, Introductory and Advanced Topics
		
			MargaretHDanham
		
		
			SSridhar
		
		
			2006
		
	
	Person Education, 1st Edition