ata mining is the process of extracting useful information from various data repositories wherein data might be present in different formats in heterogeneous environments [1] [2]. Various methods like classification, association, clustering, regression, characterization, outlier analysis can be used to mine the necessary information. In this paper we shall be focusing on classification.
Classification is the process wherein a class label is assigned to unlabeled data vectors. Clas-sification can be further categorized as supervised and unsupervised classification. In supervised classify-cation the class labels or categories into which the data sets need to be classified into is known in advance. In unsu-pervised classification the class label is not known in advance [3]. Unsupervised classification is also known as clustering. Supervised classification can be subdivided into non-parametric and parametric classification. Parametric classifier method is dependent on the pro-bability distribution of each class. Non parametric cla-ssifiers are used when the density function is not known [4].
One of the very prominent parametric supervised classification methods is support vector machines(SVM).
In this paper SVM are used to perform the said classification. Herein the data vectors are represented in a feature space. Later a hyperplane that geometrically resembles a slope line is constructed in the feature space which divides the space comprising of data vectors into two regions such that the data items get classified under two different class labels corresponding to the two differrent regions [5]. It helps in solving equally two class and multi class classification problem [6] [7]. The aim of the said hyper plane is to maximize its distance from the adjoining data points in the two regions. Moreover, SVM's do not have an additional overhead of feature extraction since it is part of its own architecture. Latest research has proved that SVM classifiers provide better classification results when one uses spatial data sets as compared to other classification algorithms like Bayesian method, neural networks and k-nearest neighbors classification methods [8][9].
SVM have been used to classify data in various domains like land cover classification [10], species distribution [11], medical binary classification[9], fault diag-nosis [12], character classification [5], speech recognition [13], radar signal processing [14], habitat prediction etc... In this paper SVM is used to classify remote sensed data sets. Two formats of remote sensed data viz. raster format and comma separated value(CSV) file formats have been used for performing the said classification using SVM.
Our next section describes Background Knowledge about SVM classifiers. In section 3 materials and methods viz. data acquired and the proposed methodology have been discussed. Performance analysis is discussed in Section 4. Section 5 concludes this work and later acknowledgement is given to the data source followed by references. The line mentioned herein is called a hyperplane and can be mathematically represented by equation ( 1) [21]:
The data points can be represented by equation ( 2) [22]:
f(x)= sgn(mx+ b) (2)where sgn() is known as a sign function, which is mathematically represented by the following equation:
sgn(x)=? 1 if x > 0 0 if x = 0 ?1 if x < 0 (3)There can be many hyperplanes which can divide the data space into two regions but the one that increases the distance amid the bordering data points in the input data space is the result to the two class problem. The adjoining data points close to this hype-rplane are called support vectors. This concept can be illutrated geometrically as in Figure 2. This maximization problem viz. maximizing the distance between the hyperplane and the adjoining support vectors can be represented as a Quadratic Optimization Problem as in equation( 5) [22][23]:
h(m)= 1 2 m t m (5)subject to y i (mx i + b) >=1,?i The solution for this problem can be provide by a Lagrange multiplier ? i which is associated with every constraint in the main problem. The solution can be represented as: m=? ??i??i??i b=y k -m t x k for any x k such that Lagrange multiplier ? k #0 (6) The classifier can be denoted as [16]:
f(x)= ? ??i??i??i ?? + ??(7)In the case of non-linear SVM's the input data space can be generalized onto a higher dimensional feature space as illustrated in Fig 3. If every data point in the input data space is generalized onto a higher dimensional feature space which can be represented as [18]:
K(x i ,x j )=f(x i ) t. f(x j )(8)This is also called a kernel function. It is computed using an inner dot product in the feature space. Various kernel functions can be used to do the said mapping as mentioned in the below equations [23]: Linear Kernel function = x i t x j Polynomial kernel function = (1 + x i t x j ) p Gaussian radial based kernel function = exp(-
|???? ????? | 2 2?? 2 )Sigmoid kernel function= tanh(? 0 x i x j +? 1 ) (9)
One of the major advantages of SVM is that feature selection is automatically taken care by it and one need not separately derive features.
In this paper SVM classification methodology is applied on two different data set formats. The first format of data sets used is a comma separated value(CSV) file which shall have all relevant attributes necessary for the said classification separated by comma. The data sets used in this category is taken from the birds species occurrences of North-east India [24]. The second format of data sets for classification is in raster format [25]. Raster image is a collection of pixels represented in a matrix form. Raster images can be stored in varying formats. The raster format used herein is TIFF format. A map of Andhra Pradesh state in India used.
The data under consideration is first preprocessed. [26]. In the case of csv datasets comprising of information of birds of North-east India the attributes considered are id, family, genus, specific_ epit-het, latitude, longitude, ver-b-atim _scientific_ name, ve-rba-tim_ family, verbatim_ genus, verbatim_ specific_ ep-ithet and locality. A variable called churn acts as a class label which would categorize the data into two cate-goriesviz onehaving data sets of birds from Darjeeling area and the other having data sets of birds belonging to other north eastern parts in India. Before applying the clas-sification the data sets are cleaned to remove any mis-sing values. In the case of raster data set, a TIFF image is used. The image comprises of a map of Andhra Pradesh, a state in India. Initially a region of interest(ROI) is captured and later supervised SVM classification methodology is applied. Algorithm that explains implementation of SVM is given below [27]: Begin
Step 1: Loop the n data items
Step 2: Start dividing the input data set into two sets of data corresponding to two different categories
Step 3: If a data item is not assigned any of the regions mentioned then add it to set of support vectors V
Step 4: end loop
A total of 695 data set records act as test data set and are used to authenticate the classification results obtained for CSV data sets and in the case of TIFF raster data sets one Region of interest is extracted from the given input image. The proposed method has been implemented under the environment setting as shown in Classification accuracy can be measured using parameters of a confusion or error matrix view depending on whether the event is correctly classified or no event is correctly classified as shown in Table 2[9]. And the classified results for CSV format data sets is demonstrated in Figure 4. 10), ( 11), ( 12) and ( 13 Kappa statistics=Sensitivity + Specificity -1 (13) The efficiency of the proposed SVM classifier is evaluated using the said parameters. The confusion or error matrix view for SVM classifier while classifying the CSV data sets is given in Table 3. The confusion matrix or error matrix view for SVM Classifier while classifying raster TIFF data sets is given in Table 4. Performance Measures using evaluation metrics are specified in Table 5 which are calculated using equations ( 10), ( 11), ( 12)and (13).
In this paper SVM classification method is used to build a classification model for two datasets. The first data set is of CSV format and the second one is a raster TIFF image. Later the classification model is validated against a test data set which is a subset of the input dataset. The performance of SVM is calculated using kappa statistics and accuracy parameters and it is established that for the given data sets SVM classifies the raster image dataset with better accuracy than the CSV dataset. The SVM classification methodology discussed herein can help in environment monitoring, land use, mineral resource identification, classification of remote sensed data into roads and land etc.. in the future.
Table1 : Environment Setting | |
Item | Capacity |
CPU | Intel CPU G645 @2.9 GHz processor |
Memory 8GB RAM | |
OS | Windows 7 64-bit |
Tools | R, R Studio, Monteverdi tool |
b) Result Analysis |
Real group | Classification result | |
No Event | Event | |
No Event | True Negative(TN) False Positive(FP) | |
Event | False Negative(FN) True Positive(TP) |
Prediction | Reference Other parts Darjeeling | |
Other parts 571 | 1 | |
Darjeeling | 7 | 116 |
Prediction | Reference Land Water | |
Land | 78 | 0 |
Water | 0 | 56 |
datasets | ||
Data set type | Accuracy | Kappa Statistics |
CSV data sets | 98.85 | 95.97 |
Raster TIFF data sets 100 | 100 |
We direct our frank appreciativeness to CSV data which was accessed via GBIF data portal,
Class-specific GMM based intermediate matching kernel for classification of varying length patterns of long duration speech using support vector machines. 10.1016/j.specom.2013.09.010. http://dx.doi.org/10.1016/j.specom.2013.09.010 Speech Communication February 2014, Pages126143,ISSN01676393. 57.
Graphical Representation and Exploratory Visualization for Decision Trees in the KDD Process. 10.1016/j.sbspro.2013.02.033. http://dx.doi.org/10.1016/j.sbspro.2013.02.033 Procedia -Social and Behavioral Sciences 1877- 0428. 27 February 2013. 73 p. .
Learning Bayesian classifiers from positive and unlabeled examples. 10.1016/j.patrec.2007.08.003. http://dx.doi.org/10.1016/j.patrec.2007.08.003 Pages 23752384,ISSN01678655, 1 December 2007. 28.
Using global maps to predict the risk of dengue in Europe. 10.1016/j.actatropica.2013.08.008. http://dx.doi.org/10.1016/j.actatropica.2013.08.008 Acta Tropica 0001-706X. January 2014. 129 p. .
A survey of image classification methods and techniques for improving classification performance. 10.1080/01431160600746456. http://dx.doi.org/10.1080/01431160600746456 International Journal of Remote Sensing 2007. 28 (5) p. .
Forest classification trees and forest support vector machines algorithms: Demonstration using microarray data. 10.1016/j.compbiomed.2010.03.006. http://dx.doi.org/10.1016/j.compbiomed.2010.03.006 Computers in Biology and Medicine Volume40,Issue5, May2010, Pages519524,ISSN00104825.
The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature. 10.1016/j.dss.2010.08.006. http://dx.doi.org/10.1016/j.dss.2010.08.006 Decision Support Systems February 2011,Pages559569,ISSN01679236. 50 (3) .
Analysisof Parametric & Non Parametric Classifiers for Classification Technique using WEKA. 10.5815/ijitcs.2012.07.06a. I.J. Information Technology and Computer Science 2012. July 2012. 7 p. . (Published Online)
Multiple support vector machines for land cover change detection: An application for mapping urban extensions. 10.1016/j.isprsjprs.2006.09.004. http://dx.doi.org/10.1016/j.isprsjprs.2006.09.004 Pages 125-133,ISSN09242716, November 2006. 61.
Study on Recognition of Bird Species in Minjiang River Estuary Wetland. 10.1016/j.proenv.2011.09.386. http://dx.doi.org/10.1016/j.proenv.2011.09.386 Procedia Environmental Sciences 1878- 0296. 2011. 10 p. .
Eyas El-Qawasmeh, Performance of KNN and SVM classifiers on full word Arabic articles. 10.1016/j.aei.2007.12.001. Advanced Engineering Informatics January 2008, Pages106111,ISSN14740346. 22 (1) .
Near-miss narratives from the fire service: A Bayesian analysis. 10.1016/j.aap.2013.09.012. http://dx.doi.org/10.1016/j.aap.2013.09.012 Accident Analysis & Prevention January 2014,Pages119129,ISSN00014575. 62.
Support vector machine for multiclassification of mineral prospectivity areas. 10.1016/j.cageo.2011.12.014. http://dx.doi.org/10.1016/j.cageo.2011.12.014 Computers & Geosciences September 2012,Pages272283,SSN0098004. 46.
Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage and Chinese vinegar. 10.1016/j.snb.2012.11.071. http://dx.doi.org/10.1016/j.snb.2012.11.071 Sensors and Actuators B: Chemical February 2013, Pages970980,ISSN09254005. 177.
Magnetic resonance brain images classification using linear kernel based Support Vector Machine. 10.1109/NUICONE.2012.6493213. Nirma University International Conference on, 2012. 6-8 Dec 2012. 5. (Engineering (NUiCONE))
Predicting the potential habitat of oaks with data mining models and the R system. 10.1016/j.envsoft.2010.01.004. http://dx.doi.org/10.1016/j.envsoft.2010.01.004 Environmental Modelling & Software, July 2010. 25 p. . Nicolás Bellinfante-Crocci
Comparison of support vector machine, neural network, and CART algorithms for the land-cover classification using limited training data points. 10.1016/j.isprsjprs.2012.04.001. http://dx.doi.org/10.1016/j.isprsjprs.2012.04.001 ISPRS Journal of Photogrammetry and Remote Sensing 0924-2716. June 2012. 70 p. .
Comparison of support vector machine, neural network, and CART algorithms for the land-cover classification using limited training data points. 10.1016/j.isprsjprs.2012.04.001. ISPRS Journal of Photogrammetry and Remote Sensing June2012,Pages7887,ISSN0924716. 70.