# Introduction # R Tool R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and Mac OS [12] R is public domain software primarily used for statistical analysis and graphic techniques [10]. A core set of packages is included with the installation of R, with more than 7,801 additional packages (as of January 2016[update]) available at the Comprehensive R Archive Network (CRAN), Bio conductor, Omegahat, Git Hub, and other repositories. [14] R tool provides a wide class of statistical that includes classical statistical tests, linear and nonlinear modeling, classification, timeseries analysis, clustering and various graphical functions. [13] R uses collections of packages to perform different functions [11]. CRAN project views provide numerous packages to different users according to their taste. R package contain different functions for data mining approaches. This paper compares various clustering algorithms on datasets using R which will be useful for researchers working on medical data and biological data as well. For this IDE, R Studio is used refer the below Figure 1. 17 Year 2016 ( ) C luster analysis divides data into meaning full groups (clusters) which share common characteristics i.e. same cluster are similar to each other than those in other clusters. It is the study of automatically finding classes. A web page especially news articles which are flooded in the internet have to be grouped. The clustering of these different groups is a step forward towards the automation process, which requires many fields, including web search engines, web robots and data analysis. Any new web page goes through numerous phases including data acquisition, preprocessing, Feature extraction, classification and post processing into the database. Cluster analysis can be regarded as a form of the classification which creates a labeling of objects with class labels. However it derives these labels only from the data. Data mining functionalities are the Characterization and discrimination, mining frequent patterns, association, correlation, classification and prediction, cluster analysis, outlier analysis and evolution analysis [1]. Clustering is a vivid method. The solution is not exclusive and it firmly depends upon the analysts' choices. Clustering always provides groups or clusters, even if there is no predefined structure. While applying cluster analysis we are contemplating that the groups exist. But this speculation may be false. The outcome of clustering should never be generalized. [9]. # Clustering Algorithms a) K-Means The term "k-means" was first used by James Macqueen in 1967 [2], though the idea goes back to 1957 [4]. The K-means algorithm is the most commonly used and simplest method among all partitional clustering algorithms. As Harington and Wong1979 mentioned it is an iterative method which minimizes the sum of the squares for a given number of Clusters. Here's how the algorithm works [5]: 1. Select k points as initial centroids. 2. Repeat 3. From K clusters by assigning each point to its closest centroid. 4. Recompute the centroid of each cluster. 5. Until centroids do not change. K-Means reaches a state in which no points are shifting from one cluster to another e.g. repeating until only 1% of the points change clusters. For measuring the quality of the clustering we measure Sum of the squared error (SSE) or scatter. SSE=? ? ????????(?? ?? , x) 2 ????? ?? ?? ??=1 Where dist is standard Euclidean distance between two objects in Euclidean space. The centroid (mean) of the ith cluster that minimizes the SSE is defined as ? ? ????????(?? ?? , x) 2 ????? ?? ?? ??=1 The advantage of this method is highly scalable of the huge sum of data sets with O(n * k * r) where r is the number of rounds, where n represent number of data items, k represent numbers of clusters [14]. It has user defined constant K and Runtime is totally dependent on the initial pick of centroids. # b) K-Means Implementation using R For this analysis we have considered Online News Popularity datasets which consists of articles published by Mash able (www.mashable.com) [4]. Instances are 39797 and Number of Attributes is 61. As the results of the k means are undeterministic, we have followed the practice of running multiple rounds of k means so performed on various k values as k=3, k=5 and k=10. The best iteration is one who minimizes the average distance of each point to its assigned centroid. # Fig 2 : K means plot with k=5 The Figure 2 shows the results of clustering the Online News Popularity datasets with 39645 number of News articles URL's. The above results show that there is overlapping of clusters. So preprocessing is required to address this problem and omit the NA values also. Then the following is the code after data cleaning [15] Available components [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" [8] The results are showing the information about cluster means, clustering vector, sum of square by cluster and available components. The fpc package is used to draw the discriminant projection plot using Plotcluster function (Fig3). The result of plotting the class returned by function application is shown below. The Figure 4 shows the parallel coordinators plot to see the variables contributed in each cluster. # c) Fuzzy C-Means Fuzzy c means clustering (FCM), each data point has a fraction of membership to each cluster. This algorithm works iteratively until no further clustering is possible. The membership fraction that minimizes the expected distance to each centroid has to be calculated. The algorithm is very similar to K-Means, except that a matrix (row is each data point, column is each centroid, and each cell is the degree of membership) is used. 1. Initialize the membership matrix U 2. Repeat step (3), ( 4) until converge 3. Compute location of each centroid based on the weighted fraction of its member data point's location. # Update each cell as follows Notice that the parameter m is the degree of fuzziness. The output is the matrix with each data point assigned a degree of membership to each centroids. # e) Hierarchical Clustering Hierarchical Clustering a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. This clustering builds a cluster hierarchy or, in other words, a tree of clusters, also known as a dendrogram [7]. Hierarchical algorithms can be further categorized into two kinds [3] i. Agglomerative clustering: This clustering starts with n clusters and iteratively merges the number of clusters which are most similar objects or clusters, respectively, until only one cluster is remaining (n ? 1).This requires the defining of closest proximity. ii. Divisive clustering: This clustering starts with one cluster and iteratively splits a cluster until singleton clusters of individual points remain, so that the heterogeneity is reduced as far as possible (1 ? n).This requires the decision of splitting at each step. The Hierarchical Clustering algorithm [6] below takes an n×n distance matrix d input and increasingly gives n different partitions of the data as the tree it outputs result. The largest partition has n single-element clusters, with every element forming its own cluster. The second-largest partition aggregates the two closest clusters from the largest partition, and thus has n ? 1 clusters. In general, the ith partition combines the two closest clusters from the (i ? 1)th partition and has (n ? i + 1) clusters. Because of the additional complexity of keeping data in a sorted list or heap, so the time required is O(m 2 log m) and Space required is O(m 2 ). In this approach, it compares all pairs of data points and merges the one with the closest distance. Algorithm 1: Compute the proximity matrix if necessary 2: repeat 3: Merge the closest two clusters. 4: Update the proximity matrix to reflect the proximity between the new cluster and initial cluster 5: Until only one cluster remains. The Proximity (C i, Cj) of clusters C i and C j, which are of the size m i and m j ,respectively is expressed as ?????????????????? (?? ?? ,?? ?? ) = ? ?????????????????? (??,??) ????? ?? ?? ??? ?? ?? ?? * ?? ?? The data set considered is micro RNA expressions. It is actually collected from Fresh paired tumor and control samples from the PAC (Periampullary Carcinoma) patients undergoing Whipple's pan creaticoduodenectomys Data Mining in Health Informatics [15] is an emerging discipline, concerned with developing methods for exploring the unique type of data that come from Health Care database > result$membership[1:3,] 1 2 3 management system. We have also considered the Iris dataset. The code for the implementation is given as follows [16] Figure 6 shows the results of clustering. # Fig. 6 : Hierarchical Clustering The Authors [14 ] have also performed hierarchical clustering on PAC tumors dataset which are distinct from counterpart normal pancreas, normal duodenum, and normal distal CBD and normal ampulla. Unsupervised hierarchical cluster analysis of miRNA Expression profiles were clustered of PAC tumors into pancreatobiliary (n ¼ 23) and intestinal subtypes (n ¼ 17), while normal pancreas (n ¼ 22), normal duodenum (n ¼ 6), normal distal CBD (n ¼ 6) normal ampulla (n ¼ 6) are clustered as different entities (Figure 7). V. ![Author ? : Department of CSE, Methodist college of Engg. & Tech., OU, Hyderabad. e-mail: lavanya.post@gmail.com Author ? : Department of CSE, S R Engineering College, JNT University, Warangal. Author ? : Department of CSE, School of IT, JNT University, Hyderabad.II.](image-2.png "") ![Fig 1: R tool Studio](image-3.png "C") ![Within cluster sum of squares by cluster:[1] 3437391 3417672 3385646 3069653 3279165 (between_SS / total_SS = 92.1 %)](image-4.png "") 34![Fig. 3 : Preprocessed K-means plot k=5](image-5.png "Fig. 3 :Fig 4 :") 5![Fig. 5 : Fuzzy C means clustering plot The observation of the above results and of the Online News Popularity datasets are FCM is mainly dependent on the initial clustering and the computation time is very high for the large data sets. The weight and accuracy are inversely proportional. It is Sensible to noise and membership degree for outliers or noisy points is very low.](image-6.png "Fig. 5 :") 7![Fig. 7 : The microRNA expression profiles of PAC](image-7.png "Fig. 7 :") ![](image-8.png "") © 2016 Global Journals Inc. (US) ## Acknowledgements The authors would like to thank Dr. K.S.S Uma Mahesh (M.D Biochemistry) and his lab for sharing their data of PAC tumor samples (microRNA). * JHan MKamber Data Mining: Concepts and Techniques San Francisco Morgan Kaufmann Publishers 2000 * Introduction to Data Mining Pang-NingTan MichaelSteinbach VipinKumar May 12, 2005 Addison Wesley US edition * References Références Referencias * Data Mining with R: learning by case studies Luis Torgo(kmeans) * A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News KFernandes PVinagre PCortez Proceedings of the 17th EPIA 2015 -Portuguese Conference on Artificial Intelligence the 17th EPIA 2015 -Portuguese Conference on Artificial IntelligenceCoimbra, Portugal September * JiaweiHan MichelineKamber Data Mining: Concepts and Techniques Morgan Kaufmann Publishers 2006 second Edition * Optimization and Simplification of Hierarchical Clustering DougFisher KDD HC * A Comparative Study of Various Clustering Algorithms in Data Mining ManishVerma MaulySrivastava NehaChack AtulKumar Diswar NidhiGupta International Journal of Engineering Research and Applications (IJERA) 2 May-Jun 2012 * AFrank AAsuncion Uci Machine Learning Repository Irvine, CA 2010 University of California, School of Information and Computer Science * Density connected clustering with local subspace preferences CB¨ohm KKailing H.-PKriegel PKr¨oger Proceedings of the 4th International Conference on Data Mining (ICDM) the 4th International Conference on Data Mining (ICDM) 2004 * Springer 11. R and Data Mining: Examples and Case Studies 1 Yanchang Zhao RafaelA IrizarryRobert Gentleman JVincent WolfgangCarey Sandrine Dudoit Huber * SatishKumar Analysis Clustering Techniques in Biological Data with R, International Journal of Computer Science and Information Technologies (IJCSIT) 6 2 2015 * MicroRNA profiling in periampullary carcinoma Kss Uma Mahes MMKuruva SMitnala TRupjyoti RGVenkat SBotlagunta Pancreatology 14 2014 * A Heuristic k-Means Algorithm with Better Accuracy and Efficiency for Clustering Health Informatics Data AbdulNazeer KASebastian MPKumar SD Journal of Medical Imaging and Health Informatics 1 2011 American Scientific Publisher