# Introduction

ocument clustering [1], [2], [3], [4] techniques find relevance in a wide range of tasks from a simple search with a few terms to vast information retrieval processes. The early document clustering techniques used were developed for typically enhancing information retrieval systems [5], were designed to find documents according to the query type, however could not perform the task of creating a query, generate a synopsis of the documents, or provide an interface to the search results. The progress of internet, digital libraries, news sources and companywide intranets has made available huge volumes of text documents. The tremendous increase in the already quantum size of web data and the classification of the web documents into relevant and moderate number of clusters has led to the development of large number of web clustering engines and high performing clustering algorithms.

The process of document clustering involves four stages which are, i) Data collection, crawling to accumulate the documents, indexing the set of documents in a structured fashion, filtering of data with techniques of tokenization, stop words removal and stemming, lemming etc. ii) preprocessing where the data is represented in suitable form, vector etc. and measurable factors applied to determine the similarity, iii) Document clustering where a clustering technique and an efficient clustering algorithm are identified for clustering based on preset criteria and iv) Post processing involving applications of business and scientific requirements adaptation of the document clustering technique.

The applications of document clustering are of diverse nature such as, i) Creation of document taxonomies ii) IR process of search, accessing and collection [6],

Similar documents identification, review and classification of results [7], automatic topic extraction [8], content summarization iii) Recommendation System, iv) Search Optimization, etc. For instance the processes are used enormously in the data classification process such as Google Web Directory, Social media data classification etc.

The clustering techniques though being studied since several years, still face many of the same challenges. These challenges [9,10] of document clustering are mostly of, i) Huge volume of data, ii) The high dimensionality of the feature space, iii) A feasible clustering method in terms of constraints such as cluster quality and performance and iv) Representing the results in an effective browsing interface. The current challenges associated with text clustering are the requirement of dynamic clustering techniques to incrementally update clusters as new data is added [11,12]. For instance the social media has to generate user specific content [13] instantly and this requires real time data clustering methodologies.

The remainder of this paper is organized as follows. In Section 2 we discuss the "Taxonomy" of document clustering, in Section 3 the "Contemporary literature work of clustering techniques" are evaluated and Section 4 gives the "Conclusion" of the paper.


# II. Taxonomy

The clustering functionality can be expressed as a function comprising of a document set mapped to a D set of clusters. Based on specified constraints the minimum and maximum of the function defines the clustering difficulty and algorithms applied over the similarity criteria determine the clustering quality.

The preprocessing step of clustering for finding the document similarity is determined with methods based on the following strategies, (i) phrase or pair-wise methodology, (ii) tree form data depiction, (iii) component dependent data depiction, (iv) semantic relation dependent documents depiction, (v) concept and feature vector dependent depiction.

The clustering methods of are generally of two types, 1) Word patterns and phrases based 2) Feature based.

The clustering methods algorithms are mostly of two types 1) hierarchical methods and 2) partitioning methods (non hierarchical) [14,15,16]. The hierarchical algorithms for clustering represent data sets as a cluster tree and are of two types 1-1) agglomerative [17] 1 -2) divisive hierarchical clustering methods. Partitional clustering algorithms [17] are of two types, 2-1) iterative 2 -2) single pass methods. K means and its variants etc. are the popular partitioning methods. The hierarchical clustering algorithms are considered efficient than the remaining algorithms [18] however due to their inherent complexness they are not applicable to huge document sets.

The techniques for determining inter-cluster similarity in classification [19 20] ex. single link and for enhancing the value of the clusters where the cluster size differs or fluctuates by a huge factor [17], especially in case of high performing clustering algorithms have been studied widely in recent years.

The widely used document clustering methods are Spectral Clustering, LSI dependent cluster development and NMF technique based clustering. The Spectral clustering methods [21] are LPI, LSI etc. Latent semantic indexing (LSI) [22] a feature extraction approach [23] tries to optimize the documents space compared to the given document and is a widely used linear document indexing method [24]. LSI is inapplicable for processes with a high range of documents [24] and similarly spectral clustering when used in a large dimensional space the dimensionality reduction is very costly which limits its usability.

The word patterns and phrases based approaches are the traditional strategies where the clustering is dependent on the documents features such as words, phrases and sequences [25,26]. These methods are of four types, 1-1) Clustering with Frequent Word Patterns 1-2) Application of Word Clusters in Document Clusters 1-3) Co-clustering Words and Documents, Co-clustering with graph partitioning and Information-Theoretic Co -clustering 1-4) Clustering based on Frequent Phrases. The technique VSM is used in almost all the document clustering methods used nowadays [27]. The vector space model is a data model for representing the terms related to the words in a document as a feature vector.

The features based clustering approaches are of two types 2-1) Feature Extraction 2-2) Feature Selection.

The Feature Extraction approaches are based on the algorithm of two types i) linear and ii) nonlinear techniques. The models of linear type algorithms are unsupervised PCA, OCA, MMC etc. The examples of non linear algorithms are LLE, Laplacian Eigenmaps, and ISOMAP etc. The linear methods show better operational performance in contrast to nonlinear approaches, however underperform in the clustering of huge and complicated data of the internet. The feature extraction technique finds applications in the fields of IR based on human language learning ability, comparing reviewed and submitted papers, of various languages or networks and filter of data. Feature selection algorithms are of two types, 2-2-1) Feature Ranking that is metric based and 2-2-2) Subset Selection from the possible features. The feature selection algorithms are of two categories, i) supervised and ii) unsupervised. The supervised feature selection algorithms are the most researched as well as used and they are IG, CHI, and MI. The unsupervised methods that are most popular are, i) DF-based selection dependent on term strength and ranking dependent on entropy or term contribution, ii) LSI-based method and iii) NMF based method. These techniques of unsupervised approach such as, decision trees, statistics, NLP and ML are being used in BI or analytics, in neural networks for developing AI or bio neural networks, for developing systems of AI that are rule based for intelligent content development, database development, information retrieval and automatic grouping of web documents with Enterprise Search engines or open source software's in web mining or text mining.

The strategies of feature selection used mostly are i) wrapper, ii) filter and iii) embedded methods [28] however a study [29] has shown, the methods of supervised feature selection dependent on algorithms using the filter metric IG, are most efficient over others techniques.


# III.


# Contemporary Affirmation of the

Recent Literature

An approach of bisecting k-means algorithm proposed by Steinbach, M, Karypis, G, & Kumar, V [14] breaks up a large cluster into small clusters repetitively to generate k numbers of clusters of huge similarity for filtering the clusters and collecting similar texts based on the method.

A technique called CCA [30] widely used in the emerging technologies of ML etc applies correlation for measuring the similar features in a document. However, CCA has its own limitations in clustering. 


# C

An approach of spectral clustering based on graph partitioning strategy called LPI [31] proposed however fails in feature selection and comprises of the existing problems of distance based clustering documents.

An approach for document clustering called Frequent Term based Clustering or HFTC [32] is a topic of extensive research. However it is not scalable for huge data or of documents.

A technique known as Hierarchical Document Clustering using Frequent itemsets (FIHC) approach proposed by Fung, B., Wang, K., Ester, M, is discussed in [33]. The strategy of FIHC though performs better than HFTC underperforms in clustering efficiency when compared to existing approaches such as UPGMA and Bisecting K-means.

The TDC algorithm technique based on closed frequent itemsets for clustering is proposed by Yu, H., Searsmith, D., Li, X., Han, J [34]. The algorithm performs better compared to HFTC and FIHC however the use of closed itemsets makes it avoidable.

A strategy of Hierarchical Clustering using Closed Interesting Itemsets, referred to as HCCI proposed by Malik, H.H., Kender, J.R [35], is the best clustering method available. However the technique may cause information loss.

An approach based on PSSM histogram by Gad and Kamel [36] combines the text semantic with the process of incremental clustering and measures the similarity of the documents for adjusting the insertion order of the documents in the cluster for quality.

An improved incremental clustering technique for an efficient clustering algorithm proposed by Gavin and Yue [37] improves categorization of web data incrementally. The method based on cluster specific multiple information anew document is assigned to a cluster.

An approach for improving text clustering mining by Shehata, S, Fakhri, K, & Mohamed S, S. [38] outperforms the existing techniques such as HAC, k-NN etc.

A progressive clustering algorithm by Liu, Y, Ouyang, Y, Sheng, H, & Xiong, Z. ( 2008) [39] based on Cluster Average Similarity Area determines the cluster coherence and progressively assigns the new data items to the clusters.

A technique for enhancing the clustering functionality based on the partial disambiguation of words by means of their PoS [40] is recommended by the developers as the approach finds the inefficiency of considering synonyms and hypermy my for selecting the right sense of the word disambiguated solely by PoS tags.

The CFWS technique proposed by Y. LI, and S.M. Chung, enhances the capability to process the document, considering the word sequences apart from the words [41].

The technique of non linear representation of the data by J.B. Tenenbaum, V. de Silva, and J.C. Langford [42] keeps specific local data simultaneously based on the optimization factors however is associated with high complexity.

A study of the approaches for reducing the complexity of feature extraction based on a new technique called approximation algorithm [43], [44], [45] is found to be good.

A software for automatically retrieving information from websites by Zamir O Etzioni [46] is designed for websites comprising of vast amount of data

The approach of integrating clustering and feature selection for text clustering based on the semantic relation of the text documents with ontology was proposed by Thangamani.M and P.Thangaraj in [47]. The approach minimizes dimensionality and improves feature selection.

The clustering technique, for finding the clustering quality based on WordNet [48] phrasal noun and semantic relationships [49] shows better performance with hyperny my based strategy compared to other noun phrases.

A system for determining the ontology related semantic relations of the term or word and associated weight measure is given by Prof. K. Raja, C. Prakash Narayanan [9]. However the technique has dimensionality and other problems.

A description of the task of Ontology based automatic categorizing of web documents [50] and the scope of Ontology in improving the current machine learning and IR approaches is given by Andreas Hotho. The integration of ontology's for combining various information types of multiple resources by Young-Woo et al. in the paper [51].

The process of using domain specific ontology's for enhancing performance of text classification where text learning and IR are used to generate ontology's with minimum user interaction is given in [52,53].

The methods utilizing Wikipedia ontology for improving primarily the document depiction and cluster quality by Gabrilovich and Markovitch [54] and a further extension provided a structure based on the Wikipedia guidelines and groups [55,56]. The Wikipedia ontology is most relevant as it is applicable to a large cross section of domains and also restructured on a regular basis.

A technique for feature selection in text clustering based on supervised feature selection on the intermediary clustering outcomes by Xu, J. Xu, B [57] generates a efficient subset for classification. The suggested techniques performance is efficient compared to manual process.

A technique of feature selection dependent on the ACO algorithm by M. Janaki Meena,K.R.


# Year 2015

Global Journal of C omp uter S cience and T echnology Volume XV Issue II Version I ( ) C Chandran,J. Mary Brinda," [58] is a unique method. Comparative tests of the approach with existing chisquare and CHIR techniques shows the proposed approach achieves better performance in FS.

An entropy based FS approach i.e. a filter solution [59] tested with various data types that reduces dimensionality and is efficient in finding the subset of major features.

A feature co-selection method called MFCC (multi type feature co-selection), proposed by Shen huang, Zheng Chen, Yong Yu, and Wei-Ying main [60] shows enhanced clusters performance of web documents based on the outcomes of intermediate clustering.

A method to remodel the matrix of data similarity as a bi-stochastic matrix prior to executing algorithms by F. Wang, P. Li, and A. C. K Aonig showed better clustering performance [61].

The techniques of document clustering that are term based for clustering in dynamic environments, is given in [11] by Wang, X, Tang, J, & Liu, H, synonyms and hypermy m\ y by Bharathi and Vengatesan [62], Synonyms and Hyponyms, Nadig, R, Ramanand, J, & Bhattacharyya, P in [12]. These approaches are however not applicable to technically similar documents.

A document clustering approach [63] dependent on phrases and the STC technique by O. Zamir, O. Etzioni, O. Madanim, and R.M. Karp builds the clusters on the common documents suffixes. The method though efficient in cluster quality however is associated with high amount of term redundancy.

A study of the TF-IDF method of clustering [64], term frequency dependent algorithms [65] and a review of clustering algorithms [66] showed that majority of clustering approaches are TF-IDF based, however associated with several problems.

The NMF (Nonnegative Matrix Factorization) technique in text classification [67], improved clustering performance compared to the existing approaches [68] , relationship study of NMF techniques with earlier clustering techniques [69], [70]  [71]. A review of established techniques of NMF such as multiplicative updates [72], projected gradients [73] though efficient however are associated with the problems of memory for huge datasets streamed and not disk based [74]. To overcome these problems, approaches such as random projections [61,75] and sketch/sampling algorithms [76] have been proposed. An NMF based technique by Li and Zhu in 2011 [77] for research specific documents minimizes high dimensionality, finds relevant topics for clustering and shows performance efficiency in classification comparatively. A study of the online algorithm based on Nonnegative Matrix Factorization [78], a NMF based method that uses features based on weights and similar cluster property by Sun Park, Dong Un An, Choi Im Cheon [79] performs comparatively more efficiently than the remaining NMF based strategies.

IV.


# Conclusion

In this paper we analyzed several techniques developed for clustering documents with their applications and relevance in terms of today's requirements. The task of developing perfect strategies for classification of varied forms and types of documents for a near optimal solution or finding accurate ways of assessing the quality of the performed clustering though is impossible and is increasing in its complex nature, the field today deals with extraordinary tasks like granular taxonomies generation, sentiment analysis and document summarization for generating reliable and relevant insights applicable to several fields. In conclusion we can say document clustering is going to be widely studied and will find relevance in a number of newer areas.
2015![Journal of C omp uter S cience and T echnologyVolume XV Issue II Version I ( )](image-2.png "Year 2015 Global")
			© 2015 Global Journals Inc. (US)
			© 2015 Global Journals Inc. (US) 1
		
		
* 
	
		Recent Advances in Clustering: A Brief Survey
		
			SKotsiantis
		
		
			PPintelas
		
	
		WSEAS Trans. Information Science and Applications
		
			1
			1
			
			2004
		
	
* 
	
		Document Clustering by Concept Factorization
		
			WXu
		
		
			YGong
		
	
		Proc. Int'l Conf. Research and Development in Information Retrieval
				Int'l Conf. Research and Development in Information Retrieval
		
			July 2004
			
		
* 
	
		Restrictive Clustering and Metaclustering for Self-Organizing Document Collections
		
			SSiersdorfer
		
		
			SSizov
		
	
		Proc. Int'l Conf. Research and Development in Information Retrieval
				Int'l Conf. Research and Development in Information Retrieval
		
			July 2004
			
		
* 
	
		
			BrianSEveritt
		
		
			SabineLandau
		
		
			MorvenLeese
		
		Cluster Analysis
				
			Oxford University Press
			2001
		
	
	fourth edition


* 
	
		
			Van Rijsbergen
		
	
		London: Buttersworth
				
			1989
		
	
	Secondth ed.


* 
	
		A survey of web clustering engines
		
			CCarpineto
		
		
			SOsi´nski
		
		
			GRomano
		
		
			DWeiss
		
	
		ACM Comput. Surv
		
			41
			3
			
			2009
		
	
* 
	
		Cluster-based retrieval using language models
		
			XLiu
		
		
			WBCroft
		
	
		Proceedings of the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
				the 27th annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
		
			2004
			
		
* 
	
		Document clustering and cluster topic extraction in multilingual corpora
		
			JSilva
		
		
			JMexia
		
		
			ACoelho
		
		
			GLopes
		
	
		Proceedings of the 1st IEEE International Conference on Data Mining (ICDM)
				the 1st IEEE International Conference on Data Mining (ICDM)
		
			2001
			
		
* 
	
		Clustering Technique with Feature Selection for Text Documents
		
			.KProf
		
		
			CPrakashRaja
		
		
			Narayanan
		
	
		Proceedings of the Int.Conf. on
				the Int.Conf. on
		
	
* 
	
		
		Information Science and Applications ICISA
		
			2010 6 February 2010
		
	
* 
	
		Machine Learning in Automated Text Categorization
		
			FabrizioSebastiani
		
	
		ACM Computing Surveys
		
			34
			1
			March 2002
		
	
* 
	
		H Document clustering via matrix representation
		
			XWang
		
		
			JTang
		
		
			Liu
		
	
		11th IEEE International Conference on DataMiningICDM2011
				
			2011
			
		
* 
	
		Automatic evaluation of Word Net synonyms and hypermy my India
		
			RNadig
		
		
			JRamanand
		
		
			PBhattacharyya
		
	
		Proceedings of ICON-2008, 6th International Conference on Natural Language Processing
				ICON-2008, 6th International Conference on Natural Language Processing
		
			2008
		
	
* 
	
		Google news personalization: Scalable online collaborative filtering
		
			ADas
		
		
			MDatar
		
		
			AGarg
		
		
			SRajaram
		
	
		Proceedings of the 16th International Conference on World Wide Web (WWW)
				the 16th International Conference on World Wide Web (WWW)
		
			2007
			
		
* 
	
		
			MSteinbach
		
		
			GKarypis
		
		
			VKumar
		
		A comparison of document clustering techniques. KDD Workshop on Text Mining
				
			2000
		
	
* 
	
		Survey of clustering data mining techniques
		
			PBerkhin
		
		
			2004
		
	
* 
	
		Survey of Clustering Algorithms
		
			Xu Rui
		
	
		IEEE Transactions on Neural Networks
		
			16
			3
			
			2005
		
	
* 
	
		Hierarchical Document Clustering Using Frequent Itemsets
		
			BC MFung
		
		
			KWan
		
		
			MEster
		
		
			2003
			3
		
	
* 
	
		Concept decompositions for large sparse text data using clustering
		
			ISDhillon
		
		
			DSModha
		
	
		Machine Learning
				
			2001
			42
			
		
* 
	
		Data Stream Clustering: Challenges and Issues
		
			MKhalilian
		
		
			& NMustapha
		
	
		Proceedings of the International Multiconference of Engineers and Computer Scientists IMECS 2010
				the International Multiconference of Engineers and Computer Scientists IMECS 2010Hong Kong
		
			2010
			
		
* 
	
		Providing QoS with the Deficit Table Scheduler
		
			RMartinez-Morais
		
		
			FJAlfaro-Cortes
		
		
			&J LSanchez
		
	
		IEEE Transactions on Parallel and Distributed Systems
		
			21
			3
			
			2010
		
	
* 
	
		On Spectral Clustering: Analysis and an Algorithm
		
			AYNg
		
		
			MJordan
		
		
			YWeiss
		
	
		Advances in Neural Information Processing Systems
		
			14
			
			2001
			MIT Press
		
	
* 
	
		Latent Semantic Indexing (LSI) and TREC-2
		
			STDumais
		
	
		Proc.Second Text Retrieval Conf. (TREC)
				.Second Text Retrieval Conf. (TREC)
		
			1993
			
		
* 
	
		
			Lsa @ Cu
		
		
			Boulder
		
		
			2010
		
	
* 
	
		Indexing by Latent Semantic Analysis
		
			SCDeerwester
		
		
			STDumais
		
		
			TKLandauer
		
		
			GWFurnas
		
		
			RAHarshman
		
	
		J. Am.Soc. Information Science
		
			41
			6
			
			1990
		
	
* 
	
		Efficient Phrase-Based Document Similarity for Clustering
		
			CHung
		
		
			DXiaotie
		
	
		IEEE Transaction on Knowledge and Data Engineering
		
			20
			
			September. 2008
		
	
* 
	
		Text document clustering based on frequent word meaning sequences
		
			MCSoon
		
		
			DHJohn
		
		
			Yanjun
		
		
			L
		
	
		Data & Knowledge Engineering
		
			64
			
			2008
		
	
* 
	
		Text Categorisation: A Survey
		
			KAas
		
		
			LEikvil
		
		941
		
			1999
			Norwegian Computing Center
			Oslo Norway
		
	
	Technical Report
	iteseer.ist.psu.edu/ aas99text.html


* 
	
		
			BarakChizi
		
		
			Tel-Aviv University, Israel
		
	
* 
	
		
			LiorRokach
		
		
			Ben-Gurion University, Israel
		
	
* 
	
		a survey of feature selection techniques
		10.4018/978-1-60566-010-3.ch289
		pp. John Wang
		
			2009
			13
			9781605660103
		
		
			Oded Maimon (Tel-Aviv University, Israel ; Montclair State University, USA
		
	
* 
	
		A Comparative Study on Feature Selection in Text Categorization
		
			YYang
		
		
			JOPedersen
		
	
		Proc. 14th Int'l Conf. Machine Learning
				14th Int'l Conf. Machine Learning
		
			1997
			
		
* 
	
		Canonical Correlation Analysis: An Overview with Application to Learning Methods
		
			DRHardoon
		
		
			SRSzedmak
		
		
			JRShawetaylor
		
	
		J. Neural Computation
		
			16
			12
			
			2004
		
	
* 
	
		Locality Preserving Indexing
		
			DCai
		
		
			XHe
		
		
			JHan
		
	
		Document Clustering Using Knowledge and Data Eng
		
			17
			12
			
			Dec. 2005
		
	
	IEEE Trans


* 
	
		Frequent Term-based Text Clustering
		
			FBeil
		
		
			MEster
		
		
			XXu
		
	
		Proc. of Intl. Conf. on Knowledge Discovery and Data Mining
				of Intl. Conf. on Knowledge Discovery and Data Mining
		
			2002
		
	
* 
	
		Hierarchical document clustering using frequent Itemsets
		
			BC MFung
		
		
			KWang
		
		
			MEster
		
	
		Proceedings of SIAM International Conference on Data Mining
				SIAM International Conference on Data Mining
		
			2003
		
	
* 
	
		Scalable Construction of Topic Directory with Nonparametric Closed Termset Mining
		
			HYu
		
		
			DSearsmith
		
		
			XLi
		
		
			JHan
		
	
		Proc. of Fourth IEEE Intl. Conf.on Data Mining
				of Fourth IEEE Intl. Conf.on Data Mining
		
			2004
		
	
* 
	
		High Quality, Efficient Hierarchical Document Clustering Using Closed Interesting Itemsets
		
			HHMalik
		
		
			JRKender
		
	
		Proc. of IEEE Intl. Conf. on Data Mining
				of IEEE Intl. Conf. on Data Mining
		
			2006
		
	
* 
	
		Incremental clustering algorithm based on phrase-semantic similarity histogram
		
			WKGad
		
		
			MSKamel
		
	
		Proceedings of the Ninth International Conference on Machine Learning and Cybernetics
				the Ninth International Conference on Machine Learning and Cybernetics
		
			2010
			11
			
		
* 
	
		Enhancing an incremental clustering algorithm for Web page collections
		
			SGavin
		
		
			XYue
		
	
		ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technologies
				
			2009
			
		
* 
	
		An efficient concept-based mining model for enhancing text clustering
		
			SShehata
		
		
			KFakhri
		
		
			SMohamed
		
		
			S
		
	
		IEEE Transactions On Knowledge And Data Engineering
		
			22
			10
			
			2010
		
	
* 
	
		An Incremental Algorithm for Clustering Search Results
		
			YIu
		
		
			YOuyang
		
		
			HSheng
		
		
			ZXiong
		
	
		IEEE International Conference on Signal Image Technology and Internet Based Systems
				
			2008
			
		
* 
	
		Wordnetbased text document clustering
		
			JSedding
		
		
			DKazakov
		
		
			2004
			
		
	3rd Workshop on Robust Methods in Analysis of Natural Language Data


* 
	
		Text Document Clustering Based on Frequent Word Sequences
		
			YLi
		
		
			SMChung
		
	
		Proceedings of the. CIKM
				the. CIKMBremen, Germany
		
			2005. 2005. October 31-November 5
		
	
* 
	
		A Global Geometric Framework for Nonlinear Dimensionality Reduction
		
			JBTenenbaum
		
		
			VSilva
		
		
			JCLangford
		
	
		Science
		
			290
			
			2009
		
	
* 
	
		Candid Covariance-Free Incremental Principal Component Analysis
		
			JWeng
		
		
			YZhang
		
		
			W.-SHwang
		
	
		IEEE Trans. Pattern Analysis and Machine Intelligence
		
			25
			8
			
			Aug.2003
		
	
* 
	
		On Successive Learning Type Algorithm for Linear Discriminant Analysis
		
			KHiraoka
		
		
			MHamahira
		
	
		IEIC Technical Report
		
			99
			
			1999
		
	
	in Japanese


* 
	
		IMMC: Incremental Maximum, Marginal Criterion
		
			JYan
		
		
			BSZhang
		
		
			ZYan
		
		
			WChen
		
		
			QFan
		
		
			WYYang
		
		
			QMa
		
		
			Cheng
		
	
		Proc. 10th ACM SIGKDD
				10th ACM SIGKDD
		
			2004
			
		
* 
	
		Web Document Clustering, A Feasibility Demonstration
		
			OZamir
		
		
			KDevelopment
		
		
			Mugunthadevi
		
	
		Proceedings of the 21st International ACM SIGIR Conference on Research
				the 21st International ACM SIGIR Conference on Research
		
			IJCSE
		
	
* 
	
		integrated clustering and feature selection scheme fo textdocuments
		
			PMThangamani
		
		
			Thangaraj
		
		10.3844/jcssp.2010.536.54
		DOL:10.3 844/jcssp.2010.536.541
		
	
		J.Comput.Sci
		
			6
			536
		
	
* 
	
		Wordnet: A lexical database for English
		
			GMiller
		
	
		CACM
		
			38
			11
			
			1995
		
	
* 
	
		Exploiting noun phrases and semantic relationships for text document clustering
		
			KangZheng
		
		
			Kim
		
	
		Information Science
		
			179
			
			2009
		
	
* 
	
		Using Ontologies to Improve the Text Custering and Classification Task
		
			AndreasHotho
		
		
			January 14, 2005
		
		
			Knowledge and Data Engineering Group, University of Kassel
		
	
* 
	
		Feature Selections for Extracting Semantically Rich Word for Ontology Learning
		
			Young-WooSeo
		
		
			AnupriyaAnkolekar
		
		
			KatiaSycara
		
		CMU-RI-TR-04-18
		
			March 2004
		
	
* 
	
		Towards semantic web mining
		
			BBerendt
		
		
			AHotho
		
		
			GStumme
		
	
		Proceedings of International Semantic Web Conference (ISWC)
				International Semantic Web Conference (ISWC)
		
			2002
			
		
* 
	
		Ontologybased text clustering
		
			AHotho
		
		
			SStaab
		
		
			AMaedche
		
	
		Proceedings of the IJCAI-2001 Workshop Text Learning: Beyond Supervision
				the IJCAI-2001 Workshop Text Learning: Beyond SupervisionSeattle,USA
		
			August 2001
		
	
* 
	
		Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis
		
			EGabrilovich
		
		
			SMarkovitch
		
	
		Proc. of The 20th Intl. Joint Conf.on Artificial Intelligence
				of The 20th Intl. Joint Conf.on Artificial Intelligence
		
			2007
		
	
* 
	
		Exploiting Wikipedia as External Knowledge for Document Clustering
		
			XHu
		
		
			XZhang
		
		
			CLu
		
	
		Proc. of Knowledge Discovery and Data Mining
				of Knowledge Discovery and Data Mining
		
			2009
		
	
* 
	
		Enhancing Text Clustering by Leveraging Wikipedia Semantics
		
			JHu
		
		
			LFang
		
		
			YCao
		
	
		Proc. of 31st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval
				of 31st Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval
		
			2008
		
	
* 
	
		A new feature selection method for text clustering
		
			JXu
		
		
			BXu
		
		
			WZhang
		
		
			ZCui
		
		
			WZhang
		
	
		wuhan university journal of natural sciences
		
			12
			
			2007
		
	
* 
	
		integrating swarm intelligence and statistical data forfeature selection in text categorization
		
			MMeena
		
		
			KRChandran
		
		
			JMaryBrinda
		
	
		©2010 International Journal of Computer Applications
		
			1
			11
			
		
* 
	
		Feature Selection for Clustering -A Filter Solution
		
			ManoranjanDash
		
		
			KiseokChoi
		
		
			PeterScheuermann
		
		
			HuanLiu
		
		ICDM'02)0-7695-1754-4/02 © 2002 IEEE
	
	
		Proceedings of the 2002 IEEE International Conference on Data Mining
				the 2002 IEEE International Conference on Data Mining
		
	
* 
	
		multitype features coselection for web document clustering
		
			Wei-YingShen Huang
		
		
			Ma
		
		1041-4347/06/$20.00
	
	
		ieee transactions on knowledge and data engineering
		
			18
			4
			april 2006. 2006
		
	
	ieee published by the ieee computer society


* 
	
		Learning a bistochastic data similarity matrix
		
			FWang
		
		
			PLi
		
		
			ACKäonig
		
	
		Proceedings of the 10th IEEE International Conference on Data Mining (ICDM)
				the 10th IEEE International Conference on Data Mining (ICDM)
		
			2010
		
	
* 
	
		Improving information retrieval using document clusters and semantic synonym extraction
		
			GBharathi
		
		
			DVengatesan
		
	
		Journal of Theoretical and Applied Information Technology
		
			36
			2
			
			2012
		
	
* 
	
		Grouper: A Dynamic Clustering Interface to Web Search Results
		
			OZamir
		
		
			OEtzioni
		
	
		Computer Networks
		
			31
			
			1999
		
	
* 
	
		Term-weighting approaches in automatic text retrieval
		
			GSalton
		
		
			CBuckley
		
	
		Information Processing & Management
		
			24
			5
			
			1998
		
	
* 
	
		
			NKumar
		
		
			KSrinathan
		
		A New Approach for Clustering Variable Length Documents(Proceedings of the Advanced computing Conference
				
			IEEE
			2009
			
		
* 
	
		A survey paper on concept based text clustering
		
			YPrathima
		
		
			KPSupreethi
		
	
		International Journal of Research in IT & Management
		
			1
			3
			
			2011
		
	
* 
	
		Learning the parts of objects with nonnegative matrix factorization
		
			DDLee
		
		
			HSSeung
		
	
		Nature
		
			401
			
			1999
		
	
* 
	
		Document clustering using nonnegative matrix factorization
		
			FShahnaz
		
		
			MWBerry
		
		
			VPPauca
		
		
			RJPlemmons
		
	
		Information Processing and Management
		
			42
			2
			
			2006
		
	
* 
	
		Convex and seminonnegative matrix factorizations
		
			CDing
		
		
			TLi
		
		
			MIJordan
		
	
		IEEE Transactions on Pattern Analysis and Machine Intelligence
		
			2010
		
	
* 
	
		On the equivalence of nonnegative matrix factorization and spectral clustering
		
			CDing
		
		
			XHe
		
		
			HDSimon
		
	
		Proceedings of the 5th SIAM Int'l Conf. Data Mining (SDM)
				the 5th SIAM Int'l Conf. Data Mining (SDM)
		
			2005
			
		
* 
	
		Relation between plsa and nmf and implications
		
			EGaussier
		
		
			CGoutte
		
	
		Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
				the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)
		
			2005
			
		
* 
	
		Algorithms for nonnegative matrix factorization
		
			DDLee
		
		
			HSSeung
		
	
		Advances in Neural Information Processing System (NIPS)
				
			2000
			
		
* 
	
		Projected gradient methods for nonnegative matrix factorization
		
			CJLin
		
	
		Neural Computation
		
			19
			10
			
		
* 
	
		Efficient streaming text clustering
		
			SZhong
		
	
		Neural Networks
		
			18
			5-6
			
			2005
		
	
* 
	
		efficient non-negative matrix factorization with random projections
		
			FWang
		
		
			PLi
		
	
		Proceedings of the 10th SIAM International Conference on Data Mining (SDM)
				the 10th SIAM International Conference on Data Mining (SDM)
		
			2010
			
		
* 
	
		One sketch for all: Theory and application of conditional random sampling
		
			PLi
		
		
			KWChurch
		
		
			THastie
		
	
		Advances in Neural Information Processing System (NIPS)
				
			2008
			
		
* 
	
		Document clustering in research literature based on NMF and testor theory
		
			FLi
		
		
			QZhu
		
	
		Journal of Software
		
			6
			1
			
			2011
		
	
* 
	
		Detect and track latent factors with online nonnegative matrix factorization
		
			BCao
		
		
			DShen
		
		
			JSun
		
		
			XWang
		
		
			QYang
		
		
			ZChen
		
	
		Proc. International Joint Conference on Artificial Intelligence
				International Joint Conference on Artificial Intelligence
		
			2007
			
		
* 
	
		Document Clustering Method Using Weighted Semantic Features and Cluster Similarity
		
			SunPark
		
		
			Dong Un An
		
		
			Choi Im Cheon
		
	
		Third IEEE International Conference on Digital Game and Intelligent Toy Enhanced Learning
				
			2010. 2010
			
		
	digitel