# Introduction

ocument model in the information retrieval has three main components, namely Text Preprocessor, Topic Extractor and Corpus categoryzation. These components are integrated to deploy knowledge extraction in information system. In spite of this, the growing data and its knowledge recognition complications have considerably encouraging the extensions of machine learning algorithms.


# a) Document Model

The text document Modeling is observed as latent topics model. Various prominent approaches in machine learning are used to study the model. Document model is a mixture of topics [4]. Topics are inferred by the collection of correlated words. But unsupervised learning perspective is the pulse of bubbling out the topics. By modeling, varieties of mining range can be established with various subjects. The models try to observe the likely documents and tend to focus on topics. But document models are discriminant because of random words due to linguistic factors such as synonym, hyponym, Polysemy, etc.


# b) Text Pre-processor

The functionalities essential for machine learning of document are document pre-processing and corpus representation. Stop words removal, word stemming, filtering to exclude certain words, are done within each document. This process is called preprocessing of documents. Obtained vocabulary is put up in the word-document matrix which is generally called as bag-of-words model. The document representations may be in binary (0, for nonoccurrence and 1 for occurrence of each term in a document), term frequency (tij -number of occurrence of ith word in jth document) and term frequency inverse document frequency (probable occurrence of tij' -distribution of ith word in jth document). Obtained data in this stage is huge in dimension, and lot of techniques [15] have been proposed for dimension reduction.


# c) Topic Extractor

A topic model is a probabilistic model that can be considered as a mixture of topics, represented by probability distributions of words in a document. The latent variables or topics are the inferring components of this model. The main objective is to learn from documents the distribution of the underlying topics in a given corpus. Topic model is Text corpora representation by a co-occurrence matrix of words and documents. The probabilistic latent semantic analysis (PLSA) model [10] uses probability of words with given topics and probability of topics in a document, to build a topic model. The Latent Dirichlet Allocation (LDA) model [1], is another probabilistic approach which ties the parameters of all documents through hierarchical generative model.


# d) Corpus Categorization

Text Categorization is a classical application of Text Mining [19], and is used in email filters, social tagging and automatic labeling of documents in business libraries. Text mining applications in research and business intelligence include, latent semantic analysis techniques in bioinformatics automatic investigation of jurisdictions plagiarism detection in universities and publishing houses, cross-language information retrieval, spam filters learning, help desk inquiries, measuring customer preferences by analyzing qualitative interviews, automatic grading, fraud detection or parsing social network for ideas of new products [9].


# II.


# Literature Support

The theory of fuzzy set is Consider as a degree of membership assigned to each element, where the degree of non-membership is just automatically equal to D its complement. However, human interpretation often does not express the corresponding degree of nonmembership as the complement to 1. So, Atanassov [1][2] [3] introduced the concept of intuitionistic fuzzy set that is meant to reflect the fact that the degree of nonmembership is not always equal to 1 minus degree of membership, but there may be some hesitation degree.

Intuitionistic fuzzy set is a generalized constructive logic applied in fuzzy set. It is defined on a X of objects, with each object x is described by the degrees of membership and non-membership to a certain property, 
( ) ( ) ( ) { } X x x x x A A ? , , , ? µ (1) ( )( ) ( ) 1 0 ? + ? x x A A ? µ X x ? ? (2)
Therefore the degree of non determinacy of the object x with respect to the intuitionistic fuzzy set A is imposed as, ( ) ( ) ( )
x x x A A A ? µ ? + = X x ? ? (3)
The model is well suited to represent a classification problem with high dimension. The confusion matrix of high dimension can be probably reduced to concept matrix of low dimension. The similarity measures [14] and distance measures [21] [20] between two intuitionistic fuzzy sets can be applied in pattern recognition.

In this paper, a Partition based approach [16] inspired by Hierarchical segmentation [8] and topic based segmentation [6] are extended using Intuitionistic fuzzy set approach [23] for local centralization of conceptual words. The intuitionistic fuzzy set theory is applied in conceptual term/topic detection. A cosine similarity and correlation are taken into for defining membership degree and the non-membership degree respectively. The results using this measure found better with respect to the dataset chosen. In literature a intuitionistic fuzzy representation of images for clustering [18] [12] by utilizing a novel similarity metric are defined. But a minimal support is extended for text classification. So, a local centralization of conceptual terms using Intuitionistic logical clustering has been applied in the work.


# III.


# Proposed Model -Intuitionistic Partition based Concept Granulation (IPCG)

Intuitionistic logic is a natural deduction system [13],that have introduction rules µ and elimination rules ? for the logical connectives and quantifiers. The 
{ } ) ( ), ( , ij i ij i ij w w w A ? µ = where 1 0 < < ij w (4)
The similarity between words and on a topic is calculated by the cosine measure. Each document vector is normalized with the weight and length of terms in k partitions. Then the optimal term ij w [16] should  The intuitionistic angular or cosine similarity [22] measure between the m terms in a partitioned set is given as follows:
( ) ? ? ? = = = = m i i B m i i A m i i B i A x x x x B A C 1 2 1 2 1 ) ( ) ( ) ( ) ( , µ µ µ µ (6)
The intuitionistic correlation [7] of rows all fuzzy numbers are included from the samples of tf-idf (Partition Model). The crisp set is modified intuitionistically with the sample mean and variance of membership function as:
( ) ( )( ) ( ) ( ) ? ? ? = = = ? ? ? ? ? ? ? ? ? ? = n i B i B n i A i A n i B i B A i A I x x x x B A CR 1 1 1 ) ( ) ( ) ( ) ( , µ µ µ µ µ µ µ µ (7)
The effectiveness of the intuitionistic classification of corpus is approximately studied and analyzed using the following entropy [22] specifically used for Intuitionist Fuzzy Set 'A'.
( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ? = + + = n i i A i A i A i A i A i A x x v x x x v x n E 1 , max , min 1 ? µ ? µ (8)
IV.


# Datasets a) Newspaper Article collection

The newspaper articles under different topics are collected. The categories are marked. The training and testing documents are randomly chosen. The growing social media made essential to include newspaper article collection to include in this work. News are generally categorized by topic area ("politics," "business," etc.) written in clear, correct, "objective," and somewhat schematized language [5]. This would pave way to extend the research towards social networking and marketing. The collection includes about 780 documents with 25 categories. All new social relevant topics ("mobile","opinion", etc.) are included for categorizing.


# b) Reuters-21578 Data Set

The Reuters-21578 Data Set collection provides a classification task with challenging properties. There are multiple categories, the categories are overlapping and non exhaustive, and there are relationships among the categories. There are interesting possibilities for the use of domain knowledge. There are many possible feature sets that can be extracted from the text, and most plausible feature/example matrices are large and sparse [11].


# c) Movie Review Dataset

The Movie Review Dataset, Polarity dataset v0.9 with 900 positive and 900 negative reviews is used. Using movie reviews as data, the problem of classifying documents using standard machine learning techniques definitively outperform human-produced baselines processed reviews [17]. The training cases are chosen randomly from each class about 100 documents. Which means about 500 cases are considered for training.

V.


# Results and Analysis

The machine learning classification methods, such as Bayesian, Naïve Bayes, J48, Support Vector Machines, LMT are strong enough to support classifications.

In the case of concept granulation in document classification, the feature selection is fine tuned to achieve categories strictly connected to the human perception. Before imposing the features into the classifier, some form of selection must be chosen. The proposed method, selects the features according to the intuitionist logic. The features tf-idf matrix has been The proposed Concept Granulation Using Intuitionistic Partition Based Classification Model is implemented administered in the Java based system and analyzed for its significance. The intuitionistic correlation is applied to the specified datasets. In which the chosen dataset and the partitions play the very important role in finding the result of the model. The tfidf-IP is favorable for Reuter dataset than for Newspaper and Movies. This is represented in the Figures 2(a The perplexity is depicted in Figure 3 and Table1. So the analysis can be interpreted or inferred in the following ways:

Intuitionistic approach is in favor of the classified documents or corpus chosen Partition plays the important role in the proposed model. Out of four types of partition, k=8 plays a smoothened strong support for the proposed model k=16, the highest partition yield only a very moderate result and more confusions.

k=4, the least partition model yield the smooth but less significant support for all the datasets. k=8, yield the partially smooth but supportive significant for the movie dataset. (Than other partitions)

The results are focused to average training datasets and micro f-measure (Table 2) to show up the IPCG performs better with dimension reduction for categorization of corpus. Every datasets chosen for analysis behaves to the pull and push of various stages of the proposed model.  


# Conclusions

In this paper, we have proposed a intuitionistic partition based concept granulation topic-term model for a nominal tf-idf vector space model which is often used in information retrieval, topic analysis, and automatic classification. The cosine distance and correlation treatment to the tf-idf reduces the dimension and improves the efficiency of bag of words/terms in topics. However, it is priory treated using the intuitionistic partition for fitting the model into decision-making problems. To account this, Intuitionistic partition based cosine similarity measure between topic/terms and correlation between document/topic are included. The proposed fuzzy model is tailored with normal combinational approach to fetch intuitionistic fuzzy crisp set. Yet, it is observed the model is well behaving and promising for the categorized documents and not so bad support for the low inference corpus collections like movie review. So, this make us clear that the social media documents should be specially treated before introducing this model. It is felt that aggregation of social media topic-terms is needed. This is taken for future work or extension of the proposed work. 
![x does not belongs to the set A . The model is defined by the restriction](image-2.png "")
![document classification system needs conceptual terms ) (µ , non deterministic terms or noises ( ) ? with logics and reasons to quantify concept granules. Let A be a tf-idf matrix of m n? represents corpus. Each value is associated to ? Set of terms representing the membership of domain ) (x A µ Term representing the non membership of domain ) (x A ? Algorithm: IPCG For each document { Lowercase, numbers, special characters from document Remove stop list words from document Split document into k partitions For each segment { Find frequency of words Prepare matrix with each segment as row and words as columns Include non zero frequency as member Cosine similarity distance between each segments calculated Discard the segment with least distance } Single row or vector of a document has been found Intuitionistic Correlation to include conceptual terms in topic Classify the document and find entropy } The intuitionistic fuzzy set A is generated by](image-3.png "?")
![Journals Inc. (US) Intuitionistic Partition based Conceptual Granulation Topic-Term Modeling be picked from the non sparse term of k partitions. i.e.](image-4.png "")
1![Figure 1 : Partition Model {ri<=n (i.e. r is random or varies from document to document)(where i=1,2,?m), k = no. of partitions or segments}](image-5.png "Figure 1 :")
![based feature model. The proposed approach is modeled as a probability distribution over the set of Topic/Words represented by the vocabulary. These distributions are sampled from multi-nominal distributions.](image-6.png "")
2![Figure 2 : Intuitionistic correlation Vs The number of training documents](image-7.png "Figure 2 :")

1Training with 300 DocDimension ReductionPerplexity CorrelationNewspaper26%0.2310.582Reuters22%0.3110.520Movie16%0.4830.480Datasettf-idfIPCGClassifiersReuters News Paper MovieReutersNews PaperMovieSVM0.4820.4220.3210.8440.8410.799NB0.4010.3690.2970.8720.8340.810J480.4000.3990.3810.7980.7970.784Bayes'0.5410.4110.3990.8310.8540.829LMT0.4420.5410.5870.8780.7980.722
2
			© 2014 Global Journals Inc. (US) Intuitionistic Partition based Conceptual Granulation Topic-Term Modeling
		
		
* 
	
		Latent Dirichlet Allocation
		
			DMBlei
		
		
			AYNg
		
		
			MIJordan
		
	
		Journal of Machine Learning Research
		
			3
			
			2003
		
	
* 
	
		Probabilistic Latent Semantic Analysis
		
			THofmann
		
	
		Proceedings of the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99)
				the 15th Annual Conference on Uncertainty in Artificial Intelligence (UAI-99)San Francisco, CA
		
			Morgan Kaufmann
			1999
			
		
* 
	
		
			FSebastiani
		
		10.1145/505282.505283
	
	
		Machine Learning in Automated Text Categorization
				
			2002
			34
			
		
* 
	
		
			IFeinerer
		
		
			KHornik
		
		
			DMeyer
		
	
		Journal of Statistical Software
		
			25
			
			2008
		
	
* 
	
		
			KTAtanassov
		
		Intuitionistic Fuzzy Sets, Theory, and Applications, Series in Fuzziness and Soft Computing
				
			Phisica-Verlag
			1999
		
	
* 
	
		Intuitionistic Fuzzy Set, Fuzzy Sets System
		
			KTAtanassov
		
		
			1986
			
		
* 
	
		Intuitionistic Fuzzy Set
		
			KTAtanassov
		
		
			SStoeva
		
	
		Polish Symposium on Interval and Fuzzy Mathematics
		
			
			1993
		
	
* 
	
		New Similarity Measures Of Intuitionistic Fuzzy Sets And Application To Pattern Recognition
		
			DLi
		
		
			CCheng
		
	
		Pattern Recognition Letter
		
			23
			
			2002
		
	
* 
	
		Entropy for Intuitionistic Fuzzy Set, Fuzzy Sets System
		
			ESzmidt
		
		
			JKacprzyk
		
		
			2001
			118
			
		
* 
	
		Distance Between Intuitionistic Fuzzy Set, Fuzzy Sets System
		
			ESzmidt
		
		
			JKacprzy
		
		
			2000
			114
			
		
* 
	
		Domain Classifier using Conceptual Granulation and Equal Partition Approach
		
			DMalathi
		
		
			SValarmathy
		
	
		Indian Journal of Engineering
		
			7
			
			2013
		
	
	Science and Technology


* 
	
		Topic-Based Hierarchical Segmentation
		
			JTChien
		
		
			CHChueh
		
	
		IEEE Transactions on Audio, Speech and Language Processing
		
			20
			
			2012
		
	
* 
	
		Topicbased document segmentation with Probabilistic Latent Semantic Analysis
		
			TBrants
		
		
			FChen
		
		
			ITsochantaridis
		
	
		the proceeding of International Conference on Information and Knowledge Management
				
			2002
			
		
* 
	
		Clustering Algorithm for Intuitionistic Fuzzy Sets
		
			ZXu
		
		
			JChen
		
		
			JWu
		
	
		Information Sciences
		
			178
			
			2008
		
	
* 
	
		Fuzzy Clustering of Intuitionistic Fuzzy Data
		
			NPelekis
		
		
			DKIakovidis
		
		
			EKEvangelos
		
		
			IKopanakis
		
	
		International Journal of Business Intelligence and Data Mining
		
			3
			1
			
		
* 
	
		Intuitionistic Fuzzy Clustering with Applications in Computer Vision. Advanced Concepts for Intelligent Vision Systems
		
			DKIakovidis
		
		
			NPelekis
		
		
			EKEvangelos
		
		
			IKopanakis
		
	
		Lecture Notes in Computer Science
		
			5259
			
			2008
		
	
* 
	
		Semantics and Aggregation of Linguistic Information, Based on Hedge Algebras
		
			VHLe
		
		
			CHNguyen
		
		
			FLiu
		
	
		The 3rd International Conference on Knowledge, Information, and Creativity Support Systems
				
			2013
		
	
* 
	
		Multicriteria Decision-making Method Based on a Cosine Similarity Measure between Trapezoidal Fuzzy Numbers
		
			JYe
		
	
		International Journal of Engineering, Science and Technology
		
			3
			
			2011
		
	
* 
	
		Correlation of fuzzy sets, Fuzzy Sets and Systems
		
			DAChiang
		
		
			NPLin
		
		
			1999
			102
			
		
* 
	
		Text Mining for News and Blogs Analysis
		
			BBerendt
		
		C. Sammut, & G. I. Webb
		
			2010
			Springer
			
			London
		
	
	Encyclopedia of Machine learning


* 
	
		
			BPang
		
		
			LLee
		
		
			SVaithyanathan
		
		Thumbs up? Sentiment Classification using Machine Learning Techniques, Proceedings of EMNLP
				
			2002
			
		
* 
	
		A Comprehensive Survey on Dimension Reduction Techniques for Concept Extraction from a Large Corpus
		
			DMalathi
		
		
			SValarmathy
		
	
		International Journal of Computing Information Systems
		
			3
			
			2011