I. Introduction ow a day's huge amount of information is being posted on to the web. In order to get useful information from the web, the information available has to be categorized. Text Categorization is the task of automatically categorizing a set of unlabeled text documents to their corresponding categories from a predefined category set [2]. These categories can be viewed as a set of documents and test document can be treated as a query to the system. The measures to evaluate the information retrieval systems are often applicable to measure effectiveness text categorization systems [1]. Text categorization has many applications [2], like information retrieval system, search engine, text filtering, word sense disambiguation, language identification, POS tagging and machine translation etc. Telugu is one of the old and traditional languages of India and it is categorized as one of the Dravidian language family unit with its own high-class script. It is the authorized language of the Telangana and Andhra Pradesh states in south India. Amit et al [6] surveyed that in India the Telugu native speakers are above 50 million. It was positioned between13 to 17 largest spoken languages all over the world. Telugu is a rich morphological language that has high word conflation [7]. Various approaches for text categorization have been done on Indian languages. Most of the works have been reported on Telugu language. M Narayana Swamy et al have used KNN, NB and decision tree classifier [4]. They have experiment on Kannada, Tamil and Telugu corpus statistics is illustrated by Zipf's law. Analysis of N-gram model on text classification was proposed in the work of [5]. Goverdhan. A Durga k et al [3] projected a technique with ontology text categorization for Telugu digital-items and retrieval system. For the best of our knowledge, this is the first time our proposed language models have been applied for Telugu text categorization. The paper is structured as follows; section 2 describes the system overview, section 3 explains Testing and results and at the last, a section 4 conclusion is drawn.


# II. System Overview

The system design of the proposed approach can be shown in the Figure .1. First read a text document from corpus and each line is pre-processed by elimination of non-Telugu characters, numerals and special characters like colons, semicolons and quotes. Then a pre-processed document is tokenized and extracts the raw words. Words in Telugu text are separated by spaces and are extracted with spaces as delimiter from the document and place all raw words in Input File. Language dependent and independent models are takes raw words from Input File as input. Read one word at a time from file. Finally find the root word by applying various models like vibhaktulu based stemming, suffix removal stemming, Rule based suffix removal stemming, N-gramming, pseudo N-gramming and Rule based Pseudo N-gramming. Finally, apply the text categoryzation. Our proposed language models are categorized in three ways are shown in figure 2. These models take raw words from Input File as input and identify the root word. 


# c) Suffix removal stemming

Suffix removal stemming is the process of finding the root word from the word by removing the matched suffix with suffix list which is shown in figure 3. By observing the Telugu data set, it is found that maximum suffix length will be 2(two) and minimum is one. Suffix removal stemming method giving better performance than vibhaktulu based stemming algorithm. It's accuracy is 58-59%. Suffix removal stemming is a base method for Rule based Suffix removal stemming algorithm. The result of suffix removal stemming words may normally contain inflections. The inflections in the stem word cannot be removed using simple suffix removal. We have designed rule based suffix removal of some possible inflections that frequently occur in the Telugu Language. The rules are used to replace characters are presented in Table 1. By these rules the electiveness of the proposed Rule based Suffix removal stemming algorithm is increased. Accuracy of Rule based suffix removal is 69-70%.


# Table 1: Rules for Replacement Syllables e) Pseudo N-gramming

Pseudo N-gram is the process of finding the root word by stripping the word from the end. Stripping length will be taken depending on the word length. Maximum stripping length is 5 and minimum is 2. Example of Pseudo N-gramming is shown in figure 4. It is a language independent.  


# f) Rule Based Pseudo N-Gramming

It is a hybrid model. Pseudo N-gram is a base method for this processing to remove suffixes from words. The result of Pseudo N-gram of some words normally contains inflections. The inflections in the stem word cannot be removed using simple Pseudo N-gram.

We have designed rule based Pseudo N-gram which contain set of rules used to replace characters. These rules used for words normally contain more inflections that frequently occur in the Telugu Language. List of rules with sample example are shown in Table 3.


# Table 3: List of rules for Rule based pseudo N-gramming g) K-NN Classifier

The k-NN classifier is a similarity-based learning method that has been shown to be very effective for a variety of problem domains including text categorization [9, 10]. Given a test document, the k-NN method finds the k nearest neighbors among the training documents, and uses the categories of the k neighbors to weight the category. The similarity score of each and every neighbor document to the test document is used as the weight of the classes of the neighbor document.


# III. Testing and Results

The proposed models are evaluated on Telugu Corpus, collected from online newspapers and Wikipedia. This work has been implemented on sample selection of 1,500 documents of seven categories are presented in Table 4.


# Table 4: Categories of Telugu Documents

To evaluating the performance of the proposed system using KNN classification, we use the typical evaluation metrics that come from information retrievalprecision (P), recall (R), and F1 measure: Where TP is True Positives, TN is True Negatives, FN is False Negatives and FP is False Positive [8]. We have projected the performance of the proposed language models result with KNN classifier shown in Table 5. 


# IV. Conclusion

In this paper, we proposed various language dependent and independent models. Among these models the performance of Rule based pseudo Ngramming is more. So it is well suited for Telugu Text categorization. As part of our research work in Telugu categorization, it is also suitable for other complex Indian languages like Hindi, Malayalam and Kannada.
1![Figure 1: Proposed Approach](image-2.png "Figure 1 :")
2![Figure 2: Proposed language models b) Vibhaktulu based stemming Vibhaktulu based stemming is a language dependent model. It is the process of finding the root word by removing the last one or more syllables from the word, which are matched with Telugu vibhaktulu. It is observed that, processing the complete set of input words, only 19 to 20% of words with the last syllables are matched to Telugu vibhaktulu.](image-3.png "Figure 2 :")
3![Figure 3: Suffix list d) Rule based suffix removal Stemming Suffix removal stemming is a base method for Rule based Suffix removal stemming algorithm. The](image-4.png "Figure 3 :")
4![Figure 4: Pseudo N-gramming A sequenece of words from the Input File was used in identifying the valid root by pseudo N-gram algorithm and the results are presented in Table 2,which contains list of words with intial & final stripping length and final valid root word.](image-5.png "Figure 4 :")
![P = TP /(TP + FP) ??????????...(1) R = TP/( TP + FN)??????????.. (2) F1 = (2*P*R) /( P + R)?????????(3)](image-6.png "")
515![Figure 5(a): Recall Graph](image-7.png "Table 5 : 1 HFigure 5")
![](image-8.png "")
![](image-9.png "")
2
			© 2016 Global Journals Inc. (US)
		
		
* 
	
		An Introduction to Latent Semantic Analysis
		
			TKLandauer
		
		
			PWFoltz
		
		
			DLaham
		
	
		Discourse Processes
				
			1998
			
		
* 
	
		Automatic Categorization of Telugu News Articles, Department of Computer and Information Sciences
		
			KNMurthy
		
	
		Hyderabad
				
			2003
		
		
			University of Hyderabad
		
	
* 
	
		Ontology Based Text Categorization Telugu Documents
		
			AMrs
		
		
			Kanaka Durga
		
		
			.ADr
		
		
			Govardhan
		
	
		International Journal of Scientific & Engineering Research
		
			2
			9
			
			September-2011
		
	
* 
	
		Indian Language Text Representation and Categorization Using Supervised Learning Algorithm M Narayana Swamy1
		
	
* 
	
		Analysis of N-gram model on Telugu Document classification thesis
		
			BVishnu Vardhan
		
		
			2008