# Introduction

ext line segmentation is an essential preprocessing stage for recognition in many Optical Character Recognition (OCR) systems. Segmentation of text line is a vital step because inaccurately segmented text lines result in errors during recognition stage. Segmentation of the handwritten document is still one of the most concerned challenging problems. Several techniques for text line segmentation are reported in the literature for segmenting Indian script documents. These methods include projection profile (white space analysis) [1], voronoi and docstrum [2], graph cut, connected components based. Segmentation is not accurate with these methods. Jawahar [3] proposed the graph cut method that requires a priori information about the script structure to cut. Rajasekharan proposed a method based on projection method for Kannada script document segmentation [4]. As a conventional technique for text line segmentation, global horizontal projection analysis of black pixels has been utilized in [5,6,7,8]. Partial or piece-wise horizontal projection analysis of black pixels as modified global projection technique is employed by many researchers to segment text pages of different languages [9,10,11]. In piecewise horizontal projection technique text-page image is decomposed into vertical strips. The positions of potential piece-wise separating lines are obtained for each strip using partial horizontal projection on each stripe. The potential separating lines are then connected to achieve complete separating lines for all respective text lines located in the text page image.

In this paper a robust method for segmentation of documents into lines and words and the proposed method is based on the modified histogram as the Telugu script is very complex. For accurate line segmentation Foreground and background information is also used. This method take cares of eliminating false lines and recovering the loss of text in overlapped text lines.

The rest of the paper is organized as follows: In Section 2, we discussed the properties of Telugu scripts considered here. Proposed approach is discussed in Section 3. Experimental results in Section 4. Finally the paper is concluded in section 5.


# II.


# Characteristics of Telugu Script

Telugu is the most popular South Indian spoken script based language. The Telugu character set contains 16 vowels, 36 consonants, vowel (maatras) and consonant modifiers (vaththus). These characters are combined to represent several frequently used syllables (estimated between 5000 and 10000) in the language [12,13,14]. We refer to these basic orthographic units as glyphs (single connected component representation). These characters will have variable size. (i.e. width and height). In Latin based scripts most of the characters have same size except few characters. Segmentation of such characters is difficult when compared with Latin based scripts like English. The figure 1 shows sample Telugu simple and compound character images. 


# Proposed Approach

Here we propose a new technique which automatically identify and segment the text line regions of handwritten documents. Figure 2 shows the basic steps in our proposed algorithm.  The raw data is subjected to a number of preliminary processing steps to make it usable in the stages of character analysis.

Pre-processing aims to produce data that are easy for segmentation accurately. The main objectives of pre-processing include: Binarization Noise reduction Skeletonization/Normalization Skew correction.

We have used binary image for our work and to convert the original grey-level document images into binary image, we have applied the algorithm due to Otsu [15]. Then noise removed, skew corrected output image from the pre-processing phase is given as input to the Segmentation stage. For Noise removal we use morphological operators. Figure 3 shows steps in Noise removal.    The lines with height below a pre-determined threshold are removed. The value of this threshold is proportional to the average height of the text lines in the whole image.


# d) False Word Exclusion

As in 3.3 we will find the average height of the word in x direction and the word not satisfying the determined threshold will be treated as false word.

IV.


# Performance Evaluation

The performance is evaluated by checking the count of number of matches between the segmented entities with that of entities in the ground truth [16].A Match Score table is created where the pixels of the segments and the ground truth are coincide. Let I be the set of all image points, Gj the set of all points inside the j ground truth region, Si the set of all points inside the i segmented region, T(s) a function that counts the elements of set s. Matching results of the j ground truth region and the i segment region: 


# Results and Discussion

The algorithm is implemented in MATLAB. The algorithm is tested with several document images. Sample test results are shown in Figure 4.From the experiment the proposed method is fast and reliable to even for handwritten documents which have non overlapped lines. The line segmentation accuracy with DR is 99% and RA is 98% for good quality documents. The limitation of this method is that it resulted in segmentation errors for touching characters.


# M

o2o DR(%) RA(%) PM(%) 


# Conclusion and Future Work

In this experiment, the proposed algorithm is tested with several document images. Even though this algorithm provides robust results it could not accurately segment the overlapped lines. A heuristic algorithm needs to be thought of in case of overlapping lines and words to recover the loss text. 
1![Figure 1 : Examples for simple and compound characters](image-2.png "Figure 1 :")
2![Figure 2 : Shows the basic steps in segmentation algorithm](image-3.png "Figure 2 :")
34![Figure 3 : Steps in Noise Removal](image-4.png "Figure 3 : 4 .")
4![g a small peak in the histogram shown in green, if this region has enough height it can be confused with a text line segment by the algorithm. The equation below provides the average height of the lines found in a histogram:Where Ymax is the max height of the text line region and Ymin is the beginning of text region and Nr is the total no of line regions.](image-5.png "Figure 4 .")
![Character Segmentation for Telugu Image Document using Multiple Histogram Projections](image-6.png "F")
4![Figure 4 : Intermediate stages: (a) Input mage, (b) Pre processed step, (c) Y histogram projection (d)Text line separation with horizontal histogram projections, (e) x histogram projections for segmented words.(f) X histogram projection for segmented characters. (g)False line](image-7.png "Figure 4 :")
![Match Score(i,j)= T( T A one-to-one match is used if the matching score is equal to or above the evaluator's acceptance threshold Ta. If G is the count ground-truth elements, S is the count of result elements, and o2o is the number of one-to-one matches, we calculate the detection rate (DR) and recognition accuracy (RA) as follows:](image-8.png "")
			© 2013 Global Journals Inc. (US) Year
			© 2013 Global Journals Inc. (US) Year
		
		
* 
	
		
			CLakshmi
		
		
			CPatvardhan
		
		An optical character recognition system for printed Telugu text, Pattern Analysis & Applications
				
			2004
			7
			
		
* 
	
		
			DavidAgarwal
		
		
			Doermann
		
		Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features, 10th International Conference
				
			2009
		
	
	ICDAR


* 
	
		Learning Segmentation of Documents with Complex Scripts
		
			KSKumar
		
		
			AMNamboodiri
		
		
			CVJawahar
		
	
		Fifth Indian Conference on Computer Vision, Graphics and Image Processing
		LNCS
		Madurai, India
		
			2006
			4338
			
		
* 
	
		Character Segmentation algorithms for kannada optical character Recognition
		
			BMSagar
		
		
			DrGShoba
		
		
			DrPKumar
		
	
		Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition
				the 2008 International Conference on Wavelet Analysis and Pattern Recognition
		
			2008
		
	
* 
	
		Indian script character recognition: A Survey
		
			UPal
		
		
			BBChaudhuri
		
	
		Pattern Recognition
		
			37
			
			2004
		
	
* 
	
		A complete printed Bangla OCR system
		
			BBChaudhuri
		
		
			U
		
	
		Pattern Recognition
		
			31
			
			1998
		
	
* 
	
		Segmentation of Printed Text in Devnagari Script and Gurmukhi Script
		
			VijayKumar
		
		
			KPankaj
		
		
			Senegar
		
	
		IJCA: International Journal of Computer Applications
		
			3
			
			2010
		
	
* 
	
		Segmentation of Bangla Unconstrained Handwritten Text
		
			UPal
		
		
			SagarikaDatta
		
	
		Proc. 7th Int. Conf. on Document Analysis and Recognition
				7th Int. Conf. on Document Analysis and Recognition
		
			2003
			
		
* 
	
		Document Analysis System
		
			KWong
		
		
			RCasey
		
		
			FWahl
		
	
		IBM j. Res. Dev
		
			26
			6
			
			1982
		
	
* 
	
		Text line Segmentation of Historical Documents: a Survey
		
			LLikforman-Sulem
		
		
			AZahour
		
		
			BTaconet
		
	
		International Journal on Document Analysis and Recognition
		
			9
			2
			
			2007
			Springer
		
	
* 
	
		Multi-oriented and curved text lines extraction from Indian documents
		
			UPal
		
		
			PPRoy
		
	
		IEEE Trans. On Systems, Man and Cybernetics-Part B
		
			34
			
			2004
		
	
* 
	
		Indian script character recognition: a survey
		
			UPal
		
		
			BBChaudhuri
		
	
		Pattern Recognition
		
			37
			
			2004
		
	
* 
	
		
			BAnuradhaand
		
		
			ArunAgarwal
		
		
			C. RaghavendraRao
		
	
		An Overview of OCR Research in Indian Scripts
				
			2008
			2
		
	
* 
	
		Indian script character recognition: A Survey
		
			UPal
		
		
			BBChaudhuri
		
	
		Pattern Recognition
		
			37
			
			2004
		
	
* 
	
		A threshold selection method from gray-level histograms
		
			NOtsu
		
	
		IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS
		
			9
			1979
		
	
* 
	
		Empirical Performance Evaluation of Graphics Recognition Systems
		
			IPhillips
		
		
			AChhabra
		
	
		IEEE Trans. of Patt. Analysis and Machine Intell
		
			21
			9
			
			September 1999