# Introduction ext line segmentation is an essential preprocessing stage for recognition in many Optical Character Recognition (OCR) systems. Segmentation of text line is a vital step because inaccurately segmented text lines result in errors during recognition stage. Segmentation of the handwritten document is still one of the most concerned challenging problems. Several techniques for text line segmentation are reported in the literature for segmenting Indian script documents. These methods include projection profile (white space analysis) [1], voronoi and docstrum [2], graph cut, connected components based. Segmentation is not accurate with these methods. Jawahar [3] proposed the graph cut method that requires a priori information about the script structure to cut. Rajasekharan proposed a method based on projection method for Kannada script document segmentation [4]. As a conventional technique for text line segmentation, global horizontal projection analysis of black pixels has been utilized in [5,6,7,8]. Partial or piece-wise horizontal projection analysis of black pixels as modified global projection technique is employed by many researchers to segment text pages of different languages [9,10,11]. In piecewise horizontal projection technique text-page image is decomposed into vertical strips. The positions of potential piece-wise separating lines are obtained for each strip using partial horizontal projection on each stripe. The potential separating lines are then connected to achieve complete separating lines for all respective text lines located in the text page image. In this paper a robust method for segmentation of documents into lines and words and the proposed method is based on the modified histogram as the Telugu script is very complex. For accurate line segmentation Foreground and background information is also used. This method take cares of eliminating false lines and recovering the loss of text in overlapped text lines. The rest of the paper is organized as follows: In Section 2, we discussed the properties of Telugu scripts considered here. Proposed approach is discussed in Section 3. Experimental results in Section 4. Finally the paper is concluded in section 5. # II. # Characteristics of Telugu Script Telugu is the most popular South Indian spoken script based language. The Telugu character set contains 16 vowels, 36 consonants, vowel (maatras) and consonant modifiers (vaththus). These characters are combined to represent several frequently used syllables (estimated between 5000 and 10000) in the language [12,13,14]. We refer to these basic orthographic units as glyphs (single connected component representation). These characters will have variable size. (i.e. width and height). In Latin based scripts most of the characters have same size except few characters. Segmentation of such characters is difficult when compared with Latin based scripts like English. The figure 1 shows sample Telugu simple and compound character images. # Proposed Approach Here we propose a new technique which automatically identify and segment the text line regions of handwritten documents. Figure 2 shows the basic steps in our proposed algorithm. The raw data is subjected to a number of preliminary processing steps to make it usable in the stages of character analysis. Pre-processing aims to produce data that are easy for segmentation accurately. The main objectives of pre-processing include: Binarization Noise reduction Skeletonization/Normalization Skew correction. We have used binary image for our work and to convert the original grey-level document images into binary image, we have applied the algorithm due to Otsu [15]. Then noise removed, skew corrected output image from the pre-processing phase is given as input to the Segmentation stage. For Noise removal we use morphological operators. Figure 3 shows steps in Noise removal. The lines with height below a pre-determined threshold are removed. The value of this threshold is proportional to the average height of the text lines in the whole image. # d) False Word Exclusion As in 3.3 we will find the average height of the word in x direction and the word not satisfying the determined threshold will be treated as false word. IV. # Performance Evaluation The performance is evaluated by checking the count of number of matches between the segmented entities with that of entities in the ground truth [16].A Match Score table is created where the pixels of the segments and the ground truth are coincide. Let I be the set of all image points, Gj the set of all points inside the j ground truth region, Si the set of all points inside the i segmented region, T(s) a function that counts the elements of set s. Matching results of the j ground truth region and the i segment region: # Results and Discussion The algorithm is implemented in MATLAB. The algorithm is tested with several document images. Sample test results are shown in Figure 4.From the experiment the proposed method is fast and reliable to even for handwritten documents which have non overlapped lines. The line segmentation accuracy with DR is 99% and RA is 98% for good quality documents. The limitation of this method is that it resulted in segmentation errors for touching characters. # M o2o DR(%) RA(%) PM(%) # Conclusion and Future Work In this experiment, the proposed algorithm is tested with several document images. Even though this algorithm provides robust results it could not accurately segment the overlapped lines. A heuristic algorithm needs to be thought of in case of overlapping lines and words to recover the loss text. 1![Figure 1 : Examples for simple and compound characters](image-2.png "Figure 1 :") 2![Figure 2 : Shows the basic steps in segmentation algorithm](image-3.png "Figure 2 :") 34![Figure 3 : Steps in Noise Removal](image-4.png "Figure 3 : 4 .") 4![g a small peak in the histogram shown in green, if this region has enough height it can be confused with a text line segment by the algorithm. The equation below provides the average height of the lines found in a histogram:Where Ymax is the max height of the text line region and Ymin is the beginning of text region and Nr is the total no of line regions.](image-5.png "Figure 4 .") ![Character Segmentation for Telugu Image Document using Multiple Histogram Projections](image-6.png "F") 4![Figure 4 : Intermediate stages: (a) Input mage, (b) Pre processed step, (c) Y histogram projection (d)Text line separation with horizontal histogram projections, (e) x histogram projections for segmented words.(f) X histogram projection for segmented characters. (g)False line](image-7.png "Figure 4 :") ![Match Score(i,j)= T( T A one-to-one match is used if the matching score is equal to or above the evaluator's acceptance threshold Ta. If G is the count ground-truth elements, S is the count of result elements, and o2o is the number of one-to-one matches, we calculate the detection rate (DR) and recognition accuracy (RA) as follows:](image-8.png "") © 2013 Global Journals Inc. (US) Year © 2013 Global Journals Inc. (US) Year * CLakshmi CPatvardhan An optical character recognition system for printed Telugu text, Pattern Analysis & Applications 2004 7 * DavidAgarwal Doermann Voronoi++: A Dynamic Page Segmentation approach based on Voronoi and Docstrum features, 10th International Conference 2009 ICDAR * Learning Segmentation of Documents with Complex Scripts KSKumar AMNamboodiri CVJawahar Fifth Indian Conference on Computer Vision, Graphics and Image Processing LNCS Madurai, India 2006 4338 * Character Segmentation algorithms for kannada optical character Recognition BMSagar DrGShoba DrPKumar Proceedings of the 2008 International Conference on Wavelet Analysis and Pattern Recognition the 2008 International Conference on Wavelet Analysis and Pattern Recognition 2008 * Indian script character recognition: A Survey UPal BBChaudhuri Pattern Recognition 37 2004 * A complete printed Bangla OCR system BBChaudhuri U Pattern Recognition 31 1998 * Segmentation of Printed Text in Devnagari Script and Gurmukhi Script VijayKumar KPankaj Senegar IJCA: International Journal of Computer Applications 3 2010 * Segmentation of Bangla Unconstrained Handwritten Text UPal SagarikaDatta Proc. 7th Int. Conf. on Document Analysis and Recognition 7th Int. Conf. on Document Analysis and Recognition 2003 * Document Analysis System KWong RCasey FWahl IBM j. Res. Dev 26 6 1982 * Text line Segmentation of Historical Documents: a Survey LLikforman-Sulem AZahour BTaconet International Journal on Document Analysis and Recognition 9 2 2007 Springer * Multi-oriented and curved text lines extraction from Indian documents UPal PPRoy IEEE Trans. On Systems, Man and Cybernetics-Part B 34 2004 * Indian script character recognition: a survey UPal BBChaudhuri Pattern Recognition 37 2004 * BAnuradhaand ArunAgarwal C. RaghavendraRao An Overview of OCR Research in Indian Scripts 2008 2 * Indian script character recognition: A Survey UPal BBChaudhuri Pattern Recognition 37 2004 * A threshold selection method from gray-level histograms NOtsu IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS 9 1979 * Empirical Performance Evaluation of Graphics Recognition Systems IPhillips AChhabra IEEE Trans. of Patt. Analysis and Machine Intell 21 9 September 1999