# Introduction

he goal of Statistical Language Modeling is to build a statistical language model that can estimate the distribution of natural language as accurate as possible. A statistical language model (SLM) is a probability distribution P(s) over strings S that attempts to reflect how frequently a string S occurs as a sentence. By expressing various language phenomena in terms of simple parameters in a statistical model, SLMs provide an easy way to deal with complex natural language in computer. Therefore N-gram based modeling finds extensive acceptance to the researchers working with structural processing of natural language. An n-gram model is a type of probabilistic model for predicting the next item in such a sequence. More Microsoft Office Suite grammar checker, is also not Abstract-Statistical N-gram language modeling is used in many domains like spelling and syntactic verification, speech recognition, machine translation, character recognition and like others. This paper describes a system for sentence structure verification based on Ngram modeling of Bangla. An experimental corpus containing one million word tokens was used to train the system. The corpus was a part of the BdNC01 corpus, created in the SIPL lab. of Islamic university. Collecting several sample text from different newspapers, the system was tested by 1000 correct and another 1000 incorrect sentences. The system has successfully identified the structural validity of test sentences at a rate of 93%. This paper also describes the limitations of our system with possible solutions. gram". For a sequence of words, for example "the dog smelled like a skunk", the trigrams would be: "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #". N-Grams are typically constructed from statistics obtained from a large corpus of text using the co-occurrences of words in the corpus to determine word sequence probabilities. N-Grams have the advantage of be able to cover a much larger language than would normally be derived directly from a corpus. Open vocabulary applications are easily supported with N-Gram grammars [1]. Within the much application areas, an important application is to assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems, speech recognition, OCR Intelligent Character Recognition (ICR), machine translation and similar applications [2]. By converting a sequence of items to a set of n-grams, it can be embedded in a vector space, thus allowing the sequence to be compared to other sequences in an efficient manner. The idea of n-gram based sentence structure verification has come from these opportunities provided by n-grams. Sentence structure verification is the task of testing the syntactical correctness of a sentence. It is mostly used in word processors and compilers. For applications like compiler, it is easier to implement because the vocabulary is finite for programming languages but for a natural language it is challenging because of infinite vocabulary. Three methods are widely used for grammar checking in a language; syntax-based checking, statistics-based checking and rule-based checking. In syntax based grammar checking [3], each sentence is completely parsed to check the grammatical correctness of it. The text is considered incorrect if the parsing does not succeed. In statistics-based approach [4], a corpus is used to train a model. Some sequence will be very common others will probably not occur at all. Uncommon sequences in the training corpus can be considered incorrect in this approach. In rule-based approach [5], a set of hand crafted rules is matched against a text which has at least been POS tagged. This approach is very similar to statistics-based approach, but the rules are developed manually. However, one of the most widely used grammar checkers for English, above controversy [6]. It demonstrates that work on
x x x x n i i i i ? ? ? ? .., ,......... , ,3 2 1
.

In Probability terms, this is nothing but ( )
x x x x n i i i i P ? ? ? ., ,......... , | 2 1
. An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram", size 3 concisely, an n-gram model predicts x i based on is a "trigram"; and size 4 or more is simply called an "n-Year 2014 grammar checker in real time is not very easy task; so starting the implementation for language like Bangla structural verification of a sentence is a major feat. In our work, an effort has been made to develop system to verify Bangla sentence structure using statistical or more specifically n-gram based method. This is because, this approach does not need language resources like handcrafted grammatical rules, except for a corpus to train the language model (LM). Given the scarcity of language resources for Bangla, proposed approach may be the only reasonable one for the foreseeable future.


# II. Techniques Adopted in the Proposed System

Similarly, the possible quad-grams (N-grams with N=4) are:

After training a model using above concept it was used to design a test system. For the purpose of testing whether a sentence is correct or not, the number of N-grams (2, 3, or 4) in the sentence was counted first. Using all the N-grams of the sentence, we have generated a score for the sentence. If the score is greater than a predefined threshold, the sentence is syntactically correct. On the other hand, if the score is not greater than the threshold, the sentence is syntactically incorrect.


# III.


# Training the N-gram Model

The first step to compute N-grams is counting unigrams. The unigram count and necessary software tools was ready in the laboratory and the work was started from bigram count. After updating the existing software tools bigrams, trigrams and quad-grams were identified, counted and stored in separated disk files. In all cases input to the software was the sample corpus contained in file corpus. In statistical approach we can simply measure the probability of a sentence using n-gram analysis. For example, using bigram probability of the sentence "???? ?? ??? ?????" is, To estimate the structural correctness of a sentence, we calculate the probability of a sentence using the formula above. If the value of the probability is above some threshold then we consider the sentence to be structurally correct. Now if any of these three words are not in the corpus then the probability of the sentence will become zero because of multiplication. To solve this problem, Witten-Bell smoothing [7] was used to calculate the probability of a sentence in our work. A sample corpus was used in this work that is a part of another corpus under construction in the speech and image processing lab of Islamic University, Bangladesh. We have developed necessary programs to assemble sequences of N tokens into Ngrams. Typically N-grams are formed of contiguous tokens that occur one after another in the input corpus. IV.


# The Test System

For the purpose of testing whether a sentence is correct or not, at first, all the number of bigrams of the sentence was counted. Getting probabilities from the respective models, Witten-Bell smoothing was applied to compute a set of probabilities contained all nonzero values. Multiplying all the bigrams of the sentence, a score for the sentence was generated. If the score is greater than a predefined threshold, the sentence is syntactically correct. The functional block diagram of the system is shown in figure 3. For the trigram or quadgram models, the same algorithm was followed by replacing only the bigrams with trigrams or quad-grams respectively.


# V. Experimental Results and Discussion

In our experiment, 1000 sentences collected from the web edition of a daily newspaper to form a test set. The test set was disjoint from the training corpus. All of these 1000 sentences were structurally correct. Taking these correct sentences as input, the result generated by the test system is shown in table-1. For another experiment, All of these 1000 sentences were modified to make structurally incorrect and presented again as input to the test system. The result generated by second experiment is also shown in table-1.  


# Discussion

The word-error in Bangla can belong to one of the two distinct categories, namely, non-word error and real-word error. A string of characters separated by spaces without a meaning is a non-word. By real-word error we mean a valid but not the intended word in the sentence, thus making the sentence syntactically or semantically ill-formed or incorrect. The developed system can identify both types of errors with an failure rate of 6.9% on average. The major cause of this error is the volume of training corpus. As large as the volume of training corpus so will be success rate.


# VII.


# Conclusion

We have developed a statistical Sentence structure verifier for Bangla, which has a reasonably good performance as a rudiment Sentence verifier. By increasing the volume of training data the performance of the system can be improved and a hybrid system combining both statistical and rule based system can be develoved. 
2014![Global Journals Inc. (US) Global Journal of Computer Science and Technology Volume XIV Issue I Version I 1 Year 2014](image-2.png "T © 2014")
![Verification of Bangla Sentence Structure using N-Gram](image-3.png "")
![Figure 1(a) : Samples of first step computation](image-4.png "")
![In the second step of computation, outputs of the first step were used as inputs. A set of program modules were developed to compute bigram, trigram and quad-gram probabilities using N and N-1 gram count. For example, bigram probabilities were calculated by using unigram and bigram counts. The intermediate results of the system as the outputs of the second step are shown in figure-2.](image-5.png "a")
123![Figure 1(b) : Samples of first step computation](image-6.png "Figure 1 (Figure 2 :Figure 3 :")
1Results with correct sentencesModelsNo. ofNo. ofPerformanceSentencessuccessBigram100090090%Trigram100090590.5%Quadrigram100090790.7%Results with incorrect sentencesBigram100095095%Trigram100096196.1%Quadrigram100096396.3%Average93.1%VI.
			© 2014 Global Journals Inc. (US)
		
		
* 
	
		Stochastic Language Models (N-Gram) Specification
		
			MichaelKBrown
		
		
			AndreasKellner
		
		
			DaveRaggett
		
		
		W3C/Openwave
				Access date
		
			8th Dec. 2010
		
	
* 
	
		
			Wikipedia
		
		
		n-gram
				Access date
		
			17th Dec. 2010
		
	
* 
	
		Linear Networks and Systems (Book style)
	
	
		Natural Language Processing, the PLNLP approach
				
			W.-KChen
		
		Belmont, CA
		
			Wadsworth
			1993. 1993
			
		
* 
	
		Dealing with illformed English text, The Computational Analysis of English
		
			EricAtwell
		
		
			StephenElliott
		
		
			1987
			Longman
		
	
* 
	
		A Rule-Based Style and Grammar Checker
		
			DanielNaber
		
		
			2003
		
		
			Computer Science -Applied, University of Bielefeld
		
	
	Diploma Thesis


* 
	
		A Demonstration of the Futility of Using Microsoft Word's Spelling and Grammar Check
		
			SandeepKrishnamurthy
		
		
* 
	
		Speech and Language ProcessingAn Introduction to Natural Language Processing: Computational Linguistics and Speech Recognition
		
			DanielJurafsky
		
		
			JamesHMartin
		
		
			September 28. 1999
			Prentice Hall
			Englewood Cliffs, New Jersey 07632