# Introduction

xponential growth of the internet and free accessibility to all users across the globe, security of data across internet has become a prime concern. This security aspect can be further divided into security of data and information in individual systems and in transit between internet users across the network. Data and information can be further divided into text and non text data eg pictures, graphics, audio and video clips. Earlier the text data used to be only in English language. Introduction of unicode and the process of localization [2] encouraged the information exchange of language based context resulting in text data being transmitted in all languages across the internet. This paper deals with security of text data while being transported across the network. To achieve security of data transmission, cryptography is one of the methods in which the security goals can be achieved by means of encryption and decryption. The key used for cryptography can be symmetric or asymmetric key. The encryption can be of blocks of fixed/variable size bit stream transformed to cipher stream. They use either block cipher or stream cipher techniques for transformation. Parameters in these schemes are mainly algorithm and key. Larger the key size, greater is the security of data and slower is the data rate. One more parameter has been considered i.e the complexity of a language [6] with a case study on Telugu. Greater the complexity of the language, greater is the security of text transmitted in that language keeping other parameters like encryption algorithms and the key constant. A simple logical conclusion is that if the text of a script is complex then the same level of security can be achieved with lesser key size. Subsequently a comparative study has been carried out over English and Telugu with Bengali as a case study [7] and it was observed that percentage retrieval of data in Bengali is less than Telugu and English.

In this paper, a comparative study has been carried out on various other Indian languages and adding a fourth security parameter i.e an intelligent method of text data encryption with security [8].


# II.


# Review

A lot of study has gone into making the job of cryptanalysts simpler. Different languages in the world consist of characters displaying different properties and behavior [3,4] which help in the process of cryptanalysis. One of the methods of determining the language complexity is by the frequency analysis. In this process frequency of each symbol in the encrypted message is determined. This information is used by cryptanalysts, to determine which cipher text symbol maps to the respective plaintext symbol. In transposition systems, the letter frequencies of a cryptogram are identical to that of the plaintext. In the simplest substitution systems, each plaintext letter has one cipher text equivalent. The cipher text letter frequencies are not identical to the plaintext frequencies, but the same numbers will be present in the frequency count as a whole. A method for fast cryptanalysis of substitution ciphers has been proposed by Thomas Jakobsen [1] which uses the knowledge of diagram distribution of the cipher text. The individual letters of any language occur with greatly varying frequencies [9]. This factor has been used to solve varying simple ciphers. There are two general approaches to solve simple ciphers. One makes use of the frequency characteristics and the other uses the orderly progression of the alphabet to generate all possible decipherments from which the correct plaintext can be picked up. Statistical analysis of the frequencies of multiple letters when compared to single letters have been found to be more helpful while retrieving part of plain text message. By using the combined techniques of monogram frequencies, keyword rules and dictionary checking the cryptanalytic technique of enhanced frequency analysis has been developed [5].

Plain text is encrypted using the proposed algorithm resulting in cipher text. The frequencies of different characters in the cipher text are extracted. Mapping is carried out between the characters of plain text and cipher text based on these frequencies. Now the characters in cipher text are replaced with the mapped characters of plain text and the percentage of the exact retrieval as compared to plain text is calculated by K.W. Lee et.al [5].


# III.


# Conditional Probability

A vast study has been carried out into the frequency of occurrence of characters of many languages. The characters of different languages have different frequency patterns. This information helps a cryptanalyst to retrieve data from a cipher text by reverse frequency mapping. The percentage of data retrieved increases with the increase of corpus size of a sample text. As an example let us take a case study of English language. First a corpus frequency string is calculated with a very large corpus of English text. Corpus frequency string consists of all the different characters available in the corpus text. Next the percentage of occurrence of each character of corpus frequency string is calculated. To find the percentage retrieval of text after encryption from a new sample text, the new sample text is encrypted. The cryptanalyst using reverse frequency mapping tries to retrieve maximum possible characters. The percentage of occurrence of those characters already calculated earlier using corpus frequency string is added indicating the total percentage retrieval. Eg: retrieved chars: a, r, y and k. If the ii. Sample text dictionary file + coded file dictionary file encrypted reverse frequency mapping decrypted compared with sample text calculate % retrieval using corpus frequency string.

Difference in % retrieval between (a) and (b) is the advantage as per the proposed plan.

Here the text file is converted into a dictionary file which consists of frequency of occurrence of words arranged in descending order and an extended ASCII value is given to each word referred to as the code for that particular word. The extended ASCII values selected are from 33 to 250 i.e. a total of 218 different words are covered. If the number of different words exceed more than 218, then the numbers are repeated appending to its previous values eg 219th word will be 33 33, next will be 33 34 and so on. Based on the coded values of various words in the dictionary, the text file is converted into a coded file which can be decoded only by the dictionary. The dictionary created for different text files is different, hence dynamic in nature. Each text file will have a unique dictionary. The paper [8] speaks about the concept of dictionary and the coded file but left open how the dictionary is to be transmitted in a secure way. It also does not speak of any other language other than English. ie it has not considered language complexity as a factor for security of text data. The coded file created in this paper is different with only the coded values as it  


# Sl.No Plain Text
E T A I N S O R H C D L M U P F G B W Y V K X J Q Z % 1 1000 E P Q 15.70 2 2000 E T R H D V K Z 37.50 3 4000 E T A I N R H M U V K Z 63.97 4 6000 E T A I N S O R H M U W V K Z 72.38 5 9000 E I N R H M U W V K Z 47.13 6 12000 E I N R H M U V K Q Z 45.99 7 16000 E T A I N R H M U V K Q Z 64.32 8 20000 E T A I N R H U V K Q Z 61.20 9 25000 E T A I N R H M U V K Q Z 64.60 10 30000 E T A I N R H M U Y V K Q Z 66.20 11 40000 E T A I N R H M U P Y V K Q Z 68.48 12 50000 E T A I N R H M U P Y V K Q Z 63.31 13 70000 E T A I N R H L M U P F G B W Y V K Q Z 78.27 14 90000 E T A I N R H C D L M F G B W Y V K Q Z 80.98 15 1E+05 E T A I N S O R H C D L M U P F G B W Y V K X J Q Z 100.00 # # # #? ? ? ? ? --? ? ? ? ? ?
percentage occurrences of those characters are 8.73, 6.63, 1.24 and 0.58 then the total % retrieval is 8.73 + 6.63 + 1.24 +0.58 = 17.18 as per chart shown in Fig. 1.

serves the purpose. The dictionary is encrypted and transmitted. The coded file is transmitted without encryption. The attacker cannot decode the coded file without getting the information of the dictionary file. The actual percentage of data retrieved from coded file    (which represents the plain text data file) will finally be much less than the percentage retrieved from the dictionary file. The percentage data retrieved from dictionary file in various languages for a fixed corpus sizes have been found out using programs in Python 2.7 and displayed in figures 4 and 5 below.

Dance, little baby, dance up high, Never mind baby, mother is by; Crow and caper, caper and crow, There little baby, there you go: Up to the ceiling, down to the ground, Backwards and forwards, round and round.

Then dance, little baby, and mother shall sing, With the merry gay coral, ding, ding, a-ding, ding. Findings a) probability following a normal method is as explained in IV(a) and after intelligently converting the same sample text into a dictionary form and carrying out the same process as in IV(a) displays a vast difference in the percentage retrieval of data,(Figs 4 & 5) thereby making the proposed system as explained in IV(b) strongly secured. b) Carrying out the same procedure for similar corpus sizes, the percentage retrieval of data in various Indian languages is far less than English proving that text data transmitted in regional languages is more secured than English language. c) Amongst the various Indian languages, Gujarati, Hindi and Punjabi display a very low percentage of retrieval of data, making it more secure as far as transmission of data is concerned compared to other languages considered. d) The three Indian languages Gujarati, Hindi and Punjabi prove to be the most secure amongst the languages considered as case study, are stroke based unlike the languages of the southern part of India which are curvature based.

VI.


# Conclusions

Security of transmitted data over the internet is most secure when transmitted in any of the Indian languages compared to English language after converting the data into an intermediate form (dictionary and the coded file).

By creating the dictionary, the percentage retrieval compared with plain text file is far less than without creating the dictionary file.

By mapping the retrieved data from dictionary file to coded file the actual data to be retrieved is likely to be far lesser compared to what has been projected in Figs. 4 and 5.

Of the languages considered for case study, Guajarati, Punjabi and Hindi provide better security and they happen to be stroke based than curvature based (south Indian languages).  8. Dr. V.K. Govindan, B.S. Shajee mohan:An Intelligent text data encryption and compression for high speed and secure data transmission over internet. 9. Bao-Chyuan Guan, Ray-I Chang, Yung Chung Wei, ChiaLing Hu, Yu-Lin Chiu: An encryption scheme for largeChinese texts: IEEE 37th Annual.


# Global

The percentage data retrieved using conditional
1![Figure 1 : English probability matching code points IV.](image-2.png "Figure 1 :")
![An Intelligent Method of Secure Text Data Transmission through Internet and its Comparison using Complexity of Various Indian Languages in Relation to Data Security English -Probability -Mathcing code points](image-3.png "C")
![" ! ' > L 9 $ D = 4 % % 1 I % retrieval normal method for sample text is 84.75% and proposed method is 0.0%.](image-4.png "")


displaying% retrieval methodLanguageCorpus size10005000750010000English016.5515.8715.6Malayalam 012.0311.7611.51Kannada11.9111.4311.0511.1Telugu5.610.7811.011.45Tamil06.486.676.27Bengali6.075.860.070.05Gujarati0.150.10.050.52Hindi00.020.010Punjabi0.040.0240.0450.078the retrieval percentagedictionary method
% retrieval normal methodLanguageCorpus size10005000750010000English9298.1898.5399.18Malayalam 86.7894.8399.3199.57Kannada99.4996.6897.7498.14Telugu95.9998.2399.1199.43Tamil94.6999.0599.2799.53Bengali88.7596.1998.0998.89Gujarati88.5897.2598.0798.24Hindi87.5897.1998.0298.64Punjabi90.8997.6898.5598.57
			© 2013 Global Journals Inc. (US)
			© 2013 Global Journals Inc. (US) Year
		
		
* 
	
		A fast Method for Cryptanalysis of Subsittution Ciphers
		
			TJakobsen
		
	
		J. Cryptologia
		
			19
			3
			
			1995
		
	
* 
	
		Internationalizing the Internet
		
			AdamStone
		
	
		J. Internet Computing
		
			3
			
			2003
		
	
* 
	
		Decrypted secrets-Methods and Maxims of Cryptology
		
			FBauer
		
		
			2007
			Springer
		
	
* 
	
		P: Handbook of Applied Cryptography
		
			AJMenezes
		
		
			2001
			CRC Press
		
	
* 
	
		Decrypting English Text Using Enhanced Frequency Analysis: National Seminar on Science
		
			KWLee
		
		
			CETeh
		
		
			YLTa
		
	
		Technology and Social Sciences
		
			
			2006. 2006
		
	
* 
	
		
			MsvsBhadri Raju
		
		
			VishnuVardhan
		
		
			BNaidu
		
		
			G A
		
		
			PratapReddy
		
		
			L
		
		
			VinayaBabu
		
		Effect of Language Complexity on Deciphering Substitution Ciphers -A Case Study on Telugu
				
	
* 
	
		
			DevasishPal
		
		
			DrEjjagiri
		
		
			Vinaya
		
		
			Babu
		
		Complexity of Bengali Language and its relation to data security volume1 Issue 4 -2012 (IJACIT)