# Introduction ata compression is important for data transmission and data storage. It aims at reducing the size of data in order to improve the speed of transmission and reduce the size that is needed for the storage. Data compression techniques can be classified into two general categories: Lossy and Lossless techniques. Lossless techniques themselves can be classified into two main categories: statistical compression techniques and dictionary compression techniques [1], [2]. Text compression is a subfield of data compression. It focuses on compressing natural language texts as they occur in the real world. Text compression uses mainly the different features of natural languages to improve the compression ratio and performance. Research papers concerning natural language text compression have been published during the past three decades. Their main concern were European languages such as English, French and German [3], [4] [5]. Other languages such as Japanese and Chinese were subjects of this type of research, too [6]. Few studies and published research papers focused on the compressing of Arabic text. Each type of compression technique has advantages and disadvantages. Dictionary-based techniques are fast, but they give smaller compression ratios. On the other hand, statistically based techniques provide high compression ratios but ignore the specificities of natural language texts. Arabic and other Semitic languages are complex and rich in terms of morphological features, where tens or hundreds of words can be derived from the same root. These morphological features can be exploited to improve the compressing ratio of Arabic texts [7]. In 2008, ?tujbe [8] showed that utilizing multiple compression techniques is a superior alternative to the classic single-compressor approach. Thus hybrid approaches that combine several of these techniques in order to obtain better compression ratio have been proposed. Studies on Arabic text compression were limited despite the fact that Arabic is one of the major international languages. This work aims at developing new compression techniques based on the exploitation of morphological and grammatical features of Arabic language to present a hybrid paradigm that will be able to improve the compression ratio and performance and to produce a new representation of text that can be more appropriate for other applications such as information retrieval. # II. # Features of Arabic Language An Arabic word is a series of alphabet letters and diacritical marks. Thirty-six characters are used in Modern Standard Arabic (MSA): 28 basic letters and eight diacritical marks. The diacritical marks, called TASHKEEL, are optional and in general are added above or below Arabic letters. Table 1 shows the different vowelization states of the Arabic word: fully vowelized, partially vowelized and unvowelized. # ???????? -??????? In Arabic language, a word may be derivative or non-derivative. A derivative word is generated from a basic Arabic root according to a predefined palette or template called morphological balances. Figure 1 shows an example of some words that are derived from the root ????? k-t-b which represent the concept 'writing'. The non-derivative words are mainly functional words and nouns borrowed from foreign languages. Stop words are words that have little semantic meaning. However, they are used to explain grammatical relationships between the words within a sentence. This class of words includes pronouns, prepositions, conjunctions and interjections. The number of stop words is limited, but their frequency is very high in natural texts. They represent nearly 40% of the total number of words in a text [9]. Table 2 shows the frequency of these words in real-world text that contains one million words taken from a collection of articles from newspapers and magazines. The morphological analysis is one of the most important techniques used in natural language processing. Its objective to analyze words in order to decompose them into their original morphemes and identify their internal structure. In the case of Arabic words, a word may be decomposed into suffix, prefixes, root or stem. In the case of derivative words, the morphological analyzers may generate the morphological pattern used for the creation of the word in addition to the other components listed before. It is a key step for many applications of natural language processing systems [10], [11], [12]. # Related Work Three approaches to research on Arabic text compression can be found in the literature. The first approach considers general-purpose compression techniques and does not take into account the features of Arabic languages. Some of these techniques proceed at the level of characters [13]. They use the frequency of characters in order to replace the most frequent characters by short codes. Therefore, they are called statistical compression methods and are developed based on the Huffman compression technique and its variants. Other techniques look at strings in the text and put pointers to strings or substrings that have already appeared [14]; these techniques are called dictionarybased techniques and are developed in general based ???????? ?????? ?????ïº?"?? ??????? ?????? ???????? ???????? ????????? ???????? ??????? ???????? ??????? ??????? ????ïº?"?? on the Lempel-Ziv technique (LZ). The third category consists of techniques that work at the frequency of the character and its neighbouring characters to decide how a character will be encoded. Examples of the last category are Burrows-Wheeler Transform (BWT) and Prediction by Partial Matching (PPM). In 2005, Khafagy [15] presented a study analyzing the results of a variety of data compression techniques applied to both English and Arabic texts. The best compression ratio had been obtained by neural compression, followed by PPM and LZW variations and Huffman-based techniques. RLE gave the worst results. The second approach to research on Arabic text compression uses the features of Arabic language to develop new compression techniques. These techniques use either the statistical features of the languages, such as the most frequent N-grams, or the morphological features and linguistics of the language to achieve a shorter representation of the text [16], [17]. The results of these techniques are in general very limited. The third approach to research on Arabic text compression are hybrid techniques that use the features of Arabic language in addition to general-purpose data compression techniques such as Huffman in order to achieve better results. The combinations of these techniques leads to better results as shown in [18], [19]. IV. # Burrows-Wheeler Compression Several studies have proved that the compression technique based on BWT provides good results in comparison with general-purpose compressors [20]; it achieves good compression ratios combined with high speed [21]. # a) Burrows-Wheeler Algorithm The BWT technique was invented by Michael Burrows and David Wheeler in 1994. It converts the original blocks of data into a format that is extremely well suited for compression, through a sequence of steps [1]. Figure 2 describes the steps of the BWT technique. The first step performs the Burrows-Wheeler transform (BWT), which is done by reading blocks of text with predefined size from input and processing each block to make it easier to code the data with a simple coder. The second step implements the Move to Front transformation (MTF) to transform the characters into a list of numbers. This technique does not compress data; its aim is to decrease the redundancy of letters. The third step applies RLE on the new text that has been produced in the previous step. RLE is one of the simplest compression techniques dealing with consecutive recurrent symbols [21], which are encoded as a pair: the length of the string and the symbol itself. After these steps, we can apply and identify the compression technique. Usually arithmetic coding or adaptive Huffman technique is used. We have suggested the adaptive Huffman technique to apply in our work. # b) Burrows-Wheeler Algorithm And Arabic Language Arabic language is rich in morphology. Several surface forms may be generated from the same root according to a predefined tempaltic pattern. The order of letters may change inside the derived words. For example, the word " ?"???? -"read" may change to " ?"????? -"read," ?????"? -"reader" or " ??????? -readable." This is unlike the English language, in which the origin of the word remains unchanged and the derivations are limited to adding suffixes at the end or the beginning of the word, for example, "read," "reads," "reader," "the reader" [22]. The BWT technique is very sensitive to the structure of the word, so derivative words are not suitable for compression by this technique. Therefore, we have suggested using one of the morphological analyzers as a pre-processing step to implement (BWT) on derivative words, using the root-pattern dictionaries technique guided by the proposed method of [23], [19]. The main idea of this technique is to replace derived words with index values for their roots and their standard pattern as shown in Figure 3. Then BWT technique is applied to these components to compress the text. # Multilayer Model Awajan [19] provided a multilayer model for the analysis of fully vowelized, non-vowelized and partially vowelized Arabic text. It classifies the text into three categories of words: derived, functional words and other words (i.e. non-derivative words and words that the system fails to classify into one of the categories). His approach depends on searching to determine if the word is functional or not, and using two techniques to determine the derived word; the first technique applies the pattern-based algorithm, and the second uses the dictionary for patterns and roots. This approach attaches all prefixes and suffixes to the dictionary of patterns to decrease the duration of the morphological analysis. Our aim in this work is to integrate more than one technique to compress Arabic texts, by taking advantage of the morphological features of Arabic language. The most important characteristic of a multilayered model from other analyzers is that it deals with all categories of texts and all categories of Arabic words including symbols and punctuation marks. VI. # Hybrid Compression Technique The proposed compression technique consists of two phases, as shown in Figure 4. In the first phase, the multilayer model has been selected to analyze the text. This model employs several procedures to partition the incoming text into three layers that represent three categories of Arabic words: functional, derivative and non-derivative words. The first layer is used to store the index of the stop words instead of the original word. The second layer is used to store the index of the roots and the patterns instead of derivative words. The third layer represents the words that the system failed to classify into either of the first two layers. The fourth layer, called the mask, is used during the decoding stage, to reconstruct the original text from the decoding of other layers. Suitable compression techniques were applied to the different layers in order to maximize the compression ratio. Figure 4 : The main steps of the hybrid compression approach In the second phase, the encoding phase, the BWT technique is applied for each layer. The mask layer contains the number "zero" to indicate the position of the word in the first layer. If it contains the number "one," this means the current word in the second layer; if it contains the number "two," this means the word in the third layer. For compression, this layer we have suggested represents each number as binary code, then reads one byte to store the data. Decompression processes for both approaches are completely opposite to the compression process. It works by decoding each layer independently using the appropriate decoder, then reconstructing the original text using the mask layer. VII. # Experiments and Evaluation The main idea for the multilayer model is to split a text into smaller linguistically homogeneous layers representing the main categories of words. To evaluate the multilayer with hybrid compression techniques, several experiences were conducted. The objective was to evaluate its performance and to compare different possible implementations mainly using BWT and LZW. A set of different categories of Arabic texts (vowelized, partially vowelized, unvowelized) was collected from multiple Internet sources. They represent stories, holy text from the Qur'an and articles from BBC Arabia news. Compression ratio, defined as the ratio of the size of the compressed text to the size of the original text, is considered to evaluate the performances of the proposed compression technique. Three tables are used. One for storing the stop words contained 127 of the most frequently occurring stop words extracted from a corpus representing the BBC and CNN Arabic news [24]. The other two tables were constructed to represent the roots and patterns. The roots table included 4,095 of the most commonly used three-letter words, where 376,167 word types are derived from the three-letter roots [9]. The patterns table consists of the 13,600 most used patterns [25]. The later table has two entries for each pattern. One entry represents the list of consonants (LC), and the other entry represents the list of diacritics (LD) as shown in Table 3. 4 presents the compression ratio obtained at the level of the three layers using LZW and BWT compression techniques. BWT was the best technique to compress all the layers. Compression ratio for first layer was 50% when BWT was applied, 83% when LZW was applied. Compression ratio for the second layer was 54%, 75% for BWT and LZW, respectively, and for the third layer was 41%, 49% for BWT and LZW, respectively. Table 5 shows results of encoded data and size of the compressed files using LZW and BWT. These results have shown that the compression ratios are better when BWT is used with the multilayer model. On the other hand, the proposed hybrid technique for compressing Arabic texts achieved good results compared to single text data compression. # Conclusion A hybrid technique for compressing Arabic texts has been developed. It integrates the multilayer model of Arabic texts with BWT. This technique relies on exploiting the morphological features of Arabic language to improve the performance of BWT, where the multilayer model was integrated with BWT. This approach gives a better compression ratio than 1![Figure 1 : Some words derived from the same root ????? k-t-b](image-2.png "Figure 1 :") 2![Figure 2 : Steps of the Burrows-Wheeler Compression Algorithm](image-3.png "Figure 2 :") 3![Figure 3 : The morphological analyzers](image-4.png "Figure 3 :") 1Vowelization StatesExamplesFully vowelized words????? ???? ???? ??? ? ??? -??? ? ???? ???? ??? ? ???Partially vowelized words?????? ????? -?????? ? ???Unvowelized words 2Partially vowelized stop wordsUnvowelized stop wordsWordFrequencyWordFrequency?ï»?"??292,396????322,239????269,200?ï»?"??301,895???120,060????132,635?????108,252???130,809????89,027?????119,639????83,027?????115,842III. 3PatternList of Consonants (LC)List of Diacritical Marks (LD)???? ? ??? ? ??ï»?"? ???? ????????**?*??? ?? ? ? ??? ??? ? ??? ? ??ï»?"? ???? ????????***??? ?? ?? ?? ??? ??? ???? ? ??ï»?"? ???? ????????***??? ? ? ?? ?? ?? 4AlgorithmFirst LayerSecond LayerThird LayerLZW0.830.750.49BWT0.500.540.41 5Text CategoryBWTLZWMultilayer with LZWMultilayer with BWTVowelized0.310.300.240.23Unvowelized0.350.320.230.26Partially Vowelized0.330.320.300.25Average0.330.310.260.25VIII. Global Journal of Computer Science and Technology (C) Volume XV Issue I Version I © 2015 Global Journals Inc. (US) Year 2015 * Introduction to Data Compression GEBlelloch 2010 Computer Science Department Carnegie Mellon University * Available 2013 * A Comparative Study Of Text Compression Algorithms RLourdusamy SShanmugasundaram International Journal of Wisdom Based Computing 1 3 2011 * An enhanced LZW text compression algorithm DMoronfolu Oluwade Afr. J. Comp. & ICT 2 2 2009 * Data Compression Techniques on Text Files: A Comparison Study HAltarawneh MAltarawneh International Journal of Computer Applications 26 5 2011 * Data Compression using Huffman based LZW Encoding Technique RHasan International Journal of Scientific & Engineering Research 2 1 2011 * A Compressionbased Algorithm for Chinese Word Segmentation JTeahan RMcnab HWitten Computer Journal of Computational Linguistics 26 3 2000 * Arabic Computational Morphology Soudi, V. Bosch, G. Neuman 2007 Springer New York * Practical data compression, Master's thesis V?tujbe 2008 Bratislava Commenius University * Open-source Resources and Standards for Arabic Word Structure Analysis: Fine Grained Morphological Analysis of Arabic Text Corpora MSSawalha 2011 The University of Leeds * Arabic Morphological Analysis Techniques: A Comprehensive Survey AAl-Sughaiyer IAAl-Kharashi Journal of the American Society for Information Science and Technology 55 3 2004 * Speech and Language Processing, 2nd DJurafsky JHMartin 2008. 2013 Prentice Hall * Improving the Computational Morphological Analysis of a Swahili Corpus for Lexicographic Purposes GDPauw G.-MDSchryver The 13th International Conference of the African Association for Lexicography Republic of South Africa July 2008 * Efficient data compression scheme using dynamic Huffman code applied on Arabic language SGhwanmeh RAl-Shalabi GKanaan Journal of Computer Science 2 2006 * A Comparison between English and Arabic Text Compression ZMAlasmer BMZahran BAAyyoub MAKanan Journal of Contemporary Engineering Sciences 6 3 2013 * Arabic Text Data Compression MA MKhafagy 2005 Zagazig University PhD thesis * Arabic Short Text Compression EOmer KKhatatneh Journal of Computer Science 6 1 2010 * Lossless Text Compression Technique Using Syllable Based Morphology HAkman SBayindir ZOzleme Akin SanjayMisra The International Arab Journal of Information Technology 8 1 2011 * Morphological Analysis and Diacritical Arabic Text Compression MDaoud The International Journal of ACM Jordan 2078-7952 1 1 2011 * Multilayer Model for Arabic Text Compression Awajan The International Arab Journal of Information Technology 8 2 2011 * Transform methods used in lossless compression of text files RRadescu Romanian Journal of Information Science and Technology 12 1 2009 * Improvements to the Burrows-Wheeler Compression Algorithm: After BWT Stages Abel 2003. March 2013 Available:www.juergenabel.info/Preprints/Pr eprint_After_BWT_Stages * Conjugation-based Compression for Hebrew Texts YWiseman IGefner Computer Journal of ACM Transactions on Asian Language Information Processing 6 1 2007 * Arabic Text Preprocessing for the Natural Language Processing Applications Awajan Arab Gulf Journal of Scientific Research 25 4 2007 * Arabic-Corpora MSaad 2011. 2013 * Published by the Arab League Educational, Cultural and Scientific Organization 2013 Arabic Language Derivation and Morphological System