# Introduction he research on unconstrained scene text recognition is gaining momentum for few years. The text separation always been a cumbersome task because the presence of other objects in an image. Although text provides information and guide in a situation having strange environment. It is essential to investigate about nature of a text appeared in a scene image so that it may provide meaning for someone. But the unconstrained scripts like the Arabic poses a huge challenge to deal with the complexities of language itself in the presence of other image degrading properties. The normal way to tackle with the problem of Arabic scene text classification, we usually disintegrate the part of an image into smaller units and investigate each one individually. Each Arabic character has four variations concerning its position appeared in a word i.e., a character can appear in isolation, at first, middle or at last position in a word. To overcome these implicit challenges, there are numerous techniques proposed recently [1,5,7,8], which presented various feature extraction or classification techniques. The nature of unconstrained Arabic script prompt researchers to suggest implicit segmentation approaches to deal with the complexity of under discussion script. To deal with the representation of the same character appears to be extreme difficult task to address. In this way, manual segmentation also proves to be a laborious work. We are looking for such type of solutions which proved good results on cursive scripts. This particular complexity of Arabic script prompts to suggest implicit segmentation techniques. The other important aspect of cursive scripts is to consider the context. In Arabic every character appearance depends on the previous character, in this way learning the context of current character is crucial. There are some solutions suggested by recent research to tackle with the variability of characters with context learning approaches as proposed in [9,10,11]. The most prominent context learning approach specifically used for unconstrained cursive text research is Long Short Term Memory (LSTM) networks [4]. By keeping in view the complexity associated with the cursive script, it is assumed that if scene image disintegrates into smaller parts then consider their feature values individually and assemble them together in one unit before applying the language model. For the cursive script like Arabic we require more detailed features of given patterns so that we may scrutinize and learn the pattern. Therefore, there is a need to look for such classification model which does not only learns the patterns from right to left or left to right but also from top to bottom and bottom to top. To address the problem above, this paper is proposing an adapted Multidimensional Long Short Term Memory (MDLSTM) networks [12]. The implicit segmentation approaches are more accurate and less error prone in comparison to those approaches defined explicitly. The parts of a given image are considered by the convolutional neural network (ConvNets) using implicit segmentation approach. As nature of ConvNets make it as instance learner, but there is a need to learn the context of a given sample in this way history of learned pattern play a role. Therefore, this paper is proposing deep learn-ing MDLSTM network because of its strong ability to learn sequence-based on the context. The Connectionist Temporal Classification (CTC) is used as a probabilistic model to map the learned sequences against corresponding ground truth [13]. By using CTC, explicit segmentation and modeling language is avoided. The performance of proposed MDLSTM network architecture is evaluated on Arabic scene text images. The EASTR-42k dataset used for proposed work which covers various aspects of scene text images. The dataset contains 14, 000 segmented Arabic scene text images. Arabic like languages share the same writing style i.e., from right to left. Arabic like scripts categorized into two forms i.e., joiner and non-joiner. The characters that appear as a joiner may join predecessor or successor character in a word, its mean these characters can appear as first, middle or at final position in a word. Whereas, non-joiner characters may appear in isolation or as the last character in a ligature. As mentioned earlier every character has option to appear on any of the four locations i.e., initial, middle, final or an isolated position. As far as Arabic scene text is concerned, it is relatively difficult to deal with the complexity of joined and non-joined characters. In camera captured text images, there are other numerous factors to concentrate on so that we may extract the text with high precision. There are numerous factors like illumination, an angle of a text, font size, appearance and clarity of a text pose a challenge for researchers to recognize Arabic scene text. # II. # Related Work There are various feature extraction and classification approaches has been proposed for detection and recognition of Latin and cursive scripts like Chinese in natural scene images [6,14,15]. The text in natural scene images not only represent the pattern information but also it exhibits semantic information which shows some meaning in real applications. This paper is presenting unconstrained character recognition in natural images having Arabic text in focus. By reviewing recent year's research, there is an impression that not enough work has been presented on Arabic text recognition in natural scene images. Although some substantial work have been reported which we summarized in this section. One of the recent work on Arabic scene text is represented by [16]. They proposed Convolutional Recurrent Neural Network (CRNN) approach to evaluate the performance of their own gathered dataset and two publicly available video text datasets i.e., ALIF [17] and ACTIV [18]. They gathered 500 Arabic word images appeared in natural images. They categorized their experiments into character, word and line recognition. They reported very good accuracy on screen rendered video text datasets. The achieved 98.17% on character recognition, 79.67% accuracy on word recognition and 67.08% line recognition accuracy while on their own gathered dataset they achieved 69.55% and 39.25% accuracy on character and word recognition respectively. Another paper on deep learning based isolated Arabic scene character recognition is presented by [7]. They proposed deep convolutional neural network (ConvNets) architecture for recognition of Arabic characters appears in natural scene images. The features extraction and classification were per-formed through ConvNets. As there is not any benchmark dataset available for Arabic text in natural images they prepared dataset by their own which covers approximately every variation of each character. The experimental settings were empirically adjusted on 3 3 and 5 5 filter size with learning rate 0.5 and 0.005 by keeping the stride value 1 and 2. They identified 27 classes and save each character image with five orientations in different angles. In this way they identified 2450 images as the train set while 250 character images were used to evaluate the performance of their proposed algorithm. They reported 0.15% error rate on their proposed architecture. The Arabic scene text dataset is proposed by [3]. They collected free Arabic text appeared in an unconstrained environment. They clicked 364 images having Arabic text. The images were segmented into 1280 cropped words. They also segmented acquired Arabic text into 374 characters. The major drawback of their proposed dataset is lack of applicability details. The recent work on recognition and establishing a connection of moving Arabic text appeared in the video is presented by [19]. They developed dedicated OCR for the purpose to recognize low-resolution news captions in video images. They prepared dataset from Aljazeera news programs. They used connection method approach based on insertion operation, voting processing and substitution using minimum likelihood edit distance between two successive news frames for the purpose to connect text. Their proposed method is for automatic language translation and also helpful in reducing OCR errors caused by truncated characters. Their dataset was disintegrated into the train and the test set by using 453 video frames. They reported 96.78% accuracy through f-measure using bi-gram sequence. # III. # Learning Architecture a) Feature Extraction by ConvNets The natural images have characteristics of representing the image details at the same level, meaning that representation of text in a natural image would seems at the same energy level as the other objects in an image. As we are dealing with text specifically so we were looking such a technique by which we can focus on text only in a natural image. The arbitrary size of Year an input image is taken into account and normalized it with fixed 150 150 pixel size by considering the aspect ratio according to the image size. After that image is converted into gray scale. The 8 8 window is used to detect features of an image and make a feature map. This helps in considering each part of an image and focused on most relevant features of the corresponding image. At each point where the feature detector stops it takes a mean of involved pixels and write it over feature map at (1, 1) position. For the next move, the feature detector will move one pixel right and perform the same process again until the end of first row. After operating on first row feature detector window will move one step down to the second row and start the same process. In this way whole image will be filtered through feature detector window and update in feature map. The feature map contains a large amount of features in relevance to single image. Let's assume a small patch x x y, then array of convolved pixels will be represented as, (1) The features f are obtained by taking mean ? of contributed pixels r, c appears in feature detector window. The feature values write on the feature map from (1,1)(1,2)(1,3)....(143,143). Further explains the idea that feature map is mapping 143 features computed by applying mean pooling strategy. There considered 143 feature points corresponding to the given image. These extracted features are now ready to pass them to classifier. The LSTM has effectively applied on a number of problems where data is correlated and sequence is important to learn. The correlation of data may be represented by single or multidimensional axis. The LSTM is a technique under RNN approach where unlike RNN the data can be modeled into multidimensional vector in addition to the single axis. The Arabic script recognition is a classic example of sequence learning tasks where context is important to learn. The representation of each character depends on the previous character and so on. Unlike Latin, the Arabic script written in joining style which complicate the recognition process. The ConvNets can prove to be the good choice to learn the different segments of handwritten samples which require a lot of manual preprocessing. Moreover, it cannot produce good results when the problem is large and where context learning is important. The idea of multidimensional LSTM is to replace the single memory block of LSTM with the number of memory blocks according to dimensions. The input is delegated to hidden layers where the input data is processed by LSTM memory blocks in each dimension. In MDLSTM the self-connection of LSTM cell is controlled by n self-connection with n dimension and n forget gates. The cell activation values were forward to gates by peephole connections. The input gate in a memory block connected to all previous cells and in all dimensions. This will help to learn the sequential pattern of learning. The forget gate connected to cell c of all dimensions with different weights. This helps in determining how much previous computation takes part in all dimensions with reference to the current cell's computation. This type of setup is very important for Arabic script recognition where each character has four variations according to the position in a word, moreover the character segmentation is also extremely difficult. f convolve = ?(r ? x + 1) * (c ? y + 1) Year 2 019 The MDLSTM is considered as ideal architecture for learning the sequential problems more efficiently and effectively. Most of the recently reported work on Urdu and in Arabic script recognition as explained in [2] proposed MDLSTM for learning the complex patterns and reported state-of-the-art results. The details about MDLSTM network architecture can be explored in Graves et al. paper [12]. # IV. # Hierarchical Subsampling based Cursive Document and Scene Text Recognition The adapted hierarchical MDLSTM architecture based on sub-sampling of hidden layers approach is proposed for Arabic scene text. The hierarchical subsampling usually applies where the data volume is too large and complex. The hierarchical subsampling based LSTM architecture includes input layer, an output layer and multiple selfconnected hidden layers. The output of each level in the hierarchy is represented as input to the level up and so on. The input sequences were subsampled by predetermined window width. The hierarchical subsampling of RNN based networks follows the same structure as defined for ConvNets. The potential of subsampling approach was scrutinized by investigating the performance through 3 layer architecture which incorporates 20, 40, 60, 80, 100 and 120 hidden memory block sizes. # Table 1: Selected Parameters during training the network The network learning is based on the empirically selected parameters. The prime objective is to look for appropriate parameters that provide low error rate in comparison. The parameters detail along error rates and overall training time is provided in Table 1. The Arabic word assorted from the scene text initially pre-process to the standard size of 70 by keeping the aspect ratio. The feature map is prepared by convolving the extracted features from given image through filter window. The convolution process is similar as presented in section ?? for handwritten Urdu text as depicted in Figure 3. Here, the gray scale values of convolved pixels are passed to classifier by following a specific input size as sketched in Figure 4. In each feature map, every neuron is mapped according to small 5 × 5 region of an input image. The connection from input image to hidden layer is established through local receptive field called a filter size. Each neuron in a layer shares the same bias value. As single feature map does not cover the intensive features, therefore the process is further delegated to have a variety of features against each given image. A feature map is defined by its share weight and a bias value; mathematically this relation can be represented as follows in equation 2, (2) whereas, ? is neural activation sigmoid function while d is a shared value of bias. We,f represents filter or kernel weight which depends on filter size whereas, A represents the input activation at point (x,y). The extracted features by ConvNets are converted into raw pixels and are given to MDLSTM architecture with corresponding ground truth as presented in Figure 4. The complex nature of Arabic script prompts to proposed a hierarchical subsampling architecture of MDLSTM for learning purpose. The proposed experiments are based on the subsampling architecture which is divided into two main categories. As a first evaluation, the experiments were performed having 3 and 5 layers architecture. Each layer incorporate 20, 40, 60, 80, 100 and 120 hidden LSTM memory block. The three-layer architecture is defined by number of hidden memory units at every three layers. The input is subsampled by 6 6 and 2 9 window size. The deep learning architecture is designed by defining the data into layer wise manner. The same process is applied on five layer architecture. # Parameters The second variation of experiments performed by defining the same pa-rameters as experimented by [12,20]. [12] proposed their solution on hand-written Arabic character recognition while [20] presented the same idea on printed Urdu character recognition using similar parameters. The same pa-rameters and network structure are deliberately to compare the performance of handwritten, printed and scene text Arabic script recognition as shown in Figure 4. All activation functions in sub-sampling layers are feed forward tanh layers, whereas hidden layers are fully connected in all dimensions. The MDLSTM network collapse all processing into one dimensional CTC layer having 40 classes including a blank label which predict the output symbol. All activation functions in sub-sampling layers are feed forward tanh layers, whereas hidden layers are fully connected in all dimensions. The MDL- Table 2: Selected Parameters during training the network STM network collapse all processing into one dimensional CTC layer having 40 classes including a blank label which predict the output symbol. The performance was evaluated on various settings of proposed architecture as summarized in Table 2. The performance comparison of said approach on handwritten, synthetic and scene text is detailed in Table 3. The offline and online handwritten Arabic is experimented by [12]. They presented their work in ICDAR 2009 handwriting competition. As presented in Table 3, they proposed hierarchical architecture. Later [20] used the same architecture by changing little bit in parameters like hidden memory blocks. Moreover, they experimented their work with MDLSTM networks. The details about their implementation can be found in their manuscript [20]. The presented approach on scene text using the hierarchical sub-sampling achieved benchmark accuracy in terms of Arabic scene text recognition. # a) Experimental Analysis The experiments were conducted into manifold with various settings. The experimental settings were apparently outlined on the basis of architectural manipulation and parametric details. Following are the details of conducted experiments. 1. The number of hidden layers were considered to investigate the performance of learning architecture. 2. The number of memory blocks at each layers using subsampled input. 3. The performance is explored by empirically selected learning rates. As discussed earlier, that proposed network delegate the processing of MDL-STM network's learning to hidden layer units. The proposed method was evaluated on 3 and 5 hidden layer architecture. At first, with three-layer architecture, each layer has 20. Then, by following same hidden layer each layer has LSTM memory blocks and so on. Ultimately, with hidden layers size 3 and 5, the network was evaluated with each 20, 60, 100 and 120 LSTM memory blocks. Consequently, there are 8 experimental settings for each proposed architecture based on number of hidden layers as detailed in Table 4. For the activation of the input and output unit used tanh whereas, function was used for gate's activation. The CTC layer has 38 output nodes for 37 input characters including one extra blank node. The 38 character input includes Arabic characters and numerals. All hidden layers in proposed architecture are fully connected to each other. The 3 hidden layer architecture was initially proposed where each layer was subsampled at first to 20 LSTM memory blocks. The performance was evaluated later on 40, 60, 80, 100. The units defined in subsampled layers were also fully connected. The performance in hidden units were delegated backward to main hidden layers and the calculation of subsample layer was incorporated in the gradient descent of next hidden layer with learning rate 1 10?4 and then on 1 10?3 and momentum 0.9 which is selected after observing the trend from another cursive text analysis using MDLSTM. The training on each experiment was stopped after observing no significant improvement on performance for 30 epochs. represent the details about number of epochs consumed for each experiment while the size of the hidden layer was 3 and 5. The learning rate and number of hidden sub-sampled layers on convolutional features are impacting the learning performance of training network. The output is presented in Figure 5. The recorded accuracy is 95.8% calculated by Levenshtein distance measure at character level as indicated in Table 6. V. # Conclusion and Discussion The nature of Arabic script is extremely complex and cursive. To under-stand the Arabic word, there is a need to investigate the characters involved in predicting a word. The representation of characters is a considerable is-sue, because every character has four possibilities to occur in a word. The constraint of character's position make it difficult for any type of segmentation technique to correctly determine the characters by any specified technique. Therefore, there is always a need to look for implicit segmentation techniques that counter such complications associated to Arabic scripts. As Arabic script is a context-based language, hence context learning classifiers are suitable for learning purpose. The presented architecture for scene text analysis depicted good results. The obtained results exhibit that if there is a precise and relevant feature provided to learning network then it could produce realistic results even on the intrinsic scripts. Experimental evaluation has also explained in detail which tells the learning trend and recognition accuracy at word and character level of Arabic scene text. 1![Figure 1: Convolutional Feature extraction](image-2.png "Figure 1 :") 2![Figure 2: The flowchart of proposed idea b) Multidimensional LSTM classifier for Arabic Scene textThe LSTM has effectively applied on a number of problems where data is correlated and sequence is important to learn. The correlation of data may be represented by single or multidimensional axis. The LSTM is a technique under RNN approach where unlike RNN the data can be modeled into multidimensional vector in addition to the single axis. The Arabic script recognition is a classic example of sequence learning tasks where context is important to learn. The representation of each character depends on the previous character and so on. Unlike Latin, the Arabic script written in joining style which complicate the recognition process. The ConvNets can prove to be the good choice to learn the different segments of handwritten samples which require a lot of manual preprocessing. Moreover, it cannot produce good results when the problem is large and where context learning is important.The idea of multidimensional LSTM is to replace the single memory block of LSTM with the number of memory blocks according to dimensions. The input is delegated to hidden layers where the input data is processed by LSTM memory blocks in each dimension. In MDLSTM the self-connection of LSTM cell is controlled by n self-connection with n dimension and n forget gates. The cell activation values were forward to gates by peephole connections. The input gate in a memory block connected to all previous cells and in all dimensions. This will help to learn the sequential pattern of learning. The forget gate connected to cell c of all dimensions with different weights. This helps in determining how much previous computation takes part in all dimensions with reference to the current cell's computation. This type of setup is very important for Arabic script recognition where each character has four](image-3.png "Figure 2 :") ![Approach for Unconstrained Arabic Scene Text Analysis by Implicit Segmentation based Deep Learning Classifier](image-4.png "") 3![Figure 3: Arabic scene text feature extraction by convolutional pixelate method](image-5.png "Figure 3 :") ![e,f A j+e,k+f Year 2 019 ( ) D © 2019 Global Journals Sub-Sampling Approach for Unconstrained Arabic Scene Text Analysis by Implicit Segmentation based Deep Learning Classifier](image-6.png "W") 3ParametersValuesTraining/ Vali-dation ErrorNo. EpochsofTime/Epoch (min-utes)Subsample6 × 60.86/ 0.8331740window2 × 90.94/ 0.9229934Hidden mem-20,60,100,Best(0.97/0.95)46129ory units120-Worst(17.28/15.74)24853Learning rate1 × 10 ?40.80/ 0.82319481 × 10 ?50.96/ 0.9840651Momentum0.9---Total network475723---weight 4 5 4WordCharacterExperimentsHidden units/layerLearning rateRecog-nitionRecogni-tion ErrorError(%)(%)6×6Exp-1201×10 ?40.490.40Exp-2601×10 ?40.240.19Exp-31001×10 ?40.170.13Exp-41201×10 ?40.200.172×9Exp-1201×10 ?50.550.51Exp-2601×10 ?50.330.23Exp-31001×10 ?50.090.06Exp-41201×10 ?50.240.16WordCharacterSubsample sizeExperimentsHidden units/layerLearning rateRecog-nition Error(%)Recogni-tion Error (%)6 × 6Exp-1201×10 ?40.620.54Exp-2601×10 ?40.530.42Exp-31001×10 ?40.110.10Exp-41201×10 ?40.430.342 × 9Exp-1201×10 ?50.590.48Exp-2401×10 ?50.310.24Exp-31001×10 ?50.190.12Exp-41201×10 ?50.220.14 6Error type Test set Er-rorDeletions43.75Substitu-41.91tionsInsertions30.24 ( ) D © 2019 Global Journals * Urdu Nastaliq recognition using convolutional-recursive deep learning Neurocomputing Saeeda Naz and Arif Iqbal Umar and Riaz Ahmad and Imran Siddiqi and Saad Bin Ahmed and Muhammad Imran Razzak and Faisal Shafait 2017 * Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks Neurocomputing in 2016 SaeedaNaz ArifIqbal Umar RiazAhmad AhmedSaad Bin SyedHamad Shirazi ImranSiddiqi Muhammad Imran Raz-Zak 2016 * Arabic characters recognition in natural scenes using sparse coding for feature representations MarouaTounsi IkramMoalla AdelMAlimi FrankLebourgeois ICDAR * A Novel Connectionist System for Uncon-strained Handwriting Recognition AGraves MLiwicki SFernandez RBertolami HBunke JSchmidhuber IEEE Trans. Pattern Analysis and Machine Intelligence 31 2009 * Handwritten Urdu Character Recognition using 1-Dimensional BLSTM Classifier AhmedSaad Bin SaeedaNaz SalahuddinSwati Muham-Mad Imran Razzak Neural Computing and Applications 2017 * Highly-accurate fast candidate reduction method for Japanese/ Chinese character recognition RyosukeOdate HideakiGoto ICIP 2016 * Deep Learning based Isolated Arabic Scene Character Recognition AhmedSaad Bin SaeedaNaz RubiyahMuhammad Imran Razzak Yousaf 1st Workshop on Arabic Script Analysis and Recognition 2017 * Evaluation of Handwritten Urdu Text by Integration of MNIST Dataset Learning Experience SaeedaSaad Bin Ahmed Naz RubiyahMuhammad Imran Razzak Yusof Neuro Processing Letters (NEPL) * Balinese Character Recognition Using Bidirectional LSTM Classifier SaadAhmed Bin Naz Saeeda MuhammadRazzak Imran RubiyahYusof ThomasMBreuel Advances in Machine Learning and Signal Processing Springer International Publishing 2016 * Zoning Features and 2DLSTM for Urdu Text-line Recognition Saeeda Naz and Saad Bin Ahmed and Riaz Ahmad and Muhammad Imran Razzak 2016 Procedia Computer Science 96 * Evaluation of cursive and non-cursive scripts using recurrent neu-ral networks Neural Computing and Applications AhmedSaad Bin SaeedaNaz Shiekh FaisalMuhammad Imran Razzak MuhammadRashid ThomasMBreuel 2016 27 * Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks AlexGraves ; AlexGraves SantiagoFern´andez FaustinoJGomez Ju¨rgenSchmidhuber 10.1007/978-3-642-24797-2 ISBN: 978-1-4244-5654-3 Ma-chine Learning, Proceedings of the Twenty-Third International Con-ference (ICML 2006) JonathanFabrizio BeatrizMarcotegui MatthieuCord Pittsburgh, Pennsylvania, USA Springer Book 2012. June 25-29, 2006 385 ISBN = 1-59593-383-2 IEEE ICIP 2009 * ZYZhang LWJin KDing XGao Character-SIFT: A Novel Feature for Offline