Index-terms: adaptive dictionary, dictionary learning, sparse decomposition, sparse dictionary, speech analysis, speech denoising. PARSE signal representations allow the salient information within a signal to be conveyed with only a few elementary components, called atoms. For this reason, they have acquired great popularity over the years, and they have been successfully applied to a variety of problems, including the study of the human sensory system [1]- [3], blind source separation [4]- [6], and signal denoising [7]. Successful application of a sparse decomposition depends on the dictionary used, and whether it matches the signal features [8]. Two main methods have emerged to determine a dictionary within a sparse decomposition: dictionary selection and dictionary learning. Dictionary selection entails choosing a pre-existing dictionary, such as the Fourier basis, wavelet basis or modified discrete cosine basis, or constructing a redundant or overcomplete dictionary by forming a union of bases (for example the Fourier and wavelet bases) so that different properties of the signal can be represented [9]. Dictionary learning, on the other hand, aims at deducing the dictionary from the training data, so that the atoms directly capture the specific features of the signal or set of signals [7]. Dictionary learning methods are often based on an alternating optimization strategy, in which the dictionary is fixed, and a sparse signal decomposition is found; then the dictionary elements are learned, while the signal representation is fixed. Early dictionary learning methods by Olshausen and Field [2] and Lewicki and Sejnowski [10] were based on a probabilistic model of the observed data. Lewicki and Sejnowski [10] clarify the relation between sparse coding methods and independent component analysis (ICA), while the connection between dictionary learning in sparse coding, and the vector quantization problem was pointed out by Kreutz-Delgado et al. [11]. The authors also proposed finding sparse representations using variants of the focal underdetermined system solver (FOCUSS) [12], and then updating the dictionary based on these representations. Aharon, Elad, and Bruckstein [13] proposed the K-SVD algorithm. It involves a sparse coding stage, based on a pursuit method, followed by an update step, where the dictionary matrix is updated one column at the time, while allowing the expansion coefficients to change [13]. More recently, dictionary learning methods for exact sparse representation based on minimization [8], [14], and online learning algorithms [15], have been proposed. Generally, the methods described above are computationally expensive algorithms that look for a sparse decomposition, for a variety of signal processing applications. In this paper, we are interested in targeting speech signals, and deriving a dictionary learning algorithm that is computationally fast. The algorithm should be able to learn a dictionary from a short speech signal, so that it can potentially be used in real-time processing applications. # a) Motivation The aim of this work is to find a dictionary learning method that is fast and efficient. Rubinstein et al. have shown that this can be achieved by means of "double sparsity" [16]. Double sparsity refers to seeking a sparse decomposition and a dictionary such that the atoms in are sparse over the fixed dictionary , such as Wavelets or the discrete cosine transform (DCT). Also, in previous results in [17], it was Introduction found that dictionary atoms learned from speech signals with a sparse coding method based on ICA (SC-ICA) [18], are localized in time and frequency. This appears to suggest that for certain types of signals (e.g., speech and music) there might be a link between sparsity in decomposition and sparsity in dictionary. This is further supported by the success of transforms such as the Wavelet transform whose basis functions are localized, and are well-suited to the analysis of natural signals (audio, images, biomedical signals), often yielding a sparse representation. Thus, in this paper we propose to learn sparse atoms as in [16], but rather than learning atoms that are sparse over a fixed base dictionary, we directly learn sparse atoms from a speech signal. In order to build a fast transform, the proposed algorithm seeks to learn an orthogonal dictionary from a set of local frames that are obtained by segmenting the speech signal. Over several iterations, the algorithm "grabs" the sparsest data frame, and uses a Gram-Schmidt-like step to orthogonalize the signal away from this frame. The advantage of this approach is its computational speed and simplicity, and because of the connection that we have observed between sparsity in the dictionary and in the representation, we expect that the signal representation that is obtained with the learned dictionary will be also sparse. # b) Contributions In this paper, we consider the formulation of our algorithm from the point of view of minimizing the sparsity index on atoms.We seek the sparsity of the dictionary atoms alone rather than of the decomposition, and to the authors' knowledge this perspective has not been considered elsewhere. 1 # c) Organization of the Paper Further, we propose a stopping rule that automatically selects only a subset of the atoms. This has the potential of making the algorithm even faster, and to aid in denoising applications by using a subset of the atoms within the signal reconstruction. The structure of the paper is as follows: the problem that we seek to address is outlined in Section II, and our sparse adaptive dictionary algorithm is introduced in Section III, along with the stopping rule. Experimental results are presented in Section IV, including the investigation of the sparsity of the atoms and speech representation, and speech denoising. Conclusions are drawn in Section VII. Given a one-dimensional speech signal , we divide this into overlapping frames , each of length samples, with an overlap of samples. Hence, the th frame is given by (1) where . Then we construct a new matrix whose th column corresponds to the signal block , and whose th element is given by ( 2) where , and . The task is to learn a dictionary consisting of atoms , that is , providing a sparse representation for the signal blocks . We seek a dictionary and a decomposition of , such that [19] (3) where are the expansion coefficients, and The -norm counts the number of non-zero entries in the vector , and therefore the expression in (4) defines the decomposition as "sparse," if is small. In the remainder of this paper, we use the definition of sparsity given later in (5). The dictionary is learned from the newly constructed matrix . In the case of our algorithm, we begin with a matrix containing columns, and we extract the first columns according to the criterion discussed in the next section. To find a set of sparse dictionary atoms we consider the sparsity index [20] for each column , of , defined as (5)(6) where and denote the -and -norm, respectively. The sparsity index measures the sparsity of a signal, and is such that the smaller , the sparser the vector . Our aim is to sequentially extract new atoms from to populate the dictionary matrix , and we do this by finding, at each iteration, the column of with minimum sparsity index # Greedy Adaptive Dictionary Algorithm (GAD) We call our method the greedy adaptive dictionary (GAD) algorithm [21]. Aside from the advantage of producing atoms that are directly relevant to the data, the GAD algorithm results in an orthogonal transform. To see this, consider rewriting the update equation in step 6 in Algorithm 1 as the projection of the current residual onto the atom space, in the style of Matching Pursuit [22], [23]: (7) It follows from step 4 in Algorithm 1, that the denominator in the right-hand-side of ( 7) is equal to 1, and therefore the equation corresponds to the residual update in step 6. Orthogonal dictionaries have the advantage being easily invertible, since if the matrix is orthogonal, then , and evaluation of the inverse simply requires the use of the matrix transpose. # a) Termination Rules We consider two possible termination rules: 1. The number of atoms to be extracted is predetermined, so that up to atoms are learned. Then, the termination rule is: ? Repeat from step 2, until , where . 2. The reconstruction error at the current iteration is defined, and the rule is: ? Repeat from step 2 until (8) Where is the approximation of the speech signal , obtained at the th iteration from , by reversing the framing process; is the dictionary learned so far, as defined in step 5 of Algorithm 1. We compared theGADmethod to PCA [24] and K-SVD [13]. K-SVD was chosen because it learns datadetermined dictionaries, and looks for a sparse representation. PCA was chosen because it is a wellestablished technique, commonly used in speech coding and therefore it sets the benchmark for the speech denoising application. We used the three algorithms to learn 512 dictionary atoms from a segment of speech lasting 1.25 s. A short data segment was used because this way the algorithm can be used within real-time speech processing applications. The data was taken from the female speech signal "supernova.wav" by "Corsica S," downloaded from The Freesound Project database [25], and downsampled to 16 kHz. We also used the male speech signal "Henry5.mp3" by "acclivity," downloaded from the same database, and downsampled to 16 kHz. The K-SVD Matlab Toolbox [26] was used to implement the K-SVD algorithm. K-SVD requires the selection of several parameters. We set the number of iterations to 50, as recommended in [13], and the number of nonzero entries in the coefficient update stage to 10, which we found empirically to give Table1: Comparing The Computational Complexity For The Pca, K-Svd, And Gad Algorithms. The Table Shows The Average Computational Time For Each Algorithm, Obtained Over 100 Trials a more accurate, although not as sparse, signal representation than , as used in [13]. The dictionary size was set to 512 and the memory usage to "high." # a) Computational Complexity In Table I, we report the computational times of the algorithms, when learning a dictionary from speech segment of 1.25s, and averaged over 100 trials. Two versions of the K-SVD were also compared: the original version which is fully based on Matlab M-code, and the second version, which combines M-code with optimized MEX functions written in C. The experiments were conducted on a Quad-Core Intel Xeon Mac at 2.66 GHz, using Matlab Version 7.6.0.324 (R2008a) and under the Mac OS X Version 10.5.8 operating system. vector corresponding to the th column of . The residual matrix changes at each iteration , and is initialized to . The dictionary is then built by selecting the residual vector that has lowest sparsity index, as indicated in Algorithm 1. requires around 1 hour and 45 minutes to learn the dictionary. Therefore, we expect that optimizing the code for GAD will lead to even faster computational complexity. # b) Learned Atoms We begin by visually inspecting some examples of the atoms learned with the three algorithms, and then considering the sparsity of the atoms and signal representation. Fig. 1(a) shows examples of the overlapping data blocks found in the columns of the matrix , from which each dictionary is learned, while the remaining plots in the figure show examples of the atoms learned with PCA, K-SVD and GAD. The sparsity index relating to each atom is also given. The atoms extracted with PCA [Fig. 1(b)] are not localized. Comparing them with Fig. 1(a), they do not appear to be capturing any particular features of the speech signal. The K-SVD atoms [Fig. 1(c)] exhibit some structure that generally seems to correspond to that of the original data blocks. The atoms obtained with the GAD algorithm are illustrated in Fig. 1(d). Those atoms extracted earlier, shown on the first two lines, are quite similar to the original data, and are also the sparsest atoms, as indicated by the low sparsity index. Atoms extracted later, shown on the last two lines in the figure, capture mostly "noise"-like characteristics, or less meaningful features of the signal. # c) Sparsity of Atoms and Representation We have seen in Fig. 1 how the GAD algorithm yields atoms that are initially quite sparse and then become more "noise"- The results are shown in the first column of Table II, and they validate our expectations: GAD yields atoms that are sparser than the original signal blocks, and than all the algorithms. However, when we used the termination rule in (8) (shown in Table II as GAD-TR), with , the average sparsity index for the GAD atoms decreased from 16.2 to 12.6. On average, GAD-TR was found to learn less than 110 atoms, which from Fig. 2 can be seen to correspond to those atoms that are sparsest. The algorithms perform in a similar way on the male speech. II. This also includes values for the sparsity index for the original signal blocks in . The lowest representation sparsity index value is obtained with K-SVD, thanks to the strong sparsity constraint imposed by the algorithm on the signal decomposition. This entails limiting the number of nonzero elements in the signal representation to a small number (we use ). The signal transformed with the GAD algorithm is sparser than in the time domain, and than the coefficients obtained with PCA when all atoms are used in the signal reconstruction, for both signals. Moreover, the representation becomes even sparser when GAD is used with the termination rule. Thus, as well as a dictionary whose atoms are sparse GAD leads to a sparse decomposition. This confirms the concepts discussed in Section I-A. # d) Representation Accuracy The accuracy of the signal approximation given by each algorithm can be assessed with the reconstruction error , as defined in (8), after the dictionary has been learned (9) Where is the signal approximation obtained from , and is the right pseudoinverse of . This is plotted in Fig. 3 for each algorithm, as the number of atoms omitted in the signal reconstruction goes from 0 to 462 (or, the total number Next, we seek to determine how sparse is the representation obtained with GAD method. We do this by considering the transform coefficients obtained with all methods, for each block, and across the 100 speech segments taken from the speech signal, each lasting 1.25s. The sparsity index of the transform coefficients is found each time. We then average across the 100 segments and across all blocks to obtain a single of atoms used goes from 512 down to 50). K-SVD has a nonzero reconstruction error even when all atoms are included in the signal approximation, because the transform is not complete, and therefore it does not result in an exact reconstruction. Fast Dictionary Learning for Sparse Representations of Speech Signals beginning are the sparsest, and after around 200 atoms have been extracted, the sparsity index is close to its maximum value. The behavior observed here is in agreement with what was observed in Fig. 1. It also shows that the atoms obtained with the other algorithms are not as sparse as those extracted by GAD. The original data blocks that are considered in the figure sparse atom is characterized by a low sparsity index. The plot shows that the atoms learned by GAD in the Fig. 2 shows the atom sparsity index for the framed speech data in the columns of , and for the atoms learned with PCA, K-SVD and GAD. Recall that a Year 2014 are used in the reconstruction). This corresponds to the number, identified in Fig. 2, below which the GAD atoms are sparsest, and above which the sparsity index reaches its maximum value. K-SVD yields signal approximations that suffer most from the reduction in the number of atoms. The dictionary constructed by GAD separates the coherent components from the incoherent components. The latter can be discarded from the representation to reduce incoherent background noise. This suggests that the GAD algorithm might be suitable for denoising applications. Hence, we will consider this problem in the following section. The term denoising refers to the removal of noise from a signal. Sparse transforms have been found to be the most successful methods for denoising [27], and dictionary learning methods have been used for this application [13]. Table III shows the tolerance of the PCA, K-SVD, and GAD algorithms to a noise level changing from 10 dB to 10 dB, as the number of atoms in the reconstruction is reduced from 512 to 50. This is evaluated with the improvement in signal-to-noise ratio (ISNR): (10) When all atoms are used in the reconstruction, the complete transforms PCA and GAD, yield an ISNR of 0 dB, while K-SVD gives a nonzero ISNR, since the approximation is not exact. Generally, K-SVD has been shown to perform well for tasks such as image denoising [7], and the results in Table III show that this is also true for speech: the algorithm yields the highest ISNR values across all experiments. For the remaining algorithms, when the noise is low (10 dB), reducing the number of atoms in the reconstruction leads to distortion in the signal approximation. As the level of noise increases, the high ISNR values for PCA and GAD indicate that there are benefits in reducing the number of atoms used in the signal approximation. It is well-known that PCA can reduce the level of noise present, because it decomposes the space into signal and noise subspaces [28], and the results in Table III show that the performance of GAD is similar. It should be emphasized that the advantage of using the GAD algorithm over PCA is that methods based on sparse representations do not enforce decorrelation on the data. This results in greater flexibility in adapting the representation to the data, and uncovering previously unobserved structure in the data. Moreover, sparse representations allow the use of powerful and efficient tools for signal analysis. # Fast Dictionary Learning for Sparse Representations of Speech Signals Where is the original signal, is the observed distorted (noisy) signal, and is the source approximated by the transform. As the signal approximation becomes closer to the original source, ISNR increases. the reconstruction error increases. PCA performs best, because the transform arranges the signal components so that most energy is concentrated in a small number of components, corresponding to those extracted earlier. The GAD transform also gives good signal approximations as more atoms are excluded from the reconstruction, although its performance seems to worsen as the number of omitted atoms becomes more than 300 (or less than 200 atoms Application to Speech Denoising waveform domain [30]. On the other hand, in its present form GAD is a general algorithm that can be used with a variety of data because it does not make any assumptions on its characteristics. Although the GAD algorithm is currently at the theoretical stage, it is a fast method that might in future be used in real practical applications such as speech coding. In this case, like with PCA, this method would require the transmission of the signal adaptive dictionary. Other applications to which we are particularly interested in applying the GAD method include image processing and biomedical signal processing. Biomedical applications typically give rise to large data sets, for instance, in microarray experiments the expression values of thousands genes are generated. Therefore, in this case the algorithm would have to be extended to deal with large data sets. We are also considering the application of this approach to the problem of source separation. In this paper, we have presented a greedy adaptive dictionary learning algorithm, that finds new dictionary elements that are sparse. The algorithm constructs a signal-adaptive orthogonal dictionary, whose atoms encode local properties of the signal. The algorithm has been shown to yield sparse atoms and a sparse signal representation. Its performance was compared to that of PCA and K-SVD methods, and it was found to give good signal approximations, even as the number of atoms in the reconstructions decreases considerably. It results in better signal reconstruction than K-SVD and it has good tolerance to noise and does not exhibit distortion when noise reduction is performed at low noise levels. Although the only sparsity measure considered here is the sparsity index, we have compared the results to other measures of sparsity including the Gini index, which was found to outperform several other sparsity measures [29]. Our experimental results indicated that the performance of the algorithm is not noticeably different. However, we are considering to study this further in future work. In its present form, the GAD method looks for onsets, and it might be argued that it does not take advantage of all the possible redundancy in the speech signal, by not exploiting the pitch structure of the signal. In future work we are considering searching for sparsity in the frequency domain, and perhaps in prototype observation that sparsity in the dictionary and sparsity in the decomposition appear to be linked, for certain types of signals. This notion is supported by the results in this paper: whilst looking for sparse atoms and making no assumptions on the decomposition, the GAD algorithm yields a signal decomposition that is sparse. ![Practical implementation of the algorithm begins with the definition of a residual matrix, where is a residual columnGlobal Journal of Computer Science and TechnologyVolume XIV Issue VIII Version I](image-2.png "") 1![Greedy adaptive dictionary (GAD) algorithm 1. Initialize: [ ] {empty matrix}, , 2. repeat 3. Find residual column of with lowest -to -norm ratio: 4. Set the th atom equal to normalized : 5. Add to the dictionary: 6. Compute the new residual for all columns 7. until "termination" (see Section III-A) Year 2014](image-3.png "Algorithm 1") 1![Figure 1: Examples of the frames of the original speech signals, and of the atoms learned with the PCA, K-SVD, and GAD algorithms like. To investigate this further, 100 segments were taken from the original speech data, each lasting 1.25 s. PCA, K-SVD and GAD were used to learn dictionaries from each segment. The sparsity index for each atom was then evaluated, and the average across the 100 trials was taken.](image-4.png "Figure 1 :") 2![Figure 2 : Sparsity index for atoms GAD, PCA, and K-SVD algorithms, learned from a female speech signal, and averaged over 100 trials](image-5.png "Figure 2 :") 2![Mean Value And Standard Deviation (Std) For The Sparsity Index Of The Atoms And The Signal Representation Obtained With The Pca, K-Svd, And Gad Algorithms Compared To That Of The Original Signal Blocks. The Values For The Original Data Blocks And For Pca, K-Svd, And Gad Were Averaged Across 100 Trials, And 512 AtomsIn general, the results show that all algorithms perform quite well when few atoms are omitted in the reconstruction. As more and more atoms are omitted,](image-6.png "Table 2 :") 3![Figure 3: Reconstruction error for the GAD, PCA, and K-SVD algorithms, averaged over 100 trials](image-7.png "Figure 3 :") ![GAD is a computationally fast algorithm that finds atoms that are sparse. It is motivated by the Year 2014 . Kreutz-Delgado, J. Murray, D. Rao, K. Engan, T.Lee, and T. Sejnowski, "Dictionary learning](image-8.png "") 3 The approach proposed in\[16]looks for as parse dictionary over a base dictionary, as well as as parse decomposition, and there for eisquite different to them ethod proposed here. © 2014 Global Journals Inc. (US) The authors would like to thank the anonymous reviewers for their input that helped improving the quality of the presentation of this work. ## Acknowledgment Fast Dictionary Learning for Sparse Representations of Speech Signals * Forming sparse representations by local anti-Hebbian learning PFöldiák Biolog. Cybern 64 1990 * Emergence of simplecell receptive field properties by learning a sparse code for natural images BOlshausen DField Nature 381 1996 * New evidences for sparse coding strategy employed in visual neurons: From the image processing and nonlinear approximation viewpoint TShan LJiao Proc. Eur. Symp. Artif. Neural Netw. (ESANN) Eur. Symp. Artif. Neural Netw. (ESANN) 2005 * Blind source separation by sparse decomposition in a signal dictionary MZibulevsky BAPearlmutter Neural Comput 13 4 2001 * Blind separation of speech mixtures via time-frequency masking ÖYilmaz SRickard IEEE Trans. Signal Process 52 7 Jul. 2004 * Sparse decomposition of stereo signals with matching pursuit and application to blind separation of more than two sources from a stereo mixture R Proc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP) IEEE Int. Conf. Acoust., Speech Signal ess. (ICASSP) 2002 3 * Image denoising via sparse redundant representations over learned dictionaries MElad MAharon IEEE Trans. Image Process 15 12 Dec. 2006 * Some recovery conditions for basis learning by L1-minimization RGribonval KSchnass Proc. Int. Symp. Commun., Control, Signal Process. (ISCCSP) Int. Symp. Commun., Control, Signal ess. (ISCCSP) 2008 * Dictionary redundancy elimination LRebollo-Neira IEE Proc.-Vis., Image, Signal Process 2004 151 * Learning overcomplete representations MSLewicki TJSejnowski Neural Comput 12 2000 * Fast Dictionary Learning for Sparse Representations of Speech Signals algorithms for sparse representations Neural Comput 15 2003 * Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm IGorodnitsky JGeorge BRao J. Electroencephalography Clinical Neurophysiol 95 1995 * K-SVD: An algorithm for designing overcomplete dictionaries for sparse representations MAharon MElad ABruckstein IEEE Trans. Signal Process 54 11 Nov. 2006 * Dictionary learning for L1-exact sparse coding MDPlumbley ICA Proc. Int. Conf. Ind Int. Conf. Ind 2007 * Online dictionary learning for sparse coding JMairal FBach JPonce GSapiro Proc. Int. Conf. Mach. Learn. (ICML) Int. Conf. Mach. Learn. (ICML) 2009 382 * Double sparsity: Learning sparse dictionaries for sparse signal approximation RRubinstein MZibulevsky MElad IEEE Trans. Signal Process 58 3 Mar. 2010 * An adaptive stereo basis method for convolutive blind audio source separation MGJafari EVincent SAAbdallah MDPlumbley MEDavies Neurocomputing 71 2008 * If edges are the independent components of natural images, what are the independent components of natural sounds? SAAbdallah ICA MDPlumbley ICA Proc. Int. Conf. Ind Int. Conf. Ind 2001 * Matching pursuit and atomic signal models based on recursive filter banks MGoodwin MVetterli IEEE Trans. Signal Process 47 7 Jul. 1999 * A study of the effect of source sparsity for various transforms on blind audio source separation performance VTan CFévotte Proc. Workshop Signal Process. With Adapt. Sparse Structured Represent. (SPARS) Workshop Signal ess. With Adapt. Sparse Structured Represent. (SPARS) 2005 * An adaptive orthogonal sparsifying transform for speech signals MJafari MPlumbley Proc. Int. Symp. Commun., Control, Signal Process. (ISCCSP) Int. Symp. Commun., Control, Signal ess. (ISCCSP) 2008 * Matching pursuit with timefrequency dictionaries SMallat ZZhang IEEE Trans. Signal Process 41 12 Dec. 1993 * Adaptive timefrequency decompositions SMallat GDavis ZZhang SPIE J. Opt. Eng 33 1994 * The freesound project. A collaborative database of creative commons licensed sounds SHaykin 1999. 25 Neural Networks: A Comprehensive Foundation NJ Prentice -Hall 2nd ed. Upper Saddle River. Online * RRubinstein * Image denoising using sparse representations SValiollahzadeh ICA HFirouzi ICA MBabaie-Zadeh ICA CJutten ICA Proc. Int. Conf. Ind Int. Conf. Ind 2009 * Independent Component Analysis AHyvaerinen JKarhunen EOja 2000 Wiley New York * Comparing measures of sparsity NHurley SRickard IEEE Trans. Inf. Theory 55 10 Oct. 2009 * Encoding speech using prototype waveforms WBKleijn IEEE Trans. Speech Audio Process 4 4 Oct. 1993