## Abstract

We present a fast, efficient algorithm for learning an overcomplete dictionary for sparse representation of signals. The whole problem is considered as a minimization of the approximation error function with a coherence penalty for the dictionary atoms and with the sparsity regularization of the coefficient matrix. Because the problem is nonconvex and nonsmooth, this minimization problem cannot be solved efficiently by an ordinary optimization method. We propose a decomposition scheme and an alternating optimization that can turn the problem into a set of minimizations of piecewise quadratic and univariate subproblems, each of which is a single variable vector problem, of either one dictionary atom or one coefficient vector. Although the subproblems are still nonsmooth, remarkably they become much simpler so that we can find a closed-form solution by introducing a proximal operator. This leads to an efficient algorithm for sparse representation. To our knowledge, applying the proximal operator to the problem with an incoherence term and obtaining the optimal dictionary atoms in closed form with a proximal operator technique have not previously been studied. The main advantages of the proposed algorithm are that, as suggested by our analysis and simulation study, it has lower computational complexity and a higher convergence rate than state-of-the-art algorithms. In addition, for real applications, it shows good performance and significant reductions in computational time.

## 1 Introduction

Situated at the heart of signal processing, data models are fundamental for stabilizing the solutions of inverse problems. The matrix factorization model has become a prominent technique for various tasks, such as independent component analysis (Hyvarinen, 1999), nonnegative matrix factorization (Lee & Seung, 1999; Cichocki & Phan, 2009), compressive sensing (Baraniuk, 2007), and sparse representation (Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado, Murray, & Rao, 2003). For the matrix factorization model, we collect the measurements and stack them as the observation matrix , where *m* is the dimensionality of each sample . Matrix factorization consists of finding a factorization of the form by solving the minimization of , where the factor is called the dictionary (its column vectors are called atoms), while the other factor, , is called the coefficient matrix and is the approximation error. The factors from different tasks are shown to have very different representational properties. The differences mainly come from two aspects: the constraints imposed, such as sparseness, nonnegativity, and smoothness, and the relative dimensionality of , which is assumed a priori. The relative dimensionality of matrix factorization has two categories: overdetermined () systems (Lee & Seung, 1999; Cichocki & Phan, 2009) and underdetermined () systems (Baraniuk, 2007; Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado et al., 2003) of linear equations, which have completely different application backgrounds. We investigate the underdetermined linear inverse problem. The straightforward optimization of underdetermined linear inverse problems, with infinitely many solutions, is an ill-posed problem. The problem can be solved by imposing a sparsity constraint on , where the sparsity is associated with the overcompleteness () of , so that most coefficients in become zero. Then each signal is represented as a linear combination of a few dictionary atoms, leading to the issue of sparse representation, where the dictionary is overcomplete.

Sparse representation is of significant importance in signal processing (Elad, Figueiredo, & Ma, 2010; Jafari & Plumbley, 2011; Fadili, Starck, & Murtagh, 2009; Adler et al., 2012; Caiafa & Cichocki, 2012). A key problem related to sparse representation is the choice of the dictionary on which the signals of interest are decomposed. One simple approach is to consider predefined dictionaries, such as the discrete Fourier transform, the discrete cosine transform, and wavelets (Selesnick, Baraniuk, & Kingsbury, 2005). Another approach is to use an adaptive dictionary learned from the signals that are to be represented, which results in better matching to the contents of the signals. The learned dictionary has the potential to offer improved performance compared with that obtained using predefined dictionaries. Many approaches (Engan, Aase, & Husoy, 1999; Aharon, Elad, & Bruckstein, 2006; Dai, Xu, & Wang, 2012; Yaghoobi, Daudet, & Davies, 2009; Bao, Ji, Quan, & Shen, 2014) have been proposed. Most of them use two alternating stages: sparse coding and dictionary update. The sparse coding step finds the best, sparsest coefficient matrix of the training signals. For the sparse coding, the -norm is treated as a sparsity constraint. A typical approach is orthogonal matching pursuit (OMP) (Tropp, 2004). OMP, a greedy algorithm, can find sufficiently good representations. However, this algorithm may not provide good enough estimations of signals, and the computational complexity of this algorithm is very high. Another appealing method for sparse coding is basis pursuit (BP) (Chen, Donoho, & Saunders, 2001) or LASSO (Tibshirani, 1994), which replaces the -norm by an -norm. Gradient-based methods (Yaghoobi, Daudet, & Davies, 2009; Rakotomamonjy, 2013) are very effective approaches to solving -norm optimizations. The gradient method can lead to an iterative soft-thresholding method, which requires more iterations, especially if the solution is not very sparse or the initialization is not ideal. For the dictionary updating, the atoms are updated either one by one or simultaneously. By optimizing a least squares problem, the whole set of atoms is updated at once in Engan, Aase, et al. (1999), but it is not guaranteed to converge. One-by-one atom updating is implemented in (Aharon et al., 2006) through singular value decomposition (SVD); this approach can learn a better dictionary. However, SVD is computationally expensive. The approach in Bao, Ji, Quan, & Shen (2014) updates the dictionary atoms one by one based on a proximal operation method, which is essentially a gradient-based method and converges slowly. Therefore, it is important to develop an efficient strategy to accelerate the dictionary learning process.

In this letter, we address the problem of learning an overcomplete dictionary for sparse representation. According to the theory of compressive sensing, the mutual incoherence of the dictionary plays a crucial role in the sparse coding (Sigg, Dikk, & Buhmann, 2012; Lin, Liu, & Zha, 2012; Mailhe, Barchiesi, & Plumbley, 2012; Bao, Ji, & Quan, 2014; Wang, Cai, Shi, & Yin, 2014). If the dissimilarities between any two atoms in the dictionary are high, the sparse representation generated from this dictionary is more efficient. Therefore, we impose the incoherence constraint on the dictionary. For simplicity, we use the -norm for sparsity. Hence, the whole problem is constructed as a minimization of the approximation error with the sparsity regularization on the coefficient matrix and with the coherence penalty on the dictionary atoms (see section 3.1). From the viewpoint of optimization, the whole problem is nonconvex with respect to the dictionary and the coefficient matrix, and nonsmooth because of the sparsity regularization and the coherence penalty; the problem therefore cannot be solved efficiently by using an ordinary optimization method. To address this problem, we have separated the problem into a series of subproblems, each of which is a minimization of a single vector variable (i.e., a univariate function), which can be solved more easily. Remarkably, each single-vector variable function has a piecewise quadratic form. Thus, we can apply the proximal operator (PO) (Moreau, 1962) to handle the nonsmoothness of each subproblem and thereby obtain a closed-form solution of each subproblem. This is the case for the dictionary atoms, even though the coherence penalty term has been included, or for the sparse coefficient vectors. To our knowledge, applying PO to the problem with a coherence penalty term has not appeared in the literature.

An appealing feature of this method is that the subproblems for atom updating and coefficient updating have similar forms. Both are piecewise quadratic and thus can be solved efficiently by the PO technique. Interestingly, these lead to a closed form of solution explicitly. In this way, the whole problem can be solved efficiently and quickly, and we have developed an algorithm, the fast proximal dictionary learning algorithm (fastPDL). This algorithm does not include optimization updating steps, though it is still a recursive procedure because it treats the atoms and coefficient vectors one by one in each round, and the results from a round can affect the results in the next round recursively. The proposed algorithm gains from directly obtaining the closed-form solutions with low complexity and a high convergence rate, avoiding costly techniques of sparse coding, such as OMP (Tropp, 2004) and BP (Chen et al., 2001), and avoiding the gradient-based method (Rakotomamonjy, 2013; Bao, Ji, Quan, & Shen, 2014) with its slow convergence rate. From the theoretical analysis, the fastPDL algorithm has efficiency far beyond state-of-the-art algorithms.

The proposed algorithm is expected to have the following desirable characteristics:

- •
Our dictionary learning problem is considered as the approximation error with the sparsity regularization of the coefficient matrix and the coherence penalty of the dictionary. The coherence of and the sparsity of can be flexibly controlled by adjusting the corresponding regularization parameters.

- •
We turn the whole problem into a set of univariate optimization subproblems. Each is piecewise quadratic so that we can apply PO to solve it and give a closed-form solution explicitly. These lead to an algorithm with low computational complexity and high convergence rate

- •
While the PO method has been used to solve sparse coding or compressive sensing problems with an -norm, to our knowledge, no such treatment exists for dictionary learning. In this letter, because we use a coherence penalty term, the dictionary learning subproblem is also nonsmooth, so that ordinary optimization methods become inefficient. We show that the PO technique can also be applied to this problem and can give a closed-form solution. Thus, while the coherence penalty term of the dictionary is incorporated into the dictionary learning problem as an additional constraint, it does not increase the complexity of the fastPDL algorithm.

The remainder of the letter is organized as follows. In section 2, we introduce the dictionary learning problem, including sparse coding and dictionary update, and provide an overview of the state-of-the-art algorithms. In section 3, we describe and analyze the fastPDL algorithm in detail. Numerical experimental studies described in section 4 clearly establish the practical advantages of the proposed fastPDL algorithm. We show that this algorithm converges significantly faster than state-of-the-art algorithms. We also present two applications, one involving nonnegative signals and the other involving a general real-world valued signal. The letter concludes in section 5.

### 1.1 Notation

A boldface uppercase letter, such as , denotes a matrix, and a lowercase letter *x _{ij}* denotes the th entry of . A boldface lowercase letter denotes a vector, and a lowercase letter

*x*denotes the

_{j}*j*th entry of . denotes the

*i*th row, and denotes the

*j*th column of matrix . denotes the transpose of the matrix . The Frobenius norm of is defined as . The -norm of is defined as . The -norm of and are defined as and , respectively. is the so-called -norm, which counts the number of nonzero elements. denotes the trace of a square matrix. In this letter, all the parameters take real values, even though we do not state this explicitly each time.

## 2 Overview of Dictionary Learning Problem

*m*is less than

*r*, the case that we are interested in, the problem becomes ill posed. Therefore, some constraints, such as sparsity, should be imposed on . Usually the sparsity can be measured by, for instance, the -norm or -norm. It is worth noting that the error function, equation 2.1, is not convex with respect to and . Most dictionary learning algorithms attack this problem by iteratively performing a two-stage procedure: sparse coding and dictionary update. Starting with an initial dictionary, the following two stages are repeated until convergence.

### 2.1 Sparse Coding

### 2.2 Dictionary Update

The procedure for dictionary learning is summarized in algorithm 1.

### 2.3 State-of-the-Art Algorithms

The method of optimal directions (MOD) is one of the simplest dictionary learning algorithms. In the sparse coding stage, is solved using OMP (Engan, Aase, et al., 1999) or FOCUSS (Engan, Rao, & Kreutz-Delgado, 1999) MOD to find the minimum of with fixed . This leads to the closed-form expression followed by normalizing the columns of . K-SVD (Aharon et al., 2006) is a classic and successful dictionary learning approach. In the sparse coding stage, the sparse coding is also solved using OMP.

The difference between K-SVD and MOD is in the dictionary update stage. MOD updates the whole set of atoms at once with fixed, but it is not guaranteed to converge. In contrast, K-SVD, through singular-value decomposition (SVD), updates the atoms one by one and simultaneously updates the nonzero entries in the associated row vector of . The work of Dai et al. (2012) extends the K-SVD algorithm by allowing, in the dictionary update stage, the simultaneous optimization of several dictionary atoms and the related approximation coefficients. With this possibility of optimizing several atoms at a time, the running efficiency of their algorithm is better than the original K-SVD, although still worse than that of MOD.

More recently, Bao, Ji, Quan, and Shen (2014) proposed an alternating proximal method iteration scheme for dictionary learning with an -norm constraint. The alternating proximal method, actually called the proximal gradient method, is essentially a gradient-based method. Although it can be an improvement over a pure gradient method because of PO, such gradient-based methods usually have linear convergence, so there is still room for improvement. Rakotomamonjy (2013) proposed an algorithm named Dir that simultaneously optimizes both the dictionary and the sparse decompositions jointly in one stage based on a nonconvex proximal splitting framework instead of alternating optimizations. This approach is another extension of the gradient-based method. Other relevant research is the work on online dictionary learning (Mairal, Bach, Ponce, & Sapiro, 2010; Skretting & Engan, 2010). Online algorithms continuously update the dictionary as each training vector is being processed. Because of this, they can handle very large sets of signals, such as commonly occur, for example, in image processing tasks (Mairal et al., 2010). Skretting and Engan (2010) proposed an extension of the method of optimal direction based on recursive least squares for dealing with this online framework.

## 3 Fast Dictionary Learning Based on the Proximal Operator

In this section, we formulate the dictionary learning problem with constraints over the coefficient matrix and the dictionary and introduce the proximal operator. Then we propose the fastPDL algorithm based on PO and analyze this algorithm, including determining the parameters and the computational complexity.

### 3.1 Problem Formulation

where and are different column vectors in . If each column of is normalized to the unit -norm, the constraint term in equation 3.2 is equivalent to , which summarizes the cosine values of the angles between any two different atoms. is the angle between atoms *l* and *k*, where . Hence, reducing equation 3.2 means enlarging angles between column vectors in . This also means enhancing the incoherence of the dictionary. Thus, a dictionary learned with such a constraint will be more efficient.

### 3.2 The Proximal Operator

*y*(the

_{j}*j*th entry of ). The closed-form solution of equation 3.4 can be expressed by

### 3.3 Proposed Algorithm: The Fast Proximal Dictionary Learning Algorithm

*r*pieces arranged as its rows , while the unknown contains

*r*pieces arranged as its columns . Denote the

*k*th column in by the corresponding vector and the

*k*th row in by the corresponding vector ( is the transposed vector of but not the

*k*th column in ). Thus, we decompose the multiplication to the sum of

*r*rank-1 matrices. Hence, the cost function, equation 3.3, can be rewritten as

Hence, we minimize a set of local cost functions, equation 3.12, for all , instead of minimizing the cost function, equation 3.3. The set of local cost functions, equation 3.12, is not only nonconvex with respect to and but also nonsmooth because of the regularization term on and the penalty term on . As mentioned, we use an alternating optimization so that the nonconvex optimization is transformed into two convex subproblems. Basically, the alternating optimization consists of updating with fixed and then updating with fixed, alternately.

*k*th column vector in but the transposed vector of . Although this subproblem is nonsmooth, it has the exact form of equation 3.6, so we can apply PO to obtain the closed-form solution. From the definition of PO, equation 3.8, it is natural to obtain the closed-form solution of equation 3.13 as follows: where

*h*is the

_{kj}*j*th entry of . By this, we can sequentially optimize the set of .

*l*and

*k*, satisfies . Thus a reasonable approximation is to disregard the term of . This means that the third term as the right-hand side of equation 3.9 disappears, and equation 3.10 from equations 3.10 becomes . Then we can get the closed-form solution of equation 3.15 from equations 3.9 and 3.10 as follows: where

*w*and

_{ki}*w*are the

_{li}*i*th entry of and , respectively. By this, we can sequentially optimize the set of . In addition, to prevent dictionary from having arbitrarily large values, each column is normalized to the unit -norm after each updating.

The coherence as the penalty term of imposed on the approximation error was studied in (Sigg et al., 2012; Lin et al., 2012; Mailhe et al., 2012; Bao, Ji, & Quan, 2014; Wang et al., 2014). The dictionary is updated by the SVD method or the partial derivatives, which is costly. In this letter, we treat the coherence as the penalty term of and use one-by-one atom updating. While we update one atom, the others are fixed. The set of subproblems with respect to dictionary atoms can thus be easily solved by PO. Here, PO successfully handles the nonsmooth coherence penalty, which is quite different from the approach in Sigg et al. (2012), Lin et al. (2012), Mailhe et al. (2012), Bao, Ji, and Quan (2014), and Wang et al. (2014).

Note that some methods such as ISTA (Daubechies et al., 2004) or its accelerated version (FISTA) (Beck & Teboulle, 2009) are also based on PO but are mainly devoted to sparse coding. In addition, some approaches for dictionary learning such as Dir (Rakotomamonjy, 2013) and Bao’s work (Bao, Ji, Quan, & Shen, 2014) are also based on PO. The proximal algorithm in these studies (Daubechies et al., 2004; Beck & Teboulle, 2009; Combettes & Wajs, 2005; Combettes & Pesquet, 2010; Kazerouni, Kamilov, Bostan, & Unser, 2013; Rakotomamonjy, 2013; Bao, Ji, Quan, & Shen, 2014) is also called the proximal gradient method. A difference in this letter is that after applying the PO technique, we further obtained closed-form solutions, as shown in equations 3.14 and 3.16, to the optimization subproblems, which can make the algorithm more efficient.

By minimizing the set of piecewise quadratic and univariate functions 3.13 and 3.15, for all based on PO, we obtain the set of sequential learning rules 3.14 and 3.16 for all in the closed form. To reduce the computational complexity further, we work factor by factor to avoid computing the residue matrix for each *k*. In other words, before updating , we need only compute the multiplications of the factors ( and ) once to update . While updating , we need only to compute the multiplications of the factors ( and ) once to update . To accelerate the convergence rate, we update all the coefficient vectors sequentially while fixing all dictionary atoms. We then update all the atoms sequentially while fixing all the coefficient vectors. The entire process is repeated until convergence is achieved. Hence, the whole problem, equation 3.3, is solved, leading to an efficient and fast algorithm. According to the analysis above, the proposed fastPDL algorithm for dictionary learning is summarized in algorithm 2.

In the fastPDL algorithm, convergence is guaranteed. The whole dictionary learning problem is decomposed into a set of piecewise quadratic and univariate functions, 3.13 and 3.15, for all . Moreover, the value of *a*, the second-order derivative of each piecewise quadratic and univariate function, is greater than zero. The exact minimizers can be obtained using equations 3.14 and 3.16, which are stationary points. Because the cost function, equation 3.3, decreases monotonically over each update , alternately and the cost function is bounded, the algorithm can converge.

### 3.4 Determining the Parameters in the FastPDL Algorithm

Our dictionary learning algorithm requires the coefficient matrix to be as sparse as possible and simultaneously requires the atoms of the dictionary to be as incoherent as possible. For the cost function, equation 3.3, the parameter can be adjusted to control the trade-off between the approximation error and the sparsity of the coefficient matrix . plays an important role in the proposed method: larger values of result in a sparser coefficient matrix. The parameter can be determined by offline calibration or adaptive tuning. Experiments show that the values of the parameter can be determined in these two ways are very close. In this letter, we find the optimal using the first way. The parameter can be adjusted for a trade-off between the accuracy of reconstruction and the incoherence of . We found that it is better to set to , because one can obtain a relatively low approximation error and adequate reduction of the redundancies of the dictionary.

To show the performance of fastPDL algorithm, one evaluation is to compare the changes of angles between any two atoms in the two different dictionaries that are learned by fastPDL with , and fastPDL with only the sparsity penalty term (). Through the two algorithms, we achieved two groups of dictionary learning results from a nature image (see “Peppers” in Figure 9), and we plotted the sorted angles between atoms in the learned dictionaries in Figure 1. It can be seen that the angles between atoms in the dictionary learned by fastPDL () are enlarged and the incoherence of the learned dictionary is obviously enhanced.

### 3.5 Analysis of the Computational Complexity

From the implementation point of view, we consider the computational complexity of our algorithm. The complexity is represented by the parameters *s*, *m*, *r*, and *L*, which correspond to the number of nonzero elements in each column of , the dimensionality of signals, the number of columns in the dictionary, and the number of training samples, respectively. Generally we can assume for dictionary learning. FastPDL decomposes the matrices into sets of row vectors or column vectors because it is easy to find the optimal solutions in closed form. K-SVD also performs the same decomposition scheme; however, that dictionary update is performed through an SVD decomposition on . The complexity of the SVD calculation of is *O*(), which is costly. For fastPDL’s dictionary update step, we used the closed-form solution, equation 3.16. The complexity mainly depends on the multiplications of the matrices ( and ) and is *O*(), which is much lower than that of SVD. The complexity of sparse coding is *O*(), which is also much lower than that of OMP (*O*()) used in the K-SVD and MOD methods. In addition, the supports (nonzero positions) and the entries at the supports in of fast-PDL are both updated, unlike in the MOD and K-SVD methods in which the supports are not changed. This contributes to further improvement in the performance of dictionary learning. Gradient-based methods, such as Dir (Rakotomamonjy, 2013) and Bao’s work (Bao, Ji, Quan, & Shen, 2014), require gradient computations that are *O*() and are a bit less than that of fastPDL. On the other hand, because it is a gradient-based method, it requires more iterations than fastPDL, and the convergence rates rely on the step sizes. Hence, the fastPDL algorithm has advantages in complexity and convergence rate and has fewer parameters that require tuning.

## 4 Simulations

In this section, to evaluate the learning capability and efficiency of fastPDL, we present the results of some numerical experiments. The first experiment is on synthetic data generated from a random dictionary taken as ground truth; we then use the proposed algorithm on the data to learn a new dictionary to compare with the ground truth dictionary. In this way, a dictionary learning algorithm should be able to extract the common features of the set of signals, which are actually the generated atoms. Furthermore, we perform another experiment in which white gaussian noise of various signal-to-noise ratios (SNRs) is added to the synthetic signals to evaluate the performance and robustness of the antinoise features.^{1} In addition, we apply fastPDL to real-world valued signals with noise to verify the applicability of the proposed algorithm.

In the experiments, all programs were coded in Matlab and were run within Matlab (R2014a) on a PC with a 3.0 GHz Intel Core i5 CPU and 12 GB of memory, under the Microsoft Windows 7 operating system.

### 4.1 Dictionary Recovery from Synthetic Data

First, we generated a dictionary by normalizing a random matrix of size , with independent and identically distributed (i.i.d.) uniformly random entries. Each column was normalized to the unit -norm. The dictionary was referred to as the ground truth dictionary; it was not used in the learning but only for evaluation. Then a collection of *L* samples of dimensionality *m* was synthesized, each as a linear combination of *s* different columns of the dictionary, with i.i.d. corresponding coefficients in uniformly random and independent positions. Here *s* also denotes the number of nonzero elements in each column of the coefficient matrix .

Then we applied all algorithms—K-SVD, MOD, Dir of Rakotomamonjy, and the proposed fastPDL—to the generated signals to learn dictionaries and record their performance. Matlab codes for K-SVD algorithm and MOD algorithm are available online (http://www.cs.technion.ac.il/∼elad//software/), as is Matlab code for the Dir algorithm (http://asi.insa-rouen.fr/enseignants/∼arakoto/publication.html/). All algorithms were initialized with the same dictionary (made by randomly choosing *r* columns of the generated signals, followed by a normalization) and with the same coefficient matrix (randomly generated).

In the experiment, the number of signals *L* was set to 1500, and the dimensionality *m* was 20. For the dictionary, the number of atoms *r* was 50. We varied *s* from 3 to 12. All trials were repeated 15 times for each algorithm and for various *s*. For all algorithms, the stopping criterion was based on the relative change of the objective function value in two consecutive iterations, , for a small, positive constant . For Dir and fastPDL, was set to for the stopping criterion, and the maximum number of iterations was set to 10,000 for both. K-SVD and MOD imposed a very strong sparsity constraint on the data by limiting the number of nonzero elements in from OMP to a small number. These methods thus yielded signal representations that are not particularly accurate. Hence, was set to for K-SVD and MOD. In addition, the implementation of OMP was very slow. For K-SVD and MOD, the maximal number of iterations was set to a reasonable value (e.g., 3000). These iterative methods ensured that the stopping criterion would be satisfied within the maximal number of iterations and that the algorithms could find the minimum of the objective function. The average recovery ratios and corresponding standard deviations for the learned dictionaries are shown with respect to the number of the nonzero element in each column of in Figure 2a. The average running time and standard deviation of each algorithm are shown versus the number of the nonzero element in each column of in Figure 2b. It can be seen that in all cases, fastPDL performed the best for the recovery ratio and the running time. The recovery ratio of fastPDL was much higher than K-SVD and MOD in all cases. In particular, fastPDL was still effective even when the number of nonzero elements *s* was very high. Furthermore, fastPDL, in computational running time, was remarkably fast and stable compared with K-SVD and MOD. It is clear that K-SVD and MOD, which use two stages, were more costly. Compared with Dir, fastPDL was slightly favorable in the ratios of recovered atoms, and it had a significant advantage in running time. The reason that fastPDL outperformed Dir was that the latter employed a gradient descent method to optimize gradually, which is time-consuming even though Dir’s optimization occurred in one stage only.

Obviously the four algorithms described above have different convergence properties. To see the convergence behaviors of all algorithms, we executed these algorithms as many times as possible and determined the reasonable numbers of iteration that could ensure convergence of each algorithm. For , 3000 iterations for Dir and fastPDL and 1000 iterations were required for K-SVD and MOD because K-SVD and MOD cost much more time than Dir and fastPDL per iteration. Fifteen trials were conducted, and performance results were averaged. With respect to the number of iterations and the running time per iteration, the average recovery ratios for the learned dictionaries are shown in Figure 3. It can be seen that the number of iterations required for convergence for K-SVD and MOD was fewer than for fastPDL in Figure 3a; however, the computational complexity for each iteration for K-SVD and MOD was much greater than that for fastPDL. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of K-SVD and MOD in Figure 3b. The computational complexity for each iteration for Dir was a little less than that for fastPDL; however, the number of iterations required for convergence for Dir was greater than that for fastPDL in Figure 3a, because the gradient descent converged slowly. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of Dir in Figure 3b. Therefore, fastPDL had significant advantages in convergence rate.

As is well known, K-SVD and MOD require a specific and fixed number of nonzero elements in each column of so that they have the fixed sparsity during iterations. For Dir and fastPDL, the sparsity of the coefficient matrix was adjusted via the regularization parameters. Figure 4a depicts the average recovery ratios and corresponding standard deviations for the learned dictionary by fastPDL with respect to the regularization parameters. The cost function values of fastPDL versus iteration number are shown in Figure 4b for different . We can see that the recovery ratio was better when . Moreover, the cost function values were smaller and converged faster when . This parameter was fixed during iterations to reduce the number of iterations and computational requirements. The sparsity of coefficient matrix , which was measured by dividing the number of nonzero elements in by the number of total elements including zero elements and nonzero elements in , is shown in Figures 5a and 5b against the number of iterations and the running time per iteration, respectively. Generally lower values of the sparsity of made the dictionary more efficient. Obviously the sparsity of fastPDL converged faster and could reach a lower value than Dir, while the sparsities for K-SVD and MOD were not plotted because they had the fixed sparsity (3/50) decided by OMP method. In addition, we showed histograms of the effective sparsity (nonzero elements in each column of ) of samples’ representation for fastPDL. In the ground truth that was randomly generated, the number of nonzero elements in each column was ; hence the number of coefficient vectors with three nonzero elements was 1500 in Figure 6a. Note that the number of coefficient vectors was the same as the number of samples . The number of coefficient vectors for fastPDL with three nonzero elements was 685 in Figure 6b. It could be seen that the vectors with three nonzero elements were dominant in the output of fastPDL. Thus, the histogram of fastPDL confirms that the columns of obtained by fastPDL were sparse.

In addition, we made an experiment to show the dependence between the rate of the recovery of atoms and the redundancy ratio of the dictionary. Here, we define the redundancy ratio of the dictionary as , where *m* is the dimensionality of dictionary atoms and *r* is the number of the dictionary atoms. In the experiment, *r* is fixed to 50. If the dictionary can be recovered even though the redundancy ratio is smaller, it means that from fewer linear combinations of atoms, we can recover the sparse coefficient matrix. Then the sparse representation is more efficient. To evaluate the ability of the learning dictionary, along with the reduction of the redundancy ratio as small as possible, we made the dictionary recovery experiment under different redundancy ratios. Fifteen trials were conducted, and performance results were averaged. The average recovery rates and corresponding standard deviations are shown in Figure 7a. We found that the fastPDL algorithm displayed a superior recovery rate to the other algorithms. Especially, the fastPDL was still efficient under , while the other algorithms deteriorated greatly. Furthermore, we showed the sparsity of coefficient matrix in Figure 7b. It can been seen that the sparsity of fastPDL can reach lower values than that of Dir under various redundancy ratios, while the sparsities for K-SVD and MOD were not plotted because they had the fixed sparsities.

### 4.2 Evaluating the Performance and Robustness against Noise for Different Noise Levels

Besides the noiseless condition, we also performed another experiment in which white gaussian noise at various SNRs corrupted the signals to evaluate the algorithm’s performance and robustness against noise. We added noise with SNR levels of 10 dB, 20 dB, and 30 dB according to the model , where was a noise matrix. The settings for the experiment are , , , and . For each of the tested algorithms and for various SNR levels, 15 trials were conducted and their results sorted. The average recovery ratios and corresponding standard deviations for the learned dictionaries for these algorithms at levels of 10 dB, 20 dB, 30 dB, and the noiseless case are shown in Figure 8a. The mean and standard deviations for the running time (until the relative change of the objective function value in two consecutive iterations was less than a small positive constant) are also shown in Figure 8b. It can be seen that fastPDL performed the best on dictionary learning at each noise level. Although the differences in recovery ratios for the learned dictionary between fastPDL and Dir under various conditions were small, fastPDL had a remarkable advantage in running time.

### 4.3 Applications and Performance Evaluations

Dictionary learning for sparse representation of signals has many applications, such as denoising, inpainting, and classification. In this section, we apply fastPDL in one particular application, denoising. Signal denoising refers to the removal of noise from a signal. An image contains only nonnegative signals, while both nonnegative and negative signals occur in audio data. This motivates us to apply fastPDL algorithm to image processing and audio processing.

#### 4.3.1 Image Denoising

Although the proposed fastPDL algorithm does not require the signal to be nonnegative, it can indeed work effectively in such cases, as we now show.

In the additive noise model, the white gaussian noise with various standard deviations corrupted the clean image by . We used a noisy image of size pixels as test data to learn a dictionary of size . This dictionary was then used to remove the noise from the noisy observed image . The denoising method is based on the method of K-SVD implemented in Elad and Aharon (2006). To provide a fair comparison, all the settings were the same as those used with the K-SVD method. For the dictionary learning stage, we chose an overcomplete DCT dictionary as the initialization to learn the dictionary by fastPDL algorithm, and the number of iterations was set to 10. The sparse coding stage was based on the OMP algorithm. After the whole process of dictionary learning, we used the learned dictionary to reconstruct the images. Some standard test images, including Boats (512512), Barbara (512512), Peppers (256256), and House (256256), were used in the experiment, and one of the reconstruction results is presented in Figure 9. We deliberately showed denoised images and the final adaptive dictionaries for a case with very strong noise, because fastPDL was also effective. Here, we used two quality measures, the peak SNR (PSNR) and the structural similarity (SSIM) (Wang, Bovik, Sheikh, Sapiro, & Simoncelli, 2004), to assess the denoised images. SSIM’s value range is between 0 and 1, and its value equals 1 if . The average PSNR and SSIM of the denoised images over 10 trials at different noise levels are summarized in Table 1. It shows that in some situations, fastPDL can obtain better results, and the averaged results of fastPDL were very close to those for K-SVD. The averaged running times are summarized in Table 2, which shows that the running time of fastPDL was shorter than that of K-SVD. An interesting phenomenon is that the running time for all the images is reduced with increasing noise levels. This may be because with a higher noise level, the image must be represented with fewer atoms because more atoms are noisy.

. | Boat . | BarBara . | Peppers . | House . | ||||
---|---|---|---|---|---|---|---|---|

/ PSNR . | PSNR . | SSIM . | PSNR . | SSIM . | PSNR . | SSIM . | PSNR . | SSIM . |

5 / 34.16 | 37.0255 | 0.9396 | 37.6528 | 0.9594 | 37.901 | 0.9548 | 39.4905 | 0.9555 |

37.0284 | 0.9396 | 37.6529 | 0.9595 | 37.897 | 0.9548 | 39.4765 | 0.9554 | |

10 / 28.15 | 33.3949 | 0.8814 | 33.9599 | 0.9257 | 34.2115 | 0.9226 | 35.965 | 0.9062 |

33.3943 | 0.8814 | 33.9241 | 0.9252 | 34.2357 | 0.9227 | 35.9705 | 0.9063 | |

15 / 24.61 | 31.4239 | 0.8357 | 31.96 | 0.8971 | 32.1934 | 0.8981 | 34.3171 | 0.8771 |

31.442 | 0.8358 | 31.9022 | 0.896 | 32.1466 | 0.8970 | 34.2136 | 0.8764 | |

25 / 20.12 | 29.0198 | 0.7621 | 29.2317 | 0.8333 | 29.7746 | 0.8574 | 32.2081 | 0.8463 |

29.0523 | 0.7626 | 29.1088 | 0.8295 | 29.6686 | 0.8551 | 32.0013 | 0.8453 | |

50 / 14.09 | 25.6511 | 0.6367 | 25.2143 | 0.6899 | 26.1915 | 0.7764 | 27.9702 | 0.7604 |

25.5263 | 0.6310 | 25.0605 | 0.6846 | 25.71391 | 0.7661 | 27.4141 | 0.7512 |

. | Boat . | BarBara . | Peppers . | House . | ||||
---|---|---|---|---|---|---|---|---|

/ PSNR . | PSNR . | SSIM . | PSNR . | SSIM . | PSNR . | SSIM . | PSNR . | SSIM . |

5 / 34.16 | 37.0255 | 0.9396 | 37.6528 | 0.9594 | 37.901 | 0.9548 | 39.4905 | 0.9555 |

37.0284 | 0.9396 | 37.6529 | 0.9595 | 37.897 | 0.9548 | 39.4765 | 0.9554 | |

10 / 28.15 | 33.3949 | 0.8814 | 33.9599 | 0.9257 | 34.2115 | 0.9226 | 35.965 | 0.9062 |

33.3943 | 0.8814 | 33.9241 | 0.9252 | 34.2357 | 0.9227 | 35.9705 | 0.9063 | |

15 / 24.61 | 31.4239 | 0.8357 | 31.96 | 0.8971 | 32.1934 | 0.8981 | 34.3171 | 0.8771 |

31.442 | 0.8358 | 31.9022 | 0.896 | 32.1466 | 0.8970 | 34.2136 | 0.8764 | |

25 / 20.12 | 29.0198 | 0.7621 | 29.2317 | 0.8333 | 29.7746 | 0.8574 | 32.2081 | 0.8463 |

29.0523 | 0.7626 | 29.1088 | 0.8295 | 29.6686 | 0.8551 | 32.0013 | 0.8453 | |

50 / 14.09 | 25.6511 | 0.6367 | 25.2143 | 0.6899 | 26.1915 | 0.7764 | 27.9702 | 0.7604 |

25.5263 | 0.6310 | 25.0605 | 0.6846 | 25.71391 | 0.7661 | 27.4141 | 0.7512 |

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result in each cell.

/ PSNR . | Boat . | Barbara . | Peppers . | House . |
---|---|---|---|---|

5 / 34.16 | 8.8048 | 8.8978 | 8.3413 | 5.1428 |

3.9209 | 4.2362 | 3.6768 | 2.5609 | |

10 / 28.15 | 4.7485 | 4.9622 | 4.5821 | 3.3541 |

2.5749 | 2.534 | 2.2583 | 1.7438 | |

15 / 24.61 | 3.3470 | 3.5784 | 3.4730 | 2.7144 |

1.9409 | 2.023 | 1.8688 | 1.4860 | |

25 / 20.12 | 2.8252 | 2.916 | 2.8641 | 2.4572 |

1.6982 | 1.6823 | 1.4967 | 1.3891 | |

50 / 14.09 | 2.4674 | 2.5537 | 2.3606 | 2.3212 |

1.4493 | 1.4858 | 1.3119 | 1.2871 |

/ PSNR . | Boat . | Barbara . | Peppers . | House . |
---|---|---|---|---|

5 / 34.16 | 8.8048 | 8.8978 | 8.3413 | 5.1428 |

3.9209 | 4.2362 | 3.6768 | 2.5609 | |

10 / 28.15 | 4.7485 | 4.9622 | 4.5821 | 3.3541 |

2.5749 | 2.534 | 2.2583 | 1.7438 | |

15 / 24.61 | 3.3470 | 3.5784 | 3.4730 | 2.7144 |

1.9409 | 2.023 | 1.8688 | 1.4860 | |

25 / 20.12 | 2.8252 | 2.916 | 2.8641 | 2.4572 |

1.6982 | 1.6823 | 1.4967 | 1.3891 | |

50 / 14.09 | 2.4674 | 2.5537 | 2.3606 | 2.3212 |

1.4493 | 1.4858 | 1.3119 | 1.2871 |

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result for each cell.

#### 4.3.2 Noise Cancellation of Audio Signal

In this section, we present the application of the proposed fastPDL algorithm to a more general signal, that is, a signal without nonnegativity. An audio signal is an appropriate example.

## 5 Conclusion

The main motivation of this letter was to develop a fast and efficient algorithm for learning incoherent dictionary for sparse representation. The problem was cast as the minimization of the approximation error function with the coherence penalty of the dictionary atoms and the sparsity regularization of the coefficient matrix. For efficiently solving the problem, we turned it into a set of minimizations of piecewise quadratic univariate subproblems, each of which was a single variable vector problem of one dictionary atom or one coefficient vector. Although the subproblems were still nonsmooth, remarkably, we could find the optimal solution in closed form using proximal operators. This led to the so-called fastPDL algorithm. Interestingly, this algorithm updates the dictionary and the coefficient matrix in the same manner, forcing the coefficients to be as sparse as possible while simultaneously forcing the atoms of the dictionary to be as incoherent as possible. We have verified that fastPDL is efficient in both computational complexity and convergence rate. These advantages are achieved because the optimal dictionary atoms and coefficient vectors have been obtained in closed form without iterative optimization. The proposed fastPDL algorithm is quite simple but very efficient. The numerical experiments have shown that fastPDL is more efficient than the state-of-the-art algorithms. In addition, we have described two applications of the proposed algorithm, one for a nonnegative signal and the other for a general signal, and have shown that the fastPDL algorithm has a significant advantage in computation time for both of them. Further applications remain as our future work.

## Appendix: Derivation of the Proximal Operator for the Coherence Penalty

## Acknowledgments

This work was supported by the Japan Society for the Promotion of Science under grant no. 26-10950.

## References

## Note

^{1}

SNRdB is defined as , where *x* and *y* denote the original signal and the signal polluted by noise, respectively.