Abstract

We present a fast, efficient algorithm for learning an overcomplete dictionary for sparse representation of signals. The whole problem is considered as a minimization of the approximation error function with a coherence penalty for the dictionary atoms and with the sparsity regularization of the coefficient matrix. Because the problem is nonconvex and nonsmooth, this minimization problem cannot be solved efficiently by an ordinary optimization method. We propose a decomposition scheme and an alternating optimization that can turn the problem into a set of minimizations of piecewise quadratic and univariate subproblems, each of which is a single variable vector problem, of either one dictionary atom or one coefficient vector. Although the subproblems are still nonsmooth, remarkably they become much simpler so that we can find a closed-form solution by introducing a proximal operator. This leads to an efficient algorithm for sparse representation. To our knowledge, applying the proximal operator to the problem with an incoherence term and obtaining the optimal dictionary atoms in closed form with a proximal operator technique have not previously been studied. The main advantages of the proposed algorithm are that, as suggested by our analysis and simulation study, it has lower computational complexity and a higher convergence rate than state-of-the-art algorithms. In addition, for real applications, it shows good performance and significant reductions in computational time.

1  Introduction

Situated at the heart of signal processing, data models are fundamental for stabilizing the solutions of inverse problems. The matrix factorization model has become a prominent technique for various tasks, such as independent component analysis (Hyvarinen, 1999), nonnegative matrix factorization (Lee & Seung, 1999; Cichocki & Phan, 2009), compressive sensing (Baraniuk, 2007), and sparse representation (Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado, Murray, & Rao, 2003). For the matrix factorization model, we collect the measurements and stack them as the observation matrix , where m is the dimensionality of each sample . Matrix factorization consists of finding a factorization of the form by solving the minimization of , where the factor is called the dictionary (its column vectors are called atoms), while the other factor, , is called the coefficient matrix and is the approximation error. The factors from different tasks are shown to have very different representational properties. The differences mainly come from two aspects: the constraints imposed, such as sparseness, nonnegativity, and smoothness, and the relative dimensionality of , which is assumed a priori. The relative dimensionality of matrix factorization has two categories: overdetermined () systems (Lee & Seung, 1999; Cichocki & Phan, 2009) and underdetermined () systems (Baraniuk, 2007; Elad, 2010; Lewicki & Sejnowski, 2000; Kreutz-Delgado et al., 2003) of linear equations, which have completely different application backgrounds. We investigate the underdetermined linear inverse problem. The straightforward optimization of underdetermined linear inverse problems, with infinitely many solutions, is an ill-posed problem. The problem can be solved by imposing a sparsity constraint on , where the sparsity is associated with the overcompleteness () of , so that most coefficients in become zero. Then each signal is represented as a linear combination of a few dictionary atoms, leading to the issue of sparse representation, where the dictionary is overcomplete.

Sparse representation is of significant importance in signal processing (Elad, Figueiredo, & Ma, 2010; Jafari & Plumbley, 2011; Fadili, Starck, & Murtagh, 2009; Adler et al., 2012; Caiafa & Cichocki, 2012). A key problem related to sparse representation is the choice of the dictionary on which the signals of interest are decomposed. One simple approach is to consider predefined dictionaries, such as the discrete Fourier transform, the discrete cosine transform, and wavelets (Selesnick, Baraniuk, & Kingsbury, 2005). Another approach is to use an adaptive dictionary learned from the signals that are to be represented, which results in better matching to the contents of the signals. The learned dictionary has the potential to offer improved performance compared with that obtained using predefined dictionaries. Many approaches (Engan, Aase, & Husoy, 1999; Aharon, Elad, & Bruckstein, 2006; Dai, Xu, & Wang, 2012; Yaghoobi, Daudet, & Davies, 2009; Bao, Ji, Quan, & Shen, 2014) have been proposed. Most of them use two alternating stages: sparse coding and dictionary update. The sparse coding step finds the best, sparsest coefficient matrix of the training signals. For the sparse coding, the -norm is treated as a sparsity constraint. A typical approach is orthogonal matching pursuit (OMP) (Tropp, 2004). OMP, a greedy algorithm, can find sufficiently good representations. However, this algorithm may not provide good enough estimations of signals, and the computational complexity of this algorithm is very high. Another appealing method for sparse coding is basis pursuit (BP) (Chen, Donoho, & Saunders, 2001) or LASSO (Tibshirani, 1994), which replaces the -norm by an -norm. Gradient-based methods (Yaghoobi, Daudet, & Davies, 2009; Rakotomamonjy, 2013) are very effective approaches to solving -norm optimizations. The gradient method can lead to an iterative soft-thresholding method, which requires more iterations, especially if the solution is not very sparse or the initialization is not ideal. For the dictionary updating, the atoms are updated either one by one or simultaneously. By optimizing a least squares problem, the whole set of atoms is updated at once in Engan, Aase, et al. (1999), but it is not guaranteed to converge. One-by-one atom updating is implemented in (Aharon et al., 2006) through singular value decomposition (SVD); this approach can learn a better dictionary. However, SVD is computationally expensive. The approach in Bao, Ji, Quan, & Shen (2014) updates the dictionary atoms one by one based on a proximal operation method, which is essentially a gradient-based method and converges slowly. Therefore, it is important to develop an efficient strategy to accelerate the dictionary learning process.

In this letter, we address the problem of learning an overcomplete dictionary for sparse representation. According to the theory of compressive sensing, the mutual incoherence of the dictionary plays a crucial role in the sparse coding (Sigg, Dikk, & Buhmann, 2012; Lin, Liu, & Zha, 2012; Mailhe, Barchiesi, & Plumbley, 2012; Bao, Ji, & Quan, 2014; Wang, Cai, Shi, & Yin, 2014). If the dissimilarities between any two atoms in the dictionary are high, the sparse representation generated from this dictionary is more efficient. Therefore, we impose the incoherence constraint on the dictionary. For simplicity, we use the -norm for sparsity. Hence, the whole problem is constructed as a minimization of the approximation error with the sparsity regularization on the coefficient matrix and with the coherence penalty on the dictionary atoms (see section 3.1). From the viewpoint of optimization, the whole problem is nonconvex with respect to the dictionary and the coefficient matrix, and nonsmooth because of the sparsity regularization and the coherence penalty; the problem therefore cannot be solved efficiently by using an ordinary optimization method. To address this problem, we have separated the problem into a series of subproblems, each of which is a minimization of a single vector variable (i.e., a univariate function), which can be solved more easily. Remarkably, each single-vector variable function has a piecewise quadratic form. Thus, we can apply the proximal operator (PO) (Moreau, 1962) to handle the nonsmoothness of each subproblem and thereby obtain a closed-form solution of each subproblem. This is the case for the dictionary atoms, even though the coherence penalty term has been included, or for the sparse coefficient vectors. To our knowledge, applying PO to the problem with a coherence penalty term has not appeared in the literature.

An appealing feature of this method is that the subproblems for atom updating and coefficient updating have similar forms. Both are piecewise quadratic and thus can be solved efficiently by the PO technique. Interestingly, these lead to a closed form of solution explicitly. In this way, the whole problem can be solved efficiently and quickly, and we have developed an algorithm, the fast proximal dictionary learning algorithm (fastPDL). This algorithm does not include optimization updating steps, though it is still a recursive procedure because it treats the atoms and coefficient vectors one by one in each round, and the results from a round can affect the results in the next round recursively. The proposed algorithm gains from directly obtaining the closed-form solutions with low complexity and a high convergence rate, avoiding costly techniques of sparse coding, such as OMP (Tropp, 2004) and BP (Chen et al., 2001), and avoiding the gradient-based method (Rakotomamonjy, 2013; Bao, Ji, Quan, & Shen, 2014) with its slow convergence rate. From the theoretical analysis, the fastPDL algorithm has efficiency far beyond state-of-the-art algorithms.

The proposed algorithm is expected to have the following desirable characteristics:

  • Our dictionary learning problem is considered as the approximation error with the sparsity regularization of the coefficient matrix and the coherence penalty of the dictionary. The coherence of and the sparsity of can be flexibly controlled by adjusting the corresponding regularization parameters.

  • We turn the whole problem into a set of univariate optimization subproblems. Each is piecewise quadratic so that we can apply PO to solve it and give a closed-form solution explicitly. These lead to an algorithm with low computational complexity and high convergence rate

  • While the PO method has been used to solve sparse coding or compressive sensing problems with an -norm, to our knowledge, no such treatment exists for dictionary learning. In this letter, because we use a coherence penalty term, the dictionary learning subproblem is also nonsmooth, so that ordinary optimization methods become inefficient. We show that the PO technique can also be applied to this problem and can give a closed-form solution. Thus, while the coherence penalty term of the dictionary is incorporated into the dictionary learning problem as an additional constraint, it does not increase the complexity of the fastPDL algorithm.

The remainder of the letter is organized as follows. In section 2, we introduce the dictionary learning problem, including sparse coding and dictionary update, and provide an overview of the state-of-the-art algorithms. In section 3, we describe and analyze the fastPDL algorithm in detail. Numerical experimental studies described in section 4 clearly establish the practical advantages of the proposed fastPDL algorithm. We show that this algorithm converges significantly faster than state-of-the-art algorithms. We also present two applications, one involving nonnegative signals and the other involving a general real-world valued signal. The letter concludes in section 5.

1.1  Notation

A boldface uppercase letter, such as , denotes a matrix, and a lowercase letter xij denotes the th entry of . A boldface lowercase letter denotes a vector, and a lowercase letter xj denotes the jth entry of . denotes the ith row, and denotes the jth column of matrix . denotes the transpose of the matrix . The Frobenius norm of is defined as . The -norm of is defined as . The -norm of and are defined as and , respectively. is the so-called -norm, which counts the number of nonzero elements. denotes the trace of a square matrix. In this letter, all the parameters take real values, even though we do not state this explicitly each time.

2  Overview of Dictionary Learning Problem

Dictionary learning, as a matrix factorization, is a procedure for factorizing the signal data as the product of and . Generally the problem is solved by the following minimization:
formula
2.1
However, if m is less than r, the case that we are interested in, the problem becomes ill posed. Therefore, some constraints, such as sparsity, should be imposed on . Usually the sparsity can be measured by, for instance, the -norm or -norm. It is worth noting that the error function, equation 2.1, is not convex with respect to and . Most dictionary learning algorithms attack this problem by iteratively performing a two-stage procedure: sparse coding and dictionary update. Starting with an initial dictionary, the following two stages are repeated until convergence.

2.1  Sparse Coding

Sparse coding is the process of computing the coefficient matrix based on the set of signals and a known dictionary by the following problem,
formula
2.2
where is a real value. Exact determination of the problem has been proven to be NP-hard. Thus, approximate solutions are considered instead, and several efficient algorithms have been proposed. The simplest ones are matching pursuit (MP) (Mallat & Zhang, 1993) and OMP, which are greedy algorithms. However, these methods may not provide a good enough estimation of signals and are not suitable for high-dimensional problems. The relaxation method uses the relaxation -norm instead of -norm and is suitable for large-scale optimization problems. The most successful approach is BP. The problem is given as
formula
2.3
Solutions can be found by linear programming methods. For large-scale problems with thousands of signals, it is tractable but very slow. In recent years, several authors have proposed improvements for BP to speed up the algorithm based on the following so-called -regularized problem:
formula
2.4
where is a sparsity-including regularizer on and is a regularization parameter. Thus, the solution to sparse approximation becomes an -regularized optimization problem, where . When , it is an -regularized problem that has increasing popularity because the problem is a convex quadratic optimization that can be efficiently solved using, for example, the iterative shrinkage-thresholding algorithm (ISTA) (Daubechies, Defrise, & De Mol, 2004; Beck & Teboulle, 2009), iteratively reweighted least squares (IRLS) (Chartrand & Yin, 2008), LARS (Efron, Hastie, Johnstone, & Tibshirani, 2004), and the work of Lee et al. (Lee, Battle, Raina, & Ng, 2007).

2.2  Dictionary Update

Once the sparse coding task is done, the second stage is to find the best dictionary based on the current . This procedure is called the dictionary update and optimizes the dictionary based on the current sparse coding:
formula
2.5
To avoid a trivial result or scale ambiguity, the constraints frequently imposed on include requiring the whole to have a unit Frobenius norm (Yaghoobi, Blumensath, & Davies, 2009) or each atom to have unit -norm (Engan, Aase, et al., 1999). One of the differences among the various dictionary learning algorithms lies in the dictionary update stage in which they update either the whole set of atoms at once (Engan, Aase, et al., 1999; Rakotomamonjy, 2013) or each atom one by one (Aharon et al., 2006; Bao, Ji, Quan, & Shen 2014).

The procedure for dictionary learning is summarized in algorithm 1.

formula

2.3  State-of-the-Art Algorithms

The method of optimal directions (MOD) is one of the simplest dictionary learning algorithms. In the sparse coding stage, is solved using OMP (Engan, Aase, et al., 1999) or FOCUSS (Engan, Rao, & Kreutz-Delgado, 1999) MOD to find the minimum of with fixed . This leads to the closed-form expression followed by normalizing the columns of . K-SVD (Aharon et al., 2006) is a classic and successful dictionary learning approach. In the sparse coding stage, the sparse coding is also solved using OMP.

The difference between K-SVD and MOD is in the dictionary update stage. MOD updates the whole set of atoms at once with fixed, but it is not guaranteed to converge. In contrast, K-SVD, through singular-value decomposition (SVD), updates the atoms one by one and simultaneously updates the nonzero entries in the associated row vector of . The work of Dai et al. (2012) extends the K-SVD algorithm by allowing, in the dictionary update stage, the simultaneous optimization of several dictionary atoms and the related approximation coefficients. With this possibility of optimizing several atoms at a time, the running efficiency of their algorithm is better than the original K-SVD, although still worse than that of MOD.

More recently, Bao, Ji, Quan, and Shen (2014) proposed an alternating proximal method iteration scheme for dictionary learning with an -norm constraint. The alternating proximal method, actually called the proximal gradient method, is essentially a gradient-based method. Although it can be an improvement over a pure gradient method because of PO, such gradient-based methods usually have linear convergence, so there is still room for improvement. Rakotomamonjy (2013) proposed an algorithm named Dir that simultaneously optimizes both the dictionary and the sparse decompositions jointly in one stage based on a nonconvex proximal splitting framework instead of alternating optimizations. This approach is another extension of the gradient-based method. Other relevant research is the work on online dictionary learning (Mairal, Bach, Ponce, & Sapiro, 2010; Skretting & Engan, 2010). Online algorithms continuously update the dictionary as each training vector is being processed. Because of this, they can handle very large sets of signals, such as commonly occur, for example, in image processing tasks (Mairal et al., 2010). Skretting and Engan (2010) proposed an extension of the method of optimal direction based on recursive least squares for dealing with this online framework.

3  Fast Dictionary Learning Based on the Proximal Operator

In this section, we formulate the dictionary learning problem with constraints over the coefficient matrix and the dictionary and introduce the proximal operator. Then we propose the fastPDL algorithm based on PO and analyze this algorithm, including determining the parameters and the computational complexity.

3.1  Problem Formulation

Inheriting from the definitions and notations given in section 2, we formulate our dictionary learning problem. For simplicity, we consider the convex -norm over the coefficient matrix as the sparsity constraint as follows:
formula
3.1
More particularly, we impose an additional constraint on the atoms of to enhance the incoherence between any two atoms,
formula
3.2

where and are different column vectors in . If each column of is normalized to the unit -norm, the constraint term in equation 3.2 is equivalent to , which summarizes the cosine values of the angles between any two different atoms. is the angle between atoms l and k, where . Hence, reducing equation 3.2 means enlarging angles between column vectors in . This also means enhancing the incoherence of the dictionary. Thus, a dictionary learned with such a constraint will be more efficient.

Hence, our dictionary learning can be formulated as the following minimization problem,
formula
3.3
where is a regularization parameter to suppress , which is used to control the sparsity of . is also a regularization parameter and is used to adjust the incoherence of . The problem, equation 3.3, is nonconvex with respect to and and nonsmooth because of the sparsity regularization and the incoherence penalty.

3.2  The Proximal Operator

In our problem, because the regularization term on and the penalty term on are both nonsmooth, we will adopt PO to conquer the nonsmooth terms. For the nonsmooth , the proximal operator is defined as
formula
3.4
where parameter . Consider . This problem is the minimization of a piecewise quadratic function. It is easy to find closed-form solutions based on PO, which has the element-wise property and maps the vector to the one-dimensional yj (the jth entry of ). The closed-form solution of equation 3.4 can be expressed by
formula
3.5
The cost function in our dictionary learning is a piecewise quadratic function. Extending PO to fit our problem is the key to obtaining closed-form solutions directly. We derive PO for the piecewise quadratic functions (parameter and are both column vectors) as follows:
formula
3.6
where , and
formula
3.7
where , and .
Remarkably, the function 3.6 has the closed-form solution as follows:
formula
3.8
The function in equation 3.7 is nonsmooth and different from the form in equation 3.4. However, the principle of minimization for equation 3.7 is similar to that of equation 3.4 because it is also a piecewise quadratic function. We can find the minimal solution for each piece and summarize all the solutions to obtain the global minimizer. The detailed derivation is given in the appendix. Accordingly, the closed-form solution is given as
formula
3.9
formula
3.10

3.3  Proposed Algorithm: The Fast Proximal Dictionary Learning Algorithm

As is well known, imposing constraints on both and further increases the technical complexity of designing an effective algorithm. We aim at developing a fast algorithm for solving equation 3.3. The unknown factor can be decomposed into r pieces arranged as its rows , while the unknown contains r pieces arranged as its columns . Denote the kth column in by the corresponding vector and the kth row in by the corresponding vector ( is the transposed vector of but not the kth column in ). Thus, we decompose the multiplication to the sum of r rank-1 matrices. Hence, the cost function, equation 3.3, can be rewritten as
formula
3.11
The cost function, equation 3.11, with respect to and reduces to
formula
3.12
where .

Hence, we minimize a set of local cost functions, equation 3.12, for all , instead of minimizing the cost function, equation 3.3. The set of local cost functions, equation 3.12, is not only nonconvex with respect to and but also nonsmooth because of the regularization term on and the penalty term on . As mentioned, we use an alternating optimization so that the nonconvex optimization is transformed into two convex subproblems. Basically, the alternating optimization consists of updating with fixed and then updating with fixed, alternately.

First, by fixing and ignoring the constant term , the problem of equation 3.12 becomes the minimization of a piecewise quadratic and univariate function based on as follows,
formula
3.13
where and , because each column of is normalized to the unit -norm. In addition, . Here, is termed a univariate function of the vector because it is dependent only on . We note that again, is the row vector of . is not the kth column vector in but the transposed vector of . Although this subproblem is nonsmooth, it has the exact form of equation 3.6, so we can apply PO to obtain the closed-form solution. From the definition of PO, equation 3.8, it is natural to obtain the closed-form solution of equation 3.13 as follows:
formula
3.14
where hkj is the jth entry of . By this, we can sequentially optimize the set of .
Similarly, the local cost function, equation 3.12, by fixing and ignoring the constant term , can also be cast into the minimization of a piecewise quadratic and univariate function as follows:
formula
3.15
Here, . and . depends on only (fixing ). Notice that if the dictionary is sufficiently overcomplete, , the angle between atoms l and k, satisfies . Thus a reasonable approximation is to disregard the term of . This means that the third term as the right-hand side of equation 3.9 disappears, and equation 3.10 from equations 3.10 becomes . Then we can get the closed-form solution of equation 3.15 from equations 3.9 and 3.10 as follows:
formula
3.16
where wki and wli are the ith entry of and , respectively. By this, we can sequentially optimize the set of . In addition, to prevent dictionary from having arbitrarily large values, each column is normalized to the unit -norm after each updating.
Remark 1.

The coherence as the penalty term of imposed on the approximation error was studied in (Sigg et al., 2012; Lin et al., 2012; Mailhe et al., 2012; Bao, Ji, & Quan, 2014; Wang et al., 2014). The dictionary is updated by the SVD method or the partial derivatives, which is costly. In this letter, we treat the coherence as the penalty term of and use one-by-one atom updating. While we update one atom, the others are fixed. The set of subproblems with respect to dictionary atoms can thus be easily solved by PO. Here, PO successfully handles the nonsmooth coherence penalty, which is quite different from the approach in Sigg et al. (2012), Lin et al. (2012), Mailhe et al. (2012), Bao, Ji, and Quan (2014), and Wang et al. (2014).

Remark 2.

Note that some methods such as ISTA (Daubechies et al., 2004) or its accelerated version (FISTA) (Beck & Teboulle, 2009) are also based on PO but are mainly devoted to sparse coding. In addition, some approaches for dictionary learning such as Dir (Rakotomamonjy, 2013) and Bao’s work (Bao, Ji, Quan, & Shen, 2014) are also based on PO. The proximal algorithm in these studies (Daubechies et al., 2004; Beck & Teboulle, 2009; Combettes & Wajs, 2005; Combettes & Pesquet, 2010; Kazerouni, Kamilov, Bostan, & Unser, 2013; Rakotomamonjy, 2013; Bao, Ji, Quan, & Shen, 2014) is also called the proximal gradient method. A difference in this letter is that after applying the PO technique, we further obtained closed-form solutions, as shown in equations 3.14 and 3.16, to the optimization subproblems, which can make the algorithm more efficient.

By minimizing the set of piecewise quadratic and univariate functions 3.13 and 3.15, for all based on PO, we obtain the set of sequential learning rules 3.14 and 3.16 for all in the closed form. To reduce the computational complexity further, we work factor by factor to avoid computing the residue matrix for each k. In other words, before updating , we need only compute the multiplications of the factors ( and ) once to update . While updating , we need only to compute the multiplications of the factors ( and ) once to update . To accelerate the convergence rate, we update all the coefficient vectors sequentially while fixing all dictionary atoms. We then update all the atoms sequentially while fixing all the coefficient vectors. The entire process is repeated until convergence is achieved. Hence, the whole problem, equation 3.3, is solved, leading to an efficient and fast algorithm. According to the analysis above, the proposed fastPDL algorithm for dictionary learning is summarized in algorithm 2.

formula

In the fastPDL algorithm, convergence is guaranteed. The whole dictionary learning problem is decomposed into a set of piecewise quadratic and univariate functions, 3.13 and 3.15, for all . Moreover, the value of a, the second-order derivative of each piecewise quadratic and univariate function, is greater than zero. The exact minimizers can be obtained using equations 3.14 and 3.16, which are stationary points. Because the cost function, equation 3.3, decreases monotonically over each update , alternately and the cost function is bounded, the algorithm can converge.

3.4  Determining the Parameters in the FastPDL Algorithm

Our dictionary learning algorithm requires the coefficient matrix to be as sparse as possible and simultaneously requires the atoms of the dictionary to be as incoherent as possible. For the cost function, equation 3.3, the parameter can be adjusted to control the trade-off between the approximation error and the sparsity of the coefficient matrix . plays an important role in the proposed method: larger values of result in a sparser coefficient matrix. The parameter can be determined by offline calibration or adaptive tuning. Experiments show that the values of the parameter can be determined in these two ways are very close. In this letter, we find the optimal using the first way. The parameter can be adjusted for a trade-off between the accuracy of reconstruction and the incoherence of . We found that it is better to set to , because one can obtain a relatively low approximation error and adequate reduction of the redundancies of the dictionary.

To show the performance of fastPDL algorithm, one evaluation is to compare the changes of angles between any two atoms in the two different dictionaries that are learned by fastPDL with , and fastPDL with only the sparsity penalty term (). Through the two algorithms, we achieved two groups of dictionary learning results from a nature image (see “Peppers” in Figure 9), and we plotted the sorted angles between atoms in the learned dictionaries in Figure 1. It can be seen that the angles between atoms in the dictionary learned by fastPDL () are enlarged and the incoherence of the learned dictionary is obviously enhanced.

Figure 1:

Averaged angles between any two atoms in the dictionaries that are learned by the cost functions with and without the dictionary incoherence penalty term.

Figure 1:

Averaged angles between any two atoms in the dictionaries that are learned by the cost functions with and without the dictionary incoherence penalty term.

3.5  Analysis of the Computational Complexity

From the implementation point of view, we consider the computational complexity of our algorithm. The complexity is represented by the parameters s, m, r, and L, which correspond to the number of nonzero elements in each column of , the dimensionality of signals, the number of columns in the dictionary, and the number of training samples, respectively. Generally we can assume for dictionary learning. FastPDL decomposes the matrices into sets of row vectors or column vectors because it is easy to find the optimal solutions in closed form. K-SVD also performs the same decomposition scheme; however, that dictionary update is performed through an SVD decomposition on . The complexity of the SVD calculation of is O(), which is costly. For fastPDL’s dictionary update step, we used the closed-form solution, equation 3.16. The complexity mainly depends on the multiplications of the matrices ( and ) and is O(), which is much lower than that of SVD. The complexity of sparse coding is O(), which is also much lower than that of OMP (O()) used in the K-SVD and MOD methods. In addition, the supports (nonzero positions) and the entries at the supports in of fast-PDL are both updated, unlike in the MOD and K-SVD methods in which the supports are not changed. This contributes to further improvement in the performance of dictionary learning. Gradient-based methods, such as Dir (Rakotomamonjy, 2013) and Bao’s work (Bao, Ji, Quan, & Shen, 2014), require gradient computations that are O() and are a bit less than that of fastPDL. On the other hand, because it is a gradient-based method, it requires more iterations than fastPDL, and the convergence rates rely on the step sizes. Hence, the fastPDL algorithm has advantages in complexity and convergence rate and has fewer parameters that require tuning.

4  Simulations

In this section, to evaluate the learning capability and efficiency of fastPDL, we present the results of some numerical experiments. The first experiment is on synthetic data generated from a random dictionary taken as ground truth; we then use the proposed algorithm on the data to learn a new dictionary to compare with the ground truth dictionary. In this way, a dictionary learning algorithm should be able to extract the common features of the set of signals, which are actually the generated atoms. Furthermore, we perform another experiment in which white gaussian noise of various signal-to-noise ratios (SNRs) is added to the synthetic signals to evaluate the performance and robustness of the antinoise features.1 In addition, we apply fastPDL to real-world valued signals with noise to verify the applicability of the proposed algorithm.

In the experiments, all programs were coded in Matlab and were run within Matlab (R2014a) on a PC with a 3.0 GHz Intel Core i5 CPU and 12 GB of memory, under the Microsoft Windows 7 operating system.

4.1  Dictionary Recovery from Synthetic Data

First, we generated a dictionary by normalizing a random matrix of size , with independent and identically distributed (i.i.d.) uniformly random entries. Each column was normalized to the unit -norm. The dictionary was referred to as the ground truth dictionary; it was not used in the learning but only for evaluation. Then a collection of L samples of dimensionality m was synthesized, each as a linear combination of s different columns of the dictionary, with i.i.d. corresponding coefficients in uniformly random and independent positions. Here s also denotes the number of nonzero elements in each column of the coefficient matrix .

Then we applied all algorithms—K-SVD, MOD, Dir of Rakotomamonjy, and the proposed fastPDL—to the generated signals to learn dictionaries and record their performance. Matlab codes for K-SVD algorithm and MOD algorithm are available online (http://www.cs.technion.ac.il/∼elad//software/), as is Matlab code for the Dir algorithm (http://asi.insa-rouen.fr/enseignants/∼arakoto/publication.html/). All algorithms were initialized with the same dictionary (made by randomly choosing r columns of the generated signals, followed by a normalization) and with the same coefficient matrix (randomly generated).

To evaluate the performance of the learned dictionary , we compared obtained by the above algorithms with the ground truth . Because the order of the columns in and may be different, the comparison was made by sweeping through the columns of and and measuring the distances between their columns via
formula
where was an atom of the ground truth dictionary and was an atom of the learned dictionary . If the distance was less than 0.01, the atom was regarded as successfully recovered. The recovery ratio for the learned dictionary was calculated by dividing the number of recovered atoms by the total number of atoms in .

In the experiment, the number of signals L was set to 1500, and the dimensionality m was 20. For the dictionary, the number of atoms r was 50. We varied s from 3 to 12. All trials were repeated 15 times for each algorithm and for various s. For all algorithms, the stopping criterion was based on the relative change of the objective function value in two consecutive iterations, , for a small, positive constant . For Dir and fastPDL, was set to for the stopping criterion, and the maximum number of iterations was set to 10,000 for both. K-SVD and MOD imposed a very strong sparsity constraint on the data by limiting the number of nonzero elements in from OMP to a small number. These methods thus yielded signal representations that are not particularly accurate. Hence, was set to for K-SVD and MOD. In addition, the implementation of OMP was very slow. For K-SVD and MOD, the maximal number of iterations was set to a reasonable value (e.g., 3000). These iterative methods ensured that the stopping criterion would be satisfied within the maximal number of iterations and that the algorithms could find the minimum of the objective function. The average recovery ratios and corresponding standard deviations for the learned dictionaries are shown with respect to the number of the nonzero element in each column of in Figure 2a. The average running time and standard deviation of each algorithm are shown versus the number of the nonzero element in each column of in Figure 2b. It can be seen that in all cases, fastPDL performed the best for the recovery ratio and the running time. The recovery ratio of fastPDL was much higher than K-SVD and MOD in all cases. In particular, fastPDL was still effective even when the number of nonzero elements s was very high. Furthermore, fastPDL, in computational running time, was remarkably fast and stable compared with K-SVD and MOD. It is clear that K-SVD and MOD, which use two stages, were more costly. Compared with Dir, fastPDL was slightly favorable in the ratios of recovered atoms, and it had a significant advantage in running time. The reason that fastPDL outperformed Dir was that the latter employed a gradient descent method to optimize gradually, which is time-consuming even though Dir’s optimization occurred in one stage only.

Figure 2:

Experimental results for the synthetic signal. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries versus s. (b) Average running time and corresponding standard deviation of each algorithm versus s.

Figure 2:

Experimental results for the synthetic signal. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries versus s. (b) Average running time and corresponding standard deviation of each algorithm versus s.

Obviously the four algorithms described above have different convergence properties. To see the convergence behaviors of all algorithms, we executed these algorithms as many times as possible and determined the reasonable numbers of iteration that could ensure convergence of each algorithm. For , 3000 iterations for Dir and fastPDL and 1000 iterations were required for K-SVD and MOD because K-SVD and MOD cost much more time than Dir and fastPDL per iteration. Fifteen trials were conducted, and performance results were averaged. With respect to the number of iterations and the running time per iteration, the average recovery ratios for the learned dictionaries are shown in Figure 3. It can be seen that the number of iterations required for convergence for K-SVD and MOD was fewer than for fastPDL in Figure 3a; however, the computational complexity for each iteration for K-SVD and MOD was much greater than that for fastPDL. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of K-SVD and MOD in Figure 3b. The computational complexity for each iteration for Dir was a little less than that for fastPDL; however, the number of iterations required for convergence for Dir was greater than that for fastPDL in Figure 3a, because the gradient descent converged slowly. Hence, the convergence rate of fastPDL versus the running time per iteration was faster than that of Dir in Figure 3b. Therefore, fastPDL had significant advantages in convergence rate.

Figure 3:

Experimental results for the synthetic signal (). (a) Average recovery ratios for the learned dictionaries versus iteration number. (b) Average recovery ratios for the learned dictionaries versus running time per iteration.

Figure 3:

Experimental results for the synthetic signal (). (a) Average recovery ratios for the learned dictionaries versus iteration number. (b) Average recovery ratios for the learned dictionaries versus running time per iteration.

As is well known, K-SVD and MOD require a specific and fixed number of nonzero elements in each column of so that they have the fixed sparsity during iterations. For Dir and fastPDL, the sparsity of the coefficient matrix was adjusted via the regularization parameters. Figure 4a depicts the average recovery ratios and corresponding standard deviations for the learned dictionary by fastPDL with respect to the regularization parameters. The cost function values of fastPDL versus iteration number are shown in Figure 4b for different . We can see that the recovery ratio was better when . Moreover, the cost function values were smaller and converged faster when . This parameter was fixed during iterations to reduce the number of iterations and computational requirements. The sparsity of coefficient matrix , which was measured by dividing the number of nonzero elements in by the number of total elements including zero elements and nonzero elements in , is shown in Figures 5a and 5b against the number of iterations and the running time per iteration, respectively. Generally lower values of the sparsity of made the dictionary more efficient. Obviously the sparsity of fastPDL converged faster and could reach a lower value than Dir, while the sparsities for K-SVD and MOD were not plotted because they had the fixed sparsity (3/50) decided by OMP method. In addition, we showed histograms of the effective sparsity (nonzero elements in each column of ) of samples’ representation for fastPDL. In the ground truth that was randomly generated, the number of nonzero elements in each column was ; hence the number of coefficient vectors with three nonzero elements was 1500 in Figure 6a. Note that the number of coefficient vectors was the same as the number of samples . The number of coefficient vectors for fastPDL with three nonzero elements was 685 in Figure 6b. It could be seen that the vectors with three nonzero elements were dominant in the output of fastPDL. Thus, the histogram of fastPDL confirms that the columns of obtained by fastPDL were sparse.

Figure 4:

Experimental results for the synthetic signal (). (a) Average recovery ratios and corresponding standard deviations for fastPDL dictionary versus parameter . Cost function values of fastPDL (approximation error sparsity measure incoherence measure) versus iteration number for different values of parameter .

Figure 4:

Experimental results for the synthetic signal (). (a) Average recovery ratios and corresponding standard deviations for fastPDL dictionary versus parameter . Cost function values of fastPDL (approximation error sparsity measure incoherence measure) versus iteration number for different values of parameter .

Figure 5:

Experimental results for the synthetic signal (). (a) Sparsity of the coefficient matrix versus iteration number. (b) Sparsity of the coefficient matrix versus running time per iteration. The sparsities of K-SVD and MOD are not plotted because they are fixed ().

Figure 5:

Experimental results for the synthetic signal (). (a) Sparsity of the coefficient matrix versus iteration number. (b) Sparsity of the coefficient matrix versus running time per iteration. The sparsities of K-SVD and MOD are not plotted because they are fixed ().

Figure 6:

Experimental results for the synthetic signal (). (a) A histogram of the effective sparsity of the 1500 signals' representation for randomly generated . (b) A histogram of the effective sparsity of the 1500 signals' representation for fastPDL.

Figure 6:

Experimental results for the synthetic signal (). (a) A histogram of the effective sparsity of the 1500 signals' representation for randomly generated . (b) A histogram of the effective sparsity of the 1500 signals' representation for fastPDL.

In addition, we made an experiment to show the dependence between the rate of the recovery of atoms and the redundancy ratio of the dictionary. Here, we define the redundancy ratio of the dictionary as , where m is the dimensionality of dictionary atoms and r is the number of the dictionary atoms. In the experiment, r is fixed to 50. If the dictionary can be recovered even though the redundancy ratio is smaller, it means that from fewer linear combinations of atoms, we can recover the sparse coefficient matrix. Then the sparse representation is more efficient. To evaluate the ability of the learning dictionary, along with the reduction of the redundancy ratio as small as possible, we made the dictionary recovery experiment under different redundancy ratios. Fifteen trials were conducted, and performance results were averaged. The average recovery rates and corresponding standard deviations are shown in Figure 7a. We found that the fastPDL algorithm displayed a superior recovery rate to the other algorithms. Especially, the fastPDL was still efficient under , while the other algorithms deteriorated greatly. Furthermore, we showed the sparsity of coefficient matrix in Figure 7b. It can been seen that the sparsity of fastPDL can reach lower values than that of Dir under various redundancy ratios, while the sparsities for K-SVD and MOD were not plotted because they had the fixed sparsities.

Figure 7:

Experimental results for the synthetic signal (). (a) Average recovery rates for the learned dictionaries versus . (b) Sparsity of the coefficient matrix versus .

Figure 7:

Experimental results for the synthetic signal (). (a) Average recovery rates for the learned dictionaries versus . (b) Sparsity of the coefficient matrix versus .

4.2  Evaluating the Performance and Robustness against Noise for Different Noise Levels

Besides the noiseless condition, we also performed another experiment in which white gaussian noise at various SNRs corrupted the signals to evaluate the algorithm’s performance and robustness against noise. We added noise with SNR levels of 10 dB, 20 dB, and 30 dB according to the model , where was a noise matrix. The settings for the experiment are , , , and . For each of the tested algorithms and for various SNR levels, 15 trials were conducted and their results sorted. The average recovery ratios and corresponding standard deviations for the learned dictionaries for these algorithms at levels of 10 dB, 20 dB, 30 dB, and the noiseless case are shown in Figure 8a. The mean and standard deviations for the running time (until the relative change of the objective function value in two consecutive iterations was less than a small positive constant) are also shown in Figure 8b. It can be seen that fastPDL performed the best on dictionary learning at each noise level. Although the differences in recovery ratios for the learned dictionary between fastPDL and Dir under various conditions were small, fastPDL had a remarkable advantage in running time.

Figure 8:

Experimental results versus noise. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries at different noise levels. (b) Average running time and corresponding standard deviations of each algorithm at different noise levels.

Figure 8:

Experimental results versus noise. (a) Average recovery ratios and corresponding standard deviations for the learned dictionaries at different noise levels. (b) Average running time and corresponding standard deviations of each algorithm at different noise levels.

4.3  Applications and Performance Evaluations

Dictionary learning for sparse representation of signals has many applications, such as denoising, inpainting, and classification. In this section, we apply fastPDL in one particular application, denoising. Signal denoising refers to the removal of noise from a signal. An image contains only nonnegative signals, while both nonnegative and negative signals occur in audio data. This motivates us to apply fastPDL algorithm to image processing and audio processing.

4.3.1  Image Denoising

Although the proposed fastPDL algorithm does not require the signal to be nonnegative, it can indeed work effectively in such cases, as we now show.

In the additive noise model, the white gaussian noise with various standard deviations corrupted the clean image by . We used a noisy image of size pixels as test data to learn a dictionary of size . This dictionary was then used to remove the noise from the noisy observed image . The denoising method is based on the method of K-SVD implemented in Elad and Aharon (2006). To provide a fair comparison, all the settings were the same as those used with the K-SVD method. For the dictionary learning stage, we chose an overcomplete DCT dictionary as the initialization to learn the dictionary by fastPDL algorithm, and the number of iterations was set to 10. The sparse coding stage was based on the OMP algorithm. After the whole process of dictionary learning, we used the learned dictionary to reconstruct the images. Some standard test images, including Boats (512512), Barbara (512512), Peppers (256256), and House (256256), were used in the experiment, and one of the reconstruction results is presented in Figure 9. We deliberately showed denoised images and the final adaptive dictionaries for a case with very strong noise, because fastPDL was also effective. Here, we used two quality measures, the peak SNR (PSNR) and the structural similarity (SSIM) (Wang, Bovik, Sheikh, Sapiro, & Simoncelli, 2004), to assess the denoised images. SSIM’s value range is between 0 and 1, and its value equals 1 if . The average PSNR and SSIM of the denoised images over 10 trials at different noise levels are summarized in Table 1. It shows that in some situations, fastPDL can obtain better results, and the averaged results of fastPDL were very close to those for K-SVD. The averaged running times are summarized in Table 2, which shows that the running time of fastPDL was shorter than that of K-SVD. An interesting phenomenon is that the running time for all the images is reduced with increasing noise levels. This may be because with a higher noise level, the image must be represented with fewer atoms because more atoms are noisy.

Figure 9:

Example of denoising results for the image Peppers with the noise level of 20 dB. The corresponding dictionaries trained from Peppers also are shown. The first items show PSNR values, and the second items the SSIM index.

Figure 9:

Example of denoising results for the image Peppers with the noise level of 20 dB. The corresponding dictionaries trained from Peppers also are shown. The first items show PSNR values, and the second items the SSIM index.

Table 1:
PSNR (dB) and SSIM of Denoising Results.
BoatBarBaraPeppersHouse
/ PSNRPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
5 / 34.16 37.0255 0.9396 37.6528 0.9594 37.901 0.9548 39.4905 0.9555 
 37.0284 0.9396 37.6529 0.9595 37.897 0.9548 39.4765 0.9554 
10 / 28.15 33.3949 0.8814 33.9599 0.9257 34.2115 0.9226 35.965 0.9062 
 33.3943 0.8814 33.9241 0.9252 34.2357 0.9227 35.9705 0.9063 
15 / 24.61 31.4239 0.8357 31.96 0.8971 32.1934 0.8981 34.3171 0.8771 
 31.442 0.8358 31.9022 0.896 32.1466 0.8970 34.2136 0.8764 
25 / 20.12 29.0198 0.7621 29.2317 0.8333 29.7746 0.8574 32.2081 0.8463 
 29.0523 0.7626 29.1088 0.8295 29.6686 0.8551 32.0013 0.8453 
50 / 14.09 25.6511 0.6367 25.2143 0.6899 26.1915 0.7764 27.9702 0.7604 
 25.5263 0.6310 25.0605 0.6846 25.71391 0.7661 27.4141 0.7512 
BoatBarBaraPeppersHouse
/ PSNRPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
5 / 34.16 37.0255 0.9396 37.6528 0.9594 37.901 0.9548 39.4905 0.9555 
 37.0284 0.9396 37.6529 0.9595 37.897 0.9548 39.4765 0.9554 
10 / 28.15 33.3949 0.8814 33.9599 0.9257 34.2115 0.9226 35.965 0.9062 
 33.3943 0.8814 33.9241 0.9252 34.2357 0.9227 35.9705 0.9063 
15 / 24.61 31.4239 0.8357 31.96 0.8971 32.1934 0.8981 34.3171 0.8771 
 31.442 0.8358 31.9022 0.896 32.1466 0.8970 34.2136 0.8764 
25 / 20.12 29.0198 0.7621 29.2317 0.8333 29.7746 0.8574 32.2081 0.8463 
 29.0523 0.7626 29.1088 0.8295 29.6686 0.8551 32.0013 0.8453 
50 / 14.09 25.6511 0.6367 25.2143 0.6899 26.1915 0.7764 27.9702 0.7604 
 25.5263 0.6310 25.0605 0.6846 25.71391 0.7661 27.4141 0.7512 

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result in each cell.

Table 2:
Comparison of Running Time in Seconds for Image Signals.
/ PSNRBoatBarbaraPeppersHouse
5 / 34.16 8.8048 8.8978 8.3413 5.1428 
 3.9209 4.2362 3.6768 2.5609 
10 / 28.15 4.7485 4.9622 4.5821 3.3541 
 2.5749 2.534 2.2583 1.7438 
15 / 24.61 3.3470 3.5784 3.4730 2.7144 
 1.9409 2.023 1.8688 1.4860 
25 / 20.12 2.8252 2.916 2.8641 2.4572 
 1.6982 1.6823 1.4967 1.3891 
50 / 14.09 2.4674 2.5537 2.3606 2.3212 
 1.4493 1.4858 1.3119 1.2871 
/ PSNRBoatBarbaraPeppersHouse
5 / 34.16 8.8048 8.8978 8.3413 5.1428 
 3.9209 4.2362 3.6768 2.5609 
10 / 28.15 4.7485 4.9622 4.5821 3.3541 
 2.5749 2.534 2.2583 1.7438 
15 / 24.61 3.3470 3.5784 3.4730 2.7144 
 1.9409 2.023 1.8688 1.4860 
25 / 20.12 2.8252 2.916 2.8641 2.4572 
 1.6982 1.6823 1.4967 1.3891 
50 / 14.09 2.4674 2.5537 2.3606 2.3212 
 1.4493 1.4858 1.3119 1.2871 

Notes: In each cell, top row: K-SVD; second row: fastPDL. Figures in bold show the best result for each cell.

4.3.2  Noise Cancellation of Audio Signal

In this section, we present the application of the proposed fastPDL algorithm to a more general signal, that is, a signal without nonnegativity. An audio signal is an appropriate example.

The audio data used in the experiments were recorded from a BBC radio music sampled at 16 kHz and lasting 4.1 s. Different levels of white gaussian noise corrupted the audio data. We considered the corrupted audio as training samples, each being composed of 64 time samples. The learned dictionary was of size . The noise cancellation procedure was similar to that used for image denoising, and 10 trials were conducted. The reconstructed result is presented in Figure 10. For the audio signal, we used the improvement in signal-to-noise ratio (ISNR) to evaluate the performance of noise cancellation,
formula
where is the original signal, is the observed noisy signal, and is the reconstructed signal. As the reconstructed signal becomes closer to the original signal, ISNR increases. Tables 3 and 4 show that the results of fastPDL are very close to those of K-SVD on ISNR performance, and fastPDL significantly outperforms K-SVD on running time
Figure 10:

Example of ISNR for audio data with noise level of 0 dB.

Figure 10:

Example of ISNR for audio data with noise level of 0 dB.

Table 3:
Comparisons of ISNR for Audio Signal.
10 dB0 dB–10 dB
K-SVD 8.4263 9.5779 11.8957 
FastPDL 8.4212 9.3561 11.7532 
10 dB0 dB–10 dB
K-SVD 8.4263 9.5779 11.8957 
FastPDL 8.4212 9.3561 11.7532 

Note: Figures in bold show the best result in each cell.

Table 4:
Comparisons of Running Time in Seconds for Audio Signal.
10 dB0 dB–10 dB
K-SVD 3.974 2.7029 2.2987 
FastPDL 2.0012 1.5346 1.2446 
10 dB0 dB–10 dB
K-SVD 3.974 2.7029 2.2987 
FastPDL 2.0012 1.5346 1.2446 

Note: Figures in bold show the best result in each cell.

5  Conclusion

The main motivation of this letter was to develop a fast and efficient algorithm for learning incoherent dictionary for sparse representation. The problem was cast as the minimization of the approximation error function with the coherence penalty of the dictionary atoms and the sparsity regularization of the coefficient matrix. For efficiently solving the problem, we turned it into a set of minimizations of piecewise quadratic univariate subproblems, each of which was a single variable vector problem of one dictionary atom or one coefficient vector. Although the subproblems were still nonsmooth, remarkably, we could find the optimal solution in closed form using proximal operators. This led to the so-called fastPDL algorithm. Interestingly, this algorithm updates the dictionary and the coefficient matrix in the same manner, forcing the coefficients to be as sparse as possible while simultaneously forcing the atoms of the dictionary to be as incoherent as possible. We have verified that fastPDL is efficient in both computational complexity and convergence rate. These advantages are achieved because the optimal dictionary atoms and coefficient vectors have been obtained in closed form without iterative optimization. The proposed fastPDL algorithm is quite simple but very efficient. The numerical experiments have shown that fastPDL is more efficient than the state-of-the-art algorithms. In addition, we have described two applications of the proposed algorithm, one for a nonnegative signal and the other for a general signal, and have shown that the fastPDL algorithm has a significant advantage in computation time for both of them. Further applications remain as our future work.

Appendix:  Derivation of the Proximal Operator for the Coherence Penalty

Problem 3.7 includes a nonsmooth function, , a complication since it is the summation of . For simplicity, we at first consider a case of , that is,
formula
A.1
where , and . Then we derive PO for of equation A1,
formula
A.2
where denotes PO and denotes the subgradient to . To compute , we consider three cases: , , and . For , is , where . Then, we get the solution of equation A.1 in the case of ,
formula
A.3
since and , we get , and thus . Hence, the solution A.3 can be rewritten as
formula
A.4
The same derivation applies for and . is for and for , respectively. Then we obtain
formula
A.5
formula
A.6
Summarizing equations A.4 to A.6, we get the closed-form solution of equation A.1,
formula
A.7
where xj and bj are the jth entry of and , respectively.
Now we derive PO for the nonsmooth of equation 3.7, where . In addition, we have and . Thus we obtain,
formula
A.8
To compute , we define , and , where is the cardinality of a set and . Then we get the solution of equation 3.7,
formula
A.9
For the case of , we have which reduces to
formula
A.10
Since and , we have,
formula
A.11
Plugging equation A.11 into A.10, we obtain
formula
A.12
For making equation A.12 more readable, we consider different situations separately. If and , equation A.12 says
formula
A.13
If , equation A.12 says
formula
A.14
Since , according to equations A.13 and A.14, satisfies
formula
A.15
Similarly, in the cases of , we have,
formula
A.16
and in the case of , we have
formula
A.17
Summarizing equations A.9, A.15, and A.17, we get the closed-form solution of equation 3.7:
formula
A.18
where clj and cgj are the jth entry of and .

Acknowledgments

This work was supported by the Japan Society for the Promotion of Science under grant no. 26-10950.

References

Adler
,
A.
,
Emiya
,
V.
,
Jafari
,
M. G.
,
Elad
,
M.
,
Gribonval
,
R.
, &
Plumbley
,
M. D.
(
2012
).
Audio inpainting
.
IEEE Transactions on Audio, Speech, and Language Processing
,
20
,
922
932
.
Aharon
,
M.
,
Elad
,
M.
, &
Bruckstein
,
A.
(
2006
).
K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation
.
IEEE Transactions on Signal Processing
,
54
,
4311
4322
.
Bao
,
C.
,
Ji
,
H.
, &
Quan
,
Y.
(
2014
).
A convergent incoherent dictionary learning algorithm for sparse coding
. In
Proceedings of the European Conference on Computer Vision
(pp.
302
316
).
New York
:
Springer
.
Bao
,
C.
,
Ji
,
H.
,
Quan
,
Y.
, &
Shen
,
Z.
(
2014
).
L0 norm based dictionary learning by proximal methods with global convergence
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
3858
3865
).
Piscataway, NJ
:
IEEE
.
Baraniuk
,
R. G.
(
2007
).
Compressive sensing
.
IEEE Signal Processing Magazine
,
24
,
118
121
.
Beck
,
A.
, &
Teboulle
,
M.
(
2009
).
A fast iterative shrinkage-thresholding algorithm for linear inverse problems
.
SIAM Journal on Imaging Sciences
,
2
(
1
),
183
202
.
Caiafa
,
C. F.
, &
Cichocki
,
A.
(
2012
).
Computing sparse representations of multidimensional signals
.
Neural Computation
,
25
(
1
),
186
220
.
Chartrand
,
R.
, &
Yin
,
W.
(
2008
).
Iteratively reweighted algorithms for compressive sensing
. In
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
3869
3872
).
Piscataway, NJ
:
IEEE
.
Chen
,
S. S.
,
Donoho
,
D. L.
, &
Saunders
,
M. A.
(
2001
).
Atomic decomposition by basis pursuit
.
Society for Industrial and Applied Mathematrics
,
43
(
1
),
129
159
.
Cichocki
,
A.
, &
Phan
,
A. H.
(
2009
).
Fast local algorithms for large scale nonnegative matrix and tensor factorizations
.
IEICE Transactions on Fundamentals of Electronics
,
E92-A
(
3
),
708
721
.
Combettes
,
P. L.
, &
Pesquet
,
J.
(
2010
).
Proximal splitting methods in signal processing
. arXiv:0912.3522v4.
Combettes
,
P. L
., &
Wajs
,
V. R.
(
2005
).
Signal recovery by proximal forward-backward splitting
.
Multiscale Modeling and Simulation
,
4
(
4
),
1168
1200
.
Dai
,
W.
,
Xu
,
T.
, &
Wang
,
W.
(
2012
).
Simultaneous codeword optimization (SimCO) for dictionary update and learning
.
IEEE Transactions on Signal Processing
,
60
(
12
),
6340
6353
.
Daubechies
,
I.
,
Defrise
,
M.
, &
De Mol
,
C.
(
2004
).
An iterative thresholding algorithm for linear inverse problems with a sparsity constraint
.
Communications on Pure and Applied Mathematics
,
57
(
11
),
1413
1457
.
Efron
,
B.
,
Hastie
,
T.
,
Johnstone
,
I.
, &
Tibshirani
,
R.
(
2004
).
Least angle regression
.
Annals of Statistics
,
32
,
407
499
.
Elad
,
M.
(
2010
).
Sparse and redundant represenataiton
. Berlin: Springer.
Elad
,
M.
, &
Aharon
,
M.
(
2006
).
Image denoising via sparse and redundant representations over learned dictionaries
.
IEEE Transactions on Image Processing
,
15
(
12
),
3736
3745
.
Elad
,
M.
,
Figueiredo
,
M. A. T.
, &
Ma
,
Y.
(
2010
).
On the role of sparse and redundant representations in image processing
.
Proceedings of the IEEE
,
98
,
972
982
.
Engan
,
K.
,
Aase
,
S. O.
, &
Husoy
,
J. H.
(
1999
).
Method of optimal directions for frame design
. In
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing
(Vol.
5
, pp.
2443
2446
). Piscataway, NJ: IEEE.
Engan
,
K.
,
Rao
,
B. D.
, &
Kreutz-Delgado
,
K.
(
1999
).
Frame design using focus with method of optimal directions (MOD)
. In
Proceeedings of the Norwegian Signal Processing Symposium
(
pp. 65
69
).
Piscataway, NJ
:
IEEE
.
Fadili
,
M. J.
,
Starck
,
J.-L.
, &
Murtagh
,
F.
(
2009
).
Inpainting and zooming using sparse representations
.
Computer Journal
,
52
,
64
79
.
Hyvarinen
,
A.
(
1999
).
Fast and robust fixed-point algorithms for independent component analysis
.
IEEE Transactions on Neural Networks
,
10
(
3
),
626
634
.
Jafari
,
M. G.
, &
Plumbley
,
M. D.
(
2011
).
Fast dictionary learning for sparse representations of speech signals
.
IEEE Journal of Selected Topics in Signal Processing
,
5
,
1025
1031
.
Kazerouni
,
A.
,
Kamilov
,
U. S.
,
Bostan
,
E.
, &
Unser
,
M.
(
2013
).
Bayesian denoising: From MAP to MMSE using consistent cycle spinning
.
IEEE Signal Processing Letters
,
20
(
3
),
249
252
.
Kreutz-Delgado
,
K.
,
Murray
,
J. F.
, &
Rao
,
B. D.
(
2003
).
Dictionary learning algorithms for sparse representation
.
Neural Computation
,
15
,
349
396
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by nonnegative matrix factorization
.
Nature
,
401
,
788
791
.
Lee
,
H.
,
Battle
,
A.
,
Raina
,
R.
, &
Ng
,
A. Y.
(
2007
).
Efficient sparse coding algorithms
. In
S. A.
Solla
,
T. K.
Leen
, &
K.-R.
Muller
(Eds.),
Advances in neural information processing systems
,
19 (pp. 801–808). Cambridge, MA
:
MIT Press
.
Lewicki
,
M. S.
, &
Sejnowski
,
T. J.
(
2000
).
Learning overcomplete representations
.
Neural Computation
,
12
(
2
),
337
365
.
Lin
,
T.
,
Liu
,
S.
, &
Zha
,
H.
(
2012
).
Incoherent dictionary learning for sparse representation
. In
Proceedings of the IEEE 21st International Conference on Pattern Recognition
(pp. 1237
1240
).
Piscataway, NJ
:
IEEE
.
Mailhe
,
B.
,
Barchiesi
,
D.
, &
Plumbley
,
M.D.
(
2012
).
INK-SVD: Learning incoherent dictionaries for sparse representations
.
IEEE International Conference on Acoustics, Speech and Signal Processing
(pp.
3573
3576
).
Piscataway, NJ
:
IEEE
.
Mairal
,
J.
,
Bach
,
F.
,
Ponce
,
J.
, &
Sapiro
,
G.
(
2010
).
Online learning for matrix factorization and sparse coding
.
Journal of Machine Learning Research
,
11
,
19
60
.
Mallat
,
S. G.
, &
Zhang
,
Z.
(
1993
).
Matching pursuits with time-frequency dictionaries
.
IEEE Transactions on Signal Processing
,
41
,
3397
3415
.
Moreau
,
J. J.
(
1962
).
Fonctions convexes duales et points proximaux dans un espace hilbertien
.
Comptes rendues de l'académie des sciences de Paris
,
255
,
2897
2899
.
Rakotomamonjy
,
A.
(
2013
).
Direct optimization of the dictionary learning proble
.
IEEE Transactions on Signal Processing
,
22
(
61
),
5495
5506
.
Selesnick
,
I. W.
,
Baraniuk
,
R. G.
, &
Kingsbury
,
N. C.
(
2005
).
The dual-tree complex wavelet transform
.
IEEE Signal Processing Magazine
,
22
,
123
151
.
Sigg
,
C. D.
,
Dikk
,
T.
, &
Buhmann
,
J. M.
(
2012
).
Learning dictionaries with bounded self-coherence
.
IEEE Signal Processing Letters
,
19
(
12
),
861
864
.
Skretting
,
K.
, &
Engan
,
K.
(
2010
).
Recursive least squares dictionary learning algorithm
.
IEEE Transactions on Signal Processing
,
58
,
2121
2130
.
Tibshirani
,
R.
(
1994
).
Regression shrinkage and selection via the lasso
.
Journal of the Royal Statistical Society, Series B
,
58
,
267
288
.
Tropp
,
J. A.
(
2004
).
Greed is good: Algorithmic results for sparse approximation
.
IEEE Transactions on Information Theory
,
50
,
2231
2242
.
Wang
,
J.
,
Cai
,
J.-F.
,
Shi
,
Y.
, &
Yin
B.
(
2014
).
Incoherent dictionary learning for sparse representation based image denoising
.
IEEE international Conference on Image Processsing
.
Piscataway, NJ
:
IEEE
.
Wang
,
Z.
,
Bovik
,
A. C.
,
Sheikh
,
H. R.
,
Sapiro
,
G.
, &
Simoncelli
,
E. P.
(
2004
).
Image quality assessment: From error visibility to structural similarity
.
IEEE Transactions on Image Processing
,
13
,
600
612
.
Yaghoobi
,
M.
,
Blumensath
,
T.
, &
Davies
,
M. E.
(
2009
).
Dictionary learning for sparse approximations with the majorization method
.
IEEE Transactions on Signal Processing
,
57
,
2178
2191
.
Yaghoobi
,
M.
,
Daudet
,
L.
, &
Davies
,
M. E.
(
2009
).
Parametric dictionary design for sparse coding
.
IEEE Transactions on Signal Processing
,
57
,
4800
4810
.

Note

1

SNRdB is defined as , where x and y denote the original signal and the signal polluted by noise, respectively.