Abstract

In this letter, we consider the nonnegative matrix factorization (NMF) problem and several NMF variants. Two approaches based on DC (difference of convex functions) programming and DCA (DC algorithm) are developed. The first approach follows the alternating framework that requires solving, at each iteration, two nonnegativity-constrained least squares subproblems for which DCA-based schemes are investigated. The convergence property of the proposed algorithm is carefully studied. We show that with suitable DC decompositions, our algorithm generates most of the standard methods for the NMF problem. The second approach directly applies DCA on the whole NMF problem. Two algorithms—one computing all variables and one deploying a variable selection strategy—are proposed. The proposed methods are then adapted to solve various NMF variants, including the nonnegative factorization, the smooth regularization NMF, the sparse regularization NMF, the multilayer NMF, the convex/convex-hull NMF, and the symmetric NMF. We also show that our algorithms include several existing methods for these NMF variants as special versions. The efficiency of the proposed approaches is empirically demonstrated on both real-world and synthetic data sets. It turns out that our algorithms compete favorably with five state-of-the-art alternating nonnegative least squares algorithms.

1  Introduction

Nonnegative matrix factorization (NMF) is the problem of approximating a given nonnegative matrix by the product of two low-rank nonnegative matrices. Given a matrix and a positive integer , one desires to compute two nonnegative matrix factors and such that
formula
1.1

This problem was introduced in 1994 by Paatero and Tapper (1994) and then received considerable interest after the work of Lee and Seung (1999, 2001). Due to the nonnegativity, equation 1.1 means that each column ith of A is approximated by an additive linear combination of r basis columns in U, weighted by the components of the ith row of V. The NMF has been successfully applied in many applications such as text mining (Shahnaz, Berry, Langville, Pauca, & Plemmons, 2006; Berry, Browne, Langville, Pauca, & Plemmons, 2006), image processing (Lee & Seung, 1999), bioinformatics (Kim & Park, 2007), and spectral data analysis (Pauca, Piper, & Plemmons, 2006; Berry et al., 2006).

The tightness of approximation 1.1 can be assessed using various measures, such as the Frobenius norm, the Küllback-Leibler divergence (Lee & Seung, 1999, 2001), the Bregman divergence (Dhillon & Sra, 2006), the Earth Mover’s distance metric (Sandler & Lindenbaum, 2011), and many more (see also Cichocki, Zdunek, Phan, & Amari, 2009) for a comprehensive survey). In this letter, we focus on the most widely used measure, the Frobenius norm . The NMF problem is then formulated as the following optimization problem:
formula
1.2

Problem 1.2 is a nonconvex optimization with respect to the pair , and it has been shown that solving the NMF is NP-hard (Vavasis, 2009).

1.1  Existing Work

Research has been very active on solution methods for the NMF problem and its extension. Work can be divided into two categories: the first, widely investigated, is based on alternating schemes and the second, less studied, directly tackles the NMF problem by minimizing the function F with respect to the pair

The idea of alternating schemes is based on the fact that when one of the factors U or V is fixed, problem 1.2 reduces to an efficiently solvable convex nonnegativity-constrained least squares problem (NNLS). Exploiting this property, most of algorithms proposed for the NMF solve, at each iteration, alternatively problem 1.2 with respect to one variable U (resp. V) when the other variable V (resp. U) is fixed. Hence each algorithm differs from one of the other by the way it solves the NNLS subproblems, which follows two directions: some algorithms try to compute an optimal solution, while the others settle for an approximate solution (sometimes very roughly).

Algorithms that compute optimal solutions to NNLS subproblems belong to the class of alternating nonnegative least squares (ANLS). The first proposed algorithms in this class include the projected gradient method (Lin, 2007b), the projected quasi-Newton method (Kim, Sra, & Dhillon, 2007), the active set method (Kim & Park, 2008), and the block principal pivoting method (Kim & Park, 2011). The name of these algorithms indicates the methods they use to solve the NNLS subproblems. However, these methods are quite expensive and complicated to implement. Recently, Guan, Tao, Lou, and Yuan (2012) proposed the so-called NeNMF, which applies Nesterov’s optimal gradient method to solve the NNLS subproblems. Gong and Zhang (2012) solved the NNLS subproblems by a second-order algorithm, the projected Newton method (PNM). In another work (Huang, Liu, & Zhou, 2014), a quadratic regularization projected Barzilai-Borwein (QRPBB) method is used to solve the NNLS subproblems.

Algorithms that compute only an approximate solution to the NNLS subproblem have a cheaper computational cost at each iteration than ANLS-based ones. One such procedure is the multiplicative updating (MU) algorithm proposed by Lee and Seung (1999, 2001). This algorithm has been one of the most commonly used for the NMF, but there are some issues related to its performance and convergence (Gonzalez & Zhang, 2005; Lin, 2007a, 2007b; Kim et al., 2007). Lin (2007a) presented a modified version of the MU to fix its convergence issue. However, the modification is more costly and has a slow convergence rate Berry et al. (2006) provided a simple way to treat the NNLS problem. It first finds the solution to the unconstrained least squares problem and then projects this solution onto the nonnegative orthant. This method is simple, but it is unstable and lacks convergence guarantee. An algorithm that iteratively updates each column of V and U per iteration, the hierarchical alternating least squares (HALS) algorithm, also called the rank-one residue iteration (RRI), was introduced in Cichocki, Zdunek, and Amari (2007), Cichocki and Phan (2009); Ho (2008), and Gillis and Glineur (2008). In fact, this method amounts to solving the NNLS subproblems approximately by executing only one step in the sequential coordinate-wise algorithm (Franc, Hlavac, & Navara, 2005). By carefully analyzing the computational cost needed at each iteration, Gillis and Glineur (2012) proposed an accelerated version of the HALS. The idea of this method is to repeat the sequential coordinate-wise algorithm several times at cheap additional cost. This is due to the fact that the computational cost of each sequential coordinate-wise update is much less than the computational cost of the preparation step.

In the second category of solution methods for the NMF, the algorithms simultaneously compute both factors U and V at each iteration. This approach seems to be more difficult and therefore less developed; only few work has been done by Chu, Diele, Plemmons, and Ragni (2004), Lin (2007b), and Le Thi, Pham Dinh, and Vo (2014).

1.2  Our Contribution

We propose two approaches based on DC (difference of convex functions) programming and DCA (DC algorithms; see Pham Dinh & Le Thi, 1997, 1998; Le Thi & Pham Dinh, 2005), in both categories noted for solving the NMF problem. The first approach follows the general alternating framework and applies DCA on the NNLS subproblems; the second approach directly investigates DCA on the NMF problem and simultaneously computes both factors U and V at each iteration.

Our motivation is based on the fact that DC programming and DCA, powerful tools and innovative approaches in nonconvex optimization, have been successfully applied to many (smooth or nonsmooth) large-scale nonconvex programs in various domains of applied sciences and proved to be more robust and efficient than the standard methods, particularly in data analysis and data mining (see, e.g. Le Thi, Belghiti, & Pham Dinh, 2006; Le Thi, Le Hoai, & Pham Dinh, 2007; Le Thi, Le, & Pham Dinh, 2007; Le Thi, Le, Nguyen, & Pham Dinh, 2008; Le, Le Thi, Pham Dinh, & Huynh, 2013; Le Thi, Le, Pham Dinh, & Huynh, 2013; Le Thi, Le, & Pham Dinh, 2014, 2015; Le Thi, Nguyen, & Pham Dinh, 2014; Le Thi, Vo, & Pham Dinh, 2014; Le Thi & Nguyen, 2014; Le Thi, Pham Dinh, Le, & Vo, 2015; Le, Le Thi, & Nguyen, 2015 see also references in Le Thi, 2005).

Our contributions are multiple. First, we develop a DCA scheme for solving the NNLS convex subproblem in the alternating method. Although convex programming has been studied for about a century, an increasing amount of effort has been put recently into developing fast and scalable algorithms to deal with large-scale problems. Especially for the NNLS problem, several efficient convex optimization methods have been investigated. Since a convex problem can be formulated as a DC program, DCA can also be applied to convex programs. Exploiting the specific structure of the NNLS problem and using a suitable DC decomposition of its objective function, we propose a DCA scheme whose calculations are all explicit. Furthermore, we improve this DCA scheme by considering a like DCA algorithm that requires much less computational effort.

Second, following the general alternating method and using the DCA designed to the NNLS problem, we propose an alternative DCA-based algorithm for solving the NMF problem and prove its convergence properties. We show that with a suitable DC decomposition, our method includes many existing algorithms for the NMF as special cases.

Third, based on the conventional DCA, we develop two algorithms for the NMF problem that simultaneously compute two factors U and V of equation 1.2 at each iteration. Exploiting the fact that even if equation 1.2 is NP-hard, its structure is quite simple—the objective function F is twice differentiable and the feasible set is simply defined by nonnegative constraints—we propose two versions of DCA where all computations are explicit. The first requires the projection of points on a ball, while the second is based on the projection of points on a box in combining with a variable selection updating strategy. This constitutes an original contribution of the letter, knowing that most existing work shares the alternating scheme and little previous work has solved the whole NMF problem directly.

Fourth, we adapt the proposed DCA-based methods for several NMF variants. We also show that these algorithms cover several existing methods for these NMF problems as special cases. Preliminary experiments on both synthetic and real data sets show the efficiency of our proposed NMF algorithms.

The rest of the letter is organized as follows. In the next section we look at briefly DC programming and DCA for general DC programs. Section 3 deals with the first approach, the alternative DCA-based algorithm, while section 4 concerns the second approach, with two DCA-based algorithms for solving the whole NMF problem. Several NMF variants and their solution methods are considered in section 5. The numerical experiments are presented in section 6. Section 7 concludes the letter.

Before beginning, we mention some notations we use. We use uppercase letters to denote matrices and lowercase letters for vectors or scalars. Vectors are also regarded as matrices with one column. denotes the matrix whose entries are all 1. For a matrix , the notation (resp. ) refers to the ith row (resp. jth column) of X. The inequality means that all entries of X are nonnegative. Each entry of the absolute matrix is defined as for all . The notation (resp. ) denotes the element-wise product (resp. division) of matrices. The matrix inner product of two matrices is defined by . For two matrices X and Y of the same size , the inequality means that for all indices , while (resp. ) denotes the matrix Z such that (resp. ) for all . For a scalar , is shortly denoted as .

For a matrix , is an vector obtained by “stacking” s columns. If and , then their Kronecker product is an block matrix whose block is the matrix . For matrices B, C, and X with compatible sizes, we have the relation (Van Loan, 2000)
formula

For a matrix , . The set of nonnegative (resp. positive) real numbers is denoted by (resp. ). The projection of a vector on a subset , denoted by , is the nearest element of to x with regard to Euclidean distance. When , we use the notation instead of . It is clear that .

2  Outline of DC Programming and DCA

DC programming and DCA constitute the backbone of smooth and nonsmooth nonconvex programming and global optimization. They address general DC programs of the form
formula
where G, H are lower semicontinuous proper convex functions on . Such a function F is called a DC function, and a DC decomposition of F while G and H are the DC components of F. Note that the closed convex constraint can be incorporated in the objective function of (Pdc) by using the indicator function on C denoted by , which is defined by if and otherwise:
formula
Recall that for a convex function , the subdifferential of at , denoted by , is defined by
formula
The subdifferential generalizes the derivative in the sense that is differentiable at x0 if and only if
A point is said to be a local minimizer of GH if is finite and there exists a neighborhood of such that
formula
2.1
The necessary local optimality condition for (primal) DC program (Pdc) is given by
formula
2.2
Condition 2.2 is also sufficient (for local optimality) in many important classes of DC programs, for example, for polyhedral DC programs with H being polyhedral convex or when function F is locally convex at (Pham Dinh & Le Thi, 1997; Le Thi & Pham Dinh, 2005).
A point is called a critical point of GH or a generalized Karush-Kuhn-Tucker point (KKT) of (P)) if
formula
2.3

2.1  Philosophy of DCA

DCA is based on local optimality conditions and duality in DC programming (for simplicity, we omit here the presentation of DC duality and the dual part of DCA). The main idea of DCA is simple: each iteration k of DCA approximates the concave part—H by its affine majorization (that corresponds to taking —and minimizes the resulting convex function.

The generic DCA scheme can be described as follows:

Generic DCA Scheme

  • Initialization: Let be an initial guess, .

  • Repeat

    • Calculate

    • Calculate

  • Until convergence of .

Convergence properties of DCA and its theoretical basic can be found in Pham Dinh and Le Thi (1997) and Le Thi and Pham Dinh (2005). It is worth mentioning that:

  • DCA is a descent method (i.e., the sequence is decreasing) with global convergence (its converges from any starting point). Moreover, DCA does not require line search (its step size is always equal to one).

  • If , then xk is a critical point of GH. In such a case, DCA terminates at the kth iteration.

  • If the optimal value of problem (Pdc) is finite and the infinite sequence is bounded, then every limit point of the sequences is a critical point of GH.

  • DCA has a linear convergence for general DC programs.

A deeper insight into DCA has been described in Le Thi and Pham Dinh (2005). For instance, it is crucial to note the main feature of DCA: DCA is constructed from DC components and their conjugates but not the DC function  f  itself, which has infinitely many DC decompositions, and there are as many DCA as there are DC decompositions. Such decompositions play a crucial role in determining the speed of convergence, stability, robustness, and globality of computed solutions. This flexibility of DC programming and DCA is of particular interest from both a theoretical and an algorithmic point of view. Moreover, it is worth pointing out that with suitable DC decompositions, DCA generates most of standard methods in convex and nonconvex programming. For instance, the four well-known methods in machine learning—EM (expectation-maximization), SLA (successive linear approximation), CCCP (convex-concave procedure), and ISTA (iterative shrinkage–thresholding algorithm)—are particular cases of DCA (see Pham Dinh & Le Thi, 1998).

For a complete study of DC programming and DCA, readers are referred to Pham Dinh and Le Thi (1997, 1998, 2014). Le Thi and Pham Dinh (2005). The solution of a nonconvex program (Pdc) by DCA must be composed of two stages: the search for an appropriate DC decomposition of f and that of a good initial point.

In the last decade, a variety of work on machine learning based on DCA has been conducted. The efficiency and the scalability of DCA have been proved in a lot of work (see Le et al., 2013; Le Thi, Pham Dinh et al., 2015; the list of references on http://lita.sciences.univ-metz.fr/lethi/DCA.html). The successful work on DCA motivated us to investigate it for solving the NMF problem.

3  The First Approach: Alternative Method Based on DCA

The alternating methods for solving the NMF problem consist of holding U fixed and optimizing with respect to V, and then holding V fixed and optimizing with respect to U. More precisely, these methods share the following general scheme:

  1. Initialize with nonnegative elements.

  2. For do

    1. Fix Uk and find by solving
      formula
    2. Fix and find by solving
      formula

Similarly, one may exchange the order of updating U and V.

Problems 3.1a and 3.1b take the form of a nonnegativity-constrained least squares (NNLS) problem with a multiple right-hand side,
formula
that is separable with respect to rows of V. Hence, solving the last problem amounts to solving n NNLS problems with a single right-hand side of the form
formula
where v and a are the transpose of rows of V and A accordingly.

In section 3.1, we develop DCA-based schemes for solving the NNLS problem with a single right-hand side and then multiple right-hand sides. Since problems 3.1a and 3.1b are similar (by replacing AT with A and changing the role of U and V), we will present the solution method of problem 3.1a.

3.1  DCA for Solving the NNLS problem

3.1.1  DCA for Solving the NNLS Problem with a Single Right-Hand-Side Vector

Consider the NNLS problem with single right-hand-side vector,
formula
3.2
where , . By letting and , we have .
The KKT conditions of problem 3.2 are given by
formula
3.3
where . These conditions are equivalent to , where is the set of the indices violating the KKT conditions, equation 3.3, namely
formula
3.4
Let satisfying . Then the objective function f can be expressed as a DC function,
formula
where
formula
Thus, problem 3.2 can be reformulated as a DC program:
formula
3.5
Applying DCA to equation 3.5, we have to determine, at each iteration k,
formula
and then compute the next iterate by
formula
For simplifying the presentation, we drop the index k; then the update procedure can be briefly described by
formula
3.6
A common choice of is such that (Pham Dinh & Le Thi, 1998). With this choice, the update rule, equation 3.6, can be interpreted as a projected gradient scheme with the constant step size . However, by modifying , we get different DC decompositions of f that may influence the efficiency of the corresponding DCA. The following result suggests a more flexible way to choose .
Lemma 1.
Given an symmetric matrix M and a vector , then
formula
Proof.
For any , we have
formula
Thus, .
From lemma 1, for any , we can choose
formula
where is a safety parameter that guarantees the positivity of . Then the update rule, equation 3.6, becomes
formula
3.7
or, equivalently,
formula
Hence, by varying , we get various vector with different component values.

Improvement of Rule 3.7: A like DCA algorithm. Despite the condition , the update rule, equation 3.7, is well defined for any . Moreover, we observe that if the index , that is, , then for any , the update rule, equation 3.7, does not affect the value of . Thus, at each iteration, we can discard the variables vi with . Consequently, to reduce the computational effort, it is reasonable to consider in equation 3.7, and let when vi is not under consideration. The following result proves that this modified DCA scheme is still a descent method as the standard DCA.

Lemma 2.
Suppose that satisfies the condition
formula
3.8
Then if is not a critical point of problem 3.2, we have , where is computed with the update rule, equation 3.7.
Proof.

The assumption that v is not a critical point of problem 3.2 implies that . Considering the DC program, equation 3.5, restricting on variables , the assertion of lemma 2 is a consequence of the general convergence properties of DCA.

For exploiting the effect of DC decomposition on the resulting DCA, we can again refine the modified DCA by computing the vector d dynamically during DCA’s iterations. This procedure can be seen as the use of local DC decompositions, that is, DC decompositions are changed from one to another iteration.

A dynamic computation of d. For , define
formula
Then () is the updated value of vt by applying the update rule 3.7 with – the tth canonical basis vector in . And is the corresponding displacement. Define () by and for all . For , if we update only the tth component of v, the objective function f will decrease an amount of
formula
Let
formula
and
formula
3.9
Let , where v0 is the starting point and is a small positive number. Then the vector in the update rule, equation 3.7, is defined by
formula
3.10
It is clear that if and , d defined by equation 3.10 satisfies condition 3.8.

In the sequel we will use equation 3.10 to define d; the resulting algorithm is named like DCA algorithm.

Remark 1.
Consider the following general form of problem 3.3:
formula
3.11
where is a symmetric matrix, , and . Then the update rule, equation 3.7, and lemma 2 are still valid for equation 3.11. We can also replace Q in the update rule 3.7 by a symmetric matrix such that .

3.1.2  DCA for Solving the NNLS Problem with Multiple Right-Hand-Side Vectors

Now we consider the NNLS problem with multiple right-hand-side vectors,
formula
3.12
where , , and .
First, note that by letting , , and , we can express
formula
Thus, the NNLS problem with multiple right-hand sides, equation 3.12, can be considered an NNLS problem with a single right-hand side of the form in equation 3.2. However, the size of the resulting problem is very large: variables. Therefore, rather than consider directly problem 3.12 in form of equation 3.2, we will decouple the former into several problems like equation 3.2.
In fact, the objective function in equation 3.12 can be written as
formula
where and , . Therefore, solving equation 3.12 amounts to solving n separate problems of the form of equation 3.2, namely,
formula
Let and , and let be defined by . Then equation 3.7 gives us the update rule,
formula
3.13
where .

Now let be computed by equation 3.10. Then the like DCA for solving equation 3.12 is described in algorithm DCA-NNLS:

DCA-NNLS: DCA for Solving NNLS problem 3.12

  1. Input:, , , and

  2. Output: is an approximate solution to problem 3.12

  3. Compute, and let l = 1.

  4. whiledo

    1. Compute .

    2. Compute row by row using equation 3.10 and set .

    3. If there exists a component , then set
      formula
      Else break
    4. .

  5. end while

  6. .

The following assertion is a direct consequence of lemma 2:

Lemma 3.
Suppose that verifies the condition
formula
3.14
where , and is computed via the update rule, equation 3.13. Then if V is not a critical point of equation 3.12, that is, , we have , otherwise .

Consequently, let be the input and output of algorithm DCA-NNLS. If V0 is not a solution of problem, equation 3.12, then ; otherwise .

Remark 2.
In DCA-NNLS, we need to compute (). For , instead of using directly the formula , we can update from as follows:
formula
In this way, we need multiplications/additions instead of while computing directly. Since would be much less than by our construction, this will reduce the computational effort.

3.2  The Alternative DCA-Based Algorithm and Its Convergence Properties

The alternative DCA-based algorithm for solving the NMF problem, equation 1.2 is described in the ADCA algorithm:

ADCA: Alternative DCA-Based Algorithm for Solving the NMF Problem

  1. Initialize with nonnegative elements, , ,

  2. repeat

    1. Compute

    2. Compute

  3. until Stopping criterion is satisfied.

Now we will prove that any limit point of the sequence generated by the ADCA algorithm is a critical point of problem 1.2—one that satisfies the KKT conditions. Formally, the KKT optimality conditions for problem 1.2 are given as
formula
3.15a
formula
3.15b
where
formula
This means that is a KKT point of problem 1.2 if and only if and are KKT points of the problems of the form 3.12,
formula
respectively.

First, let us prove a general result, which will be used for deriving the convergence of ADCA.

Theorem 1.
Consider the optimization problem
formula
P
where f is a continuous real-valued function taken on the closed convex set . Suppose that is a sequence satisfying the following conditions:
  1. The sequence is monotonically decreasing,

  2. For any sub-sequence , there exists a continuous function and a sub-sub-sequence such that
    formula
    and if x is not a critical point of the problem (P).

Then, every limit point of is a critical point of the problem (P).

Proof.

Assume that is a limit point of and is not a critical point. Then there is a sub-sequence such that . By the assumption, there are a function and a sub-sub-sequence that satisfy condition b.

Using the continuity of f and condition a, we get
formula
On the other hand,
formula
Thus, is also a limit point of , and we also have . Consequently . But condition b and the assumption that is not a critical point imply that , which is a contradiction. The theorem is proved.

The following lemma is useful to prove that the sequence generated by ADCA satisfies condition b of theorem 6.

Lemma 4.

There is a finite set of continuous functions from such that for any , , and computed by algorithm DCA-NNLS, we have for a .

Proof.
It is easy to see that the functions () defined by
formula
where , are continuous functions from .
Let be the set of functions from given by
formula
Then is a finite set of continuous functions. Moreover, it is clear that the function T, defined by
formula
is continuous with regard to . Therefore,
formula
where stands for the composite function (l times), is a finite set of continuous functions from .

By the construction of DCA-NNLS, if , and are the inputs and output of DCA-NNLS, there is a such that .

Now we are in a position to prove the convergence of ADCA.

Theorem 2.

Every limit point of the sequence generated by algorithm ADCA is a critical point of the NMF problem, equation 1.2.

Proof.

According to lemma 4, the sequence generated by algorithm ADCA verifies condition a of theorem 6.

By lemma 7, there are finite families and of continuous functions from and from , respectively, such that, for ,
formula
for some , .
For any , define a continuous function by
formula
Then we have
formula
Since the sets and are finite, so is the set . This implies that the sequence generated by algorithm ADCA verifies condition b of theorem 6. Applying theorem 6, we deduce that every limit point of the sequence generated by algorithm ADCA is a critical point of the NMF problem, equation 1.2.

It is worth noting that this convergence analysis is not only valid for our specific choice of D but also to other D satisfying condition 3.14 and the assumption that there is a finite set of continuous function from such that each Dl in algorithm DCA-NNLS can be expressed as for a . The latter assumption simply requires that each element of D is computed from U and V via a finite number of formulations. It is easy to see that this assumption is satisfied when , and D is constructed by equation 3.17. The two latter constructions of D also verify condition 3.14, so the convergence of the corresponding NMF algorithms is guaranteed.

3.3  Link between ADCA and Existing Methods

We will see that the algorithm ADCA includes some well-known alternating methods for the NMF problem. As indicated in section 1, each alternating algorithm differs from one another by how they solve NNLS subproblems. We show below that the update rules for solving the NNLS problem, equation 3.12, in some existing NMF algorithms is nothing other than our update rule, equation 3.13, with a suitable choice of D.

3.3.1  Multiplicative Updating (MU) Algorithm (Lee & Seung, 2001)

Since , if we take and let the safety parameter , the update rule, equation 3.13, becomes
formula
3.16
This is the multiplicative update rule used in Lee and Seung (2001). Note that this choice of D may not satisfy condition 3.14. Thus, the convergence of the MU algorithm cannot be guaranteed.

3.3.2  Modified MU Algorithm (Lin, 2007a)

Let D be computed by
formula
3.17
with . It is easy to see that for any . Then equation 3.13 becomes
formula
3.18
By replacing with (this is another way of bounding the denominator away from zero), we get back the update rule proposed in Lin (2007a).

3.3.3  Hierarchical Alternating Least Squares (HALS) Algorithm (Ho, 2008; Cichocki & Phan, 2009)

For , by successively letting , where is the canonical basis of , and conducting the update rule, equation 3.13, we get back the update rule used in the RRI/HALS algorithm (Ho, 2008; Cichocki & Phan, 2009):
formula
3.19

4  The Second Approach: DCA Applied on the Whole NMF Problem

Even if equation 1.2, is NP-hard, its structure is quite suitable to investigate DCA: the objective function F is twice differentiable, and the feasible set is simply defined by nonnegative constraints. For a DC program with the objective function being twice differentiable, it is well known that the projection DC decomposition results in very inexpensive DCAs when the feasible set is a ball or a box (see e.g. Le et al., 2013; Le Thi, Le Hoai et al., 2007; Le Thi, Le et al., 2007, Le Thi, Le et al., 2013; Le Thi & Nguyen, 2014). In this section we propose two versions of such DCAs. The first requires the projection of points on a ball, and the second is based on the projection of points on a box in combining with the variable selection updating strategy similar to the technique developed in section 3.1.

4.1  The First DCA for the Whole NMF Problem

Attempting to design efficient DCAs based on the projection DC decomposition, we first construct the two balls bounding the variables U and V (naturally, the smaller the balls are, the better DCA is) and then reduce the feasible set of the NMF problem on the product of these balls.

Let and define the sets
formula
Consider the problem
formula
4.1
The following result shows the equivalence between problems 4.1 and 1.2.
Lemma 5.
(i) Suppose that is a critical point (resp. optimal solution) of problem 1.2 and defined by
formula
where
formula
Then is a critical point (resp. optimal solution) of both problems 1.2 and 4.1.

(ii) If is a critical point (resp. optimal solution) of problem 4.1, then it is also a critical point (resp. optimal solution) of problem 1.2.

Proof.
First, we mention the KKT optimality conditions for problem 4.1. There exist and , such that
formula
4.2
(i) By the definition of we have , so and
formula
It is trivial that if is an optimal solution to problem 4.1, then it is also an optimal solution to problem 1.2. Now we suppose that satisfies conditions 3.15. Since , it is easy to see that satisfies conditions 3.15. Then we have
formula
This implies that , so . Moreover, we have for all , so
formula
This means that , so satisfies equation 4.2 with .

(ii) Following condition i, it is easy to see that if is an optimal solution to problem 4.1, then it is an optimal solution to problem 1.2.

Assuming now that satisfies conditions 4.2, we will show that , and so verifies equation 3.15. If, on the contrary, , we have and, from the first three properties in equation 4.2,
formula
Combining this and the fifth property in equation 4.2, we deduce that and . For any , from the first three properties in equation 4.2, we have
formula
Therefore, for any . Similar to the proof in condition i, we have
formula
Moreover, we have
formula
These imply that , which contradicts . Thus, . Symmetrically, we also have . Then satisfies equation 3.15.

By virtue of lemma 9, instead of solving problem 1.2 on the whole space , we only need to consider the restricted problem, 4.1. In the rest of this section, we reformulate problem 4.1 as a DC program and develop DCA for solving it.

For any and for any and , we have
formula
where
formula
and
formula
are the first-order and the second-order derivatives of F, respectively.
For any , we have the estimation
formula
Therefore, for , the function
formula
is convex on the set .
Now let
formula
which is clearly a convex function. Then problem 4.1 can be written as the following DC program:
formula
4.3
DCA applied to equation 4.3 consists of computing two sequences and ,
formula
4.4a
formula
4.4b
or, more precisely,
formula
Hence, (resp. ) is the projection of the point (resp. ) onto (resp. ) and can be explicitly computed as follows:
formula
4.5a
formula
4.5b
Finally, DCA applied to problem 4.1 is summarized in algorithm DCA1:

DCA1: The First DCA for solving the Whole NMF Problem

  1. Initialize ,

  2. repeat

    1. Compute using equation 4.4

    2. Compute using equation 4.5

  3. until Stopping criterion is satisfied.

Theorem 3.

The sequence generated by algorithm DCA1 has at least one limit point, and every limit point of is a stationary point of problem 2.

Proof.

The first conclusion is trivial due to the fact that the sequence is bounded. The second conclusion follows lemma 9 and the general convergence properties of DCA.

4.2  The Second DCA for the Whole NMF Problem

In the first DCA, we have to compute all the variables Uij and Vij at each iteration. Although the projection of points on a ball is explicit, DCA1 requires important computational efforts if we want to determine an optimal solution of the convex subproblem at each iteration. In order to reduce computational efforts, we investigate a variable selection strategy similar to the like DCA for the NNLS problem—say, at each iteration of the DCA scheme, some variables are discarded and others are updated. Such a procedure is possible if the subproblem at each iteration is separable in its variables. This suggested using the projection DC decomposition on rectangles. For this purpose, we design local DC decompositions of F by bounding the variables to be updated on a box and then decompose F on this reduced domain, as we show below.

Denote by the sequence constructed by DCA for solving the NMF problem. In the following, we describe the procedure to compute from . Let and be defined by
formula
and let
formula
4.6
Consider the sets
formula
and let
formula
4.7
Then for all , and , we have
formula
On the other hand, by virtue of lemma 1, we have
formula
Therefore,
formula
Consider now functions and defined by
formula
From the above estimation of , for any and , we have
formula
Therefore, and are convex functions on . Then the NMF problem, equation 1.2, is a DC program of the form
formula
4.8
Hence, each iteration of DCA applied to equation 4.8 consists of the two following steps:
  1. Compute by
    formula
  2. Compute
    formula
    or, equivalently,
    formula
    formula
The remaining question is the choice of the matrices DU and DV. By the same reason in section 3.1.2, we see that DU and DV in equations 4.9 and 4.10 are not necessarily strictly positive; instead, they must only satisfy
formula
where and . As in algorithm DCA-NNLS, we will compute DU and DV row by row using equation 3.10. Then only entries Uit (resp. Vjt) with (resp. ) are updated at iteration k.

Finally, the second DCA for the NMF problem is described in algorithm DCA2:

DCA2: The Second DCA for Solving the Whole NMF Problem

  1. Initialize with nonnegative elements, k = 0

  2. repeat

    1. Compute and by equation 4.6

    2. Compute

    3. Compute row by row using equation 3.10

    4. Compute and using equations 4.9 and 4.10 with and .

  3. until Stopping criterion is satisfied.

With the same arguments in lemmas 4 and 7 and theorem 8 and the fact that in step 2 of algorithm DCA2 depends on , we have the following convergence result:

Theorem 4.

Let be the sequence generated by algorithm DCA2. Then

  • is monotonically decreasing.

  • if is not a critical point of the NMF problem, 1.2.

  • Every limit point of the sequence is a critical point of the NMF problem 1.2.

In practice, computing by equation 4.7 is difficult. We can replace it by an upper bound given by
formula
4.11
Then the convergence result in theorem 8 still holds.

5  Solving NMF Variants

In this section, we adapt our proposed approaches for solving several NMF variants, including the nonnegative factorization, the smooth regularization NMF, the sparse regularization NMF, the multilayer NMF, the convex/convex-hull NMF, and the symmetric NMF. For the first five classes of problems, we use the alternating framework similar to equation 3.1 and the method in section 3.1 for solving subproblems. We also show that some existing algorithms are actually special cases of our method. For solving the symmetric NMF, we develop a DCA-based algorithm similar to the method in section 4.1 for the original NMF.

To simplify the presentation, we do not describe the complete alternating procedure as algorithms DCA-NNLS and ADCA. Instead, we only derive the update rule that is the essential part in the alternating framework.

5.1  Nonnegative Factorization

Gillis and Glineur (2008) introduced the problem of nonnegative factorization (NF), which is the same as NMF except that the matrix to factorize A can be any real matrix. We still consider problem 1.2 but do not assume .

Note that the update rules derived in section 3.1.1 (resp. 3.1.2) do not use the assumption (resp. ). Thus, they can be applied on the NF problem. Below, we propose another update rule for the NF problem.

Reconsider the NNLS problem, equation 3.2, and the update rule, equation 3.6. Suppose that , where b and c are nonnegative numbers. It is clear that by letting
formula
we have . Then the update rule, equation 3.6, becomes
formula
5.1
Like the case of NMF, we accept in equation 5.1.
Let us now consider the NF problem. The real matrix A can always be written as the difference of two nonnegative matrices B and C: . For , deriving from equation 5.1, we get the following update rule for factor V:
formula
Especially by taking , we have
formula
This multiplicative rule was also introduced in Gillis and Glineur (2008).

5.2  Regularized Nonnegative Matrix Factorization

For various application purposes, additional regularizations are incorporated to impose prior knowledge. The regularized NMF problem takes the form
formula
5.2
where functions J1 and J2 are regularization terms enforcing certain dependent-application constraints on the solutions, and are trade-off parameters.
Applying the alternating method on problem 5.2, we have to compute alternatively U and V at each iteration by solving two subproblems. For instance, the subproblem to compute V when U is fixed takes the form
formula
5.3
We will derive the update rules for computing solution to this problem in specific cases.

5.2.1  Smooth Regularization

The smooth regularization is defined by
formula
where the symmetric matrix is a regularization operator. The objective function f of problem 5.3 can be represented as
formula
where is a permutation matrix such that .
Therefore, problem 5.3 has the form of equation 3.11. According to equation 3.7, for , an update rule of V will be
formula
5.4
where .
If L is represented as with being semipositive symmetric matrices, we have
formula
As indicated in remark 3, we can replace Q with in the denominator of equation 5.4. Thus, for , an update rule of V takes the form
formula
5.5

Below, we will study the update rules, equations 5.4 and 5.5 for some standard smooth regularizations and show that existing methods are special cases of our methods.

Tikhonov regularization (Berry et al., 2006). If , then , which is known as the Tikhonov regularization. The update rule, equation 5.4, becomes
formula
or, under the matrix form,
formula
When and , the preceding update rule becomes
formula
This is the multiplicative update rule in the algorithm constrained NMF proposed in Berry et al. (2006).
Temporal smoothness regularization (Chen & Cichocki, 2005). Let , where is a template operator. We can apply directly update rules, equations 5.4 and 5.5, by representing . In a special case, when is sufficiently small so that , and with and , equation 5.4 becomes
formula
or, equivalently,
formula
We obtain the update rule proposed in Chen and Cichocki (2005).
Graph Laplacian regularization (Cai, He, Wu, & Han, 2008). Considered , where T is a symmetric edge weight matrix with elements and M is the diagonal matrix whose elements are column sums of T. Since M and T are nonnegative, we can apply the update rule, equation 5.5. Specifically, if we take and , equation 5.5 becomes
formula
or, equivalently,
formula
This is also the update rule proposed in Cai et al. (2008).

5.2.2  Sparse Regularization

Recall that one of the important applications of NMF is to learn parts-base representations of nonnegative data. Although NMF itself usually produces sparse representations of data, it is useful to impose more sparsity (Hoyer, 2002, 2004). Below, we consider some existing sparse regularizations.

-norm regularization (Hoyer, 2002). The regularized term is given by
formula
The objective function of problem 5.3 becomes
formula
where P is the permutation matrix defined in section 5.2.1. For , let
formula
According to equation 3.6, the update rule takes the form
formula
or, under the matrix form,
formula
Particularly by taking , we get back the multiplicative rule proposed in Hoyer (2002):
formula
Squared -norm regularization. A sparse regularization similar to the -norm regularization has been proposed by Kim and Park (2011):
formula
The objective function of problem 5.3 has the quadratic form
formula
where . For , the update rule derived from equation 3.7 becomes
formula
or, again, under the matrix form:
formula
The Hoyer-regularization (Hoyer, 2004). Hoyer proposed a sparsity measure of a vector as follows:
formula
To control the sparsity of V, Hoyer (2004) imposed the constraints
formula
where is the desired sparsity of V. These constraints can be rewritten as, when V is represented as a vector,
formula
This can be penalized in the objective function, and then the regularization term is
formula
where .
To the best of our knowledge, there is no solution method to deal with this Hoyer regularization. Observing that J2 is nonconvex but can be expressed as a DC function,
formula
we investigate DCA for solving equation 5.3. For this purpose, we consider the following DC decomposition of f, the objective function of equation 5.3:
formula
DCA applied to problem 5.3 consists of, at each iteration, computing and solving the convex subproblem,
formula
that has form 3.11. Performing one iteration of the update rule, 3.7, on this subproblem, we derive the update rule for problem 5.3 as follows,
formula
5.6
where , , and .

5.3  Multilayer Nonnegative Matrix Factorization

The multilayer nonnegative matrix factorization aims to represent a nonnegative data matrix A as a product of N nonnegative factor matrices with compatible sizes (Cichocki & Zdunek, 2006). Measuring discrepancy by Euclidean distance, the multilayer nonnegative matrix factorization takes the form (Ho, 2008)
formula
5.7
For , we set
formula
and and . Then the objective function F in equation 5.7 can be expressed as
formula
According to the alternating framework, we iteratively update s as follows:
  1. fordo

  2. Fix s (), update Xi as an approximate solution to the problem

  3. .

  4. end for

Similar to the usual NMF, the KKT conditions of the multiplayer NMF can be stated as
formula
where .
In the remainder of this section, we discuss the solution method for the subproblems . These subproblems have the common form
formula
5.8
where , and .
Using the operator , we can express
formula
where . Therefore, solving problem 5.8 amounts to solving the NNLS problem of the form of equation 3.2. A gradient of is computed as
formula
Since
formula
we have
formula
Combining this and
formula
we get
formula
5.9
For any , we have
formula
Then an update rule of X derived from equation 3.7 will be
formula
5.10
Especially, if B and C are nonnegative, by letting and , we obtain the following multiplicative update rule:
formula

5.4  Convex and Convex-Hull Nonnegative Matrix Factorization

5.4.1  Convex Coding

In certain applications, one desires to find an approximation UVT of an input matrix A such that and . It means that each column of A has to be approximated by a convex combination of columns of U. Most popular algorithms for NMF cannot directly handle the constraint . Lee and Seung (1997) and Berry et al. (2006) proposed penalizing it in the objective function; then the regularization term is
formula
Then the subproblem for computing V has the form of equation 3.11. However it is very difficult to find solutions satisfying the full constraint. We show in this section that DCA can be efficiently investigated for this problem without the penalty technique.
The mathematical formulation of this NMF variant is written by
formula
5.11
The alternating scheme applied to this problem differs from the one for the original NMF only on the step of computation of V (U is computed by solving an NNLS problem using the algorithm DCA-NNLS). Computing V amounts to solving a sequence of separate problems of the form
formula
5.12
where is the canonical simplex defined by
formula
For , we consider the following DC decomposition of f:
formula
DCA applied on this DC decomposition is iteratively processed as follows:
  1. Compute .

  2. Update .

Note that efficient algorithms for computing the projection of a point onto a simplex are available (see, e.g., Michelot, 1986; Júdice & Pires, 1992).

In short, for , the update rule for V will be
formula
5.13
where .

5.4.2  Convex Nonnegative Matrix Factorization

To enhance the interpretability of matrix factors, Ding, Li and Jordan (2010) proposed an NMF variant—the so-called convex-NMF. Differing from the convex coding model where constraint is imposed on the second factor V, the convex-NMF imposes the constraint on the first factor U so that its columns are convex combinations of columns of the data matrix A. This yields an interesting interpretation: each data point can be expressed as a weighted sum of given data points. The mathematical formulation of convex-NMF takes the form (here, the variable W is used instead of )
formula
5.14
Similar to equation 5.11, the alternating scheme applied to equation 5.14 differs from the one for the original NMF only on the step of computation of W (V is computed by solving an NNLS problem via the algorithm DCA-NNLS). Computing W amounts to solving the convex problem,
formula
5.15
Since the product WVT is invariant under diagonal scaling (for a diagonal matrix with positive diagonal elements, ), the first approach to compute W can be carried out through these two steps:
  1. Ignore the additivity constraint and compute W using the method discussed at the end of section 5.3.

  2. Normalization step: for
    formula

We can also completely ignore the additive constraint and apply the normalization step only when the iteration process of computing W and V is finished. However, it may occur that some of columns of W are zero and the normalization step is invalid. In other words, this approach cannot be applied when at least one column of W is zero.

The second approach consists of directly solving problem 5.15 by DC programming and DCA.

As in section 5.3, we express in the least-square form:
formula
For , we reformulate equation 5.15 as a DC program,
formula
with
formula
Applying DCA on this DC program, we iteratively:
  1. Compute and then

  2. Update
    formula
Briefly, the update rule for W can be presented by
formula
5.16
where (r times).

5.4.3  Convex-Hull Nonnegative Matrix Factorization

Thurau, Kersting, Wahabzada, and Bauckhage (2011) extended the idea of convex-NMF to the so-call convex-hull NMF. This NMF variant adds the constraint on the second factor V in the convex coding problem, (i.e., ) to the convex-NMF 5.14. Now each data point can be expressed as a convex combination of convex combinations of specific data points. For this problem, we can alternatively compute W using equation 5.16 and compute V by DCA as discussed in section 5.4.1.

It is worth mentioning that both Ding et al. (2010) and Thurau et al (2011) did not directly deal with the constraint . Ding et al. (2010) simply ignored this constraint and alternatively computed W and V by a majorization-minimization method. Meanwhile, instead of computing W and setting , Thurau et al. (2011) determined directly U by (approximately) finding k appropriate vertices of the convex hull of A.

5.5  Symmetric Nonnegative Matrix Factorization

The symmetric nonnegative matrix factorization (SNMF) problem (Ho, 2008) arises in various applications (e.g., graph clustering, correlation matrix approximation). In such situations, the data matrix A is symmetric (e.g., the adjacency matrix of an undirected graph, correlation matrices), and we want to find a nonnegative factorization of A that preserves the symmetry. The SNMF aims at approximating a symmetric data matrix by a matrix of the form UUT such that with . This means that we desire the matrix factors in equation 1.1 to be equal. The mathematical formulation of the SNMF then takes the form
formula
5.17
where A is an symmetric nonnegative matrix and .
The KKT conditions for this problem are given by
formula
where .

As with the case of NMF, we can bound the solution of the SNMF problem as follows:

Lemma 6.

If U is a solution of problem 5.17, then , where .

This result allows us to consider problem 5.17 under the form
formula
5.18
where We will develop DCA for problem 5.18. For any and for any , we have the following estimation of the second-order derivative of Fs:
formula
Thus, for , the objective function of equation 5.18 admits the following DC decomposition:
formula
and the resulting DCA can be summarized as shown here:

DCA-sym: DCA for Solving the SNMF

  1. Initialize U0 satisfied , .

  2. repeat

    1. Compute .

    2. Compute

    3. .

  3. until Convergence of .

6  Numerical Experiments

In this section, we evaluate our proposed algorithms, ADCA, DCA1, and DCA2, and compare them with the following state-of-the-art algorithms:

  • BPP: Alternating nonnegative least squares with the block principal pivoting method (Kim & Park, 2011)1

  • HAac: The accelerated hierarchical alternating least squares (Gillis & Glineur, 2012)2

  • NeNMF: An NMF solver based on Nesterov’s optimal gradient method (Guan et al., 2012)3

  • PNM: NMF via projected Newton method (Gong & Zhang, 2012)4

  • QRPBB: NMF using quadratic regularization projected Barzilai-Borwein method (Huang et al., 2014)

We use the source codes of these algorithms available on the authors’ home pages, except for the QRPBB, which is provided directly by the authors. For the NNLS subproblems, we used the default settings parameters in their code, which are the best settings according to the authors’ suggestions.

In all experiment, the settings , , were used for algorithm DCA-NNLS and D was row-by-row constructed using equation 3.10.

All of these algorithms were executed with Matlab R2008b an Intel Core i5-2500S 2 × 2.7 GHz processor and 4 GB memory. The multithreading option of Matlab was disabled.

All initializations in experiments were computed as follows. First, each element of U0 and V0 is picked randomly in