## Abstract

Sparsity is a desirable property in many nonnegative matrix factorization (NMF) applications. Although some level of sparseness of NMF solutions can be achieved by using regularization, the resulting sparsity depends highly on the regularization parameter to be valued in an ad hoc way. In this letter we formulate sparse NMF as a mixed-integer optimization problem with sparsity as binary constraints. A discrete-time projection neural network is developed for solving the formulated problem. Sufficient conditions for its stability and convergence are analytically characterized by using Lyapunov's method. Experimental results on sparse feature extraction are discussed to substantiate the superiority of this approach to extracting highly sparse features.

## 1  Introduction

Nonnegative matrix factorization (NMF) seeks to reveal the hidden structure of nonnegative data (Lee & Seung, 1999). It decomposes a nonnegative matrix $V∈Rm×n$ into a product of two low-rank nonnegative factor matrices $W∈Rm×r$ and $H∈Rr×n$, where $r< (Lin, 2007). A common approach to NMF is performed by minimizing the Frobenius norm of the error matrix. NMF has been widely applied in numerous fields, such as image representation (Lee & Seung, 1999), document clustering (Xu, Liu, & Gong, 2003), and data analysis (Kim & Park, 2007), to name a few.

Sparse solutions are often desirable in machine learning applications (Field, 1994; Hoyer, 2004). Several formulations for generating sparse solutions are presented in the literature. For example, $L1$-norm is added in the objective function as a regularization term for enhancing sparsity (Hoyer, 2002). A sparseness-constrained formulation is introduced in Hoyer (2004), where factorization accuracy is compromised for the sparseness as evidenced in the experimental results in section 4. An $L1$/$L2$ norm-based matrix factorization method named VSMF is proposed in Li and Ngom (2013). All of these regularization-based methods entail proper weighting by using a regularization parameter as a trade-off of factorization accuracy and solution sparsity.

Neurodynamic optimization is a parallel optimization approach pioneered by Hopfield and Tank (Hopfield & Tank, 1986; Tank & Hopfield, 1986) in the 1980s. It demonstrates admirable search capability for various optimization problems such as linear programming (Liu & Wang, 2008; Liu, Cao, & Chen, 2010; He, Li, Huang, & Huang, 2014), quadratic programming (Xia, Feng, & Wang, 2004; Qin, Le, & Wang, 2017), minimax optimization (Liu & Wang, 2016; Le & Wang, 2017; Leung & Wang, in press), nonsmooth optimization (Liu, Guo, & Wang, 2012; Liu & Wang, 2013; Alireza, Wang, & Hosseini, 2013; Li, Yu, Huang, Chen, & He, 2015), distributed optimization (Liu, Yang, & Wang, 2017; Yang, Liu, & Wang, 2018), biconvex optimization (Che & Wang, 2019b), multiobjective optimization (Leung & Wang, 2018, in press), and global and combinatorial optimization (Yan, Wang, & Li, 2014; Yan, Fan, & Wang, 2017; Che & Wang, 2019a, in press). In particular, a collaborative neurodynamic approach (Fan & Wang, 2017) and a discrete-time neurodynamic model (Che & Wang, 2018) have been developed for NMF.

In this letter, we present a sparsity-constrained NMF. We introduce a discrete-time projection neural network (DTPNN) for solving the proposed problem. The convergence of the model is analytically characterized.

The contributions of this letter are summarized as follows:

1. A sparsity-constrained NMF formulation is proposed for obtaining a desired level of sparse solutions while minimizing the factorization error without using regularization.

2. A discrete-time projection neural network is developed for solving the above problem. The convergence and stability are theoretically guaranteed if the step size is smaller than the derived upper bound.

3. The experimental results show the superiority of the performance of this approach when the required level of sparsity is high.

The rest of this letter is organized as follows. Section 2 reviews the related problems and introduces a sparse NMF formulation in the form of an integer optimization problem. Section 3 presents theoretical analyses of the DTPNN. Section 4 discusses the experimental results of sparse feature learning obtained by using DTPNN. Finally, section 5 gives conclusions.

## 2  Problem Formulations

Throughout this letter, lowercase and uppercase letters denote vectors and matrices, respectively. For a matrix $A∈Rn×m$, the operator vec$(A)$ converts $A$ into a one-column vector composed of all the columns of $A$—that is, vec$(A)=[a11,…,an1,a21,…,anm]⊺$. The $l0$-norm and the $l1$-norm of the vectorized matrix A, $∥vec(A)∥0$ and $∥vec(A)∥1$, are abbreviated as $∥A∥0$ and $∥A∥1$, respectively throughout of the rest of the letter.

### 2.1  Existing Formulations

A naive sparsity-constrained NMF is usually formulated as follows:
$minW,H≥0V-WHF2subjecttoH0≤s$
(2.1)
$orminW,H≥0V-WHF2subjecttoW0≤s,$
(2.2)
where $·F$ is the Frobenius norm and $·0$ is the $l0$-norm denoting the number of nonzero elements. The above problems are hard to solve due to the combinatorial nature of $l0$-norm.
Instead of using $l0$-norm as a constraint, to induce sparseness, $l1$-norm is added as a regularization term in (Hoyer, 2002):
$minW,H≥0,V-WHF2+μH1orminW,H≥0,V-WHF2+μW1,$
(2.3)
where $μ$ is a regularization parameter. A larger parameter value $μ$ leads to a sparser factorization result. In practice, the parameter is given and taken in an ad hoc manner. A small regularization parameter value does not affect the level of sparsity. A large parameter value results in a significant factorization error. It leads to an inevitable dilemma.
To achieve the required level of sparsity, a sparsity-constrained NMF formulation is introduced in Hoyer (2004):
$minW,H≥0V-WHF2subjecttosparseness(wi)≤sorminW,H≥0V-WHF2subjecttosparseness(hj)≤s,$
(2.4)
where sparseness($·$) is a function defined on a row of $W$ or a column of $H$:
$sparseness(x)=n-∑|xi|∑xi2n-1,forn>1.$
Later, an $l1/l2$-based matrix factorization formulation named versatile sparse matrix factorization (VSMF) is proposed in Li and Ngom (2013):
$minW,H≥0V-WHF2+∑i=1kα12W22+α2W1+∑i=1nμ12H22+μ2H1,$
(2.5)
where $α1$ and $μ1$ are positive constants to control the smoothness and $α2$ and $μ2$ are positive constants to control the sparseness. Similar to Hoyer (2002), this formulation depends highly on the choice of penalty parameters.

### 2.2  Proposed Formulation

Let $Z∈{0,1}n×m$ be a binary matrix containing zeros and ones only. The number of nonzero elements in $Z$ is equal to the sum of all the elements in $Z$ and $Z0=Z1$.

Let $W=W^∘Z$ and $H=H^∘Z$, where $∘$ denotes the Hadamard product operator (element-wise matrix multiplication). Problems 2.1 and 2.2 can be reformulated as a mixed-integer optimization problem with a binary constraint as follows:
$minW,H^≥0V-W(H^∘Z)F2subjecttoZ1≤s,Z∈{0,1}r×n$
(2.6)
$orminW^,H≥0V-(W^∘Z)HF2subjecttoZ1≤s,Z∈{0,1}m×r.$
(2.7)
Proposition 1.

Problem 2.1 is equivalent to 2.6 and problem 2.2 is equivalent to problem 2.7 in terms of optimal solutions.

Proof.

First, we show that problem 2.1 is equivalent to 2.6.

Problem 2.1 is biconvex in $W$ and $H$ (Che & Wang, 2018). Since $H=H^∘Z$, problem 2.6 is also biconvex in $W$ and $(H^∘Z)$. This implies that if $H$ and $(H^∘Z)$ produce the same optimal value, $W$ also gives the same solution. Therefore, we only need to verify that when $W$ is fixed, the following problems, 2.8 and 2.9 are equivalent:
$minH≥0V-WHF2,subjecttoH0≤s,$
(2.8)
$minH^,Z≥0V-W(H^∘Z)F2,subjecttoZ1≤s,zij∈{0,1}∀i,j,$
(2.9)
where $zij$ denotes the $ij$th entry of the given matrix $Z$.
Let $Ω1$ and $Ω2$ be the feasible solution sets, $f(2.8)*$ and $f(2.9)*$ denote the optimal objective function values for 2.8 and 2.9, respectively. For all $H∈Ω1$, there always exists $H^$ such that $H^=H,zij=1{Hij≠0}$ and $Z1=H0=s$. Since the pair $(H^,Z)∈Ω2$, this implies that
$f(2.8)*=V-WH*F2=V-W(H*^∘Z*)F2≥f(2.9)*.$
Conversely, we want to show that $f(2.9)*≥f(2.8)*$. For all pairs $(H^,Z)∈Ω2$, there always exist $H=H^∘Z$ and $H0=Z1=s$. Similarly, since $H∈Ω1$, this implies
$f(2.9)*=V-W(H*^∘Z*)F2=V-WH*F2≥f(2.8)*.$

Therefore, the optimal objective function values are equal, that is, $f(2.8)*=f(2.9)*$, and this immediately implies that problem 2.1 is equivalent to 2.6.

Analogously, problem 2.2 is equivalent to 2.7.$□$

Since $Z$ is a binary matrix, problems 2.6 and 2.7 are essentially mixed-integer optimization problems. In order to solve the problem, the binary constraint is replaced by an equality constraint:
$Z∈{0,1}r×nifzij(zij-1)=0,∀i,j.$
(2.10)
For $zij(zij-1)=0$ to hold, $zij$ can only be 0 or 1.
The proposed formulations for sparsity-constrained NMF are
$minW,H≥0V-W(H^∘Z)F2subjecttoZ1≤s,zij(zij-1)=0,∀i,j$
(P2.1)
$orminW,H≥0V-(W^∘Z)HF2subjecttoZ1≤s,zij(zij-1)=0,∀i,j.$
(P2.2)

In view of the nonconvexity in the equality constraint, equations P2.1 and P2.2 become a global optimization problem.

## 3  Neurodynamic Approaches

### 3.1  Global Optimization

Including equations P2.1 and P2.2 as two special cases, a general global optimization problem form of problem is defined as follows:
$minimizef(x)subjecttog(x)≤0h(x)=0l≤x≤u,$
(3.1)
where $x:=(W,H,Z)$, $g(x)=Z1-s$, $h(x)=[z11(z11-1),…,zkn(zkn-1)]T$. $l$ denotes the lower bound, and $u$ denotes the upper bound. For $wij∈W$ and $hij∈H$, $l=0$ and $u=+∞$. For $zij∈Z$, $l=0$ and $u=1$.
The Lagrangian function of problem 3.1 is defined as follows:
$L(x,λ,ν)=f(x)+λTg(x)+νTh(x),$
(3.2)
where $λ=[λ1,…,λm]T∈Rm$ and $ν=[ν1,…,νq]T∈Rq$ are the Lagrangian multipliers. Specifically, $m=1$, $q=k×n$ for equation P2.1 and $q=m×k$ for P2.2.
The Karush-Kuhn-Tucker (KKT) conditions are shown as follows:
$∇f(x)+∇g(x)λ+∇h(x)ν=0,l≤x≤u,λ≥0,g(x)≤0,λTg(x)=0,h(x)=0,$
(3.3)
where $∇g(x)=[∇1g(x),…,∇mg(x)]∈Rm$, $∇h(x)=[∇1h(x),…,∇qh(x)]∈Rq$.
The optimality conditions of problem 3.1 can be equivalently transformed, based on the projection theorem (Kinderlehrer & Stampacchia, 1980), as follows:
$-x+PΩ(x-(∇f(x)+∇g(x)λ+∇h(x)ν)=0-λ+(λ+g(x))+=0h(x)=0,$
(3.4)
where $(ξ)+=max(0,ξ)$, $PΩ(x):Rn↦Ω$ is a piecewise linear activation function defined as
$PΩ(xi)=ui,xi>uixi,li≤xi≤uili,xi

### 3.2  Augmented Lagrangian Function

The Lagrangian dual method can be used for solving constrained optimization problems if the duality gap between the primal and dual problems is zero (Mangasarian, 1987), and this is equivalent to the existence of a saddle point of the Lagrangian function (Minoux, 1986). However, the existence of a saddle point cannot be ensured for a global optimization problem due to its nonconvexity.

In order to ensure the existence of a saddle point and solve the global optimization problem 3.1, an augmented Lagrangian function is introduced in Che and Wang (2019a),
$Φ(x,λ,ν)=L(x,λ,ν)+α2∑j=1m(λjgj(x))2+β2∑i=1q(νhi(x))2,$
(3.5)
where $α$ and $β$ are two nonnegative parameters. A modified augmented Lagrangian function with a linear term with respect to $λ$ and without $ν$ in the last term are shown as follows:
$Ψ(x,λ,ν)=L(x,λ,ν)+α2∑j=1mλjgj(x)2+β2∑i=1qhi(x)2.$
(3.6)
The gradient of the augmented Lagrangian function 3.6 is defined as
$∇xΨ(x,λ,ν)=∇xL(x,λ,ν)+α∑j=1mλjgj(x)∇xgj(x)+β∑i=1qhi(x)∇xhi(x).$
(3.7)
By KKT conditions in equation 3.3, we have
$∇xΨ(x*,λ*,ν*)=∇xL(x*,λ*,ν*).$
Based on the saddle-point theorem (Kinderlehrer & Stampacchia, 1980), $x*$ is an optimal solution of problem 3.1 if and only if there exist $λ*$ and $ν*$ such that $(x*,λ*,ν*)$ is a saddle point. For $(x*,λ*,ν*)$ being a saddle point, it has to be a solution of the following min-max optimization problem (Sun et al., 2005):
$minxmaxλ,νΨ(x,λ,ν)subjecttox∈Ω,λ≥0.$
(3.8)
Therefore, the minimization problems P2.1 and P2.2 can be turned into the following min-max optimization problem:
$minxmaxλ,νf(x)+λ(Z1-s)+∑i∑jνij(zij2-zij)+α2λ(Z1-s)2+β2∑i∑j(zij2-zij)2,subjecttoW,H,λ≥0,$
(3.9)
where the range of $i$ and $j$ subject to problems P2.1 and P2.2 is as follows:
$f(x)=V-W(H^∘Z)F2,forproblemP2.1V-(W^∘Z)HF2,forproblemP2.2.$
The $x*$ in saddle-point $(x*,λ*,ν*)$ of problem 3.9 is the local optimum of problems P2.1 and P2.2.

### 3.3  Continuous-Time Neurodynamic Model

Similar to Che and Wang (2019a), a continuous-time projection neural network for solving problem 3.9 is defined as follows:
$εdxdt=-x+PΩ(x-∇xΨ(x,λ,ν))=-x+PΩ(x-(∇f(x)+∇g(x)λ+∇h(x)ν+α∇g(x)(λ∘g(x))+β∇h(x)h(x))),εdλdt=-λ+PΩ(λ+∇λΨ(x,λ,ν))=-λ+(λ+g(x))+,εdνdt=-ν+PΩ(ν+∇νΨ(x,λ,ν))=h(x),$
(3.10)
where $ε$ is a time constant.
In view of equation 3.10, the equilibrium denoted as $(x¯,λ¯,ν¯)$ satisfies the following equations:
$x¯=PΩ(x¯-∇xΨ(x¯,λ¯,ν¯)),λ¯=(λ¯+g(x¯))+,h(x¯)=0.$
(3.11)
Lemma 1
(Second-Order Sufficiency Conditions; Bazaraa, 2013). Suppose that $x*$ is a strict local minimum of problem 3.1 if there exist $λ*∈Rm$ and $ν*∈Rq$ such that $(x*,λ*,ν*)$ is a KKT point and $∇xxL(x*,λ*,ν*)$ is positive definite on an open cone:
$C={d∈Rn|∇gj(x*)Td=0,∀j∈{j|gj(x*)≤0},∇gj(x*)Td≤0,∀j∈{j|gj(x*)=0},∇hi(x*)Td=0,∀i∈1,…,q}.$
(3.12)
For the second-order sufficiency conditions to hold, the regularity condition needs to be verified. For both problems P2.1 and P2.2, let $ℓ$ be the total number of the equality constraints. Then the active constraints can be written as $[g,h1,…,hℓ]$. The gradients of the active constraints in terms of $x=vec(W,H,Z)$ are
$[∇xg,∇xh1,…,∇xhℓ]=10⋯010⋯⋮⋮⋮⋯⋮10⋯012z1-100⋯0102z2-10⋯0⋮⋮⋮⋱⋮⋮⋮⋮⋱0100⋯02zℓ-1$

It is clear that the columns are linearly independent of each other. Therefore, the regularity condition holds. It implies that lemma 2 and the KKT conditions in equations 3.3 and 3.4 hold. Thus, every equilibrium of model 3.10 is a strict local minimum and vice versa.

Therefore, the equilibrium $μ¯=(W¯,H¯,Z¯,λ¯,ν¯)$ for problem P2.1 or P2.2 is
$W¯=(W¯-∇WΨ(μ¯))+,H¯=(H¯-∇HΨ(μ¯))+,Z¯=P[0,1](Z¯-∇ZΨ(μ¯)),λ¯=(λ¯+g(Z¯))+=λ¯+∑ik∑jmz¯ij-s+,h(Z¯)=Z¯∘(Z¯-1)=0,∇WΨ(μ¯)=2(W¯(Z¯∘H¯-V))(Z¯∘H¯)⊺,forP2.1,2Z¯∘(((W¯∘Z¯)H¯-V)H¯⊺),forP2.2;∇HΨ(μ¯)=2Z¯∘(W¯⊺(W¯(Z¯∘H¯)-V)),forP2.1,2(W¯∘Z¯)⊺((W¯∘Z¯)H¯-V),forP2.2;∇ZΨ(μ¯)=2H¯∘(W¯⊺(W¯(Z¯∘H¯)-V))+λ¯1+αλ¯(∑ik∑jmz¯ij-s)∘1+β(2Z¯-1)∘(Z¯∘(Z¯-1))forP2.1,2W¯∘(((W¯∘Z¯)H¯-V)H¯⊺)+λ¯1+αλ¯(∑ik∑jmz¯ij-s)∘1+β(2Z¯-1)∘(Z¯∘(Z¯-1))forP2.2;$
(3.13)
where $1$ is a matrix with all elements being ones.

The derivations of gradient functions are given in the appendix.

### 3.4  Discrete-Time Neurodynamic Model

Through Euler discretization, the continuous-time model 3.10 is described as follows:
$wk+1=wk+δk(-wk+(wk-∇wΨ)+)hk+1=hk+δk(-hk+(hk-∇hΨ)+)zk+1=zk+δk(-zk+P[0,1](zk-∇zΨ))λk+1=λk+δk(-λk+(λk+∇λΨ)+)νk+1=νk+δk(-νk+(νk+∇νΨ)),$
(3.14)
where $Ψ$ is the abbreviation of $Ψ(wk,hk,zk,λk,νk)$.
Lemma 2

(Kinderlehrer & Stampacchia, 1980). $(PΩ(u)-u)⊺(v-PΩ(u))≥0,∀u∈Rn,∀v∈Ω.$

By simply taking a negative sign on both sides, obtain
$(PΩ(u)-u)⊺(PΩ(u)-v)≤0.$
(3.15)
Lemma 3.
Let $qxi=xki-∇fx(xki)$ and $qyi=yki+∇fy(yki)$. There exist functions $γ(·)$ and $ζ(·)$, such that the following two discrete-time projection neural networks (DTPNN) are equivalent:
$DTPNN1:=xk+1i=xki+δk,x(-xki+PΩ(qxi))yk+1i=yki+δk,y(-yki+PΩ(qyi))DTPNN2:=xk+1i=xki-ηk,xi∇fx(xki)yk+1i=yki+ηk,yi∇fy(yki)$
where
$ηki=δk,qxi∈[li,ui]orqyi∈[li,ui]δkγi(xk),qxi>uiδkγi(yk),qyi>uiδkζi(xk),qxi

The proof is given in the appendix.

Lemma 4 says that DTPNN$1$ can be converted into DTPNN$2$ with a suitable step size $ηi$, since there always exists a function that maps the input into range $[li,ui]$. The bound of step size $ηi$ is detailed in theorem 6.

According to lemma 4, model 3.13 is equivalent to the following DTPNN:
$wk+1i=wki-ηki∇wΨihk+1i=hki-ηki∇hΨizk+1i=zki-ηki∇zΨiλk+1i=λki+ηkig(zk)νk+1i=νki+ηkih(zk).$
(3.16)

### 3.5  Stability and Convergence Analysis

In this section, we analyze stability and the convergence of DTPNN by using Lyapunov's method. The main result is given in theorem 6.

Lemma 4.

The equilibrium sets of the continuous-time neural network 3.10 and the discrete-time neural network 3.14 are identical.

The proof is given in the appendix.

Theorem 1.
The discrete-time projection neural network 3.14 is Lyapunov stable and convergent to an equilibrium from any initial state $(w0,h0,z0,λ0,ν0)∈W×H×Z×Λ×N$ if
$(1-c)+≤δk≤1+c,$
where $c=$
$2ηk⊺∇xkΨ-ηk⊺∇ykΨ22+1-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22-1.$
Proof.
Let $x=(w,h,z)$, $y=(λ,ν)$. Then, $∇xΨ=(∇wΨ,∇hΨ,∇zΨ)⊺$,$∇yΨ=(∇λΨ,∇νΨ)⊺$. Now we rewrite model 3.14 as
$xk+1yk+1=xkyk+δk-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ).$
(3.17)
Let $[x¯,y¯]⊺$ denote the equilibrium that satisfies equation 3.13 for model 3.17. Then, for all $k>0$,
$xk+1yk+1-x¯y¯22=xkyk-x¯y¯+δk-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22=2δkxk-x¯yk-y¯⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)+xkyk-x¯y¯22+δk-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22.$
(3.18)
According to equation 3.15 in lemma 3, the following inequality is obtained:
$PΩ(xk-∇xkΨ)PΩ(yk+∇ykΨ)-xk-∇xkΨyk+∇ykΨ⊺×PΩ(xk-∇xkΨ)PΩ(yk+∇ykΨ)-x¯y¯+xkyk-xkyk≤0.$
By straightforward algebra, it can be rearranged to
$xk-x¯yk-y¯⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)≤--xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22-∇xkΨ-∇ykΨ⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)-∇xkΨ-∇ykΨ⊺xk-x¯yk-y¯.$
Since $∇xkΨ⊺(xk-x¯)≥0$ and -$∇ykΨ⊺(yk-y¯)≥0$, the above inequality can be converted to
$xk-x¯yk-y¯⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)≤--xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22-∇xkΨ-∇ykΨ⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ).$
By lemma 4, the inequality can be rewritten as
$xk-x¯yk-y¯⊺-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)≤--xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22+1δηk⊺∇xkΨ-ηk⊺∇ykΨ22,$
(3.19)
where $ηk=[ηk1,…,ηkn]⊺$, $ηki$ is defined in lemma 4.
Substituting 3.19 into 3.18 gives:
$xk+1-x¯yk+1-y¯22-xk-x¯yk-y¯22≤2δk--xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22+1δηk⊺∇xkΨ-ηk⊺∇ykΨ22+δk-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22≤(δk2-2δk)-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22+2ηk⊺∇xkΨ-ηk⊺∇ykΨ22.$
(3.20)
Now, considering the Lyapunov function
$V([xk,yk])=xk-x¯yk-y¯22.$
According to the inequality given in equation 3.20, we have
$V([xk+1,yk+1])-V([xk,yk])=xk+1-x¯yk+1-y¯22-xk-x¯yk-y¯22≤(δk2-2δk)-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22+2ηk⊺∇xkΨ-ηk⊺∇ykΨ22.$
(3.21)
Thus, for
$V([xk+1,yk+1])-V([xk,yk])≤0,$
(3.22)
it directly implies
$(δk2-2δk)-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22≤-2ηk⊺∇xkΨ-ηk⊺∇ykΨ22⇒(1-c)+≤δk≤1+cwherec=-2ηk⊺∇xkΨ-ηk⊺∇ykΨ22-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22-1+1.$
(3.23)
When $δk$ satisfies equation 3.23, $V([xk,yk])$ is a Lyapunov function that implies Lyapunov stability of model 3.16. Regarding a discrete-time model, LaSalle's invariance principle (LaSalle, 1976) states that $xk$ and $yk$ converge to the largest invariant subset of the set as below:
$Θ={xk,yk∈Rn:V([xk+1,yk+1])-V([xk,yk])=0}.$
From equation 3.21, it can be seen that if $V([xk+1,yk+1])-V([xk,yk])=0$, and it requires
$ηk⊺∇xkΨ-ηk⊺∇ykΨ22=-xk+PΩ(xk-∇xkΨ)-yk+PΩ(yk+∇ykΨ)22=0.$
According to lemma 4, $xk$ and $yk$ converge to the equilibrium of model 3.14. Following from equation 3.22, we can obtain the sequence
$xk+1-x¯yk+1-y¯22≤xk-x¯yk-y¯22≤⋯≤x0-x¯y0-y¯22.$
This indicates that $xk$ and $yk$ are bounded in $Rn$. Therefore, there exists an increasing sequence ${kl}l=1∞$ such that
$liml→∞xklykl=x¯y¯.$
Based on the above analysis, for any $k$, there exists $kl$ such that
$xk-x¯yk-y¯22≤xkl-x¯ykl-y¯22.$
Then, for $l→∞$, $k→∞$,
$limk→∞xk-x¯yk-y¯22=liml→∞xkl-x¯ykl-y¯22=0.$
Therefore, $limk→∞xk=x¯$ and $limk→∞yk=y¯$.$□$

## 4  Experimental Results

### 4.1  Setup

To characterize the convergence behaviors of DTPNN and substantiate the performance of the proposed sparse NMF formulation for feature extraction, the experiments are based on the following two commonly used sparse NMF data sets:

• The ORL1data set consists of data collected from 40 individuals, and each has 10 different views with dimension $32×32$. Thus, the input matrix $V$ for factorization has the dimension $1024×400$. The factorization rank for this experiment is set as 25 as in Hoyer (2004).

• The CBCL2 data set contains a training set and a testing set. In this experiment, the training set is used. It contains 2429 face images, and each has dimension $19×19$. Therefore, the input matrix $V$ for factorization has dimension $381×2429$. Following Hoyer (2004), the factorization rank for this experiment is set as 49.

In every experiment, the error tolerance is set as $10-6$. The maximum iteration number is set as $103$. For brevity, the neurodynamic models for solving problems P2.1 and P2.2 are denoted as DTPNN-H and DTPNN-W, respectively.

### 4.2  Convergence Behaviors

In this section, transient objective function values and transient neuronal states are plotted to show their convergence behaviors. According to theorem 6, the discrete-time model is stable when step size $η∈(0,2)$. Sparsity level $s$ is fixed to $50%$ in the experiment.

Then, for problem P2.1, the desired sparsity levels are set as:
$s=12800(=1024×25×0.5),forORL9334.5(=381×49×0.5),forCBCL.$
For problem P2.2, the desired sparsity levels are set as:
$s=5000(=25×400×0.5),forORL59510.5(=49×2429×0.5),forCBCL.$

Figure 1 shows the convergence behaviors of DTPNN for solving problem P2.1 on the ORL and CBCl data sets. In panel a, the first and second rows show the transient states of $W$ and $H$. The plots on the third row show that all the $Z$ entries converge to either zero or one. The plots on the fourth and fifth rows show the transient states of both equality and inequality Lagrangian multipliers, $λ$ and $ν$. Their transient states become stable, and this indicates that the inequality and equality constraints are eventually satisfied. Panel b shows the objective function and augmented Lagrangian function converge.

Figure 1:

Transient states of DTPNN-H and corresponding function values for solving problem P2.1 over iterations.

Figure 1:

Transient states of DTPNN-H and corresponding function values for solving problem P2.1 over iterations.

Figure 2 shows the convergence behavior of DTPNN for solving problem P2.2 on ORL and CBCl data set. Panel a plots the stable behaviors of the transient states. The first two rows plot the trends of decision variables $W$ and $H$. On the third row, all the $Z$ entries converge to either zero or one as the equality constraint as required by the binary constraint. The transient behaviors of Lagrangian multipliers show that the constraints are satisfied, and thus the augmented Lagrangian functions converge to the same function values as the objective functions as plotted in panel b.

Figure 2:

Transient states of DTPNN-W and corresponding function values for solving problem P2.2 over iterations.

Figure 2:

Transient states of DTPNN-W and corresponding function values for solving problem P2.2 over iterations.

### 4.3  Factorization Results

While a desirable sparsity level is satisfied, factorization accuracy is essential. Therefore, the performance is demonstrated in the following two aspects, accuracy and sparsity, by the relative reconstruction error (RRE) and sparsity evaluation measure (SEM), respectively.

The RRE and SEM for problem P2.1 are defined as
$RREH=V-W(H∘Z)F2VF2,SEMH=H∘Z>ε0r×n,$
(4.1)
and the RRE and SEM for problem P2.2 are defined as
$RREW=V-(W∘Z)HF2VF2,SEMW=W∘Z>ε0r×m,$
(4.2)
where $ε$ is a small error tolerance. In the experiments, $ε$ is set to be $10-6$. The lower the RRE value, the higher accuracy the results are. A small SEM value indicates a high sparsity level.

The performance is compared with various sparse NMF formulations, including NMFSC3 (Hoyer, 2004), $l1$-norm, and VSMF (Li & Ngom, 2013) which were introduced in section 2. For the $l1$-norm NMF formulation, we used two algorithms from Matlab Toolbox based on nonnegative alternating least squares (NALS) (Cichocki, Zdunek, Phan, & Amari, 2009) and the active set method(AS) (Kim & Park, 2008) methods for comparison. We compared the SNMF-H algorithm (Le Roux, Hershey, & Weninger, 2015), which is used only for $H$'s sparsity control will be compared, with problem P2.1. Note that for the same $l1$-norm regularized formulation, different weights are required for different algorithms for the same sparsity degree. Therefore, it is meaningless to compare the results by fixing the weights. Since the proposed formulations do not require weight tuning, we used four test cases on this problem setting first. By setting the required sparsity level in the constraint, the exact sparsity values obtained by DTPNN are recorded in SEM for later comparison. We selected the weights for compared algorithms carefully to reach a similar SEM. Since a good model should obtain a desirable degree of sparsity while minimizing the accuracy, their relative reconstruction errors (RRE) are compared to DTPNN at the end. We used four test cases for the factorization algorithms by requiring their results to have nonzero elements less than ${20%,40%,60%,80%}$, corresponding to the degree of sparsity $s≤{80%,60%,40%,20%}$.

Tables 1 and 2 record the results of solving problems P2.1 and P2.2, respectively. Both experiments on the ORL data sets, DTPNN-H and DTPNN-W obtain comparable RRE results (only three digits after the decimal points' difference). When the SEM values are higher than 50%, DTPNN-H and DTPNN-W achieve better RRE when high sparsity levels (low SEM) are required.

Table 1:
Performance Measure of Obtaining a Required Sparsity Level of $H$ on ORL Data Set.
20%40%60%80%
$H0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-H 0.1605 15.70% 0.1411 34.36% 0.1358 48.06% 0.1314 66.34%
SNMF-H (Le Roux et al., 2015infeasible infeasible infeasible infeasible 0.4121 59.94% 0.2163 70.43%
NMFSC (Hoyer, 20040.2505 19.01% 0.2413 39.75% 0.2286 56.29% 0.2231 66.52%
NALS (Cichocki, Zdunek, Phan, & Amari, 20090.1653 19.42% 0.1427 34.41% 0.1322 50.02% 0.1305 67.02%
AS (Kim & Park, 20080.2006 16.69% 0.1848 36.91% 0.1697 53.85% 0.1558 67.58%
VSMF (Li & Ngom, 20130.1816 18.93% 0.1580 36.35% 0.1367 58.98% 0.1319 66.43%
20%40%60%80%
$H0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-H 0.1605 15.70% 0.1411 34.36% 0.1358 48.06% 0.1314 66.34%
SNMF-H (Le Roux et al., 2015infeasible infeasible infeasible infeasible 0.4121 59.94% 0.2163 70.43%
NMFSC (Hoyer, 20040.2505 19.01% 0.2413 39.75% 0.2286 56.29% 0.2231 66.52%
NALS (Cichocki, Zdunek, Phan, & Amari, 20090.1653 19.42% 0.1427 34.41% 0.1322 50.02% 0.1305 67.02%
AS (Kim & Park, 20080.2006 16.69% 0.1848 36.91% 0.1697 53.85% 0.1558 67.58%
VSMF (Li & Ngom, 20130.1816 18.93% 0.1580 36.35% 0.1367 58.98% 0.1319 66.43%

Note: The best values are in bold.

Table 2:
Performance Measure of Obtaining a Required Sparsity Level of $W$ on ORL Data Set.
20%40%60%80%
$W0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-W 0.1566 15.59% 0.1482 35.95% 0.1468 38.87% 0.1422 54.32%
NMFSC (Hoyer, 20040.3188 18.41% 0.2517 38.26% 0.2480 39.31% 0.2291 54.81%
NALS (Cichocki et al., 20090.1989 16.73% 0.1787 36.04% 0.1750 40.16% 0.1679 54.48%
AS (Kim & Park, 20080.2059 15.98% 0.1827 37.40% 0.1725 39.79% 0.1511 58.59%
VSMF (Li & Ngom, 20130.1891 15.71% 0.1557 36.04% 0.1545 40.19% 0.1432 54.37%
20%40%60%80%
$W0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-W 0.1566 15.59% 0.1482 35.95% 0.1468 38.87% 0.1422 54.32%
NMFSC (Hoyer, 20040.3188 18.41% 0.2517 38.26% 0.2480 39.31% 0.2291 54.81%
NALS (Cichocki et al., 20090.1989 16.73% 0.1787 36.04% 0.1750 40.16% 0.1679 54.48%
AS (Kim & Park, 20080.2059 15.98% 0.1827 37.40% 0.1725 39.79% 0.1511 58.59%
VSMF (Li & Ngom, 20130.1891 15.71% 0.1557 36.04% 0.1545 40.19% 0.1432 54.37%

Note: The best values are in bold.

When a matrix becomes sparse, less information is preserved. Therefore, this leads to a higher reconstruction error, and thus RRE goes high. Although NMFSC reaches a given sparsity after carefully selecting the parameter for its sparseness(x) constraint, the final factorization result has low factorization accuracy. SNMF-H reaches a sparseness level of 60% with a weight value of approximately 0.7. Then, even slightly increasing its weight by 0.001, this leads its SEM to go directly to one (all the elements became zero). In other words, SNMF-H fails to obtain a sparse coefficient matrix $H$ on the ORL data set. NALS, AS, and VSNMF are applied to the $L1$-norm regularized formulation. They both perform well when the required sparsity levels are low in both experiments. When higher sparsity is required, the RRE goes high. It is worth noticing that during the experiments, these algorithms require different orders of weights even for the same level of sparsity, on the same $L1$-norm regularized formulation.

Figure 3 shows the basis matrix $W$ obtained from the two data sets. The first row plots the learned basis obtained by solving problem P2.1. Since the matrix is decomposed into a product of two nonnegative matrices, the sparseness of one matrix will cause the other one to be dense. As $H$ becomes sparser, the basis matrix $W$ gets denser, which finally shows a whole face instead of representing the features as parts. The second row shows the results by solving problem P2.2. When the sparsity level on the basis matrix $W$ is directly constrained, the face components are clearly separated.

Figure 3:

Learned basis matrix $W$ of ORL data set with rank $=$ 25 obtained by solving problem P2.1 using DTPNN-H (top row) and solving problem P2.2 using DTPNN-W (bottom row).

Figure 3:

Learned basis matrix $W$ of ORL data set with rank $=$ 25 obtained by solving problem P2.1 using DTPNN-H (top row) and solving problem P2.2 using DTPNN-W (bottom row).

Tables 3 and 4 record the results obtained from the CBCL data set. It can be seen that DTPNN performs the best for both $W$ and $H$ when high sparsity levels are required. NALS and AS achieve a better RRE when less sparsity is required. Since the initialization has a certain randomness, it is normal to have small differences for the SEM value.

Table 3:
Performance Measure of Obtaining a Required Sparsity Level of $H$ on the CBCL Data Set.
20%40%60%80%
$H0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-H 0.1863 11.38% 0.1568 23.05% 0.1415 35.28% 0.1200 53.87%
SNMF-H (Le Roux et al., 20150.9891 14.88% 0.3359 25.32% 0.2274 35.85% 0.1347 55.88%
NMFSC (Hoyer, 20040.2673 12.50% 0.2529 23.38% 0.1971 35.35% 0.1621 0.54%
NALS (Cichocki et al., 20090.1887 11.58% 0.1570 23.54% 0.1336 35.28% 0.1102 53.87%
AS (Kim & Park, 20080.1977 11.79% 0.1704 23.78% 0.1416 37.54% 0.1215 54.82%
VSMF (Li & Ngom, 20130.1923 11.82% 0.1568 23.56% 0.1386 35.33% 0.1124 54.69%
20%40%60%80%
$H0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-H 0.1863 11.38% 0.1568 23.05% 0.1415 35.28% 0.1200 53.87%
SNMF-H (Le Roux et al., 20150.9891 14.88% 0.3359 25.32% 0.2274 35.85% 0.1347 55.88%
NMFSC (Hoyer, 20040.2673 12.50% 0.2529 23.38% 0.1971 35.35% 0.1621 0.54%
NALS (Cichocki et al., 20090.1887 11.58% 0.1570 23.54% 0.1336 35.28% 0.1102 53.87%
AS (Kim & Park, 20080.1977 11.79% 0.1704 23.78% 0.1416 37.54% 0.1215 54.82%
VSMF (Li & Ngom, 20130.1923 11.82% 0.1568 23.56% 0.1386 35.33% 0.1124 54.69%

Note: The best values are in bold.

Table 4:
Performance Measure of Obtaining a Required Sparsity Level of $W$ on the CBCL Data Set.
20%40%60%80%
$W0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-W 0.1204 14.21% 0.1207 20.54% 0.1112 29.10% 0.1008 39.01%
NMFSC (Hoyer, 20040.2429 15.78% 0.2223 21.84% 0.1909 29.29% 0.1677 39.92%
NALS (Cichocki et al., 20090.1385 15.26% 0.1298 21.84% 0.1202 29.63% 0.1087 40.05%
AS (Kim & Park, 20080.1281 14.28% 0.1208 20.68% 0.1100 29.50% 0.0974 39.08%
VSMF (Li & Ngom, 2013infeasible infeasible 0.1242 29.77% 0.1179 33.11% 0.0988 41.21%
20%40%60%80%
$W0≤$ Performance MeasureRRESEMRRESEMRRESEMRRESEM
DTPNN-W 0.1204 14.21% 0.1207 20.54% 0.1112 29.10% 0.1008 39.01%
NMFSC (Hoyer, 20040.2429 15.78% 0.2223 21.84% 0.1909 29.29% 0.1677 39.92%
NALS (Cichocki et al., 20090.1385 15.26% 0.1298 21.84% 0.1202 29.63% 0.1087 40.05%
AS (Kim & Park, 20080.1281 14.28% 0.1208 20.68% 0.1100 29.50% 0.0974 39.08%
VSMF (Li & Ngom, 2013infeasible infeasible 0.1242 29.77% 0.1179 33.11% 0.0988 41.21%

Note: The best values are in bold.

For VSMF, various values of the penalty parameter ranging from 0.1 to 50 were tested. The lowest sparsity degree that VSMF can achieve is approximately 29%. Therefore, it fails to achieve a desirable sparsity for coefficient matrix $H$. For SNMF and NMFSC, both fail to obtain low accuracy when the required sparsity level is high. All of these algorithms used for comparison require different weight parameters from the experiment for the ORL data set.

Figure 4 shows the basis matrix $W$ obtained by DTPNN-W and DTPNN-H. From Tables 3 and 4, it can be seen that the CBCL data set tends to generate a sparser basis matrix $H$. The matrix reached 50% of sparsity even when no extra sparseness constraint was added. By enhancing sparsity on $H$, the basis matrix becomes dense. The result of solving problem P2.2 on CBCL is shown in the second row. When $W$ is enforced to have less than 20% of entries to be nonzero, the results reach a significantly high sparsity, with only 14.21% of the elements at nonzero.

Figure 4:

Learned basis matrix $W$ of the CBCL data set with rank $=$ 49 obtained by solving problem P2.1 using DTPNN-H (top row), and solving problem P2.2 using DTPNN-W (bottom row).

Figure 4:

Learned basis matrix $W$ of the CBCL data set with rank $=$ 49 obtained by solving problem P2.1 using DTPNN-H (top row), and solving problem P2.2 using DTPNN-W (bottom row).

## 5  Conclusion

In this letter, we propose a mixed-integer optimization formulation with a sparsity constraint for sparse nonnegative matrix factorization. A discrete-time projection neural network is developed for solving the formulated problem. An upper bound of the step size of the neural network is derived to guarantee its stability and convergence. The superiority of this approach is substantiated based on two experiments on feature extraction with high-level sparsity requirements. Further investigations may aim at the other NMF problem reformulations and efficient and effective solution procedures.

## Appendix:  Proofs of Lemmas 4 and 5 and Gradient Derivation

### A.1  Proof of Lemma 4

There are three cases that need to be considered separately. For all $k∈[k0,k1,…]$:

• Case 1: $(xki-∇f(xki))∈[li,ui]$ or $(yki+∇f(yki))∈[li,ui]$. By the definition of projection operator $PΩ$, this directly implies
$xk+1i=xki-ηki∇fx(xki)yk+1i=yki+ηki∇fy(yki).$
(A.1)
Model A.1 reaches its equilibrium within the given interval if and only if $∇fx(xki)=0$ and $∇fy(yki)=0$.
• Case 2: $(xki-∇f(xki))>ui$ or $(yki+∇f(yki))>ui$. This directly implies
$xk+1i=xki-ηki(-xki+uki)≥0yk+1i=yki+ηki(-yki+uki)≥0.$
(A.2)
Since $(xki-∇f(xki))>ui$ and $xki∈[li,ui]$, this implies $∇f(xki)≤0$. There exists a function $γi(xk)≥0$. Similarly, since $(yki+∇f(yki))>ui$ and $yki∈[li,ui]$, this implies $∇f(yki)≥0$. There exists a function $γi(yk)≤0$. Then, model A.2 can be equivalently transferred to
$xk+1i=xki-γi(xk)∇f(xki)≥0yk+1i=yki+γi(yk)∇f(yki)≥0.$
(A.3)
Model A.3 reaches its equilibrium within the given interval if and only if $γi(xk)=0$.
• Case 3: $(xki+∇f(xki)) or $(yki+∇f(yki)). This directly implies
$xk+1i=xki-ηki(-xki+uki)≤0yk+1i=yki+ηki(-yki+uki)≤0.$
(A.4)
Since $(xki-∇f(xki)) and $xki∈[li,ui]$, this implies $∇f(xki)≥0$. There exists a function $ζi(xk)≥0$. Similarly, since $(yki+∇f(yki)) and $yki∈[li,ui]$, this implies $∇f(yki)≤0$. There exists a function $ζi(yk)≤0$. Then, model A.4 can be equivalently transferred to
$xk+1i=xki-ζi(xk)∇f(xki)≤0yk+1i=yki+ζi(yk)∇f(yki)≤0.$
(A.5)
Model A.5 reaches its equilibrium within the given interval if and only if $ζi(xk)=0$.

Therefore, these two models are equivalent.$□$

### A.2  Proof of Lemma 5

Let $(w¯c,h¯c,z¯c,λ¯c,ν¯c)$ be the equilibrium point of the continuous-time projection neural network (CTPNN). Let $(w¯k,d,h¯k,d,z¯k,d,λ¯k,d,ν¯k,d)$ be the equilibrium point of the discrete-time projection neural network (DTPNN) at iteration $k$. Abbreviate $Ψ(w,h,z,λ,ν)$ as $Ψ$.

For CTPNN to be stable,
$εdwdt=-w¯c+PΩ(w¯c-∇wΨ)=0,εdhdt=-h¯c+PΩ(h¯c-∇hΨ)=0,εdzdt=-z¯c+PΩ(z¯c-∇zΨ)=0,εdλdt=-λ¯c+PΩ(λ¯c+∇λΨ)=0,εdνdt=-ν¯c+PΩ(ν¯c+∇νΨ)=0,$
it requires
$∇wΨ=0,∇hΨ=0,∇zΨ=0,∇λΨ=0,∇νΨ=0.$
For DTPNN to be stable,
$w¯k+1,d-w¯k,d=δk(-w¯k,d+(w¯k,d-∇wΨ)+)=0,h¯k+1,d-h¯k,d=δk(-h¯k,d+(h¯k,d-∇hΨ)+)=0,z¯k+1,d-z¯k,d=δk(-z¯k,d+P[0,1](z¯k,d-∇zΨ))=0,λ¯k+1,d-λ¯k,d=δk(-λ¯k,d+(λ¯k,d+∇λΨ)+)=0,ν¯k+1,d-ν¯k,d=δk(-ν¯k,d+(ν¯k,d+∇νΨ))=0.$
it also requires
$∇wΨ=0,∇hΨ=0,∇zΨ=0,∇λΨ=0,∇νΨ=0.$
Since the same gradient functions are required to be zero for both CTPNN and DTPNN at the equilibrium state, they have the same equilibrium set.$□$

### A.3  Gradient Computation for Equation 3.13

Here, we present the details of matrix gradient computation of equation 3.13.

For $ℓ=1,2$, the original equations are:
$Ψℓ(W,H,Z,λ,ν)=fℓ(W,H,Z)+λ(Z1-s)+∑i=1k∑j=1mνij(zij2-zij)+α2λ(Z1-s)2+β2∑i=1k∑j=1m(zij2-zij)2,P2.1f1(W,H,Z)=V-W(H∘Z)F2P2.2f2(W,H,Z)=V-(W∘Z)HF2$
For convenience, we use a colon in the following equations to denote a trace product, that is,
$A:B=Tr(A⊺B).$

#### A.3.1  Gradient Computation for Problem P2.1

Define matrices:
$A1=H∘Z,B1=WA1-V$
Rewrite function $f1$ with new variables:
$f1=B1F2=B1:B1,∂f1=2B1:∂B1=2B1:W∂A1=2W⊺B1:∂A1=2W⊺B1:Z∘H=2Z∘(W⊺B1):∂H,∂f1∂H=2Z∘(W⊺B1),∂Ψ1∂H=2Z∘(W⊺(W(H∘Z)-V)).$
Since $H$ and $Z$ are doing element-wise multiplication, the order does not affect computation. Then the partial differentiation with respect to $Z$ is
$∂f1∂Z=2H∘(W⊺B1):∂Z,∂Ψ1dZ=2H∘(W⊺(W(H∘Z)-V))+λ1+αλ∑ik∑jmzij-s∘1+β(2Z-1)∘(Z∘(Z-1)).$
The partial differentiation with respect to $W$ is
$∂f1=2B1:∂B1=2B1:∂WA1,∂Ψ1dW=2B1A1⊺=2(W(H∘Z)-V)(H∘Z)⊺.$

#### A.3.2  Gradient Computation for Problem P2.2

Define matrices:
$A2=W∘Z,B2=A2H-V$
Rewrite function $f2$ with new variables:
$f2=B2F2=B2:B2,∂f2=2B2:∂B2=2B2:∂A2H=2B2H⊺:Z∘∂W=2Z∘(B2H⊺):∂W,∂f2∂W=2Z∘(B2H⊺),∂Ψ2∂W=2Z∘(((W∘Z)H-V)H⊺).$
The partial differentiation with respect to $Z$ is
$∂f2∂Z=2H∘(B2H⊺),∂Ψ2dZ=2H∘(((W∘Z)H-V)H⊺)+λ1+αλ∑ik∑jmzij-s∘1+β(2Z-1)∘(Z∘(Z-1)).$
The partial differentiation with respect to $H$ is
$∂f2=2B2:A2∂H=2A2⊺B2:∂H,∂Ψ2dH=2A2⊺B2=2(W∘Z)⊺((W∘Z)H-V).$

## Acknowledgments

This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region of China under grants 11208517, 11202318, 11202019, 11200116 (9042322), 11206317 (9042489), and 11209819 (9042816); and by the National Natural Science Foundation of China under grant 61673330; and by the Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (grant 2018AAA0101301).

## References

Alireza
,
H.
,
Wang
,
J.
, &
Hosseini
,
S.
(
2013
).
A recurrent neural network for solving a class of generalized convex optimization problems
.
Neural Networks
,
44
,
78
86
.
Bazaraa
,
S.
(
2013
).
Nonlinear programming: Theory and algorithms
(3rd ed.).
Hoboken, NJ
:
Wiley
.
Che
,
H.
, &
Wang
,
J.
(
2018
).
A nonnegative matrix factorization algorithm based on a discrete-time projection neural network
.
Neural Networks
,
103
,
63
71
.
Che
,
H.
, &
Wang
,
J.
(
2019a
).
A collaborative neurodynamic approach to global and combinatorial optimization
.
Neural Networks
,
114
,
15
27
.
Che
,
H.
, &
Wang
,
J.
(
2019b
).
A two-timescale duplex neurodynamic approach to biconvex optimization
.
IEEE Transactions on Neural Networks and Learning Systems
,
30
,
2503
2514
.
Che
,
H.
, &
Wang
,
J.
(in press).
A two-time-scale duplex neurodynamic approach to mixed-integer optimization.
IEEE Transactions on Neural Networks and Learning Systems
,
31
.
Cichocki
,
A.
,
Zdunek
,
R.
,
Phan
,
A. H.
, &
Amari
,
S.
(
2009
).
Nonnegative matrix and tensor factorizations: Applications to exploratory multi-way data analysis and blind source separation
.
Hoboken, NJ
:
Wiley
.
Fan
,
J.
, &
Wang
,
J.
(
2017
).
A collective neurodynamic optimization approach to nonnegative matrix factorization
.
IEEE Transactions on Neural Networks and Learning Systems
,
28
(
10
),
2344
2356
.
Field
,
D. J.
(
1994
).
What is the goal of sensory coding?
Neural Computation
,
6
(10)
,
559
601
.
He
,
X.
,
Li
,
C.
,
Huang
,
T.
, &
Huang
,
J.
(
2014
).
A recurrent neural network for solving bilevel linear programming problem
.
IEEE Transactions on Neural Networks and Learning Systems
,
25
,
824
830
.
Hopfield
,
J.
, &
Tank
,
D.
(
1986
).
Computing with neural circuits: A model
.
Science
,
233
,
625
633
.
Hoyer
,
P. O.
(
2002
).
Non-negative sparse coding.
In
Proceedings of the IEEE Workshop in Neural Networks for Signal Processing
(pp.
557
565
).
Piscataway, NJ
:
IEEE
.
Hoyer
,
P. O.
(
2004
).
Non-negative matrix factorization with sparseness constraints
.
Journal of Machine Learning Research
,
5
,
1457
1469
.
Kim
,
H.
, &
Park
,
H.
(
2007
).
Sparse non-negative matrix factorizations via alternating non-negativity constrained least squares for microarray data analysis
.
Bioinformatics
,
23
(
12
),
1495
1502
.
Kim
,
H.
, &
Park
,
H.
(
2008
).
Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method
.
SIAM Journal on Matrix Analysis and Applications
,
30
(
2
),
713
730
.
Kinderlehrer
,
D.
, &
Stampacchia
,
G.
(
1980
).
An introduction to variational inequalities and their applications
.
New York
:
.
LaSalle
,
J.
(
1976
).
The stability of dynamical systems
.
:
Society for Industrial and Applied Mathematics
.
Le
,
X.
, &
Wang
,
J.
(
2017
).
A two-time-scale neurodynamic approach to constrained minimax optimization
.
IEEE Transactions on Neural Networks and Learning Systems
,
28
(
3
),
620
629
.
Le Roux
,
J.
,
Hershey
,
J. R.
, &
Weninger
,
F.
(
2015
).
Sparse NMF half-baked or well done?
Technical Report TR2015-023
.
Cambridge, MA
:
Mitsubishi Electric Research Labs
.
Lee
,
H.
, &
Seung
,
D.
(
1999
).
Learning the parts of objects by nonnegative matrix factorization
.
Nature
,
401
,
788
791
.
Leung
,
M.
, &
Wang
,
J.
(
2018
).
A collaborative neurodynamic approach to multiobjective optimization
.
IEEE Transactions on Neural Networks and Learning Systems
,
29
(
11
),
5738
5748
.
Leung
,
M.
, &
Wang
,
J.
(in press).
Minimax and biobjective portfolio selection based on collaborative neurodynamic optimization.
IEEE Transactions on Neural Networks and Learning Systems
,
31
.
Li
,
C.
,
Yu
,
X.
,
Huang
,
T.
,
Chen
,
G.
, &
He
,
X.
(
2015
).
A generalized Hopfield network for nonsmooth constrained convex optimization: Lie derivative approach
.
IEEE Transactions on Neural Networks and Learning Systems
,
27
,
1
14
.
Li
,
Y.
, &
Ngom
,
A.
(
2013
). Versatile sparse matrix factorization and its applications in high-dimensional biological data analysis. In
M.
Comin
,
L.
Käll
,
E.
Marchiori
,
A.
Ngom
, &
J.
Rajapkse
(Eds.),
Pattern recognition in bioinformatics
(pp.
91
101
).
Berlin
:
Springer
.
Lin
,
C.
(
2007
).
Projected gradient methods for nonnegative matrix factorization
.
Neural Computation
,
19
(
10
),
2756
2779
.
Liu
,
Q.
,
Cao
,
J.
, &
Chen
,
G.
(
2010
).
A novel recurrent neural network with finite-time convergence for linear programming
.
Neural Computation
,
22
(
11
),
2962
2978
.
Liu
,
Q.
,
Guo
,
Z.
, &
Wang
,
J.
(
2012
).
A one-layer recurrent neural network for constrained pseudoconvex optimization and its application for dynamic portfolio optimization
.
Neural Networks
,
28
,
99
109
.
Liu
,
Q.
, &
Wang
,
J.
(
2008
).
A one-layer recurrent neural network with a discontinuous activation function for linear programming
.
Neural Computation
,
20
(
5
),
1366
1383
.
Liu
,
Q.
, &
Wang
,
J.
(
2013
).
A one-layer projection neural network for nonsmooth optimization subject to linear equalities and bound constraints
.
IEEE Transactions on Neural Networks and Learning Systems
,
24
(
5
),
812
824
.
Liu
,
Q.
, &
Wang
,
J.
(
2016
).
$l1$-minimization algorithms for sparse signal reconstruction based on a projection neural network
.
IEEE Transactions on Neural Networks and Learning Systems
,
27
(
3
),
698
707
.
Liu
,
Q.
,
Yang
,
S.
, &
Wang
,
J.
(
2017
).
A collective neurodynamic approach to distributed constrained optimization.
IEEE Transactions on Neural Networks and Learning Systems
,
28
,
1747
1758
.
Mangasarian
,
O. L.
(
1987
).
Nonlinear programming
(10th ed.).
:
Society for Industrial and Applied Mathematics
.
Minoux
,
M.
(
1986
).
Mathematical programming: Theory and algorithms
.
Hoboken, NJ
:
Wiley
.
Qin
,
S.
,
Le
,
X.
, &
Wang
,
J.
(
2017
).
A neurodynamic optimization approach to bilevel quadratic programming
.
IEEE Transactions on Neural Networks and Learning Systems
,
28
(
11
),
2580
2591
.
Sun
,
X. L.
,
Li
,
D.
, &
McKinnon
,
K. I. M.
(
2005
).
On saddle points of augmented Lagrangians for constrained nonconvex optimization
.
SIAM Journal on Optimization
,
15
(
4
),
1128
1146
.
Tank
,
D.
, &
Hopfield
,
J.
(
1986
).
Simple “neural” optimization networks: An A/D converter, signal decision circuit, and a linear programming circuit
.
IEEE Transactions on Circuits and Systems
,
33
(
5
),
533
541
.
Xia
,
Y.
,
Feng
,
G.
, &
Wang
,
J.
(
2004
).
A recurrent neural network with exponential convergence for solving convex quadratic program and related linear piecewise equations
.
Neural Networks
,
17
(
7
),
1003
1015
.
Xu
,
W.
,
Liu
,
X.
, &
Gong
,
Y.
(
2003
).
Document clustering based on non-negative matrix factorization.
In
Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
(pp.
267
273
).
New York
:
ACM
.
Yan
,
Z.
,
Fan
,
J.
, &
Wang
,
J.
(
2017
).
A collective neurodynamic approach to constrained global optimization.
IEEE Transactions on Neural Networks and Learning Systems
,
28
(5)
,
1206
1215
.
Yan
,
Z.
,
Wang
,
J.
, &
Li
,
G.
(
2014
).
A collective neurodynamic optimization approach to bound-constrained nonconvex optimization
.
Neural Networks
,
55
,
20
29
.
Yang
,
S.
,
Liu
,
Q.
, &
Wang
,
J.
(
2018
).
A collaborative neurodynamic approach to multiple-objective distributed optimization
.
IEEE Transactions on Neural Networks and Learning Systems
,
29
(
4
),
981
992
.