## Abstract

Sparsity is a desirable property in many nonnegative matrix factorization (NMF) applications. Although some level of sparseness of NMF solutions can be achieved by using regularization, the resulting sparsity depends highly on the regularization parameter to be valued in an ad hoc way. In this letter we formulate sparse NMF as a mixed-integer optimization problem with sparsity as binary constraints. A discrete-time projection neural network is developed for solving the formulated problem. Sufficient conditions for its stability and convergence are analytically characterized by using Lyapunov's method. Experimental results on sparse feature extraction are discussed to substantiate the superiority of this approach to extracting highly sparse features.

## 1 Introduction

Nonnegative matrix factorization (NMF) seeks to reveal the hidden structure of nonnegative data (Lee & Seung, 1999). It decomposes a nonnegative matrix $V\u2208Rm\xd7n$ into a product of two low-rank nonnegative factor matrices $W\u2208Rm\xd7r$ and $H\u2208Rr\xd7n$, where $r<<m$ (Lin, 2007). A common approach to NMF is performed by minimizing the Frobenius norm of the error matrix. NMF has been widely applied in numerous fields, such as image representation (Lee & Seung, 1999), document clustering (Xu, Liu, & Gong, 2003), and data analysis (Kim & Park, 2007), to name a few.

Sparse solutions are often desirable in machine learning applications (Field, 1994; Hoyer, 2004). Several formulations for generating sparse solutions are presented in the literature. For example, $L1$-norm is added in the objective function as a regularization term for enhancing sparsity (Hoyer, 2002). A sparseness-constrained formulation is introduced in Hoyer (2004), where factorization accuracy is compromised for the sparseness as evidenced in the experimental results in section 4. An $L1$/$L2$ norm-based matrix factorization method named VSMF is proposed in Li and Ngom (2013). All of these regularization-based methods entail proper weighting by using a regularization parameter as a trade-off of factorization accuracy and solution sparsity.

Neurodynamic optimization is a parallel optimization approach pioneered by Hopfield and Tank (Hopfield & Tank, 1986; Tank & Hopfield, 1986) in the 1980s. It demonstrates admirable search capability for various optimization problems such as linear programming (Liu & Wang, 2008; Liu, Cao, & Chen, 2010; He, Li, Huang, & Huang, 2014), quadratic programming (Xia, Feng, & Wang, 2004; Qin, Le, & Wang, 2017), minimax optimization (Liu & Wang, 2016; Le & Wang, 2017; Leung & Wang, in press), nonsmooth optimization (Liu, Guo, & Wang, 2012; Liu & Wang, 2013; Alireza, Wang, & Hosseini, 2013; Li, Yu, Huang, Chen, & He, 2015), distributed optimization (Liu, Yang, & Wang, 2017; Yang, Liu, & Wang, 2018), biconvex optimization (Che & Wang, 2019b), multiobjective optimization (Leung & Wang, 2018, in press), and global and combinatorial optimization (Yan, Wang, & Li, 2014; Yan, Fan, & Wang, 2017; Che & Wang, 2019a, in press). In particular, a collaborative neurodynamic approach (Fan & Wang, 2017) and a discrete-time neurodynamic model (Che & Wang, 2018) have been developed for NMF.

In this letter, we present a sparsity-constrained NMF. We introduce a discrete-time projection neural network (DTPNN) for solving the proposed problem. The convergence of the model is analytically characterized.

The contributions of this letter are summarized as follows:

A sparsity-constrained NMF formulation is proposed for obtaining a desired level of sparse solutions while minimizing the factorization error without using regularization.

A discrete-time projection neural network is developed for solving the above problem. The convergence and stability are theoretically guaranteed if the step size is smaller than the derived upper bound.

The experimental results show the superiority of the performance of this approach when the required level of sparsity is high.

The rest of this letter is organized as follows. Section 2 reviews the related problems and introduces a sparse NMF formulation in the form of an integer optimization problem. Section 3 presents theoretical analyses of the DTPNN. Section 4 discusses the experimental results of sparse feature learning obtained by using DTPNN. Finally, section 5 gives conclusions.

## 2 Problem Formulations

Throughout this letter, lowercase and uppercase letters denote vectors and matrices, respectively. For a matrix $A\u2208Rn\xd7m$, the operator vec$(A)$ converts $A$ into a one-column vector composed of all the columns of $A$—that is, vec$(A)=[a11,\u2026,an1,a21,\u2026,anm]\u22ba$. The $l0$-norm and the $l1$-norm of the vectorized matrix A, $\u2225vec(A)\u22250$ and $\u2225vec(A)\u22251$, are abbreviated as $\u2225A\u22250$ and $\u2225A\u22251$, respectively throughout of the rest of the letter.

### 2.1 Existing Formulations

*sparseness($\xb7$)*is a function defined on a row of $W$ or a column of $H$:

### 2.2 Proposed Formulation

Let $Z\u2208{0,1}n\xd7m$ be a binary matrix containing zeros and ones only. The number of nonzero elements in $Z$ is equal to the sum of all the elements in $Z$ and $Z0=Z1$.

First, we show that problem 2.1 is equivalent to 2.6.

Therefore, the optimal objective function values are equal, that is, $f(2.8)*=f(2.9)*$, and this immediately implies that problem 2.1 is equivalent to 2.6.

## 3 Neurodynamic Approaches

### 3.1 Global Optimization

### 3.2 Augmented Lagrangian Function

The Lagrangian dual method can be used for solving constrained optimization problems if the duality gap between the primal and dual problems is zero (Mangasarian, 1987), and this is equivalent to the existence of a saddle point of the Lagrangian function (Minoux, 1986). However, the existence of a saddle point cannot be ensured for a global optimization problem due to its nonconvexity.

### 3.3 Continuous-Time Neurodynamic Model

It is clear that the columns are linearly independent of each other. Therefore, the regularity condition holds. It implies that lemma ^{2} and the KKT conditions in equations 3.3 and 3.4 hold. Thus, every equilibrium of model 3.10 is a strict local minimum and vice versa.

The derivations of gradient functions are given in the appendix.

### 3.4 Discrete-Time Neurodynamic Model

(Kinderlehrer & Stampacchia, 1980). $(P\Omega (u)-u)\u22ba(v-P\Omega (u))\u22650,\u2200u\u2208Rn,\u2200v\u2208\Omega .$

The proof is given in the appendix.

Lemma ^{4} says that DTPNN$1$ can be converted into DTPNN$2$ with a suitable step size $\eta i$, since there always exists a function that maps the input into range $[li,ui]$. The bound of step size $\eta i$ is detailed in theorem ^{6}.

^{4}, model 3.13 is equivalent to the following DTPNN:

### 3.5 Stability and Convergence Analysis

In this section, we analyze stability and the convergence of DTPNN by using Lyapunov's method. The main result is given in theorem ^{6}.

The proof is given in the appendix.

^{3}, the following inequality is obtained:

^{4}, the inequality can be rewritten as

^{4}.

^{4}, $xk$ and $yk$ converge to the equilibrium of model 3.14. Following from equation 3.22, we can obtain the sequence

## 4 Experimental Results

### 4.1 Setup

To characterize the convergence behaviors of DTPNN and substantiate the performance of the proposed sparse NMF formulation for feature extraction, the experiments are based on the following two commonly used sparse NMF data sets:

The ORL

^{1}data set consists of data collected from 40 individuals, and each has 10 different views with dimension $32\xd732$. Thus, the input matrix $V$ for factorization has the dimension $1024\xd7400$. The factorization rank for this experiment is set as 25 as in Hoyer (2004).The CBCL

^{2}data set contains a training set and a testing set. In this experiment, the training set is used. It contains 2429 face images, and each has dimension $19\xd719$. Therefore, the input matrix $V$ for factorization has dimension $381\xd72429$. Following Hoyer (2004), the factorization rank for this experiment is set as 49.

### 4.2 Convergence Behaviors

In this section, transient objective function values and transient neuronal states are plotted to show their convergence behaviors. According to theorem ^{6}, the discrete-time model is stable when step size $\eta \u2208(0,2)$. Sparsity level $s$ is fixed to $50%$ in the experiment.

Figure 1 shows the convergence behaviors of DTPNN for solving problem P2.1 on the ORL and CBCl data sets. In panel a, the first and second rows show the transient states of $W$ and $H$. The plots on the third row show that all the $Z$ entries converge to either zero or one. The plots on the fourth and fifth rows show the transient states of both equality and inequality Lagrangian multipliers, $\lambda $ and $\nu $. Their transient states become stable, and this indicates that the inequality and equality constraints are eventually satisfied. Panel b shows the objective function and augmented Lagrangian function converge.

Figure 2 shows the convergence behavior of DTPNN for solving problem P2.2 on ORL and CBCl data set. Panel a plots the stable behaviors of the transient states. The first two rows plot the trends of decision variables $W$ and $H$. On the third row, all the $Z$ entries converge to either zero or one as the equality constraint as required by the binary constraint. The transient behaviors of Lagrangian multipliers show that the constraints are satisfied, and thus the augmented Lagrangian functions converge to the same function values as the objective functions as plotted in panel b.

### 4.3 Factorization Results

While a desirable sparsity level is satisfied, factorization accuracy is essential. Therefore, the performance is demonstrated in the following two aspects, accuracy and sparsity, by the relative reconstruction error (RRE) and sparsity evaluation measure (SEM), respectively.

The performance is compared with various sparse NMF formulations, including NMFSC^{3} (Hoyer, 2004), $l1$-norm, and VSMF (Li & Ngom, 2013) which were introduced in section 2. For the $l1$-norm NMF formulation, we used two algorithms from Matlab Toolbox based on nonnegative alternating least squares (NALS) (Cichocki, Zdunek, Phan, & Amari, 2009) and the active set method(AS) (Kim & Park, 2008) methods for comparison. We compared the SNMF-H algorithm (Le Roux, Hershey, & Weninger, 2015), which is used only for $H$'s sparsity control will be compared, with problem P2.1. Note that for the same $l1$-norm regularized formulation, different weights are required for different algorithms for the same sparsity degree. Therefore, it is meaningless to compare the results by fixing the weights. Since the proposed formulations do not require weight tuning, we used four test cases on this problem setting first. By setting the required sparsity level in the constraint, the exact sparsity values obtained by DTPNN are recorded in SEM for later comparison. We selected the weights for compared algorithms carefully to reach a similar SEM. Since a good model should obtain a desirable degree of sparsity while minimizing the accuracy, their relative reconstruction errors (RRE) are compared to DTPNN at the end. We used four test cases for the factorization algorithms by requiring their results to have nonzero elements less than ${20%,40%,60%,80%}$, corresponding to the degree of sparsity $s\u2264{80%,60%,40%,20%}$.

Tables 1 and 2 record the results of solving problems P2.1 and P2.2, respectively. Both experiments on the ORL data sets, DTPNN-H and DTPNN-W obtain comparable RRE results (only three digits after the decimal points' difference). When the SEM values are higher than 50%, DTPNN-H and DTPNN-W achieve better RRE when high sparsity levels (low SEM) are required.

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$H0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-H | 0.1605 | 15.70% | 0.1411 | 34.36% | 0.1358 | 48.06% | 0.1314 | 66.34% |

SNMF-H (Le Roux et al., 2015) | infeasible | infeasible | infeasible | infeasible | 0.4121 | 59.94% | 0.2163 | 70.43% |

NMFSC (Hoyer, 2004) | 0.2505 | 19.01% | 0.2413 | 39.75% | 0.2286 | 56.29% | 0.2231 | 66.52% |

NALS (Cichocki, Zdunek, Phan, & Amari, 2009) | 0.1653 | 19.42% | 0.1427 | 34.41% | 0.1322 | 50.02% | 0.1305 | 67.02% |

AS (Kim & Park, 2008) | 0.2006 | 16.69% | 0.1848 | 36.91% | 0.1697 | 53.85% | 0.1558 | 67.58% |

VSMF (Li & Ngom, 2013) | 0.1816 | 18.93% | 0.1580 | 36.35% | 0.1367 | 58.98% | 0.1319 | 66.43% |

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$H0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-H | 0.1605 | 15.70% | 0.1411 | 34.36% | 0.1358 | 48.06% | 0.1314 | 66.34% |

SNMF-H (Le Roux et al., 2015) | infeasible | infeasible | infeasible | infeasible | 0.4121 | 59.94% | 0.2163 | 70.43% |

NMFSC (Hoyer, 2004) | 0.2505 | 19.01% | 0.2413 | 39.75% | 0.2286 | 56.29% | 0.2231 | 66.52% |

NALS (Cichocki, Zdunek, Phan, & Amari, 2009) | 0.1653 | 19.42% | 0.1427 | 34.41% | 0.1322 | 50.02% | 0.1305 | 67.02% |

AS (Kim & Park, 2008) | 0.2006 | 16.69% | 0.1848 | 36.91% | 0.1697 | 53.85% | 0.1558 | 67.58% |

VSMF (Li & Ngom, 2013) | 0.1816 | 18.93% | 0.1580 | 36.35% | 0.1367 | 58.98% | 0.1319 | 66.43% |

Note: The best values are in bold.

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$W0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-W | 0.1566 | 15.59% | 0.1482 | 35.95% | 0.1468 | 38.87% | 0.1422 | 54.32% |

NMFSC (Hoyer, 2004) | 0.3188 | 18.41% | 0.2517 | 38.26% | 0.2480 | 39.31% | 0.2291 | 54.81% |

NALS (Cichocki et al., 2009) | 0.1989 | 16.73% | 0.1787 | 36.04% | 0.1750 | 40.16% | 0.1679 | 54.48% |

AS (Kim & Park, 2008) | 0.2059 | 15.98% | 0.1827 | 37.40% | 0.1725 | 39.79% | 0.1511 | 58.59% |

VSMF (Li & Ngom, 2013) | 0.1891 | 15.71% | 0.1557 | 36.04% | 0.1545 | 40.19% | 0.1432 | 54.37% |

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$W0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-W | 0.1566 | 15.59% | 0.1482 | 35.95% | 0.1468 | 38.87% | 0.1422 | 54.32% |

NMFSC (Hoyer, 2004) | 0.3188 | 18.41% | 0.2517 | 38.26% | 0.2480 | 39.31% | 0.2291 | 54.81% |

NALS (Cichocki et al., 2009) | 0.1989 | 16.73% | 0.1787 | 36.04% | 0.1750 | 40.16% | 0.1679 | 54.48% |

AS (Kim & Park, 2008) | 0.2059 | 15.98% | 0.1827 | 37.40% | 0.1725 | 39.79% | 0.1511 | 58.59% |

VSMF (Li & Ngom, 2013) | 0.1891 | 15.71% | 0.1557 | 36.04% | 0.1545 | 40.19% | 0.1432 | 54.37% |

Note: The best values are in bold.

When a matrix becomes sparse, less information is preserved. Therefore, this leads to a higher reconstruction error, and thus RRE goes high. Although NMFSC reaches a given sparsity after carefully selecting the parameter for its sparseness(x) constraint, the final factorization result has low factorization accuracy. SNMF-H reaches a sparseness level of 60% with a weight value of approximately 0.7. Then, even slightly increasing its weight by 0.001, this leads its SEM to go directly to one (all the elements became zero). In other words, SNMF-H fails to obtain a sparse coefficient matrix $H$ on the ORL data set. NALS, AS, and VSNMF are applied to the $L1$-norm regularized formulation. They both perform well when the required sparsity levels are low in both experiments. When higher sparsity is required, the RRE goes high. It is worth noticing that during the experiments, these algorithms require different orders of weights even for the same level of sparsity, on the same $L1$-norm regularized formulation.

Figure 3 shows the basis matrix $W$ obtained from the two data sets. The first row plots the learned basis obtained by solving problem P2.1. Since the matrix is decomposed into a product of two nonnegative matrices, the sparseness of one matrix will cause the other one to be dense. As $H$ becomes sparser, the basis matrix $W$ gets denser, which finally shows a whole face instead of representing the features as parts. The second row shows the results by solving problem P2.2. When the sparsity level on the basis matrix $W$ is directly constrained, the face components are clearly separated.

Tables 3 and 4 record the results obtained from the CBCL data set. It can be seen that DTPNN performs the best for both $W$ and $H$ when high sparsity levels are required. NALS and AS achieve a better RRE when less sparsity is required. Since the initialization has a certain randomness, it is normal to have small differences for the SEM value.

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$H0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-H | 0.1863 | 11.38% | 0.1568 | 23.05% | 0.1415 | 35.28% | 0.1200 | 53.87% |

SNMF-H (Le Roux et al., 2015) | 0.9891 | 14.88% | 0.3359 | 25.32% | 0.2274 | 35.85% | 0.1347 | 55.88% |

NMFSC (Hoyer, 2004) | 0.2673 | 12.50% | 0.2529 | 23.38% | 0.1971 | 35.35% | 0.1621 | 0.54% |

NALS (Cichocki et al., 2009) | 0.1887 | 11.58% | 0.1570 | 23.54% | 0.1336 | 35.28% | 0.1102 | 53.87% |

AS (Kim & Park, 2008) | 0.1977 | 11.79% | 0.1704 | 23.78% | 0.1416 | 37.54% | 0.1215 | 54.82% |

VSMF (Li & Ngom, 2013) | 0.1923 | 11.82% | 0.1568 | 23.56% | 0.1386 | 35.33% | 0.1124 | 54.69% |

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$H0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-H | 0.1863 | 11.38% | 0.1568 | 23.05% | 0.1415 | 35.28% | 0.1200 | 53.87% |

SNMF-H (Le Roux et al., 2015) | 0.9891 | 14.88% | 0.3359 | 25.32% | 0.2274 | 35.85% | 0.1347 | 55.88% |

NMFSC (Hoyer, 2004) | 0.2673 | 12.50% | 0.2529 | 23.38% | 0.1971 | 35.35% | 0.1621 | 0.54% |

NALS (Cichocki et al., 2009) | 0.1887 | 11.58% | 0.1570 | 23.54% | 0.1336 | 35.28% | 0.1102 | 53.87% |

AS (Kim & Park, 2008) | 0.1977 | 11.79% | 0.1704 | 23.78% | 0.1416 | 37.54% | 0.1215 | 54.82% |

VSMF (Li & Ngom, 2013) | 0.1923 | 11.82% | 0.1568 | 23.56% | 0.1386 | 35.33% | 0.1124 | 54.69% |

Note: The best values are in bold.

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$W0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-W | 0.1204 | 14.21% | 0.1207 | 20.54% | 0.1112 | 29.10% | 0.1008 | 39.01% |

NMFSC (Hoyer, 2004) | 0.2429 | 15.78% | 0.2223 | 21.84% | 0.1909 | 29.29% | 0.1677 | 39.92% |

NALS (Cichocki et al., 2009) | 0.1385 | 15.26% | 0.1298 | 21.84% | 0.1202 | 29.63% | 0.1087 | 40.05% |

AS (Kim & Park, 2008) | 0.1281 | 14.28% | 0.1208 | 20.68% | 0.1100 | 29.50% | 0.0974 | 39.08% |

VSMF (Li & Ngom, 2013) | infeasible | infeasible | 0.1242 | 29.77% | 0.1179 | 33.11% | 0.0988 | 41.21% |

. | 20% . | 40% . | 60% . | 80% . | ||||
---|---|---|---|---|---|---|---|---|

$W0\u2264$ Performance Measure . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . | RRE . | SEM . |

DTPNN-W | 0.1204 | 14.21% | 0.1207 | 20.54% | 0.1112 | 29.10% | 0.1008 | 39.01% |

NMFSC (Hoyer, 2004) | 0.2429 | 15.78% | 0.2223 | 21.84% | 0.1909 | 29.29% | 0.1677 | 39.92% |

NALS (Cichocki et al., 2009) | 0.1385 | 15.26% | 0.1298 | 21.84% | 0.1202 | 29.63% | 0.1087 | 40.05% |

AS (Kim & Park, 2008) | 0.1281 | 14.28% | 0.1208 | 20.68% | 0.1100 | 29.50% | 0.0974 | 39.08% |

VSMF (Li & Ngom, 2013) | infeasible | infeasible | 0.1242 | 29.77% | 0.1179 | 33.11% | 0.0988 | 41.21% |

Note: The best values are in bold.

For VSMF, various values of the penalty parameter ranging from 0.1 to 50 were tested. The lowest sparsity degree that VSMF can achieve is approximately 29%. Therefore, it fails to achieve a desirable sparsity for coefficient matrix $H$. For SNMF and NMFSC, both fail to obtain low accuracy when the required sparsity level is high. All of these algorithms used for comparison require different weight parameters from the experiment for the ORL data set.

Figure 4 shows the basis matrix $W$ obtained by DTPNN-W and DTPNN-H. From Tables 3 and 4, it can be seen that the CBCL data set tends to generate a sparser basis matrix $H$. The matrix reached 50% of sparsity even when no extra sparseness constraint was added. By enhancing sparsity on $H$, the basis matrix becomes dense. The result of solving problem P2.2 on CBCL is shown in the second row. When $W$ is enforced to have less than 20% of entries to be nonzero, the results reach a significantly high sparsity, with only 14.21% of the elements at nonzero.

## 5 Conclusion

In this letter, we propose a mixed-integer optimization formulation with a sparsity constraint for sparse nonnegative matrix factorization. A discrete-time projection neural network is developed for solving the formulated problem. An upper bound of the step size of the neural network is derived to guarantee its stability and convergence. The superiority of this approach is substantiated based on two experiments on feature extraction with high-level sparsity requirements. Further investigations may aim at the other NMF problem reformulations and efficient and effective solution procedures.

## Appendix: Proofs of Lemmas ^{4} and ^{5} and Gradient Derivation

### A.1 Proof of Lemma ^{4}

There are three cases that need to be considered separately. For all $k\u2208[k0,k1,\u2026]$:

- Case 1: $(xki-\u2207f(xki))\u2208[li,ui]$ or $(yki+\u2207f(yki))\u2208[li,ui]$. By the definition of projection operator $P\Omega $, this directly impliesModel A.1 reaches its equilibrium within the given interval if and only if $\u2207fx(xki)=0$ and $\u2207fy(yki)=0$.$xk+1i=xki-\eta ki\u2207fx(xki)yk+1i=yki+\eta ki\u2207fy(yki).$(A.1)
- Case 2: $(xki-\u2207f(xki))>ui$ or $(yki+\u2207f(yki))>ui$. This directly impliesSince $(xki-\u2207f(xki))>ui$ and $xki\u2208[li,ui]$, this implies $\u2207f(xki)\u22640$. There exists a function $\gamma i(xk)\u22650$. Similarly, since $(yki+\u2207f(yki))>ui$ and $yki\u2208[li,ui]$, this implies $\u2207f(yki)\u22650$. There exists a function $\gamma i(yk)\u22640$. Then, model A.2 can be equivalently transferred to$xk+1i=xki-\eta ki(-xki+uki)\u22650yk+1i=yki+\eta ki(-yki+uki)\u22650.$(A.2)Model A.3 reaches its equilibrium within the given interval if and only if $\gamma i(xk)=0$.$xk+1i=xki-\gamma i(xk)\u2207f(xki)\u22650yk+1i=yki+\gamma i(yk)\u2207f(yki)\u22650.$(A.3)
- Case 3: $(xki+\u2207f(xki))<li$ or $(yki+\u2207f(yki))<li$. This directly impliesSince $(xki-\u2207f(xki))<li$ and $xki\u2208[li,ui]$, this implies $\u2207f(xki)\u22650$. There exists a function $\zeta i(xk)\u22650$. Similarly, since $(yki+\u2207f(yki))<li$ and $yki\u2208[li,ui]$, this implies $\u2207f(yki)\u22640$. There exists a function $\zeta i(yk)\u22640$. Then, model A.4 can be equivalently transferred to$xk+1i=xki-\eta ki(-xki+uki)\u22640yk+1i=yki+\eta ki(-yki+uki)\u22640.$(A.4)Model A.5 reaches its equilibrium within the given interval if and only if $\zeta i(xk)=0$.$xk+1i=xki-\zeta i(xk)\u2207f(xki)\u22640yk+1i=yki+\zeta i(yk)\u2207f(yki)\u22640.$(A.5)

Therefore, these two models are equivalent.$\u25a1$

### A.2 Proof of Lemma ^{5}

Let $(w\xafc,h\xafc,z\xafc,\lambda \xafc,\nu \xafc)$ be the equilibrium point of the continuous-time projection neural network (CTPNN). Let $(w\xafk,d,h\xafk,d,z\xafk,d,\lambda \xafk,d,\nu \xafk,d)$ be the equilibrium point of the discrete-time projection neural network (DTPNN) at iteration $k$. Abbreviate $\Psi (w,h,z,\lambda ,\nu )$ as $\Psi $.

### A.3 Gradient Computation for Equation 3.13

Here, we present the details of matrix gradient computation of equation 3.13.

#### A.3.1 Gradient Computation for Problem P2.1

#### A.3.2 Gradient Computation for Problem P2.2

## Notes

## Acknowledgments

This work was supported in part by the Research Grants Council of the Hong Kong Special Administrative Region of China under grants 11208517, 11202318, 11202019, 11200116 (9042322), 11206317 (9042489), and 11209819 (9042816); and by the National Natural Science Foundation of China under grant 61673330; and by the Key Project of Science and Technology Innovation 2030 supported by the Ministry of Science and Technology of China (grant 2018AAA0101301).