## ABSTRACT

The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minimum. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of acquiring the Hessian matrix. To address this problem, we proposed the *Adaptive Composite Gradients(ACG)* method, linearly convergent in bilinear games under suitable settings. Theory analysis and toy-function experiments both suggest that our approach alleviates the cyclic behaviors and converges faster than recently proposed SOTA algorithms. The convergence speed of the ACG is improved by 33% than other methods. Our ACG method is a novel *Semi-Gradient-Free* algorithm that can reduce the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. The mixture of Gaussians experiments and real-world digital image generative experiments show that our ACG method outperforms several existing technologies, illustrating the superiority and efficacy of our method.

## 1. INTRODUCTION

Gradient descent-based machine learning and deep learning methods have been widely used in various computer science tasks over the past several decades. Optimizing a single objective problem with gradient descent is easy to converge to a saddle point in some cases [1]. However, there is a growing set of multi-objective problems that need to be optimized in numerous fields, such as deep reinforcement learning [2, 3], Game Theory, Machine Learning and Deep Learning. Generative Adversarial Networks [4] is a kind of classical multi-objective problem in Deep Learning. GANs have a wide range of applications [5] because of their capability, which can learn to generate complex and high dimensional target distribution. The existing literature about GANs can be divided into four categories, including music generation [6, 7, 8], natural languages [9, 10, 11, 12], methods of training GANs [13, 14, 15, 16], images processing [17, 18, 19, 20]. GANs have obtained remarkable progress in image processing, such as video generation [21, 22], noise removal [23], deblur [24], image to image translation [25, 26], image super-resolution [17], medical image processing [27].

Generative adversarial networks’ framework consists of two deep neural networks: generator network and discriminator network correspondingly. The generator network is given a noise sample from a simple known distribution as input, and it can produce a fake sample as output. The generator learns to make such fake samples, not by directly using real data, just by adversarial training with a discriminator network. Bilinear games are two-player, non-cooperative zero-sum games with compact polytopal strategy sets. If the generator and discriminator have no information communication, then training GANs is a noncooperative zero-sum game. Therefore, GANs can be considered a bilinear game under suitable scenarios. The object function of GANs [4] is often formulated as a two-player min-max game with a Nash equilibrium at the saddle points:

Where *x*∼*P _{X}* (x) denotes an actual data sample and

*z*∼

*P*(z) denotes a sample from a noise distribution (often using uniform distribution or Gaussian distribution). More different forms of GANs object function are mentioned in [28]. Though GANs have achieved remarkable applications, training stable and fast GANs [29, 30] still is a challenging task. Since it suffers from the strongly associated gradient vector field rotating around a Nash equilibrium (see Figure 1). Moreover, those gradient descent ascent-based methods used to optimize object function of GANs tend to lead the limit oscillatory behavior because of imaginary components in the Jacobian eigenvalues.

_{Z}The main idea of this work is to reduce the computing cost of the Hessian matrix in consensus optimization and SGA. Motivated by [15, 16] and [31], we propose a novel Adaptive Composite Gradient method, which can be used to calibrate and accelerate the traditional methods such as SGD, RMSProp, Adam, consensus optimization, and SGA. The ACG method exploits three aspects of information in the iteration process, which includes gradients information of the past iteration steps, adaptive and predictive information for future iteration steps, and the projection information of the current iteration step mentioned in [16]. We fuse this information as the composite gradient to update the scheme in our algorithm, which can be deployed in deep networks and used to train GANS. The main contributions of this paper are as follows:

We propose a novel adaptive composite gradient (ACG) method, which can alleviate cyclic behaviors in training GANs. Meanwhile, ACG can reduce the computational consumption of gradients and improve convergence speed.

For purely adversarial bilinear game problems, we prove that the ACG method is linearly convergent under suitable conditions. In addition, we extend the ACG method to three-player game problems and verify its effect and efficiency with toy models.

The comprehensive experiments are conducted to test the effect of training GANs and Deep Convolutional Generative Adversarial Networks. The proposed method can obtain competitive results over state-of-the-art (SoTA) methods.

## 2. RELATED WORK

There are several distinctive approaches to improve the training of GANs, but they show more or fewer limitations in some cases. Some of these are dependent closely on the previous assumptions, which leads to these methods not being valid. Moreover, some of these need to payoff the computing cost of the Hessian in the dynamics. Next, we will discuss some related researches in this section.

**Symplectic Gradient Adjustment (SGA) [32]:** Compared with the traditional games, do not constrain the players’ parameter sets or require the loss functions to be convex. The general games can be decomposed into a potential game, and a Hamiltonian game in [32]. To introduce our method, we firstly review the SGA method as follows.

**Definition 2.1** A game is a set of players $[p]={1,2,\u2026,n}$, and the loss functions satisfy twice continuously differentiable ${li:Rd\u2192R}i=1n$. Players' parameters are $w=(w1,w2,\u2026,wn)\u2208Rd$ with $wi\u2208Rdi$ where $\u2211indi=d$. The *i*^{th} player can control *w _{i}*.

Using the *g(w)* notes the simultaneous gradient which is the gradient of the losses withe respect to players' parameters $g(w)=(\u2207w1l1,\u2207w2l2,\u2026,\u2207wnln)$. For a bilinear game, it requires the losses to satisfy $\u2211i=1nli=0$ such as follow:

This kind of games have a Nash equilibrium at (x,y) = (**0,0**). The simultaneous gradient *g*(x,y) = (Cy,-C^{T}x) rotates around the Nash equilibrium shown in Fig. 6.

We can derive the Hessian of a *n*-player game with the simultaneous gradient *g(w)*. The formula of Hessian is $H(w)=\u2207w\xb7g(w)T=(\u2202gi(w)\u2202wj)i,j=1d$ where $H\u2208Rd\xd7d$. Further, the matrix formula of Hessian is as follows:

Applying the generalized Helmholtz decomposition [**Lemma 1** in [32]] to the above mentioned Hessian of the game, we have H(*w*) = S(*w*)+A(*w*). David et al. (2018) [32] pointed that a game is a potential game if A(*w*)≡0. It is a Hamiltonian game if S(*w*)≡0. Potential games and Hamiltonian games are both well-studied, and they are easy to solve. Since the cyclic behavior around the Nash equilibrium is caused by simultaneous gradient, David et al. [32] proposed the Symplectic Gradient Adjustment method, which is as follows:

Where **A** is from the Helmholtz decomposition of Hessian. *g*_{λ} is used to replace the gradient among the iterates in the GAD-based methods, and using *g*_{λ} to train GANs can alleviate the cyclic behaviors. If we consider the players in a bilinear game as GANs, the SGA algorithm needs to pay an expensive computing cost of Hessian which can lower the algorithm efficiency.

**Centripetal Acceleration [15]:** The simultaneous gradient shows cyclic behaviors around the Nash. Hamiltonian games obey a conservation law in these gradient descent-based methods, so the cyclic behaviors can be considered as a uniform circular motion process. As well known, the direction of the centripetal acceleration in a consistent circular motion process points to the center of the circle. Using this characteristic modifies the direction of the simultaneous gradient vector field to alleviate the cyclic behaviors. Based on these observations, Peng et al. (2020) [15] propose the Centripetal acceleration methods, which are derived to two versions methods named Simultaneous Centripetal Acceleration (Grad-SCA) and Alternating Centripetal Acceleration (Grad-ACA), they are used to train GANs. In the next, we review the Centripetal Acceleration methods.

Given a bilinear game, the losses of this game are *ℓ _{1}(θ,φ*),

*ℓ*corresponding to player 1 and player 2. The parameter space is defined in

_{2}(θ,φ)*θ*×

*φ*, where

*θ,φ*∈ ℝ

^{n}. Player 1 can control the parameter

*θ*and tries to minimize payoff function

*ℓ*, while player 2 can control parameter

_{1}*φ*and tries to minimize payoff function under the non-cooperative situation. This game is a process of the two players adjusting their parameters to find a local Nash equilibrium that satisfies the following two requirements:

The centripetal acceleration methods require that the two-player game is differentiable. Then, the above two payoff functions can be combined into a joint payoff function because of the zero-sum property of the game:

The derivation of Eq. (1) leads a two-player game, which can be rewrite as **V**(*θ,φ*). The problem becomes to finds a local Nash equilibrium:

where

In order to introduce the Centripetal Acceleration methods, we fist review the simultaneous gradient descent method in [33] is

And the simultaneous gradient descent based alternating version is

where α is learning rate in the algorithms. While the Centripetal Acceleration methods directly utilize the item of centripetal acceleration to adjust the simultaneous gradient descent. Then gradient descent with simultaneous centripetal acceleration is introduced as:

We can also obtain the gradient descent with the alternating centripetal acceleration method:

where α_{1}, *β _{1}*, α

_{2}, β

_{2}in above two versions of the centripetal acceleration methods are hyper parameters. The centripetal acceleration methods can calibrate other gradient-based methods. The intuitive understanding of the centripetal acceleration method is shown in Figure 2.

**Predictive Projection Centripetal Acceleration (PPCA) [16]:** For the centripetal acceleration methods, it uses the last iterative step information to update (*φ _{t+1}* +

*φ*

_{t+1}). However, there are some methods which utilize the predictive step information to update

*(θ*,

_{t+1}*φ*

_{t+1}), such as MPM, OMD, OGDA. MPM is introduced by Liang et al. (2019) [34] and its dynamics are as follows:

Motivated by MPM and centripetal acceleration methods, Li et al. (2020) [16] propose the predictive projection centripetal acceleration methods. They also consider approximately the cyclic behavior around a Nash as a uniform circular motion process. However, it is different from the Grad-SCA and Grad-ACA. They construct the item of centripetal acceleration to use the predictive step information replacing that of the last step. Meanwhile, they argue that the centripetal acceleration term points to the matched center approximately. To make the centripetal acceleration item point to the center precisely, they proposed the projection centripetal acceleration methods. PPCA can modify the gradient descent ascent and the alternating gradient descent ascent directly. We can understand PPCA intuitively from Figure 2. The dynamics of predictive projection centripetal acceleration are the following formula:

Where $\u2207V\xaf(\theta t,\phi t)=(-\u2207\theta V(\theta t,\phi t)\u2207\phi V(\theta t,\phi t))$ is the signed gradient vector at time *t*. The $\u220f\u2207v\xaf(\theta t,\phi t)[\u2207V\xaf(\theta t+12,\phi t+12)\u2212\u2207V\xaf(\theta t,\phi t)]$ is the projection of the centripetal acceleration term $[\u2207V\xaf(\theta t+1'\phi t+12)-\u2207V\xaf(\theta t,\phi t)]$ onto the vector $\u2207V\xaf(\theta t,\varphi t)$.

Li et al. (2020) proposed two versions of the PPCA methods by constraining the coefficient matrix which must be full rank in bilinear games under the specified situation (**Lemma 3.2 in [16]**). The form of PPCA method for bilinear game is

And also, we can get the alternating PPCA formula is as follow:

where the all of *γ, α, β* are hyper parameters.

Although the mentioned methods have achieved significant work in training GANs, some of them require high computing costs and computer memory. The rest of them depend closely on the approximate circular motion process. If the practical numerical experiments do not satisfy this approximate theory, these methods will not be valid. In contrast to our approach, we propose the adaptive composite gradient method, which can reduce the computing cost and solve the limitation brought by the approximate circular motion process.

## 3. MOTIVATION

### 3.1 Limitation Analysis

These Hessian-based methods, such as consensus optimization and SGA, optimize game problems, bringing high computing costs. The centripetal acceleration algorithm reduces expensive computational costs but depends on the approximately uniform circular motion process assumption, shown in Figure 3. PPCA is an improved version of the centripetal acceleration algorithm, shown in Figure 4(b). However, PPCA needs a full rank coefficient matrix, causing the projection of the PPCA method to be zero (**Lemma 3.2 in [16]**). The proposed ACG is motivated by two aspects. Firstly, we consider cyclic behavior as a generally circular motion process but not a uniform one. Therefore, similar to the centripetal acceleration method, we modify directions by adding the projection of the centripetal acceleration term at time *t*. Secondly, **A ^{3}DMM** provides an idea that can use past iterations to predict future iterations because the trajectory of the sequence

*Z*is either straight linear or spiral. However, the cyclic behavior is also approximately a spiral. Motivated by the two aspects, we proposed a novel adaptive composite gradient (ACG) method to alleviate the cyclic behavior in training GANs. ACG can reduce computing costs and accelerate iteration by predicting the future step iterative information, which is called a

_{k}**Semi-Gradient-Free**method.

### 3.2 Motivational Theory

Our idea is motivated by **A ^{3}DMM** [31]. Then, we review the

**A**method. Give a optimisation problem

^{3}DMMwhere the essential assumptions are proposed

**R**∈ Γ_{0}(ℝ^{n}) and**J**∈ Γ_{0}(ℝ^{m}) are proper convex and lower semi-continuous functions.**A**,**B**are injective linear operators.*ri (dom*(**R**) ∩*dom*(*J*)) ≠ Ø and the set of minimizers is non-empty.

To derive the iteration scheme, considering the augmented Lagrangian and rewriting the optimisation problem, which reads

where γ >0 and Ψ is the Lagrangian multiplier, then we have iteration forms:

We can rewrite the above iteration into the following formula by introducing a new variable $Zk=def\Psi k\u22121+\gamma Axk,$

The trajectory of the sequence *Z _{k}* dependents closely on the value of γ, where

*k*∈

*N*. If selecting a proper γ, the eventual trajectory of

*Z*is spiral as shown in Figure 4. since the trajectory of

_{k}*Z*has the characteristic of spiral, which provides a way that using the previous

_{k}*q*iterationd predicts the future

*s*iterations. The

*Z*of ADMM can be estimated by

_{k}*Z̄*

_{k,s}which is defined as follow:

for the choice of *s* = 1. Given a sequence *Z _{k-j}, i* = 1, 2, …,

*q*+1, we can define

*v*=

_{i}*Z*and obtain the past $Vk\u22121,Vk\u22122,\cdots ,Vk\u2212q$ which can be used the $Vk\u22121,Vk\u22122,\cdots ,Vk\u2212q$ to estimate

_{i}− Z_{i-1}*v*. Let $Vk\u22121=[vk\u22121,Vk\u22122,\cdots ,Vk\u2212q]\u2208Rn\xd7q$ and $Ck=argminC\u2208Rq\u2225Vk\u22121C\u2212Vk\u22252=\u2225\u2211i=1qCiVk\u2212i\u2212Vk\u22252$. Then we can use the

_{k}*V*to approximate

_{k}C_{k}*V*

_{k+1}that is $VkCk\u2248Vk+1$, we can compute $Z\xafk+1=Zk+VkC\u2248Zk+1.$ By iterating

*s*times, we can obtain $Z\xafk,s\u2248Zk+s$. This method is proposed by Clarice Poon et al. [31].

## 4. ADAPTIVE COMPOSITE GRADIENT METHOD

To more easily understand our method, we make some symbol conventions throughout the paper. Let $a\u2192\u2299b\u2192$ is considered as the projection of $a\u2192$ onto $b\u2192$, where $a\u2192$ and $b\u2192$ ∈ ℝ^{n} and ⊙ denotes the projecting calculation between two vectors. *w*_{t}^{i} denotes the *i*^{th} player controlling parameter at time *t* and $Wt=(wt1,Wt2,?,Wtn)$. we use ${\u2113i:Rd?R}i=1n$, to denote the losses corresponding to *n* players which is same as mentioned in **Definition 2.1**. Then we can obtain, which is the payoff vector of the *n* players at time *t*.

We consider a bilinear game problem with the following form

**Desiderata.** This two players game has a Nash equilibrium which must satisfy

D1. The two losses satisfy: $\u2211i=12\u2113i(W1,W2)\u22610;$

D2. For *ℓ _{i}* is differentiable over the parameter space Ω(

*w*

^{1}) ×Ω(

*w*

^{2}), where $\Omega (w1)\xd7\Omega (w2)\u2286Rn\xd7Rn$.

The player 1 holds the parameter *w*^{1} and tries to minimize the loss ℓ_{1}, while the player 2 holds the parameter *w*^{2} and tries to minimize the loss ℓ_{2}. From the D1. we can get ℓ_{1}*= -*ℓ_{2}, the formula (22) can be rewritten as

The dynamics of gradient descent ascent based method is

The ACG method involves three parts which are consist of the composite gradients. Firstly, we introduce the predictive aspect. In this section, $Wt=(Wt1,Wt2)$ is the parameter vectors at time *t*. Similar to **A**^{3}**DMM**, we utilize the *W* of the previous *q* iterations to predict the future *s* iterations which can be denoted by $W\xaft,s$. Then we can get the following formula of $W\xaft,s$:

for the value of *s* = 1. Define $vi=Wi\u2212Wi\u22121$, *W _{i}* is from sequence ${Wk\u2212i}i=0q.$ We can use the past $Vk\u22121,Vk\u22122,\cdots ,Vk\u2212q$ to approximate the last

*v*. Then, $ck=argminc\u2208Rq\u2225Vk\u22121c\u2212Vk\u22252$. Finally, we can obtain $W\xafk,1=Wk+Vkck\u2248Wk+1.$ By looping of s times, we can get $W\xafk,s\u2248Wk+s$. The second part and third part of our ACG method are the $\u2207L((wt1,wt2))$ and the projection of centripetal acceleration term. The dynamics of proposed ACG method is

_{k}where the $\u2207wiL((wt1,wt2))$ denotes the partial derivatives of *w _{i}* corresponding to

*ℓ*. at time

_{i}*t*. The $a\u2192i$ denotes $\u2207wiL((wt1,wt2))\u2212\u2207wiL((wt\u221211,wt\u221212))$. The $b\u2192i$ denotes $a\u2192i\u2299\u2207wiL((wt\u221211,wt\u221212))$, which is the projection of $a\u2192i$ onto the vector $\u2207wiL((wt\u221211,wt\u221212))$. The basic intuition of our proposed method is shown in Figure 5.

For clarity, we draft the scheme of proposed adaptive composite gradient method in Algorithm 1, which is applied in a two-player game problem. Note that the ACG method can calibrate any optimizer based on gradient descent ascent. Meanwhile, the ACG method can extend to a *n*-player game problem. Given a *n*-player game problem, let *g(W _{t})* be the gradient of losses for all players at t. It is worth noting that all loss functions must be differentiable. We adopt the same way of Algorithm 1 to compute $W\xaft$. The dynamics of the ACG method for

*n*-players reads

Where the $a\u2192$ denotes $g(Wt)\u2212g(Wt\u22121),b\u2192$ denotes $a\u2192\u2299g(Wt\u22121)$ which is the projection of $a\u2192$ onto *g(W _{t-1})*.

**Remark 4.1** The value of *k* can be controlled both in Algorithm 1 and Algorithm 2. Let *k = q + i* where $i\u2208N+$, we can set different acceleration ratio of the algorithms by adjusting the values of *k, s*.

In Algorithm 1 and Algorithm 2, there is no need to calculate the gradient of each iteration because of $W\xaft+s$. Therefore, we also name it a **Semi-Gradient-Free** method, whose merit is that it can reduce the computational cost and convergent fast. *β*_{1}, *β*_{2} can be used to control the convergence speed in our algorithms.

## 5. THE CONVERGENCE OF ADAPTIVE COMPOSITE GRADIENT METHOD

### 5.1 The Convergence Analysis for the Bilinear Game

In this subsection, we mainly discuss the convergence of Adaptive Composite Gradient Method in the bilinear game, which reads

For any local Nash equilibrium of the bilinear game has the point (*θ∗, φ∗)* satisfied the following conditions:

The local Nash equilibrium exists if and only if the ranks of *A* and *A*^{T} are the same as the dimension of *B,C*. By this way, without loss of generality, we can convert *(θ, φ)* to $(\theta \u2212\theta \u2217,\varphi \u2212\varphi \u2217)$, which is used to rewrite bilinear game (29) as:

Before analyzing the convergence property of the Adaptive Composite Gradient Method in this situation, we introduce some essential theorems and propositions.

**Theorem 5.1***Suppose $F\u2208Rd\xd7d$ is from the iterative system x _{k+1} = Fx_{k}. If F is nonsingular and satisfies the spectral radius ρ (F) < 1, then the x_{k} of iterative system converges to 0 linearly*.

**Theorem 5.2** (OMD) *Consider a bilinear game V(θ, φ) = θ ^{T} Aφ, where*$A\u2208Rd\xd7d$.

*Assume*A

*is full rank. Then the following dynamics*,

*with the learning rate*

*obtain an ɛ -minimizer such that $(\theta T,\varphi T)\u2208B2(\epsilon ),$ provided*

*under the assumption that*$\u2225(\theta 0,\varphi 0)\u2225,\u2225(\theta 1,\varphi 1)\u2225\u2264\delta .$

To discuss the convergence of the ACG method for the bilinear game, we divided the analysis process into three parts (two cases). Without loss of generality, let *t* represent the iterative step, and *k* is the previous steps. The convergence property of Algorithm 1 is as the following.

**Case 1. mod**(*t,k*)≠0 or **mod**(*t,k*) = 0&ρ *(C _{k}*)≥1. The ACG method adopts the dynamics:

Taking *α* = *2β* = *2η* in **Case 1**, the dynamics scheme reduces to **OMD**, which can be written as:

**Theorem** 5.2 has assigned the condition about learning rate of **OMD** and it is exponential Convergence. The convergence of **OMD** can be found in [34] (Theorem 3).

**Case 2. mod**(*t,k*) = 0&ρ *(C _{k}*)<1. From the Algorithm 1, firstly we shall compute the $W\xaft+s$ by:

Using the fixed-point formulation of ADMM, (35) can be written as an unified $W\xaft=\epsilon (W\xaft\u22121)$, let $Vtct=\sigma t$, then $W\xaft=\epsilon (W\xaft\u22121+\sigma t)$. We can obtain the convergence of (35), iff σ_{t} converges to 0. The convergence of (35) is based on the convergence of inexact Krasnosel'skil-Mann fixed-poin iteration in [31] (Proposition 5.34). The detail convergence analysis of $W\xaft=\epsilon (W\xaft\u22121+\sigma t)$ has been discussed in [31](Proposition 4.2).

Then we discuss the convergence of composite gradients update scheme, which is written as follow:

where the $Gw1,Gw2$ are defined as:

**Proposition 5.3***Any given two vector $a\u2192,b\u2192$, the projection of vector $b\u2192$ onto the vector $a\u2192$ can be denoted as $\gamma a\u2192$, where the*$\gamma \u2208R$.

According to proposition 5.3, our dynamics of the bilinear game, the composite gradients update scheme is reduced to be

We can obtain the iterative matrix as:

where τ is defined by (35).

According to the iterative matrix, it is easy to obtain that $[\theta t+1,\varphi t+1,\theta t,\varphi t,W\xaf\theta t+1,W\xaf\varphi t+1]\u22a4=F[\theta t,\varphi t,\theta t\u22121,\varphi t\u22121,W\xaf\theta t,W\xaf\varphi t]\u22a4$, where the $(\theta t,\varphi t)$ are generated by (38). With the assumption that *A* is square and nonsingular in Proposition 5.4, we use the well-known Theorem 5.1 to illustrate the linear convergence for the update scheme (38).

**Proposition 5.4***Suppose that A is square and nonsingular. Then, the eigenvalues of F are the roots of the sixth order polynomials:*

*where Sp(·) denotes the collection of all eigenvalues.*

Proposition 5.5 *Suppose that A is square and nonsingular. The*$\Delta t:=\u2225\theta t\u22252+\u2225\varphi t\u22252+\u2225\theta t+1\u22252+\u2225\varphi t+1\u22252+\u2225w\xaf\theta t+1||2+\u2225w\xaf\varphi t+1\u22252$*is linearly convergent to 0 for given γ with α and* β_{1}, *satisfy*

*where the*$\lambda max(\u22c5),\lambda min(\u22c5)$*denote the largest and the smallest eigenvalues of* A^{T}A.

### 5.2 The Convergence Analysis for the n-player Game

This subsection mainly discusses the convergence of the Adaptive Composite Gradient method in the general *n*-player game. The problem is described as in Definition 2.1, according to Algorithm 2, the convergence analysis process in general *n* players game is the same as that of the bilinear game with three parts and two cases. Before analyzing convergence property, we introduce some basic definitions.

**Definition 5.6** Suppose that $f:Rn\u2192R$ and it is convex, continuously. For $\u2200x,y\u2208Rn$, the gradient of *f* is Lipschitz continuous with constant *L* such that:

we define that *f* belongs to the class $FL1,1$. If *f* is strongly convex with modulus μ > 0 and such that:

we define that *f* belongs to $F\mu ,L1,1$.

Next, we suppose that the all $\u2113i,i=1,2,\cdots ,n$ are belonging to $FL1,1$. We can get the definition of fixed point which is also called Local Nash Equilibrium in game.

**Definition 5.7***W*∗ is a Local Nash Equilibrium(fixed point) if *W∗* satisfy *g*(*W∗*) = 0. We say that it is stable if *g*(*W∗*)≥0, unstable if *g*(*W∗*)≤0.

**Theorem 5.8***[Nesterov 1983] Let f be a convex and β-smooth function, and we can write the well-known Nesterov's Accelerated Gradient Descent update scheme as*

Then Nesterov's Accelerated Gradient Descent satisfies

*Nesterov (1983) proposed the accelerated gradient method which achieves the optimal $O(1t2)$ convergence rate.*

The convergence of the ACG method for the general *n* player game is also divided into two cases. Let *t* represent the iterative step, and *k* is the previous steps. The convergence property of Algorithm 2 is as the following.

**Case 1. mod**(*t,k*)≠0 or **mod**(*t,k*) = 0&ρ *(C _{k}*)≥1. From Algorithm 2, if

*t*and

*k*satisfy the conditions of

**Case 1**. Our proposed ACG method is the same as classical Gradient Descent and the update scheme is

where the α is a positive step-size parameter. According to Definition 5.7, let the Local Nash Equilibrium *W∗* and $L\u2217=L(W\u2217)$. Base the Definition 5.6, if $\u2113i,i=1,2,\cdots ,n$ are belonging to $FL1,1$, then $L(Wt)\u2212L\u2217$ associated {*W _{t}*} converges at rate $O(1t)$. For more detail convergence of averaged iterations with generalized gradient descent update scheme in convex-concave games is analyzed in [36, 37].

**Case 2. mod**(*t,k*) = 0&ρ *(C _{k}*)<1. According to Algorithm 2, firstly we should compute the $W\xaft+s$ by:

The convergence of formula (44) is the same as that of **Case 2** in previous section 5.1. For more detail about convergence of (44) has been discussed in [30] (Proposition 4.2).

In **Case 2**, we mainly analyze the convergence of composite gradient update scheme in formula (28). Before illustrating our proposed method convergence property, The formula (42) can be equivalently written as

where the α and β are step-size parameters. However, our proposed composite gradient method (28) can be transfer to the similar formula as (45) which is based on the Proposition 5.3. That is

where the $W\xaft+s$ is equivalent to (*W _{t}* -

*W*

_{t}_{-1}). Comparing (45) with (46) if the parameters are equivalently transformed, our proposed adaptive composite gradient method can be reduced to the Nesterov's Accelerated Gradient (NAG) method by this way. In 1983, Nesterov had given the convergence rate at $O(1t2)$for convex smooth optimization in [38]. And also there had given the convergence bounds for convex, non-convex and smooth optimization in [39] (Theorem 1 ~ Theorem 4).

**Remark 5.1** In **Case 2**, our proposed Adaptive composite gradient method has the same convergence rate and the same convergence bounds as that of NAG method, since the assumption that all $\u2113i,i=1,2,\cdots ,n$ are belonging to $FL1,1$. We can naturally obtain the convergence rate of the ACG method, which is at $O(1t2)$ based on Theorem 5.8.

## 6. EXPERIMENTS

### 6.1 Toy Functions Simulation

We tested our ACG methods in Algorithm 1 and Algorithm 2 corresponding with the bilinear game and general game with 3 players. We tested the ACG method in Algorithm 1 on the following bilinear game, which can be written as

It is obvious that the Nash Equilibrium (stationary point) is (0,0). We compared our ACG with some other methods whose results are presented in Figure 6 (a). From the behaviors of Figure 6, the Sim-GDA method diverges, and the Alt-GDA method is rotating around the stationary point. However, the other methods all converge to the Nash Equilibrium. We proposed the ACG method converges faster than other convergence methods.

In Figure 6 (b), we test our ACG method on the following general zero-sum game

The effects of the compared methods on this game show that all methods converge to the origin. Notably, the cyclic behavior of the Alt-GDA method has disappeared, and the Sim-GDA method converges. It is worth noting that the trajectory of our ACG method is the same as that of PPCA [16]. Both our ACG and PPCA [16] seem faster than others. We also compared ACG with other methods on the following general game

Its effects are presented in Figure 6 (c), which shows that Sim-GDA and Grad-SCA diverge, the rest methods converge. The APPCA [16] is faster than our ACG method in this game.

We used the last general zero-sum game (49) to test the robustness of the proposed ACG method in Figure 7. As the learning rate α increases through {0.01,0.05,0.1} and other parameters keeping same. ACG method converges faster when the learning rates setting with α = 0.01 and α = 0.05, although converge slower with learning rate α = 0.1, ACG method still converge to the origin rapidly.

The proposed Adaptive Composite Gradient method (ACG) is also suitable for the general game with *n* players. However, it is challenging to present the effect of the general *n* player game by toy function. To illustrate our proposed method of Algorithm 2 adaptive for *n* players game, we show the effects of Algorithm 2 by a general 3 players game. The payoff functions can be written as

Where the local Nash Equilibrium is (0,0,0). The effects are shown in Figure 8, the top row and the bottom left of Figure 8 are the trajectories of the compared methods, and the bottom right of Figure 8 are the Euclidean distances of each iteration away from the origin for compared methods. Figure 8 presented that SGD, SGA, and our ACG method all converge to origin. There is cyclic behavior in SGD, which leads to converging slowly. The second row of Figure 8 presents that the proposed ACG method approaches the origin faster than SGA and SGD.

### 6.2 Mixtures of Gaussiane

We also tested the ACG method by training a toy GAN model, and we compared our method with other well-known methods on learning 5 Gaussians and 16 Gaussians. All the mixture of 16 Gaussians and 5 Gaussians are appointed with a standard deviation of 0.02. The Ground truths for the mixture of 16 Gaussians and 5 Gaussians are present in Figure B1.

**Details on Network architecture.** GANs is consist of a generator network and a discriminator network. We set up both the generator and discriminator networks with six fully connected layers, and each layer has 256 neurons. We used the fully connected layer to replace the sigmoid function layer, appended to the discriminator. We adopt the ReLU function layer appended to each of the six layers in the generator and discriminator networks. The generator network has two output neurons, while the discriminator network has one output. The input data of the generator is a random noise sampled from a standard Gaussian distribution. The output of the generator is used as the input for the discriminator. The output of the discriminator is used to evaluate the quality of the generated points by the generator.

**Experimental environments.** We deploy the mixture Gaussians experiments on the computer with **CPU AMD Ryzen 7 3700, GPU RTX 2060, 6GB RAM**, Python (version 3.6.7), Keras (version 2.3.1), TensorFlow (version 1.13.1), PyTorch (version 1.3.1). We conducted each of the compared methods with 10,000 iterations.

We conducted the experiments on the mixture of 5 Gaussians with the proposed ACG method and several other methods. The training set of the all compared algorithms are as follows:

RMSP: We employ the TensorFlow to provide Simultaneous RMSPropOptimizer and set the learning rate α = 5×10

^{−4}.RMSP-alt: The alternating RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10

^{−4}.ConOpt [40]: The Consensus Optimizer realized by TensorFlow with

*h*= 1×10^{−4},*γ*= 1.RMSP-SGA [32]: The Symplectic Gradient Adjusted RMSPropOptimizer realized by TensorFlow with leanring rate α = 1×10

^{−4},*ξ*= 1.RMSP-ACA [15]: The Alternating Centripetal Acceleration on RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10

^{−4},*β*= 0.5.SGA-ACG(ours): Our proposed ACG method in Algorithm 1 on the SGA Optimizer relaized by PyTorch with learning rate α = 5×10

^{−4}, β_{1}= 5×10^{−7}, β_{2}= α.

The mixture of 5 Gaussians numerical simulation results are shown in Figure 9. Figure 9 shows the ConOpt, RMSP-SGA, SGA-ACG algorithms all converge. Meanwhile, the generated mixture of 5 Gaussians is almost approaching the ground truth in Figure B1. It is observed that they have the same convergence speed among ConOpt, RMSP-SGA, SGA-ACG (Ours). We also compared these methods with other SOTA methods, such as RMSP, RMSP-alt, RMSP-ACA. The compared simulation results are shown in Figure B2. To compare the convergence speed among all six algorithms, we employ the same training settings as the mixture of 5 Gaussians to conduct the mixture of 16 Gaussians as shown in Figure 10.

From Figure 10, it is obvious that our proposed SGA-ACG method converges faster than ConOpt, RMSP-SGA. More comparison results are shown in Figure B3. From Figure B3, RMSP, RMSP-alt, RMSP-ACA still do not converge after 10,000 iterations. To compare the convergence speed, Figure 11 shows the timeconsuming of all the compared methods.

There is a parameter *s* in our proposed algorithms in Algorithm 1 and Algorithm 2. We have explored the influence of *s* on final results and conducted a series of experiments as the *s* through {50, 100, 150, 200} on the mixture of 16 Gaussians, which is as shown in Figure 12. It shows that the proposed SGA-ACG method converges faster with the *s* increasing.

### 6.3 Experiments on Prevalent Datasets

This section conducts the third experiment that tested our proposed ACG method on image generation tasks. We employ four prevalent datasets to illustrate our ACG method can be applied in deep learning. We choose the standard MNIST [41], Fashion-MNIST [42], CIFAR-10 [43], CelebA [44] datasets to conduct realistic experiments.

**Network Architecture.** We choose two kinds of network architectures for GANs on the MNIST dataset. For the first kind of network structure, we employ 2 fully connected layers with 256 and 512 neurons for the generator network, Each of the 2 layers is appended to a LeakyReLU layer with α = 0.2. We adopt a Tanh activation function layer as the last layer in the generator network. The input data for the generator is a random noise with dimensions 100 sampled from a standard Gaussian distribution. The output of the generator is an image with shape (28, 28, 1). For the discriminator network, we also use 2 fully connected layers with 512 and 256 neurons. After each layer, there is appended to a LeakyReLU layer with α = 0.2, which is the same as the generator network. However, in the last layer of the discriminator, we adopt a Sigmoid activation function. The input data of the discriminator includes the generated image and the ground truth image on the MNIST dataset. The output of the discriminator is used to evaluate the quality of the image made by the generator network. For the second kind of network structure, we used the architecture of DCGANs [45], we just used 4 layers of DCGANs [45] for both generator network and discriminator network.

**Experimental environments.** We conduct experiments of this section on a server equipped with **CPU E5-2698**, **4∗GPU GTX 3090 aero**, **24GB RAM**, Python (version 3.6.13), PyTorch (version 1.8.0). We realized the all compared algorithms by the PyTorch, and the training set of these algorithms on the four datasets are as follows:

SGD: The learning rate of linear GANs on MNIST dataset is

*α*= 2×10^{−4}, excepted for the learning rate of DCGANs on MNIST is*α*= 5×10^{−4}, Fashion-MNIST (learning rate*α*= 2×10^{−4}), CIFAR-10(learning rate*α*= 2×10^{−4}), CelebA(learning rate*α*= 2×10^{−4}).Adam: Linear GANs on MNIST(learning rate

*α*= 3x10^{−4}), DCGANs on MNIST(learning rate*α*= 2×10^{−4}), Fashion-MNIST (learning rate*α*= 2x10^{−4}), CIFAR-10(learning rate*α*= 2x10^{−4}), CelebA(learning rate*α*= 2×10^{−4}).RMSP: Linear GANs on MNIST(learning rate

*α*= 2x10^{−4}), DCGANs on MNIST(learning rate*α*= 5×10^{−4}), Fashion-MNIST (learning rate*α*= 5×10^{−4}), CIFAR-10(learning rate*α*= 5×10^{−4}), CelebA(learning rate*α*= 5×10^{−4}).RMSP-ACG: Only on the linear GANS(learning rate

*α*= 5×10^{−4}, β_{1}= 5×10^{−7}, β_{2}= α).Adam-ACG: On the all datasets, our proposed Adam-ACG method applied in linear GANs and DCGANs with learning rate

*α*= 5×10^{−4}, β_{1}= 5×10^{−7}, β_{2}= α.

In the experiment of linear GANs on MNIST data, we set the batch size as 64, and the epoch number is 324. The generation results of our proposed methods are shown in Figure 13. More comparisons among these algorithms on MNIST are shown in Figure B4.

For DCGANs experiments, the batch size is 64, and the epoch number is 110 on the MNIST dataset. The same batch size and epoch number are assigned to the Fashion-MNIST and CIFAR-10 data. In contrast, the batch size and number of epoch on CelebA data set are 128 and 70, respectively. The results of our methods on the four datasets are shown in Figure 14, and more comparisons results among several algorithms on the same datasets are shown in Figure B5.

## 7. CONCLUSION

We proposed the Adaptive Composite Gradients(ACG) method to find a local Nash Equilibrium in game problems. The ACG algorithm can alleviate the cyclic behaviors. It has robustness and can be easily integrated with SGD, Adam, RMSP, SGA, and other gradient-based optimizers. Since the ACG method employs the predicted information in future *s* iterations, it is a novel semi-gradient-free algorithm. Our algorithm has a linear convergence rate. Furthermore, we offer that the SGA-ACG can be competitive to ConOpt and SGA methods on the mixture of Gaussians geneated tasks. We prove the ACG can be applied in a general zero-sum game with *n* players by toy function experiment. The extensive generative image experiments show our method can optimize the generic deep learning networks model. However, our research objectives are limited to the convex and smoothness of simple zero-sum games. The non-convex and nonsmooth games are more complex and challenging to find a local Nash Equilibrium. Therefore, optimizing and finding local solutions for non-convex and non-smooth games is still a challenging task worth researching in the future.

## ACKNOWLEDGEMENT

This work is supported by the National Key Research and Development Program of China (No.2018AAA0101001), Science and Technology Commission of Shanghai Municipality (No.20511100200), and supported in part by the Science and Technology Commission of Shanghai Municipality (No.18dz2271000).

## AUTHOR CONTRIBUTION STATEMENT

Conceptualization, methodology, algorithm designing, coding, original draft preparation and survey references: Huiqing Qi. Methodology, data analyses, manuscript review, funding acquisition, original draft preparation and manuscript revising: Fang Li. Methodology, data analyses, manuscript review and funding acquisition: Shengli Tan. Methodology, algorithm designing, manuscript review, funding acquisition: Xiangyun Zhang. All authors have read and agreed to the published version of the manuscript.

## REFERENCES

*k*

^{2})

## 8. APPENDICES

### APPENDIX A. PROOFS IN SECTION 5

#### A.1 Proof of Proposition 5.3

** Proof.** Without loss of generality, let $a\u2192=(a1,a2,a3,\cdots ,an),b\u2192=(b1,b2,b3,\cdots ,bn)$, where $n\u22652,n\u2208N\u2217$. Then, we have

The projection $p\u2192$ of $b\u2192$ onto $a\u2192$ can be written as

Incorporating 65 into 66, we have

Using the γ to replace $a\u2192\u22c5b\u2192\u2225a\u2192\u222522$ we can obtain $p\u2192=\gamma a\u2192,$ where the $\gamma \u2208R$.

#### A.2 Proof of Proposition 5.4

** Proof.** The characteristic polynomial of the matrix (39) is

which is equivalent to

From (A5) we can derive to

According to (A6), 0 and 1 can not be its roots based on *A* is nonsingular and square. Eq. (A6) is equivalent to

Then, we can obtain that the eigenvalues of *F* are the roots of the sixth order polynomials:

#### A.3 Proof of Proposition 5.5

** Proof.** Let the characteristic polynomial of the matrix (39) to be 0, which is written as follows:

It is obvious that (A9) have 6 roots, and λ_{1}, = λ_{2} = τ are two of the 6 roots. According to the convergence of formula (35), we can obtain the τ is almost small and |τ| < 1. We mainly discuss the following polynomial:

Using Proposition (5.4), we have

Denote $a:=\alpha +\beta 1$ and $b:=\beta 1(1+\gamma )$, then (A11) can be written as

we can get the four roots of (A12) are

Let $u:=a\xi +b\xi $ and $v:=a\xi \u2212b\xi $, then we can obtain

Denote $s:=u+v2$ and $t:=3v\u2212u2$, then we have

The following proof process is the same as **(A.2)** in [29]. For a given complex number *z*, we can obtain the absolute value of the real part in *z* is $|z|+R(z)2$ and the absolute value of the imaginary part in *z* is $|z|\u2212R(z)2$, However, According to this Proposition *s* ≤ 1, all the real parts of the four roots lie in the interval $[\u2212S,S]$, where

all the imaginary parts of roots lie in the interval $[\u2212T,T]$, where

Using the Inequality $x+y\u2264x+y2x,(x>0,y>0),$ we can obtain

Then, we analyze the *s* in $(0,12]$ and $(12,1]$ two cases separately.

**Case 1**, we suppose $0<s\u226412$, According to this proposition $(|\alpha +\beta 1|+|2\beta 1(1+\gamma )|)/(\alpha +\beta 1)2\u22640.1\xi ,$ for all $\xi 2\u2208Sp(A\u22a4A),$ we have $|t|\u2264s210$. Then, based on $s22\u22641\u22121\u2212s2$, we have

Integrating $s\u226412$ with (A20), we can obtain

which follows that

The inequality (A22) follows by the fact that $|t|1\u2212s2\u2264s1\u2212s2\u22641$ and the inequality (A23) uses $x+y\u2264x+y2x$. The (A21 - A23) can be written equivalently to

According (A18) and (A19), we have

It is worth noting that $x+y\u2264x+y2x$ holds equality if and only if *y* = 0. Then, the (A25) holds equality when *t* = 0 and *s* = 0. Since *s* > 0, we have the strict inequality *ρ(F)* ≤ 1 which suggests for the linear convergence of unit time ∇_{t}.

**Case 2.** we suppose $12<s\u22641$, since $t\u2264s210\u22640.1$. Combining (A16) and (A17) directly, we can obtain

which is also linear convergence

### APPENDIX B. THE APPENDIX FIGURES OF EXPERIMENTS

This section shows more figures. There are two ground-truth in Figure. B1. Figure B2 is the comparison results of our proposed method and other SOTA algorithms on the mixture of 5 Gaussians experiments. Figure B3 is the results of the compared methods on the mixture of 16 Gaussians experiments. Figure B4 is the experimental results of compared methods with linear GANs on the MNIST dataset. Figure B5 is the experimental results of compared methods with DCGANs on the four Datasets (MNIST, Fashion-MNIST, CIFAR10, and CelebA).