The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minimum. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of acquiring the Hessian matrix. To address this problem, we proposed the Adaptive Composite Gradients(ACG) method, linearly convergent in bilinear games under suitable settings. Theory analysis and toy-function experiments both suggest that our approach alleviates the cyclic behaviors and converges faster than recently proposed SOTA algorithms. The convergence speed of the ACG is improved by 33% than other methods. Our ACG method is a novel Semi-Gradient-Free algorithm that can reduce the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. The mixture of Gaussians experiments and real-world digital image generative experiments show that our ACG method outperforms several existing technologies, illustrating the superiority and efficacy of our method.

Gradient descent-based machine learning and deep learning methods have been widely used in various computer science tasks over the past several decades. Optimizing a single objective problem with gradient descent is easy to converge to a saddle point in some cases [1]. However, there is a growing set of multi-objective problems that need to be optimized in numerous fields, such as deep reinforcement learning [2, 3], Game Theory, Machine Learning and Deep Learning. Generative Adversarial Networks [4] is a kind of classical multi-objective problem in Deep Learning. GANs have a wide range of applications [5] because of their capability, which can learn to generate complex and high dimensional target distribution. The existing literature about GANs can be divided into four categories, including music generation [6, 7, 8], natural languages [9, 10, 11, 12], methods of training GANs [13, 14, 15, 16], images processing [17, 18, 19, 20]. GANs have obtained remarkable progress in image processing, such as video generation [21, 22], noise removal [23], deblur [24], image to image translation [25, 26], image super-resolution [17], medical image processing [27].

Generative adversarial networks’ framework consists of two deep neural networks: generator network and discriminator network correspondingly. The generator network is given a noise sample from a simple known distribution as input, and it can produce a fake sample as output. The generator learns to make such fake samples, not by directly using real data, just by adversarial training with a discriminator network. Bilinear games are two-player, non-cooperative zero-sum games with compact polytopal strategy sets. If the generator and discriminator have no information communication, then training GANs is a noncooperative zero-sum game. Therefore, GANs can be considered a bilinear game under suitable scenarios. The object function of GANs [4] is often formulated as a two-player min-max game with a Nash equilibrium at the saddle points:

minCmaxDV(D,G)=ExPX(x)[logD(x)]+EzPZ(z)[log(1D(G(z)))].
(1)

Where xPX (x) denotes an actual data sample and zPZ (z) denotes a sample from a noise distribution (often using uniform distribution or Gaussian distribution). More different forms of GANs object function are mentioned in [28]. Though GANs have achieved remarkable applications, training stable and fast GANs [29, 30] still is a challenging task. Since it suffers from the strongly associated gradient vector field rotating around a Nash equilibrium (see Figure 1). Moreover, those gradient descent ascent-based methods used to optimize object function of GANs tend to lead the limit oscillatory behavior because of imaginary components in the Jacobian eigenvalues.

Figure 1.

(a): the strong gradients rotational filed around Nash equilibrium, (b): comparison of convergence behaviors among several recently proposed methods. It is obvious ours ACG method converges faster than others. For more details in Section 6.1.

Figure 1.

(a): the strong gradients rotational filed around Nash equilibrium, (b): comparison of convergence behaviors among several recently proposed methods. It is obvious ours ACG method converges faster than others. For more details in Section 6.1.

Close modal

The main idea of this work is to reduce the computing cost of the Hessian matrix in consensus optimization and SGA. Motivated by [15, 16] and [31], we propose a novel Adaptive Composite Gradient method, which can be used to calibrate and accelerate the traditional methods such as SGD, RMSProp, Adam, consensus optimization, and SGA. The ACG method exploits three aspects of information in the iteration process, which includes gradients information of the past iteration steps, adaptive and predictive information for future iteration steps, and the projection information of the current iteration step mentioned in [16]. We fuse this information as the composite gradient to update the scheme in our algorithm, which can be deployed in deep networks and used to train GANS. The main contributions of this paper are as follows:

  • We propose a novel adaptive composite gradient (ACG) method, which can alleviate cyclic behaviors in training GANs. Meanwhile, ACG can reduce the computational consumption of gradients and improve convergence speed.

  • For purely adversarial bilinear game problems, we prove that the ACG method is linearly convergent under suitable conditions. In addition, we extend the ACG method to three-player game problems and verify its effect and efficiency with toy models.

  • The comprehensive experiments are conducted to test the effect of training GANs and Deep Convolutional Generative Adversarial Networks. The proposed method can obtain competitive results over state-of-the-art (SoTA) methods.

There are several distinctive approaches to improve the training of GANs, but they show more or fewer limitations in some cases. Some of these are dependent closely on the previous assumptions, which leads to these methods not being valid. Moreover, some of these need to payoff the computing cost of the Hessian in the dynamics. Next, we will discuss some related researches in this section.

Symplectic Gradient Adjustment (SGA) [32]: Compared with the traditional games, do not constrain the players’ parameter sets or require the loss functions to be convex. The general games can be decomposed into a potential game, and a Hamiltonian game in [32]. To introduce our method, we firstly review the SGA method as follows.

Definition 2.1 A game is a set of players [p]={1,2,,n}, and the loss functions satisfy twice continuously differentiable {li:RdR}i=1n. Players' parameters are w=(w1,w2,,wn)Rd with wiRdi where indi=d. The ith player can control wi.

Using the g(w) notes the simultaneous gradient which is the gradient of the losses withe respect to players' parameters g(w)=(w1l1,w2l2,,wnln). For a bilinear game, it requires the losses to satisfy i=1nli=0 such as follow:

1(x,y)=xCy and 2(x,y)=xCy
(2)

This kind of games have a Nash equilibrium at (x,y) = (0,0). The simultaneous gradient g(x,y) = (Cy,-CTx) rotates around the Nash equilibrium shown in Fig. 6.

We can derive the Hessian of a n-player game with the simultaneous gradient g(w). The formula of Hessian is H(w)=w·g(w)T=(gi(w)wj)i,j=1d where HRd×d. Further, the matrix formula of Hessian is as follows:

H(w)=[w121w2,w122wn,w12nw1,w221w222wn,w22nw1,wn21w2,wn22wn2n].
(3)

Applying the generalized Helmholtz decomposition [Lemma 1 in [32]] to the above mentioned Hessian of the game, we have H(w) = S(w)+A(w). David et al. (2018) [32] pointed that a game is a potential game if A(w)≡0. It is a Hamiltonian game if S(w)≡0. Potential games and Hamiltonian games are both well-studied, and they are easy to solve. Since the cyclic behavior around the Nash equilibrium is caused by simultaneous gradient, David et al. [32] proposed the Symplectic Gradient Adjustment method, which is as follows:

gλ:=g+λAg.
(4)

Where A is from the Helmholtz decomposition of Hessian. gλ is used to replace the gradient among the iterates in the GAD-based methods, and using gλ to train GANs can alleviate the cyclic behaviors. If we consider the players in a bilinear game as GANs, the SGA algorithm needs to pay an expensive computing cost of Hessian which can lower the algorithm efficiency.

Centripetal Acceleration [15]: The simultaneous gradient shows cyclic behaviors around the Nash. Hamiltonian games obey a conservation law in these gradient descent-based methods, so the cyclic behaviors can be considered as a uniform circular motion process. As well known, the direction of the centripetal acceleration in a consistent circular motion process points to the center of the circle. Using this characteristic modifies the direction of the simultaneous gradient vector field to alleviate the cyclic behaviors. Based on these observations, Peng et al. (2020) [15] propose the Centripetal acceleration methods, which are derived to two versions methods named Simultaneous Centripetal Acceleration (Grad-SCA) and Alternating Centripetal Acceleration (Grad-ACA), they are used to train GANs. In the next, we review the Centripetal Acceleration methods.

Given a bilinear game, the losses of this game are 1(θ,φ), 2(θ,φ) corresponding to player 1 and player 2. The parameter space is defined in θ × φ, where θ,φ ∈ ℝn. Player 1 can control the parameter θ and tries to minimize payoff function 1, while player 2 can control parameter φ and tries to minimize payoff function under the non-cooperative situation. This game is a process of the two players adjusting their parameters to find a local Nash equilibrium that satisfies the following two requirements:

θargminθ1(θ,ϕ) and ϕargminϕ2(θ,ϕ)
(5)

The centripetal acceleration methods require that the two-player game is differentiable. Then, the above two payoff functions can be combined into a joint payoff function because of the zero-sum property of the game:

(θ,ϕ)minθmaxθV(θ,ϕ).
(6)

The derivation of Eq. (1) leads a two-player game, which can be rewrite as V(θ,φ). The problem becomes to finds a local Nash equilibrium:

θargminθV(θ,ϕ) and ϕargminϕV(θ,ϕ)
(7)

where

V(θ,ϕ)=ExPX(x)[logD(x;ϕ)]+EzPZ(z)[log(1D(C(z;θ);ϕ))]
(8)

In order to introduce the Centripetal Acceleration methods, we fist review the simultaneous gradient descent method in [33] is

θt+1=θtαθV(θt,ϕt),ϕt+1=ϕt+αϕV(θt,ϕt).
(9)

And the simultaneous gradient descent based alternating version is

θt+1=θtαθV(θt,ϕt),ϕt+1=ϕt+αϕV(θt+1,ϕt),
(10)

where α is learning rate in the algorithms. While the Centripetal Acceleration methods directly utilize the item of centripetal acceleration to adjust the simultaneous gradient descent. Then gradient descent with simultaneous centripetal acceleration is introduced as:

θt+1=θtα1θV(θt,ϕt)β1[θV(θt,ϕt)θV(θt1,ϕt1)]ϕt+1=ϕt+α2ϕV(θt,ϕt)+β2[ϕV(θt,ϕt)ϕV(θt1,ϕt1)].
(11)

We can also obtain the gradient descent with the alternating centripetal acceleration method:

θt+1=θtα1θV(θt,ϕt)β1[θV(θt,ϕt)θV(θt1,ϕt1)]ϕt+1=ϕt+α2ϕV(θt+1,ϕt)+β2[ϕV(θt+1,ϕt)ϕV(θt,ϕt1)],
(12)

where α1, β1, α2, β2 in above two versions of the centripetal acceleration methods are hyper parameters. The centripetal acceleration methods can calibrate other gradient-based methods. The intuitive understanding of the centripetal acceleration method is shown in Figure 2.

Figure 2.

Left: the basic intuition of centripetal acceleration methods in [29]. Right: the basic intuition of PPCA methods in [12].

Figure 2.

Left: the basic intuition of centripetal acceleration methods in [29]. Right: the basic intuition of PPCA methods in [12].

Close modal

Predictive Projection Centripetal Acceleration (PPCA) [16]: For the centripetal acceleration methods, it uses the last iterative step information to update (φt+1 + φt+1). However, there are some methods which utilize the predictive step information to update t+1, φt+1), such as MPM, OMD, OGDA. MPM is introduced by Liang et al. (2019) [34] and its dynamics are as follows:

 predictive step : θt+12=θtαθV(θt,ϕt),ϕt+12=ϕt+αϕV(θt,ϕt)
(13a)
 gradient step : θt+1=θtβθV(θt+12,ϕt+12),ϕt+1=ϕt+βϕV(θt+12ϕt+12)
(13b)

Motivated by MPM and centripetal acceleration methods, Li et al. (2020) [16] propose the predictive projection centripetal acceleration methods. They also consider approximately the cyclic behavior around a Nash as a uniform circular motion process. However, it is different from the Grad-SCA and Grad-ACA. They construct the item of centripetal acceleration to use the predictive step information replacing that of the last step. Meanwhile, they argue that the centripetal acceleration term points to the matched center approximately. To make the centripetal acceleration item point to the center precisely, they proposed the projection centripetal acceleration methods. PPCA can modify the gradient descent ascent and the alternating gradient descent ascent directly. We can understand PPCA intuitively from Figure 2. The dynamics of predictive projection centripetal acceleration are the following formula:

 predictive step : θt+12=θtγθV(θt,ϕt);ϕt+12=ϕt+γϕV(θt,ϕt)
(14a)
 gradient step :( θt+1ϕt+1)=(θtϕt)+αV¯(θt,ϕt)+β{[V¯(θt+12ϕt+12)V¯(θt,ϕt)]V¯(θt,ϕt)[V¯(θt+12ϕt+12)V¯(θt,ϕt)]}.
(14b)

Where V¯(θt,φt)=(-θV(θt,φt)φV(θt,φt)) is the signed gradient vector at time t. The v¯(θt,φt)[V¯(θt+12,φt+12)V¯(θt,φt)] is the projection of the centripetal acceleration term [V¯(θt+1'φt+12)-V¯(θt,φt)] onto the vector V¯(θt,ϕt).

Li et al. (2020) proposed two versions of the PPCA methods by constraining the coefficient matrix which must be full rank in bilinear games under the specified situation (Lemma 3.2 in [16]). The form of PPCA method for bilinear game is

 predictive step : θt+12=θtγθV(θt,ϕt);ϕt+12=ϕt+γϕV(θt,ϕt)
(15a)
 gradient step : θt+1=θt-αθV(θt,ϕt)-β[θV(θt+1'2ϕt+1'2)-θV(θt,ϕt)];ϕt+1=ϕt+αϕV(θt,ϕt)+β[ϕV(θt+1'2ϕt+1'2)-ϕV(θt,ϕt)].
(15b)

And also, we can get the alternating PPCA formula is as follow:

 predictive step : θt+12=θtγθV(θt,ϕt);ϕt+12=ϕt+γϕV(θt,ϕt);
(16a)
 gradient step : θt+1=θt-αθV(θt,ϕt)-β[θV(θt+1'2ϕt+12)-θV(θt,ϕt)];ϕt+1=ϕt+αϕV(θt+1,ϕt)+β[ϕV(θt+1'2ϕt+12)-ϕV(θt,ϕt)];
(16b)

where the all of γ, α, β are hyper parameters.

Although the mentioned methods have achieved significant work in training GANs, some of them require high computing costs and computer memory. The rest of them depend closely on the approximate circular motion process. If the practical numerical experiments do not satisfy this approximate theory, these methods will not be valid. In contrast to our approach, we propose the adaptive composite gradient method, which can reduce the computing cost and solve the limitation brought by the approximate circular motion process.

3.1 Limitation Analysis

These Hessian-based methods, such as consensus optimization and SGA, optimize game problems, bringing high computing costs. The centripetal acceleration algorithm reduces expensive computational costs but depends on the approximately uniform circular motion process assumption, shown in Figure 3. PPCA is an improved version of the centripetal acceleration algorithm, shown in Figure 4(b). However, PPCA needs a full rank coefficient matrix, causing the projection of the PPCA method to be zero (Lemma 3.2 in [16]). The proposed ACG is motivated by two aspects. Firstly, we consider cyclic behavior as a generally circular motion process but not a uniform one. Therefore, similar to the centripetal acceleration method, we modify directions by adding the projection of the centripetal acceleration term at time t. Secondly, A3DMM provides an idea that can use past iterations to predict future iterations because the trajectory of the sequence Zk is either straight linear or spiral. However, the cyclic behavior is also approximately a spiral. Motivated by the two aspects, we proposed a novel adaptive composite gradient (ACG) method to alleviate the cyclic behavior in training GANs. ACG can reduce computing costs and accelerate iteration by predicting the future step iterative information, which is called a Semi-Gradient-Free method.

Figure 3.

The limitations of centripetal acceleration method. Left: |∇Vt+δt|>|∇Vt|, Right |∇Vt+δt|<|∇Vt|.

Figure 3.

The limitations of centripetal acceleration method. Left: |∇Vt+δt|>|∇Vt|, Right |∇Vt+δt|<|∇Vt|.

Close modal
Figure 4.

(a): The spiral trajectory of Zk. (b): Degenerated PPCA method in case |b| = 0. The degenerated PPCA is the same as centripetal acceleration method, if δt = 1/2.

Figure 4.

(a): The spiral trajectory of Zk. (b): Degenerated PPCA method in case |b| = 0. The degenerated PPCA is the same as centripetal acceleration method, if δt = 1/2.

Close modal

3.2 Motivational Theory

Our idea is motivated by A3DMM [31]. Then, we review the A3DMM method. Give a optimisation problem

minxRn,yRmR(x)+J(y) s.t. Ax+By=b
(17)

where the essential assumptions are proposed

  • R ∈ Γ0(ℝn) and J ∈ Γ0(ℝm) are proper convex and lower semi-continuous functions.

  • A, B are injective linear operators.

  • ri (dom (R) ∩ dom (J)) ≠ Ø and the set of minimizers is non-empty.

To derive the iteration scheme, considering the augmented Lagrangian and rewriting the optimisation problem, which reads

L(x,y;Ψ)= def R(x)+J(y)+Ψ,Ax+Byb+γ2Ax+Byb2,
(18)

where γ >0 and Ψ is the Lagrangian multiplier, then we have iteration forms:

xk=argminxRnR(x)+γ2Ax+Byk1b+1γΨk12
(19a)
yk=argminyRmJ(y)+γ2Axk+Byb+1γΨk12
(19b)
Ψk=Ψk1+γ(Axk+Bykb).
(19c)

We can rewrite the above iteration into the following formula by introducing a new variable Zk= def Ψk1+γAxk,

xk=argminxRnR(x)+γ2Ax1γ(Zk12Ψk12,
(20a)
Zk=Ψk1+γAxk,
(20b)
yk=argminyRmJ(y)+γ2By+1γ(Zkγb)2,
(20c)
Ψk=Zk+γ(Bykb).
(20d)

The trajectory of the sequence Zk dependents closely on the value of γ, where kN. If selecting a proper γ, the eventual trajectory of Zk is spiral as shown in Figure 4. since the trajectory of Zk has the characteristic of spiral, which provides a way that using the previous q iterationd predicts the future s iterations. The Zk of ADMM can be estimated by k,s which is defined as follow:

Z¯k=F(Zk,Zk1,,Zkq),
(21)

for the choice of s = 1. Given a sequence Zk-j, i = 1, 2, …, q+1, we can define vi = Zi − Zi-1 and obtain the past Vk1,Vk2,,Vkq which can be used the Vk1,Vk2,,Vkq to estimate vk. Let Vk1=[vk1,Vk2,,Vkq]Rn×q and Ck=argminCRqVk1CVk2=i=1qCiVkiVk2. Then we can use the VkCk to approximate Vk+1 that is VkCkVk+1, we can compute Z¯k+1=Zk+VkCZk+1. By iterating s times, we can obtain Z¯k,sZk+s. This method is proposed by Clarice Poon et al. [31].

To more easily understand our method, we make some symbol conventions throughout the paper. Let ab is considered as the projection of a onto b, where a and b ∈ ℝn and ⊙ denotes the projecting calculation between two vectors. wti denotes the ith player controlling parameter at time t and Wt=(wt1,Wt2,?,Wtn). we use {i:Rd?R}i=1n, to denote the losses corresponding to n players which is same as mentioned in Definition 2.1. Then we can obtain, which is the payoff vector of the n players at time t.

We consider a bilinear game problem with the following form

1(w1,w2)=(w1)Aw2,2(w1,w2)=(w1)Aw2.
(22)

Desiderata. This two players game has a Nash equilibrium which must satisfy

D1. The two losses satisfy: i=12i(W1,W2)0;

D2. For i is differentiable over the parameter space Ω(w1) ×Ω(w2), where Ω(w1)×Ω(w2)Rn×Rn.

The player 1 holds the parameter w1 and tries to minimize the loss ℓ1, while the player 2 holds the parameter w2 and tries to minimize the loss ℓ2. From the D1. we can get ℓ1= -2, the formula (22) can be rewritten as

minw1Ω(w1)maxw2Ω(w2)L((w1,w2)).
(23)

The dynamics of gradient descent ascent based method is

wt+11=wt1αw1L((wt1,wt2)),wt+12=wt2+αw2L((wt1,wt2).
(24)

The ACG method involves three parts which are consist of the composite gradients. Firstly, we introduce the predictive aspect. In this section, Wt=(Wt1,Wt2) is the parameter vectors at time t. Similar to A3DMM, we utilize the W of the previous q iterations to predict the future s iterations which can be denoted by W¯t,s. Then we can get the following formula of W¯t,s:

W¯t,s=F(Wt,Wt1,,Wtq),
(25)

for the value of s = 1. Define vi=WiWi1, Wi is from sequence {Wki}i=0q. We can use the past Vk1,Vk2,,Vkq to approximate the last vk. Then, ck=argmincRqVk1cVk2. Finally, we can obtain W¯k,1=Wk+VkckWk+1. By looping of s times, we can get W¯k,sWk+s. The second part and third part of our ACG method are the L((wt1,wt2)) and the projection of centripetal acceleration term. The dynamics of proposed ACG method is

 composite gradients : Gw1=w1L((wt1,wt2))+β1α(a1b1)+β2αw¯t+s1Cw2=w2L((wt1,wt2))+β1α(a2b2)+β2αw¯t+s2;
(26a)
 gradient step : wt+s1=wt1αGw1,wt+s2=wt2+αCw2
(26b)

where the wiL((wt1,wt2)) denotes the partial derivatives of wi corresponding to i. at time t. The ai denotes wiL((wt1,wt2))wiL((wt11,wt12)). The bi denotes aiwiL((wt11,wt12)), which is the projection of ai onto the vector wiL((wt11,wt12)). The basic intuition of our proposed method is shown in Figure 5.

Figure 5.

The basic intuition of our proposed Adaptive Composite Gradient method. To illustrate our approach, we chose the s = 20 in this figure.

Figure 5.

The basic intuition of our proposed Adaptive Composite Gradient method. To illustrate our approach, we chose the s = 20 in this figure.

Close modal

For clarity, we draft the scheme of proposed adaptive composite gradient method in Algorithm 1, which is applied in a two-player game problem. Note that the ACG method can calibrate any optimizer based on gradient descent ascent. Meanwhile, the ACG method can extend to a n-player game problem. Given a n-player game problem, let g(Wt) be the gradient of losses for all players at t. It is worth noting that all loss functions must be differentiable. We adopt the same way of Algorithm 1 to compute W¯t. The dynamics of the ACG method for n-players reads

Algorithm 1.

ACG-Adaptive Composite Gradients method for Bilinear game.

Algorithm 1.

ACG-Adaptive Composite Gradients method for Bilinear game.

Close modal
W¯t+s=F(Wt,Wt1,,Wtq),
(27)
 composite gradients: Cw=g(Wt)+β1α(ab)+β2αW¯t+s
(28a)
 gradient step : Wt+s=WtαGW
(28b)

Where the a denotes g(Wt)g(Wt1),b denotes ag(Wt1) which is the projection of a onto g(Wt-1).

Remark 4.1 The value of k can be controlled both in Algorithm 1 and Algorithm 2. Let k = q + i where iN+, we can set different acceleration ratio of the algorithms by adjusting the values of k, s.

Algorithm 2.

ACG - Adaptive Composite Gradients method for the n-players game.

Algorithm 2.

ACG - Adaptive Composite Gradients method for the n-players game.

Close modal

In Algorithm 1 and Algorithm 2, there is no need to calculate the gradient of each iteration because of W¯t+s. Therefore, we also name it a Semi-Gradient-Free method, whose merit is that it can reduce the computational cost and convergent fast. β1, β2 can be used to control the convergence speed in our algorithms.

5.1 The Convergence Analysis for the Bilinear Game

In this subsection, we mainly discuss the convergence of Adaptive Composite Gradient Method in the bilinear game, which reads

minθRdmaxϕRdθAϕ+θB+Cϕ,ARd×d,B,CRd.
(29)

For any local Nash equilibrium of the bilinear game has the point (θ∗, φ∗) satisfied the following conditions:

Aϕ+B=0,
(30a)
Aθ+C=0.
(30b)

The local Nash equilibrium exists if and only if the ranks of A and AT are the same as the dimension of B,C. By this way, without loss of generality, we can convert (θ, φ) to (θθ,ϕϕ), which is used to rewrite bilinear game (29) as:

minθRdmaxϕRdθAϕ,ARd×d.
(31)

Before analyzing the convergence property of the Adaptive Composite Gradient Method in this situation, we introduce some essential theorems and propositions.

Theorem 5.1Suppose FRd×d is from the iterative system xk+1 = Fxk. If F is nonsingular and satisfies the spectral radius ρ (F) < 1, then the xk of iterative system converges to 0 linearly.

Theorem 5.2 (OMD) Consider a bilinear game V(θ, φ) = θT Aφ, whereARd×d. Assume A is full rank. Then the following dynamics,

θt+1=θt2ηθV(θt,ϕt)+ηθV(θt1,ϕt1)ϕt+1=ϕt+2ηϕV(θt,ϕt)ηϕV(θt1,ϕt1),
(32)

with the learning rate

η=122λmax(AAT),

obtain an ɛ -minimizer such that (θT,ϕT)B2(ε), provided

TTOMD :=16λmax(AA)λmin(AA)log42δε

under the assumption that(θ0,ϕ0),(θ1,ϕ1)δ.

To discuss the convergence of the ACG method for the bilinear game, we divided the analysis process into three parts (two cases). Without loss of generality, let t represent the iterative step, and k is the previous steps. The convergence property of Algorithm 1 is as the following.

Case 1. mod(t,k)≠0 or mod(t,k) = 0&ρ (Ck)≥1. The ACG method adopts the dynamics:

wt+11=wt1αw1L((wt1,wt2))+βw1L((wt11,wt12));wt+12=wt2+αw2L((wt1,wt2))βw2L((wt11,wt12)).
(33)

Taking α = = in Case 1, the dynamics scheme reduces to OMD, which can be written as:

wt+11=wt12ηw1L((wt1,wt2))+ηw1L((wt11,wt12));wt+12=wt2+2ηw2L((wt1,wt2))ηw2L((wt11,wt12)).
(34)

Theorem 5.2 has assigned the condition about learning rate of OMD and it is exponential Convergence. The convergence of OMD can be found in [34] (Theorem 3).

Case 2. mod(t,k) = 0&ρ (Ck)<1. From the Algorithm 1, firstly we shall compute the W¯t+s by:

W¯t+s1=F(wt1,wt11,,Wtq1),W¯t+s2=F(wt2,wt12,,wtq2).
(35)

Using the fixed-point formulation of ADMM, (35) can be written as an unified W¯t=ε(W¯t1), let Vtct=σt, then W¯t=ε(W¯t1+σt). We can obtain the convergence of (35), iff σt converges to 0. The convergence of (35) is based on the convergence of inexact Krasnosel'skil-Mann fixed-poin iteration in [31] (Proposition 5.34). The detail convergence analysis of W¯t=ε(W¯t1+σt) has been discussed in [31](Proposition 4.2).

Then we discuss the convergence of composite gradients update scheme, which is written as follow:

wt+s1=wt1αGw1,wt+s2=wt2+αGw2,
(36)

where the Gw1,Gw2 are defined as:

Gw1=w1L((wt1,wt2))+β1α(a1b1)+β2αw¯t+s1;Gw2=w2L((wt1,wt2))+β1α(a2b2)+β2αw¯t+s2;
(37)

Proposition 5.3Any given two vector a,b, the projection of vector b onto the vector a can be denoted as γa, where theγR.

According to proposition 5.3, our dynamics of the bilinear game, the composite gradients update scheme is reduced to be

θt+1=θt(α+β1)Aϕt+β1(1+γ)Aϕt1β2W¯θt;ϕt+1=ϕt+(α+β1)Aθtβ1(1+γ)Aθt1+β2W¯ϕt.
(38)

We can obtain the iterative matrix as:

F:=[Id(α+β1)A0β1(1+γ)Aβ2Id0(α+β1)AIdβ1(1+γ)A00β2IdId000000Id00000000τId000000τId],
(39)

where τ is defined by (35).

According to the iterative matrix, it is easy to obtain that [θt+1,ϕt+1,θt,ϕt,W¯θt+1,W¯ϕt+1]=F[θt,ϕt,θt1,ϕt1,W¯θt,W¯ϕt], where the (θt,ϕt) are generated by (38). With the assumption that A is square and nonsingular in Proposition 5.4, we use the well-known Theorem 5.1 to illustrate the linear convergence for the update scheme (38).

Proposition 5.4Suppose that A is square and nonsingular. Then, the eigenvalues of F are the roots of the sixth order polynomials:

(τλ)2[λ2(1λ)2+(λ(α+β1)β1(1+γ))2ξ2],ξ2Sp( AA),
(40)

where Sp(·) denotes the collection of all eigenvalues.

Proposition 5.5 Suppose that A is square and nonsingular. TheΔt:=θt2+ϕt2+θt+12+ϕt+12+w¯θt+1||2+w¯ϕt+12is linearly convergent to 0 for given γ with α and β1, satisfy

0<α+β11λmax(AA),|α+β1|+|2β1(1+γ)|(α+β1)2λmin(AA)10
(41)

where theλmax (),λmin ()denote the largest and the smallest eigenvalues of ATA.

5.2 The Convergence Analysis for the n-player Game

This subsection mainly discusses the convergence of the Adaptive Composite Gradient method in the general n-player game. The problem is described as in Definition 2.1, according to Algorithm 2, the convergence analysis process in general n players game is the same as that of the bilinear game with three parts and two cases. Before analyzing convergence property, we introduce some basic definitions.

Definition 5.6 Suppose that f:RnR and it is convex, continuously. For x,yRn, the gradient of f is Lipschitz continuous with constant L such that:

0f(y)f(x)f(x),yxL2xy2,

we define that f belongs to the class FL1,1. If f is strongly convex with modulus μ > 0 and such that:

μ2xy2f(y)f(x)f(x),yx,

we define that f belongs to Fμ,L1,1.

Next, we suppose that the all i,i=1,2,,n are belonging to FL1,1. We can get the definition of fixed point which is also called Local Nash Equilibrium in game.

Definition 5.7W∗ is a Local Nash Equilibrium(fixed point) if W∗ satisfy g(W∗) = 0. We say that it is stable if g(W∗)≥0, unstable if g(W∗)≤0.

Theorem 5.8[Nesterov 1983] Let f be a convex and β-smooth function, and we can write the well-known Nesterov's Accelerated Gradient Descent update scheme as

yt+1=xtαf(xt),xt+1=yt+1+β(yt+1yt).
(42)

Then Nesterov's Accelerated Gradient Descent satisfies

f(xt)f(x)2βx1x2t2.
(43)

Nesterov (1983) proposed the accelerated gradient method which achieves the optimal O(1t2) convergence rate.

The convergence of the ACG method for the general n player game is also divided into two cases. Let t represent the iterative step, and k is the previous steps. The convergence property of Algorithm 2 is as the following.

Case 1. mod(t,k)≠0 or mod(t,k) = 0&ρ (Ck)≥1. From Algorithm 2, if t and k satisfy the conditions of Case 1. Our proposed ACG method is the same as classical Gradient Descent and the update scheme is

Wt+1=Wtαg(Wt),

where the α is a positive step-size parameter. According to Definition 5.7, let the Local Nash Equilibrium W∗ and L=L(W). Base the Definition 5.6, if i,i=1,2,,n are belonging to FL1,1, then L(Wt)L associated {Wt} converges at rate O(1t). For more detail convergence of averaged iterations with generalized gradient descent update scheme in convex-concave games is analyzed in [36, 37].

Case 2. mod(t,k) = 0&ρ (Ck)<1. According to Algorithm 2, firstly we should compute the W¯t+s by:

W¯t+s=F(Wt,Wt1,,Wtq).
(44)

The convergence of formula (44) is the same as that of Case 2 in previous section 5.1. For more detail about convergence of (44) has been discussed in [30] (Proposition 4.2).

In Case 2, we mainly analyze the convergence of composite gradient update scheme in formula (28). Before illustrating our proposed method convergence property, The formula (42) can be equivalently written as

xt+1=xt(α+αβ)f(xt)+αβf(xt1)+β(xtxt1),
(45)

where the α and β are step-size parameters. However, our proposed composite gradient method (28) can be transfer to the similar formula as (45) which is based on the Proposition 5.3. That is

Wt+1=Wt+1(α+β1)g(Wt)+β1(1+γ)g(Wt1)β2W¯t+s,
(46)

where the W¯t+s is equivalent to (Wt - Wt-1). Comparing (45) with (46) if the parameters are equivalently transformed, our proposed adaptive composite gradient method can be reduced to the Nesterov's Accelerated Gradient (NAG) method by this way. In 1983, Nesterov had given the convergence rate at O(1t2)for convex smooth optimization in [38]. And also there had given the convergence bounds for convex, non-convex and smooth optimization in [39] (Theorem 1 ~ Theorem 4).

Remark 5.1 In Case 2, our proposed Adaptive composite gradient method has the same convergence rate and the same convergence bounds as that of NAG method, since the assumption that all i,i=1,2,,n are belonging to FL1,1. We can naturally obtain the convergence rate of the ACG method, which is at O(1t2) based on Theorem 5.8.

6.1 Toy Functions Simulation

We tested our ACG methods in Algorithm 1 and Algorithm 2 corresponding with the bilinear game and general game with 3 players. We tested the ACG method in Algorithm 1 on the following bilinear game, which can be written as

minθRdmaxϕRdθϕ,d=1.
(47)

It is obvious that the Nash Equilibrium (stationary point) is (0,0). We compared our ACG with some other methods whose results are presented in Figure 6 (a). From the behaviors of Figure 6, the Sim-GDA method diverges, and the Alt-GDA method is rotating around the stationary point. However, the other methods all converge to the Nash Equilibrium. We proposed the ACG method converges faster than other convergence methods.

Figure 6.

The effects of various compared methods in tow player games with 1 50 iterations. The parameters of compared methods are different in the three toy functions. In g1 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.3), OMD(α = 0.1, β =0.1), MPM(α = 0.1, γ = 1.0), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 1.0), ACG(α = 0.05, β1 = 0.5, β2 = 1.0). In g2 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.03), OMD(α = 0.1, β = 0.1), MPM(α = 0.3, γ = 0.2), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.02, γ = 0.25), ACG(α = 0.1, β1 = 0.3, β2 = 1.0). In g3 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.0.01), OMD(α = 0.1, β = 0.1), MPM(α = 0.1, γ = 0.2), PPCA(α = 0.1, β = 0.1, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 0.2), ACG(α = 0.1, β1 = 0.05, β2 = 1.0).

Figure 6.

The effects of various compared methods in tow player games with 1 50 iterations. The parameters of compared methods are different in the three toy functions. In g1 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.3), OMD(α = 0.1, β =0.1), MPM(α = 0.1, γ = 1.0), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 1.0), ACG(α = 0.05, β1 = 0.5, β2 = 1.0). In g2 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.03), OMD(α = 0.1, β = 0.1), MPM(α = 0.3, γ = 0.2), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.02, γ = 0.25), ACG(α = 0.1, β1 = 0.3, β2 = 1.0). In g3 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.0.01), OMD(α = 0.1, β = 0.1), MPM(α = 0.1, γ = 0.2), PPCA(α = 0.1, β = 0.1, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 0.2), ACG(α = 0.1, β1 = 0.05, β2 = 1.0).

Close modal

In Figure 6 (b), we test our ACG method on the following general zero-sum game

minθRdmaxϕRd3θ2+ϕ2+4θϕ,d=1.
(48)

The effects of the compared methods on this game show that all methods converge to the origin. Notably, the cyclic behavior of the Alt-GDA method has disappeared, and the Sim-GDA method converges. It is worth noting that the trajectory of our ACG method is the same as that of PPCA [16]. Both our ACG and PPCA [16] seem faster than others. We also compared ACG with other methods on the following general game

minθRdmaxϕRdθ2+ϕ24θϕ,d=1.
(49)

Its effects are presented in Figure 6 (c), which shows that Sim-GDA and Grad-SCA diverge, the rest methods converge. The APPCA [16] is faster than our ACG method in this game.

We used the last general zero-sum game (49) to test the robustness of the proposed ACG method in Figure 7. As the learning rate α increases through {0.01,0.05,0.1} and other parameters keeping same. ACG method converges faster when the learning rates setting with α = 0.01 and α = 0.05, although converge slower with learning rate α = 0.1, ACG method still converge to the origin rapidly.

Figure 7.

The robustness of the Adaptive Composite Gradient method on g3 toy function. It is worth noting that the proposed model has significant robustness, which means the ACG algorithm can quickly converge to the Nash point, whatever the value of α changes.

Figure 7.

The robustness of the Adaptive Composite Gradient method on g3 toy function. It is worth noting that the proposed model has significant robustness, which means the ACG algorithm can quickly converge to the Nash point, whatever the value of α changes.

Close modal

The proposed Adaptive Composite Gradient method (ACG) is also suitable for the general game with n players. However, it is challenging to present the effect of the general n player game by toy function. To illustrate our proposed method of Algorithm 2 adaptive for n players game, we show the effects of Algorithm 2 by a general 3 players game. The payoff functions can be written as

1(x,y,z)=14x2+xy+xz,
(50a)
2(x,y,z)=xy+110y2+yz,
(50b)
3(x,y,z)=xzyz+110z2.
(50c)

Where the local Nash Equilibrium is (0,0,0). The effects are shown in Figure 8, the top row and the bottom left of Figure 8 are the trajectories of the compared methods, and the bottom right of Figure 8 are the Euclidean distances of each iteration away from the origin for compared methods. Figure 8 presented that SGD, SGA, and our ACG method all converge to origin. There is cyclic behavior in SGD, which leads to converging slowly. The second row of Figure 8 presents that the proposed ACG method approaches the origin faster than SGA and SGD.

Figure 8.

The effects of SGD, SGA, proposed ACG method in general 3 players’ game. It shows projection trajectories of competing algorithms on three coordinate planes. The right subplot of the second row shows the euclidean distance from the current point to the Nash point as the iteration increased.

Figure 8.

The effects of SGD, SGA, proposed ACG method in general 3 players’ game. It shows projection trajectories of competing algorithms on three coordinate planes. The right subplot of the second row shows the euclidean distance from the current point to the Nash point as the iteration increased.

Close modal

6.2 Mixtures of Gaussiane

We also tested the ACG method by training a toy GAN model, and we compared our method with other well-known methods on learning 5 Gaussians and 16 Gaussians. All the mixture of 16 Gaussians and 5 Gaussians are appointed with a standard deviation of 0.02. The Ground truths for the mixture of 16 Gaussians and 5 Gaussians are present in Figure B1.

Details on Network architecture. GANs is consist of a generator network and a discriminator network. We set up both the generator and discriminator networks with six fully connected layers, and each layer has 256 neurons. We used the fully connected layer to replace the sigmoid function layer, appended to the discriminator. We adopt the ReLU function layer appended to each of the six layers in the generator and discriminator networks. The generator network has two output neurons, while the discriminator network has one output. The input data of the generator is a random noise sampled from a standard Gaussian distribution. The output of the generator is used as the input for the discriminator. The output of the discriminator is used to evaluate the quality of the generated points by the generator.

Experimental environments. We deploy the mixture Gaussians experiments on the computer with CPU AMD Ryzen 7 3700, GPU RTX 2060, 6GB RAM, Python (version 3.6.7), Keras (version 2.3.1), TensorFlow (version 1.13.1), PyTorch (version 1.3.1). We conducted each of the compared methods with 10,000 iterations.

We conducted the experiments on the mixture of 5 Gaussians with the proposed ACG method and several other methods. The training set of the all compared algorithms are as follows:

  • RMSP: We employ the TensorFlow to provide Simultaneous RMSPropOptimizer and set the learning rate α = 5×10−4.

  • RMSP-alt: The alternating RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10−4.

  • ConOpt [40]: The Consensus Optimizer realized by TensorFlow with h = 1×10−4, γ = 1.

  • RMSP-SGA [32]: The Symplectic Gradient Adjusted RMSPropOptimizer realized by TensorFlow with leanring rate α = 1×10−4, ξ = 1.

  • RMSP-ACA [15]: The Alternating Centripetal Acceleration on RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10−4, β = 0.5.

  • SGA-ACG(ours): Our proposed ACG method in Algorithm 1 on the SGA Optimizer relaized by PyTorch with learning rate α = 5×10−4, β1 = 5×10−7, β2 = α.

The mixture of 5 Gaussians numerical simulation results are shown in Figure 9. Figure 9 shows the ConOpt, RMSP-SGA, SGA-ACG algorithms all converge. Meanwhile, the generated mixture of 5 Gaussians is almost approaching the ground truth in Figure B1. It is observed that they have the same convergence speed among ConOpt, RMSP-SGA, SGA-ACG (Ours). We also compared these methods with other SOTA methods, such as RMSP, RMSP-alt, RMSP-ACA. The compared simulation results are shown in Figure B2. To compare the convergence speed among all six algorithms, we employ the same training settings as the mixture of 5 Gaussians to conduct the mixture of 16 Gaussians as shown in Figure 10.

Figure 9.

Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}.

Figure 9.

Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}.

Close modal
Figure 10.

Compared results on the mixture of 16 Gaussians. Each row represents a kind of algorithms, and the columns are each algorithm in 2000,4000,6000,8000,10000 iterations, respectively.

Figure 10.

Compared results on the mixture of 16 Gaussians. Each row represents a kind of algorithms, and the columns are each algorithm in 2000,4000,6000,8000,10000 iterations, respectively.

Close modal

From Figure 10, it is obvious that our proposed SGA-ACG method converges faster than ConOpt, RMSP-SGA. More comparison results are shown in Figure B3. From Figure B3, RMSP, RMSP-alt, RMSP-ACA still do not converge after 10,000 iterations. To compare the convergence speed, Figure 11 shows the timeconsuming of all the compared methods.

Figure 11.

The time consuming of compared methods on the mixture of 16 Gaussians in Figure B3. Our proposed method SGA-ACG takes more time than RMSProp, RMSProp-alt, and RMSProp-ACA. However, it takes less time than ConOpt and RMSProp-SGA methods.

Figure 11.

The time consuming of compared methods on the mixture of 16 Gaussians in Figure B3. Our proposed method SGA-ACG takes more time than RMSProp, RMSProp-alt, and RMSProp-ACA. However, it takes less time than ConOpt and RMSProp-SGA methods.

Close modal

There is a parameter s in our proposed algorithms in Algorithm 1 and Algorithm 2. We have explored the influence of s on final results and conducted a series of experiments as the s through {50, 100, 150, 200} on the mixture of 16 Gaussians, which is as shown in Figure 12. It shows that the proposed SGA-ACG method converges faster with the s increasing.

Figure 12.

Exploring of s, the mixture of 16 Gaussians. Each row shows results with s at different values, and then each column shows the results with iteration number through {2000, 4000, 6000, 8000, 10000}.

Figure 12.

Exploring of s, the mixture of 16 Gaussians. Each row shows results with s at different values, and then each column shows the results with iteration number through {2000, 4000, 6000, 8000, 10000}.

Close modal

6.3 Experiments on Prevalent Datasets

This section conducts the third experiment that tested our proposed ACG method on image generation tasks. We employ four prevalent datasets to illustrate our ACG method can be applied in deep learning. We choose the standard MNIST [41], Fashion-MNIST [42], CIFAR-10 [43], CelebA [44] datasets to conduct realistic experiments.

Network Architecture. We choose two kinds of network architectures for GANs on the MNIST dataset. For the first kind of network structure, we employ 2 fully connected layers with 256 and 512 neurons for the generator network, Each of the 2 layers is appended to a LeakyReLU layer with α = 0.2. We adopt a Tanh activation function layer as the last layer in the generator network. The input data for the generator is a random noise with dimensions 100 sampled from a standard Gaussian distribution. The output of the generator is an image with shape (28, 28, 1). For the discriminator network, we also use 2 fully connected layers with 512 and 256 neurons. After each layer, there is appended to a LeakyReLU layer with α = 0.2, which is the same as the generator network. However, in the last layer of the discriminator, we adopt a Sigmoid activation function. The input data of the discriminator includes the generated image and the ground truth image on the MNIST dataset. The output of the discriminator is used to evaluate the quality of the image made by the generator network. For the second kind of network structure, we used the architecture of DCGANs [45], we just used 4 layers of DCGANs [45] for both generator network and discriminator network.

Experimental environments. We conduct experiments of this section on a server equipped with CPU E5-2698, 4∗GPU GTX 3090 aero, 24GB RAM, Python (version 3.6.13), PyTorch (version 1.8.0). We realized the all compared algorithms by the PyTorch, and the training set of these algorithms on the four datasets are as follows:

  • SGD: The learning rate of linear GANs on MNIST dataset is α = 2×10−4, excepted for the learning rate of DCGANs on MNIST is α = 5×10−4, Fashion-MNIST (learning rate α = 2×10−4), CIFAR-10(learning rate α = 2×10−4), CelebA(learning rate α = 2×10−4).

  • Adam: Linear GANs on MNIST(learning rate α = 3x10−4), DCGANs on MNIST(learning rate α = 2×10−4), Fashion-MNIST (learning rate α = 2x10−4), CIFAR-10(learning rate α = 2x10−4), CelebA(learning rate α = 2×10−4).

  • RMSP: Linear GANs on MNIST(learning rate α = 2x10−4), DCGANs on MNIST(learning rate α = 5×10−4), Fashion-MNIST (learning rate α = 5×10−4), CIFAR-10(learning rate α = 5×10−4), CelebA(learning rate α = 5×10−4).

  • RMSP-ACG: Only on the linear GANS(learning rate α = 5×10−4, β1 = 5×10−7, β2 = α).

  • Adam-ACG: On the all datasets, our proposed Adam-ACG method applied in linear GANs and DCGANs with learning rate α = 5×10−4, β1 = 5×10−7, β2 = α.

In the experiment of linear GANs on MNIST data, we set the batch size as 64, and the epoch number is 324. The generation results of our proposed methods are shown in Figure 13. More comparisons among these algorithms on MNIST are shown in Figure B4.

Figure 13.

Compared results of Linear GANs on MNIST dataset. The first and second rows are the results of RMSP-ACG, Adam-ACG, The first, second, third, and fourth columns are the results in 50000,150000,250000, and 300000 iterations, respectively.

Figure 13.

Compared results of Linear GANs on MNIST dataset. The first and second rows are the results of RMSP-ACG, Adam-ACG, The first, second, third, and fourth columns are the results in 50000,150000,250000, and 300000 iterations, respectively.

Close modal

For DCGANs experiments, the batch size is 64, and the epoch number is 110 on the MNIST dataset. The same batch size and epoch number are assigned to the Fashion-MNIST and CIFAR-10 data. In contrast, the batch size and number of epoch on CelebA data set are 128 and 70, respectively. The results of our methods on the four datasets are shown in Figure 14, and more comparisons results among several algorithms on the same datasets are shown in Figure B5.

Figure 14.

Comparison of DCGAN for our proposed method on the four datasets. The first column is the results of the MNIST dataset in 100000 iterations. The second column is the results of the Fashion-MNIST dataset in 100000 iterations. The third column is the results of the CIFAR10 dataset in 80000 iterations, and the last column is the results of CelebA dataset in 100000 iterations.

Figure 14.

Comparison of DCGAN for our proposed method on the four datasets. The first column is the results of the MNIST dataset in 100000 iterations. The second column is the results of the Fashion-MNIST dataset in 100000 iterations. The third column is the results of the CIFAR10 dataset in 80000 iterations, and the last column is the results of CelebA dataset in 100000 iterations.

Close modal

We proposed the Adaptive Composite Gradients(ACG) method to find a local Nash Equilibrium in game problems. The ACG algorithm can alleviate the cyclic behaviors. It has robustness and can be easily integrated with SGD, Adam, RMSP, SGA, and other gradient-based optimizers. Since the ACG method employs the predicted information in future s iterations, it is a novel semi-gradient-free algorithm. Our algorithm has a linear convergence rate. Furthermore, we offer that the SGA-ACG can be competitive to ConOpt and SGA methods on the mixture of Gaussians geneated tasks. We prove the ACG can be applied in a general zero-sum game with n players by toy function experiment. The extensive generative image experiments show our method can optimize the generic deep learning networks model. However, our research objectives are limited to the convex and smoothness of simple zero-sum games. The non-convex and nonsmooth games are more complex and challenging to find a local Nash Equilibrium. Therefore, optimizing and finding local solutions for non-convex and non-smooth games is still a challenging task worth researching in the future.

This work is supported by the National Key Research and Development Program of China (No.2018AAA0101001), Science and Technology Commission of Shanghai Municipality (No.20511100200), and supported in part by the Science and Technology Commission of Shanghai Municipality (No.18dz2271000).

Conceptualization, methodology, algorithm designing, coding, original draft preparation and survey references: Huiqing Qi. Methodology, data analyses, manuscript review, funding acquisition, original draft preparation and manuscript revising: Fang Li. Methodology, data analyses, manuscript review and funding acquisition: Shengli Tan. Methodology, algorithm designing, manuscript review, funding acquisition: Xiangyun Zhang. All authors have read and agreed to the published version of the manuscript.

[1]
Maleknia
M
,
Shamsi
M.
:
A quasi-Newton proximal bundle method using gradient sampling technique for minimizing nonsmooth convex functions
.
Optimization Methods and Software
,
37
(
4
),
1415
1446
(
2022
).
[2]
Li
K
,
Zhang
T
,
Wang
R.
:
Deep reinforcement learning for multiobjective optimization
.
IEEE transactions on cybernetics
,
51
(
6
),
3103
3114
(
2020
).
[3]
Vezhnevets
A S
,
Osindero
S
,
Schaul
T
, et al
.:
Feudal networks for hierarchical reinforcement learning
. In:
International conference on machine learning
, pp.
3540
3549
(
2017
).
[4]
Goodfellow
I
,
Pouget-Abadie
J
,
Mirza
M
, et al
.:
Generative adversarial networks
.
Communications of the ACM
,
63
(
11
),
139
144
(
2020
).
[5]
Hong
Y
,
Hwang
U
,
Yoo
J
, et al
.:
How generative adversarial networks and their variants work: An overview
.
ACM Computing Surveys (CSUR)
,
52
(
1
),
1
43
(
2019
).
[6]
Guimaraes
G L
,
Sanchez-Lengeling
B
,
Outeiral
C
, et al
.:
Objective-reinforced generative adversarial networks (organ) for sequence generation models.
arXiv preprint arXiv:1705.10843 (
2017
).
[7]
Lee
S
,
Hwang
U
,
Min
S
, et al
.:
Polyphonic music generation with sequence generative adversarial networks.
arXiv preprint arXiv:1710.11418 (
2017
).
[8]
Yu
L
,
Zhang
W
,
Wang
J
, et al
.:
SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient.
arXiv preprint arXiv:1609.05473 (
2016
).
[9]
Hsu
C C
,
Hwang
H T
,
Wu
Y C
, et al
.:
Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks.
arXiv preprint arXiv:1704.00849 (
2017
).
[10]
Lin
K
,
Li
D
,
He
X
, et al
.:
Adversarial ranking for language generation
.
Advances in neural information processing systems
,
30
(
2017
).
[11]
Akmal Haidar
M
,
Rezagholizadeh
M.
:
TextKD-GAN: Text Generation using KnowledgeDistillation and Generative Adversarial Networks.
arXiv preprint arXiv: 1905.01976 (
2019
).
[12]
Croce
D
,
Castellucci
G
,
Basili
R.
:
GAN-BERT: Generative adversarial learning for robust text classification with a bunch of labeled examples
. In:
Proceedings of the 58th annual meeting of the association for computational linguistics
, pp.
2114
2119
(
2020
).
[13]
Neyshabur
B
,
Bhojanapalli
S
,
Chakrabarti
A.
:
Stabilizing GAN training with multiple random projections.
arXiv preprint arXiv:1705.07831 (
2017
).
[14]
Qin
C
,
Wu
Y
,
Springenberg
J T
, et al
.:
Training generative adversarial networks by solving ordinary di erential equations
.
Advances in Neural Information Processing Systems
,
33
,
5599
5609
(
2020
).
[15]
Peng
W
,
Dai
Y H
,
Zhang
H
, et al
.: Training
GANs with centripetal acceleration
.
Optimization Methods and Software
,
35
(
5
),
955
973
(
2020
).
[16]
Keke
L
,
Ke
Z
,
Qiang
L
, et al
.:
Training GANS with predictive projection centripetal acceleration
. arXiv preprint arXiv: 2010.03322 (
2020
).
[17]
Ledig
C
,
Theis
L
,
Huszár
F
, et al
.:
Photo-realistic single image super-resolution using a generative adversarial network
. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
4681
4690
(
2017
).
[18]
Wu
J
,
Zhang
C
,
Xue
T
, et al
.:
Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling
.
Advances in neural information processing systems
,
29
(
2016
).
[19]
Zhu
J Y
,
Park
T
,
Isola
P
, et al
.:
Unpaired image-to-image translation using cycleconsistent adversarial networks
. In:
Proceedings of the IEEE international conference on computer vision
,
2223
2232
(
2017
).
[20]
Wang
C
,
Xu
C
,
Wang
C
, et al
.:
Perceptual adversarial networks for image-to-image transformation
.
IEEE Transactions on Image Processing
,
27
(
8
),
4066
4079
(
2018
).
[21]
Walker
J
,
Marino
K
,
Gupta
A
, et al
.:
The pose knows: Video forecasting by generating pose futures
. In:
Proceedings of the IEEE international conference on computer vision
, pp.
3332
3341
(
2017
).
[22]
Tulyakov
S
,
Liu
M Y
,
Yang
X
, et al
.:
Mocogan: Decomposing motion and content for video generation
. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
1526
1535
(
2018
).
[23]
Yue
Z
,
Zhao
Q
,
Zhang
L
, et al
.:
Dual Adversarial Network: Toward Real-World Noise Removal and Noise Generation
. In:
European Conference on Computer Vision
, pp.
41
58
(
2020
).
[24]
Kupyn
O
,
Budzan
V
,
Mykhailych
M
, et al
.:
Deblurgan: Blind motion deblurring using conditional adversarial networks
. In:
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp.
8183
8192
(
2018
).
[25]
Kim
T
,
Cha
M
,
Kim
H
, et al
.:
Learning to discover cross-domain relations with generative adversarial networks
. In:
International conference on machine learning
, pp.
1857
1865
(
2017
).
[26]
Yi
Z
,
Zhang
H
,
Tan
P
, et al
.:
Dualgan: Unsupervised dual learning for image-to-image translation
. In:
Proceedings of the IEEE international conference on computer vision
, pp.
2849
2857
(
2017
).
[27]
Yang
D
,
Xiong
T
,
Xu
D
, et al
.:
Automatic vertebra labeling in large-scale 3D CT using deep image-to-image network with message passing and sparsity regularization
. In:
25th International Conference on Information Processing in Medical Imaging
, pp.
633
644
(
2017
).
[28]
Wang
Y.
:
A mathematical introduction to generative adversarial nets (GAN).
arXiv preprint arXiv:2009.00169 (
2020
).
[29]
Odena
A.
:
Open questions about generative adversarial networks
.
Distill
,
4
(
4
),
18
(
2019
).
[30]
Goodfellow
I.
:
Nips 2016 tutorial: Generative adversarial networks.
arXiv preprint arXiv:1701.00160 (
2016
).
[31]
Poon
C
,
Liang
J.
:
Trajectory of alternating direction method of multipliers and adaptive acceleration
.
Advances in Neural Information Processing Systems
,
32
(
2019
).
[32]
Balduzzi
D
,
Racaniere
S
,
Martens
J
, et al
.:
The mechanics of n-player di erentiable games
. In:
International Conference on Machine Learning
, pp.
354
363
(
2018
).
[33]
Nowozin
S
,
Cseke
B
,
Tomioka
R.
:
f-gan: Training generative neural samplers using variational divergence minimization
.
Advances in neural information processing systems
,
29
(
2016
).
[34]
Liang
T
,
Stokes
J.
:
Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks
. In:
The 22nd International Conference on Artificial Intelligence and Statistics
, pp.
907
915
(
2019
).
[35]
Bauschke
H H
,
Combettes
P L
.:
Convex Analysis and Monotone Operator Theory in Hilbert Spaces
. corrected printing, (
2019
).
[36]
Bruck
R E
.:
On the weak convergence of an ergodic iteration for the solution of variational inequalities for monotone operators in Hilbert space
.
Journal of Mathematical Analysis and Applications
,
61
(
1
),
159
164
(
1977
).
[37]
Nedic
A
,
Ozdaglar
A.
:
Subgradient methods for saddle-point problems
.
Journal of optimization theory and applications
,
142
,
205
228
(
2009
).
[38]
Nesterov
,
Yurii
.:
A method for solving the convex programming problem with convergence rate O(1/k2)
. In:
Proceedings of the USSR Academy of Sciences
,
269
, pp.
543
547
(
1983
).
[39]
Yang
T
,
Lin
Q
,
Li
Z.
:
Unified convergence analysis of stochastic momentum methods for convex and nonconvex optimization.
arXiv preprint arXiv:1604.03257 (
2016
).
[40]
Mescheder
L
,
Nowozin
S
,
Geiger
A.
:
The numerics of gans
.
Advances in neural information processing systems
,
30
(
2017
).
[41]
LeCun
Y
,
Bottou
L
,
Bengio
Y
, et al
.:
Gradient-based learning applied to document recognition
. In:
Proceedings of the IEEE
,
86
(
11
), pp.
2278
2324
(
1998
).
[42]
Xiao
H
,
Rasul
K
,
Vollgraf
R.
:
Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.
arXiv preprint arXiv:1708.07747 (
2017
).
[43]
Krizhevsky
A
,
Hinton
G.
:
Convolutional deep belief networks on cifar-10
.
Unpublished manuscript
,
40
(
7
),
1
9
(
2010
).
[44]
Liu
Z
,
Luo
P
,
Wang
X
, et al
.:
Deep learning face attributes in the wild
. In:
Proceedings of the IEEE international conference on computer vision
, pp.
3730
3738
(
2015
).
[45]
Radford
A
,
Metz
L
,
Chintala
S.
:
Unsupervised representation learning with deep convolutional generative adversarial networks.
arXiv preprint arXiv:1511.06434 (
2015
).

APPENDIX A. PROOFS IN SECTION 5

A.1 Proof of Proposition 5.3

Proof. Without loss of generality, let a=(a1,a2,a3,,an),b=(b1,b2,b3,,bn), where n2,nN. Then, we have

cosa,b=ab|a||b|.
(A1)

The projection p of b onto a can be written as

p=|b|cos<a,b>a|a|.
(A2)

Incorporating 65 into 66, we have

p=|b|a|a|ab|a||b|=ab||a|22a.
(A3)

Using the γ to replace aba22 we can obtain p=γa, where the γR.

A.2 Proof of Proposition 5.4

Proof. The characteristic polynomial of the matrix (39) is

det((1λ)ld(α+β1)A0β1(1+γ)Aβ2ld0(α+β1)A(1λ)ldβ1(1+γ)A00β2ldld0λld0000ld0λld000000(τλ)ld000000(τλ)Id)
(A4)

which is equivalent to

(τλ)2det((1λ)ld(α+β1)A0β1(1+γ)A(α+β1)A(1λ)Idβ1(1+γ)A0ld0λld00ld0λld)
(A5)

From (A5) we can derive to

(τλ)2(λ(1λ)Idλ(α+β1)A+β1(1+γ)Aλ(α+β1)Aβ1(1+γ)Aλ(1λ)ld),
(A6)

According to (A6), 0 and 1 can not be its roots based on A is nonsingular and square. Eq. (A6) is equivalent to

det((τλ)2[λ2(1λ)2+(λ(α+β1)β1(1+γ))2 AA])
(A7)

Then, we can obtain that the eigenvalues of F are the roots of the sixth order polynomials:

(τλ)2[λ2(1λ)2+(λ(α+β1)β1(1+γ))2ξ2],ξ2Sp(AA).
(A8)

A.3 Proof of Proposition 5.5

Proof. Let the characteristic polynomial of the matrix (39) to be 0, which is written as follows:

(τλ)2[λ2(1λ)2+(λ(α+β1)β1(1+γ))2ξ2]=0,ξ2Sp(AA).
(A9)

It is obvious that (A9) have 6 roots, and λ1, = λ2 = τ are two of the 6 roots. According to the convergence of formula (35), we can obtain the τ is almost small and |τ| < 1. We mainly discuss the following polynomial:

[λ2(1λ)2+(λ(α+β1)β1(1+γ))2ξ2]=0.
(A10)

Using Proposition (5.4), we have

(λ2λi[λ(α+β1)β1(1+γ)]ξ)×(λ2λ+i[λ(α+β1)β1(1+γ)]ξ)=0
(A11)

Denote a:=α+β1 and b:=β1(1+γ), then (A11) can be written as

[λ2λi(λab)ξ][λ2λ+i(λab)ξ]=0.
(A12)

we can get the four roots of (A12) are

λ1±=1iaξ±1a2ξ22iaξ+4ibξ2λ2±=1+iaξ±1a2ξ2+2iaξ4ibξ2.
(A13)

Let u:=aξ+bξ and v:=aξbξ, then we can obtain

λ1±=1(u+v2)i±1(u+v2)2(3vu)i2;λ2±=1+(u+v2)i±1(u+v2)2+(3vu)i2.
(A14)

Denote s:=u+v2 and t:=3vu2, then we have

λ1±=1si±12tis22;λ2±=1+si±1+2tis22
(A15)

The following proof process is the same as (A.2) in [29]. For a given complex number z, we can obtain the absolute value of the real part in z is |z|+R(z)2 and the absolute value of the imaginary part in z is |z|R(z)2, However, According to this Proposition s ≤ 1, all the real parts of the four roots lie in the interval [S,S], where

S=12(1s)2+4t2+1s22+12,
(A16)

all the imaginary parts of roots lie in the interval [T,T], where

T=12(1s)2+4t21+s22+s2.
(A17)

Using the Inequality x+yx+y2x,(x>0,y>0), we can obtain

S121s2+t21s2+12
(A18)
Ts2+|t|21s2
(A19)

Then, we analyze the s in (0,12] and (12,1] two cases separately.

Case 1, we suppose 0<s12, According to this proposition (|α+β1|+|2β1(1+γ)|)/(α+β1)20.1ξ for all ξ2Sp(AA), we have |t|s210. Then, based on s2211s2, we have

|t|11s25.
(A20)

Integrating s12 with (A20), we can obtain

|t|2(11s2)(1s2)5(11s2)(1s2)21s2+12,

which follows that

1|t|1s2+1s2+|t|2(1s2)+|t|1s2
(A21)
t21s2+1s2+t22(1s2)32+s|t|1s2
(A22)
t21s2+1s2+t21s2+s|t|1s2.
(A23)

The inequality (A22) follows by the fact that |t|1s2s1s21 and the inequality (A23) uses x+yx+y2x. The (A21 - A23) can be written equivalently to

(121s2+t21s2+12)2+(s2+|t|21s2)21.
(A24)

According (A18) and (A19), we have

ρ(F)S2+T21.
(A25)

It is worth noting that x+yx+y2x holds equality if and only if y = 0. Then, the (A25) holds equality when t = 0 and s = 0. Since s > 0, we have the strict inequality ρ(F) ≤ 1 which suggests for the linear convergence of unit time ∇t.

Case 2. we suppose 12<s1, since ts2100.1. Combining (A16) and (A17) directly, we can obtain

ρ(F)S2+T2<1.
(A26)

which is also linear convergence

APPENDIX B. THE APPENDIX FIGURES OF EXPERIMENTS

This section shows more figures. There are two ground-truth in Figure. B1. Figure B2 is the comparison results of our proposed method and other SOTA algorithms on the mixture of 5 Gaussians experiments. Figure B3 is the results of the compared methods on the mixture of 16 Gaussians experiments. Figure B4 is the experimental results of compared methods with linear GANs on the MNIST dataset. Figure B5 is the experimental results of compared methods with DCGANs on the four Datasets (MNIST, Fashion-MNIST, CIFAR10, and CelebA).

Figure B1.

Ground truth of Mixture Gaussians.

Figure B1.

Ground truth of Mixture Gaussians.

Close modal
Figure B2.

Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}. Although from the figure, we can obtain that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth, our proposed method seems competitive to ConOpt and RMSP-SGA.

Figure B2.

Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}. Although from the figure, we can obtain that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth, our proposed method seems competitive to ConOpt and RMSP-SGA.

Close modal
Figure B3.

Compared results on the mixture of 16 Gaussians. Each row suggests a different method, and then each column is the results at different iteration numbers through {2000, 4000, 6000, 8000, 10000}. The figure shows that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth. However, our methods converge faster than ConOpt and RMSP-SGA at iteration 2000, which seems competitive to ConOpt and RMSP-SGA.

Figure B3.

Compared results on the mixture of 16 Gaussians. Each row suggests a different method, and then each column is the results at different iteration numbers through {2000, 4000, 6000, 8000, 10000}. The figure shows that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth. However, our methods converge faster than ConOpt and RMSP-SGA at iteration 2000, which seems competitive to ConOpt and RMSP-SGA.

Close modal
Figure B4.

Compared results of Linear GANs on MNIST dataset. Each row suggests a different compared method, and each column is the results of iteration number through {50000, 150000, 250000, 300000}. This figure show SGD and Adam can not generate correct handwritten numbers. While RMSP and RMSP-ACG(ours), Adam-ACG(ours) can generate the handwritten digits. However, all the compared methods, including our proposed, are faced with the mode collapse problem.

Figure B4.

Compared results of Linear GANs on MNIST dataset. Each row suggests a different compared method, and each column is the results of iteration number through {50000, 150000, 250000, 300000}. This figure show SGD and Adam can not generate correct handwritten numbers. While RMSP and RMSP-ACG(ours), Adam-ACG(ours) can generate the handwritten digits. However, all the compared methods, including our proposed, are faced with the mode collapse problem.

Close modal
Figure B5.

Comparison of DCGANs for several Algorithms on the four datasets. The first, second, third, and fourth rows are the results of SGD, RMSP, Adam, Adam-ACG(ours) on the four datasets. The first, second, third, and fourth columns are the results of the MNIST, Fashion-MNIST, CIFAR1O, and CelebA datasets, respectively. We conduct 100000 iterations for all experiments on the MNIST dataset, 100000 iterations for all experiments on the Fashion-MNIST dataset, 80000 iterations for all experiments on the CIFAR10 dataset, and 100000 iterations for all experiments on the CelebA dataset, respectively. Thus, the SGD method is invalid on the MNIST and Fashion-MNIST datasets experiments. In contrast, SGD is valid on CIFAR10 and CelebA datasets experiments. Significantly, the RMSP method is invalid on the four datasets experiments. From this figure, our proposed method seems competitive to Adam on all four datasets experiments.

Figure B5.

Comparison of DCGANs for several Algorithms on the four datasets. The first, second, third, and fourth rows are the results of SGD, RMSP, Adam, Adam-ACG(ours) on the four datasets. The first, second, third, and fourth columns are the results of the MNIST, Fashion-MNIST, CIFAR1O, and CelebA datasets, respectively. We conduct 100000 iterations for all experiments on the MNIST dataset, 100000 iterations for all experiments on the Fashion-MNIST dataset, 80000 iterations for all experiments on the CIFAR10 dataset, and 100000 iterations for all experiments on the CelebA dataset, respectively. Thus, the SGD method is invalid on the MNIST and Fashion-MNIST datasets experiments. In contrast, SGD is valid on CIFAR10 and CelebA datasets experiments. Significantly, the RMSP method is invalid on the four datasets experiments. From this figure, our proposed method seems competitive to Adam on all four datasets experiments.

Close modal
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.