Training Generative Adversarial Networks with Adaptive Composite Gradient

ABSTRACT The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minimum. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of acquiring the Hessian matrix. To address this problem, we proposed the Adaptive Composite Gradients(ACG) method, linearly convergent in bilinear games under suitable settings. Theory analysis and toy-function experiments both suggest that our approach alleviates the cyclic behaviors and converges faster than recently proposed SOTA algorithms. The convergence speed of the ACG is improved by 33% than other methods. Our ACG method is a novel Semi-Gradient-Free algorithm that can reduce the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. The mixture of Gaussians experiments and real-world digital image generative experiments show that our ACG method outperforms several existing technologies, illustrating the superiority and efficacy of our method.


Introduction
Gradient Descent-based machine learning and deep learning methods have been widely used in various computer science tasks over the past several decades.Optimizing a single objective problem with gradient descent is easy to converge to a saddle point in some cases [21].Meanwhile, there is a growing set of multi-objective problems that need to be optimized in numerous fields, such as deep reinforcement learning [22,41], Game Theory, Machine Learning and Deep Learning.Generative Adversarial Networks [10] is a kind of classical multi-objective problem in Deep Learning.GANs have a wide range of applications [13] because of their capability, which can learn to generate complex and high dimensional target distribution.The existing literature about GANs can be divided into four categories, including music generation [8,11,52], natural languages [5,12,14,25], methods of training GANs [15,33,36,38],images processing [20,44,47,55].GANs have obtained remarkable progress in image processing, such as video generation [40,42,43], noise removal [53], deblur [18], image to image translation [16,51], image super-resolution [20], medical image processing [6,27,49].
Generative adversarial networks' framework consists of two deep neural networks: generator network and discriminator network correspondingly.The generator network is given a noise sample from a simple known distribution as input, and it can produce a fake sample as output.The generator learns to make such fake samples, not by directly using real data, just by adversarial training with a discriminator network.The essence of GANs is a zero-sum game between the generator and discriminator.The object function of GANs [10] is often formulated as a two-player min-max game with a Nash equilibrium at the saddle points: Where x ∼ PX (x) denotes an actual data sample and z ∼ PZ (z) denotes a sample from a noise distribution(often using uniform distribution or Gaussian distribution).More different forms of GANs object function are mentioned in [46].Though GANs have achieved remarkable applications, training stable and fast GANs [9,35] still is a challenging task.Since it suffers from the strongly associated gradient vector field rotating around a Nash equilibrium (see Figure 1).Moreover, those gradient descent ascent-based methods used to optimize object function of GANs tend to lead the limit oscillatory behavior because of imaginary components in the Jacobian eigenvalues.
In recent years, there are great amount of remarkable studies on proposing novel algorithms for training GANs.[38] considered the dynamics as a continuous-time process and proposed using ordinary differential equations to train GANs.Consensus optimization [28] with Jacobian information diverts gradient updates to the descent direction of the field magnitudes.[1] developed a method called Symplectic Gradient Adjust(SGA).Motivated by SGA, [36] propose centripetal acceleration method and their altering centripetal acceleration version.[15] based on some predictive methods [4,7,29,30,45,54] proposed the projection predictive gradient centripetal method.
The main idea of this work is to reduce the computing cost of the Hessian matrix in consensus optimization and SGA.Motivated by [15,36] and [37], we propose a novel Adaptive Composite Gradient method, which can be used to calibrate and accelerate the traditional methods such as SGD, RMSProp, Adam, consensus optimization, and SGA.The ACG method exploits three aspects of information in the iteration process, which includes gradients information of the past iteration steps, adaptive and predictive information for future iteration steps, and the projection information of the current iteration step mentioned in [15].We fuse this information as the composite gradient to update the parameters in our algorithm, which can be deployed in deep networks and used to train GANS.
Contributions.The main contributions of this paper are as follows: • We proposed the Adaptive Composite Gradient (ACG) method, which can alleviate the cyclic behaviors around the Nash equilibria in games.Meanwhile, we prove its convergence in bilinear games.Our algorithm can be used to train GANs.
• Our ACG method is not only applied in bilinear games but also used in general game problems.Furthermore, we experimentally demonstrate applicability for three-player game problems by toy model.
• The Adaptive Composite Gradient method can reduce the computing cost of the Hessian when it calibrates SGA and consensus optimization or some other Hessian-based methods.And also, ACG can reduce the computing cost of gradients when calibrating gradient descent-based methods.In other words, our method is a novel semi-gradient-free algorithm.

Related Work
There are several distinctive approaches to improve the training of GANs, but they show more or fewer limitations in some cases.Some of these are dependent closely on the previous assumptions, which leads to these methods not being valid.Moreover, some of these need to payoff the computing cost of the Hessian in the dynamics.Next, we will discuss some related researches in this section.
Symplectic Gradient Adjustment (SGA) [1]: Compared with the traditional games, do not constrain the players' parameter sets or require the loss functions to be convex.The general games can be decomposed into a potential game, and a Hamiltonian game in [1].To introduce our method, we first recall the SGA method as follows.Using the g(w) notes the simultaneous gradient which is the gradient of the losses withe respect to players' parameters g(w) = (∇w 1 1, ∇w 2 2, ..., ∇w n n).For a bilinear game, it requires the losses to satisfy n i=1 i ≡ 0 such as follow: 1(x, y) = x T Cy and 2(x, y) = −x T Cy This kind of games have a Nash equilibrium at (x, y) = (0, 0).The simultaneous gradient g(x, y) = (Cy, −C T x) rotates around the Nash equilibrium shown in Figure 6.
We can derive the Hessian of a n-players game with the simultaneous gradient g(w).The formula of Hessian is H , where H ∈ R d×d .Further, the matrix formula of Hessian is as follows: .
(2) Applying the generalized Helmholtz decomposition[Lemma 1 in [1]] to the above mentioned Hessian of the game H(w) = S(w) + A(w).David et al(2018) [1] pointed that a game is a potential game if A(w) ≡ 0. It is a Hamiltonian game if S(w) ≡ 0. Potential games and Hamiltonian games are both well-studied, and they are easy to solve.Since the cyclic behavior around the Nash equilibrium is caused by simultaneous gradient, David et al [1] proposed the Symplectic Gradient Adjustment method which is as follows: Where A is from the Helmholtz decomposition of Hessian.g λ is used to replace the gradient among the iterates in the GAD-based methods, and using g λ to train GANs can alleviate the cyclic behaviors.If we consider the players in a bilinear game as GANs, the SGA algorithm needs to pay an expensive computing cost of Hessian which can lower the algorithm efficiency.
Centripetal Acceleration [36]: The simultaneous gradient shows cyclic behaviors around the Nash.Hamiltonian games obey a conservation law in these gradient descent-based methods, so the cyclic behaviors can be considered a uniform circular motion process.As well known, the direction of the centripetal acceleration of a consistent circular motion process points to the center of the circle.Using this characteristic modifies the direction of the simultaneous gradient vector field to alleviate the cyclic behaviors.Based on these observations, Peng et al. (2020) [36] propose the Centripetal acceleration methods, which are derived to two versions methods named Simultaneous Centripetal Acceleration (Grad-SCA) and Alternating Centripetal Acceleration (Grad-ACA) which are used to train GANs.
Given a bilinear game, the losses of this game are 1(θ, φ) , 2(θ, φ) corresponding to player 1 and player 2. The parameter space is defined in θ × φ, where θ, φ ∈ R n .Player 1 can control the parameter θ and tries to minimize payoff function 1 while player 2 can control parameter φ and tries to minimize payoff function 2 under the non-cooperative situation.The game is a process in which the two players adjust their parameters to find a local Nash equilibrium that satisfies the following two requirements: The centripetal acceleration methods require that the two-player game is differentiable.Then, the above two payoff functions can be combined into a joint payoff function because of the zero-sum property of the game: In order to introduce the Centripetal Acceleration methods, we fist review the simultaneous gradient descent method in [34] is And the simultaneous gradient descent based alternating version is where α is learning rate in the algorithms.While the Centripetal Acceleration methods directly utilize the item of centripetal acceleration to adjust the simultaneous gradient descent.Then gradient descent with simultaneous centripetal acceleration is introduced as: We can also obtain the gradient descent with the alternating centripetal acceleration method: where α1, β1, α2, β2 in above two versions of the centripetal acceleration methods are hyper-parameters.The centripetal acceleration methods can calibrate other gradient-based methods.The intuitive understanding of the centripetal acceleration method is shown in Figure 2.
Predictive Projection Centripetal Acceleration (PPCA) [15]: From the viewpoint of centripetal acceleration methods, it use the last iterative step information to update (θt+1, φt+1).However, there are some methods which utilize the predictive step information to update (θt+1, φt+1), such as MPM,OMD,OGDA.MPM is introduced by Liang et al.(2019) [23] and its dynamics are as follows: Motivated by MPM and centripetal acceleration methods, Li et al.(2020) [15] propose the predictive projection centripetal acceleration methods.They also consider approximately the cyclic behavior around a Nash as a uniform circular motion process.However, it is different from the Grad-SCA and Grad-ACA.They construct the item of centripetal acceleration to use the predictive step information replacing that of the last step.Meanwhile, they argue that the approximated centripetal acceleration term points to the matched center.To make the centripetal acceleration item point to the center precisely, they propose the projection centripetal acceleration term at time t.PPCA can modify the gradient descent ascent and the alternating gradient descent ascent directly.We can understand PPCA intuitively from Figure 2. The dynamics of predictive projection centripetal acceleration are the following formula: is the signed gradient vector at time t.
) − ∇ V(θt, φt)] onto the vector ∇ V(θt, φt).Li et al(2020) propose two versions of the PPCA methods by constraining the coefficient matrix which must be full rank in bilinear games under the specified situation [Lemma 3.2 in [15]].The form of PPCA method for bilinear game is And also, we can get the alternating PPCA formula is as follow: where the all of γ, α, β are hyper parameters.[36].Right: the basic intuition of PPCA methods in [15] Although the mentioned methods have achieved significant work in training GANs, some require high computing costs and high computer memory.The rest of them depend closely on the approximate circular motion process.If the practical numerical experiments do not satisfy this approximate theory, these methods will not be valid.In contrast to our approach, we propose the adaptive composite gradient method, which can reduce the computing cost and solve the limitation brought by the approximate circular motion process.

Motivation
In this section, firstly, we mainly focus on the limitations of mentioned methods in Section 2.Then, we describe the theory that motivates our proposed method in the next section.

Limitation Analysing
These Hessian-based methods are used to optimize n players' game problem, which bring high computing costs, such as Consensus Optimization and Symplectic Gradient Adjustment.The dynamics of SGA is g λ := g + λ • A T g.Before updating the parameters, the SGA method must obtain the Hessian matrix.However, the time complexity of computing Hessian is O(n 3 ) and space complexity of that is O(n 2 ) just for one layer in the deep neural network.Supposing there are generative adversarial networks with m depth and n width, the maximum iteration number is I.It is well known the costs and computer memory of computing Hessian are expensive.Compared with our method, we use predictive information to update the dynamics, reducing the computing cost of gradients and improving the efficiency of training deep networks because of our method's semi-gradientfree characteristic.
The Centripetal Acceleration methods depend closely on the cyclic behavior around a Nash equilibrium which is approximately considered a uniform circular motion process around the origin.In contrast, the realistic experiments show that the cyclic behavior is not a uniform circular motion process.The centripetal acceleration methods can change the direction of the gradient vector field.It makes the direction far away from Nash equilibrium under exceptional cases shown in Figure 3.If the assumption is not satisfied, the Centripetal Acceleration method will be invalid.
The PPCA method is an improved version Centripetal Acceleration method.The PPCA method also argues that the cyclic behavior around the origin is approximated to a circular motion process.The PPCA method uses the projection of the centripetal acceleration item, which points to the origin precisely to make up the limitation of Centripetal Acceleration methods, as shown in Figure 2.But the PPCA method is reduced to be centripetal acceleration method with the full rank coefficient matrix A in bilinear games (shown in Figure 4), since the projection of the PPCA method is zero [Lemma 3.2 in [15]].And others situations are not discussed in the PPCA method.Meanwhile, the Centripetal Acceleration methods and PPCA methods are only applied in these two-player games.Our proposed Adaptive Compose Gradient method can apply to n-player games.

Motivational Theory
Our idea is motivated by A 3 DMM [37].Then, we review the A 3 DMM method.Give a optimisation problem where the essential assumptions are proposed • R ∈ Γ0(R n ) and J ∈ Γ0(R m ) are proper convex and lower semi-continuous functions.• A, B are injective linear operators.
In order to derive the ADMM iteration, consider the augmented Lagrangian to rewrite the optimisation problem which reads Where γ > 0 and Ψ is the Lagrangian multiplier, then we can derive the ADMM iteration form as follow: The trajectory of the sequence Z k dependents closely on the value of γ, where k ∈ N. If selecting a proper γ, the eventual trajectory of Z k is spiral as shown in Figure 4. since the trajectory of Z k has the characteristic of spiral, which provides a way that using the previous q iterates predicts the future s iterates.The update Z k of ADMM can be estimated by Zk,s which is defined as follow:

We can rewrite the above iteration into the following formula by introducing a new variable Z
for the choice of s = 1.Given a sequence By iterating s times, we can obtain Zk,s ≈ Z k+s .This method is proposed by Clarice Poon and Jingwei Liang(2019) [37].
Our Adaptive Composite Gradient method is motivated by two aspects.Firstly, we still consider the cyclic behavior around a Nash as a circular motion process around an origin but not a uniform circular motion process.Therefore, similar to the centripetal acceleration method, we modify the directions by adding the projection of the centripetal acceleration term at time t.Secondly, The A 3 DMM provides an idea that can use the past iterates to predict the future iterates because the trajectory of the sequence Z k is either straight liner or spiral.However, the cyclic behavior around a Nash is also approximately a spiral.In our method, to reduce the computing cost and accelerate the iteration, we consider the controlled parameters by players as a spiral trajectory as shown in Figure 1.By the two motivations, we propose a Semi-Gradient-Free method named the Adaptive Composite Gradient method to optimize games' problems, which also can alleviate the cyclic behaviors and be used to train GANs.

Adaptive Composite Gradient Method
In this section, we mainly introduce proposed the Adaptive Composite Gradient method.Firstly, in order to more easily understand our method, we make some symbol conventions throughout the paper.let a b is considered as the projection of a onto b, where a, b ∈ R n and denotes the projecting calculation between two vectors.w i t denotes the i th player controlling parameter at time t and Wt = (w 1 t , w 2 t , • • • , w n t ).we use { i : R d → R} n i=1 to denote the losses corresponding to n players which is same as mentioned in Definition 1. and we can obtain ) which is the payoff vector of the n players at time t.
We consider a bilinear game problem with the following form (3) Desiderata.This two players game has a Nash equilibrium which must satisfy D1.The two losses satisfy: D2.For i is differentiable over the parameter space Ω(w 1 ) × Ω(w 2 ), where Ω(w The player 1 which holds the parameter w 1 tries to minimize the loss 1, while the player 2 which holds the parameter w 2 tries to minimize the loss 2. From the D1.we can get 1 = − 2, the equation ( 3) can be rewritten as min As well known, the dynamics of traditional gradient descent ascent based method reads ).According the previous section motivational theory we exploit the spiral characteristic to design the proposed Adaptive Composite Gradients method(ACG).The ACG method involves three parts which are consist of the composite gradients.Firstly, we introduce the predictive aspect.In this section, Wt = (w 1 t , w 2 t ) is the parameter vectors at time t.Similar to A 3 DMM, we utilize the W of the previous q iterations to predict the future s iterations which can be denoted by Wt,s.Then we can get the following formula of Wt,s : ,By looping of s times, we can get Wk,s ≈ W k+s .The second part and third part of our ACG method are the ∇L ((w 1 t , w 2 t )) and the projection of centripetal acceleration term.
The dynamics of proposed Adaptive Composite Gradients is composite gradients : Where the ∇ w i L ((w 1 t , w 2 t )) denotes the partial derivatives of wi corresponding to i at time t.The ai denotes ) which is the projection of ai onto the vector ∇ w i L ((w 1 t−1 , w 2 t−1 )).

Algorithm 1 ACG-Adaptive Composite Gradients method for Bilinear game
Input: losses L ( 1, 2) and W = (w 1 , w 2 ) Initial: Let s ≥ 1, q ≥ 1 be integers and k = q + 1, learning rate α, adaptive rate β1, β2, W0 = (w 1 0 , w 2 0 ).while not converged do for t ≥ 1 do if mod(t, k) == 0 then Compute Ct and w1 t+s , w2 t+s : t+s ; Gradient update: OMD update: ). //Replace this step with any optimizer end if end for end while In Algorithm 1, The Adaptive Composite Gradient is proposed for the bilinear game with two players.It is remarkable that the ACG method can calibrate any optimizer based on gradient descent ascent.While the ACG method can extend for a game with n player.g(Wt) is the gradient of the losses for the controlling parameters of the corresponding players.It is worth noting that the losses are required to satisfy differentiable.We adopt the way same as Algorithm 1 to compute Wt.The dynamics of the ACG method for n-players reads Wt+s; gradient step : Wt+s =Wt − αGW . ( Where the a denotes g(Wt) − g(Wt−1), b denotes a g(Wt−1) which is the projection of a onto g(Wt−1).

Algorithm 2 ACG -Adaptive Composite Gradients method for general game with n players
Input: losses L (W ) and W = (w 1 , w 2 , • • • , w n ) Initial: Let s ≥ 1, q ≥ 1 be integers and k = q + 1, learning rate α, adaptive rate β1, β2, W0 = (w 1 0 , w 2 0 , • • • , w n 0 ).while not converged do for t ≥ 1 do if mod(t, k) == 0 then Compute Ct and Wt+s : Compute composite gradients: Wt+s; Gradient update: else Gradient update: end if end for end while Remark 4.1.The value of k can be controlled both in Algorithm 1 and Algorithm 2. Let k = q + i where i ∈ N + , we can set different acceleration ratio of the algorithms by adjusting the values of k, s.

Note: (1)
In Algorithm 1 and Algorithm 2, the memory cost of storing V k is nq and computational cost of obtaining the pseudoinverse of V k is nq 2 .(2) There is no need to calculate the gradient of each iteration because of the Wt+s.So that this is a Semi-Gradient-Free method which can reduce the computational cost of calculating gradients.(3) The β1, β2 can be used to control the convergence of the algorithms.
The basic intuition of our proposed Adaptive Composite Gradient method.To illustrate our approach, we chose the s = 20 in this figure.We explored the influence of s on the convergence in Figure 14 of Appendix.

The Convergence of Adaptive Composite Gradient Method for Bilinear Game
In this subsection, we mainly discuss the convergence of Adaptive Composite Gradient Method in the bilinear game, which reads (6) For any local Nash equilibrium of the bilinear game has the point (θ * , φ * ) satisfied the following conditions: The local Nash equilibrium exists if and only if the ranks of A and A T are the same as the dimension of B, C. By this way, without loss of generality, we can convert (θ, φ) to (θ − θ * , φ − φ * ) which is used to rewrite bilinear game (6) as: Before analyzing the convergence property of the Adaptive Composite Gradient Method in a bilinear game, we introduce some essential theorems and propositions.Theorem 5.1 Suppose F ∈ R d×d is from the iterative system x k+1 = F x k .If F is nonsingular and satisfies the spectral radius ρ(F ) < 1, then the x k of iterative system converges to 0 linearly.
To discuss the convergence of the ACG method for the bilinear game, we divided the analysis process into three parts(two cases).
Without loss of generality, let t represent the iterative step, and k is the previous steps.The convergence property of Algorithm 1 is as the following.
When the iterative t and the previous step k satisfied the conditions of Case 1., our ACG method adopts the dynamics: Taking α = 2β = 2η in Case 1, the dynamics scheme reduces to OMD, which can be written as: Theorem 5.2 has assigned the condition about learning rate of OMD and it is exponential Convergence.The convergence of OMD can be found in [24][Theorem 3].
Then we discuss the convergence of composite gradients update scheme, which is written as follow: where the G w 1 , G w 2 are defined as: Proposition 5.3 Any given two vector a, b, the projection of vector b onto the vector a can be denoted as γ a, where the γ ∈ R.
According to proposition 5.3, our dynamics of the bilinear game, the composite gradients update scheme is reduced to be We can obtain the iterative matrix as: where τ is defined by (12).
Proposition 5.4 Suppose that A is square and nonsingular.Then, the eigenvalues of F are the roots of the sixth order polynomials: where Sp(•) denotes the collection of all eigenvalues.
Proposition 5.5 Suppose that A is square and nonsingular.The || wφ t+1 || 2 is linearly convergent to 0 for given γ with α and β1 satisfy where the λmax(•), λmin(•) denote the largest and the smallest eigenvalues of A T A.

The Convergence of Composite Gradient Method for General N-player Game
This subsection mainly discusses the convergence of the Adaptive Composite Gradient method in the general n-player game.The problem is described as in Definition 2.1, according to Algorithm 2, the convergence analysis process in general n players game is the same as that of the bilinear game with three parts and two cases.Before analyzing convergence property, we introduce some basic definitions.Definition 5.6 Suppose that f : R n → R and it is convex, continuously.For ∀x, y ∈ R n , the gradient of f is Lipschitz continuous with constant L such that: we define that f belongs to the class F 1,1 L .If f is strongly convex with modulus µ > 0 and such that: we define that f belongs to F 1,1 µ,L .
Next, we suppose that the all i, i = 1, 2, • • • , n are belonging to F 1,1 L .We can get the definition of fixed point which is also called Local Nash Equilibrium in game.Definition 5.7 W * is a Local Nash Equilibrium(fixed point) if W * satisfy g(W * ) = 0. We say that it is stable if g(W * ) ≥ 0, unstable if g(W * ) ≤ 0. Theorem 5.8 [Nesterov 1983] Let f be a convex and β-smooth function, and we can write the well-known Nesterov's Accelerated Gradient Descent update scheme as Then Nesterov's Accelerated Gradient Descent satisfies Nesterov (1983) proposed the accelerated gradient method which achieves the optimal O( 1 t 2 ) convergence rate.
The convergence of the ACG method for the general n player game is also divided into two cases.Let t represent the iterative step, and k is the previous steps.The convergence property of Algorithm 2 is as the following.
From Algorithm 2, if t and k satisfy the conditions of Case 1.
Our proposed ACG method is the same as classical Gradient Descent and the update scheme is where the α is a positive step-size parameter.According to Definition 5.7, let the Local Nash Equilibrium W * and L * = L (W * ).
Base the Definition 5.6, if For more detail convergence of averaged iterates with generalized gradient descent update scheme in convex-concave games is analyzed in [3,31].
According to Algorithm 2, firstly we should compute the Wt+s by: The convergence of formula (21) is the same as that of Case 2. in previous section 5.1.For more detail about convergence of (21) has been discussed in [37](Proposition 4.2).
In Case 2., we mainly analyze the convergence of composite gradient update scheme in formula (5).Before illustrating our proposed method convergence property, The formula ( 19) can be equivalently written as where the α and β are step-size parameters.However, our proposed composite gradient method (5) can be transfer to the similar formula as (22) which is based on the Proposition 5.3.That is where the Wt+s is equivalent to (Wt − Wt−1).Comparing ( 22) with ( 23) if the parameters are equivalently transformed, our proposed adaptive composite gradient method can be reduced to the Nesterov's Accelerated Gradient (NAG) method by this way.In 1983, Nesterov had given the convergence rate at O( 1 t 2 ) for convex smooth optimization in [32].And also there had given the convergence bounds for convex,non-convex and smooth optimization in [50] (Theorem 1 ∼ Theorem 4).Remark 4.1.In Case 2., our proposed Adaptive composite gradient method has the same convergence rate and the same convergence bounds as that of NAG method because of the assumption that all i, i = 1, 2, • • • , n are belonging to F 1,1 L .So we can naturally obtain the convergence rate of the ACG method, which is at O( 1 t 2 ) based on Theorem 5.8.Here we will not repeat the similar description of the convergence of our algorithm, which is the same as that of the NAG method.

Experiments
This article deploys three parts numerical simulation with the toy functions simulation, the mixture of Gaussians, and the four Prevalent Datasets.Meanwhile, we give more details on each experimental setup.Finally, we provide the detailed experimental environment for the last two parts of the experiments.

Toy Functions Simulation
In this section, we mainly describe our experiment on toy functions.We tested our ACG methods in Algorithm 1 and Algorithm 2 corresponding with the bilinear game and general game with 3 players.we tested the ACG method in Algorithm 1 on the following bilinear game, which can be written as It is obvious that the Nash Equilibrium (stationary point) is (0, 0).We compared our ACG with some other methods whose results are presented in Figure 6 (a).From the behaviors of Figure 6, the Sim-GDA method diverges, and the Alt-GDA method is rotating around the stationary point.However, the other methods all converge to the Nash Equilibrium.We proposed the ACG method converges faster than other convergence methods.
In Figure 6 (b), we test our ACG method on the following general zero-sum game The effects of all the compared methods on this game show that all methods converge to the origin.Significantly, the cyclic behavior of the Alt-GDA method has disappeared, and the Sim-GDA method converges.It is worth noting that the trajectory of our ACG method is the same as that of PPCA [15].Both our ACG and PPCA [15] seem faster than others.We also compared ACG with other methods on the following general game Its effects are presented in Figure 6 (c), which shows that Sim-GDA and Grad-SCA diverge, the rest methods converge.The APPCA [15] is faster than our ACG method in this game.
We used the last general zero-sum game (24) to test the robustness of the proposed ACG method in Figure 7.As the learning rate α increases through {0.01, 0.05, 0.1} and other parameters keeping same.ACG method converges faster when the learning rates setting with α = 0.01 and α = 0.05, although converge slower with learning rate α = 0.1, ACG method still converge to the origin rapidly.
The proposed Adaptive Composite Gradient method(ACG) is also suitable for the general game with n players.However, it is challenging to present the effect of the general n player game by toy function.To illustrate our proposed method of Algorithm 2 adaptive for n players game, we show the effects of Algorithm 2 by a general 3 players game.The payoff functions can be written as Where the local Nash Equilibrium is (0, 0, 0).The effects are shown in Figure 8, the top row of Figure 8 are the trajectory of the compared methods, and the second row of Figure 8 are the Euclidean distances of each iteration away from the origin for compared methods.Figure 8 presented that SGD, SGA, and our ACG method all converge to origin.There is cyclic behavior in SGD, which leads to converging slowly.The second row of Figure 8 presents that the proposed ACG method approaches the origin faster than SGA and SGD.

Mixtures of Gaussians
In this section, we concentrate on the mixture of Gaussians experiments.GANs are the typical example of two players game in deep learning.We tested the proposed ACG method by training a toy GAN model, and we compared our method with other wellknown methods on learning 5 Gaussians and 16 Gaussians.All the mixture of 16 Gaussians and 5 Gaussians are appointed with a standard deviation of 0.02.The Ground truths for the mixture of 16 Gaussians and 5 Gaussians are present in Appendix Figure 13.
Details on Network architecture.GANs consist of a generator network and a discriminator network.We set up both the generator and discriminator networks with six fully connected layers, and each layer has 256 neurons.We used the fully connected layer to replace the sigmoid function layer, appended to the discriminator.We adopt the ReLU function layer appended to each of the six layers in the generator and discriminator networks.The generator network has two output neurons, while the discriminator network has one output.The input data of the generator is a random noise sampled from a standard Gaussian distribution.The output of the generator is used as the input for the discriminator.The output of the discriminator is used to evaluate the quality of the generated points by the generator.
We conducted the experiments on the mixture of 5 Gaussians with the proposed ACG method and several other methods as shown in Appendix Figure 15.The training set of the all compared algorithms are as follows: • RMSP: We employ the TensorFlow to provide Simultaneous RMSPropOptimizer and set the learning rate α = 5 × 10 −4 .
• SGA-ACG(ours): Our proposed ACG method in Algorithm 1 on the SGA Optimizer relaized by PyTorch with learning rate α = 5 × 10 −4 , β1 = 5 × 10 −7 , β2 = α.Figure 15 shows that the RMSP, RMSP-alt, RMSP-ACA do not converge after 10,000 iterations.In contrast, the ConOpt, RMSP-SGA, SGA-ACG algorithms all converge, and the generated mixture of 5 Gaussians is almost approaching the ground truth in Figure 13.It seems that they have the same convergence speed among ConOpt, RMSP-SGA, SGA-ACG(Ours).To compare the convergence speed among all six algorithms, we employ the same training settings as the mixture of 5 Gaussians to conduct the mixture of 16 Gaussians as shown in Figure 10.
From Figure 10, it is obvious that our proposed SGA-ACG method converges faster than ConOpt, RMSP-SGA.More comparisons are shown in Appendix Figure 16.From Figure 16,RMSP, RMSP-alt, RMSP-ACA still do not converge after 10,000 iterations.To present the convergence speed, Figure 9 shows the time-consuming of the compared methods in Figure 16.There is a parameter s in our proposed algorithms in Algorithm 1 and Algorithm 2. To explore the influence of s on final results, we conduct a series of experiments as the s through {50, 100, 150, 200} on the mixture of 16 Gaussians as shown in Appendix Figure 14.From Figure 14, it shows that the proposed SGA-ACG method converges faster with the s increasing.

SGD
SGA ACG(Ours) Figure 8: The effects of SGD, SGA, proposed ACG method in general 3 players' game.

Experiments on Prevalent Datasets
This section conducts the third experiment that tested our proposed ACG method on image generation tasks.We employ four prevalent datasets to illustrate our ACG method can be applied in deep learning.We choose the standard MNIST [19], Fashion-MNIST [48], CIFAR-10 [17], CelebA [26] datasets to conduct realistic experiments.

Network Architecture
We choose two kinds of network architectures for GANs on the MNIST dataset.For the first kind of network structure, we employ 2 fully connected layers with 256 and 512 neurons for the generator network, Each of the 2 layers is appended to a LeakyReLU layer with α = 0.2.We adopt a Tanh activation function layer as the last layer in the generator network.The input data for the generator is a random noise with dimensions 100 sampled from a standard Gaussian distribution.The output of the generator is an image with shape (28, 28, 1).For the discriminator network, we also use 2 fully connected layers with 512 and 256 neurons.After each layer, there is appended to a LeakyReLU layer with α = 0.2, which is the same as the generator network.However, in the last layer of the discriminator, we adopt a Sigmoid activation function.The input data of the discriminator includes the generated image and the ground truth image on the MNIST dataset.The output of the discriminator is used to evaluate the quality of the image made by the generator network.For the second kind of network structure, we used the architecture of DCGANs [39], we just used 4 layers of DCGANs [39] for both generator network and discriminator network.

Conclusion
This article proposed the Adaptive Composite Gradients(ACG) method to find a local Nash Equilibrium in a general zero-sum smooth game.Inspired by PPAC, OMD, and A3DMM, the ACG method can alleviate the cyclic behaviors in the bilinear game and training GANs.The proposed algorithm has Strong compatibility and robustness, which can easily integrate with SGD, Adam, RMSP, SGA, and other gradient-based optimizers.Since the ACG method employs the predicted information in future s iterations, this is a novel semi-gradient-free algorithm.The ACG method has a linear convergence rate in a general zero-sum game, and the three parts of the experiments show that our algorithm is more preferred and faster than previous works.Furthermore, we offer that the SGA-ACG can be competitive to ConOpt and SGA methods on the mixture of Gaussians generated tasks.Finally, we prove our ACG method can be promoted and applied in a general zero-sum game with n players by toy function experiment.
However, our research objectives just are limited to the convex and smoothness of simple zero-sum games.The non-convex and non-smooth games are more complex and challenging to find a local Nash Equilibrium.Therefore, optimizing and finding local solutions for non-convex and non-smooth games is still a challenging task worth researching in the future.According to (A.5), 0 and 1 can not be its roots based on A is nonsingular and square.so (A.5) is equivalent to Then, we can obtain that the eigenvalues of F are the roots of the sixth order polynomials: A.3 Proof of Proposition 5.5 Proof Let the characteristic polynomial of the matrix (16) to be 0, which is written as follows: It is obvious that (A.7) have 6 roots, and λ1 = λ2 = τ are two of the 6 roots.According to the convergence of formula (12), we can obtain the τ is almost small and |τ | < 1.However, in this case, whatever the values of the α, β1andγ are, the dynamic system will converge to the Nash Equilibrium, which is meaningless.So we mainly discuss the following polynomial:

Figure 2 :
Figure2: Left: the basic intuition of centripetal acceleration methods in[36].Right: the basic intuition of PPCA methods in[15]

RM
SPr op RM SPr opalt Con Op t RM SPr op-SG A RM SPr op-

Figure 9 :
Figure 9: The time consuming of compared methods on the mixture of 16 Gaussians in Appendix Figure 16.Our proposed method SGA-ACG takes more time than RMSProp, RMSPropalt, and RMSProp-ACA.However, it takes less time than ConOpt and RMSProp-SGA methods.

Figure 10 :Figure 11 :Figure 12 :A Proofs in Section 5 A. 1 3 Proof 2 2 4 Proof
Figure 10: Compared results on the mixture of 16 Gaussians.Each row represents a kind of algorithms, and the columns are each algorithm in 2000,4000,6000,8000,10000 iterations, respectively.

Figure 14 :Figure 16 :Figure 17 :Figure 18 :
Figure 14: Exploring of s, the mixture of 16 Gaussians.Each row shows results with s at different values, and then each column shows the results with iteration number through {2000, 4000, 6000, 8000, 10000}.This figure present that the method converges faster with s increasing.
Definition 2.1 A game is a set of players [p] = {1, 2, ..., n}, and the loss functions satisfy twice continuously differentiable { i : R d → R} n i=1 .Players' parameters are w = (w1, w2, ..., wn) ∈ R d with wi ∈ R d i where n i di = d.The i th player can control wi.