ABSTRACT
The wide applications of Generative adversarial networks benefit from the successful training methods, guaranteeing that an object function converges to the local minimum. Nevertheless, designing an efficient and competitive training method is still a challenging task due to the cyclic behaviors of some gradient-based ways and the expensive computational cost of acquiring the Hessian matrix. To address this problem, we proposed the Adaptive Composite Gradients(ACG) method, linearly convergent in bilinear games under suitable settings. Theory analysis and toy-function experiments both suggest that our approach alleviates the cyclic behaviors and converges faster than recently proposed SOTA algorithms. The convergence speed of the ACG is improved by 33% than other methods. Our ACG method is a novel Semi-Gradient-Free algorithm that can reduce the computational cost of gradient and Hessian by utilizing the predictive information in future iterations. The mixture of Gaussians experiments and real-world digital image generative experiments show that our ACG method outperforms several existing technologies, illustrating the superiority and efficacy of our method.
1. INTRODUCTION
Gradient descent-based machine learning and deep learning methods have been widely used in various computer science tasks over the past several decades. Optimizing a single objective problem with gradient descent is easy to converge to a saddle point in some cases [1]. However, there is a growing set of multi-objective problems that need to be optimized in numerous fields, such as deep reinforcement learning [2, 3], Game Theory, Machine Learning and Deep Learning. Generative Adversarial Networks [4] is a kind of classical multi-objective problem in Deep Learning. GANs have a wide range of applications [5] because of their capability, which can learn to generate complex and high dimensional target distribution. The existing literature about GANs can be divided into four categories, including music generation [6, 7, 8], natural languages [9, 10, 11, 12], methods of training GANs [13, 14, 15, 16], images processing [17, 18, 19, 20]. GANs have obtained remarkable progress in image processing, such as video generation [21, 22], noise removal [23], deblur [24], image to image translation [25, 26], image super-resolution [17], medical image processing [27].
Generative adversarial networks’ framework consists of two deep neural networks: generator network and discriminator network correspondingly. The generator network is given a noise sample from a simple known distribution as input, and it can produce a fake sample as output. The generator learns to make such fake samples, not by directly using real data, just by adversarial training with a discriminator network. Bilinear games are two-player, non-cooperative zero-sum games with compact polytopal strategy sets. If the generator and discriminator have no information communication, then training GANs is a noncooperative zero-sum game. Therefore, GANs can be considered a bilinear game under suitable scenarios. The object function of GANs [4] is often formulated as a two-player min-max game with a Nash equilibrium at the saddle points:
Where x∼PX (x) denotes an actual data sample and z∼PZ (z) denotes a sample from a noise distribution (often using uniform distribution or Gaussian distribution). More different forms of GANs object function are mentioned in [28]. Though GANs have achieved remarkable applications, training stable and fast GANs [29, 30] still is a challenging task. Since it suffers from the strongly associated gradient vector field rotating around a Nash equilibrium (see Figure 1). Moreover, those gradient descent ascent-based methods used to optimize object function of GANs tend to lead the limit oscillatory behavior because of imaginary components in the Jacobian eigenvalues.
(a): the strong gradients rotational filed around Nash equilibrium, (b): comparison of convergence behaviors among several recently proposed methods. It is obvious ours ACG method converges faster than others. For more details in Section 6.1.
(a): the strong gradients rotational filed around Nash equilibrium, (b): comparison of convergence behaviors among several recently proposed methods. It is obvious ours ACG method converges faster than others. For more details in Section 6.1.
The main idea of this work is to reduce the computing cost of the Hessian matrix in consensus optimization and SGA. Motivated by [15, 16] and [31], we propose a novel Adaptive Composite Gradient method, which can be used to calibrate and accelerate the traditional methods such as SGD, RMSProp, Adam, consensus optimization, and SGA. The ACG method exploits three aspects of information in the iteration process, which includes gradients information of the past iteration steps, adaptive and predictive information for future iteration steps, and the projection information of the current iteration step mentioned in [16]. We fuse this information as the composite gradient to update the scheme in our algorithm, which can be deployed in deep networks and used to train GANS. The main contributions of this paper are as follows:
We propose a novel adaptive composite gradient (ACG) method, which can alleviate cyclic behaviors in training GANs. Meanwhile, ACG can reduce the computational consumption of gradients and improve convergence speed.
For purely adversarial bilinear game problems, we prove that the ACG method is linearly convergent under suitable conditions. In addition, we extend the ACG method to three-player game problems and verify its effect and efficiency with toy models.
The comprehensive experiments are conducted to test the effect of training GANs and Deep Convolutional Generative Adversarial Networks. The proposed method can obtain competitive results over state-of-the-art (SoTA) methods.
2. RELATED WORK
There are several distinctive approaches to improve the training of GANs, but they show more or fewer limitations in some cases. Some of these are dependent closely on the previous assumptions, which leads to these methods not being valid. Moreover, some of these need to payoff the computing cost of the Hessian in the dynamics. Next, we will discuss some related researches in this section.
Symplectic Gradient Adjustment (SGA) [32]: Compared with the traditional games, do not constrain the players’ parameter sets or require the loss functions to be convex. The general games can be decomposed into a potential game, and a Hamiltonian game in [32]. To introduce our method, we firstly review the SGA method as follows.
Definition 2.1 A game is a set of players , and the loss functions satisfy twice continuously differentiable . Players' parameters are with where . The ith player can control wi.
Using the g(w) notes the simultaneous gradient which is the gradient of the losses withe respect to players' parameters . For a bilinear game, it requires the losses to satisfy such as follow:
This kind of games have a Nash equilibrium at (x,y) = (0,0). The simultaneous gradient g(x,y) = (Cy,-CTx) rotates around the Nash equilibrium shown in Fig. 6.
We can derive the Hessian of a n-player game with the simultaneous gradient g(w). The formula of Hessian is where . Further, the matrix formula of Hessian is as follows:
Applying the generalized Helmholtz decomposition [Lemma 1 in [32]] to the above mentioned Hessian of the game, we have H(w) = S(w)+A(w). David et al. (2018) [32] pointed that a game is a potential game if A(w)≡0. It is a Hamiltonian game if S(w)≡0. Potential games and Hamiltonian games are both well-studied, and they are easy to solve. Since the cyclic behavior around the Nash equilibrium is caused by simultaneous gradient, David et al. [32] proposed the Symplectic Gradient Adjustment method, which is as follows:
Where A is from the Helmholtz decomposition of Hessian. gλ is used to replace the gradient among the iterates in the GAD-based methods, and using gλ to train GANs can alleviate the cyclic behaviors. If we consider the players in a bilinear game as GANs, the SGA algorithm needs to pay an expensive computing cost of Hessian which can lower the algorithm efficiency.
Centripetal Acceleration [15]: The simultaneous gradient shows cyclic behaviors around the Nash. Hamiltonian games obey a conservation law in these gradient descent-based methods, so the cyclic behaviors can be considered as a uniform circular motion process. As well known, the direction of the centripetal acceleration in a consistent circular motion process points to the center of the circle. Using this characteristic modifies the direction of the simultaneous gradient vector field to alleviate the cyclic behaviors. Based on these observations, Peng et al. (2020) [15] propose the Centripetal acceleration methods, which are derived to two versions methods named Simultaneous Centripetal Acceleration (Grad-SCA) and Alternating Centripetal Acceleration (Grad-ACA), they are used to train GANs. In the next, we review the Centripetal Acceleration methods.
Given a bilinear game, the losses of this game are ℓ1(θ,φ), ℓ2(θ,φ) corresponding to player 1 and player 2. The parameter space is defined in θ × φ, where θ,φ ∈ ℝn. Player 1 can control the parameter θ and tries to minimize payoff function ℓ1, while player 2 can control parameter φ and tries to minimize payoff function under the non-cooperative situation. This game is a process of the two players adjusting their parameters to find a local Nash equilibrium that satisfies the following two requirements:
The centripetal acceleration methods require that the two-player game is differentiable. Then, the above two payoff functions can be combined into a joint payoff function because of the zero-sum property of the game:
The derivation of Eq. (1) leads a two-player game, which can be rewrite as V(θ,φ). The problem becomes to finds a local Nash equilibrium:
where
In order to introduce the Centripetal Acceleration methods, we fist review the simultaneous gradient descent method in [33] is
And the simultaneous gradient descent based alternating version is
where α is learning rate in the algorithms. While the Centripetal Acceleration methods directly utilize the item of centripetal acceleration to adjust the simultaneous gradient descent. Then gradient descent with simultaneous centripetal acceleration is introduced as:
We can also obtain the gradient descent with the alternating centripetal acceleration method:
where α1, β1, α2, β2 in above two versions of the centripetal acceleration methods are hyper parameters. The centripetal acceleration methods can calibrate other gradient-based methods. The intuitive understanding of the centripetal acceleration method is shown in Figure 2.
Left: the basic intuition of centripetal acceleration methods in [29]. Right: the basic intuition of PPCA methods in [12].
Predictive Projection Centripetal Acceleration (PPCA) [16]: For the centripetal acceleration methods, it uses the last iterative step information to update (φt+1 + φt+1). However, there are some methods which utilize the predictive step information to update (θt+1, φt+1), such as MPM, OMD, OGDA. MPM is introduced by Liang et al. (2019) [34] and its dynamics are as follows:
Motivated by MPM and centripetal acceleration methods, Li et al. (2020) [16] propose the predictive projection centripetal acceleration methods. They also consider approximately the cyclic behavior around a Nash as a uniform circular motion process. However, it is different from the Grad-SCA and Grad-ACA. They construct the item of centripetal acceleration to use the predictive step information replacing that of the last step. Meanwhile, they argue that the centripetal acceleration term points to the matched center approximately. To make the centripetal acceleration item point to the center precisely, they proposed the projection centripetal acceleration methods. PPCA can modify the gradient descent ascent and the alternating gradient descent ascent directly. We can understand PPCA intuitively from Figure 2. The dynamics of predictive projection centripetal acceleration are the following formula:
Where is the signed gradient vector at time t. The is the projection of the centripetal acceleration term onto the vector .
Li et al. (2020) proposed two versions of the PPCA methods by constraining the coefficient matrix which must be full rank in bilinear games under the specified situation (Lemma 3.2 in [16]). The form of PPCA method for bilinear game is
And also, we can get the alternating PPCA formula is as follow:
where the all of γ, α, β are hyper parameters.
Although the mentioned methods have achieved significant work in training GANs, some of them require high computing costs and computer memory. The rest of them depend closely on the approximate circular motion process. If the practical numerical experiments do not satisfy this approximate theory, these methods will not be valid. In contrast to our approach, we propose the adaptive composite gradient method, which can reduce the computing cost and solve the limitation brought by the approximate circular motion process.
3. MOTIVATION
3.1 Limitation Analysis
These Hessian-based methods, such as consensus optimization and SGA, optimize game problems, bringing high computing costs. The centripetal acceleration algorithm reduces expensive computational costs but depends on the approximately uniform circular motion process assumption, shown in Figure 3. PPCA is an improved version of the centripetal acceleration algorithm, shown in Figure 4(b). However, PPCA needs a full rank coefficient matrix, causing the projection of the PPCA method to be zero (Lemma 3.2 in [16]). The proposed ACG is motivated by two aspects. Firstly, we consider cyclic behavior as a generally circular motion process but not a uniform one. Therefore, similar to the centripetal acceleration method, we modify directions by adding the projection of the centripetal acceleration term at time t. Secondly, A3DMM provides an idea that can use past iterations to predict future iterations because the trajectory of the sequence Zk is either straight linear or spiral. However, the cyclic behavior is also approximately a spiral. Motivated by the two aspects, we proposed a novel adaptive composite gradient (ACG) method to alleviate the cyclic behavior in training GANs. ACG can reduce computing costs and accelerate iteration by predicting the future step iterative information, which is called a Semi-Gradient-Free method.
The limitations of centripetal acceleration method. Left: |∇Vt+δt|>|∇Vt|, Right |∇Vt+δt|<|∇Vt|.
The limitations of centripetal acceleration method. Left: |∇Vt+δt|>|∇Vt|, Right |∇Vt+δt|<|∇Vt|.
(a): The spiral trajectory of Zk. (b): Degenerated PPCA method in case |b| = 0. The degenerated PPCA is the same as centripetal acceleration method, if δt = 1/2.
(a): The spiral trajectory of Zk. (b): Degenerated PPCA method in case |b| = 0. The degenerated PPCA is the same as centripetal acceleration method, if δt = 1/2.
3.2 Motivational Theory
Our idea is motivated by A3DMM [31]. Then, we review the A3DMM method. Give a optimisation problem
where the essential assumptions are proposed
R ∈ Γ0(ℝn) and J ∈ Γ0(ℝm) are proper convex and lower semi-continuous functions.
A, B are injective linear operators.
ri (dom (R) ∩ dom (J)) ≠ Ø and the set of minimizers is non-empty.
To derive the iteration scheme, considering the augmented Lagrangian and rewriting the optimisation problem, which reads
where γ >0 and Ψ is the Lagrangian multiplier, then we have iteration forms:
We can rewrite the above iteration into the following formula by introducing a new variable
The trajectory of the sequence Zk dependents closely on the value of γ, where k ∈ N. If selecting a proper γ, the eventual trajectory of Zk is spiral as shown in Figure 4. since the trajectory of Zk has the characteristic of spiral, which provides a way that using the previous q iterationd predicts the future s iterations. The Zk of ADMM can be estimated by Z̄k,s which is defined as follow:
for the choice of s = 1. Given a sequence Zk-j, i = 1, 2, …, q+1, we can define vi = Zi − Zi-1 and obtain the past which can be used the to estimate vk. Let and . Then we can use the VkCk to approximate Vk+1 that is , we can compute By iterating s times, we can obtain . This method is proposed by Clarice Poon et al. [31].
4. ADAPTIVE COMPOSITE GRADIENT METHOD
To more easily understand our method, we make some symbol conventions throughout the paper. Let is considered as the projection of onto , where and ∈ ℝn and ⊙ denotes the projecting calculation between two vectors. wti denotes the ith player controlling parameter at time t and . we use , to denote the losses corresponding to n players which is same as mentioned in Definition 2.1. Then we can obtain, which is the payoff vector of the n players at time t.
We consider a bilinear game problem with the following form
Desiderata. This two players game has a Nash equilibrium which must satisfy
D1. The two losses satisfy:
D2. For ℓi is differentiable over the parameter space Ω(w1) ×Ω(w2), where .
The player 1 holds the parameter w1 and tries to minimize the loss ℓ1, while the player 2 holds the parameter w2 and tries to minimize the loss ℓ2. From the D1. we can get ℓ1= -ℓ2, the formula (22) can be rewritten as
The dynamics of gradient descent ascent based method is
The ACG method involves three parts which are consist of the composite gradients. Firstly, we introduce the predictive aspect. In this section, is the parameter vectors at time t. Similar to A3DMM, we utilize the W of the previous q iterations to predict the future s iterations which can be denoted by . Then we can get the following formula of :
for the value of s = 1. Define , Wi is from sequence We can use the past to approximate the last vk. Then, . Finally, we can obtain By looping of s times, we can get . The second part and third part of our ACG method are the and the projection of centripetal acceleration term. The dynamics of proposed ACG method is
where the denotes the partial derivatives of wi corresponding to ℓi. at time t. The denotes . The denotes , which is the projection of onto the vector . The basic intuition of our proposed method is shown in Figure 5.
The basic intuition of our proposed Adaptive Composite Gradient method. To illustrate our approach, we chose the s = 20 in this figure.
The basic intuition of our proposed Adaptive Composite Gradient method. To illustrate our approach, we chose the s = 20 in this figure.
For clarity, we draft the scheme of proposed adaptive composite gradient method in Algorithm 1, which is applied in a two-player game problem. Note that the ACG method can calibrate any optimizer based on gradient descent ascent. Meanwhile, the ACG method can extend to a n-player game problem. Given a n-player game problem, let g(Wt) be the gradient of losses for all players at t. It is worth noting that all loss functions must be differentiable. We adopt the same way of Algorithm 1 to compute . The dynamics of the ACG method for n-players reads
Where the denotes denotes which is the projection of onto g(Wt-1).
Remark 4.1 The value of k can be controlled both in Algorithm 1 and Algorithm 2. Let k = q + i where , we can set different acceleration ratio of the algorithms by adjusting the values of k, s.
ACG - Adaptive Composite Gradients method for the n-players game.
In Algorithm 1 and Algorithm 2, there is no need to calculate the gradient of each iteration because of . Therefore, we also name it a Semi-Gradient-Free method, whose merit is that it can reduce the computational cost and convergent fast. β1, β2 can be used to control the convergence speed in our algorithms.
5. THE CONVERGENCE OF ADAPTIVE COMPOSITE GRADIENT METHOD
5.1 The Convergence Analysis for the Bilinear Game
In this subsection, we mainly discuss the convergence of Adaptive Composite Gradient Method in the bilinear game, which reads
For any local Nash equilibrium of the bilinear game has the point (θ∗, φ∗) satisfied the following conditions:
The local Nash equilibrium exists if and only if the ranks of A and AT are the same as the dimension of B,C. By this way, without loss of generality, we can convert (θ, φ) to , which is used to rewrite bilinear game (29) as:
Before analyzing the convergence property of the Adaptive Composite Gradient Method in this situation, we introduce some essential theorems and propositions.
Theorem 5.1Suppose is from the iterative system xk+1 = Fxk. If F is nonsingular and satisfies the spectral radius ρ (F) < 1, then the xk of iterative system converges to 0 linearly.
Theorem 5.2 (OMD) Consider a bilinear game V(θ, φ) = θT Aφ, where. Assume A is full rank. Then the following dynamics,
with the learning rate
obtain an ɛ -minimizer such that provided
under the assumption that
To discuss the convergence of the ACG method for the bilinear game, we divided the analysis process into three parts (two cases). Without loss of generality, let t represent the iterative step, and k is the previous steps. The convergence property of Algorithm 1 is as the following.
Case 1. mod(t,k)≠0 or mod(t,k) = 0&ρ (Ck)≥1. The ACG method adopts the dynamics:
Taking α = 2β = 2η in Case 1, the dynamics scheme reduces to OMD, which can be written as:
Theorem 5.2 has assigned the condition about learning rate of OMD and it is exponential Convergence. The convergence of OMD can be found in [34] (Theorem 3).
Case 2. mod(t,k) = 0&ρ (Ck)<1. From the Algorithm 1, firstly we shall compute the by:
Using the fixed-point formulation of ADMM, (35) can be written as an unified , let , then . We can obtain the convergence of (35), iff σt converges to 0. The convergence of (35) is based on the convergence of inexact Krasnosel'skil-Mann fixed-poin iteration in [31] (Proposition 5.34). The detail convergence analysis of has been discussed in [31](Proposition 4.2).
Then we discuss the convergence of composite gradients update scheme, which is written as follow:
where the are defined as:
Proposition 5.3Any given two vector , the projection of vector onto the vector can be denoted as , where the.
According to proposition 5.3, our dynamics of the bilinear game, the composite gradients update scheme is reduced to be
We can obtain the iterative matrix as:
where τ is defined by (35).
According to the iterative matrix, it is easy to obtain that , where the are generated by (38). With the assumption that A is square and nonsingular in Proposition 5.4, we use the well-known Theorem 5.1 to illustrate the linear convergence for the update scheme (38).
Proposition 5.4Suppose that A is square and nonsingular. Then, the eigenvalues of F are the roots of the sixth order polynomials:
where Sp(·) denotes the collection of all eigenvalues.
Proposition 5.5 Suppose that A is square and nonsingular. Theis linearly convergent to 0 for given γ with α and β1, satisfy
where thedenote the largest and the smallest eigenvalues of ATA.
5.2 The Convergence Analysis for the n-player Game
This subsection mainly discusses the convergence of the Adaptive Composite Gradient method in the general n-player game. The problem is described as in Definition 2.1, according to Algorithm 2, the convergence analysis process in general n players game is the same as that of the bilinear game with three parts and two cases. Before analyzing convergence property, we introduce some basic definitions.
Definition 5.6 Suppose that and it is convex, continuously. For , the gradient of f is Lipschitz continuous with constant L such that:
we define that f belongs to the class . If f is strongly convex with modulus μ > 0 and such that:
we define that f belongs to .
Next, we suppose that the all are belonging to . We can get the definition of fixed point which is also called Local Nash Equilibrium in game.
Definition 5.7W∗ is a Local Nash Equilibrium(fixed point) if W∗ satisfy g(W∗) = 0. We say that it is stable if g(W∗)≥0, unstable if g(W∗)≤0.
Theorem 5.8[Nesterov 1983] Let f be a convex and β-smooth function, and we can write the well-known Nesterov's Accelerated Gradient Descent update scheme as
Then Nesterov's Accelerated Gradient Descent satisfies
Nesterov (1983) proposed the accelerated gradient method which achieves the optimal convergence rate.
The convergence of the ACG method for the general n player game is also divided into two cases. Let t represent the iterative step, and k is the previous steps. The convergence property of Algorithm 2 is as the following.
Case 1. mod(t,k)≠0 or mod(t,k) = 0&ρ (Ck)≥1. From Algorithm 2, if t and k satisfy the conditions of Case 1. Our proposed ACG method is the same as classical Gradient Descent and the update scheme is
where the α is a positive step-size parameter. According to Definition 5.7, let the Local Nash Equilibrium W∗ and . Base the Definition 5.6, if are belonging to , then associated {Wt} converges at rate . For more detail convergence of averaged iterations with generalized gradient descent update scheme in convex-concave games is analyzed in [36, 37].
Case 2. mod(t,k) = 0&ρ (Ck)<1. According to Algorithm 2, firstly we should compute the by:
The convergence of formula (44) is the same as that of Case 2 in previous section 5.1. For more detail about convergence of (44) has been discussed in [30] (Proposition 4.2).
In Case 2, we mainly analyze the convergence of composite gradient update scheme in formula (28). Before illustrating our proposed method convergence property, The formula (42) can be equivalently written as
where the α and β are step-size parameters. However, our proposed composite gradient method (28) can be transfer to the similar formula as (45) which is based on the Proposition 5.3. That is
where the is equivalent to (Wt - Wt-1). Comparing (45) with (46) if the parameters are equivalently transformed, our proposed adaptive composite gradient method can be reduced to the Nesterov's Accelerated Gradient (NAG) method by this way. In 1983, Nesterov had given the convergence rate at for convex smooth optimization in [38]. And also there had given the convergence bounds for convex, non-convex and smooth optimization in [39] (Theorem 1 ~ Theorem 4).
Remark 5.1 In Case 2, our proposed Adaptive composite gradient method has the same convergence rate and the same convergence bounds as that of NAG method, since the assumption that all are belonging to . We can naturally obtain the convergence rate of the ACG method, which is at based on Theorem 5.8.
6. EXPERIMENTS
6.1 Toy Functions Simulation
We tested our ACG methods in Algorithm 1 and Algorithm 2 corresponding with the bilinear game and general game with 3 players. We tested the ACG method in Algorithm 1 on the following bilinear game, which can be written as
It is obvious that the Nash Equilibrium (stationary point) is (0,0). We compared our ACG with some other methods whose results are presented in Figure 6 (a). From the behaviors of Figure 6, the Sim-GDA method diverges, and the Alt-GDA method is rotating around the stationary point. However, the other methods all converge to the Nash Equilibrium. We proposed the ACG method converges faster than other convergence methods.
The effects of various compared methods in tow player games with 1 50 iterations. The parameters of compared methods are different in the three toy functions. In g1 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.3), OMD(α = 0.1, β =0.1), MPM(α = 0.1, γ = 1.0), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 1.0), ACG(α = 0.05, β1 = 0.5, β2 = 1.0). In g2 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.03), OMD(α = 0.1, β = 0.1), MPM(α = 0.3, γ = 0.2), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.02, γ = 0.25), ACG(α = 0.1, β1 = 0.3, β2 = 1.0). In g3 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.0.01), OMD(α = 0.1, β = 0.1), MPM(α = 0.1, γ = 0.2), PPCA(α = 0.1, β = 0.1, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 0.2), ACG(α = 0.1, β1 = 0.05, β2 = 1.0).
The effects of various compared methods in tow player games with 1 50 iterations. The parameters of compared methods are different in the three toy functions. In g1 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.3), OMD(α = 0.1, β =0.1), MPM(α = 0.1, γ = 1.0), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 1.0), ACG(α = 0.05, β1 = 0.5, β2 = 1.0). In g2 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.03), OMD(α = 0.1, β = 0.1), MPM(α = 0.3, γ = 0.2), PPCA(α = 0.1, β = 0.3, γ = 1.0), APPCA(α = 0.1, β = 0.02, γ = 0.25), ACG(α = 0.1, β1 = 0.3, β2 = 1.0). In g3 function, the parameters are as following: Sim-GDA(α = 0.1), Alt-GDA(α = 0.1), Grad-SCA(α = 0.1, β = 0.3), Grad-ACA(α = 0.1, β = 0.0.01), OMD(α = 0.1, β = 0.1), MPM(α = 0.1, γ = 0.2), PPCA(α = 0.1, β = 0.1, γ = 1.0), APPCA(α = 0.1, β = 0.3, γ = 0.2), ACG(α = 0.1, β1 = 0.05, β2 = 1.0).
In Figure 6 (b), we test our ACG method on the following general zero-sum game
The effects of the compared methods on this game show that all methods converge to the origin. Notably, the cyclic behavior of the Alt-GDA method has disappeared, and the Sim-GDA method converges. It is worth noting that the trajectory of our ACG method is the same as that of PPCA [16]. Both our ACG and PPCA [16] seem faster than others. We also compared ACG with other methods on the following general game
Its effects are presented in Figure 6 (c), which shows that Sim-GDA and Grad-SCA diverge, the rest methods converge. The APPCA [16] is faster than our ACG method in this game.
We used the last general zero-sum game (49) to test the robustness of the proposed ACG method in Figure 7. As the learning rate α increases through {0.01,0.05,0.1} and other parameters keeping same. ACG method converges faster when the learning rates setting with α = 0.01 and α = 0.05, although converge slower with learning rate α = 0.1, ACG method still converge to the origin rapidly.
The robustness of the Adaptive Composite Gradient method on g3 toy function. It is worth noting that the proposed model has significant robustness, which means the ACG algorithm can quickly converge to the Nash point, whatever the value of α changes.
The robustness of the Adaptive Composite Gradient method on g3 toy function. It is worth noting that the proposed model has significant robustness, which means the ACG algorithm can quickly converge to the Nash point, whatever the value of α changes.
The proposed Adaptive Composite Gradient method (ACG) is also suitable for the general game with n players. However, it is challenging to present the effect of the general n player game by toy function. To illustrate our proposed method of Algorithm 2 adaptive for n players game, we show the effects of Algorithm 2 by a general 3 players game. The payoff functions can be written as
Where the local Nash Equilibrium is (0,0,0). The effects are shown in Figure 8, the top row and the bottom left of Figure 8 are the trajectories of the compared methods, and the bottom right of Figure 8 are the Euclidean distances of each iteration away from the origin for compared methods. Figure 8 presented that SGD, SGA, and our ACG method all converge to origin. There is cyclic behavior in SGD, which leads to converging slowly. The second row of Figure 8 presents that the proposed ACG method approaches the origin faster than SGA and SGD.
The effects of SGD, SGA, proposed ACG method in general 3 players’ game. It shows projection trajectories of competing algorithms on three coordinate planes. The right subplot of the second row shows the euclidean distance from the current point to the Nash point as the iteration increased.
The effects of SGD, SGA, proposed ACG method in general 3 players’ game. It shows projection trajectories of competing algorithms on three coordinate planes. The right subplot of the second row shows the euclidean distance from the current point to the Nash point as the iteration increased.
6.2 Mixtures of Gaussiane
We also tested the ACG method by training a toy GAN model, and we compared our method with other well-known methods on learning 5 Gaussians and 16 Gaussians. All the mixture of 16 Gaussians and 5 Gaussians are appointed with a standard deviation of 0.02. The Ground truths for the mixture of 16 Gaussians and 5 Gaussians are present in Figure B1.
Details on Network architecture. GANs is consist of a generator network and a discriminator network. We set up both the generator and discriminator networks with six fully connected layers, and each layer has 256 neurons. We used the fully connected layer to replace the sigmoid function layer, appended to the discriminator. We adopt the ReLU function layer appended to each of the six layers in the generator and discriminator networks. The generator network has two output neurons, while the discriminator network has one output. The input data of the generator is a random noise sampled from a standard Gaussian distribution. The output of the generator is used as the input for the discriminator. The output of the discriminator is used to evaluate the quality of the generated points by the generator.
Experimental environments. We deploy the mixture Gaussians experiments on the computer with CPU AMD Ryzen 7 3700, GPU RTX 2060, 6GB RAM, Python (version 3.6.7), Keras (version 2.3.1), TensorFlow (version 1.13.1), PyTorch (version 1.3.1). We conducted each of the compared methods with 10,000 iterations.
We conducted the experiments on the mixture of 5 Gaussians with the proposed ACG method and several other methods. The training set of the all compared algorithms are as follows:
RMSP: We employ the TensorFlow to provide Simultaneous RMSPropOptimizer and set the learning rate α = 5×10−4.
RMSP-alt: The alternating RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10−4.
ConOpt [40]: The Consensus Optimizer realized by TensorFlow with h = 1×10−4, γ = 1.
RMSP-SGA [32]: The Symplectic Gradient Adjusted RMSPropOptimizer realized by TensorFlow with leanring rate α = 1×10−4, ξ = 1.
RMSP-ACA [15]: The Alternating Centripetal Acceleration on RMSPropOptimizer realized by TensorFlow with learning rate α = 5×10−4, β = 0.5.
SGA-ACG(ours): Our proposed ACG method in Algorithm 1 on the SGA Optimizer relaized by PyTorch with learning rate α = 5×10−4, β1 = 5×10−7, β2 = α.
The mixture of 5 Gaussians numerical simulation results are shown in Figure 9. Figure 9 shows the ConOpt, RMSP-SGA, SGA-ACG algorithms all converge. Meanwhile, the generated mixture of 5 Gaussians is almost approaching the ground truth in Figure B1. It is observed that they have the same convergence speed among ConOpt, RMSP-SGA, SGA-ACG (Ours). We also compared these methods with other SOTA methods, such as RMSP, RMSP-alt, RMSP-ACA. The compared simulation results are shown in Figure B2. To compare the convergence speed among all six algorithms, we employ the same training settings as the mixture of 5 Gaussians to conduct the mixture of 16 Gaussians as shown in Figure 10.
Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}.
Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}.
Compared results on the mixture of 16 Gaussians. Each row represents a kind of algorithms, and the columns are each algorithm in 2000,4000,6000,8000,10000 iterations, respectively.
Compared results on the mixture of 16 Gaussians. Each row represents a kind of algorithms, and the columns are each algorithm in 2000,4000,6000,8000,10000 iterations, respectively.
From Figure 10, it is obvious that our proposed SGA-ACG method converges faster than ConOpt, RMSP-SGA. More comparison results are shown in Figure B3. From Figure B3, RMSP, RMSP-alt, RMSP-ACA still do not converge after 10,000 iterations. To compare the convergence speed, Figure 11 shows the timeconsuming of all the compared methods.
The time consuming of compared methods on the mixture of 16 Gaussians in Figure B3. Our proposed method SGA-ACG takes more time than RMSProp, RMSProp-alt, and RMSProp-ACA. However, it takes less time than ConOpt and RMSProp-SGA methods.
The time consuming of compared methods on the mixture of 16 Gaussians in Figure B3. Our proposed method SGA-ACG takes more time than RMSProp, RMSProp-alt, and RMSProp-ACA. However, it takes less time than ConOpt and RMSProp-SGA methods.
There is a parameter s in our proposed algorithms in Algorithm 1 and Algorithm 2. We have explored the influence of s on final results and conducted a series of experiments as the s through {50, 100, 150, 200} on the mixture of 16 Gaussians, which is as shown in Figure 12. It shows that the proposed SGA-ACG method converges faster with the s increasing.
Exploring of s, the mixture of 16 Gaussians. Each row shows results with s at different values, and then each column shows the results with iteration number through {2000, 4000, 6000, 8000, 10000}.
Exploring of s, the mixture of 16 Gaussians. Each row shows results with s at different values, and then each column shows the results with iteration number through {2000, 4000, 6000, 8000, 10000}.
6.3 Experiments on Prevalent Datasets
This section conducts the third experiment that tested our proposed ACG method on image generation tasks. We employ four prevalent datasets to illustrate our ACG method can be applied in deep learning. We choose the standard MNIST [41], Fashion-MNIST [42], CIFAR-10 [43], CelebA [44] datasets to conduct realistic experiments.
Network Architecture. We choose two kinds of network architectures for GANs on the MNIST dataset. For the first kind of network structure, we employ 2 fully connected layers with 256 and 512 neurons for the generator network, Each of the 2 layers is appended to a LeakyReLU layer with α = 0.2. We adopt a Tanh activation function layer as the last layer in the generator network. The input data for the generator is a random noise with dimensions 100 sampled from a standard Gaussian distribution. The output of the generator is an image with shape (28, 28, 1). For the discriminator network, we also use 2 fully connected layers with 512 and 256 neurons. After each layer, there is appended to a LeakyReLU layer with α = 0.2, which is the same as the generator network. However, in the last layer of the discriminator, we adopt a Sigmoid activation function. The input data of the discriminator includes the generated image and the ground truth image on the MNIST dataset. The output of the discriminator is used to evaluate the quality of the image made by the generator network. For the second kind of network structure, we used the architecture of DCGANs [45], we just used 4 layers of DCGANs [45] for both generator network and discriminator network.
Experimental environments. We conduct experiments of this section on a server equipped with CPU E5-2698, 4∗GPU GTX 3090 aero, 24GB RAM, Python (version 3.6.13), PyTorch (version 1.8.0). We realized the all compared algorithms by the PyTorch, and the training set of these algorithms on the four datasets are as follows:
SGD: The learning rate of linear GANs on MNIST dataset is α = 2×10−4, excepted for the learning rate of DCGANs on MNIST is α = 5×10−4, Fashion-MNIST (learning rate α = 2×10−4), CIFAR-10(learning rate α = 2×10−4), CelebA(learning rate α = 2×10−4).
Adam: Linear GANs on MNIST(learning rate α = 3x10−4), DCGANs on MNIST(learning rate α = 2×10−4), Fashion-MNIST (learning rate α = 2x10−4), CIFAR-10(learning rate α = 2x10−4), CelebA(learning rate α = 2×10−4).
RMSP: Linear GANs on MNIST(learning rate α = 2x10−4), DCGANs on MNIST(learning rate α = 5×10−4), Fashion-MNIST (learning rate α = 5×10−4), CIFAR-10(learning rate α = 5×10−4), CelebA(learning rate α = 5×10−4).
RMSP-ACG: Only on the linear GANS(learning rate α = 5×10−4, β1 = 5×10−7, β2 = α).
Adam-ACG: On the all datasets, our proposed Adam-ACG method applied in linear GANs and DCGANs with learning rate α = 5×10−4, β1 = 5×10−7, β2 = α.
In the experiment of linear GANs on MNIST data, we set the batch size as 64, and the epoch number is 324. The generation results of our proposed methods are shown in Figure 13. More comparisons among these algorithms on MNIST are shown in Figure B4.
Compared results of Linear GANs on MNIST dataset. The first and second rows are the results of RMSP-ACG, Adam-ACG, The first, second, third, and fourth columns are the results in 50000,150000,250000, and 300000 iterations, respectively.
Compared results of Linear GANs on MNIST dataset. The first and second rows are the results of RMSP-ACG, Adam-ACG, The first, second, third, and fourth columns are the results in 50000,150000,250000, and 300000 iterations, respectively.
For DCGANs experiments, the batch size is 64, and the epoch number is 110 on the MNIST dataset. The same batch size and epoch number are assigned to the Fashion-MNIST and CIFAR-10 data. In contrast, the batch size and number of epoch on CelebA data set are 128 and 70, respectively. The results of our methods on the four datasets are shown in Figure 14, and more comparisons results among several algorithms on the same datasets are shown in Figure B5.
Comparison of DCGAN for our proposed method on the four datasets. The first column is the results of the MNIST dataset in 100000 iterations. The second column is the results of the Fashion-MNIST dataset in 100000 iterations. The third column is the results of the CIFAR10 dataset in 80000 iterations, and the last column is the results of CelebA dataset in 100000 iterations.
Comparison of DCGAN for our proposed method on the four datasets. The first column is the results of the MNIST dataset in 100000 iterations. The second column is the results of the Fashion-MNIST dataset in 100000 iterations. The third column is the results of the CIFAR10 dataset in 80000 iterations, and the last column is the results of CelebA dataset in 100000 iterations.
7. CONCLUSION
We proposed the Adaptive Composite Gradients(ACG) method to find a local Nash Equilibrium in game problems. The ACG algorithm can alleviate the cyclic behaviors. It has robustness and can be easily integrated with SGD, Adam, RMSP, SGA, and other gradient-based optimizers. Since the ACG method employs the predicted information in future s iterations, it is a novel semi-gradient-free algorithm. Our algorithm has a linear convergence rate. Furthermore, we offer that the SGA-ACG can be competitive to ConOpt and SGA methods on the mixture of Gaussians geneated tasks. We prove the ACG can be applied in a general zero-sum game with n players by toy function experiment. The extensive generative image experiments show our method can optimize the generic deep learning networks model. However, our research objectives are limited to the convex and smoothness of simple zero-sum games. The non-convex and nonsmooth games are more complex and challenging to find a local Nash Equilibrium. Therefore, optimizing and finding local solutions for non-convex and non-smooth games is still a challenging task worth researching in the future.
ACKNOWLEDGEMENT
This work is supported by the National Key Research and Development Program of China (No.2018AAA0101001), Science and Technology Commission of Shanghai Municipality (No.20511100200), and supported in part by the Science and Technology Commission of Shanghai Municipality (No.18dz2271000).
AUTHOR CONTRIBUTION STATEMENT
Conceptualization, methodology, algorithm designing, coding, original draft preparation and survey references: Huiqing Qi. Methodology, data analyses, manuscript review, funding acquisition, original draft preparation and manuscript revising: Fang Li. Methodology, data analyses, manuscript review and funding acquisition: Shengli Tan. Methodology, algorithm designing, manuscript review, funding acquisition: Xiangyun Zhang. All authors have read and agreed to the published version of the manuscript.
REFERENCES
8. APPENDICES
APPENDIX A. PROOFS IN SECTION 5
A.1 Proof of Proposition 5.3
Proof. Without loss of generality, let , where . Then, we have
The projection of onto can be written as
Incorporating 65 into 66, we have
Using the γ to replace we can obtain where the .
A.2 Proof of Proposition 5.4
Proof. The characteristic polynomial of the matrix (39) is
which is equivalent to
From (A5) we can derive to
According to (A6), 0 and 1 can not be its roots based on A is nonsingular and square. Eq. (A6) is equivalent to
Then, we can obtain that the eigenvalues of F are the roots of the sixth order polynomials:
A.3 Proof of Proposition 5.5
Proof. Let the characteristic polynomial of the matrix (39) to be 0, which is written as follows:
It is obvious that (A9) have 6 roots, and λ1, = λ2 = τ are two of the 6 roots. According to the convergence of formula (35), we can obtain the τ is almost small and |τ| < 1. We mainly discuss the following polynomial:
Using Proposition (5.4), we have
Denote and , then (A11) can be written as
we can get the four roots of (A12) are
Let and , then we can obtain
Denote and , then we have
The following proof process is the same as (A.2) in [29]. For a given complex number z, we can obtain the absolute value of the real part in z is and the absolute value of the imaginary part in z is , However, According to this Proposition s ≤ 1, all the real parts of the four roots lie in the interval , where
all the imaginary parts of roots lie in the interval , where
Using the Inequality we can obtain
Then, we analyze the s in and two cases separately.
Case 1, we suppose , According to this proposition for all we have . Then, based on , we have
Integrating with (A20), we can obtain
which follows that
The inequality (A22) follows by the fact that and the inequality (A23) uses . The (A21 - A23) can be written equivalently to
According (A18) and (A19), we have
It is worth noting that holds equality if and only if y = 0. Then, the (A25) holds equality when t = 0 and s = 0. Since s > 0, we have the strict inequality ρ(F) ≤ 1 which suggests for the linear convergence of unit time ∇t.
Case 2. we suppose , since . Combining (A16) and (A17) directly, we can obtain
which is also linear convergence
APPENDIX B. THE APPENDIX FIGURES OF EXPERIMENTS
This section shows more figures. There are two ground-truth in Figure. B1. Figure B2 is the comparison results of our proposed method and other SOTA algorithms on the mixture of 5 Gaussians experiments. Figure B3 is the results of the compared methods on the mixture of 16 Gaussians experiments. Figure B4 is the experimental results of compared methods with linear GANs on the MNIST dataset. Figure B5 is the experimental results of compared methods with DCGANs on the four Datasets (MNIST, Fashion-MNIST, CIFAR10, and CelebA).
Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}. Although from the figure, we can obtain that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth, our proposed method seems competitive to ConOpt and RMSP-SGA.
Compared results on the mixture of 5 Gaussiane. Each row suggests a different method, and each column is the results at other iteration numbers through {2000, 4000, 6000, 8000, 10000}. Although from the figure, we can obtain that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth, our proposed method seems competitive to ConOpt and RMSP-SGA.
Compared results on the mixture of 16 Gaussians. Each row suggests a different method, and then each column is the results at different iteration numbers through {2000, 4000, 6000, 8000, 10000}. The figure shows that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth. However, our methods converge faster than ConOpt and RMSP-SGA at iteration 2000, which seems competitive to ConOpt and RMSP-SGA.
Compared results on the mixture of 16 Gaussians. Each row suggests a different method, and then each column is the results at different iteration numbers through {2000, 4000, 6000, 8000, 10000}. The figure shows that RMSP, RMSP-alt, RMSP-ACA can not converge to the ground truth, and ConOpt and RMSP-SGA, SGA-ACG all converge to the ground truth. However, our methods converge faster than ConOpt and RMSP-SGA at iteration 2000, which seems competitive to ConOpt and RMSP-SGA.
Compared results of Linear GANs on MNIST dataset. Each row suggests a different compared method, and each column is the results of iteration number through {50000, 150000, 250000, 300000}. This figure show SGD and Adam can not generate correct handwritten numbers. While RMSP and RMSP-ACG(ours), Adam-ACG(ours) can generate the handwritten digits. However, all the compared methods, including our proposed, are faced with the mode collapse problem.
Compared results of Linear GANs on MNIST dataset. Each row suggests a different compared method, and each column is the results of iteration number through {50000, 150000, 250000, 300000}. This figure show SGD and Adam can not generate correct handwritten numbers. While RMSP and RMSP-ACG(ours), Adam-ACG(ours) can generate the handwritten digits. However, all the compared methods, including our proposed, are faced with the mode collapse problem.
Comparison of DCGANs for several Algorithms on the four datasets. The first, second, third, and fourth rows are the results of SGD, RMSP, Adam, Adam-ACG(ours) on the four datasets. The first, second, third, and fourth columns are the results of the MNIST, Fashion-MNIST, CIFAR1O, and CelebA datasets, respectively. We conduct 100000 iterations for all experiments on the MNIST dataset, 100000 iterations for all experiments on the Fashion-MNIST dataset, 80000 iterations for all experiments on the CIFAR10 dataset, and 100000 iterations for all experiments on the CelebA dataset, respectively. Thus, the SGD method is invalid on the MNIST and Fashion-MNIST datasets experiments. In contrast, SGD is valid on CIFAR10 and CelebA datasets experiments. Significantly, the RMSP method is invalid on the four datasets experiments. From this figure, our proposed method seems competitive to Adam on all four datasets experiments.
Comparison of DCGANs for several Algorithms on the four datasets. The first, second, third, and fourth rows are the results of SGD, RMSP, Adam, Adam-ACG(ours) on the four datasets. The first, second, third, and fourth columns are the results of the MNIST, Fashion-MNIST, CIFAR1O, and CelebA datasets, respectively. We conduct 100000 iterations for all experiments on the MNIST dataset, 100000 iterations for all experiments on the Fashion-MNIST dataset, 80000 iterations for all experiments on the CIFAR10 dataset, and 100000 iterations for all experiments on the CelebA dataset, respectively. Thus, the SGD method is invalid on the MNIST and Fashion-MNIST datasets experiments. In contrast, SGD is valid on CIFAR10 and CelebA datasets experiments. Significantly, the RMSP method is invalid on the four datasets experiments. From this figure, our proposed method seems competitive to Adam on all four datasets experiments.