Abstract

The concave-convex procedure (CCCP) is an iterative algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms, including sparse support vector machines (SVMs), transductive SVMs, and sparse principal component analysis. Though CCCP is widely used in many applications, its convergence behavior has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper; however, we believe the analysis is not complete. The convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), proposed in the global optimization literature to solve general d.c. programs, whose proof relies on d.c. duality. In this note, we follow a different reasoning and show how Zangwill's global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP. This underlines Zangwill's theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization and generalized alternating minimization. In this note, we provide a rigorous analysis of the convergence of CCCP by addressing two questions: When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? and when does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.

1.  Introduction

The concave-convex procedure (CCCP) (Yuille & Rangarajan, 2003) is a popularly used algorithm to solve d.c. (difference of convex functions) programs of the form,
formula
1.1
where u, v, and {fi}mi=1 are real-valued convex functions, all defined on . Here, [m] ≔ {1, …, m}. Suppose v is differentiable. The CCCP algorithm is an iterative procedure that solves equation 1.1 as the following sequence of convex programs:
formula
1.2
As can be seen from equation 1.2, the idea of CCCP is to linearize the concave part of f, which is −v, around a solution obtained in the current iteration so that u(x) − xTv(x(l)) is convex in x, and therefore the nonconvex program in equation 1.1 is solved as a sequence of convex programs as shown in equation 1.2. CCCP has been extensively used in solving many nonconvex programs of the form in equation 1.1 that appear in machine learning. For example, Bradley and Mangasarian (1998) proposed a successive linear approximation (SLA) algorithm for feature selection in support vector machines, which can be seen as a special case of CCCP. Other applications where CCCP has been used include sparse principal component analysis (Sriperumbudur, Torres, & Lanckriet, 2007), transductive SVMs (Fung & Mangasarian, 2001; Collobert, Sinz, Weston, & Bottou, 2006; Wang, Shen, & Pan, 2007), feature selection in SVMs (Neumann, Schnörr, & Steidl, 2005), structured estimation (Do, Le, Teo, Chapelle, & Smola, 2009), and missing data problems in gaussian processes, and SVMs (Smola, Vishwanathan, & Hofmann, 2005).

The algorithm in equation 1.2 starts at some random point x(0) ∈ Ω ≔ {x:fi(x) ≤ 0, i ∈ [m]}, iteratively solves the program in the equation 1.2, and therefore generates a sequence {x(l)}l=0. The goal of this note is to study the convergence of {x(l)}l=0: When does CCCP find a local minimum or a stationary point of the program in equation 1.1?1 Does {x(l)}l=0 converge? If so, to what and under what conditions? From a practical perspective, these questions are highly relevant, given that CCCP is widely applied in machine learning.

In their original CCCP paper, Yuille and Rangarajan (2003, theorem 2) analyzed its convergence, but we believe the analysis is not complete. They showed that {x(l)}l=0 satisfies the monotone descent property, f(x(l+1)) ≤ f(x(l)), and argued that this property ensures the convergence of {x(l)}l=0 to a minimum or saddle point of the program in equation 1.1. However, their analysis is not complete, as the monotone descent property by itself is not sufficient to claim the convergence of {x(l)}l=0.

In the d.c. programming literature, Pham Dinh and Le Thi (1997) proposed a primal-dual subdifferential method called the DCA (d.c. algorithm) for solving general d.c. programs of the form , where it is assumed that u and v are proper lower semicontinuous convex functions, which form a larger class of convex functions than the class of differentiable convex functions (note that in the case of CCCP, v is assumed to be differentiable). Unlike in CCCP, DCA involves constructing two sets of convex programs (called the primal and dual programs) and solving them iteratively in succession such that the solution of the primal is the initialization to the dual and vice versa. However, when v is differentiable, DCA and CCCP can be shown to be equivalent. Pham Dinh and Le Thi (1997, theorem 3) provide a proof for the convergence of DCA for general d.c. programs, which therefore proves the convergence of CCCP. The proof exploits d.c. duality (and follows an approach that is tailored specifically to d.c. programs solved by DCA). We refer readers to section 2 for a brief review of d.c. duality and a summary of convergence results for DCA.

In this note, we follow a fundamentally different approach and show that the convergence of CCCP, specifically, can be analyzed by relying on Zangwill's (1969) global convergence theory of iterative algorithms. The tools employed in our proof are of a completely different flavor from the ones used in the proof of DCA convergence: DCA convergence analysis exploits d.c. duality, while we use the notion of point-to-set maps as introduced by Zangwill. Zangwill's theory is a powerful and general framework to deal with the convergence issues of iterative algorithms. It has also been used to prove the convergence of the expectation-maximization (EM) algorithm (Wu, 1983), generalized alternating minimization algorithms (Gunawardana & Byrne, 2005), multiplicative updates in nonnegative quadratic programming (Sha, Lin, Saul, & Lee, 2007), and so on and is therefore a natural framework to analyze the convergence of CCCP in a direct way.

The paper is organized as follows. Following Pham Dinh and Le Thi (1997, 1998), in section 2, we review d.c. duality and summarize the convergence results obtained for DCA. In section 3, we present Zangwill's theory of global convergence, a general framework to analyze the convergence behavior of iterative algorithms. This theory is used to address the global convergence of CCCP in section 4.1. This involves analyzing the fixed points of the CCCP algorithm in equation 1.2 and then showing that the fixed points are the stationary points of the program in equation 1.1. The results in section 4.1 are extended in section 4.2 to analyze the convergence of the constrained concave-convex procedure that Smola et al. (2005) proposed to deal with d.c. programs involving d.c. constraints (note that in contrast, CCCP in equation 1.2 deals with convex constraints). We briefly discuss the local convergence issues of CCCP in section 5 and conclude the section with an example and an open question.

2.  Review of D.C. Duality, DCA, and Convergence of DCA

Let , which means the dual space Y of X can be identified with X itself. Suppose Γ0(X) is the set of all proper lower semicontinuous convex functions on X. The conjugate function u* of u ∈ Γ0(X) is a function belonging to Γ0(Y), defined as
formula
Pham Dinh and Le Thi (1998) considered d.c. programs of the form
formula
P
where u, v ∈ Γ0(X). Note that this primal program, labeled P, can handle the minimization of f over a closed convex subset, C of X. This is because the constraint set, {x:xC}, can be absorbed into the objective function through its indicator, χC(x) = 0, if xC, +∞ otherwise, and the modified objective would be (u + χC) − v. Using the definition of conjugate functions, we have
formula
where . It is clear that β(y) = v*(y) − u*(y) if , and +∞ otherwise. Therefore, the dual problem can be written as
formula
which is equivalent to
formula
D
The perfect symmetry between the primal program (P) and the dual program (D) is referred to as the d.c. duality.
Based on the above d.c. duality, Pham Dinh and Le Thi (1997, 1998) proposed DCA, which involves constructing two sets of sequences {x(l)} and {y(l)}, starting from a given point , by setting
formula
where ∂v(x(0)) is the subdifferential of v at x(0), that is, . Lemma 3.6 in Pham Dinh and Le Thi (1998) shows that the sequences {x(l)} and {y(l)} are well defined if and only if and . DCA can be interpreted as follows: at each iteration l, we have
formula
formula
Note that is a convex program obtained from P by replacing v with its affine minorization defined by y(l) ∈ ∂v(x(l)). Similarly, the convex problem is obtained from D by using the affine minorization of u* defined by x(l+1) ∈ ∂u*(y(l)). Suppose v is differentiable. Then DCA reduces to CCCP:
formula
We now summarize the convergence of DCA for general d.c. programs. To do that, we need some definitions. A point x* is said to be a critical point of uv if ∂u(x*)∩∂v(x*) ≠ ∅. If u and v are differentiable, then x* is a stationary point of uv as ∇u(x*) = ∇v(x*). Let ρ ≥ 0 and C be a convex subset of X. A function is said to be ρ-convex if
formula
where ‖x22 = xTx. This is equivalent to saying that is convex on C. The modulus of strong convexity of θ on C, denoted by ρ(θ, C) is given by . θ is said to be strongly convex on C if ρ(θ, C)>0.

Convergence of DCA.

  1. Pham Dinh and Le Thi (1998, theorem 3.7) showed that DCA is a descent method for both P and D: (uv)(x(l+1)) ≤ (v* − u*)(y(l)) ≤ (uv)(x(l)) with equality if x(l) ∈ ∂u*(y(l)), y(l) ∈ ∂v(x(l)), u, v are strongly convex on X and u*, v* are strongly convex on Y. In addition, when equality holds, x(l) and y(l) are the critical points of P and D respectively.

  2. If α is finite, then the decreasing sequences {(uv)(x(l))} and {(v* − u*)(y(l))} converge to the same limit β ≥ α, that is, liml→∞(uv)(x(l)) = liml→∞(v* − u*)(y(l)) = β. In addition, if the sequences {x(l)} and {y(l)} are bounded, then for every limit point x* of {x(l)} (resp. y* of {y(l)}), there exists a limit point y* of {y(l)} (resp. x* of {x(l)}) such that (uv)(x*) = (v* − u*)(y*) = β. This means that every limit point x* of {x(l)} is a critical point of uv.

  3. The convergence of the whole sequence {x(l)} (resp. {y(l)}) can be ensured if the following hold: {x(l)} is bounded, the set of limit points of {x(l)} is finite, and liml→∞x(l+1)x(l)‖ = 0.

Having summarized the convergence properties of DCA (and therefore of CCCP), in section 4, we state and prove the convergence results for CCCP using a completely different framework, Zangwill's global convergence theory, which is briefly discussed in the following section.

3.  Global Convergence Theory of Iterative Algorithms

For an iterative procedure like CCCP to be useful, it must converge to a local optimum or a stationary point from all or at least a significant number of initialization states and not exhibit other nonlinear system behaviors, such as divergence or oscillation. This behavior can be analyzed by using the global convergence theory of iterative algorithms developed by Zangwill (1969). Note that the term global convergence is a misnomer. We will clarify it below and also introduce some notation and terminology.

To understand the convergence of an iterative procedure like CCCP, we need to understand the notion of a set-valued mapping, or point-to-set mapping, which is central to the theory of global convergence.2 A point-to-set map Ψ from a set X into a set Y is defined as , which assigns a subset of Y to each point of X, where denotes the power set of Y. We introduce some definitions related to the properties of point-to-set maps that will be used later. Suppose X and Y are two topological spaces. A point-to-set map Ψ is said to be closed at x0X if xkx0 as k → ∞, xkX and yky0 as k → ∞, yk ∈ Ψ(xk), imply y0 ∈ Ψ(x0). This concept of closure generalizes the concept of continuity for ordinary point-to-point mappings. A point-to-set map Ψ is said to be closed on SX if it is closed at every point of S. A fixed point of the map is a point x for which {x} = Ψ(x), whereas a generalized fixed point of Ψ is a point for which x ∈ Ψ(x). Ψ is said to be uniformly compact on X if there exists a compact set H independent of x such that Ψ(x) ⊂ H for all xX. Note that if X is compact, then Ψ is uniformly compact on X. Let be a continuous function. Ψ is said to be monotone with respect to φ whenever y ∈ Ψ(x) implies that φ(y) ≤ φ(x). If, in addition, y ∈ Ψ(x) and φ(y) = φ(x) imply that y = x, then we say that Ψ is strictly monotone.

Many iterative algorithms in mathematical programming can be described using the notion of point-to-set maps. Let X be a set and x0X a given point. Then an algorithm, , with initial point x0 is a point-to-set map , which generates a sequence {xk}k=1 via the rule is said to be globally convergent if for any chosen initial point x0, the sequence {xk}k=0 generated by (or a subsequence) converges to a point for which a necessary condition of optimality holds. The property of global convergence expresses, in a sense, the certainty that the algorithm works. It is very important to stress that it does not imply (contrary to what the term might suggest) convergence to a global optimum for all initial points x0.

With these concepts in place, we now state Zangwill's global convergence theorem (Zangwill, 1969):

Theorem 1. 

Let be a point-to-set map (an algorithm) that given a point x0X generates a sequence {xk}k=0 through the iteration . Also let a solution set Γ ⊂ X be given. Suppose

  1. All points xk are in a compact set SX.

  2. There is a continuous function such that:

    1. x ∉ Γ ⇒ φ(y) < φ(x), .

    2. x ∈ Γ ⇒ φ(y) ≤ φ(x), .

  3. is closed at x if x ∉ Γ.

Then the limit of any convergent subsequence of {xk}k=0 is in Γ. Furthermore,
formula
for all limit points x*.

The general idea when proving the global convergence of an algorithm, is to invoke theorem 1 by appropriately defining φ and Γ. For an algorithm that solves the minimization problem, min{f(x):x ∈ Ω}, the solution set, Γ, is usually chosen to be the set of corresponding stationary points, and φ can be chosen to be the objective function itself, that is, f, if f is continuous. In theorem 1, the convergence of φ(xk) to φ(x*) does not automatically imply the convergence of xk to x*. However, if is strictly monotone with respect to φ, then theorem 1 can be strengthened by using the following result due to Meyer (1976, theorem 3.1, corollary 3.2):

Theorem 2. 

Let be a point-to-set map such that is uniformly compact, closed, and strictly monotone on X, where X is a closed subset of . If {xk}k=0 is any sequence generated by , then all limit points will be fixed points of , φ(xk) → φ(x*) ≕ φ* as k → ∞, where x* is a fixed point, ‖xk+1xk‖ → 0, and either {xk}k=0 converges or the set of limit points of {xk}k=0 is connected. Define where is the set of fixed points of . If is finite, then any sequence {xk}k=0 generated by converges to some x* in .

Using these results on the global convergence of algorithms, Wu (1983) has studied the convergence properties of the EM algorithm, while Gunawardana and Byrne (2005) analyzed the convergence of generalized alternating minimization procedures. In the following section, we use these results to analyze the convergence of CCCP.

4.  Main Results

In section 4.1, we analyze the global convergence of CCCP. In section 4.2, we extend these results and present a global convergence theorem for the constrained concave-convex procedure, a generalization of CCCP proposed by Smola et al. (2005) to deal with d.c. programs involving d.c. constraints. Proofs for the results in sections 4.1 and 4.2 are provided in section 4.3.

4.1.  Convergence Theorems for CCCP.

To analyze the global convergence of the CCCP algorithm in equation 1.2, pertaining to the d.c. program in equation 1.1, we consider the point-to-set map , defined as
formula
4.1
where Ω ≔ {x:fi(x) ≤ 0, i ∈ [m]}. We now present two global convergence theorems for CCCP.
Theorem 3 

(Global convergence of CCCP—I). Let u, {fi}mi=1 be real-valued continuous convex functions and v be a real-valued differentiable convex function, all defined on . Suppose ∇v is continuous. Let {x(l)}l=0 be any sequence generated by defined by equation 4.1. Suppose is uniformly compact on Ω ≔ {x:fi(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω.3 Then, assuming suitable constraint qualification, all the limit points of {x(l)}l=0 are generalized fixed points of , which are stationary points of equation 1.1.4 In addition , where x* is some generalized fixed point of .

Remark. 

Note that if Ω is compact, then is uniformly compact on Ω. In addition, since u is continuous on Ω, by the Weierstrass theorem (Minoux, 1986), it follows that is nonempty for every x ∈ Ω and therefore is also closed on Ω (by lemma 1; see the appendix).5 Therefore, the assumptions of uniform compactness and nonemptiness of are trivially satisfied if Ω is compact.

The result obtained in theorem 3 is similar to the convergence result for DCA but with slightly stronger assumptions. In theorem 3, we require u to be continuous, v to be differentiable, and ∇v to be continuous while DCA requires u and v to be lower semicontinuous convex functions on . However, the assumptions on u and v as mentioned in theorem 3 are usually satisfied in machine learning applications, the examples of which include sparse principal component analysis (Sriperumbudur et al., 2007), feature selection in SVMs (Neumann et al., 2005), and transductive SVMs (Collobert et al., 2006).

In theorem 3, we considered the generalized fixed points of . The disadvantage with this case is that it does not rule out “oscillatory” behavior (Meyer, 1976). To elaborate, let us consider . For example, let Ω0 = {x1, x2} and let and u(x1) − v(x1) = u(x2) − v(x2) = 0. Then the sequence {x1, x2, x1, x2, …} could be generated by , with the convergent subsequences converging to the generalized fixed points x1 and x2. Such an oscillatory behavior can be avoided if we ensure to have fixed points instead of generalized fixed points. With appropriate assumptions on u, the following stronger result can be obtained on the convergence of CCCP.

Theorem 4

(Global convergence of CCCP—II). Let u be a real-valued strictly convex function {fi}mi=1 be real-valued continuous convex functions, and v be a differentiable convex function with continuous ∇v, all defined on . Let {x(l)}l=0 be any sequence generated by defined by equation 4.1. Suppose is uniformly compact on Ω ≔ {x:fi(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω. Then, assuming suitable constraint qualification, all the limit points of {x(l)}l=0 are fixed points of , which are stationary points of the d.c. program in equation 1.1, u(x(l)) − v(x(l)) → u(x*) − v(x*) ≕ f* as l → ∞, for some fixed point x* (also a stationary point of equation 1.1), ‖x(l+1)x(l)‖ → 0, and either {x(l)}l=0 converges or the set of limit points of {x(l)}l=0 is a connected and compact subset of , where and is the set of fixed points of . If is finite, then any sequence {x(l)}l=0 generated by converges to some x* in .

Note that the main difference between the assumptions in theorems 3 and 4 is that u is assumed to be strictly convex in theorem 4. This is not a strong assumption as it can be achieved as follows. Suppose u is convex but not strictly convex. Let t be a real-valued strictly convex function defined on . Then is strictly convex on and
formula
If t is continuously differentiable with ∇t continuous (for, e.g., t(x) = λ‖x22, λ>0), then it is clear that and satisfy the conditions in theorem 4, which means with the same assumptions of theorem 3, we obtain a stronger result in theorem 4. However, since theorem 4 is applied to and , it has to be noted that the sequence {x(l)}l=0 is generated by the following point-to-set map,
formula
4.2
instead of equation 4.1, which is the point-to-set map corresponding to theorem 3, which is applied to u and v.

Given the stronger guarantees about the convergence behavior of {x(l)}l=0 in equation 4.2, as provided by theorem 4, it may be preferable to use equation 4.2 instead of 4.1 to solve equation 1.1 when u is convex (but not strictly convex). On the other hand, equation 4.1 may be computationally simpler and more efficient to solve than equation 4.2—for example, if u is linear and Ω is a polyhedral set. In case the latter is more desirable, then theorem 3 can be used to provide convergence guarantees so that theorem 3 is not completely redundant.

From theorem 4, it should be clear that convergence of f(x(l)) to f* does not automatically imply the convergence of x(l) to x*. The convergence in the latter sense requires more stringent conditions like the finiteness of the set of stationary points of equation 1.1 that assume the value of f*. Note that a similar condition of the set of limit points of {x(l)} being finite is also required for the convergence of the whole DCA sequence.

4.2.  Extensions.

So far, we have considered d.c. programs where the constraint set is convex and analyzed the global convergence behavior of CCCP—using Zangwill's theory—that is used to solve such programs. In the following, we consider general d.c. programs where the constraints need not be convex and present the global convergence analysis (using Zangwill's theory) of an iterative algorithm (which is an extension of CCCP) that solves such general d.c. programs. Note that DCA can be used to solve such general d.c. programs (see equation 4.3) whose convergence properties are summarized in section 2.6

Let us consider a general d.c. program (Horst & Thoai, 1999), given by
formula
4.3
where {ui}mi=0, {vi}mi=0 are real-valued continuous convex functions defined on with {vi}mi=0 being continuously differentiable. While dealing with kernel methods for missing variables, Smola et al. (2005) encountered a problem of the form in equation 4.3 for which they proposed a constrained concave-convex procedure given by
formula
4.4
where
formula
Note that similar to CCCP, the algorithm in equation 4.4 is a sequence of convex programs. Although Smola et al. (2005, theorem 1) provided some convergence analysis for the algorithm in equation 4.4, their analysis is not complete due to the fact that the convergence of {x(l)}l=0 is assumed. In this section, we provide its convergence analysis, following an approach similar to what we did for CCCP, by considering a point-to-set map , associated with the iterative algorithm in equation 4.4, where . Note that unlike in equation 1.2, the constraint set in equation 4.4 varies with l and for any x(l), which therefore implies x(l+1) ∈ Ω. In theorem 5, we provide the global convergence result for the constrained concave-convex procedure, an equivalent version of theorem 4 for CCCP. Theorem 5 provides a result similar to the convergence result for DCA but under slightly stronger assumptions of {ui}mi=0, ∇v0 being continuous and {vi}mi=1 being differentiable on .
Theorem 5

(Global convergence of constrained CCP). Let u0 be a real-valued continuous and strictly convex function, {ui}mi=1 be real-valued continuous convex functions, and {vi}mi=0 be real-valued convex differentiable functions with continuous ∇v0, all defined on . Let {x(l)}l=0 be any sequence generated by defined in equation 4.4. Suppose is uniformly compact on Ω ≔ {x:ui(x) − vi(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω. Then, assuming suitable constraint qualification, all the limit points of {x(l)}l=0 are fixed points of , which are stationary points of the d.c. program in equation 4.3, u0(x(l)) − v0(x(l)) → u0(x*) − v0(x*) ≕ f* as l → ∞, for some fixed point, x* of (also a stationary point of equation 4.3), ‖x(l+1)x(l)‖ → 0, and either {x(l)}l=0 converges or the set of limit points of {x(l)}l=0 is a connected and compact subset of , where and is the set of fixed points of . If is finite, then any sequence {x(l)}l=0 generated by converges to some x* in .

In the following section, we present the proofs of theorems 3, 4, and 5.

4.3 Proofs

Proof of Theorem 3. The assumption of being uniformly compact on Ω ensures that condition 1 in theorem 1 is satisfied. Let Γ be the set of all generalized fixed points of , and let φ = f = uv. Because of the descent property, f(x(l+1)) ≤ f(x(l)) as shown in Yuille and Rangarajan (2003), condition 2 in theorem 1 is satisfied. By our assumption on u and v, we have g(x, y) ≔ u(x) − xTv(y) is continuous in x and y. Therefore, by lemma 1 (in the appendix), the assumption of nonemptiness of for every x ∈ Ω ensures that is closed on Ω and so satisfies condition 3 in theorem 1. Therefore, by theorem 1, all the limit points of {x(l)}l=0 are the generalized fixed points of and liml→∞(u(x(l)) − v(x(l))) = u(x*) − v(x*), where x* is some generalized fixed point of . We now show that any generalized fixed point of is a stationary point of equation 1.1, therefore proving the result.

Suppose x* is a generalized fixed point of , that is, . Since the constraints in equation 4.1 are qualified at x*, there exist Lagrange multipliers such that the following KKT conditions hold:
formula
4.5

Equation 4.5 is exactly the set of KKT conditions of equation 1.1, which are satisfied by (x*, {η*i}), and therefore x* is a stationary point of equation 1.1.

Proof of Theorem 4. Since u is strictly convex, the strict descent property holds, f(x(l+1)) < f(x(l)) unless x(l+1) = x(l), and therefore is strictly monotone with respect to f. The assumption of nonemptiness of for every x ∈ Ω ensures that is closed on Ω (which follows from lemma 7 in the appendix). By assumption, since is uniformly compact on Ω, invoking theorem 2 provides that all the limit points of {x(l)}l=0 are fixed points of , which either converge or form a connected compact set. Since any fixed point of is a generalized fixed point, which is also a stationary point of equation 1.1 (see the proof of theorem 3), the desired result follows.

Proof of Theorem 5. The proof is very similar to that of theorem 4. Note that . Since u0 is strictly convex, we have unless x(l+1) = x(l), which means u0(x(l+1)) − v0(x(l+1)) < u0(x(l)) − v0(x(l)) unless x(l+1) = x(l) and therefore is strictly monotone. Since u0 and ∇v0 are continuous and is nonempty for every x ∈ Ω, by invoking lemma 1, we obtain that is closed on Ω. The result therefore follows from theorem 2, which shows that all the limit points of {x(l)}l=0 are fixed points of , which either converge or form a connected compact set. We now show that any fixed point of is a stationary point of equation 4.3.

Suppose x* is a fixed point of and assume that constraints in equation 4.4 are qualified at x*. Then there exist Lagrange multipliers such that the following KKT conditions hold:
formula
4.6
which is exactly the KKT conditions for equation 4.3 satisfied by (x*, {η*i}) and, therefore, x* is a stationary point of equation 4.3.

5.  Local Convergence Analysis of CCCP

The study so far has been devoted to the global convergence analysis of CCCP and the constrained concave-convex procedure. We say an algorithm is globally convergent if for any chosen starting point, x0, the sequence {xk}k=0 generated by converges to a point for which a necessary condition of optimality holds. In the results so far, we have shown that all the limit points of any sequence generated by CCCP (resp. its constrained version) are the stationary points (local extrema or saddle points) of the program in equation 1.1 (resp. 4.3). Suppose that x0 is chosen such that it lies in an ε-neighborhood around a local minimum, x*. Will the CCCP sequence then converge to x*? If so, what is the rate of convergence? These are questions of local convergence.

Salakhutdinov, Roweis, and Ghahramani (2003) studied the local convergence of bound optimization algorithms (of which CCCP is an example) to compare the rate of convergence of such methods to that of gradient and second-order methods. In their work, they considered the unconstrained version of CCCP with as a point-to-point map that is differentiable. They showed that, depending on the curvature of u and v, CCCP will exhibit either quasi-Newton behavior with fast, typically superlinear convergence or extremely slow, first-order convergence behavior. However, extending these results to the constrained setup in equation 1.2 is not obvious. The following result due to Ostrowski, which can be found in Ortega and Rheinboldt (1970, theorem 10.1.3), provides a way to study the local convergence of iterative algorithms.

Proposition 1. 

Suppose that has a fixed point and Ψ is Fréchet-differentiable at x*. If the spectral radius, ρ(Ψ′(x*)) of Ψ′(x*) satisfies ρ(Ψ′(x*)) < 1, and if x0 is sufficiently close to x*, then the iterates {xk} defined by xk+1 = Ψ(xk) all lie in U and converge to x*.

We now discuss how proposition 1 can be used to study the local convergence of CCCP. First note that the proposition treats Ψ (in our case, ) as a point-to-point map, which can be obtained by choosing u to be strictly convex so that x(l+1) is the unique minimizer of equation 1.2. Suppose, we choose x* in proposition 1 to be a local minimum of equation 1.1. Then the desired result of local convergence with at least a linear rate of convergence is obtained if we show that . However, currently we are not aware of a way to compute the Fréchet differential of and, moreover, to impose conditions on the functions in equation 1.2 so that is a Fréchet-differentiable map. This is an open question coming out of this work. However, in the following, we present a simple example for which is Fréchet differentiable and .

Example. 
Consider the following nonconvex program,
formula
5.1
where (space of n × n symmetric matrices over ), , , , and . Assume , where m < n. Although the objective in equation 5.1 need not be convex, it can be written as a difference of convex functions:
formula
where ρ>max(0, λmax (A)), so that ρInA is positive definite. Here In denotes the n × n identity matrix and λmax (A) is the largest eigenvalue of A. Define u(x) ≔ ρ‖x22 + bTx + c and v(x) ≔ xTInA)x. Using CCCP, equation 5.1 can be solved as
formula
5.2
By solving the Lagrangian of equation 5.2, we get
formula
5.3
where C+CT(CCT)−1. Note that the point-to-point map , defined as in equation 5.3, is linear and therefore is Fréchet differentiable at any . Suppose A, C, and ρ are such that maxi∈[n](|λi|) < 1, where {λi} are the eigenvalues of ρ−1(InC+C)(ρInA), then the conditions in proposition 1 are satisfied. Therefore, if x* is a local minimum of , then choosing any x(0) that is sufficiently close to x* results in a sequence of iterates, {x(l)} converging to x* with a rate of convergence that is at least linear.

6.  Conclusion

The concave-convex procedure (CCCP) is widely used in machine learning. In this note, we provide a proof of its global convergence by using results from the global convergence theory of iterative algorithms. The proposed approach is fundamentally different from that used for the convergence of DCA. It illustrates the power and generality of Zangwill's global convergence theory as a framework for proving the convergence of iterative algorithms. We briefly discuss the local convergence of CCCP and present an open question, the settlement of which would address the local convergence behavior of CCCP.

Appendix: Supplementary Result

The following result from Gunawardana and Byrne (2005, proposition 7) shows that the minimization of a continuous function forms a closed point-to-set map.7 A similar sufficient condition is also provided in Wu (1983, equation 10).

Lemma 1. 
Given a real-valued continuous function h on X × Y, define the point-to-set map by
formula
A.1
Then Ψ is closed at x if Ψ(x) is nonempty.

Acknowledgments

We thank anonymous reviewers for their constructive comments that greatly improved the manuscript. The work was carried out when B.K.S. was a Ph.D. student at the University of California, San Diego. We acknowledge support from the National Science Foundation (grants DMS-MSPA 0625409 and IIS-1054960).

Notes

1

x* is said to be a stationary point of a constrained optimization problem if it satisfies the corresponding Karush-Kuhn-Tucker (KKT) conditions. Assuming constraint qualification, KKT conditions are necessary for the local optimality of x*. See Bonnans, Gilbert, Lemaréchal, and Sagastizábal (2006, section 11.3) for details.

2

Note that depending on the objective and constraints, the minimizer of equation 1.2 need not be unique. Therefore, the algorithm takes x(l) as its input and returns a set of minimizers from which an element, x(l+1) is chosen. Hence, the notion of point-to-set maps appears naturally in such iterative algorithms.

3

Instead of uniform compactness, one could also assume that for every x ∈ Ω, the set is bounded for the claims in theorem 3 to hold.

4

Examples include Slater's qualification and Mangasarian-Fromovitz qualification, among others. See Bonnans et al. (2006).

5

The Weierstrass theorem states: If f is a real-valued continuous function on a compact set , then the problem min{f(x):xK} has an optimal solution x* ∈ K.

6
While equation 4.3 is not a d.c. program in the sense of equation 1.1 where the constraint set is convex, an exact penalty approach can be used to transform equation 4.3 into a d.c. program. Consider a modified form of equation 4.3,
formula
A
where K is a nonempty closed convex set in . A penalty approach penalizes the constraints and introduces the following nondifferentiable d.c. program:
formula
B
with t>0, which can be solved using DCA. To solve general d.c. programs via DCA on equation B (note that to apply DCA to equation B, the continuity and differentiability conditions on {ui}mi=0 and {vi}mi=0 mentioned in the paragraph following equation 4.3 are not needed), exact penalty must hold, that is, the existence of t0 ≥ 0 such that equations A and B are equivalent for all t>t0. As far as we know, the existence of such t0 is guaranteed if K is a nonempty bounded polyhedral convex set in and the feasible set of equation A is nonempty. See Pham Dinh and Le Thi (1997, section 8.1) and Le Thi, Pham Dinh, and Le Dung (1999).
7

Lemma 1 is a slight modification to proposition 7 of Gunawardana and Byrne (2005), with exactly the same proof, wherein the latter deals with unconstrained minimization, that is, the minimization in equation A.1 is over Y, while the lemma deals with constrained minimization carried out over a subset, Ω of Y.

References

Bonnans
,
J. F.
,
Gilbert
,
J. C.
,
Lemaréchal
,
C.
, &
Sagastizábal
,
C. A.
(
2006
).
Numerical optimization: Theoretical and practical aspects
.
New York
:
Springer-Verlag
.
Bradley
,
P. S.
, &
Mangasarian
,
O. L.
(
1998
).
Feature selection via concave minimization and support vector machines
. In
Proc. 15th International Conf. on Machine Learning
(pp.
82
90
).
San Francisco
:
Morgan Kaufmann
.
Collobert
,
R.
,
Sinz
,
F.
,
Weston
,
J.
, &
Bottou
,
L.
(
2006
).
Large scale transductive SVMs
.
Journal of Machine Learning Research
,
7
,
1687
1712
.
Do
,
C. B.
,
Le
,
Q. V.
,
Teo
,
C. H.
,
Chapelle
,
O.
, &
Smola
,
A. J.
(
2009
).
Tighter bounds for structured estimation
. In
D. Koller, D. Schuurmans, Y. Bengio, & L. Bottou
(Eds.),
Advances in neural information processing systems, 21
(pp.
281
288
).
Cambridge, MA
:
MIT Press
.
Fung
,
G.
, &
Mangasarian
,
O. L.
(
2001
).
Semi-supervised support vector machines for unlabeled data classification
.
Optimization Methods and Software
,
15
,
29
44
.
Gunawardana
,
A.
, &
Byrne
,
W.
(
2005
).
Convergence theorems for generalized alternating minimization procedures
.
Journal of Machine Learning Research
,
6
,
2049
2073
.
Horst
,
R.
, &
Thoai
,
N. V.
(
1999
).
D.C. programming: Overview
.
Journal of Optimization Theory and Applications
,
103
,
1
43
.
Le Thi
,
H. A
,
Pham Dinh
,
T.
, &
Le Dung
,
M.
(
1999
).
Exact penalty in d.c. programming
.
Vietnam Journal of Mathematics
,
27
,
169
179
.
Meyer
,
R. R.
(
1976
).
Sufficient conditions for the convergence of monotonic mathematical programming algorithms
.
Journal of Computer and System Sciences
,
12
,
108
121
.
Minoux
,
M.
(
1986
).
Mathematical programming: Theory and algorithms
.
Hoboken, NJ
:
Wiley
.
Neumann
,
J.
,
Schnörr
,
C.
, &
Steidl
,
G.
(
2005
).
Combined SVM-based feature selection and classification
.
Machine Learning
,
61
,
129
150
.
Ortega
,
J. M.
, &
Rheinboldt
,
W. C.
(
1970
).
Iterative solution of nonlinear equations in several variables
.
New York
:
Academic Press
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1997
).
Convex analysis approach to d.c. programming: Theory, algorithms and applications
.
Acta Mathematica Vietnamica
,
22
(
1
),
289
355
.
Pham Dinh
,
T.
, &
Le Thi
,
H. A.
(
1998
).
D.C. optimization algorithms for solving the trust region subproblem
.
SIAM Journal of Optimization
,
8
,
476
505
.
Salakhutdinov
,
R.
,
Roweis
,
S.
, &
Ghahramani
,
Z.
(
2003
).
On the convergence of bound optimization algorithms
. In
Proc. 19th Conference in Uncertainty in Artificial Intelligence
(pp.
509
516
).
San Franciso
:
Morgan Kaufmann
.
Sha
,
F.
,
Lin
,
Y.
,
Saul
,
L. K.
, &
Lee
,
D. D.
(
2007
).
Multiplicative updates for nonnegative quadratic programming
.
Neural Computation
,
19
,
2004
2031
.
Smola
,
A. J.
,
Vishwanathan
,
S.V.N.
, &
Hofmann
,
T.
(
2005
).
Kernel methods for missing variables
. In
R. Cowell & Z. Ghahramani (Eds.)
,
Proc. of the Tenth International Workshop on Artificial Intelligence and Statistics
(pp.
325
332
).
N.p.
:
Society for Artificial Intelligence & Statistics
.
Sriperumbudur
,
B. K.
,
Torres
,
D. A.
, &
Lanckriet
,
G.R.G.
(
2007
).
Sparse eigen methods by d.c. programming
. In
Z. Ghahramani (Ed.)
,
Proc. of the 24th Annual International Conference on Machine Learning
(pp.
831
838
).
N.p.
:
Omnipress
.
Wang
,
L.
,
Shen
,
X.
, &
Pan
,
W.
(
2007
).
On transductive support vector machines
. In
J. Verducci, X. Shen, & J. Lafferty
(Eds.),
Prediction and discovery
.
Providence, RI
:
American Mathematical Society
.
Wu
,
C.F.J.
(
1983
).
On the convergence properties of the EM algorithm
.
Annals of Statistics
,
11
(
1
),
95
103
.
Yuille
,
A. L.
, &
Rangarajan
,
A.
(
2003
).
The concave-convex procedure
.
Neural Computation
,
15
,
915
936
.
Zangwill
,
W. I.
(
1969
).
Nonlinear programming: A unified approach
.
Upper Saddle River, NJ
:
Prentice-Hall
,
Englewood Cliffs
.