## Abstract

The concave-convex procedure (CCCP) is an iterative algorithm that solves d.c. (difference of convex functions) programs as a sequence of convex programs. In machine learning, CCCP is extensively used in many learning algorithms, including sparse support vector machines (SVMs), transductive SVMs, and sparse principal component analysis. Though CCCP is widely used in many applications, its convergence behavior has not gotten a lot of specific attention. Yuille and Rangarajan analyzed its convergence in their original paper; however, we believe the analysis is not complete. The convergence of CCCP can be derived from the convergence of the d.c. algorithm (DCA), proposed in the global optimization literature to solve general d.c. programs, whose proof relies on d.c. duality. In this note, we follow a different reasoning and show how Zangwill's global convergence theory of iterative algorithms provides a natural framework to prove the convergence of CCCP. This underlines Zangwill's theory as a powerful and general framework to deal with the convergence issues of iterative algorithms, after also being used to prove the convergence of algorithms like expectation-maximization and generalized alternating minimization. In this note, we provide a rigorous analysis of the convergence of CCCP by addressing two questions: When does CCCP find a local minimum or a stationary point of the d.c. program under consideration? and when does the sequence generated by CCCP converge? We also present an open problem on the issue of local convergence of CCCP.

## 1. Introduction

*u*,

*v*, and {

*f*}

_{i}^{m}

_{i=1}are real-valued convex functions, all defined on . Here, [

*m*] ≔ {1, …,

*m*}. Suppose

*v*is differentiable. The CCCP algorithm is an iterative procedure that solves equation 1.1 as the following sequence of convex programs: As can be seen from equation 1.2, the idea of CCCP is to linearize the concave part of

*f*, which is −

*v*, around a solution obtained in the current iteration so that

*u*(

*x*) −

*x*∇

^{T}*v*(

*x*

^{(l)}) is convex in

*x*, and therefore the nonconvex program in equation 1.1 is solved as a sequence of convex programs as shown in equation 1.2. CCCP has been extensively used in solving many nonconvex programs of the form in equation 1.1 that appear in machine learning. For example, Bradley and Mangasarian (1998) proposed a successive linear approximation (SLA) algorithm for feature selection in support vector machines, which can be seen as a special case of CCCP. Other applications where CCCP has been used include sparse principal component analysis (Sriperumbudur, Torres, & Lanckriet, 2007), transductive SVMs (Fung & Mangasarian, 2001; Collobert, Sinz, Weston, & Bottou, 2006; Wang, Shen, & Pan, 2007), feature selection in SVMs (Neumann, Schnörr, & Steidl, 2005), structured estimation (Do, Le, Teo, Chapelle, & Smola, 2009), and missing data problems in gaussian processes, and SVMs (Smola, Vishwanathan, & Hofmann, 2005).

The algorithm in equation 1.2 starts at some random point *x*^{(0)} ∈ Ω ≔ {*x*:*f _{i}*(

*x*) ≤ 0,

*i*∈ [

*m*]}, iteratively solves the program in the equation 1.2, and therefore generates a sequence {

*x*

^{(l)}}

^{∞}

_{l=0}. The goal of this note is to study the convergence of {

*x*

^{(l)}}

^{∞}

_{l=0}: When does CCCP find a local minimum or a stationary point of the program in equation 1.1?

^{1}Does {

*x*

^{(l)}}

^{∞}

_{l=0}converge? If so, to what and under what conditions? From a practical perspective, these questions are highly relevant, given that CCCP is widely applied in machine learning.

In their original CCCP paper, Yuille and Rangarajan (2003, theorem 2) analyzed its convergence, but we believe the analysis is not complete. They showed that {*x*^{(l)}}^{∞}_{l=0} satisfies the monotone descent property, *f*(*x*^{(l+1)}) ≤ *f*(*x*^{(l)}), and argued that this property ensures the convergence of {*x*^{(l)}}^{∞}_{l=0} to a minimum or saddle point of the program in equation 1.1. However, their analysis is not complete, as the monotone descent property by itself is not sufficient to claim the convergence of {*x*^{(l)}}^{∞}_{l=0}.

In the d.c. programming literature, Pham Dinh and Le Thi (1997) proposed a primal-dual subdifferential method called the DCA (d.c. algorithm) for solving general d.c. programs of the form , where it is assumed that *u* and *v* are proper lower semicontinuous convex functions, which form a larger class of convex functions than the class of differentiable convex functions (note that in the case of CCCP, *v* is assumed to be differentiable). Unlike in CCCP, DCA involves constructing two sets of convex programs (called the primal and dual programs) and solving them iteratively in succession such that the solution of the primal is the initialization to the dual and vice versa. However, when *v* is differentiable, DCA and CCCP can be shown to be equivalent. Pham Dinh and Le Thi (1997, theorem 3) provide a proof for the convergence of DCA for general d.c. programs, which therefore proves the convergence of CCCP. The proof exploits d.c. duality (and follows an approach that is tailored specifically to d.c. programs solved by DCA). We refer readers to section 2 for a brief review of d.c. duality and a summary of convergence results for DCA.

In this note, we follow a fundamentally different approach and show that the convergence of CCCP, specifically, can be analyzed by relying on Zangwill's (1969) global convergence theory of iterative algorithms. The tools employed in our proof are of a completely different flavor from the ones used in the proof of DCA convergence: DCA convergence analysis exploits d.c. duality, while we use the notion of point-to-set maps as introduced by Zangwill. Zangwill's theory is a powerful and general framework to deal with the convergence issues of iterative algorithms. It has also been used to prove the convergence of the expectation-maximization (EM) algorithm (Wu, 1983), generalized alternating minimization algorithms (Gunawardana & Byrne, 2005), multiplicative updates in nonnegative quadratic programming (Sha, Lin, Saul, & Lee, 2007), and so on and is therefore a natural framework to analyze the convergence of CCCP in a direct way.

The paper is organized as follows. Following Pham Dinh and Le Thi (1997, 1998), in section 2, we review d.c. duality and summarize the convergence results obtained for DCA. In section 3, we present Zangwill's theory of global convergence, a general framework to analyze the convergence behavior of iterative algorithms. This theory is used to address the global convergence of CCCP in section 4.1. This involves analyzing the fixed points of the CCCP algorithm in equation 1.2 and then showing that the fixed points are the stationary points of the program in equation 1.1. The results in section 4.1 are extended in section 4.2 to analyze the convergence of the constrained concave-convex procedure that Smola et al. (2005) proposed to deal with d.c. programs involving d.c. constraints (note that in contrast, CCCP in equation 1.2 deals with convex constraints). We briefly discuss the local convergence issues of CCCP in section 5 and conclude the section with an example and an open question.

## 2. Review of D.C. Duality, DCA, and Convergence of DCA

*Y*of

*X*can be identified with

*X*itself. Suppose Γ

_{0}(

*X*) is the set of all proper lower semicontinuous convex functions on

*X*. The conjugate function

*u** of

*u*∈ Γ

_{0}(

*X*) is a function belonging to Γ

_{0}(

*Y*), defined as Pham Dinh and Le Thi (1998) considered d.c. programs of the form where

*u*,

*v*∈ Γ

_{0}(

*X*). Note that this primal program, labeled P, can handle the minimization of

*f*over a closed convex subset,

*C*of

*X*. This is because the constraint set, {

*x*:

*x*∈

*C*}, can be absorbed into the objective function through its indicator, χ

_{C}(

*x*) = 0, if

*x*∈

*C*, +∞ otherwise, and the modified objective would be (

*u*+ χ

_{C}) −

*v*. Using the definition of conjugate functions, we have where . It is clear that β(

*y*) =

*v**(

*y*) −

*u**(

*y*) if , and +∞ otherwise. Therefore, the dual problem can be written as which is equivalent to The perfect symmetry between the primal program (P) and the dual program (D) is referred to as the d.c. duality.

*x*

^{(l)}} and {

*y*

^{(l)}}, starting from a given point , by setting where ∂

*v*(

*x*

^{(0)}) is the subdifferential of

*v*at

*x*

^{(0)}, that is, . Lemma 3.6 in Pham Dinh and Le Thi (1998) shows that the sequences {

*x*

^{(l)}} and {

*y*

^{(l)}} are well defined if and only if and . DCA can be interpreted as follows: at each iteration

*l*, we have Note that is a convex program obtained from P by replacing

*v*with its affine minorization defined by

*y*

^{(l)}∈ ∂

*v*(

*x*

^{(l)}). Similarly, the convex problem is obtained from D by using the affine minorization of

*u** defined by

*x*

^{(l+1)}∈ ∂

*u**(

*y*

^{(l)}). Suppose

*v*is differentiable. Then DCA reduces to CCCP: We now summarize the convergence of DCA for general d.c. programs. To do that, we need some definitions. A point

*x** is said to be a critical point of

*u*−

*v*if ∂

*u*(

*x**)∩∂

*v*(

*x**) ≠ ∅. If

*u*and

*v*are differentiable, then

*x** is a stationary point of

*u*−

*v*as ∇

*u*(

*x**) = ∇

*v*(

*x**). Let ρ ≥ 0 and

*C*be a convex subset of

*X*. A function is said to be ρ-convex if where ‖

*x*‖

^{2}

_{2}=

*x*. This is equivalent to saying that is convex on

^{T}x*C*. The modulus of strong convexity of θ on

*C*, denoted by ρ(θ,

*C*) is given by . θ is said to be strongly convex on

*C*if ρ(θ,

*C*)>0.

**Convergence of DCA.**

Pham Dinh and Le Thi (1998, theorem 3.7) showed that DCA is a descent method for both P and D: (

*u*−*v*)(*x*^{(l+1)}) ≤ (*v** −*u**)(*y*^{(l)}) ≤ (*u*−*v*)(*x*^{(l)}) with equality if*x*^{(l)}∈ ∂*u**(*y*^{(l)}),*y*^{(l)}∈ ∂*v*(*x*^{(l)}),*u*,*v*are strongly convex on*X*and*u**,*v** are strongly convex on*Y*. In addition, when equality holds,*x*^{(l)}and*y*^{(l)}are the critical points of P and D respectively.If α is finite, then the decreasing sequences {(

*u*−*v*)(*x*^{(l)})} and {(*v** −*u**)(*y*^{(l)})} converge to the same limit β ≥ α, that is, lim_{l→∞}(*u*−*v*)(*x*^{(l)}) = lim_{l→∞}(*v** −*u**)(*y*^{(l)}) = β. In addition, if the sequences {*x*^{(l)}} and {*y*^{(l)}} are bounded, then for every limit point*x** of {*x*^{(l)}} (resp.*y** of {*y*^{(l)}}), there exists a limit point*y** of {*y*^{(l)}} (resp.*x** of {*x*^{(l)}}) such that (*u*−*v*)(*x**) = (*v** −*u**)(*y**) = β. This means that every limit point*x** of {*x*^{(l)}} is a critical point of*u*−*v*.The convergence of the whole sequence {

*x*^{(l)}} (resp. {*y*^{(l)}}) can be ensured if the following hold: {*x*^{(l)}} is bounded, the set of limit points of {*x*^{(l)}} is finite, and lim_{l→∞}‖*x*^{(l+1)}−*x*^{(l)}‖ = 0.

Having summarized the convergence properties of DCA (and therefore of CCCP), in section 4, we state and prove the convergence results for CCCP using a completely different framework, Zangwill's global convergence theory, which is briefly discussed in the following section.

## 3. Global Convergence Theory of Iterative Algorithms

For an iterative procedure like CCCP to be useful, it must converge to a local optimum or a stationary point from all or at least a significant number of initialization states and not exhibit other nonlinear system behaviors, such as divergence or oscillation. This behavior can be analyzed by using the global convergence theory of iterative algorithms developed by Zangwill (1969). Note that the term *global convergence* is a misnomer. We will clarify it below and also introduce some notation and terminology.

To understand the convergence of *an iterative procedure like* CCCP, we need to understand the notion of a set-valued mapping, or point-to-set mapping, which is central to the theory of global convergence.^{2} A point-to-set map Ψ from a set *X* into a set *Y* is defined as , which assigns a subset of *Y* to each point of *X*, where denotes the power set of *Y*. We introduce some definitions related to the properties of point-to-set maps that will be used later. Suppose *X* and *Y* are two topological spaces. A point-to-set map Ψ is said to be closed at *x*_{0} ∈ *X* if *x _{k}* →

*x*

_{0}as

*k*→ ∞,

*x*∈

_{k}*X*and

*y*→

_{k}*y*

_{0}as

*k*→ ∞,

*y*∈ Ψ(

_{k}*x*), imply

_{k}*y*

_{0}∈ Ψ(

*x*

_{0}). This concept of closure generalizes the concept of continuity for ordinary point-to-point mappings. A point-to-set map Ψ is said to be closed on

*S*⊂

*X*if it is closed at every point of

*S*. A fixed point of the map is a point

*x*for which {

*x*} = Ψ(

*x*), whereas a generalized fixed point of Ψ is a point for which

*x*∈ Ψ(

*x*). Ψ is said to be uniformly compact on

*X*if there exists a compact set

*H*independent of

*x*such that Ψ(

*x*) ⊂

*H*for all

*x*∈

*X*. Note that if

*X*is compact, then Ψ is uniformly compact on

*X*. Let be a continuous function. Ψ is said to be monotone with respect to φ whenever

*y*∈ Ψ(

*x*) implies that φ(

*y*) ≤ φ(

*x*). If, in addition,

*y*∈ Ψ(

*x*) and φ(

*y*) = φ(

*x*) imply that

*y*=

*x*, then we say that Ψ is strictly monotone.

Many iterative algorithms in mathematical programming can be described using the notion of point-to-set maps. Let *X* be a set and *x*_{0} ∈ *X* a given point. Then an algorithm, , with initial point *x*_{0} is a point-to-set map , which generates a sequence {*x _{k}*}

^{∞}

_{k=1}via the rule is said to be globally convergent if for any chosen initial point

*x*

_{0}, the sequence {

*x*}

_{k}^{∞}

_{k=0}generated by (or a subsequence) converges to a point for which a necessary condition of optimality holds. The property of global convergence expresses, in a sense, the certainty that the algorithm works. It is very important to stress that it does not imply (contrary to what the term might suggest) convergence to a global optimum for all initial points

*x*

_{0}.

With these concepts in place, we now state Zangwill's global convergence theorem (Zangwill, 1969):

*Let be a point-to-set map (an algorithm) that given a point x_{0} ∈ X generates a sequence {x_{k}}^{∞}_{k=0} through the iteration . Also let a solution set Γ ⊂ X be given. Suppose*

All points

*x*are in a compact set_{k}*S*⊂*X*.There is a continuous function such that:

*x*∉ Γ ⇒ φ(*y*) < φ(*x*), .*x*∈ Γ ⇒ φ(*y*) ≤ φ(*x*), .

is closed at

*x*if*x*∉ Γ.

The general idea when proving the global convergence of an algorithm, is to invoke theorem 1 by appropriately defining φ and Γ. For an algorithm that solves the minimization problem, min{*f*(*x*):*x* ∈ Ω}, the solution set, Γ, is usually chosen to be the set of corresponding stationary points, and φ can be chosen to be the objective function itself, that is, *f*, if *f* is continuous. In theorem 1, the convergence of φ(*x _{k}*) to φ(

*x**) does not automatically imply the convergence of

*x*to

_{k}*x**. However, if is strictly monotone with respect to φ, then theorem 1 can be strengthened by using the following result due to Meyer (1976, theorem 3.1, corollary 3.2):

*Let be a point-to-set map such that is uniformly compact, closed, and strictly monotone on X, where X is a closed subset of . If {x_{k}}^{∞}_{k=0} is any sequence generated by , then all limit points will be fixed points of , φ(x_{k}) → φ(x*) ≕ φ* as k → ∞, where x* is a fixed point, ‖x_{k+1} − x_{k}‖ → 0, and either {x_{k}}^{∞}_{k=0} converges or the set of limit points of {x_{k}}^{∞}_{k=0} is connected. Define where is the set of fixed points of . If is finite, then any sequence {x_{k}}^{∞}_{k=0} generated by converges to some x* in *.

Using these results on the global convergence of algorithms, Wu (1983) has studied the convergence properties of the EM algorithm, while Gunawardana and Byrne (2005) analyzed the convergence of generalized alternating minimization procedures. In the following section, we use these results to analyze the convergence of CCCP.

## 4. Main Results

In section 4.1, we analyze the global convergence of CCCP. In section 4.2, we extend these results and present a global convergence theorem for the constrained concave-convex procedure, a generalization of CCCP proposed by Smola et al. (2005) to deal with d.c. programs involving d.c. constraints. Proofs for the results in sections 4.1 and 4.2 are provided in section 4.3.

### 4.1. Convergence Theorems for CCCP.

(Global convergence of CCCP—*I*). *Let u, { f_{i}}^{m}_{i=1} be real-valued continuous convex functions and v be a real-valued differentiable convex function, all defined on . Suppose ∇v is continuous. Let {x^{(l)}}^{∞}_{l=0} be any sequence generated by defined by equation 4.1. Suppose is uniformly compact on Ω ≔ {x:f_{i}(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω.^{3} Then, assuming suitable constraint qualification, all the limit points of {x^{(l)}}^{∞}_{l=0} are generalized fixed points of , which are stationary points of equation 1.1.^{4} In addition , where x* is some generalized fixed point of *.

Note that if Ω is compact, then is uniformly compact on Ω. In addition, since *u* is continuous on Ω, by the Weierstrass theorem (Minoux, 1986), it follows that is nonempty for every *x* ∈ Ω and therefore is also closed on Ω (by lemma 1; see the appendix).^{5} Therefore, the assumptions of uniform compactness and nonemptiness of are trivially satisfied if Ω is compact.

The result obtained in theorem 3 is similar to the convergence result for DCA but with slightly stronger assumptions. In theorem 3, we require *u* to be continuous, *v* to be differentiable, and ∇*v* to be continuous while DCA requires *u* and *v* to be lower semicontinuous convex functions on . However, the assumptions on *u* and *v* as mentioned in theorem 3 are usually satisfied in machine learning applications, the examples of which include sparse principal component analysis (Sriperumbudur et al., 2007), feature selection in SVMs (Neumann et al., 2005), and transductive SVMs (Collobert et al., 2006).

In theorem 3, we considered the generalized fixed points of . The disadvantage with this case is that it does not rule out “oscillatory” behavior (Meyer, 1976). To elaborate, let us consider . For example, let Ω_{0} = {*x*_{1}, *x*_{2}} and let and *u*(*x*_{1}) − *v*(*x*_{1}) = *u*(*x*_{2}) − *v*(*x*_{2}) = 0. Then the sequence {*x*_{1}, *x*_{2}, *x*_{1}, *x*_{2}, …} could be generated by , with the convergent subsequences converging to the generalized fixed points *x*_{1} and *x*_{2}. Such an oscillatory behavior can be avoided if we ensure to have fixed points instead of generalized fixed points. With appropriate assumptions on *u*, the following stronger result can be obtained on the convergence of CCCP.

(Global convergence of CCCP—*II*). *Let u be a real-valued strictly convex function { f_{i}}^{m}_{i=1} be real-valued continuous convex functions, and v be a differentiable convex function with continuous ∇v, all defined on . Let {x^{(l)}}^{∞}_{l=0} be any sequence generated by defined by equation 4.1. Suppose is uniformly compact on Ω ≔ {x:f_{i}(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω. Then, assuming suitable constraint qualification, all the limit points of {x^{(l)}}^{∞}_{l=0} are fixed points of , which are stationary points of the d.c. program in equation 1.1, u(x^{(l)}) − v(x^{(l)}) → u(x*) − v(x*) ≕ f* as l → ∞, for some fixed point x* (also a stationary point of equation 1.1), ‖x^{(l+1)} − x^{(l)}‖ → 0, and either {x^{(l)}}^{∞}_{l=0} converges or the set of limit points of {x^{(l)}}^{∞}_{l=0} is a connected and compact subset of , where and is the set of fixed points of . If is finite, then any sequence {x^{(l)}}^{∞}_{l=0} generated by converges to some x* in *.

*u*is assumed to be strictly convex in theorem 4. This is not a strong assumption as it can be achieved as follows. Suppose

*u*is convex but not strictly convex. Let

*t*be a real-valued strictly convex function defined on . Then is strictly convex on and

*t*is continuously differentiable with ∇

*t*continuous (for, e.g.,

*t*(

*x*) = λ‖

*x*‖

^{2}

_{2}, λ>0), then it is clear that and satisfy the conditions in theorem 4, which means with the same assumptions of theorem 3, we obtain a stronger result in theorem 4. However, since theorem 4 is applied to and , it has to be noted that the sequence {

*x*

^{(l)}}

^{∞}

_{l=0}is generated by the following point-to-set map, instead of equation 4.1, which is the point-to-set map corresponding to theorem 3, which is applied to

*u*and

*v*.

Given the stronger guarantees about the convergence behavior of {*x*^{(l)}}^{∞}_{l=0} in equation 4.2, as provided by theorem 4, it may be preferable to use equation 4.2 instead of 4.1 to solve equation 1.1 when *u* is convex (but not strictly convex). On the other hand, equation 4.1 may be computationally simpler and more efficient to solve than equation 4.2—for example, if *u* is linear and Ω is a polyhedral set. In case the latter is more desirable, then theorem 3 can be used to provide convergence guarantees so that theorem 3 is not completely redundant.

From theorem 4, it should be clear that convergence of *f*(*x*^{(l)}) to *f** does not automatically imply the convergence of *x*^{(l)} to *x**. The convergence in the latter sense requires more stringent conditions like the finiteness of the set of stationary points of equation 1.1 that assume the value of *f**. Note that a similar condition of the set of limit points of {*x*^{(l)}} being finite is also required for the convergence of the whole DCA sequence.

### 4.2. Extensions.

So far, we have considered d.c. programs where the constraint set is convex and analyzed the global convergence behavior of CCCP—using Zangwill's theory—that is used to solve such programs. In the following, we consider general d.c. programs where the constraints need not be convex and present the global convergence analysis (using Zangwill's theory) of an iterative algorithm (which is an extension of CCCP) that solves such general d.c. programs. Note that DCA can be used to solve such general d.c. programs (see equation 4.3) whose convergence properties are summarized in section 2.^{6}

*u*}

_{i}^{m}

_{i=0}, {

*v*}

_{i}^{m}

_{i=0}are real-valued continuous convex functions defined on with {

*v*}

_{i}^{m}

_{i=0}being continuously differentiable. While dealing with kernel methods for missing variables, Smola et al. (2005) encountered a problem of the form in equation 4.3 for which they proposed a constrained concave-convex procedure given by where Note that similar to CCCP, the algorithm in equation 4.4 is a sequence of convex programs. Although Smola et al. (2005, theorem 1) provided some convergence analysis for the algorithm in equation 4.4, their analysis is not complete due to the fact that the convergence of {

*x*

^{(l)}}

^{∞}

_{l=0}is assumed. In this section, we provide its convergence analysis, following an approach similar to what we did for CCCP, by considering a point-to-set map , associated with the iterative algorithm in equation 4.4, where . Note that unlike in equation 1.2, the constraint set in equation 4.4 varies with

*l*and for any

*x*

^{(l)}, which therefore implies

*x*

^{(l+1)}∈ Ω. In theorem 5, we provide the global convergence result for the constrained concave-convex procedure, an equivalent version of theorem 4 for CCCP. Theorem 5 provides a result similar to the convergence result for DCA but under slightly stronger assumptions of {

*u*}

_{i}^{m}

_{i=0}, ∇

*v*

_{0}being continuous and {

*v*}

_{i}^{m}

_{i=1}being differentiable on .

(Global convergence of constrained CCP). *Let u _{0} be a real-valued continuous and strictly convex function, {u_{i}}^{m}_{i=1} be real-valued continuous convex functions, and {v_{i}}^{m}_{i=0} be real-valued convex differentiable functions with continuous ∇v_{0}, all defined on . Let {x^{(l)}}^{∞}_{l=0} be any sequence generated by defined in equation 4.4. Suppose is uniformly compact on Ω ≔ {x:u_{i}(x) − v_{i}(x) ≤ 0, i ∈ [m]} and is nonempty for every x ∈ Ω. Then, assuming suitable constraint qualification, all the limit points of {x^{(l)}}^{∞}_{l=0} are fixed points of , which are stationary points of the d.c. program in equation 4.3, u_{0}(x^{(l)}) − v_{0}(x^{(l)}) → u_{0}(x*) − v_{0}(x*) ≕ f* as l → ∞, for some fixed point, x* of (also a stationary point of equation 4.3), ‖x^{(l+1)} − x^{(l)}‖ → 0, and either {x^{(l)}}^{∞}_{l=0} converges or the set of limit points of {x^{(l)}}^{∞}_{l=0} is a connected and compact subset of , where and is the set of fixed points of . If is finite, then any sequence {x^{(l)}}^{∞}_{l=0} generated by converges to some x* in *.

In the following section, we present the proofs of theorems 3, 4, and 5.

**4.3 Proofs**

**Proof of Theorem 3.** The assumption of being uniformly compact on Ω ensures that condition 1 in theorem 1 is satisfied. Let Γ be the set of all generalized fixed points of , and let φ = *f* = *u* − *v*. Because of the descent property, *f*(*x*^{(l+1)}) ≤ *f*(*x*^{(l)}) as shown in Yuille and Rangarajan (2003), condition 2 in theorem 1 is satisfied. By our assumption on *u* and *v*, we have *g*(*x*, *y*) ≔ *u*(*x*) − *x ^{T}*∇

*v*(

*y*) is continuous in

*x*and

*y*. Therefore, by lemma 1 (in the appendix), the assumption of nonemptiness of for every

*x*∈ Ω ensures that is closed on Ω and so satisfies condition 3 in theorem 1. Therefore, by theorem 1, all the limit points of {

*x*

^{(l)}}

^{∞}

_{l=0}are the generalized fixed points of and lim

_{l→∞}(

*u*(

*x*

^{(l)}) −

*v*(

*x*

^{(l)})) =

*u*(

*x**) −

*v*(

*x**), where

*x** is some generalized fixed point of . We now show that any generalized fixed point of is a stationary point of equation 1.1, therefore proving the result.

*x** is a generalized fixed point of , that is, . Since the constraints in equation 4.1 are qualified at

*x**, there exist Lagrange multipliers such that the following KKT conditions hold:

Equation 4.5 is exactly the set of KKT conditions of equation 1.1, which are satisfied by (*x**, {η*_{i}}), and therefore *x** is a stationary point of equation 1.1.

**Proof of Theorem 4.** Since *u* is strictly convex, the strict descent property holds, *f*(*x*^{(l+1)}) < *f*(*x*^{(l)}) unless *x*^{(l+1)} = *x*^{(l)}, and therefore is strictly monotone with respect to *f*. The assumption of nonemptiness of for every *x* ∈ Ω ensures that is closed on Ω (which follows from lemma 7 in the appendix). By assumption, since is uniformly compact on Ω, invoking theorem 2 provides that all the limit points of {*x*^{(l)}}^{∞}_{l=0} are fixed points of , which either converge or form a connected compact set. Since any fixed point of is a generalized fixed point, which is also a stationary point of equation 1.1 (see the proof of theorem 3), the desired result follows.

**Proof of Theorem 5.** The proof is very similar to that of theorem 4. Note that . Since *u*_{0} is strictly convex, we have unless *x*^{(l+1)} = *x*^{(l)}, which means *u*_{0}(*x*^{(l+1)}) − *v*_{0}(*x*^{(l+1)}) < *u*_{0}(*x*^{(l)}) − *v*_{0}(*x*^{(l)}) unless *x*^{(l+1)} = *x*^{(l)} and therefore is strictly monotone. Since *u*_{0} and ∇*v*_{0} are continuous and is nonempty for every *x* ∈ Ω, by invoking lemma 1, we obtain that is closed on Ω. The result therefore follows from theorem 2, which shows that all the limit points of {*x*^{(l)}}^{∞}_{l=0} are fixed points of , which either converge or form a connected compact set. We now show that any fixed point of is a stationary point of equation 4.3.

*x** is a fixed point of and assume that constraints in equation 4.4 are qualified at

*x**. Then there exist Lagrange multipliers such that the following KKT conditions hold: which is exactly the KKT conditions for equation 4.3 satisfied by (

*x**, {η*

_{i}}) and, therefore,

*x** is a stationary point of equation 4.3.

## 5. Local Convergence Analysis of CCCP

The study so far has been devoted to the global convergence analysis of CCCP and the constrained concave-convex procedure. We say an algorithm is globally convergent if for any chosen starting point, *x*_{0}, the sequence {*x _{k}*}

^{∞}

_{k=0}generated by converges to a point for which a necessary condition of optimality holds. In the results so far, we have shown that all the limit points of any sequence generated by CCCP (resp. its constrained version) are the stationary points (local extrema or saddle points) of the program in equation 1.1 (resp. 4.3). Suppose that

*x*

_{0}is chosen such that it lies in an ε-neighborhood around a local minimum,

*x**. Will the CCCP sequence then converge to

*x**? If so, what is the rate of convergence? These are questions of local convergence.

Salakhutdinov, Roweis, and Ghahramani (2003) studied the local convergence of bound optimization algorithms (of which CCCP is an example) to compare the rate of convergence of such methods to that of gradient and second-order methods. In their work, they considered the unconstrained version of CCCP with as a point-to-point map that is differentiable. They showed that, depending on the curvature of *u* and *v*, CCCP will exhibit either quasi-Newton behavior with fast, typically superlinear convergence or extremely slow, first-order convergence behavior. However, extending these results to the constrained setup in equation 1.2 is not obvious. The following result due to Ostrowski, which can be found in Ortega and Rheinboldt (1970, theorem 10.1.3), provides a way to study the local convergence of iterative algorithms.

*Suppose that has a fixed point and Ψ is Fréchet-differentiable at x*. If the spectral radius, ρ(Ψ′(x*)) of Ψ′(x*) satisfies ρ(Ψ′(x*)) < 1, and if x_{0} is sufficiently close to x*, then the iterates {x_{k}} defined by x_{k+1} = Ψ(x_{k}) all lie in U and converge to x**.

We now discuss how proposition 1 can be used to study the local convergence of CCCP. First note that the proposition treats Ψ (in our case, ) as a point-to-point map, which can be obtained by choosing *u* to be strictly convex so that *x*^{(l+1)} is the unique minimizer of equation 1.2. Suppose, we choose *x** in proposition 1 to be a local minimum of equation 1.1. Then the desired result of local convergence with at least a linear rate of convergence is obtained if we show that . However, currently we are not aware of a way to compute the Fréchet differential of and, moreover, to impose conditions on the functions in equation 1.2 so that is a Fréchet-differentiable map. This is an open question coming out of this work. However, in the following, we present a simple example for which is Fréchet differentiable and .

*n*×

*n*symmetric matrices over ), , , , and . Assume , where

*m*<

*n*. Although the objective in equation 5.1 need not be convex, it can be written as a difference of convex functions: where ρ>max(0, λ

_{max}(

*A*)), so that ρ

*I*−

_{n}*A*is positive definite. Here

*I*denotes the

_{n}*n*×

*n*identity matrix and λ

_{max}(

*A*) is the largest eigenvalue of

*A*. Define

*u*(

*x*) ≔ ρ‖

*x*‖

^{2}

_{2}+

*b*+

^{T}x*c*and

*v*(

*x*) ≔

*x*(ρ

^{T}*I*−

_{n}*A*)

*x*. Using CCCP, equation 5.1 can be solved as By solving the Lagrangian of equation 5.2, we get where

*C*

^{+}≔

*C*(

^{T}*CC*)

^{T}^{−1}. Note that the point-to-point map , defined as in equation 5.3, is linear and therefore is Fréchet differentiable at any . Suppose

*A*,

*C*, and ρ are such that max

_{i∈[n]}(|λ

_{i}|) < 1, where {λ

_{i}} are the eigenvalues of ρ

^{−1}(

*I*−

_{n}*C*

^{+}

*C*)(ρ

*I*−

_{n}*A*), then the conditions in proposition 1 are satisfied. Therefore, if

*x** is a local minimum of , then choosing any

*x*

^{(0)}that is sufficiently close to

*x** results in a sequence of iterates, {

*x*

^{(l)}} converging to

*x** with a rate of convergence that is at least linear.

## 6. Conclusion

The concave-convex procedure (CCCP) is widely used in machine learning. In this note, we provide a proof of its global convergence by using results from the global convergence theory of iterative algorithms. The proposed approach is fundamentally different from that used for the convergence of DCA. It illustrates the power and generality of Zangwill's global convergence theory as a framework for proving the convergence of iterative algorithms. We briefly discuss the local convergence of CCCP and present an open question, the settlement of which would address the local convergence behavior of CCCP.

## Appendix: Supplementary Result

## Acknowledgments

We thank anonymous reviewers for their constructive comments that greatly improved the manuscript. The work was carried out when B.K.S. was a Ph.D. student at the University of California, San Diego. We acknowledge support from the National Science Foundation (grants DMS-MSPA 0625409 and IIS-1054960).

## Notes

^{1}

*x** is said to be a stationary point of a constrained optimization problem if it satisfies the corresponding Karush-Kuhn-Tucker (KKT) conditions. Assuming constraint qualification, KKT conditions are necessary for the local optimality of *x**. See Bonnans, Gilbert, Lemaréchal, and Sagastizábal (2006, section 11.3) for details.

^{2}

Note that depending on the objective and constraints, the minimizer of equation 1.2 need not be unique. Therefore, the algorithm takes *x*^{(l)} as its input and returns a set of minimizers from which an element, *x*^{(l+1)} is chosen. Hence, the notion of point-to-set maps appears naturally in such iterative algorithms.

^{3}

Instead of uniform compactness, one could also assume that for every *x* ∈ Ω, the set is bounded for the claims in theorem 3 to hold.

^{4}

Examples include Slater's qualification and Mangasarian-Fromovitz qualification, among others. See Bonnans et al. (2006).

^{5}

The Weierstrass theorem states: If *f* is a real-valued continuous function on a compact set , then the problem min{*f*(*x*):*x* ∈ *K*} has an optimal solution *x** ∈ *K*.

^{6}

*K*is a nonempty closed convex set in . A penalty approach penalizes the constraints and introduces the following nondifferentiable d.c. program: with

*t*>0, which can be solved using DCA. To solve general d.c. programs via DCA on equation B (note that to apply DCA to equation B, the continuity and differentiability conditions on {

*u*}

_{i}^{m}

_{i=0}and {

*v*}

_{i}^{m}

_{i=0}mentioned in the paragraph following equation 4.3 are not needed), exact penalty must hold, that is, the existence of

*t*

_{0}≥ 0 such that equations A and B are equivalent for all

*t*>

*t*

_{0}. As far as we know, the existence of such

*t*

_{0}is guaranteed if

*K*is a nonempty bounded polyhedral convex set in and the feasible set of equation A is nonempty. See Pham Dinh and Le Thi (1997, section 8.1) and Le Thi, Pham Dinh, and Le Dung (1999).

^{7}

Lemma 1 is a slight modification to proposition 7 of Gunawardana and Byrne (2005), with exactly the same proof, wherein the latter deals with unconstrained minimization, that is, the minimization in equation A.1 is over *Y*, while the lemma deals with constrained minimization carried out over a subset, Ω of *Y*.