Abstract

This letter proposes a novel approach using the -norm regularization for the sparse covariance matrix estimation (SCME) problem. The objective function of SCME problem is composed of a nonconvex part and the term, which is discontinuous and difficult to tackle. Appropriate DC (difference of convex functions) approximations of -norm are used that result in approximation SCME problems that are still nonconvex. DC programming and DCA (DC algorithm), powerful tools in nonconvex programming framework, are investigated. Two DC formulations are proposed and corresponding DCA schemes developed. Two applications of the SCME problem that are considered are classification via sparse quadratic discriminant analysis and portfolio optimization. A careful empirical experiment is performed through simulated and real data sets to study the performance of the proposed algorithms. Numerical results showed their efficiency and their superiority compared with seven state-of-the-art methods.

1  Introduction

The estimation of covariance matrix is a common statistical problem that emerges from many scientific applications, and it has quickly become an active and fast-growing field of research. Much statistical analysis of such high-dimensional data requires estimating a covariance matrix or its inverse. Several applications in numerous domains such as portfolio management and risk assessment (Ledoit & Wolf, 2003, 2004; Jagannathan & Ma, 2003; Kourtis, Dotsis, & Markellos, 2012; Fan, Liao, & Mincheva, 2013; Xue, Ma, & Zuo, 2012; Lai, Xing, & Chen, 2011; Deng & Tsui, 2013), high-dimensional classification (Guo, Hastie, & Tibshirani, 2007; Witten & Tibshirani, 2011; Tibshirani, Hastie, Narasimhan, & Chu, 2004), analysis of independence and conditional independence relationships between components in graphical models, statistical inference like controlling false discoveries in multiple testing (Leek & Storey, 2008; Efron, 2010), finding quantitative trait loci based on longitudinal data (Yap, Fan, & Wu, 2009; Xiong et al., 2011), and testing the capital asset pricing model (Sentana, 2009), have reported success in using covariance matrix estimation. For instance, principal component analysis (PCA) applies the eigendecomposition of the covariance matrix for dimension reduction. In classification, linear discriminant analysis (LDA), quadratic discriminant analysis, and other procedures exploit the inverse of a covariance matrix to compute the classification rule. In finance, portfolio optimization often uses the covariance matrix for minimizing portfolio risk. More notably, graphical models are especially of interest in the analysis of gene expression data, since it is believed that genes operate in pathways, or networks. Graphical models based on gene expression data can provide a powerful tool for visualizing the relationships of genes and generating biological hypotheses (Toh & Horimoto, 2002; Dobra et al., 2004; Schafer & Strimmer, 2005a, 2005b).

Let be a -dimensional random vector with the covariance matrix , where is the covariance between and . Suppose that we observe a sample including observational data points from a multivariate normal distribution . The general purpose is to estimate from this sample. This amounts to minimizing the negative log-likelihood function (Mardia, Kent, & Bibby, 1979) defined by
formula
1.1
where is the sample covariance matrix.

The problem is that with the increasing abundance of high-dimensional data sets, the sample covariance matrix becomes an extremely noisy estimator of the covariance matrix, and the number of parameters used to estimate grows quadratically with the number of variables. Intuitively, the most suitable approach to cope with this problem is finding an estimate of the covariance matrix that is as sparse as possible, since the sparsity leads to the effective reduction in the number of parameters. In addition, the sparsity is visualized by the so-called covariance graph (Chaudhuri, Drton, & Richardson, 2007). In the covariance graph, each node presents a random variable in a random vector, and these nodes are connected by bidirectional edges if the covariances between the corresponding variables are nonzero. Note that the two random variables and are marginally independent if and only if the covariance between and is zero. Hence, the zeros in a covariance matrix correspond to marginal independencies between variables, and sparse estimation of the covariance matrix is equivalent to estimating a covariance graph having a small number of edges. Thus, the sparsity of the covariance matrix or its inverse is useful to improve the estimation accuracy or explore the structure of the covariance graphical model.

In recent years, in connection with the big data phenomenon, the sparse covariance matrix estimation (SCME) problem has attracted the attention of researchers and constitutes a challenge for the machine learning community. Existing methods for the SCME problems follow two directions. The first direction is to estimate the sparse inverse covariance matrix. A popular method in this direction consists in adding the Lasso penalty (-norm regularization) on the entries of the inverse covariance matrix to the normal likelihood (see, e.g., Meinshausen & Buhlmann, 2006; Yuan & Lin, 2007; Banerjee, El Ghaoui, & d'Aspremont, 2008; Friedman, Hastie, & Tibshirani, 2008; Rothman, Levina, & Zhu, 2008; Danaher, Wang, & Witten, 2014; Cai, Liu, & Luo, 2011; Zhang & Zou, 2014). The second direction is to estimate directly the sparse covariance matrix. In this letter, we follow the latter one.

A natural way to deal with sparsity in machine learning is using the -norm in the regularization term that leads to the following mathematical formulation of the SCME problem:
formula
1.2
where is a nonnegative tuning parameter, the notation means that is symmetric positive definite, and denotes the -norm of , that is, the number of nonzero elements of matrix . It is clear that the SCME problem is much more difficult than the classical estimation of covariance matrix problem. In view of optimization, the SCME problem includes a double difficulty: both the negative log-likelihood function and the -norm are nonconvex.

Note that the solution to the problem 1.2 is positive definite. This property is crucial for any covariance matrix estimator from both methodological and practical aspects. Positive-definite covariance matrices are required in all statistical procedures that use the normal distribution (e.g., the principal component analysis, the parametric bootstrap method, and the linear or quadratic discriminant analysis). Even some important statistical methods that do not use the normal distribution still need positive-definite covariance matrix estimators such as some portfolio optimization methods based on the celebrated Markowitz model.

Optimization methods involving the -norm can be divided into three categories according to the way the -norm is treated: convex approximation, nonconvex approximation, and nonconvex exact reformulation. We refer to Le Thi, Pham Dinh, Le Hoai, & Vo Xuan (2015) for an excellent review on exact and approximation approaches to deal with the -norm. When the objective function (besides the -term) is convex, convex approximation techniques result in a convex optimization problem that is so far easy to solve. Unfortunately, due to the nonconvexity of the negative log-likelihood function, the SCME problem, equation 1.2, remains nonconvex with any approximation convex or nonconvex of the -norm. How to deal with the -norm and how to treat the nonconvexity of the negative log-likelihood loss function are two crucial questions to study. Several works have been developed for the SCME problem, but designing an efficient method for it is still a challenge in this research field.

The approaches for solving the SCME problem can be divided into two groups. Most of them are included in the first group that we name the convex approach. Here, one seeks to deter the nonconvexity by replacing the negative log-likelihood function with a surrogate convex loss function and using the -norm or -norm instead of the -norm to deal with sparsity (see, e.g., Deng & Tsui, 2013; Liu, Wang, & Zhao, 2014; Rothman, Levina, & Zhu, 2009; Rothman, 2012; Xue et al., 2012). The resulting problems are then convex and thus solvable, but it is unsurprising that the quality of solutions is not always good. In the second group, including Bien and Tibshirani (2011) and Lam and Fan (2009) and the negative log-likelihood function is kept and the -regularization is used to deal with sparsity. As mentioned above, due to the nonconvexity of the negative log-likelihood function, the resulting problem is still nonconvex. Bien and Tibshirani (2011) applied the minorization-maximization (MM) approach for solving the resulting problem. Previous works on sparse optimization showed that the use of the -norm as a convex approximation of the -norm is not a good way in general (see Le Thi, Pham Dinh et al., 2015). Instead, the approach approximating the -norm by a nonconvex continuous function (actually a DC function) is more suitable from a theoretical perspective (Le Thi, Pham Dinh et al., 2015). The advantage of the -regularization versus the -regularization in learning with sparsity has been proved by several machine learning algorithms; however, the difficulty caused by the -norm prevents researchers from using the -regularization.

1.1  Our Contributions

In this letter, taking into account the advantages of some DC approximations of the -norm developed in Le Thi, Pham Dinh et al. (2015 and references therein), we seek the most natural way to tackle the sparsity, -regularization, and use the DC approximations to design two new models for the SCME problem. More precisely, we consider an equivalent problem of the problem 1.2 where the constraint is replaced by after showing that there exists a lower bound of . We maintain the negative log-likelihood function in the equivalent problem and replace the -regularization by the DC approximations. The resulting approximate problems are far more difficult than the existing models, but they are DC programs. To the best of our knowledge, this is the first time in the literature the -regularization has been considered, and its nonconvex approximations are used for the SCME problem.

Furthermore, we tackle the resulting approximate SCME problem by DC programming and DCA, powerful tools in nonconvex programming framework. Our motivation is based on the fact that DCA is a fast and scalable approach that has been successfully applied to many large-scale (smooth or nonsmooth) nonconvex programs in various domains of applied science, in particular in data analysis and machine learning (see, e.g., Collobert, Sinz, Weston, & Bottou, 2006; Krause & Singer, 2004; Le Thi & Pham Dinh, 2005; Le Thi, Pham Dinh, & Huynh, 2012; Le Thi, Le Hoai, & Pham Dinh, 2007, 2014, 2015; Le Thi, Le Hoai, Nguyen, & Pham Dinh, 2008; Pham Dinh & Le Thi, 1997, 1998, 2014; Le Hoai, Le Thi, Pham Dinh, & Huynh, 2013; Liu, Shen, & Doss, 2005) and the list of reference on http://lita.sciences.univ-metz.fr/∼lethi/DCA.html). We note in passing that the MM-based approach proposed in Bien and Tibshirani (2011) for the SCME problem is also a version of DCA. Constituting the backbone of smooth/nonsmooth nonconvex programming and global optimization, DC programming and DCA address general DC programs of the form
formula
where are lower semicontinuous proper convex functions on . Such a function is called a DC function and a DC decomposition of , while and are the DC components of . The general DCA scheme is a philosophy but not an algorithm. There is not only one DCA but one family of DCAs for a considered problem. The main feature of DCA is that it is constructed from DC components but not the DC function itself, which has infinitely many DC decompositions, and there are as many DCAs as there are DC decompositions. Such decompositions play a critical role in determining the speed of convergence, stability, robustness, and globality of sought solutions. Hence what is a “good” DC decomposition is a crucial question when developing DCA for a DC program. The design of an efficient DCA for a concrete problem should be based on its special structure.

In this work, we propose two DC formulations of the approximate SCME problem based on two DC decompositions of its objective function. The first results from a natural DC decomposition, and the second is introduced to exploit the nice effect of DC decompositions. It turns out that the complexity of two corresponding DCA schemes is quite different because the convex subproblems in the second DCA scheme can be solved by a very inexpensive algorithm. The ratio of gain between the two DCAs in terms of CPU times in our numerical experiments is up to 44 times. Among various existing sparse-inducing functions, we are choosing, for implementing our algorithms, the piecewise linear approximation (Capped-) (Peleg & Meir, 2008) and the piecewise exponential approximation (Bradley & Mangasarian, 1998). This choice is motivated by the fact that the Capped- has been proved theoretically in Le Thi, Pham Dinh et al. (2015) to be the tightness approximation, while the piecewise exponential function has been shown to be efficient via the numerical results in numerous works (Bradley & Mangasarian, 1998; Le Thi et al., 2008; Le Thi, Vo Xuan, & Pham Dinh, 2014; Le Thi, Pham Dinh et al., 2015; Ong & Le Thi, 2013). Applying DCA on two DC formulations with two approximations, we have then four DCA-based algorithms for the approximate SCME problem. Special convergence analysis results of our algorithms are provided. We consider two important applications of the SCME problem in our experiments. The first is quadratic discriminant analysis using sparse covariance matrices estimated by the proposed algorithms. The second is a portfolio optimization problem. Numerical experiments are carefully carried out on several test problems on both simulated and real data sets with 11 algorithms, including 7 state-of-the-art methods and the 4 proposed DCA schemes.

The rest of the letter is organized as follows. DC programming and DCA are introduced in section 2 while the DCA-based methods for solving the SCME problem are presented in section 3. The numerical experiments are reported in section 4. Section 5 concludes the letter.

Throughout the letter, for matrices , the inner product of and is defined as . If , the notations and refer to the smallest and largest eigenvalues of , respectively. The spectral and Frobenius norms are , , respectively. The Kronecker product of and is an block matrix whose block is the matrix .

2  A Brief Introduction to DC Programming and DCA

DC programming and DCA were introduced by Pham Dinh Tao in their preliminary form in 1985. They have been extensively developed since 1994 by Le Thi Hoai An and Pham Dinh Tao and have become classic and increasingly popular (see Le Thi & Pham Dinh, 2005; Pham Dinh & Le Thi, 1997, 1998, 2014; Le Thi et al., 2012). The standard DC program takes the form of the problem above. It is worthwhile to note that the closed convex constraint can be incorporated in the objective function of by using the indicator function on denoted by , which is defined by if , and otherwise:
formula
A DC program is called a polyhedral DC program when either or is a polyhedral convex function (i.e., the sum of the indicator function of a polyhedral convex set and the pointwise supremum of a finite collection of affine functions). Note that a polyhedral convex function is almost always differentiable, that is to say, it is differentiable everywhere except on a set of measure zero.
Recall that for a convex function , the subdifferential of at , denoted by , is defined by
formula
The subdifferential generalizes the derivative in the sense that is differentiable at if and only if

The complexity of DC programs resides, of course, in the lack of practical optimal globality conditions. Local optimality conditions are then useful in DC programming.

A point is said to be a local minimizer of if is finite and there exists a neighborhood of such that
formula
2.1
The necessary local optimality condition for a (primal) DC program is given by
formula
2.2
Condition 2.2 is also sufficient (for local optimality) in many important classes of DC programs, for example, for polyhedral DC programs, or when function is locally convex at (Le Thi & Pham Dinh, 2005).
A point is called a critical point of or a generalized Karush-Kuhn-Tucker point (KKT) of (P) if
formula
2.3
It follows that if is a polyhedral convex function, then a critical point of is almost always a local solution to .

DCA is based on local optimality conditions and duality in DC programming (for simplicity, we do not present DC duality and the dual part of DCA). The main idea of DCA is simple: each iteration of DCA approximates the concave part by its affine majorization (which corresponds to taking and minimizes the resulting convex function.

The generic DCA scheme can be described as follows:

Generic DCA Scheme

Initialization: Let be an initial guess, .

Repeat

Calculate

Calculate

Until convergence of .

Convergence properties of DCA and its theoretical basic can be found in Pham Dinh and Le Thi (1997); Le Thi and Pham Dinh (2005). For instance, it is worth mentioning that:

  • DCA is a descent method (without line search but with global convergence): the sequence is decreasing.

  • If , then is a critical point of . In such a case, DCA terminates at the th iteration.

  • If the optimal value of problem is finite and the infinite sequences are bounded, then every limit point of the sequence is a critical point of .

  • DCA has a linear convergence for general DC programs. In DC programming with subanalytic data, the whole sequence generated by DCA converges and DCA's rate convergence is stated.

  • In polyhedral DC programs, the sequence contains finitely many elements, and DCA converges to a solution after a finite number of iterations. Especially if is polyhedral convex and is differentiable at , then is a local minimizer of .

As mentioned in section 1 and described in the generic DCA scheme, the construction of DCA involves DC components and but not the function itself. Hence, for a DC program, each DC decomposition corresponds to a different version of DCA, and the search for a “good” DC decomposition is important from an algorithmic point of view. For a given DC program, the choice of optimal DC decompositions is still open and depends strongly on the specific structure of the problem being considered. One tries in practice to choose and such that sequences and can be easily calculated—either they are in an explicit form or their computations are inexpensive. Generally the sequence is explicitly computed by using the usual rules for calculating the subdifferential of convex functions. But the solution of the convex program , if not explicit, should be achieved by efficient algorithms well adapted to its special structure in order to handle the large-scale setting. How to develop an efficient algorithm based on the generic DCA scheme for a practical problem is thus a sensible question to study. The solution of a nonconvex program by DCA must have two stages: the search for an appropriate DC decomposition of the objective function and that of a good initial point. For a complete study of DC programming and DCA, the reader is referred to Le Thi and Pham Dinh (2005), Pham Dinh and Le Thi (1997, 1998), and references therein.

Note that with suitable DC decompositions, DCA generates most of standard methods in convex and nonconvex programming. Again, one recovers a version of DCA in standard methods widely used by the research community in machine learning. For instance, it is useful to point out that the four well-known methods—expectation-maximization (Dempster, Laird, & Rubin, 1977), successive linear approximation (Bradley & Mangasarian, 1998), convex-concave procedure (Yuille & Rangarajan, 2003), and iterative shrinkage-thresholding algorithm (Chambolle, Devore, Lee, & Lucier, 1998)—are particular cases of DCA. In addition, these four methods, without proof of convergence, relate only differentiable functions.

3  DCA for Solving the Sparse Covariance Matrix Estimation SCME Problem

3.1  The Approximation SCME Problem

First, we show that there exists a lower bound of for problem 1.2.

Proposition 1.
If is nonsingular, then there exists such that problem 1.2 is equivalent to the following problem,
formula
3.1
where denotes the identity matrix, and the notation means that is symmetric positive semidefinite.
Proof.
The proof of this proposition is similar to the one in Bien and Tibshirani (2011) while proving a lower bound of in the regularization model. Assume that is a solution to problem 1.2. To prove this proposition, it is sufficient to show that there exists such that . Let and . From the proof of proposition 1 in Bien and Tibshirani (2011), we have
formula
3.2
where . Since and for some , we obtain . Thus it follows from the inequality 3.2 that
formula
3.3
Let be the set defined by , where . Since is continuous on and
formula
it follows that is a compact set and then there exists . Therefore we get from the inequality 3.3.
Note that if is singular, there exists a vector such that and . Let be vectors such that is an orthogonal basis of . We have and
formula
Thus we obtain
formula
It follows from this that the objective function of problem 1.2 is unbounded below if is singular. We therefore replace with for some when is singular. Adding to a singular matrix is a usual (and trivial) way to get a nonsingular matrix. Such a procedure implies a small perturbation on the observed data, but it does not have an impact on the model. This technique has been already used by Bien and Tibshirani (2011), who justified that adding to is equivalent to augmenting the data set with points that do not lie perfectly in the span of the observed data.
The discontinuity of the -norm is overcome by using a DC approximation function. Define the step function by for and otherwise. Then . The idea of approximation methods is to replace the discontinuous step function by a continuous approximation , where is a parameter controlling the tightness of approximation. The approximation of the -norm is then defined by
formula
3.4
This leads to the approximation SCME problem of the problem 3.1, which takes the form
formula
3.5
where .
We consider in this work two approximation functions and given by
formula
(the piecewise exponential concave approximation; Bradley & Mangasarian, 1998) and
formula
(the Capped-; Peleg & Meir, 2008). In the sequel, for convenience, we use the common notation to design both and . It has been proved in Le Thi, Pham Dinh et al. (2015) that is a DC function verifying their assumption 1 and with a suitable value of the Capped- approximation SCME problem (say, the problem 3.5 when is equivalent to the original SCME problem 3.1.

We now investigate DCA for solving the nonconvex problem 3.5.

3.2  The First DCA Scheme for Solving the Approximation SCME Problem 3.5

The approximation can be presented as a DC function,
formula
3.6
and
formula
3.7
In addition, we see that is concave in (Boyd & Vanderberghe, 1979) while is convex. Indeed, we have
formula
and the function is convex in (Boyd & Vanderberghe, 1979). It follows that is convex. Consequently, the following DC decomposition of seems to be natural,
formula
3.8
where
formula
and
formula
are clearly convex functions. Now, the optimization problem 3.5 can be rewritten as
formula
3.9
According to the generic DCA scheme, at each iteration , we have to compute and then solve the convex program of the form :
formula
3.10
The computation of depends on . More precisely, for , is computed by
formula
3.11
where is the sign of . For , is calculated as
formula
3.12
For solving the convex problem 3.10, we have to use an iterative method in convex programming. For instance, we use the generalized gradient descent (GGD) algorithm (Beck & Teboulle, 2009) that iteratively solves the following problem at each iteration :
formula
3.13
where is a suitable step size. Then problem 3.13 can be solved by the alternating direction method of multipliers (ADMM) (Boyd, Parikh, Chu, Peleato, & Eckstein, 2011). The augmented Lagrangian function of this problem is
formula
and ADMM solves the following problems at each iteration :
formula
Let be the elementwise soft-thresholding operator defined by . Then, finally, DCA for solving the problem 3.9 can be described as follows:

DCA1

Initialization: Let tolerance sufficient small, set and compute . Choose an initial point and .

repeat

1. Compute according to equation 3.11 (resp. equation 3.12) when (resp. ).

2. Set and choose .

repeat

- Compute .

- Set , choose .

repeat

- Compute where and .

- Compute .

- Compute .

- .

until.

- .

- .

until.

3. .

4. .

until or .

The complexity of one iteration of DCA1 is determined as follows. The computation of the subgradient of needs operations. Each iteration of ADMM, the computation of the eigenvalue decomposition also needs operations. Therefore, one iteration of DCA1 requires operations, where and are the number of iterations of the GGD and ADMM algorithms, respectively. Thus, the complexity of DCA1 is
formula
3.14
where denotes the number of iterations of DCA1.

DCA1 might be quite expensive because we have to use two iterative methods for solving the convex subproblem 3.10. This motivates us to consider another DC formulation of problem 3.5.

3.3  The Second DCA Scheme for Solving the Approximation SCME Problem 3.5

Observing that the convex problem 3.10 is difficult due to the presence of the term in , we seek another DC decomposition by moving to the second DC component. We then propose the following second DC formulation of problem 3.5:
formula
3.15
where
formula
3.16
and
formula
3.17
are convex functions when is large enough. For estimating , we state the following lemma.
Lemma 1.

If , then is convex.

Proof.
Since the function is convex and the sum of two convex functions is also convex, it is sufficient to show that becomes convex, that is, is greater than the spectral radius of the Hessian matrix of . We have
formula
3.18
where and are the spectral radius and the spectral norm of the Hessian matrix of , respectively. The differential of is defined by
formula
Hence we get the gradient of as follows:
formula
3.19
We recall the product rule for matrix derivatives
formula
where . Using this product rule, we have
formula
Moreover, by the product rule, we also obtain and
formula
Therefore the Hessian of is
formula
Using the property , we finally get
formula
3.20
We can deduce from equation 3.20 that , and then by the inequality 3.18, we have
Remark 1.

From lemma 2, we can choose .

Applying DCA on the problem 3.15, we have to compute and then solve the convex program of the form , say,
formula
3.21
The computation of is given by, for ,
formula
3.22
and for :
formula
3.23
For solving the convex subproblem 3.21, we use the ADMM algorithm. The augmented Lagrangian function of the subproblem is
formula
More specifically, ADMM solves the following problems at each iteration :
formula
3.24
formula
3.25
formula
3.26

Hence DCA for solving the problem 3.15 can be described as follows.

DCA2

Initialization: Let tolerance sufficient small, set and compute . Choose an initial point and .

repeat

1. Compute according to equation 3.22 (resp. equation 3.23) when (resp. ).

2. Set , choose .

repeat

1. Compute where and .

2. Compute .

3. Compute

4. .

until.

3. .

4. .

until or .

We observe that the convex subproblem 3.21 is easier to solve than the convex subproblem 3.10 in DCA1. It requires only one iterative algorithm (ADMM), which is explicit at each iteration. The complexity of DCA2 is
formula
3.27
where and denote the number of iterations of DCA2 and ADMM, respectively.

3.4  Convergence Analysis

Now we will prove the convergence properties of DCA1 and DCA2. For a DC program, the convergence of DCA is guaranteed when the optimal value is finite and the sequence generated by DCA is bounded.

Lemma 2.

The optimal value of the problem 3.5 is finite.

Proof.

Since is positive definite and is positive semidefinite, is always nonnegative. On the other hand, and for all . Hence for all .

Lemma 3.

Let be the positive-definite matrices. We have

  1. .

  2. .

  3. .

Proof.

a. Part a is a consequence of theorem III.2.1 in Bhatia (1997).

b. For any positive-definite matrix , we have
formula
where denotes the operator norm of . Hence, we get
formula
Since the eigenvalues of the inverse matrix are the inverse of the eigenvalues, we obtain
formula
c. First, we will show that and . Let and be the eigendecomposition of and , respectively. We have
formula
Note that and is a diagonal matrix with diagonal entries . Then, are also the eigenvalues of . It follows that . Similarly, we have
formula
Since and is a diagonal matrix with the diagonal entries , we get .
On the other hand, we have . Then, from part b of the lemma, we obtain
formula

The convergence properties of DCA1 and DCA2 are given in the following theorem:

Theorem 1.

Let be the sequence generated by DCA1 (resp. DCA2). We have

  1. is decreasing.

  2. is bounded.

  3. , and hence as .

  4. The sequence has at least one limit point and every limit point of this sequence is a critical point of problem 3.9 (resp., problem 3.15).

Proof.

a. This is direct consequence of convergence properties of general DC programs.

b. First, we prove that the level set is bounded, . Assume that this level set is not bounded. Then there is a sequence such that as . Let be the eigenvalues of with . We have
formula
Besides, we have as . Thus, as . Since is always nonnegative and , we have
formula
3.28
It follows that as . But the fact that implies for all . Thus, we have a contradiction: the level set is bounded, .

Since the sequence is monotonically decreasing, we have for some . This and the boundness of the level set imply property b.

c. We will show that the first DC components ( and ) are strongly convex. By the definition of , it is obvious that is strongly convex. As for , it is sufficient to show that is strongly convex. From the proof of lemma 2, we have
formula
3.29
Applying lemma 5, we get
formula
As mentioned in section 3.1, we assumed that is positive definite. If is not full rank, we replace by for some . Thus, . So is strongly convex.
Let be the sequence generated by DCA1. If is generated by DCA2, then part c of the theorem will be proved analogously. Recall that is an optimal solution of the problem
formula
where . Then the first-order optimality condition holds at , that is, , and then
formula
3.30
Hence we have
formula
3.31
where is the modulus of the strong convexity of . Since , we also have
formula
3.32
Combining the inequalities 3.31 and 3.32, we have
formula
Moreover, since is strongly convex, . Hence we obtain
formula
3.33
Let be a positive integer. Summing equation 3.33 from to , we get
formula
3.34
On the other hand, from the proof of lemma 4 we see that . Combining this and the inequality 3.34, we get
formula
Taking the limit as , we obtain
formula
3.35
and hence .

Part d is deduced from part b, lemma 4, and the DCA's convergence property iii in section 2.

4  Numerical Experiments

4.1  Comparative Algorithms

For each algorithm, we use two DC approximations of the -norm. Hence we have four DCA-based algorithms: DCA1-CaP, DCA1-PiE, DCA2-CaP, and DCA2-PiE, where CaP and PiE denote the algorithm using the Capped- ( and the piecewise exponential approximation function (, respectively. To study the performance of the proposed algorithms, we compare our algorithms with seven standard methods that cover all types of algorithms mentioned in section 1.

  • Methods follow the first direction: estimate the sparse inverse covariance matrix: CLIME (Cai et al., 2011) and SPME (Zhang & Zou, 2014).

  • Methods follow the second direction (estimate directly the sparse covariance matrix) in the first group (the convex approach); that is, they use surrogate convex loss functions of the negative log likelihood function and the -norm (resp. -norm): PDSCE (Rothman, 2012) (resp. PCME; Deng & Tsui, 2013).

  • Methods follow the second direction. In the second group, replace the -norm by the -norm: SPCOV1 and SPCOV2 (Bien & Tibshirani, 2011).

Moreover, for the two real applications (classification and portfolio selection), we consider in addition a method using the sample covariance matrix: the quadratic discriminant algorithm (QDA). We also compare the four DCA-based algorithms to study the performance of the two DC approximations of the -norm, as well as the two different DC decompositions.

4.1.1  Constrained Minimization Approach to Sparse Precision Matrix Estimation (CLIME)

CLIME (Cai et al., 2011) estimated an inverse covariance matrix by solving the following problem,
formula
4.1
where is a tuning parameter. The convex program 4.1 can be further decomposed into vector minimization problems that are solved by the primal dual interior method. The clime package for CLIME is also available from CRAN (http://cran.r-project.org/).

4.1.2  Sparse Precision Matrix Estimation via Lasso Penalized D-Trace Loss (SPME)

SPME (Zhang & Zou, 2014) used a surrogate loss function instead of the negative log-likelihood function with the penalty on the precision matrix , namely,
formula
4.2
where is a tuning parameter and . Zhang and Zou (2014) used the alternating direction method of multipliers for solving problem 4.2. The source code of this method is available at https://math.cos.ucf.edu/tengz/.

4.1.3  Positive-Definite Sparse Covariance Estimators (PDSCE)

Rothman (2012) proposed the sparse covariance matrix estimator by solving the following problem,
formula
4.3
where is a tuning parameter and is fixed at a small value. The author developed a blockwise coordinate descent algorithm to compute the solution to the problem 4.3. The PDSCE algorithm is included in the PDSCE package of R software.

4.1.4  Penalized Covariance Matrix Estimation Using a Matrix Logarithm Transformation (PCME)

The PCME method (Deng & Tsui, 2013) used an estimate of covariance matrix , where solves the following problem:
formula
4.4
where is a tuning parameter. The Matlab code for PCME is available at http://www.tandfonline.com/doi/suppl/10.1080/10618600.2012.715556.

4.1.5  Sparse Estimation of a COVariance Matrix (SPCOV)

Bien and Tibshirani (2011) considered the negative log-likelihood function with an penalty on the entries of the covariance matrix, namely,
formula
4.5
where is an arbitrary matrix with nonnegative elements. Problem 4.5 is nonconvex. Bien and Tibshirani (2011) used the MM approach for solving this problem (it is, in fact, a version of DCA). Denote by SPCOV1 (resp. SPCOV2) the MM approach for solving problem 4.5 with if and 1 otherwise (resp. with if and otherwise) (Bien & Tibshirani, 2011). The R package spcov for SPCOV1 and SPCOV2 is available from CRAN.

4.2  Experimental Setups

The proposed algorithms are implemented in R software, and all algorithms are performed on a PC Intel i7 CPU3770, 3.40 GHz of 8 GB RAM.

In experiments, we set the stop tolerance for ADMM, GGD, and DCA-based algorithms. The initial point of DCA is the sample covariance matrix . By exploiting the special structure of the problem, we found that could be a good initial point for the DCA-based algorithms. This is justified by the fact that is an optimal solution of problem 1.2 without regularization. When is singular, we replace by , where is chosen through a five-fold cross-validation on a set of candidates .

The values of parameter and approximation parameter of the Capped- are chosen through a five-fold cross-validation procedure on sets of candidates given by and , respectively. The approximation parameter of the piecewise exponential approximation function is chosen as , which gives a reasonable approximation of the -norm as suggested in Bradley and Mangasarian (1998). From the proof of proposition 1, we can compute using the Newton method.

We know that is a sufficient condition so that the function is convex. But this is not a necessary condition; that means the function may still be convex for some . From a numerical point of view, we observe that among the values such that is convex, the smaller is, the more efficient DCA2 could be. Hence in our experiments, for finding a good value of , we use an updating parameter for DCA2 as follows. Starting DCA2 with , at each iteration , we reduce by dividing it by 2 while the objective function decreases (i.e., why DCA2 works). We take the smallest obtained value of for which DCA2 works and then continue DCA2 until its convergence with this value.

4.3  Numerical Results on Synthetic Data Sets

We evaluate the performance of DCA1-CaP, DCA2-CaP, DCA1-PiE, and DCA2-PiE on four synthetic data sets. We generate from a multivariate normal distribution , where is a sparse symmetric positive-definite matrix. We consider three types of covariance graphs and a moving average model as follows (Bien & Tibshirani, 2011):

  • Cliques model: , where are dense matrices.

  • Hubs model: again; however, each submatrix is zero except elements in the last row and the last column. This corresponds to a graph with five connected components, each of which has all nodes connected to one particular node.

  • Random model: In this model, we take to be nonzero with the probability 0.02, independent of other elements.

  • First-order moving average model: We generate to be nonzero for .

In the first three cases, the nonzero entries of matrix are randomly drawn in the set . In the moving average model, all nonzero values are set to be 0.4. In this experiment, for each covariance model, we generate training sets with size .

To evaluate the performance of each method, we consider three loss functions: the root-mean-square error (RMSE), the entropy loss (EN), and the Kullback-Leibler loss (KL), respectively:
formula
where is a sparse estimate of the covariance matrix . We also consider the log-likelihood function (LL) given by
formula
The cross-validation procedure on synthetic data sets is described as follows (Bien & Tibshirani, 2011). For , let , and denotes the complement of . We divide into five subsets, , and then compute
formula
4.6
where is an estimate of the covariance matrix with the parameter and , and . Finally, we choose .

The experimental results on synthetic data sets are given in Table 1. In this table, the average of the root-mean-square error (RMSE), entropy loss (EN), Kullback-Leibler loss (KL), log likelihood (LL), number of nonzero elements that their absolute values are not smaller than (NZ), CPU time in seconds and number of iterations of DCA () over 10 samples are reported.

Table 1:
Comparative Results of DCA1-CaP, DCA2-CaP, DCA1-PiE, DCA2-PiE, SPCOV1, SPCOV2, CLIME, PCME, SPME, and PDSCE on the Simulated Data Sets in Terms of the Average of Root-Mean-Square Error (RMSE), Entropy Loss (EN), Kullback-Leibler Loss (KL), Log Likelihood (LL), Number of Nonzero Elements, CPU Time in Seconds and Number of Iterations of DCA () over 10 Runs.
DCA1-CaPDCA2-CaPDCA1-PiEDCA2-PiESPCOV1SPCOV2CLIMEPCMESPMEPDSCE
Cliques 
RMSE 0.399 0.381 0.456 0.413 0.398 0.384 0.387 0.474 0.544 0.418 
 (0.003) (0.004) (0.01) (0.008) (0.0.004) (0.005) (0.004) (0.003) (0.005) (0.004) 
EN 15.37 14.15 14.99 13.54 15.77 16.83 83.54 27.85 31.81 80.42 
 (0.95) (0.61) (1.07) (0.7) (0.57) (0.53) (3.32) (0.36) (0.62) (2.68) 
KL 23.23 20.56 15.77 14.29 31.43 34.39 21.33 71.23 47.12 21.02 
 (4.11) (1.22) (1.28) (0.59) (2.23) (2.04) (0.32) (1.94) (1.33) (0.21) 
LL 143.66 142.39 146.45 144.76 151.58 148.24 149.63 146.51 166.05 146.7 
 (0.52) (0.48) (0.7) (0.82) (0.89) (0.82) (0.41) (0.48) (0.83) (0.36) 
NZ 2623 2419.6 1011.4 1018.2 3775.4 3565.8 8998.4 9847 2624.4 2755.6 
 (510.08) (209.34) (30.21) (18.59) (156.13) (94.57) (18.37) (15.72) (821.28) (72.65) 
CPUs 76.96 70.77 130.97 114.06 180.04 112.99 687.95 223.87 33.51 2.72 
 907.5 1475.3 3531.4 3650.11 – – – – – – 
Hubs 
RMSE 0.085 0.077 0.062 0.072 0.073 0.073 0.194 0.237 0.183 0.109 
 (0.004) (0.004) (0.008) (0.006) (0.0.003) (0.004) (0.084) (0.002) (0.007) (0.003) 
EN 3.32 3.09 1.53 1.98 3.54 2.63 264.8 29.03 61.46 141.61 
 (0.25) (0.34) (0.23) (0.38) (0.27) (0.21) (56.5) (0.76) (1.53) (2.35) 
KL 4.44 3.83 1.72 2.2 5.79 3.6 18.94 64.92 22.41 13.99 
 (0.38) (0.35) (0.27) (0.34) (0.73) (0.29) (3.85) (2.45) (0.82) (0.19) 
LL 116.36 116.52 112.49 115.87 118.24 116.41 127.3 115.65 124.85 121.68 
 (0.59) (0.46) (0.58) (0.53) (0.89) (0.54) 0.57 0.54 (0.3) 0.38) 
NZ 552.6 530.2 301.2 328.4 879 582.2 1174 9493.8 4640 597.6 
 (25.02) (21.36) (5.38) (12.51) (33.06) (24.5) (124.21) (30.68) (1195.54) (30.56) 
CPUs 96.1 53. 43 74.63 49.52 99.73 86.31 676.59 279.42 29.68 3.24 
 7758.4 6355.6 4815.1 7783.4 – – – – – – 
Random 
RMSE 0.096 0.086 0.051 0.052 0.086 0.066 0.089 0.177 0.125 0.083 
 (0.001) (0.002) (0.003) (0.003) (0.0.002) (0.002) (0.002) (0.001) (0.003) (0.002) 
EN 5.42 3.91 1.58 1.61 3.9 2.47 32 23,57 16.21 27.69 
 (0.43) (0.15) (0.21) (0.19) (0.16) (0.15) (4.06) (0.35) (0.63) (2.04) 
KL 5.68 4.53 1.7 1.74 5.07 3.02 7.41 52.61 12.17 6.67 
 (0.42) (0.22) (0.24) (0.21) (0.49) (0.24) (0.16) (1.17) (0.49) (0.15) 
LL 103.4 103.6 102.93 103.56 106.92 104.38 105.3 108.67 108.39 106.1 
 (0.39) (0.46) (0.46) (0.36) (0.66) (0.43) (0.29) (0.25) (0.26) (0.31) 
NZ 527.2 604.2 287.8 289 791.2 518.4 8972.6 9349.6 5932.8 614.22 
 (26.73) (31.6) (4.68) (5.31) (46.64) (20.09) (449.65) (37.82) (943.41) (37.28) 
CPUs 124.39 82.6 156.26 30.08 127.65 87.3 598.09 292.6 29.48 3.25 
 7991.9 14481.1 7500.2 5747.3 – – – – – – 
Moving 
RMSE 0.009 0.01 0.015 0.012 0.038 0.015 0.025 0.05 0.044 0.021 
 (0.0007) (0.0004) (0.0008) (0.0008) (0.0007) (0.0009) (0.0007) (0.004) (0.0009) (0.001) 
EN 1.02 1.14 2.05 1.51 7.95 2.29 41.23 30.64 171.97 82.13 
 (0.11) (0.09) (0.15) (0.15) (0.38) (0.17) (1.65) (11.84) (4.23) (3.5) 
KL 1.09 1.18 2.48 1.66 11.71 2.84 15.21 43.44 37.23 20.65 
 (0.19) (0.15) (0.22) (0.21) (0.82) (0.25) (0.3) (8.2) (0.89) (0.63) 
LL 10.05 11.78 12.4 11.07 13.29 13.15 20.04 22.6 28.77 23.79 
 (0.31) (0.42) (0.35) (0.28) (0.22) (0.39) (0.4) (0.25) (0.37) 0.36) 
NZ 298.4 298.4 451.4 361.6 1380 591.8 9986.4 8440.66 4311 640 
 (1.2) (1.2) (26.3) (14.41) (53.4) (25.16) (6.24) (213.4) (2258.65) (30.24) 
CPUs 48.91 6.67 86.4 27.81 110.86 78.98 679.42 258.07 29.45 8.85 
 2301.6 1834.2 2561.29 4385.7 – – – – – – 
DCA1-CaPDCA2-CaPDCA1-PiEDCA2-PiESPCOV1SPCOV2CLIMEPCMESPMEPDSCE
Cliques 
RMSE 0.399 0.381 0.456 0.413 0.398 0.384 0.387 0.474 0.544 0.418 
 (0.003) (0.004) (0.01) (0.008) (0.0.004) (0.005) (0.004) (0.003) (0.005) (0.004) 
EN 15.37 14.15 14.99 13.54 15.77 16.83 83.54 27.85 31.81 80.42 
 (0.95) (0.61) (1.07) (0.7) (0.57) (0.53) (3.32) (0.36) (0.62) (2.68) 
KL 23.23 20.56 15.77 14.29 31.43 34.39 21.33 71.23 47.12 21.02 
 (4.11) (1.22) (1.28) (0.59) (2.23) (2.04) (0.32) (1.94) (1.33) (0.21) 
LL 143.66 142.39 146.45 144.76 151.58 148.24 149.63 146.51 166.05 146.7 
 (0.52) (0.48) (0.7) (0.82) (0.89) (0.82) (0.41) (0.48) (0.83) (0.36) 
NZ 2623 2419.6 1011.4 1018.2 3775.4 3565.8 8998.4 9847 2624.4 2755.6 
 (510.08) (209.34) (30.21) (18.59) (156.13) (94.57) (18.37) (15.72) (821.28) (72.65) 
CPUs 76.96 70.77 130.97 114.06 180.04 112.99 687.95 223.87 33.51 2.72 
 907.5 1475.3 3531.4 3650.11 – – – – – – 
Hubs 
RMSE 0.085 0.077 0.062 0.072 0.073 0.073 0.194 0.237 0.183 0.109 
 (0.004) (0.004) (0.008) (0.006) (0.0.003) (0.004) (0.084) (0.002) (0.007) (0.003) 
EN 3.32 3.09 1.53 1.98 3.54 2.63 264.8 29.03 61.46 141.61 
 (0.25) (0.34) (0.23) (0.38) (0.27) (0.21) (56.5) (0.76) (1.53) (2.35) 
KL 4.44 3.83 1.72 2.2 5.79 3.6 18.94 64.92 22.41 13.99 
 (0.38) (0.35) (0.27) (0.34) (0.73) (0.29) (3.85) (2.45) (0.82) (0.19) 
LL 116.36 116.52 112.49 115.87 118.24 116.41 127.3 115.65 124.85 121.68 
 (0.59) (0.46) (0.58) (0.53) (0.89) (0.54) 0.57 0.54 (0.3) 0.38) 
NZ 552.6 530.2 301.2 328.4 879 582.2 1174 9493.8 4640 597.6 
 (25.02) (21.36) (5.38) (12.51) (33.06) (24.5) (124.21) (30.68) (1195.54) (30.56) 
CPUs 96.1 53. 43 74.63 49.52 99.73 86.31 676.59 279.42 29.68 3.24 
 7758.4 6355.6 4815.1 7783.4 – – – – – – 
Random 
RMSE 0.096 0.086 0.051 0.052 0.086 0.066 0.089 0.177 0.125 0.083 
 (0.001) (0.002) (0.003) (0.003) (0.0.002) (0.002) (0.002) (0.001) (0.003) (0.002) 
EN 5.42 3.91 1.58 1.61 3.9 2.47 32 23,57 16.21 27.69 
 (0.43) (0.15) (0.21) (0.19) (0.16) (0.15) (4.06) (0.35) (0.63) (2.04) 
KL 5.68 4.53 1.7 1.74 5.07 3.02 7.41 52.61 12.17 6.67 
 (0.42) (0.22) (0.24) (0.21) (0.49) (0.24) (0.16) (1.17) (0.49) (0.15) 
LL 103.4 103.6 102.93 103.56 106.92 104.38 105.3 108.67 108.39 106.1 
 (0.39) (0.46) (0.46) (0.36) (0.66) (0.43) (0.29) (0.25) (0.26) (0.31) 
NZ 527.2 604.2 287.8 289 791.2 518.4 8972.6 9349.6 5932.8 614.22 
 (26.73) (31.6) (4.68) (5.31) (46.64) (20.09) (449.65) (37.82) (943.41) (37.28) 
CPUs 124.39 82.6 156.26 30.08 127.65 87.3 598.09 292.6 29.48 3.25 
 7991.9 14481.1 7500.2 5747.3 – – – – – – 
Moving 
RMSE 0.009 0.01 0.015 0.012 0.038 0.015 0.025 0.05 0.044 0.021 
 (0.0007) (0.0004) (0.0008) (0.0008) (0.0007) (0.0009) (0.0007) (0.004) (0.0009) (0.001) 
EN 1.02 1.14 2.05 1.51 7.95 2.29 41.23 30.64 171.97 82.13 
 (0.11) (0.09) (0.15) (0.15) (0.38) (0.17) (1.65) (11.84) (4.23) (3.5) 
KL 1.09 1.18 2.48 1.66 11.71 2.84 15.21 43.44 37.23 20.65 
 (0.19) (0.15) (0.22) (0.21) (0.82) (0.25) (0.3) (8.2) (0.89) (0.63) 
LL 10.05 11.78 12.4 11.07 13.29 13.15 20.04 22.6 28.77 23.79 
 (0.31) (0.42) (0.35) (0.28) (0.22) (0.39) (0.4) (0.25) (0.37) 0.36) 
NZ 298.4 298.4 451.4 361.6 1380 591.8 9986.4 8440.66 4311 640 
 (1.2) (1.2) (26.3) (14.41) (53.4) (25.16) (6.24) (213.4) (2258.65) (30.24) 
CPUs 48.91 6.67 86.4 27.81 110.86 78.98 679.42 258.07 29.45 8.85 
 2301.6 1834.2 2561.29 4385.7 – – – – – – 

Notes: The numbers in parentheses are standard deviations. The best result in each row is in bold.

We observe from Table 1 that in the cliques model, DCA2-CaP gives the lowest root mean square error and log likelihood while DCA2-PiE gives the best results in terms of the entropy loss and Kullback-Leibler loss. We further note that in terms of sparsity, the number of the nonzero elements, the DCA-based algorithms achieve much better performances than the other six approaches.

For the hubs and random models, the best and the second-best performing methods with respect to the losses and the sparsity are DCA1-PiE and DCA2-PiE, respectively.

Conversely, when using the moving model, DCA1-CaP outperforms the others with the loss and the sparsity measures. However, it provides slightly smaller improvement than DCA2-CaP. Here, notably the results obtained with DCA2-CaP and DCA2-PiE are still superior to those of SPCOV1, SPCOV2, CLIME, PCME, SPME, and PDSCE.

Regarding the training time, DCA2-CaP and DCA2-PiE are remarkably faster than the other algorithms in the nonconvex approach keeping the negative log-likelihood function, say, DCA1-CaP, DCA1-PiE, SPCOV1, and SPCOV2, as well as the two methods CLIME and PCME. This can be explained by the fact that DCA2-CaP and DCA2-PiE lead to the sequences of convex problems that are easily solved by an explicit ADMM algorithm. As for the convex-based approaches SPME and PDSCE, not suprisingly, they are faster than the DCA-based algorithms but achieve much worse performances on all losses as well as sparsity. The ratio of gain of DCA varies from 1.1 to 168.6 in terms of losses and from 2 to 20.6 in terms of sparsity.

4.4  Numerical Results on Real Data Sets

We illustrate the use of the sparse covariance matrix estimation problem with two real applications: a classification problem and a portfolio optimization problem of stock data. These applications require an estimate of the covariance matrix.

4.4.1  Sparse Quadratic Discriminant Analysis

Let be an training data matrix with observations () on the rows and features on the columns. We assume that the observations within the th class are normal distributed . We denote the prior probability of the th class by . The quadratic discriminant function is