## Abstract

We study a multi-instance (MI) learning dimensionality-reduction algorithm through sparsity and orthogonality, which is especially useful for high-dimensional MI data sets. We develop a novel algorithm to handle both sparsity and orthogonality constraints that existing methods do not handle well simultaneously. Our main idea is to formulate an optimization problem where the sparse term appears in the objective function and the orthogonality term is formed as a constraint. The resulting optimization problem can be solved by using approximate augmented Lagrangian iterations as the outer loop and inertial proximal alternating linearized minimization (iPALM) iterations as the inner loop. The main advantage of this method is that both sparsity and orthogonality can be satisfied in the proposed algorithm. We show the global convergence of the proposed iterative algorithm. We also demonstrate that the proposed algorithm can achieve high sparsity and orthogonality requirements, which are very important for dimensionality reduction. Experimental results on both synthetic and real data sets show that the proposed algorithm can obtain learning performance comparable to that of other tested MI learning algorithms.

## 1 Introduction

In multi-instance (MI) learning (Dietterich, Lathrop, & Lozano-Pérez, 1997), data observations (bags) can have different alternative descriptions (instances). Labels are assigned to bags, but they are not assigned to instances. Under the standard MI assumption, a bag is positively labeled if it contains at least one positive instance and negatively labeled if all instances contained in a bag are negative. The goal of MI learning is to learn from a training data set and build a classifier for correctly labeling unseen bags. MI learning naturally fits various real-world applications—for example, drug activity prediction (Dietterich et al., 1997), text categorization (Andrews, Tsochantaridis, & Hofmann, 2003), image retrieval (Andrews et al., 2003), medical diagnosis (Fung & Ng, 2007), and face detection (Viola & Jones, 2004).

In the literature, MI learning on drug activity prediction was first represented in Dietterich et al. (1997). The task of drug activity prediction is to predict whether a given molecule is qualified to make a drug. We note that molecules are qualified to make a drug when one of its low-energy shapes can bind tightly to the target object. In contrast, molecules are not qualified to make a drug when none of its low-energy shapes can bind tightly to the target object. The main challenge of drug activity prediction is that each molecule can have many possible low-energy shapes, and only one or a few of them can bind tightly to the target object. In practice, few training molecules are known to be qualified to make a drug, but we may not know low-energy shapes for such qualification. In MI learning, each molecule can be described with a bag, and each instance corresponds to one low-energy shape.

### 1.1 Related Work on MI Learning

The axis-parallel rectangle (APR) algorithm based on greedy selection was proposed to predict drug activity. The learnability and computational complexities of the APR algorithm were analyzed in Auer, Long, and Srinivasan (1997), Long and Tan (1998), Blum and Kalai (1998). Then several MI learning algorithms were developed and studied, such as diverse density (Maron & Lozano-Pérez, 1998), citation-KNN (Wang & Zucker, 2000), ID3-MI (Chevaleyre & Zucker, 2001), RIPPER-MI (Chevaleyre & Zucker, 2001), BP-MIP (Zhang & Zhou, 2004), and MI-SVM (Andrews et al., 2003). The diverse density algorithm searches for a point with the maximum diverse density in the feature space, which is defined as measuring how many different positive bags are nearby and how far the negative bags are from this point. Citation-KNN is a nearest-neighbor algorithm that measures the distance between bags by the minimal Hausdorff distance. ID3-MI is a decision tree algorithm that follows the divide-and-conquer strategy and uses MI entropy to distinguish bags instead of splitting instances. RIPPER-MI is a rule induction algorithm that uses MI coverage to follow the divide-and-conquer method for rule inducers. BP-MIP is a feedforward neural network algorithm that uses diverse density for feature scaling and PCA (principal component analysis; Jolliffe, 2002) for feature reduction. Basically, these MI learning algorithms are studied at bag level (i.e., discrimination on bags instead of discrimination on instances). More detailed information on the differences among these methods can be found in Zhou (2004). There are some survey papers and books on MI learning (Foulds & Frank, 2010; Amores, 2013; Herrera et al., 2016). Foulds and Frank classified MI learning methods according to the assumptions of each method. Amores (2013) looks into what level of information is used in MI learning (i.e., instance level or bag level). However, dimensionality reduction in MI has not attracted much attention.

Many MI learning problems involve high-dimensional data: each instance has a large number of features. In addition, MI data may contain noisy and redundant features. Feature selection and dimension reduction are two ways to manage high-dimensional data. Feature selection attempts to select a subset of the original features according to some measurements. Dimensionality reduction attempts to identify a small set of features in a new feature space from the set of original features. HyDR-MI (Zafra, Pechenizkiy, & Ventura, 2013) is a feature selection method that uses the filter method to determine the important attributes in the feature space and then adopts the wrapper method to select the best feature subset. MIDR (Sun, Ng, & Zhou, 2010) aims to make the posterior probability of a truly positive bag to be one and zero otherwise. The projection matrix for dimensionality reduction was required to satisfy both sparsity and orthogonality. MidLABS (Ping, Xu, Ren, Chi, & Shen, 2010) uses the trace-ratio expression to simultaneously maximize between-class scattering and minimize within-class scattering. MidLABS constructs scattering matrices by directly evaluating the scattering among bags and takes the structure information of each bag by building an $\epsilon $-graph. MIDA (Chai, Ding, Chen, & Li, 2014) uses the selected positive instance in each positive bag and the mean of all negative instances in all negative bags to construct scattering matrices and simultaneously maximizes the between-class scattering and minimizes the within-class scattering by solving the trace-difference formulation. CLFDA (Kim & Choi, 2010) prelabels all instances with their bag labels and then adopts neighborhood information to detect the false-positive ones. MidLABS, MIDA, and CLFDA can be treated as MI extensions of LDA (linear discriminant analysis; Fukunaga, 1990) using different scattering matrices.

### 1.2 Motivation

The main aim of this letter is to study MI dimensionality reduction by sparsity and orthogonality. The optimization problem of the MIDR method (Sun et al., 2010) is solved by the gradient descent method along the tangent space of the orthogonal matrices. In order to improve efficiency, sparsity and orthogonality constraints were further approximated in the method. Therefore, the calculated solution is not necessarily sparse and orthogonal, and learning performance may be affected. There is no proof of convergence of the MIDR algorithm; the computed solution cannot ensure that the optimality conditions are satisfied. Unlike Sun et al. (2010), we do not use any approximation for sparsity and orthogonality. Our idea is to formulate the optimization problem using the corresponding scaled augmented Lagrangian function with sparsity and orthogonality. The resulting Lagrangian function can be solved iteratively by updating the involved variables. In particular, variables associated with sparsity and orthogonal constraints can be treated by using the inertial proximal alternating linearized minimization (iPALM) method (Pock & Sabach, 2016; Zhu, 2016). The advantage of this method is that these variables can be managed separately and updated effectively. We show the global convergence of the proposed algorithm combining the outer approximate augmented Lagrangian iteration step and the inner iPALM iteration step. The computed solution can be sparse and orthogonal, and it also satisfies the optimality conditions of the optimization problem. Experimental results on both synthetic and real data sets demonstrate that a good MI learning performance and both sparsity and orthogonality can be achieved, by the proposed algorithm.

The outline of this letter is as follows. In section 2, we review the MI dimensionality-reduction formula with sparsity and orthogonality and study the convergence of the proposed algorithm. In addition, we present the iPALM solver for the subproblems arising from the proposed algorithm. In section 3, we present experimental results to show the effectiveness of the proposed algorithm. Finally, we give some concluding remarks in section 4.

## 2 Dimensionality Reduction

In this section, we review the MIDR method proposed by Sun et al. (2010) and apply the approximate augmented Lagrangian method to solve the corresponding optimization problem.

### 2.1 The Optimization Problem

Let ${(X1,y1),\u2026,(XN,yN)}$ be the training data set, where $Xi={xi,1,\u2026,xi,ni}\u2282RD$ is the $i$th bag, which contains $ni$ instances ($ni$ can vary across the bags), and $yi\u2208{0,1}$ is the label of $Xi$. Here $xi,j\u2208RD$ denotes the $j$th instance in the $i$th bag, and its hidden label is $yi,j\u2208{0,1}$. Each instance contains $D$ attributes. Under the standard assumption of MI learning, $Xi$ has a label $yi=1$ and is said to be a positive bag if there exists at least one instance $xi,j\u2208Xi$ with a label $yi,j=1$ (the concrete value of the index $j$ is usually unknown). Otherwise, $Xi$ is said to be a negative bag with a label $yi=0$.

In MIDR, Sun et al. (2010) studied the projection matrix $A\u2208RD\xd7d$ ($d\u226aD$) to discriminate positive and negative bags: project $Xi\u2282RD$ to ${ATxi,1,\u2026,ATxi,ni}\u2282Rd$. $A$ is required to be orthogonal to guarantee that the resulting features are uncorrelated and nonredundant in the new feature representation. Obviously each new feature $ATxi,j\u2208Rd$ is a linear combination of all features in original data $xi,j\u2208RD$, and the coefficients of such a linear combination are generally nonzero. Therefore, the importance of the original features and the interpretation of the features obtained in the lower-dimensional space may be difficult to be considered, especially when the data dimension $D$ is large. To improve the ability of the model to interpret and visualize the results, $\u2225A\u22251(=\u2211i,j|Aij|)$ is incorporated into the new feature representation. Sparse representation of features for some real data sets has been studied and reported (see Dundar, Fung, Bi, Sathyakama, & Rao, 2005; Fung & Ng, 2007; Qiao, Zhou, & Huang, 2009; Ng, Liao, & Zhang, 2011) and references therein.

### 2.2 Our Algorithm

Let ${(A(k),B(k),w(k))}k\u2208N$ be a sequence generated by algorithm 1. If the sequence ${(A(k),B(k),w(k))}k\u2208N$ is bounded, then any cluster point $(A*,B*,w*)$ of the sequence ${(A(k),B(k),w(k))}k\u2208N$ satisfies the first-order optimality condition of problem 2.3. Moreover, $(A*,w*)$ satisfies the first-order optimality condition of problem 2.1.

The proof of theorem ^{1} is given in appendix A. Theorem ^{1} implies that the MI-ALM algorithm can provide a theoretically guaranteed optimal solution with the required sparsity and orthogonality, which is important to MI dimensionality reduction.

### 2.3 iPALM Algorithm

We note that step 1 is the crucial part of the MI-ALM algorithm, which requires that criterion 2.5 must be satisfied in each iteration. Next, we will show how the iPALM method can be employed to provide $(A(k),B(k),w(k))$ such that equation 2.5 holds.

For each fixed $k$, the implementation of the iPALM method employed in step 1 of MI-ALM is listed in algorithm 2.

and $\xi (k,j)=(\xi Ak,j,\xi Bk,j,\xi wk,j)$. The following theorem indicates that $\xi (k,j)$ indeed satisfies criterion 2.5, which means that step 1 of MI-ALM is well defined with the iPALM method as the inner solver.

For each $k\u22651$, let ${(A(k,j),B(k,j),w(k,j))}j\u2208N$ be a sequence generated by equations 2.9 to 2.11.

- The sequence ${(A(k,j),B(k,j),w(k,j))}j\u2208N$ has finite length:Moreover, ${(A(k,j),B(k,j),w(k,j))}j\u2208N$ converges to a critical point $(A(k,*),B(k,*),w(k,*))$ of function $Lk(A,B,w)$.$\u2211j=1\u221e\u2225(A(k,j+1),B(k,j+1),w(k,j+1))-(A(k,j),B(k,j),w(k,j))\u2225<\u221e.$

The proof of theorem ^{2} is in appendix B.

In the next section, we test the performance of the proposed algorithm.

## 3 Experimental Results

We evaluate the effectiveness and efficiency of the MI-ALM algorithm on some synthetic data sets and five MI benchmark data sets.^{1} In the following experiments, we set $\beta =3$ as the approximate degree of softmax function, $\epsilon (k)=0.999k$, $\tau =0.99$, $\mu =1.02$, $\Lambda \xafmin=-102$, $\Lambda \xafmax=102$ in algorithm 1. For each $k\u22651$, we set $a(j)=\gamma 1(\alpha (k)+\beta (k))$, $b(j)=\gamma 2(\alpha (k)(2N\u2225w(k,j)\u2225max1\u2264i\u2264N\u2211u=1ni\u2225xi,u\u2225)+\beta (k))$, and $c(j)=\gamma 3\alpha (k)(|1-\beta |+\beta )\u2225B(k,j+1)\u22252\u2211i=1N\u2211u=1ni\u2225xiu\u2225ni+\beta (k)$ for all $j\u2208N$ in the iPALM method, where $\alpha (k)\u22610.95$, $\beta (k)\u22610.05$, and $\gamma i=1.01$, $i=1,2,3$. All experiments were performed in Matlab R2013a on a MacBook Pro laptop with an Intel core i7 CPU at 2.2 GHz $\xd74$ and 16 GB of RAM.

### 3.1 Synthetic Data Sets

Two experiments were carried out on synthetic data sets to verify the effectiveness of algorithm 1. In this section, we set $\alpha =0.2$ in problem 2.1 for algorithm 1.

Figure 1 shows a simple two-dimensional synthetic example. Our goal is to reduce the dimensionality of the synthetic data from two to one. The structure in this test is shown in Table 1. There are four bags; the first three are positive, and the last one is negative. Figure 1 shows one-dimensional results after dimension reduction generated by methods LDA, PCA, MIDR, and MI-ALM. For MIDR, $minA,H\u2211i(Pi(A)-yi)2+c22\u2225A-H\u2225F2+c1\u2225H\u22251$ is used as an approximation of problem 2.1. Here, we set $c1=0.5$, $c2=20$ to keep $A$ and $H$ close. Figure 1 shows that LDA is misled by negative instances in the positive bags. After dimensionality reduction, the positive and negative instances in the positive bags 2 and 3 are very close. This is consistent with the fact that LDA, as a supervised learning algorithm, assigns each instance to the label of the bag. It can be seen from bag 4 in the PCA panel in Figure 1 that after dimensionality reduction, the positive and negative bags are not well separated. This can be explained: PCA does not consider label information as an unsupervised learning method. MIDR and MI-ALM both try to enlarge the distances between positive and negative instances to separate positive and negative bags. It can be seen in the figures that the performance of MIDR and MI-ALM are better than those of PCA and LDA.

Bag 1 | Bag 2 | Bag 3 | Bag 4 | |||||

Bags | Negative | Positive | Negative | Positive | Negative | Positive | Negative | Positive |

Figure 1 | 15 | 5 | 15 | 5 | 15 | 5 | 10 | 0 |

legend | $\u2218$ | $\u25a1$ | $\u25ca$ |

Bag 1 | Bag 2 | Bag 3 | Bag 4 | |||||

Bags | Negative | Positive | Negative | Positive | Negative | Positive | Negative | Positive |

Figure 1 | 15 | 5 | 15 | 5 | 15 | 5 | 10 | 0 |

legend | $\u2218$ | $\u25a1$ | $\u25ca$ |

Next, we generate a synthetic data set (data set II) in a high-dimensional feature space containing more bags and instances. Each data set has 80 bags (60 positive bags and 20 negative bags). Each positive bag contains 5 positive instances and 15 negative instances. Each negative bag contains 10 negative instances. There are $D$ features in each instance, $d$ relevant dimensions, and $D-d$ noisy (or irrelevant) dimensions. For a positive instance, the means of relevant dimensions are $[-4,4,-4,4,\u2026]\u2208Rd$; for a negative instance, the means of relevant dimensions are $[4,-4,4,-4,\u2026]\u2208Rd$. The covariance matrices for relevant dimensions in both positive instances and negative instances are $2Id$. The noisy dimensions follow a normal distribution with a mean of 0 and a standard deviation of 8. The $d$ relevant dimensions were randomly selected among $D$ dimensions. In this setting, we would like to reduce the dimension from $D$ to $d$.

Table 2 reports the average results obtained by MI-ALM, MIDR, PCA, and LDA algorithms on 20 randomly generated data sets; their standard deviations are given in parentheses. We randomly selected 50% positive bags and 50% negative bags as training data and the remaining bags as testing data. The competitive algorithms are used only for dimensionality reduction, so an additional classifier is needed to evaluate classification performance. Here we use MILR solved by the BFGS method (Ray & Craven, 2005) as the classifier of these four methods. For comparison, we further tested MILR using the original features. The value of AUROC gives the classification accuracy based on the area under ROC (AUROC; Bradley, 1997; Fawcett, 2006) by constructing the receiver operating characteristic (ROC) curve. The value of sparsity refers to the ratio between the number of zero entries in the calculated solution $Ac$ and its size. We note that when the sparsity value of each dimension approaches one (zero), the solution becomes sparser (denser). The values of orthogonality for MI-ALM and MIDR are calculated from the computed solution $Ac$: $\u2225AcTAc-I\u2225F/d$. Table 3 shows that the MIDR method provides sparse and orthogonal approximation, and it is not effective in dealing with sparsity and orthogonal constraints simultaneously. Notice that MIDR separates sparse and orthogonality constraints by introducing constraint $H=A$, requiring $H$ to meet sparsity constraints and $A$ to meet orthogonal constraints. In the objective function of MIDR, the penalty parameter is used to penalize the constraint $A=H$. Table 3 also reports the difference between $Ac$ and $H$ by calculating $\u2225Ac-H\u2225F/d$ (“Error”) for reference. It can be seen that when $D=500$, MIDR cannot obtain the high accuracy of constraints $A=H$. This is because the penalty method requires that the penalty parameter should be sufficiently large. However, the number of iterations required for solving such minimization problem is very large. Therefore, it is difficult to balance the computational trade-off by tuning the penalty parameter. In contrast, the proposed algorithm MI-ALM can provide sparse and orthogonal solutions well. Table 3 does not list the sparsity and orthogonality results of PCA and LDA because PCA and LDA do not provide sparse solutions, and their solutions satisfy orthogonality via the eigendecomposition procedure. We also remark that the upper bound of its subspace dimension is equal to the number of classes minus one in LDA setting. This is valid for binary classes (positive and negative labels) in multiple-instance learning.

AUROC | |||||||||||

MILR | MI-ALM | MIDR | PCA | LDA | |||||||

$D$ = 40 | 0.2D | 0.68 | (1.71E-01) | 0.84 | (2.58E-01) | 0.98 | (6.49E-02) | 0.79 | (2.10E-01) | 0.81 | (2.14E-01) |

0.3D | 0.82 | (2.10E-01) | 0.96 | (1.15E-01) | 0.96 | (1.38E-01) | 0.96 | (7.72E-02) | 0.82 | (2.0E-01) | |

0.4D | 0.94 | (9.25E-02) | 0.95 | (1.01E-01) | 1.00 | (0.00E+00) | 0.92 | (1.51E-01) | 0.84 | (1.6E-01) | |

0.5D | 0.96 | (1.13E-01) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.89 | (1.5E-01) | |

0.6D | 0.99 | (2.52E-02) | 1.00 | (0.00E+00) | 1.00 | (1.58E-02) | 0.99 | (1.54E-02) | 0.90 | (9.4E-02) | |

$D$=100 | 0.2D | 0.71 | (2.37E-01) | 1.00 | (1.58E-02) | 0.99 | (2.64E-02) | 0.93 | (1.33E-01) | 0.77 | (9.35E-2) |

0.3D | 0.96 | (8.17E-02) | 0.99 | (3.27E-02) | 0.96 | (9.28E-02) | 0.98 | (4.96E-02) | 0.77 | (9.08E-02) | |

0.4D | 0.91 | (1.67E-01) | 1.00 | (0.00E+00) | 0.94 | (1.86E-01) | 0.99 | (3.21E-02) | 0.76 | (1.10E-01) | |

0.5D | 0.95 | (1.00E-01) | 1.00 | (0.00E+00) | 0.99 | (4.43E-02) | 1.00 | (1.26E-02) | 0.78 | (1.25E-01) | |

0.6D | 0.94 | (1.33E-01) | 1.00 | (0.00E+00) | 0.96 | (1.41E-01) | 1.00 | (0.00E+00) | 0.77 | (1.03E-01) | |

$D$ = 500 | 0.2D | 0.76 | (2.50E-01) | 0.97 | (5.02E-02) | 0.96 | (1.32E-01) | 0.82 | (3.34E-01) | 0.56 | (1.13E-01) |

0.3D | 0.99 | (2.71E-02) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.74 | (4.09E-01) | 0.55 | (1.37E-01) | |

0.4D | 0.95 | (1.12E-01) | 1.00 | (0.00E+00) | 0.99 | (2.11E-02) | 0.96 | (6.27E-02) | 0.54 | (1.22E-01) | |

0.5D | 0.93 | (1.87E-01) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.94 | (1.76E-01) | 0.54 | (1.04E-01) | |

0.6D | 0.92 | (3.16E-03) | 0.94 | (1.99E-01) | 0.81 | (4.00E-01) | 1.00 | (0.00E+00) | 0.54 | (1.40E-01) |

AUROC | |||||||||||

MILR | MI-ALM | MIDR | PCA | LDA | |||||||

$D$ = 40 | 0.2D | 0.68 | (1.71E-01) | 0.84 | (2.58E-01) | 0.98 | (6.49E-02) | 0.79 | (2.10E-01) | 0.81 | (2.14E-01) |

0.3D | 0.82 | (2.10E-01) | 0.96 | (1.15E-01) | 0.96 | (1.38E-01) | 0.96 | (7.72E-02) | 0.82 | (2.0E-01) | |

0.4D | 0.94 | (9.25E-02) | 0.95 | (1.01E-01) | 1.00 | (0.00E+00) | 0.92 | (1.51E-01) | 0.84 | (1.6E-01) | |

0.5D | 0.96 | (1.13E-01) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.89 | (1.5E-01) | |

0.6D | 0.99 | (2.52E-02) | 1.00 | (0.00E+00) | 1.00 | (1.58E-02) | 0.99 | (1.54E-02) | 0.90 | (9.4E-02) | |

$D$=100 | 0.2D | 0.71 | (2.37E-01) | 1.00 | (1.58E-02) | 0.99 | (2.64E-02) | 0.93 | (1.33E-01) | 0.77 | (9.35E-2) |

0.3D | 0.96 | (8.17E-02) | 0.99 | (3.27E-02) | 0.96 | (9.28E-02) | 0.98 | (4.96E-02) | 0.77 | (9.08E-02) | |

0.4D | 0.91 | (1.67E-01) | 1.00 | (0.00E+00) | 0.94 | (1.86E-01) | 0.99 | (3.21E-02) | 0.76 | (1.10E-01) | |

0.5D | 0.95 | (1.00E-01) | 1.00 | (0.00E+00) | 0.99 | (4.43E-02) | 1.00 | (1.26E-02) | 0.78 | (1.25E-01) | |

0.6D | 0.94 | (1.33E-01) | 1.00 | (0.00E+00) | 0.96 | (1.41E-01) | 1.00 | (0.00E+00) | 0.77 | (1.03E-01) | |

$D$ = 500 | 0.2D | 0.76 | (2.50E-01) | 0.97 | (5.02E-02) | 0.96 | (1.32E-01) | 0.82 | (3.34E-01) | 0.56 | (1.13E-01) |

0.3D | 0.99 | (2.71E-02) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.74 | (4.09E-01) | 0.55 | (1.37E-01) | |

0.4D | 0.95 | (1.12E-01) | 1.00 | (0.00E+00) | 0.99 | (2.11E-02) | 0.96 | (6.27E-02) | 0.54 | (1.22E-01) | |

0.5D | 0.93 | (1.87E-01) | 1.00 | (0.00E+00) | 1.00 | (0.00E+00) | 0.94 | (1.76E-01) | 0.54 | (1.04E-01) | |

0.6D | 0.92 | (3.16E-03) | 0.94 | (1.99E-01) | 0.81 | (4.00E-01) | 1.00 | (0.00E+00) | 0.54 | (1.40E-01) |

Notes: Values in parentheses are standard deviations. The best results are highlighted in bold.

Sparsity | Orthogonality | Error | |||||||||

MI-ALM | MIDR | MI-ALM | MIDR | MIDR | |||||||

D = 20 | 0.2D | 0.87 | (1.22E-01) | 0.19 | (2.99E-01) | 5.12E-04 | (5.82E-04) | 5.82E-01 | (6.63E-01) | 3.63E-03 | (2.04E-04) |

0.3D | 0.94 | (2.91E-02) | 0.46 | (3.20E-01) | 2.53E-04 | (2.59E-04) | 8.64E-02 | (1.89E-01) | 3.54E-03 | (1.07E-04) | |

0.4D | 0.92 | (7.12E-02) | 0.66 | (3.06E-01) | 3.95E-04 | (3.97E-04) | 9.41E-02 | (1.89E-01) | 3.67E-03 | (5.45E-04) | |

0.5D | 0.94 | (2.81E-02) | 0.56 | (2.31E-01) | 2.15E-04 | (2.58E-04) | 4.22E-02 | (2.50E-02) | 3.50E-03 | (4.30E-05) | |

0.6D | 0.94 | (2.77E-02) | 0.61 | (1.00E-01) | 3.78E-04 | (3.51E-04) | 5.75E-02 | (1.16E-02) | 4.23E-03 | (1.48E-03) | |

D = 100 | 0.2D | 0.90 | (1.07E-01) | 0.90 | (8.35E-02) | 1.15E-03 | (1.13E-03) | 2.53E-02 | (1.00E-02) | 9.79E-03 | (8.89E-04) |

0.3D | 0.80 | (2.12E-01) | 0.91 | (3.67E-02) | 1.64E-03 | (4.83E-04) | 2.65E-02 | (6.75E-03) | 9.69E-03 | (6.47E-04) | |

0.4D | 0.91 | (9.91E-02) | 0.84 | (4.71E-02) | 1.33E-03 | (1.30E-03) | 3.51E-02 | (7.26E-03) | 9.63E-03 | (2.28E-04) | |

0.5D | 0.83 | (1.82E-01) | 0.77 | (6.80E-02) | 1.85E-03 | (1.09E-03) | 4.30E-02 | (5.26E-03) | 9.38E-03 | (1.89E-04) | |

0.6D | 0.91 | (5.48E-02) | 0.63 | (2.93E-02) | 1.60E-03 | (7.07E-03) | 5.68E-02 | (6.95E-03) | 1.04E-02 | (2.09E-03) | |

D = 500 | 0.2D | 0.95 | (5.59E-02) | 0.96 | (3.47E-03) | 1.29E-03 | (1.52E-03) | 9.78E-03 | (1.28E-03) | 1.23E-01 | (8.61E-03) |

0.3D | 0.96 | (1.62E-02) | 0.94 | (1.10E-02) | 7.50E-04 | (1.97E-04) | 1.22E-02 | (2.37E-03) | 1.19E-01 | (1.04E-03) | |

0.4D | 0.96 | (2.06E-02) | 0.94 | (3.86E-02) | 1.44E-03 | (9.51E-04) | 1.61E-02 | (1.66E-03) | 1.23E-01 | (1.56E-02) | |

0.5D | 0.93 | (3.14E-02) | 0.92 | (6.42E-02) | 1.76E-03 | (7.36E-04) | 2.48E-02 | (7.14E-03) | 1.18E-01 | (5.33E-04) | |

0.6D | 0.96 | (1.93E-02) | 0.64 | (3.19E-01) | 1.85E-03 | (1.40E-03) | 1.02E-01 | (1.41E-01) | 1.18E-01 | (3.05E-03) |

Sparsity | Orthogonality | Error | |||||||||

MI-ALM | MIDR | MI-ALM | MIDR | MIDR | |||||||

D = 20 | 0.2D | 0.87 | (1.22E-01) | 0.19 | (2.99E-01) | 5.12E-04 | (5.82E-04) | 5.82E-01 | (6.63E-01) | 3.63E-03 | (2.04E-04) |

0.3D | 0.94 | (2.91E-02) | 0.46 | (3.20E-01) | 2.53E-04 | (2.59E-04) | 8.64E-02 | (1.89E-01) | 3.54E-03 | (1.07E-04) | |

0.4D | 0.92 | (7.12E-02) | 0.66 | (3.06E-01) | 3.95E-04 | (3.97E-04) | 9.41E-02 | (1.89E-01) | 3.67E-03 | (5.45E-04) | |

0.5D | 0.94 | (2.81E-02) | 0.56 | (2.31E-01) | 2.15E-04 | (2.58E-04) | 4.22E-02 | (2.50E-02) | 3.50E-03 | (4.30E-05) | |

0.6D | 0.94 | (2.77E-02) | 0.61 | (1.00E-01) | 3.78E-04 | (3.51E-04) | 5.75E-02 | (1.16E-02) | 4.23E-03 | (1.48E-03) | |

D = 100 | 0.2D | 0.90 | (1.07E-01) | 0.90 | (8.35E-02) | 1.15E-03 | (1.13E-03) | 2.53E-02 | (1.00E-02) | 9.79E-03 | (8.89E-04) |

0.3D | 0.80 | (2.12E-01) | 0.91 | (3.67E-02) | 1.64E-03 | (4.83E-04) | 2.65E-02 | (6.75E-03) | 9.69E-03 | (6.47E-04) | |

0.4D | 0.91 | (9.91E-02) | 0.84 | (4.71E-02) | 1.33E-03 | (1.30E-03) | 3.51E-02 | (7.26E-03) | 9.63E-03 | (2.28E-04) | |

0.5D | 0.83 | (1.82E-01) | 0.77 | (6.80E-02) | 1.85E-03 | (1.09E-03) | 4.30E-02 | (5.26E-03) | 9.38E-03 | (1.89E-04) | |

0.6D | 0.91 | (5.48E-02) | 0.63 | (2.93E-02) | 1.60E-03 | (7.07E-03) | 5.68E-02 | (6.95E-03) | 1.04E-02 | (2.09E-03) | |

D = 500 | 0.2D | 0.95 | (5.59E-02) | 0.96 | (3.47E-03) | 1.29E-03 | (1.52E-03) | 9.78E-03 | (1.28E-03) | 1.23E-01 | (8.61E-03) |

0.3D | 0.96 | (1.62E-02) | 0.94 | (1.10E-02) | 7.50E-04 | (1.97E-04) | 1.22E-02 | (2.37E-03) | 1.19E-01 | (1.04E-03) | |

0.4D | 0.96 | (2.06E-02) | 0.94 | (3.86E-02) | 1.44E-03 | (9.51E-04) | 1.61E-02 | (1.66E-03) | 1.23E-01 | (1.56E-02) | |

0.5D | 0.93 | (3.14E-02) | 0.92 | (6.42E-02) | 1.76E-03 | (7.36E-04) | 2.48E-02 | (7.14E-03) | 1.18E-01 | (5.33E-04) | |

0.6D | 0.96 | (1.93E-02) | 0.64 | (3.19E-01) | 1.85E-03 | (1.40E-03) | 1.02E-01 | (1.41E-01) | 1.18E-01 | (3.05E-03) |

Note: Values in the parentheses are standard deviations.

According to the results of the mean values and standard deviations listed in Table 2, the classification performance of MI-ALM is quite competitive compared with MIDR, PCA, LDA, and MILR. We can also see from Table 2 that for each fixed $D$, MILR obtains improved accuracy as $d$ (the number of related dimensions) increases. For most cases, PCA can achieve better accuracy than LDA and MILR. We find that LDA misclassified negative bags into positive bags, especially for large $D$. Table 2 demonstrates that the classification performance can be improved by dimension reduction.

### 3.2 Real Data Sets

In this section, we test MI-ALM algorithm on five MI benchmark data sets: Musk1, Musk2, Tiger, Elephant (Elep), and Fox.^{2} Musk1 and Musk2 are drug activity prediction tasks and aim to predict whether a drug molecule can bind well to a target protein related to certain disease states, which is primarily determined by the low-energy shape of the molecule. A molecule can bind well if at least one of its shapes can bind well. Hence, MI learning can be used to learn the right shape that determines the binding and predict whether a new molecule can bind to the target protein by modeling a molecule as a bag and taking low-energy shapes as the instances. Tiger, Elephant, and Fox are image classification tasks. The image is considered positive when at least one segment of the image contains the desired animal. Therefore, by modeling an image as a bag and modeling the image segment as an instance, MI learning can be applied to find out whether the image contains the required animals. A description of these five data sets is shown in Table 4. Before we perform multi-instance dimensionality reduction and learning for each data set, we employ $Z$-score standardization such that each feature of all instances has zero mean and standard deviation one.

Bags | Instances | ||||||||

Data Set | (D) Features | Positive | Negative | Total | Positive | Negative | Total | Average | |

Bioinformatics | Musk1 | 166 | 47 | 45 | 92 | 207 | 269 | 476 | 5.57 |

Musk2 | 166 | 39 | 63 | 102 | 1017 | 5581 | 6598 | 64.69 | |

Image | Tiger | 230 | 100 | 100 | 200 | 544 | 676 | 1220 | 6.69 |

classification | Elep | 230 | 100 | 100 | 200 | 762 | 629 | 1391 | 6.10 |

and retrieval | Fox | 230 | 100 | 100 | 200 | 647 | 673 | 1320 | 6.60 |

Bags | Instances | ||||||||

Data Set | (D) Features | Positive | Negative | Total | Positive | Negative | Total | Average | |

Bioinformatics | Musk1 | 166 | 47 | 45 | 92 | 207 | 269 | 476 | 5.57 |

Musk2 | 166 | 39 | 63 | 102 | 1017 | 5581 | 6598 | 64.69 | |

Image | Tiger | 230 | 100 | 100 | 200 | 544 | 676 | 1220 | 6.69 |

classification | Elep | 230 | 100 | 100 | 200 | 762 | 629 | 1391 | 6.10 |

and retrieval | Fox | 230 | 100 | 100 | 200 | 647 | 673 | 1320 | 6.60 |

In the following experiments, we repeat 10 times 10-fold cross validation with random partitions. More precisely, each data set is randomly partitioned into 10 samples. A single sample among 10 samples is retained as the testing data, and the remaining 9 samples are used as training data. Each of these 10 samples is used exactly once as the testing data. The whole process is then repeated 10 times. Then we compute the average performance of 10 random repetitions of different MI algorithms. Here, we compare the performance of MI-ALM with other MI learning methods: PCA, LDA, HyDR-MI, CLFDA, and MIDR. For the dimensionality-reduction methods PCA, CLFDA, MIDR, and MI-ALM, we reduce the dimensionality to $(20%,30%,40%,50%,60%)$ of the size of original features. We use MILR as the classifier to evaluate the classification performance. We also test MILR without dimensionality reduction for reference. For HyDR-MI, we use the adapted Hausdorff distance to select the initial $(15%,30%,45%,60%,75%,90%)$ features of the original data in the filter component. We use AUROC as the evaluation criterion. For MI-ALM, we set $\alpha =0.3$; for MIDR, we set $c1=0.1$ and $c2=10$ to make $A$ and $H$ close.

Table 5 reports AUROC values generated by each algorithm. It can be seen that in most of cases, the results obtained by dimension-reduction methods can be higher than those based on the original feature space (e.g., MILR). For convenience, we highlight the best result under dimensional reduction for each data set. We see that the learning performance of the MI-ALM algorithm is better than that of the other dimension-reduction algorithms.

Dimension Reduction 10% | Dimension Reduction 20% | Dimension Reduction 30% | |||||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox |

PCA | 0.87 | 0.84 | 0.86 | 0.89 | 0.62 | 0.82 | 0.88 | 0.86 | 0.88 | 0.64 | 0.82 | 0.88 | 0.85 | 0.89 | 0.61 |

CLFDA | 0.77 | 0.83 | 0.84 | 0.86 | 0.57 | 0.79 | 0.85 | 0.90 | 0.88 | 0.56 | 0.83 | 0.86 | 0.87 | 0.89 | 0.54 |

MIDR | 0.83 | 0.87 | 0.76 | 0.85 | 0.61 | 0.82 | 0.89 | 0.78 | 0.87 | 0.62 | 0.82 | 0.87 | 0.79 | 0.80 | 0.58 |

MI-ALM | 0.85 | 0.83 | 0.86 | 0.89 | 0.64 | 0.83 | 0.88 | 0.87 | 0.90 | 0.62 | 0.83 | 0.88 | 0.89 | 0.87 | 0.59 |

Dimension Reduction 40% | Dimension Reduction 50% | Dimension Reduction 60% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

PCA | 0.81 | 0.86 | 0.87 | 0.88 | 0.59 | 0.81 | 0.88 | 0.86 | 0.87 | 0.59 | 0.82 | 0.87 | 0.85 | 0.88 | 0.58 |

CLFDA | 0.83 | 0.87 | 0.85 | 0.86 | 0.54 | 0.82 | 0.88 | 0.87 | 0.86 | 0.52 | 0.82 | 0.89 | 0.85 | 0.87 | 0.52 |

MIDR | 0.82 | 0.88 | 0.81 | 0.83 | 0.55 | 0.80 | 0.90 | 0.80 | 0.83 | 0.53 | 0.86 | 0.88 | 0.81 | 0.87 | 0.54 |

MI-ALM | 0.85 | 0.88 | 0.87 | 0.88 | 0.59 | 0.86 | 0.88 | 0.92 | 0.84 | 0.60 | 0.82 | 0.86 | 0.89 | 0.88 | 0.58 |

Initial Selection 15% | Initial Selection 30% | Initial Selection 45% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

HyDR-MI | 0.75 | 0.78 | 0.87 | 0.86 | 0.54 | 0.87 | 0.89 | 0.87 | 0.85 | 0.58 | 0.80 | 0.88 | 0.88 | 0.87 | 0.60 |

8.9% | 9.5% | 13.0% | 11.3% | 13.2% | 15.7% | 15.8% | 23.2% | 23.1% | 22.7% | 17.5% | 24.9% | 30.7% | 33.4% | 31.2% | |

Initial Selection 60% | Initial Selection 75% | Initial Selection 90% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

HyDR-MI | 0.79 | 0.86 | 0.89 | 0.85 | 0.53 | 0.80 | 0.84 | 0.88 | 0.88 | 0.57 | 0.81 | 0.88 | 0.90 | 0.85 | 0.58 |

24.2% | 31.3% | 43.8% | 39.9% | 44.7% | 33.2% | 39.3% | 50.5% | 50.8% | 54.4% | 41.1% | 43.6% | 66.9% | 46.6% | 58.2% | |

Musk1 | Musk2 | Tiger | Elep | Fox | |||||||||||

MILR | 0.83 | 0.88 | 0.90 | 0.90 | 0.55 | ||||||||||

LDA | 0.80 | 0.54 | 0.85 | 0.90 | 0.60 |

Dimension Reduction 10% | Dimension Reduction 20% | Dimension Reduction 30% | |||||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox |

PCA | 0.87 | 0.84 | 0.86 | 0.89 | 0.62 | 0.82 | 0.88 | 0.86 | 0.88 | 0.64 | 0.82 | 0.88 | 0.85 | 0.89 | 0.61 |

CLFDA | 0.77 | 0.83 | 0.84 | 0.86 | 0.57 | 0.79 | 0.85 | 0.90 | 0.88 | 0.56 | 0.83 | 0.86 | 0.87 | 0.89 | 0.54 |

MIDR | 0.83 | 0.87 | 0.76 | 0.85 | 0.61 | 0.82 | 0.89 | 0.78 | 0.87 | 0.62 | 0.82 | 0.87 | 0.79 | 0.80 | 0.58 |

MI-ALM | 0.85 | 0.83 | 0.86 | 0.89 | 0.64 | 0.83 | 0.88 | 0.87 | 0.90 | 0.62 | 0.83 | 0.88 | 0.89 | 0.87 | 0.59 |

Dimension Reduction 40% | Dimension Reduction 50% | Dimension Reduction 60% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

PCA | 0.81 | 0.86 | 0.87 | 0.88 | 0.59 | 0.81 | 0.88 | 0.86 | 0.87 | 0.59 | 0.82 | 0.87 | 0.85 | 0.88 | 0.58 |

CLFDA | 0.83 | 0.87 | 0.85 | 0.86 | 0.54 | 0.82 | 0.88 | 0.87 | 0.86 | 0.52 | 0.82 | 0.89 | 0.85 | 0.87 | 0.52 |

MIDR | 0.82 | 0.88 | 0.81 | 0.83 | 0.55 | 0.80 | 0.90 | 0.80 | 0.83 | 0.53 | 0.86 | 0.88 | 0.81 | 0.87 | 0.54 |

MI-ALM | 0.85 | 0.88 | 0.87 | 0.88 | 0.59 | 0.86 | 0.88 | 0.92 | 0.84 | 0.60 | 0.82 | 0.86 | 0.89 | 0.88 | 0.58 |

Initial Selection 15% | Initial Selection 30% | Initial Selection 45% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

HyDR-MI | 0.75 | 0.78 | 0.87 | 0.86 | 0.54 | 0.87 | 0.89 | 0.87 | 0.85 | 0.58 | 0.80 | 0.88 | 0.88 | 0.87 | 0.60 |

8.9% | 9.5% | 13.0% | 11.3% | 13.2% | 15.7% | 15.8% | 23.2% | 23.1% | 22.7% | 17.5% | 24.9% | 30.7% | 33.4% | 31.2% | |

Initial Selection 60% | Initial Selection 75% | Initial Selection 90% | |||||||||||||

Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

HyDR-MI | 0.79 | 0.86 | 0.89 | 0.85 | 0.53 | 0.80 | 0.84 | 0.88 | 0.88 | 0.57 | 0.81 | 0.88 | 0.90 | 0.85 | 0.58 |

24.2% | 31.3% | 43.8% | 39.9% | 44.7% | 33.2% | 39.3% | 50.5% | 50.8% | 54.4% | 41.1% | 43.6% | 66.9% | 46.6% | 58.2% | |

Musk1 | Musk2 | Tiger | Elep | Fox | |||||||||||

MILR | 0.83 | 0.88 | 0.90 | 0.90 | 0.55 | ||||||||||

LDA | 0.80 | 0.54 | 0.85 | 0.90 | 0.60 |

Note: The best result is highlighted in bold, and the underlined result refers to the best result obtained by HyDR-MI for each data set.

The performance of LDA is not good in most cases since LDA misclassified negative bags as positive bags. It is worth noting that for image classification data sets, the performance of CLFDA is worse than that of LDA. CLFDA can be seen as supervised dimensionality reduction for multiple instance learning, which incorporates both citation and reference information to detect false-positive instance. In most cases, HyDR-MI can obtain better accuracy results than CLFDA and LDA. We list the percentages of features finally selected by HyDR-MI in Table 5 and underline the best results obtained by HyDR-MI for each data set. We see that the performance of the MI-ALM algorithm is better than that of the LDA, CLFDA, and HyDR-MI methods.

Sparsity results generated by MIDR and MI-ALM methods are reported in Table 6. Under the current parameter setting, the sparsity results by MI-ALM are better than MIDR except the Fox data set. The orthogonality results generated by MI-ALM are better than those by MIDR except the Musk2 data set. Table 6 also reports the error between $Ac$ and $H$ (Error) by MIDR algorithm. MI-ALM does not have this issue, and it provides solution $A$ directly. These results demonstrate that sparse and orthogonal solution can provide better projection matrix to discriminate positive and negative bags. The important result is that the MI-ALM algorithm can provide better learning performance via sparsity and orthogonality than the MIDR algorithm does.

Dimension Reduction 10% | Dimension Reduction 20% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.581 | 0.584 | 0.814 | 0.808 | 0.848 | 0.693 | 0.669 | 0.861 | 0.835 | 0.868 |

MI-ALM | 0.895 | 0.593 | 0.817 | 0.848 | 0.741 | 0.945 | 0.789 | 0.910 | 0.892 | 0.788 | |

Orthogonality | MIDR | 2.1E-03 | 1.1E-01 | 3.4E-03 | 1.3E-02 | 5.9E-02 | 3.6E-02 | 4.8E-03 | 6.2E-03 | 5.5E-03 | 4.7E-02 |

MI-ALM | 1.1E-03 | 1.5E-03 | 2.1E-03 | 1.7E-03 | 1.4E-03 | 8.4E-04 | 1.9E-03 | 1.7E-03 | 2.3E-03 | 2.8E-03 | |

Error | MIDR | 9.9E-02 | 9.8E-02 | 7.9E-02 | 7.9E-02 | 7.1E-02 | 8.9E-02 | 9.3E-02 | 7.0E-02 | 7.4E-02 | 7.0E-02 |

Dimension Reduction 30% | Dimension Reduction 40% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.720 | 0.674 | 0.904 | 0.871 | 0.904 | 0.772 | 0.696 | 0.898 | 0.864 | 0.897 |

MI-ALM | 0.944 | 0.848 | 0.917 | 0.910 | 0.827 | 0.959 | 0.851 | 0.922 | 0.916 | 0.868 | |

Orthogonality | MIDR | 4.8E-02 | 1.7E-03 | 1.4E-02 | 8.0E-03 | 1.9E-02 | 6.1E-02 | 2.0E-03 | 7.4E-03 | 1.7E-01 | 1.3E-02 |

MI-ALM | 1.1E-03 | 1.8E-03 | 1.9E-03 | 2.8E-03 | 3.2E-03 | 8.5E-04 | 2.2E-03 | 2.1E-03 | 2.4E-03 | 2.7E-03 | |

Error | MIDR | 8.7E-02 | 9.4E-02 | 5.8E-02 | 6.5E-02 | 6.0E-02 | 7.9E-02 | 9.2E-02 | 6.1E-02 | 6.7E-02 | 6.3E-02 |

Dimension Reduction 50% | Dimension Reduction 60% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.803 | 0.713 | 0.906 | 0.880 | 0.906 | 0.850 | 0.710 | 0.914 | 0.896 | 0.922 |

MI-ALM | 0.963 | 0.797 | 0.930 | 0.922 | 0.876 | 0.965 | 0.884 | 0.933 | 0.916 | 0.901 | |

Orthogonality | MIDR | 6.0E-02 | 2.6E-03 | 1.1E-01 | 4.9E-03 | 4.4E-02 | 5.1E-02 | 2.6E-03 | 5.0E-03 | 7.6E-03 | 7.2E-02 |

MI-ALM | 1.0E-03 | 2.7E-03 | 2.2E-03 | 2.4E-03 | 3.2E-03 | 1.0E-03 | 2.1E-03 | 2.1E-03 | 2.6E-03 | 2.9E-03 | |

Error | MIDR | 7.5E-02 | 9.0E-02 | 6.0E-02 | 6.3E-02 | 6.1E-02 | 6.5E-02 | 9.2E-02 | 5.7E-02 | 5.8E-02 | 5.5E-02 |

Dimension Reduction 10% | Dimension Reduction 20% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.581 | 0.584 | 0.814 | 0.808 | 0.848 | 0.693 | 0.669 | 0.861 | 0.835 | 0.868 |

MI-ALM | 0.895 | 0.593 | 0.817 | 0.848 | 0.741 | 0.945 | 0.789 | 0.910 | 0.892 | 0.788 | |

Orthogonality | MIDR | 2.1E-03 | 1.1E-01 | 3.4E-03 | 1.3E-02 | 5.9E-02 | 3.6E-02 | 4.8E-03 | 6.2E-03 | 5.5E-03 | 4.7E-02 |

MI-ALM | 1.1E-03 | 1.5E-03 | 2.1E-03 | 1.7E-03 | 1.4E-03 | 8.4E-04 | 1.9E-03 | 1.7E-03 | 2.3E-03 | 2.8E-03 | |

Error | MIDR | 9.9E-02 | 9.8E-02 | 7.9E-02 | 7.9E-02 | 7.1E-02 | 8.9E-02 | 9.3E-02 | 7.0E-02 | 7.4E-02 | 7.0E-02 |

Dimension Reduction 30% | Dimension Reduction 40% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.720 | 0.674 | 0.904 | 0.871 | 0.904 | 0.772 | 0.696 | 0.898 | 0.864 | 0.897 |

MI-ALM | 0.944 | 0.848 | 0.917 | 0.910 | 0.827 | 0.959 | 0.851 | 0.922 | 0.916 | 0.868 | |

Orthogonality | MIDR | 4.8E-02 | 1.7E-03 | 1.4E-02 | 8.0E-03 | 1.9E-02 | 6.1E-02 | 2.0E-03 | 7.4E-03 | 1.7E-01 | 1.3E-02 |

MI-ALM | 1.1E-03 | 1.8E-03 | 1.9E-03 | 2.8E-03 | 3.2E-03 | 8.5E-04 | 2.2E-03 | 2.1E-03 | 2.4E-03 | 2.7E-03 | |

Error | MIDR | 8.7E-02 | 9.4E-02 | 5.8E-02 | 6.5E-02 | 6.0E-02 | 7.9E-02 | 9.2E-02 | 6.1E-02 | 6.7E-02 | 6.3E-02 |

Dimension Reduction 50% | Dimension Reduction 60% | ||||||||||

Algorithm | Musk1 | Musk2 | Tiger | Elep | Fox | Musk1 | Musk2 | Tiger | Elep | Fox | |

Sparsity | MIDR | 0.803 | 0.713 | 0.906 | 0.880 | 0.906 | 0.850 | 0.710 | 0.914 | 0.896 | 0.922 |

MI-ALM | 0.963 | 0.797 | 0.930 | 0.922 | 0.876 | 0.965 | 0.884 | 0.933 | 0.916 | 0.901 | |

Orthogonality | MIDR | 6.0E-02 | 2.6E-03 | 1.1E-01 | 4.9E-03 | 4.4E-02 | 5.1E-02 | 2.6E-03 | 5.0E-03 | 7.6E-03 | 7.2E-02 |

MI-ALM | 1.0E-03 | 2.7E-03 | 2.2E-03 | 2.4E-03 | 3.2E-03 | 1.0E-03 | 2.1E-03 | 2.1E-03 | 2.6E-03 | 2.9E-03 | |

Error | MIDR | 7.5E-02 | 9.0E-02 | 6.0E-02 | 6.3E-02 | 6.1E-02 | 6.5E-02 | 9.2E-02 | 5.7E-02 | 5.8E-02 | 5.5E-02 |

Note: The best result is highlighted in bold.

## 4 Conclusion

In this letter, we have proposed an augmented Lagrangian method to deal with MI learning via sparsity and orthogonality. The subproblems arising from the augmented Lagrangian method are solved by the iPALM method. The convergence of the proposed algorithm is also given. The important result is that the algorithm can guarantee both sparsity and orthogonality constraints for solutions. The effectiveness of the proposed algorithm is verified by both synthetic and real data sets. The learning performance of the proposed algorithm is better than that of existing projected algorithms. As future research, we would like to extend the approach to MI multilearning problems by incorporating both sparsity and orthogonality constraints in finding better solutions in learning. The identification of features by sparsity projection matrices would be useful to deal with high-dimensional problems.

## Appendix A: The Convergence of the MI-ALM algorithm

Before we study the convergence result of the MI-ALM algorithm (theorem ^{1}), we discuss the optimality condition of problem (2.3).

The equalities $A*-B*=0$ and $(A*)TA*=Id$ hold since $(A*,B*,w*)$ is feasible as a stationary point.

$\u25a1$

Next, we prove theorem ^{1}, which implies that any cluster point generated by the ML-ALM algorithm satisfies the first-order optimality conditions in equation (A.1). In the analysis, we assume step 1 of MI-ALM is established.

For any cluster point $(A*,B*,w*)$ of the bounded sequence ${(A(k),B(k),w(k))}k\u2208N$, there exists an index set $K\u2282N$ such that ${(A(k),B(k),w(k))}k\u2208K$ converges to $(A*,B*,w*)$. To prove that $(A*,B*,w*)$ satisfies the first-order optimality condition, equation (A.1), we first show that it is a feasible point.

The equality $(A*)TA*=Id$ is easy to check since $(A(k))TA(k)=Id$ holds for any $k\u2208N$.

If ${\rho (k)}$ is bounded, then there exist a $k0\u2208N$ such that $\u2225A(k)-B(k)\u2225\u221e\u2264\tau \u2225A(k-1)-B(k-1)\u2225\u221e$, $\u2200k\u2265k0$ (by the updating rule of $\rho (k)$ in MI-ALM). Hence, we have $A*-B*=0$.

^{3}, $(A*,B*,w*)$ satisfies the first-order optimality condition of problem 2.3. Moreover, ($A*$, $w*$) satisfies the first-order optimality condition of problem 2.1. The result follows.

We remark that $v(k)\u2208\u2202\u2225B(k)\u22251$ and $2A(k)\Gamma (k)\u2208\u2202\delta (A(k))$ are just for theoretical analysis; $v(k)$ and $\Gamma (k)$ do not need to be calculated in implemen-tation.

$\u25a1$

## Appendix B: Details and Analysis of iPALM Algorithm

Before we study the convergence result of algorithm 2 (theorem ^{2}), we first demonstrate that the scaled augmented Lagrangian function, equation 2.8, satisfies the following blanket assumptions for the iPALM method proposed in Zhu (2016).

**Blanket Assumptions**

- 1.
$\u2225\xb7\u22251:RD\xd7d\u2192(-\u221e,+\u221e]$ and $\delta :RD\xd7d\u2192(-\u221e,+\u221e]$ are proper and lower semicontinuous functions.

- 2.
$Hk(\xb7,\xb7,\xb7):RD\xd7d\xd7RD\xd7d\xd7Rd\u2192(-\u221e,+\u221e]$ is a $C1$ function.

- 3.
For any fixed $B$ and $w$, the function $A\u2192Hk(A,B,w)$ is $C11,1$ (i.e., $\u2207AHk(A,B,w)$) is globally Lipschitz with moduli 1. Similarly, for any fixed $A$ and $w$, the function $B\u2192Hk(A,B,w)$ is $Clk1(A,w)1,1$. For any fixed $A$ and $B$, the function $w\u2192Hk(A,B,w)$ is $Clk2(B)1,1$.

- 4.There exist $\lambda 1->-\u221e$ and $\lambda 1+<+\u221e$ such thatThere also exist $\lambda 2->-\u221e$ and $\lambda 2+<+\u221e$ such that$inf{lk1(A,w):k\u2208N}\u2265\lambda 1-andsup{lk1(A,w):k\u2208N}\u2264\lambda 1+.$$inf{lk2(B):k\u2208N}\u2265\lambda 2-andsup{lk2(B):k\u2208N}\u2264\lambda 2+.$
- 5.$\u2207Hk$ is Lipschitz continuous on bounded subsets of $RD\xd7d\xd7RD\xd7d\xd7Rd$. For each bounded subset $\Omega 1\xd7\Omega 2\xd7\Omega 3$ of $RD\xd7d\xd7RD\xd7d\xd7Rd$, there exists $\theta >0$ such that for all $(A,B,w),(A',B',w')\u2208\Omega 1\xd7\Omega 2\xd7\Omega 3$:$\u2207AHk(A,B,w)-\u2207AHk(A',B',w')\u2207BHk(A,B,w)-\u2207BHk(A',B',w')\u2207wHk(A,B,w)T-\u2207wHk(A',B',w')T\u2264\theta A-A'B-B'wT-(w')T.$

Next, we prove theorem ^{2}, which implies that step 1 in algorithm 1 is well defined with the iPALM method as a solver.

We consider a fixed value of $k$. For simplicity, we set $V:=(A,B,w)$ and $Lk(V):=Lk(A,B,w)$.

- By the first-order optimality conditions of subproblems 2.9 to 2.11, it is easy to check thatSince $Hk(V)$ is continuously differentiable, by subdifferentiability property,$\xi A(k,j)\u2208\u2202ALk(V(k,j)),\xi B(k,j)\u2208\u2202BLk(V(k,j))and\xi w(k,j)\u2208\u2202wLk(V(k,j)).$which implies$\u2202Lk(V)=\u2202ALk(V)\xd7\u2202BLk(V)\xd7\u2202wLk(V),$To show $\u2225\xi (k,j)\u2225\u221e\u21920$, we only need to verify that ${V(k,j)}j\u2208N$ generated by equations 2.9 to 2.11 is bounded and use lemma 4.6 and theorem 4.3 in Zhu (2016) (or proposition 4.4 in Pock & Sabach (2016)). The boundedness of ${V(k,j)}j\u2208N$ is proved by contradiction. Notice that for any $\rho (k-1)>0$, $L\u02dck(V)=\rho (k-1)Lk(V)=f(B,w)+\alpha \u2225B\u22251+\delta (A)+\rho (k-1)2\u2225A-B+\Lambda \xaf(k-1)/\rho (k-1)\u2225F2-12\rho (k-1)\u2225\Lambda \xaf(k-1)\u2225F2$ is a coercive function ($f(B,w)$ is bounded, $\alpha \u2225B\u22251$ and $\delta (A)$ are coercive, $\u2225A-B+\Lambda \xaf(k-1)/\rho (k-1)\u2225F2>0$, $-12\rho (k-1)\u2225\Lambda \xaf(k-1)\u2225F2$ is bounded from below since ${\rho (k)}k\u2208N$ is nondecreasing). Suppose $limj\u2192\u221e\u2225V(k,j)\u2225\u221e=+\u221e$. Then there must hold$\xi (k,j)\u2208\u2202Lk(V(k,j))=\u2202Lk(A(k,j),B(k,j),w(k,j)),j\u2208N.$We know from proposition 4.3 in Zhu (2016) that there exists $M>0$ such that ${Lk(V(k,j))+M\u2225V(k,j)-V(k,j-1)\u2225F2}j\u2208N$ is a decreasing sequence, which implies that$limj\u2192\u221eL\u02dck(V(k,j))=+\u221e.$Hence, by contradiction argument, ${V(k,j)}j\u2208N$ is bounded.$limj\u2192\u221eL\u02dck(V(k,j))<+\u221e.$
- From the proof of the previous point (i), we know that ${V(k,j)}j\u2208N$ is bounded. Then by theorem 4.3 in Zhu (2016) (or theorem 4.1 in Pock & Sabach, 2016), it remains to verify that $Lk(V)$ is a K-L function. Notice thatwhere $f(B,w)$ satisfies KL properties since the exponential function is definable (Wilkie, 1996) and the composition of definable function is a definable. Therefore, $Lk(V)$ is definable in an o-minimal structure. The result holds directly.$Lk(V)=1\rho (k-1)f(B,w)+\alpha \rho (k-1)\u2225B\u22251+1\rho (k-1)\delta (A)+12\u2225A-B+\Lambda \xaf(k-1)\rho (k-1)\u2225F2,$

$\u25a1$

## Notes

^{1}

The code of the proposed MI-ALM algorithm can be found in www.math.hkbu.edu.hk/∼mng/mi-alm.html.

^{2}

These five data sets can be downloaded from http://www.uco.es/grupos/kdis/momil/.

## Acknowledgments

The work of H.Z. was supported by NSF of China grant NSFC11701227, NSF of Jiangsu Province under project BK20170522, and Jiangsu University 17JDG013. The work of L.-Z. L. was supported in part by HKRGC GRF 12319816 and HKBU FRG2/15-16/058. The work of M.N. was supported in part by HKRGC GRF 12302715, 12306616, 12200317 and 12300218, HKBU RC-ICRS/16-17/03.