## Abstract

Many classification tasks require both labeling objects and determining label associations for parts of each object. Example applications include labeling segments of images or determining relevant parts of a text document when the training labels are available only at the image or document level. This task is usually referred to as multi-instance (MI) learning, where the learner typically receives a collection of labeled (or sometimes unlabeled) bags, each containing several segments (instances). We propose a semisupervised MI learning method for multilabel classification. Most MI learning methods treat instances in each bag as independent and identically distributed samples. However, in many practical applications, instances are related to each other and should not be considered independent. Our model discovers a latent low-dimensional space that captures structure within each bag. Further, unlike many other MI learning methods, which are primarily developed for binary classification, we model multiple classes jointly, thus also capturing possible dependencies between different classes. We develop our model within a semisupervised framework, which leverages both labeled and, typically, a larger set of unlabeled bags for training. We develop several efficient inference methods for our model. We first introduce a Markov chain Monte Carlo method for inference, which can handle arbitrary relations between bag labels and instance labels, including the standard hard-max MI assumption. We also develop an extension of our model that uses stochastic variational Bayes methods for inference, and thus scales better to massive data sets. Experiments show that our approach outperforms several MI learning and standard classification methods on both bag-level and instance-level label prediction. All code for replicating our experiments is available from https://github.com/hsoleimani/MLTM.

## 1 Introduction

In standard classification tasks (including single and multilabel classification), each individual sample, described by a single feature vector, is associated with a subset of the class labels. In supervised classification, for instance, the goal is to predict class labels for test samples given a training set of labeled feature vectors. By contrast, in multi-instance (MI) classification, each sample (bag) consists of *multiple* feature vectors (instances) (Dietterich, Lathrop, & Lozano-Pérez, 1997; Foulds & Frank, 2010; Amores, 2013; Hernández-González, Inza, & Lozano, 2016). Each bag is associated with a subset of the class labels, but individual instances in a bag do not necessarily carry the same labels; each instance is associated with a subset (including possibly none) of the bag’s class labels. In supervised MI classification, we observe only ground-truth labels for training bags, and the goal is to predict class labels for unseen bags and possibly also determine label associations for each bag’s instances (e.g. for both the test *and* training bags).

MI learning was introduced by Dietterich et al. (1997) as a binary classification problem for drug activity prediction. In the application they described, each drug molecule (bag) can take multiple conformations (instances). A drug molecule is considered effective (positive class) if it contains at least one conformation that binds well to a target binding site. If none of the molecule’s conformations bind to the target, the molecule belongs to the negative class. This assumption that a class label is present in a bag *if and only if* at least one of its instances has that label is generally known as the standard MI assumption (Foulds & Frank, 2010).

The standard MI assumption also reasonably applies to other domains such as images or text documents (Chen & Wang, 2004; Andrews, Tsochantaridis, & Hofmann, 2002). For instance, a document (bag) contains a topic if one of its paragraphs (instances) is related to that topic.

Note that in several important domains, there are statistical dependencies and relations between instances in each bag. For example, in text documents, sentences (instances) in a document (bag) are clearly related to each other. Each sentence in a document may be ambiguous and may seem to belong to multiple topics (class labels) if considered independently of other sentences in the document (i.e., ignoring the document’s context). Similarly, spatial relations between pixels or blocks of an image or temporal relations between consecutive segments in a time series could be very informative about the labels present at the bag and instance levels. This type of data in MI learning is sometimes referred to as *structured* data (Zhang et al., 2011; Warrell & Torr, 2011; Guan, Raich, & Wong, 2016). Clearly in these types of domains, instances in each bag should not be modeled as independent and identically distributed (i.i.d.) feature vectors.

The typical definition of MI learning, however, allows another possible scenario, under which no particular structure exists in the bags. This alternative scenario is in fact a special case of weakly supervised problems (Cour & Sapp, 2011), where the instances are i.i.d. feature vectors but there is ambiguity in their labels. In some cases, not every feature vector in the training set can be manually labeled due to finite time and resources or sometimes due to technical constraints; instead, ground-truth labels are provided only for groups (bags) of feature vectors. For instance, consider a batch of headshot photos. We know the names of the people in the photo collection but do not observe the identity of the person in each individual photo (Cour & Sapp, 2011). In this example, no particular relation exists between the photographs; they are independent.

Note that in the latter, weakly supervised scenario, *instances* are i.i.d. samples, whereas in the former, structured case, *bags* are the i.i.d. samples, but the instances in each bag are related to each other and should not be considered independent. In this case, methods that capture the possible relations between instances in the bags are expected to perform better than those that ignore a bag’s structure. Most existing MI learning methods, however, focus on the second case and consider instances as i.i.d. samples. In this letter, we consider the first scenario and propose an MI learning model that does not assume instances in a bag are i.i.d. and captures relations between instances in each bag. For example, by attributing individual instances to a subset of the set of class labels, our method identifies which instances are (class) related to each other.

In this letter, we also consider the *multilabel*, multi-instance classification problem, where each bag can possibly have multiple labels and each instance in every bag is associated with a subset of the bag’s labels (Zhou & Zhang, 2006). Our proposed model is consistent with the standard MI assumption; the label set of each bag under our model is the union of the label sets of all of its instances.

Most existing MI classification approaches have been developed primarily for binary classification problems and cannot naturally handle multilabel problems. Of course, we can in principle cast any multilabel problem as multiple separate binary classification problems (one for each of the distinct class labels). However, we expect that modeling all classes jointly and taking into account possible dependencies between them should improve the overall classification performance (Zhang & Zhou, 2014). (Zhang & Zhou, 2014, provide a comprehensive review of multilabel classification algorithms.) Our proposed multilabel MI approach jointly models all classes.

We also develop our model within a semisupervised framework (Miller & Uyar, 1997; Chapelle, Schölkopf, & Zien, 2006), where we train on both labeled and unlabeled bags. In many domains, manually labeling a sufficient number of bags is expensive, but a large number of unlabeled bags are readily available. Using a semisupervised framework generally leads to better performance compared to purely supervised models, which use only the labeled samples for learning (Miller & Uyar, 1997; Chapelle et al., 2006).

Note that multi-instance learning is loosely related to semisupervised learning. Zhou and Xu (2007) show that MI learning can be modeled as a special case of semisupervised learning. In a binary classification problem, labels for instances in the negative bags are all specified. Instances in positive bags, however, are unlabeled, albeit with the constraint that at least one of them should belong to the positive class. Zhou and Xu (2007) accordingly developed a special semisupervised support vector machine to address the MI learning problem. We note that this is not the approach we take in this letter. We use a generative semisupervised framework (Miller & Uyar, 1997), treating bags as i.i.d. samples that are either labeled or unlabeled. Aside from instance-level labels, which are never observed, we assume that for each bag, we either observe all its bag-level labels (for labeled bags) or none of them (for unlabeled bags). Nevertheless, we can easily modify our learning algorithm to handle the scenario where only some of a bag’s class labels are observed.

Our model can in principle be applied to any MI domain. However, here we focus on modeling text documents, as this is one prime example of MI learning where instances in each bag are related to each other. Our model can handle other data types (e.g., with continuous-valued features) by suitably changing the likelihood model. Alternatively, one can quantize continuous-valued features and follow a bag-of-words approach (Fei-Fei & Perona, 2005).

Labeling text documents and determining labels for parts of documents, sometimes referred to as *credit attribution* (Ramage, Hall, Nallapati, & Manning, 2009; Ramage, Manning, & Dumais, 2011), is an important task in text mining and information retrieval. For instance, a user may wish to personalize her newsfeed to highlight snippets relevant to a class label or view a summary of the documents in a database that focus on a certain topic. Similarly, an analyst may wish to filter documents based on certain labels to identify trending topics. These are all important applications of credit attribution, which we cast within an MI learning framework in this letter.

In our framework, we treat each document as a bag and every sentence in each document as an instance. Alternatively, constituent parts of documents such as paragraphs could instead be chosen as instances. The proper document segmentation and the choice of instances may depend on the application.

We build our model using topic models (Blei, Ng, & Jordan, 2003) as a foundation. Topic models have been widely used to discover latent topics from collections of text documents (Blei, Carin, & Dunson, 2012; Blei, Ng, & Jordan, 2003). We develop our semisupervised multilabel topic model (MLTM) by extending the ideas of latent Dirichlet allocation (LDA) (Blei et al., 2003), a popular probabilistic topic model. This generative framework allows us to capture relations between sentences in each bag. Our model jointly discovers a set of latent topics, learns associations between each topic and the class labels, labels every (unlabeled) document with (possibly) multiple class labels, and performs sentence-level class attribution for every sentence in every document. We develop a computationally efficient Markov chain Monte Carlo (MCMC) method for inference in our MLTM model. Our model and its MCMC inference were first introduced *solely* within a document classification context in Soleimani and Miller (2016). The earlier work (Soleimani & Miller, 2016) did not consider more general application to MI learning or the conceptual relationship with and experimental comparison to existing MI learning methods. These are considered here. We also introduce a new inference method for our model, as discussed next.

Our particular choice of domain imposes certain requirements on the scalability and computational efficiency of our model. Text documents is a domain with a very high-dimensional feature space and typically very large data sets, consisting of, for example, tens of thousands of documents. Most existing MI learning methods do not scale well to such large data sets; standard MI benchmark data sets usually consist only of hundreds of bags (Foulds & Frank, 2010; Amores, 2013; Zhou, Sun, & Li, 2009).

In this letter, we thus pay special attention to the efficiency of our model and experiment on data sets that are orders of magnitude larger than typical MI learning data sets. We develop a variant of our MLTM model, MLTMVB, which replaces the MCMC of MLTM with stochastic variational Bayes (VB) methods for inference. Our MLTMVB model scales better than MLTM to larger data sets by exploiting parallelism.

The rest of the letter is organized as follows. We introduce some background material and review related work in section 2. We present our MLTM model in section 3. In section 4, we introduce our MLTMVB model, which scales better to larger data sets. Experimental results are reported in section 5. We discuss the results and further compare different methods in section 6. Section 7 presents our conclusions.

## 2 Background and Related Work

In this section, we introduce our problem and review some related MI learning and other supervised and semisupervised topic modeling approaches, including the baseline methods against which we will compare our proposed method.

Assume we have a -class training data set consisting of unlabeled and labeled bags. Let . Throughout the letter, we sometimes refer to bags as documents and instances as sentences in order to aid understanding of our model. This choice of terminology does not restrict the applicability of our model solely to the document classification domain. Each bag has a -dimensional binary label vector , which specifies whether class label is present () or absent () in . The set of labels can have zero, one, or multiple elements: .

Note that in our semisupervised framework, we distinguish between a labeled bag with no class label () and an unlabeled bag. The former has been examined and determined to have no defined class labels, while the latter has not been labeled; that is, the class labels are latent variables in the latter case. Labeled bags with no class labels are informative about *general* topics (clusters)—those not associated with any known class.

There are, in total, unique words in the dictionary, indexed . Document has sentences, and sentence has words where . We also define .

We treat each sentence as one instance in the document (bag). Each label for every sentence of document has a latent variable indicating whether class label is present in the sentence () or not (). Note that treating as a latent variable is consistent with standard MI learning settings where we observe only bag labels (in our semisupervised case, only for *some* of the bags) and we do not have any ground-truth information about instance labels. Accordingly, our observations consist of and .

To follow the multilabel version of the standard MI learning assumption, for every class in each bag , we let ; that is, iff for at least one sentence in .

For unlabeled documents (belonging to either or an independent test set), our goal is to predict and . For labeled documents (belonging to either or an independent test set), we only need to predict the .

Next, we first review some related supervised and semisupervised topic models and then discuss some existing MI learning approaches.

### 2.1 Topic Models

Latent Dirichlet allocation (LDA), introduced in Blei et al. (2003), is perhaps the most popular unsupervised topic model. LDA uses word occurrences to discover a set of topics exhibited in . Each topic is a multinomial distribution over all words in the dictionary; that is, where is the probability of word under topic . LDA also posits that multiple topics may be present in each document and accordingly estimates topic proportions , a multinomial distribution over topics, for every document: .

Several supervised and semisupervised extensions of LDA have been proposed for incorporating ground-truth class labels (Blei & McAuliffe, 2010; Lacoste-Julien, Sha, & Jordan, 2008; Wang, Blei, & Li, 2009; Dai & Storkey, 2014; Lu & Zhai, 2008; Mao et al., 2012). These topic models are not primarily appropriate for MI learning as they are developed only to predict bag-level class labels and not to solve the credit attribution problem. However, we can still use some of these methods to make instance-level predictions in a heuristic fashion. Some other topic models focusing on credit attribution in text documents have also been proposed (Ramage et al., 2009, 2011; Kim, Kim, & Oh, 2012; Yang, Kotov, Mohan, & Lu, 2015). These methods are better suited to MI learning problems. However, they do not follow the standard MI learning assumption and suffer from some other limitations which we discuss next.

#### 2.1.1 Semisupervised LDA

Blei and McAuliffe (2010) introduced supervised LDA (sLDA), an extension of LDA for incorporating document response variables. The distribution of the response variable in every document in sLDA is a generalized linear model with the empirical topic proportions of that document as covariates. sLDA is a fully supervised model that requires ground-truth labels for all training documents; however, as Blei and McAuliffe (2010) discussed, a semisupervised extension can be easily achieved by integrating out the label variable for unlabeled documents. In our letter, we focus on such an approach, which we dub semisupervised LDA (ssLDA), with a response variable distribution appropriate for documents with multiple labels.

The generative process of each document under ssLDA is as follows:

Draw topic proportions .

For every word position in each sentence :

Choose a topic .

Draw a word .

Draw .

Here, , and is the total number of words in document . Also, is an -dimensional binary random vector with only a single element equal to one and all other elements equal to zero. That is, and indicates that topic is used to generate .

Word probabilities, , and class-dependent regression variables, , are the parameters of ssLDA. Topic proportions and are latent variables, which are integrated out.

In the variational M-step, given the current estimate of the variational parameters, we update word probabilities by . We also use conjugate gradient methods (Nocedal & Wright, 2006) to update the class-dependent regression parameters, , and the Dirichlet hyperparameter, . We alternately repeat the variational E-step and M-step until the increase in the variational lower bound is smaller than a threshold.

We can estimate the posterior probability of each class label by , where . ssLDA does not directly predict instance-level labels; however, we can approximate the probability of each class label in every sentence using where . Note that this approximate method does not set and consistent with the standard MI assumption.

#### 2.1.2 Partially Labeled LDA

Ramage et al. (2009) proposed labeled LDA, a generative model for credit attribution in multilabeled documents. Labeled LDA assigns *each* word in a document to *one* of the document’s labels. Labeled LDA assumes a one-to-one association between topics and classes; unlike ssLDA where the number of topics is a hyperparameter, in labeled LDA, equals , the total number of classes. Under labeled LDA, words in each document are generated using only the topics associated with the *observed* label set, , of that document.

Ramage et al. (2011) developed partially labeled LDA (PLLDA), an extension of labeled LDA that allows more than one topic for every class label, as well as some general topics that are not associated with any class. The general topics are used to explain labeled documents that have not been assigned to any classes (), as well as words in labeled documents that are not well associated with any of the classes. Under PLLDA, each nongeneral topic belongs to exactly one class. That is, , where is the set of topics assigned to class and is the set of general topics. In PLLDA, the number of topics per class () is a hyperparameter that is chosen up front.

The generative process of PLLDA is as follows:

Generate word probabilities .

For every document with label set :

Generate topic proportions , .

For each word in every sentence :

Choose a topic .

Draw a word .

PLLDA was developed to address the credit attribution problem; however, its generative process does not satisfy the standard MI assumption. Although only the general topics as well as the topics associated with the labels of each document contribute to generating the words in that document, the generative process does not guarantee that every class label of a document is assigned to (is explained by) at least one word in that document. We call this type of model failure *inconsistent labeling*. Experimentally, we have observed that inconsistent labeling occurs with high frequency in PLLDA (see Figure 4 and its discussion). Unlike PLLDA, our proposed method guarantees that each label of a document is assigned to at least one sentence in that document.

PLLDA has some other limitations that we will overcome through our proposed model. First, it assumes that the class labels of each document are observed. In order to predict the class labels for an unlabeled test document, we need to average over cases (all possible label assignments), which may be practically infeasible. Both PLLDA and labeled LDA thus in general require heuristic inference in practice. Ramage et al. (2009, 2011) first make the simplifying assumption that for test documents with unknown labels, all topics (hence all classes) are present, reducing the model to standard LDA. Ramage et al. (2009, 2011) estimate topic proportions similar to standard LDA and then, by thresholding topic proportions, determine which topics (hence, which labels) are present in each test document.

Second, PLLDA assumes that each topic belongs to only *one* class. This significantly simplifies interpretation of the topics and the label assignments to words; however, it may also result in redundant topics (highly similar topics) that are assigned to different classes. This also exacerbates the inconsistent labeling problem (see Figure 4 and its discussion). It is conceivable that some topics should be associated with multiple classes.

Third, PLLDA does not directly assign labels to what we consider as instances (sentences or paragraphs); it assigns labels to individual words. Label posteriors for each sentence can still be approximated by averaging over label posteriors of all words in that sentence. However, this is a heuristic way to infer sentence-level labels for PLLDA. Moreover, individual words are often ambiguous by themselves and are more meaningful in the context of a sentence. If considered alone, each word may appear to belong to multiple classes. In fact, this is another potential issue with PLLDA: it assigns every word to *one* class and does not allow a word to be associated with more than one class.

A nonparametric extension of PLLDA has been proposed in Kim et al. (2012), who estimate the number of topics for each class based on the data. Nevertheless, it still retains most PLLDA limitations. In this letter, we take basic PLLDA as one baseline for comparison with our methods.

### 2.2 Multi-Instance Learning

#### 2.2.1 Binary MI Classification Methods

Most MI learning methods have been proposed primarily for binary classification problems.

*miSVM and MISVM.* Andrews et al. (2002) proposed two discriminative methods, miSVM and MISVM, for binary MI classification by extending support vector machines (SVMs). miSVM is an iterative algorithm based on SVMs that maximizes the margin on instances. The labels for instances in positive bags are unobserved variables. At each iteration of the algorithm, miSVM alternates between completing the data (i.e., imputing labels for instances in the positive bags) and learning an SVM model on the completed data. To complete the data, in each positive bag, the instance with the greatest decision function value is assigned a positive label, while the rest are considered negative instances. This iterative algorithm is terminated when there is no change in the imputed instance labels. miSVM is most suitable for instance-level classification (Andrews et al., 2002).

In contrast, MISVM maximizes the bag margin, and hence it is more appropriate for bag classification (Andrews et al., 2002). MISVM represents each positive bag with one of its instances (again, the instance with the maximum decision function) and trains an SVM model on the representatives from all positive bags and all instances from negative bags. Iteratively, MISVM alternates between choosing the positive bag representatives and learning the SVM model. The algorithm is terminated when the representatives do not change.

Both miSVM and MISVM fail to exploit possible relations between instances in each bag. Also, MISVM essentially treats instances as i.i.d. samples. Moreover, they are both fully supervised models, requiring ground-truth labels for all training bags.

*EM-DD.* Zhang and Goldman (2002) proposed EM diverse density (EM-DD), an extension of the DD method introduced by Maron and Lozano-Pérez (1998). EM-DD posits that in each bag, one instance, which is a priori unknown, is responsible for the bag’s class label. This unknown instance is treated as missing data within an approximate EM framework (Zhang & Goldman, 2002).

The EM-DD algorithm consists of an approximate E-step and an M-step, alternately, iteratively applied. In the E-step, Zhang and Goldman (2002) compute to determine the instance responsible for each bag’s class label. In the M-step, they optimize , the log likelihood of class labels given , with respect to the model parameters , where is the likelihood of the label of bag () conditioned on and .

Note that EM-DD is not a true EM algorithm (Dempster, Laird, & Rubin, 1977; Bishop, 2006): in EM-DD is not the expectation of a complete-data log likelihood with respect to the posterior of missing variables. For this reason, each iteration in EM-DD is not guaranteed to increase the objective function . Instead, Zhang and Goldman (2002) propose to stop the algorithm when decreases or when the relative absolute change in is smaller than a threshold value.

Although EM-DD is primarily developed for binary classification, it can be easily modified for multilabel MI problems. Specifically, redefine the objective function , where , , and .

Similar to miSVM and MISVM, EM-DD is purely supervised. EM-DD also does not exploit relations between instances in each bag.

*MIMLSVM*. Zhou and Zhang (2006) proposed MIMLSVM, a binary classification MI approach that is used for tackling multilabel problems by treating them as separate single-class MI classifications. MIMLSVM first performs a -medoids clustering of all training bags and divides the data into partitions, using Hausdorff distance to measure the distance between bags (Zhou & Zhang, 2006). Then each training bag is transformed into a -dimensional feature vector whose th component is the Hausdorff distance between the bag and the medoid of the th partition (). MIMLSVM learns one binary SVM model for each class (class present or not), where now the training samples are the transformed feature vectors for each (labeled) bag and its corresponding class label.

*miGraph and MIGraph*. Zhou et al. (2009) proposed two MI classification methods, MIGraph and miGraph, which incorporate the structure in each bag and do not treat instances as i.i.d. samples. These methods are based on constructing a graph for each bag in which instances are the nodes and the weight of each edge between two nodes is computed based on a similarity measure between their corresponding instances. After constructing the graphs, a graph kernel function is defined to measure the similarity between graphs, which then is used in an SVM for solving the classification problem.

Both miGraph and MIgraph can predict only bag-level labels. Also, they are both fully supervised models and, unlike our model, do not learn the latent structure (topics) in the data.

*MIMM*. Foulds and Smyth (2011) proposed multi-instance mixture models (MIMM) for binary classification, a class of generative models adapted from naive Bayes mixture models where the instances in each bag are assumed to be independent conditioned on their binary class labels. To generate each bag, they first choose a latent binary label indicator for each instance in the bag from a Bernoulli probability , , and then generate each instance conditioned on its class label , , where is an arbitrary distribution from the exponential family whose parameters are class specific. After generating all the instances in each bag, Foulds and Smyth (2011) set the bag binary label indicator using . They treat instance labels as latent variables and propose an EM algorithm for learning the model parameters.

Our proposed framework improves the ideas presented in MIMM in two important aspects. First, we do not make a naive Bayes assumption for generating the instances in each bag. Instead, we propose a hierarchical Bayesian model and posit that the instances in each bag are independent, conditioned on a low-dimensional latent space. This allows our model to capture the relations between instances. Second, our framework can also jointly model multiple labels and learn the possible correlations among different classes. As will be discussed in section 3, the bag-specific low-dimensional latent space is our primary mechanism for capturing the relations between the instances in each bag, as well as learning the correlations among class labels.

Foulds and Smyth (2011) also discussed the cost of approximating MI learning problems as semisupervised problems, where instances in negative and positive bags are treated as labeled and unlabeled i.i.d. samples, respectively. We again emphasize that this is not the approach we take in this letter; the semisupervised approach presented here is with respect to bag labels, which may or may not be observed.

*DPMIL.* Kandemir and Hamprecht (2014) proposed a Dirichlet process mixture of gaussian models for binary MI classification. They assume that each class has its own Dirichlet process mixture model. To understand how this method works, suppose at first that the class labels are given for each instance. Then each instance in every bag is generated by first choosing a component from its class-specific mixture model and then drawing its feature vector from the component’s multivariate gaussian distribution. Given the instance labels, the label for each bag is determined using the standard MI assumption. Kandemir and Hamprecht (2014) treat instance labels as unknown deterministic binary variables that are in fact determined to maximize the complete data log likelihood using an iterative constrained optimization. Each iteration alternates between estimating model parameters given fixed instance labels and estimating instance labels given fixed parameters. Note that the optimization of instance labels is done subject to the standard MI learning assumption, given the observed bag labels. This requires a constrained optimization over the binary variables for each instance in every positive bag (essentially, cyclical optimization over these binary variables), which may not easily scale to large data sets. Also, Kandemir and Hamprecht (2014) treat instances in each bag as i.i.d. samples.

*Other approaches.* Warrell and Torr (2011) considered MI learning with structured data and used conditional random field models to capture the dependencies between instances in each bag.

Doran and Ray (2014) took a more theoretical approach and discussed the conditions under which a classifier with arbitrarily small error rate under a probability measure can be learned for an MI learning problem.

Guan et al. (2016) proposed an MI learning framework for activity recognition in time-series data. They use an autoregressive hidden Markov model to capture the temporal relations between instances in each bag. They consider a binary classification problem and, instead of the standard MI assumption, label a bag as positive if the majority of the instances in that bag are from the positive class; that is, they follow a soft-max approach.

#### 2.2.2 Multilabel MI Classification Methods

Some MI learning methods have been proposed for handling multilabel problems. These approaches can exploit possible relations between different classes by modeling all classes jointly rather than treating them as multiple separate binary classification problems.

*MIMLBOOST.* Zhou and Zhang (2006) proposed a supervised learning method, MIMLBOOST, for multilabel MI classification. In MIMLBOOST, they first build a new data set by transforming each pair into pairs of bags and labels, . MIMLBOOST then follows a boosting approach (Freund & Schapire, 1996; Xu & Frank, 2004) to minimize the classification error by developing a series of SVM classifiers. At each boosting level, a new multiclass SVM classifier is learned on instances of the expanded data set, considering each bag’s class label as the ground-truth label for all its instances. Note that, quite importantly, there are replicas of each instance in every bag in the training set for any SVM classifier at each boosting level—one replica for every class label. All of these replicas have the same input feature values as they are copies of the same instance but can have different target values (ground-truth class labels). In this case, the SVM model receives conflicting target values for the same input feature vector.^{1} To distinguish these replicas from each other, Zhou and Zhang (2006) augment the feature vector of every instance in each bag by an additional class-dependent feature. Clearly, this additional feature does not have any significant effect for high-dimensional feature spaces such as text documents, where the typical feature dimensionality is on the order of tens of thousands. Also, due to expansion of the data set by a factor of , this method is not computationally practical for our application.

*Dirichlet-Bernoulli alignment.* An MI learning method that also captures structure in each bag is Dirichlet-Bernoulli alignment (DBA), introduced by Yang, Zha, and Hu (2009). To generate each document (bag) , they first generate class proportions . Then, for each sentence (instance) , they choose a class (topic) and generate all words in that sentence from topic , . Finally, they draw each class label for document from a Bernoulli distribution with probability proportional to , where is now the vector of empirical class frequencies for document , estimated by a normalized sum of over all sentences in the document. Under DBA, sentences are independent conditioned on the bag-specific class proportions.

Note that DBA is a special case of ssLDA, discussed in section 2.1.1, in which the same topic is used to generate all words in a sentence. Unlike ssLDA, where the number of topics is a hyperparameter, there is a one-to-one association between topics and classes in DBA, which results in . Also, DBA does not follow the standard MI assumption.

*SSMIML.* Xu, Jiang, Xue, and Zhou (2012) developed SSMIML, a semisupervised multilabel MI learning framework based on manifold regularization for video annotation. In addition to labeled samples, they use unlabeled training bags by encoding a term that captures the similarities among all bags. However, computational complexity of this term grows quadratically with the sample size, which restricts the applicability of their method for large data sets. Also, training in Xu et al. (2012) involves an iterative concave-convex constrained optimization that is not scalable (Zhou, Zhang, Huang, & Li, 2012).

*EnMIMLNN.*Wu, Huang, and Zhou (2014) proposed EnMIMLNN, an ensemble supervised learning framework for multilabel MI classification. For every class label , they first perform a K-medoids clustering on the subset of training bags in which class is present () and divide into clusters, , where is a hyperparameter. They use the Hausdorff distance defined as below to compute the distance between bags:

Note that under EnMIMLNN, each class has its own specific set of weights. Also, the error function, equation 2.1, separates for each class . That is, the weights for each class are essentially estimated independently from other classes. The only common parameter across classes is , which is not optimized. This prevents EnMIMLNN from learning strong relations between classes. Also, unlike our proposed framework, EnMIMLNN is a purely supervised model.

*Other approaches.* Briggs, Fern, Raich, and Lou (2013) proposed a multi-label MI learning framework, primarily focused on instance-level label prediction. Their instance annotation problem is tackled by optimizing a nonconvex regularized rank-loss objective function.

Pham, Raich, Fern, and Arriaga (2015) proposed a discriminative probabilistic model for multilabel MI learning in the presence of novel class instances. Conditioned on observed features, they use multinomial logistic regression to generate class labels for each instance; that is, each instance will be assigned one label from a set of predefined classes or from an unknown class (multiclass classification with a novel class). The bag labels are then determined by taking the union of instance labels. Pham et al. (2015) treat instance labels as latent random variables and propose an EM algorithm to estimate instance labels and model parameters.

Adel, Smith, Urner, Stashuk, and Lizotte (2013) consider diagnosis of neuromuscular disorders using sets of motor unit potential trains (MUPTs) as a three class MI learning problem and compare different graphical models with simple structures for this task.

*Proposed baseline: Discriminative likelihood model.*As a baseline to compare against our model, we also propose here a discriminative likelihood model (DLM), a variation on a logistic regression classifier, for multilabel MI classification. To follow the standard MI assumption, we let . Under DLM, each is a Bernoulli random variable with probability proportional to , where and are both vectors of length ; is the feature vector of instance of document ; and ’s are the model parameters. Also, , the number of exponential components, is a hyperparameter. Note that we do not discover topics in this model; we simply predict bag and instance labels conditioned on the feature vectors. The likelihood of the occurring class labels, over all labeled documents, under DLM is where and

We use conjugate gradient methods (Nocedal & Wright, 2006) to estimate the model parameters, , so as to maximize the class posterior log likelihood.

## 3 Semisupervised Multilabel Topic Model

In this section, we introduce our semisupervised, multilabel topic model (MLTM). Our model discovers a set of topics (latent structure in the data) and predicts bag-level and instance-level class labels by learning the associations between class labels and the discovered topics. We also explicitly impose the standard MI assumption by setting .

MLTM posits that each topic has a Bernoulli probability parameter , associating it with every class . This allows topics to belong to multiple, one, or even zero classes probabilistically. We assume these class association probabilities are generated from a common beta distribution with parameters and , which is the conjugate prior for the Bernoulli distribution.

Each document (bag) has its own topic proportions generated from a Dirichlet distribution over the topics (i.e., ). To generate each word in a document, like LDA, we first choose a topic according to the topic proportions of that document (i.e., ) and then choose a word according to the word probabilities under that topic.

After generating all words in each sentence of document , we generate , a binary random variable that determines whether this sentence has the label . To do this, we first uniformly choose one of the words in the sentence (an anchor word): . Then we generate based on the class association probability of the topic of origin of that word, that is, where and is the topic with . By using this structure, we allow every sentence to have multiple, one, or zero labels. Note that the anchor words used for generating different class labels in a sentence are not necessarily the same.

After generating all sentences in a document and their labels, the document-level class labels are determined using .

Figure 1 shows the graphical model for MLTM. The generative process of MLTM is summarized as follows:

Generate topics: .

Generate the class association probability for each topic under every class: .

For every document :

Generate topic proportions .

For each sentence :

For each word position :

Choose a topic .

Generate a word .

For every class label :

Choose a word position .

Choose class presence where = 1 and .

Set .

The bag-specific topic proportions in our hierarchical graphical model play an important role in capturing the relations between sentences in each bag and the correlations between class labels. The sentences in each bag are independent conditioned on the bag’s topic proportions. But as we will discuss in the sequel, the posterior on the topic proportions for each bag is determined based on *all* sentences in the bag. Accordingly, the topic proportions in essence capture the context of the bag/document—the topics that the sentences jointly exhibit. For purpose of illustration, consider a document with three sentences and a model with topics. Suppose that if considered independently, these sentences would have topic proportions , and , respectively. We see that the first sentence is ambiguous in that it is equally associated with topics 1 and 2. However, our model treats them jointly as parts of a single document and learns a single (posterior) topic proportion vector for the document. In this case, all sentences will have the same vector of topic proportions: . Now the entire document, including the first sentence, is more likely to be associated with topic 1. In this example, the second and third sentences help disambiguate the first sentence.

Consequentially, the sentence labels (on classes), which are determined based on the topics of origin of the words in that sentence and their class association probabilities, are influenced by all other sentences in the document. In other words, mediated by the topic proportions, all sentences jointly contribute to determining class associations for every sentence in the document. Moreover, through these class associations, relations between instances (sentences) are revealed: two sentences with strong associations to the same class should be more related to each other than two sentences with strong associations to different classes.

Moreover, the topic proportions also capture correlations among different classes. In particular, classes that tend to co-occur in documents are more likely to be associated with the same topics that appear in those documents. Similarly, classes that are anticorrelated are not expected to be associated with similar topics. For instance, consider a problem with four class labels, , , , and , where is equally likely to co-occur with and but never co-occurs with . Through learning these correlations and anticorrelations, our model can make better predictions. In this example, the presence of topics associated with increases the probability of being present and, to a lesser degree, (if *is* present, is more likely to be present), but at the same time decreases the likelihood of being present.

Another key element in the generative process of our model is the process we choose for determining a sentence’s class labels. The presence of every class in a sentence is determined based on the topic of origin of an anchor word in , which is selected uniformly from all words in that sentence. We emphasize that this does not mean that the rest of the words in do not play a role in generating ; in fact, all words in uniformly contribute to . Note that the anchor word used to determine is a latent variable and is integrated out in the inference process. That is, the estimated posterior for class in sentence with words is the average over probabilities, each corresponding to a different anchor word. An obvious alternative choice is the linear model approach used in ssLDA (e.g., ). However, this approach unnecessarily complicates inference with no significant improvement in performance. Our choice, in contrast, as we will show, results in an elegant and efficient inference algorithm. Similar ideas have been used in Blei and Jordan (2003) for modeling annotated images.

We treat all variables in our model as random latent variables and integrate them out in the inference process, whose goal is to predict class labels and determine label attributions by computing the posteriors of and .

Note that, as stated in section 1, we can modify the generative process of MLTM for modeling data with continuous features. For instance, we can replace the multinomial distribution of word probabilities with a multivariate gaussian distribution. Then in step 3b of the generative process, we first choose a topic, , and then generate the feature vector of instance from the multivariate gaussian distribution of the selected topic: , where and are, respectively, the mean vector and covariance matrix under the selected topic. Finally, we generate the class labels—for every instance : , where .

Exact inference is not feasible, and we should appeal to either approximate inference such as MCMC or variational inference (Griffiths & Steyvers, 2004; Jordan et al., 1999; Blei et al., 2003). Note that due to the MI constraint in our model, , sentence labels for a class are not independent conditioned on the document label for that class. Thus, naively integrating out requires integrating over an -dimensional constrained discrete distribution with possible outcomes for any class with . This, in particular, makes application of variational inference methods computationally intensive. Thus, in this section, we choose MCMC for conducting inference in our method. We later describe a Hamiltonian Monte Carlo method that allows us to efficiently sample from the posterior distributions over sentence labels.

### 3.1 Hamiltonian Monte Carlo for Sampling Instance Labels

The primary challenge in sampling from equation 3.5 is the indicator function , especially when .

If a class label is absent in a labeled document, then it is absent in all sentences in that document: if for some and . In this case, no sampling for is required as they are all zero with probability 1: equation 3.5 reduces to .

The only challenging situation is sampling in labeled documents where . In this case, for every class with , we need to sample from a discrete distribution with possible outcomes, excluding the case of as must equal 1. Note that sampling from such a distribution is more challenging as the probability space grows exponentially with .

A possible strategy for sampling is to independently sample each instance label and then reject those realizations that do not satisfy the MI constraint . This naive sampling strategy, however, is wasteful, since many samples need to be rejected and the Markov chain may not mix properly. Instead, we use a Hamiltonian Monte Carlo (HMC) approach based on the auxiliary-variable HMC method proposed in Pakman and Paninski (2013).

Hamiltonian Monte Carlo is an MCMC method typically used for sampling from probability distributions on continuous random variables. HMC has been shown to achieve better convergence properties than the random walk Metropolis algorithm by using Hamiltonian dynamics equations for moving between states (Neal, 2011). Sampling using the HMC method is achieved by solving the equations of motion, and , where is the Hamiltonian, which is the sum of potential and kinetic energies. and are, respectively, the position (variable of interest) and momentum variables. The potential energy is the negative log probability of the variables of interest, while the kinetic energy is typically defined as (Neal, 2011). At each iteration, a new momentum variable is obtained by sampling from a standard gaussian distribution, ; then the motion equations are solved for a time step of duration . The values of the position vector at time are the new proposal samples.

If the value of computed using equation 3.9 is positive, then crosses the boundary and changes its value. Otherwise, if equation 3.9 results in , the particle is reflected from the boundary; we set , and keeps its current value. Note that if the value of after crossing the boundary does not satisfy , equation 3.10 becomes , which causes the particle to be reflected from the boundary. If we start from a sample that satisfies the constraint, this equation guarantees that at every time . Also, note that we can easily compute the boundary hit times for every sentence by solving in equation 3.8 (Pakman & Paninski, 2013).

Pakman and Paninski (2013) recommend setting the step length where is a constant. We experimented with different values of and chose .

Note that our sampling algorithm described above does not depend on the condition inside the indicator function. Depending on the application, we can have any other arbitrary constraint on instance labels instead of the hard-max constraint and use the same sampling algorithm for inference on sentence labels. For example, in some domains, we may want to impose a majority constraint on instance and bag labels, that is, we replace with . In this case, we would need to use the HMC algorithm to sample sentence labels both when and when . Beyond this, in the HMC algorithm, we only need to change the condition in equation 3.10 to or , respectively, for and . For unlabeled documents, we would use the Gibbs sampling algorithm as described above.

### 3.2 Determining Hyperparameters

The priors we choose in MLTM have four hyperparameters: , , , and . In this letter, we take a fully Bayesian approach and treat these hyperparameters as random variables with a gamma distribution as their prior: . We sample from the posterior of each hyperparameter given the data and other variables using a Metropolis-Hastings algorithm. We set , although our experiments show that the performance is not sensitive to this choice.

### 3.3 Posterior Mean Estimates

We take averages over posterior estimates to compute our final mean posterior estimates of the latent variables, for example, , where is the length of the burn-in period. During the burn-in period, the first iterations of the Markov chain, the samples are discarded and are not used in computing the posterior estimates as the Markov chain has not converged yet. Similarly, we compute the posterior estimate of document and sentence labels, for example, . Note that we update the posterior estimates recursively at every iteration.

### 3.4 MCMC Standard Error and Determining the Markov Chain Length

To determine the length of the Markov chain, we compute the Monte Carlo (MC) standard error of posterior estimates of the latent variables using the nonoverlapping batch means method (Brooks, Gelman, Jones, & Meng, 2011) and stop the simulation when this standard error falls below a threshold value.

At iteration of the Markov chain, we divide the samples obtained from the beginning, , up to , into batches, each of length so that . Let be the MCMC estimate of a random variable in the th batch, that is, . The batch means estimate of the MC standard error of is where and is the MC estimate of .

To obtain a consistent estimate of the standard error, the batch size is usually set to and is allowed to vary as the total length of the Markov chain increases (Jones, Haran, Caffo, & Neath, 2006). However, this requires storing all samples. But because we want to compute the standard errors recursively, we fix the batch size to 100, the square root of , which is the maximum length of the Markov chain in our experiments. We compute standard errors of each latent variable and take their average over all . We also compute the standard error of our estimate of in a similar way. We stop the sampling process when both of these standard errors are below a threshold or when we reach the maximum number of iterations. In summary, our MCMC algorithm is as follows:

Randomly initialize , , , and the counting variables .

For :

For each document :

Update the posterior estimate of and .

Update MC standard errors, and ; stop if they are less than a threshold.

### 3.5 Inference on Test Documents

## 4 Stochastic Variational Inference for MLTM

Our MLTM model described in section 3 uses MCMC methods for inference. MCMC methods are typically serial (including our own in the previous section) and generally do not scale well to large data sets. In this section, we introduce MLTMVB, an approximation to our original MLTM model. MLTMVB uses stochastic variational inference for approximating intractable expectations in the variational lower bound and for computing stochastic gradients using mini-batches of the training data. This allows us to scale our model to larger data sets and perform training in parallel. Our inference method discussed in this section can also be easily extended to work in online settings to process streaming data (Broderick, Boyd, Wibisono, Wilson, & Jordan, 2013).

We also integrate out topic proportions before applying variational inference; that is, we use collapsed variational inference (Teh, Newman, & Welling, 2006) with respect to . Compared to standard variational inference methods, collapsed variational Bayes can better approximate the log-likelihood function and has been shown to be computationally more efficient (Teh et al., 2006). We could similarly integrate out word probabilities (), but this will complicate parallelization of our inference algorithm. As we will show next, by keeping , we can easily perform the E-step of our stochastic variational inference in parallel. (See Teh et al., 2006, and Foulds, Boyles, DuBois, Smyth, & Welling, 2013, for fully collapsed variational inference for LDA.)

Also, and are the token counts as defined in section 3. We define a variational distribution on word probabilities: . For simplicity, we choose not to integrate out ; instead, we perform MAP estimation of . Thus, we do not need to define variational distributions on .

Our goal is to maximize with respect to the variational parameters () and the model parameters () using a stochastic variational approach (Hoffman, Blei, Wang, & Paisley, 2012). We first make a distinction between *local* and *global* variables in . The parameters of the variational distributions on and in each document, and , are local variables; and for every document appear only in in equation 4.3, and given and , they can be updated independently from the local variables in other documents. Thus, the local variables can be trivially updated in parallel. In contrast, and are global variables that appear in the variational lower bound of all documents.

We first turn our attention to updating local variables given fixed global variables. To update and for every document, we need to compute gradients of with respect to these variables, but, unfortunately, we cannot analytically compute the expectations on the fourth and fifth lines of equation 4.4. Our basic strategy here is to obtain stochastic estimates of these gradients using Monte Carlo methods.

Consider, for instance, the gradient of the term on the fourth line of equation 4.4 with respect to : where and . For simplicity, we drop the subscript in and . Different methods have been proposed for approximating this gradient (Paisley, Blei, & Jordan, 2012; Titsias & Lazaro-Gredilla, 2015). We use an approximation method based on the local expectation gradients approach introduced in Titsias and Lazaro-Gredilla (2015). The authors showed that the stochastic gradients computed using this method had lower variance compared to other methods.

*exact*integral with respect to and compute a Monte Carlo estimate of the expectation with respect to the remaining variables using

*one*sample drawn from . We follow the same approach in approximating the gradient of the last term in equation 4.4 with respect to . Finally, by setting and after some simplifications, we obtain

As Titsias and Lazaro-Gredilla (2015) noted, the local expectation gradients method closely resembles Gibbs sampling. In the Gibbs sampling algorithm for MLTM discussed in section 3, we initialize our algorithm by drawing one sample from the distribution of each latent variable. We then iteratively visit each variable, and at each step, we draw one sample from the posterior distribution of that variable conditioned on the samples drawn from the other latent variables. Here, similarly, we start the algorithm by generating one sample from the variational distribution of each latent variable. Then, when updating each local latent variable, instead of drawing a new sample from the posterior of that variable as in Gibbs sampling, we compute the exact expectation with respect to the variational distribution of that variable conditioned on the samples from all other latent variables. It is worth noting that a major difference between these two algorithms is that the sequence of samples drawn in Gibbs sampling constitutes a Markov chain whose stationary distribution is the *true* posterior of latent variables (Brooks et al., 2011). The samples in local expectation gradients, in contrast, are drawn from the mean field variational distributions, which, in general, even after convergence, are different from the true posterior distributions.

We alternate between updates of the local and global variables until a convergence criterion is met. To determine convergence, similar to our MCMC algorithm, we compute the Monte Carlo standard error of posterior estimates of the latent topic variables (). We terminate the algorithm if this MC standard error falls below a threshold or when we reach the maximum number of iterations .

Note that unlike the global variables, which are updated at every iteration, the local variables in any document are updated only at iteration if that document is selected in the mini-batch in that iteration. This suggests that we may need to decrease the step size with different rates for local and global variables, as well as for different local variables. We follow a heuristic approach and decrease the learning rate for the local variables in each document based on the number of times that has been visited. That is, for document at step , we compute , where is the number of times that has been selected in the mini-batches from the beginning of the algorithm. At each step , we also compute the learning rate for the global variables by taking the average over for all documents selected in the mini-batch in this iteration.

In summary, our stochastic variational inference algorithm is as follows:

Set and , and randomly initialize , , , and .

Sample , .

For :

Randomly choose documents from : .

For each document :

Compute .

Update and .

Update MC standard errors and stop if they are less than a threshold.

### 4.1 Inconsistent Labeling in MLTMVB

MLTMVB is not ensured to satisfy the standard MI assumption and, similar to PLLDA, could suffer from inconsistent labeling. MLTMVB may incur two types of labeling inconsistencies: (1) when MLTMVB predicts for some and , but (type 1 error, or false positive), and (2) when MLTMVB predicts for some and , but (type 2 error, or false negative). We borrow the terms *type 1* and *type 2* errors from binary classification. We assume that the true label is determined according to the standard MI assumption, with , but MLTMVB makes predictions using .