Abstract

In structured output learning, obtaining labeled data for real-world applications is usually costly, while unlabeled examples are available in abundance. Semisupervised structured classification deals with a small number of labeled examples and a large number of unlabeled structured data. In this work, we consider semisupervised structural support vector machines with domain constraints. The optimization problem, which in general is not convex, contains the loss terms associated with the labeled and unlabeled examples, along with the domain constraints. We propose a simple optimization approach that alternates between solving a supervised learning problem and a constraint matching problem. Solving the constraint matching problem is difficult for structured prediction, and we propose an efficient and effective label switching method to solve it. The alternating optimization is carried out within a deterministic annealing framework, which helps in effective constraint matching and avoiding poor local minima, which are not very useful. The algorithm is simple and easy to implement. Further, it is suitable for any structured output learning problem where exact inference is available. Experiments on benchmark sequence labeling data sets and a natural language parsing data set show that the proposed approach, though simple, achieves comparable generalization performance.

1  Introduction

Structured classification involves learning a classifier to predict objects like trees, graphs, and image segments. Such objects are usually composed of several components with complex interactions and are hence called structured. Typical structured classification techniques learn from a set of labeled training examples , where the instances are from an input space and the corresponding labels belong to a structured output space . Support vector machines (SVMs) have shown promise for structured output learning. Some efficient SVM-based algorithms for structured classification include SVM (Joachims, Finley, & Yu, 2009) and a sequential dual method (SDM) (Balamurugan, Shevade, Sundararajan, & Keerthi, 2011). In many practical applications, however, obtaining the label of every training example is a tedious task, and we are often left with only a very few labeled training examples.

When the training set contains only a few training examples with labels and a large number of unlabeled examples, a common approach is to use semisupervised learning methods (Chapelle, Schölkopf, & Zien, 2010). In this work, we consider semisupervised structured classification problem, where the aim is to design a classifier using the labeled training examples and unlabeled training examples . Finding a solution to this problem is hard due to each , having combinatorial possibilities, unlike in the case of semisupervised binary classification problem. Early semisupervised approaches like those proposed by Kate and Mooney (2007) converted the structured outputs into labels for multiple two-class SVMs and used semisupervised approaches available for SVMs. By constraining the relative frequencies of the output symbols, Zien, Brefeld, and Scheffer (2007) proposed a semisupervised learning method for structured outputs. Significant improvement in performance over purely supervised learning was not observed, possibly because of the lack of domain-dependent prior knowledge. A graph-regularized kernel on parts of the structured output was used in Altun, McAllester, and Belkin (2006). Chang, Ratinov, and Roth (2007) proposed a constraint-driven learning algorithm (CODL) by incorporating domain knowledge in the constraints and using a perceptron-style learning algorithm. This approach resulted in high-performance learning using significantly fewer training data. Bellare, Druck, and McCallum (2009) proposed an alternating projection method to optimize an objective function, which used auxiliary expectation constraints. Ganchev, Graça, Gillenwater, and Taskar (2010) proposed a posterior regularization (PR) method to optimize a similar objective function, and a related higher-order regularization was studied in Li and Zemel (2014). Yu (2012) considered transductive structural SVMs and used a convex-concave procedure to solve the resultant nonconvex problem.

A new probabilistic approach for semisupervised structured prediction was proposed by Dhillon, Keerthi, Bellare, Chapelle, and Sellamanickam (2012). It uses domain constraints and deals with the combinatorial nature of the label space by using relaxed labeling on unlabeled data. This approach, DASO, uses a deterministic annealing procedure to achieve a better local optimum of a nonconvex objective function. DASO was found to perform better than approaches like CODL and PR. Note that the problem formulation in Dhillon et al. (2012) uses a probabilistic model, and the algorithm presented is more appropriate for the corresponding loss function due to its natural decomposition. As Dhillon et al. (2012) reported, DASO cannot be directly used for other losses such as large-margin losses as the computations become intractable. Extending DASO to general structured output classification, other than sequence labeling problems, is also difficult due to the reasons discussed in section 3. The aim of this letter is to propose a simple algorithm for large-margin semisupervised structured classification. As will become clear, dealing with combinatorial label space is not straightforward for large-margin methods and the relaxation idea that Dhillon et al. (2012) proposed cannot be easily extended to handle large-margin formulation.

In this letter, we propose a simple and efficient algorithm to solve semisupervised structured output learning problem in the large-margin setting. The concerned nonconvex optimization problem is solved by alternating between a supervised learning algorithm and a constraint matching algorithm that also minimizes the loss on unlabeled examples. Solving the constraint matching problem is difficult, and we propose an efficient and effective label switching algorithm to solve it. Deterministic annealing is used in conjunction with alternating optimization to avoid poor local minima. The proposed approach is general and is suitable for any structured output learning problem where exact inference is available. Numerical experiments on real-world sequence labeling and natural language parsing data sets demonstrate that the proposed algorithm is competitive.

2  Related Work

For a set of labeled training examples , , and a set of unlabeled examples , we consider the following semisupervised learning problem:
formula
2.1
where and denote the loss functions corresponding to the labeled and unlabeled set of examples, respectively. In addition to minimizing the loss functions, we want to ensure that the predictions over the unlabeled data satisfy a certain set of constraints: , specified using domain knowledge. Unlike binary or multiclass classification problem, the solution of equation 2.1 is hard due to each having combinatorial possibilities. Further, the constraints play a key role in semisupervised structured output learning, as demonstrated in Chang et al. (2007) and Dhillon et al. (2012). Chang et al. (2007) also provide a list of constraints (Table 1 in their paper) useful in sequence labeling. Each of these constraints can be expressed using a function, , where for hard constraint or for soft constraints. For example, the constraint that “a citation can only start with author” is a hard constraint, and the violation of this constraint, can be denoted as . The constraint that “each output has at least one author” can be expressed as . Violation of constraints can be penalized by using an appropriate constraint loss function in the objective function. The domain constraints can be further divided into two broad categories, the instance-level constraints, which are imposed over individual training examples, and the corpus-level constraints, imposed over the entire corpus.
Table 1:
Data Set Characteristics.
Data Setnlabeledndevnunlabeledntest
Citations 5; 20; 300 100 1000 100 
Apartments 5; 20; 100 100 1000 100 
Data Setnlabeledndevnunlabeledntest
Citations 5; 20; 300 100 1000 100 
Apartments 5; 20; 100 100 1000 100 

A related work to our approach is the transductive SVM (TSVM) for multiclass and hierarchical classification by Keerthi, Sundararajan, and Shevade (2012), where the idea of TSVMs in Joachims (1999) was extended to multiclass problems. The main challenge for multiclass problems was in designing an efficient procedure to handle the combinatorial optimization involving the labels for unlabeled examples. Note that for multiclass problems, for some . Keerthi et al. (2012) showed that the combinatorial optimization for multiclass label switching results in an integer program and proposed a transportation simplex method to solve it approximately. However, the transportation simplex method turned out to be inefficient, and an efficient label switching procedure was given in Keerthi et al. (2012). A deterministic annealing method and domain constraints in the form of class ratios were also used in the training. We note, however, that a straightforward extension of TSVM to structured output learning is hindered by the complexity of solving the associated label switching problem. In particular, an integer program for equation 2.1 will include constraints spanning over multiple parts of the output ; therefore, a straightforward application of the label switching procedure illustrated in Keerthi et al. (2012) cannot be done for structured outputs. Hence, extending the label switching procedure to structured outputs is much more challenging.

Semisupervised structural SVMs considered in Zien et al. (2007) avoid the combinatorial optimization of the structured output labels and instead consider a working set of labels. We also note that the combinatorial optimization of the label space is avoided in the recent work on transductive structural SVMs by Yu (2012); instead, a working set of cutting planes is maintained. The other related work DASO (Dhillon et al., 2012) also does not consider the combinatorial problem of the label space directly; rather, it solves the following problem:
formula
2.2
where denotes a set of data instances , a denotes a distribution over the label space, and is the log-linear loss. Note that the relaxation from to avoids dealing with the combinatorial issues (associated with ) and simplifies computations. To the best of our knowledge, no prior work exists that directly tackles the combinatorial optimization problem involving structured outputs.

3  Semisupervised Learning of Structural SVMs

We consider the sequence labeling problem as an illustrative running example throughout this letter. The sequence labeling problem is a well-known structured classification problem, where an input is labeled using corresponding sequence of tokens , , a fixed set of alphabets. An input is associated with the output of the same length, and this association is captured using the feature vector . Consider a structured input-output space pair . In a supervised setting, a structural SVM learns a parameterized classification rule of the form by solving the following convex optimization problem:
formula
3.1
where , is a regularization constant and is a suitable loss term. For sequence labeling applications, can be chosen to be the Hamming loss function.
With the availability of a set of unlabeled examples , the semisupervised learning problem for structural SVMs with domain constraints aims to find a suitable parameter along with a set of candidate outputs for the unlabeled examples. The associated optimization problem is given by
formula
3.2
Note that problem 3.2 is an extension of the semisupervised learning problems associated with binary (Joachims, 1999) and multiclass (Keerthi et al., 2012) outputs. The nonconvex semisupervised learning problem equation 3.2, is solved by an alternating procedure where the following problems, equation 3.3, (by fixing to be ) and equation 3.4 (for a fixed ), are solved alternately until some stopping condition is satisfied:
formula
3.3
and
formula
3.4
Note that problem 3.3 is a supervised learning problem, which can be solved using any efficient algorithm like SDM (Balamurugan et al., 2011). However, the solution of problem 3.3 requires the knowledge of , which is obtained by solving problem  3.4.

The alternating optimization procedure described above is carried out within a deterministic annealing framework. The regularization parameter Cu associated with the unlabeled examples plays an important role in getting a good solution for the non-convex optimization problem 3.2. Setting inappropriate values can lead to poor local minima resulting in inferior generalization performance. To address this issue, we use a deterministic annealing approach (Chang et al., 2013), where Cu is treated as an annealing parameter and is slowly increased until the unlabeled loss term contributes similarly to the labeled loss term. We found this approach to be quite effective in our experiments.

The overall optimization flow is given in algorithm 1. In step 2, we learn the model parameters () using only the labeled data. In step 4, we predict the labels () for the unlabeled examples. One possible way to initialize is to assign with arbitrary random outputs. However, with the availability of a suitable parameter from step 2, the initial candidate set can be found for the unlabeled examples Xu in a principled manner. Given the parameter , we find using
formula
3.5
The on the right-hand side of equation 3.5 can be found using Viterbi algorithm (Forney, 1973) for the sequence labeling task and CYK algorithm (Hopcroft & Ullman, 2000) for the natural language parsing problem. We use these predictions as the initial labels for the unlabeled examples. The alternate optimization using the deterministic annealing approach (steps 3 to 16) is done by progressively increasing the regularization parameter (Cu) value. A key step in the alternating optimization algorithm is finding under the soft/hard constraints using equation 3.4. Note that equation 3.4 is a combinatorial optimization problem for which no efficient polynomial time algorithm exists to the best of our knowledge. Therefore, we propose an efficient greedy label switching algorithm for this purpose.
formula
A general label switching framework is provided in algorithm 2. The switching process starts with a suitable initial (step 4 of algorithm 1). The availability of candidate set for all unlabeled examples makes it possible to compute the objective function in equation 3.4, which includes the constraint loss term . Now, using obtained via equation 3.5, the objective value of equation 3.4 can be computed as
formula
3.6
The label switching algorithm aims at finding a new set of labels for the unlabeled examples starting from , so that the reduction in is significant.

The label switching is performed over multiple passes through the unlabeled examples (see steps 6 to 15 of algorithm 1). In each pass, we do a random permutation of the examples and visit each example in the order given by the permutation. Similarly, for each unlabeled example, we do a random permutation over parts of the structured output. Note that the label for a structured output is a collection of label parts assigned to corresponding parts of the output. For example, the label in a sequence labeling task is a sequence of label parts, and the label for natural language parsing is a tree made of grammar rule label parts. Each label part is assumed to take its value from a suitable space . After we select a particular part q of the output randomly, we evaluate the cost function over and choose the one that minimizes the objective function. This evaluation is done by assigning each candidate (which constitutes the label switching step) to the selected part q of the output. This includes checking for any hard constraint violations (see below for more details). If the current assignment does not violate hard constraints (if any), then the objective function value cannot increase. Hence, during every label switch, we attempt to satisfy hard constraints and try to minimize the objective function value as well. It is quite possible that the objective function value increases in order to satisfy hard constraints. Unless the hard constraints have the empty solution space, the algorithm continues to make progress considering constraints. The algorithm terminates when there is no label switch that improves the objective function (without violating hard constraints, if any) is found in a complete pass. This implies that the algorithm has reached a local optimum.

Note that the label switching procedure in algorithm 2 is generic and is applicable for general structured output problems. However, to give more clarity, we present and discuss it in detail in the context of sequence labeling and natural language parsing tasks.

3.1  Sequence Labeling Task

The following label switching procedure is carried out over the set of unlabeled examples Xu. For each unlabeled example associated with its initial label , the label-switching algorithm examines if switching the label of a particular mth token of results in a decrease in the objective function value  (obtained from equation 3.4), as shown in steps 6 to 13 of algorithm 2. Note that during the switching, the new label of the token is chosen randomly from the alphabet . Among all possible label choices from , the one that gives the maximum decrease in the objective value is accepted. This is repeated over all the output tokens associated with the input . The label switching procedure for sequence labeling can be performed by switching the labels for a pair of adjacent tokens or groups of adjacent tokens. In this letter, we focus on the simplest case where a single token is subjected to label switching, which was found to achieve good performance in our experiments.

formula
The constraint matching is handled for each switch as follows. Whenever a new label is considered, instance-level constraint violation can be checked with the new label in a straightforward way. Any corpus-level constraint violation is usually decomposable over the instances and hence can be handled in a simple way. For example, the corpus-level constraints are of the form (Dhillon et al., 2012)
formula
3.7
Thus, the constraint violation term can be easily computed for each switch. Such a simple label switching procedure was found to be very effective in our experiments. Going through a randomly permuted set of output tokens was observed to give considerable speed-up.

We now compare the computational complexity of the proposed approach with that of DASO (Dhillon et al., 2012). Note that DASO involves solving two problems alternately, one of which is called the -problem and the other one the -problem (respectively, equations 14 and 23 in Dhillon et al., 2012). The -problem is a conditional random field (CRF)–like function and the -problem is also solved using CRF-like computations. When compared to our alternating optimization, we note that structural SVM methods to solve equation 3.3 are much faster than the L-BFGS methods used to solve the -problem. This can be observed from the empirical evaluation done in Balamurugan, Shevade, Sundararajan, and Keerthi (2013). The -problem is also solved using the L-BFGS method. Every gradient computation step to solve the -problem involves computing a partition function using a forward-backward approach, which requires cost for each example of length M. For the proposed label switching algorithm, each label switch involves a Viterbi call per unlabeled example to compute the objective function value in equation 3.6. Since each Viterbi call involves a computation cost for a sequence of length M, the worst-case complexity of our method is similar to that of DASO. However, in practice, we can efficiently compute by making use of scores before a label switch. In particular, if and contain the same token at the mth position, then a label switch at that position will not result in a reduced value in equation 3.6, thereby avoiding the need to make a Viterbi call. If and contain different tokens at the mth position, then the Viterbi procedure can be performed only for the subsequences ending at the mth position. Such efficient forward-backward computations cannot be used for solving the -problem in Dhillon et al. (2012). Due to these reasons, the proposed algorithm 1, though simple, is more efficient than DASO and gives comparable generalization performance. However, as we describe below, extending DASO to problems involving general structured outputs like parse trees is difficult.

3.2  Natural Language Parsing Task

Though the previous discussion focuses on the sequence labeling problem in particular, there are other well-known structured prediction problems like natural language parsing (Joachims et al., 2009) and image segmentation (Taskar, 2004). In this section, we focus on the parsing task. Existing works on semisupervised parsing (see McClosky, Charniak, and Johnson, 2006) propose techniques for the domain adaptation setting and do not use the structural SVM framework for learning. In this work, we use structural SVMs for semisupervised natural language parsing. For natural language parsing problems (see Figure 1), the input sentence is associated with a parse-tree structured output . The feature vector contains the count of each grammar rule present in the parse-tree outputs available in the training data. The parse-grammar rules present in the output form the equivalent of individual tokens in a sequence labeling problem.

Figure 1:

A natural language parsing example, where the output is a parse-tree associated with an input sentence . The feature vector is constructed using the count of each grammar rule present in output . Note that the leaf-level grammar rules are independent of the actual words in the sentence. This is indicated by the symbol to denote an arbitrary word.

Figure 1:

A natural language parsing example, where the output is a parse-tree associated with an input sentence . The feature vector is constructed using the count of each grammar rule present in output . Note that the leaf-level grammar rules are independent of the actual words in the sentence. This is indicated by the symbol to denote an arbitrary word.

We implemented the label switching for parse trees using the following procedure. Similar to the sequence labeling task, the initial candidate set for the unlabeled examples Xu in the parsing task can be found by equation 3.5 using the CYK algorithm (Hopcroft & Ullman, 2000; Joachims et al., 2009). Now can be computed for each unlabeled example with the available . The objective function value is then calculated by adding up the constraint violation terms in equation 3.6. The label switching procedure used for parse-trees for the unlabeled examples is illustrated below.

3.2.1  Label Switching Algorithm

Note that the CYK algorithm is a dynamic programming method to build the parse tree in a bottom-up fashion. It takes as input the example , the grammar rule dictionary , and the weights associated with the grammar rules in . Using the weights , CYK returns a suitable parse tree, provided one can be constructed for the given input ; otherwise, it returns an empty tree. When CYK returns an empty tree, the value of can be safely assumed to be . During the construction of each level of the parse tree, the CYK algorithm checks if a grammar rule can be used stand-alone or in combination with another grammar rule based on the corresponding weights on the grammar rules and the weight accumulated from the previous level using the dynamic programming approach. For the label switching algorithm, we focus only on the leaf-level grammar rules. Hence switching the label at the leaf level amounts to adjusting the weights associated with the grammar rules so as to bias the construction of the parse tree toward certain rules. This is done by splitting up the grammar rules into two partitions and assigning a high scaling factor (set to 100 in our experiments) to the weights of one partition and a low scaling factor (set to in experiments) to the other. This partition is based on the domain constraints on the grammar rules. In the simplest possible scenario, this can be done by assigning a high scaling factor to a single grammar rule and assigning zero scaling factor to all the other grammar rules.

This scale factor assignment handles the term term implicitly in the inference process of CYK algorithm. During the switching procedure, it is possible that CYK returns only an empty tree for certain switches. Such label switches are ignored. It may also turn out that the initial candidate set contains empty trees for certain unlabeled examples for certain Cu values. We do not perform a label switching procedure for such examples for the particular Cu value. The label switching for parse trees is slightly different from that employed for the sequence labeling case. In sequence labeling, an explicit label is assigned to the unlabeled example during the label switching process and the Viterbi algorithm is used next to find the function value in equation 3.6. This is handled implicitly in the CYK algorithm where partitions of rules and scaling factor assignment are guided by the domain constraints.

Hence, for an unlabeled example with a nonempty parse tree and the given objective function value , the label switching algorithm for parse trees is carried out by the weight-altering procedure described above. Among a certain number of label switching steps (kept to 1000 in our experiments) for each example, the parse tree, which gives the maximum decrease in the objective function value , is accepted. This tree is considered to be for example and is used in the supervised training step, which follows the label switching. During the label switch, we also considered domain constraints in the form of grammar rule proportion in the training data. The details are given in section 4. Similar to the sequence labeling application where label switching can be extended to pairwise tokens, the label switching for natural language parsing can be performed for grammar rules at higher levels of the parse tree other than the leaf-level rules. In this letter, we focus on switching at the leaf levels and found that such a procedure is useful to achieve improvement over fully supervised training.

3.2.2  Discussion

It is important to note that extending probabilistic approaches like DASO (Dhillon et al., 2012) to handle general structured outputs like parse trees is complicated, primarily due to the loss term associated with such outputs. When general loss functions are considered, solving equation 2.2 becomes difficult because of the computation cost involved in finding the partition function (Dhillon et al., 2012). Hence, unlike our method, which can be easily used for general structured output learning problems, DASO is suitable mainly for sequence labeling problems.

4  Experiments and Results

We performed experiments on two structured classification tasks: sequence labeling and natural language parsing.

Before we move on to the details, we discuss certain speed-up heuristics and termination criteria employed in the label switching procedure. The label switching procedure turns out to be expensive for longer length sequences and parse trees with a large number of leaf-level rules. After a few initial passes, we found that the number of constraint violations that increases the objective function value reduced significantly; also, the objective function improvement over one complete pass was not significant. So it will be useful to terminate the algorithm with a limit on the number of passes or the maximum number of switches. In our experiments, we followed the latter approach, where we used a bound on the number of label switches. The bound on number of label switches is also used for annealing Cu. For example, in sequence labeling experiments, the bound was set to where M is the maximum length of the input sentence and is the size of the alphabet set. Hence the algorithm terminates when and no label switching happens (or the maximum number of label switches is reached).

We now describe the experiments.

4.1  Sequence Labeling Task

Two benchmark sequence labeling data sets were considered for our experiments: citations and apartment advertisements. These data sets were originally introduced in Grenager, Klein, and Manning (2005) and contain manually labeled training and test examples. The data sets also contain a set of unlabeled examples that is proportionally large when compared to the training set. The evaluation is done in terms of labeling accuracy on the test data. The data set characteristics are given in Table 1.

The Apartments data set used in our experiments contains 300 sequences. These sequences are apartment advertisements from craigslist.org and are labeled using a set of 12 labels like features, rent, contact, photos, size, and restriction. The average sequence length is 119. The Citations data set contains 500 sequences, which are citations of computer science papers. The labeling is done from a set of 13 labels like author, title, publisher, pages, and journal. The average sequence length for the citations data set is 35.

The partitions of citations data were taken to be the same as considered in Dhillon et al. (2012). For the Apartments data set, we considered data sets of 5, 20, and 100 labeled examples. We generated five random partitions for each case and provide the averaged results over these partitions. We used the sequential dual method (SDM) (code available at http://drona.csa.iisc.ernet.in/∼shirish/structsvm_sdm.html) for supervised learning (steps 2 and 12 in algorithm 1) and compared the following methods for our experiments:

  • Semisupervised structural SVMs proposed in this letter (referred to as SSVM-SDM)

  • Constraint-driven learning (M. W. Chang et al., 2007) (referred to as CODL)

  • Deterministic annealing for semisupervised structured classification (Dhillon et al., 2012) (referred to as DASO)

  • Posterior-regularization (Ganchev et al., 2010) (referred to as PR)

For methods such as DASO (Dhillon et al., 2012) and PR (Ganchev et al., 2010), the code is not available, and implementing these methods as used in the respective papers is quite difficult. Therefore, we directly use the results from Dhillon et al. (2012) for comparison. There is a subtle observation to be noted here. The data partitions are different for our method and DASO, PR, and CODL. To handle this, we resort to the approach of generating several partitions and reporting the average result. This should help in reducing any incorrect conclusion with the observed differences. More important, our goal is to show comparable generalization performance achievable with a simple label switching algorithm for large-margin semisupervised method for structured output classification problems, like sequence labeling and natural language parsing. We do not compare our results with those in Yu (2012), since the features considered in there are different and test examples were used for training in Yu, which is not the case for SSVM-SDM.

4.1.1  Description of the Constraints

We describe the instance-level and corpus-level constraints considered for citations data. A similar description holds for those of the Apartments data set. We used the same set of constraints given in Chang et al. (2007) and Dhillon et al. (2012). The constraints considered are of the form , which are further subdivided into instance-level constraints of the form
formula
4.1
and the corpus-level constraints of the form where
formula
4.2
We consider the following examples for instance-level domain constraints:
  1. AUTHOR label list can appear only once at most in each citation sequence. For this instance-level constraint, we can consider
    formula
    and the corresponding cI to be 1. Hence the instance-level domain constraint is of the form . The penalty function can then be defined as
    formula
    where r is a suitable penalty scaling factor, determined from cross-validation. was used in our experiments.
  2. The word CA is LOCATION. For this instance-level constraint, we can consider
    formula
    where is the indicator function, which is 1 if z is true and 0 otherwise. The corresponding cI for this constraint is set to 1. The penalty function can then be defined as
    formula
  3. Each label must be a consecutive list of words and can occur at most only once. For this instance-level constraint, we can consider
    formula
    The corresponding cI for this constraint is set to 0. Hence the instance-level domain constraint is of the form . The penalty function can then be defined as
    formula

Next, we consider some corpus-level constraints:

  1. Thirty percent of tokens should be labeled AUTHOR. For this corpus-level constraint, we can consider
    formula
    and the corresponding cI to be 30. Hence, the corpus-level domain constraint is of the form . The penalty function can then be defined as
    formula
  2. Fraction of label transitions that occur on nonpunctuation characters is 0.01. For this corpus-level constraint, we can consider
    formula
    The corresponding cI for this constraint is set to 0.01. Hence the corpus-level domain constraint is of the form . The penalty function can then be defined as
    formula

4.1.2  Experiments on the Citations Data

We considered the citations data set with 5, 20, and 300 labeled examples, along with 1000 unlabeled examples, and measured the performance on a test set of 100 examples. The parameters Cl and r were tuned using a development data set of 100 examples. The average performance on the test set was computed by training on five different partitions for each case of 5, 20, and 300 labeled examples. The average test set accuracy comparison is presented in Table 2.

Table 2:
Comparison of Average Test Accuracy(%) Obtained from SSVM-SDM with Results in Dhillon et al. (2012) (Denoted by ).
Data SetnlabeledBaseline CRFBaseline SDMDASO (I)SSVM-SDM (I)PR (I)CODL (I)
Citations 63.1 66.82 75.2 74.74 62.7 71 
 20 79.1 78.25 84.9 86.2 76 79.4 
 300 89.9 91.54 91.1 92.92 87.29 88.8 
Apartments 65.1 64.06 67.9 68.28 66.5 66 
 20 72.7 73.63 76.2 76.37 74.9 74.6 
 100 76.4 79.95 80 81.93 79 78.6 
Data SetnlabeledBaseline CRFBaseline SDMDASO (I)SSVM-SDM (I)PR (I)CODL (I)
Citations 63.1 66.82 75.2 74.74 62.7 71 
 20 79.1 78.25 84.9 86.2 76 79.4 
 300 89.9 91.54 91.1 92.92 87.29 88.8 
Apartments 65.1 64.06 67.9 68.28 66.5 66 
 20 72.7 73.63 76.2 76.37 74.9 74.6 
 100 76.4 79.95 80 81.93 79 78.6 

Note: I: denotes an inductive setting in which test examples were not used as unlabeled examples for training.

aResults from Dhillon et al. (2012).

The results for CODL, DASO, and PR are quoted from the Inductive setting in Dhillon et al. (2012), as the same set of features and constraints in Dhillon et al. are used for our experiments and test examples were not considered for our training. From the results in Table 2, we see that for the citations data set with five labeled examples, the performance of SSVM-SDM is slightly worse when compared to that obtained for DASO. However, for other data sets, SSVM-SDM achieves a comparable performance.

We present the plots on test accuracy and primal objective value for partitions with 5, 20, and 300 labeled examples in Figure 2. Though five different partitions were used in our experiments for each case, the plots given in Figure 2 correspond to a single partition. These plots indicate that as the annealing temperature increases, the generalization performance increases initially and then continues to drop. This drop in generalization performance might possibly be the result of overfitting caused by an inappropriate weight Cu for unlabeled examples. A similar observation has been made in other semisupervised structured output learning work using deterministic annealing (Chang et al., 2013). These observations suggest that finding a suitable stopping criterion for semisupervised structured output learning in the deterministic annealing framework requires further study. For our comparison results, we used the following procedure. For each partition, we first found the Cu value at which maximum generalization performance was obtained in the development data set. The maximum accuracy obtained on the test set for this particular Cu value was then considered for our results. This is indicated by a square marker in the test accuracy plots in Figure 2. The spikes in the objective function value plots correspond to the change in annealing hyperparameter Cu.

Figure 2:

Primal objective value and test accuracy behavior for a partition of citations data set. The rows correspond to 5, 20, and 300 labeled examples in that order. The square marker in the test accuracy plots denotes the best generalization performance. The spikes in the objective function value plots correspond to the change in annealing hyperparameter Cu.

Figure 2:

Primal objective value and test accuracy behavior for a partition of citations data set. The rows correspond to 5, 20, and 300 labeled examples in that order. The square marker in the test accuracy plots denotes the best generalization performance. The spikes in the objective function value plots correspond to the change in annealing hyperparameter Cu.

4.1.3  Experiments on the Apartments Data

Experiments were performed on the Apartments data set with five partitions each for 5, 20, and 100 labeled examples. One thousand unlabeled examples were considered, and a test set of 100 examples was used to measure the generalization performance. The parameters Cl and r were tuned using a development data set of 100 examples. The average test set accuracy comparison is presented in Table 2.

The results in Table 2 show that SSVM-SDM achieves a comparable average performance with DASO on all apartments data sets. The plots for test accuracy and primal objective value on partitions with 5, 20, and 100 training examples are given in Figure 3. Note that of the five different partitions used in our experiments, these plots are for a single partition. The plots show a similar performance as seen for the citations data set.

Figure 3:

Primal objective value and test accuracy behavior for a partition of apartments data set. The rows correspond to 5, 20, and 100 labeled examples in that order. The square marker in the test accuracy plots denotes the best generalization performance. The spikes in the objective function value plots correspond to the change in annealing hyperparameter Cu.

Figure 3:

Primal objective value and test accuracy behavior for a partition of apartments data set. The rows correspond to 5, 20, and 100 labeled examples in that order. The square marker in the test accuracy plots denotes the best generalization performance. The spikes in the objective function value plots correspond to the change in annealing hyperparameter Cu.

4.1.4  Remarks on the Number of Label Switches

We give the plots depicting the number of label switches as the iterations progressed in Figure 4. The plots are provided for a partition of citation and apartments data with five labeled examples for two different Cl values. The plots in Figure 4 show that a significant number of label switches is performed for smaller Cu values () and gradually reduces as Cu increases. For example, the plot for citation data () in Figure 4 indicates that around 10,000 label switches were performed for , whereas only around 2000 label switches were performed for . This behavior is observed in the other plots as well. We also find distinct spikes in the plots, which indicate a significant number of label switches being performed during the change in Cu. After the change in Cu, the number of label switches goes to zero after a few iterations, as shown by the flat parts of the plots immediately following the spikes.

Figure 4:

Number of label switches versus iterations. The rows correspond respectively to a single run of algorithm 1 on a partition of the citation and apartments data with five unlabeled examples, respectively. (Left) . (Right) . The spikes in the plots correspond to the change in Cu. The change in Cu is performed when there is no further label switch in algorithm 2 or when the maximum number of label switches is reached.

Figure 4:

Number of label switches versus iterations. The rows correspond respectively to a single run of algorithm 1 on a partition of the citation and apartments data with five unlabeled examples, respectively. (Left) . (Right) . The spikes in the plots correspond to the change in Cu. The change in Cu is performed when there is no further label switch in algorithm 2 or when the maximum number of label switches is reached.

We also observed that the algorithm required a large number of alternating steps for smaller Cu values. This behavior is shown by the width of the flat surfaces between the spikes, depicted in Figure 4. This is expected in a deterministic annealing setup, as considerable effort is made by the algorithm to find useful candidates for for lower values of Cu. As the value of Cu becomes large, the supervised step in the alternating optimization gives higher penalties to small variations in , and hence the number of alternating optimization steps and label switches is reduced.

4.2  Natural Language Parsing Task

The experiments for natural language parsing were done on a data set extracted from the Spanish parse-tree data CESS-CAT (available at http://nltk.org). The data set contains 92 examples, among which 5 were considered to be labeled examples, 65 unlabeled, and 22 test examples. We randomly permuted the examples to generate five different copies of the data set with distinct sets of labeled, unlabeled, and test examples. The following instance-level and corpus-level constraints were considered for the CESS-CAT data. At the instance level, some grammar rules were fixed: ccy, negno, prepadeendel. By fixing, we mean that a large weight was associated with these rules, and the large weight biases the CYK algorithm toward these rules. At the corpus level, the constraints we used were related to the proportion of each grammar rule. For example, we used the constraint “more than 50% of grammar rules should start with the symbol S.”

The parameter Cl was fixed to be 1 for our experiments. After training, the generalization performance was measured using the -loss over the unlabeled and test examples. The experiments were conducted to study the following important factors:

  • Improvement in generalization performance obtained by semisupervised learning over supervised learning for parsing

  • Role of grammar rules associated with the labeled, unlabeled, and test examples

Because the distributions of grammar rules applicable to labeled, unlabeled, and test set examples are different, it is important to use these grammar rules during training. This has resulted in a significant improvement in the generalization performance. In our experiments, it was also observed that if we do not include the grammar rules associated with the unlabeled and test set examples during semisupervised training, the performance on the unlabeled and test set examples deteriorates. It is worth mentioning that the inclusion of grammar rules associated with the unlabeled and test set examples is not unrealistic, as an expert linguist can provide these rules before the training starts. Note, however, that the semisupervised training does not use the labels associated with the unlabeled and test set examples.

In our experiments, we investigated the effect of including the grammar rules concerned with unlabeled and test examples in the training. Without the knowledge of these grammar rules, both supervised and semisupervised learning resulted in an average unlabeled and test accuracy of less than 2%. However, after including them, the average accuracy obtained after semisupervised learning improved by 6% (76% for supervised learning versus 82.46% for semisupervised learning with label switching) for unlabeled examples and approximately 3% (79.1% for supervised learning versus 81.8% for semi-supervised learning with label switching) for test examples. These observations are summarized in Table 3.

Table 3:
Effect of Domain Knowledge about Grammar Rules Associated with Unlabeled and Test Examples on the Generalization Performance (Measured Using 0/1 Loss).
No Domain Knowledge on Grammar RulesDomain Knowledge on Grammar Rules Included in Semisupervised Learning
Performance on unlabeled examples No improvement Improves by 6% 
Performance on test examples No improvement Improves by 2.7% 
No Domain Knowledge on Grammar RulesDomain Knowledge on Grammar Rules Included in Semisupervised Learning
Performance on unlabeled examples No improvement Improves by 6% 
Performance on test examples No improvement Improves by 2.7% 

These results suggest that by including the relevant grammar rules, the proposed semisupervised learning with domain constraints and label switching is useful for parsing task as well.

5  Conclusion

In this letter, we proposed a simple and efficient label switching algorithm for semisupervised structural SVMs. An important feature of the proposed algorithm is that it is applicable to general structured output setting as long as inference can be done exactly. This is not the case with DASO, which is fine-tuned for a sequence labeling problem. Further, the proposed algorithm is easy to implement. Experimental results on sequence labeling and natural language parsing data sets demonstrated that the proposed algorithm gives comparable generalization performance and is a useful alternative for semisupervised structured output learning. The proposed label switching algorithm can also be used to handle complex constraints, which are imposed over only parts of the structured output. We are investigating this extension.

References

Altun
,
Y.
,
McAllester
,
D.
, &
Belkin
,
B.
(
2006
).
Maximum-margin semi-supervised learning for structured variables
. In
Y.
Weiss
,
B.
Schölkopf
, &
J.
Platt
(Eds.).
Advances in neural information processing systems
, 18.
Cambridge MA
:
MIT Press
.
Balamurugan
,
P.
,
Shevade
,
S.
,
Sundararajan
,
S.
, &
Keerthi
,
S. S.
(
2011
).
A sequential dual method for structural SVMs
. In
Proceedings of the 11th International Conference on Data Mining
(pp.
223
234
).
Madison, WI
:
Omnipress
.
Balamurugan
,
P.
,
Shevade
,
S.
,
Sundararajan
,
S.
, &
Keerthi
,
S. S.
(
2013
).
An empirical evaluation of sequence-tagging trainers
.
Computing Research Repository
,
abs. 1311.2378
.
Bellare
,
K.
,
Druck
,
G.
, &
McCallum
,
A.
(
2009
).
Alternating projections for learning with expectation constraints optimization
. In
Proceedings of the 25th Conference on Artificial Intelligence
(pp.
43
50
).
Cambridge, MA
:
AAAI Press
.
Chang
,
K.-W.
,
Sundararajan
,
S.
, &
Keerthi
,
S. S.
(
2013
).
Tractable semi-supervised learning of complex structured prediction models
.
Proceedings of the European Conference on Machine Learning and Discovery in Databases
(pp.
176
191
).
New York
:
Springer
.
Chang
,
M. W.
,
Ratinov
,
L.
, &
Roth
,
D.
(
2007
).
Guiding semi-supervision with constraint-driven learning
. In
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(pp.
280
287
).
Stroudsburg, PA
:
ACL
.
Chapelle
,
O.
,
Schölkopf
,
B.
, &
Zien
,
A.
(
2010
).
Semi-supervised learning
.
Cambridge, MA
:
MIT Press
.
Dhillon
,
P. S.
,
Keerthi
,
S. S.
,
Bellare
,
K.
,
Chapelle
,
O.
, &
Sellamanickam
,
S.
(
2012
).
Deterministic annealing for semi-supervised structured output learning
.
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics
(pp.
299
307
). JMLR.org.
Forney
,
G. D.
(
1973
).
The Viterbi algorithm
.
Proceedings of the IEEE
,
61
(
3
),
268
278
.
Ganchev
,
K.
,
Graça
,
J. A.
,
Gillenwater
,
J.
, &
Taskar
,
B.
(
2010
).
Posterior regularization for structured latent variable models
.
Journal of Machine Learning Research
,
11
,
2001
2049
.
Grenager
,
T.
,
Klein
,
D.
, &
Manning
,
C. D.
(
2005
).
Unsupervised learning of field segmentation models for information extraction
.
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics
(pp.
371
378
).
Stroudsburg, PA
:
ACL
.
Hopcroft
,
J. E.
, &
Ullman
,
J. D.
(
2000
).
Introduction to automata theory, languages and computation
(2nd ed.).
Reading, MA
:
Addison-Wesley
.
Joachims
,
T.
(
1999
).
Transductive inference for text classification using support vector machines
.
Proceedings of the 16th International Conference on Machine Learning
(pp.
200
209
).
San Mateo, CA
:
Morgan Kaufmann
.
Joachims
,
T.
,
Finley
,
T.
, &
Yu
,
C.-N. J.
(
2009
).
Cutting-plane training of structural SVMs
.
Machine Learning
,
77
(
1
),
27
59
.
Kate
,
R. J.
, &
Mooney
,
R. J.
(
2007
).
Semi-supervised learning for semantic parsing using support vector machines
. In
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers
(pp.
81
84
).
Stroudsburg, PA
:
Association for Computational Linguistics
.
Keerthi
,
S. S.
,
Sundararajan
,
S.
, &
Shevade
,
S.
(
2012
).
Extension of TSVM to multi-class and hierarchical text classification problems with general losses
. In
Proceedings of the 24th International Conference on Computational Linguistics
(pp.
1091
1100
).
COLING 2012 Organizing Committee
.
Li
,
Y.
, &
Zemel
,
R.
(
2014
).
High order regularization for semi-supervised learning of structured output problems
. In
Proceedings of the 31st International Conference on Machine Learning
(pp.
1368
1376
). JMLR.org.
McClosky
,
D.
,
Charniak
,
E.
, &
Johnson
,
M.
(
2006
).
Effective self-training for parsing
. In
Proceedings of the Conference on Human Language Technology and North American Chapter of the Association for Computational Linguistics
(pp.
152
159
).
Taskar
,
B.
(
2004
).
Learning structured prediction models: A large margin approach
.
Doctoral dissertation, Stanford University
.
Yu
,
C.-N.
(
2012
).
Transductive learning of structural SVMs via prior knowledge constraints
. In
Proceedings of the 15th International Conference on Artificial Intelligence and Statistics
(pp.
1367
1376
). JMLR.org.
Zien
,
A.
,
Brefeld
,
U.
, &
Scheffer
,
T.
(
2007
).
Transductive support vector machines for structured variables
. In
Proceedings of the 24th International Conference on Machine Learning
(pp.
1183
1190
).
New York
:
ACM
.

Author notes

This work was done when P.B. was at the Computer Science and Automation Department, Indian Institute of Science, Bangalore, India.