Abstract
Neural text classification models typically treat output labels as categorical variables that lack description and semantics. This forces their parametrization to be dependent on the label set size, and, hence, they are unable to scale to large label sets and generalize to unseen ones. Existing joint input-label text models overcome these issues by exploiting label descriptions, but they are unable to capture complex label relationships, have rigid parametrization, and their gains on unseen labels happen often at the expense of weak performance on the labels seen during training. In this paper, we propose a new input-label model that generalizes over previous such models, addresses their limitations, and does not compromise performance on seen labels. The model consists of a joint nonlinear input-label embedding with controllable capacity and a joint-space-dependent classification unit that is trained with cross-entropy loss to optimize classification performance. We evaluate models on full-resource and low- or zero-resource text classification of multilingual news and biomedical text with a large label set. Our model outperforms monolingual and multilingual models that do not leverage label semantics and previous joint input-label space models in both scenarios.
1 Introduction
Text classification is a fundamental NLP task with numerous real-world applications such as topic recognition (Tang et al., 2015; Yang et al., 2016), sentiment analysis (Pang and Lee, 2005; Yang et al., 2016), and question answering (Chen et al., 2015; Kumar et al., 2015). Classification also appears as a sub-task for sequence prediction tasks such as neural machine translation (Cho et al., 2014; Luong et al., 2015) and summarization (Rush et al., 2015). Despite numerous studies, existing models are trained on a fixed label set using k-hot vectors, and therefore treat target labels as mere atomic symbols without any particular structure to the space of labels, ignoring potential linguistic knowledge about the words used to describe the output labels. Given that semantic representations of words have been shown to be useful for representing the input, it is reasonable to expect that they are going to be useful for representing the labels as well.
Previous work has leveraged knowledge from the label texts through a joint input-label space, initially for image classification (Weston et al., 2011; Mensink et al., 2012; Frome et al., 2013; Socher et al., 2013). Such models generalize to labels both seen and unseen during training, and scale well on very large label sets. However, as we explain in Section 2, existing input-label models for text (Yazdani and Henderson, 2015; Nam et al., 2016) have the following limitations: (i) their embedding does not capture complex label relationships due to its bilinear form, (ii) their output layer parametrization is rigid because it depends on the dimensionality of the encoded text and labels, and (iii) they are outperformed on seen labels by classification baselines trained with cross-entropy loss (Frome et al., 2013; Socher et al., 2013).
In this paper, we propose a new joint input-label model that generalizes over previous such models, addresses their limitations, and does not compromise performance on seen labels (see Figure 1). The proposed model is composed of a joint nonlinear input-label embedding with controllable capacity and a joint-space-dependent classification unit which is trained with cross-entropy loss to optimize classification performance.1 The need for capturing complex label relationships is addressed by two nonlinear transformations that have the same target joint space dimensionality. The parametrization of the output layer is not constrained by the dimensionality of the input or label encoding, but is instead flexible with a capacity that can be easily controlled by choosing the dimensionality of the joint space. Training is performed with cross-entropy loss, which is a suitable surrogate loss for classification problems, as opposed to a ranking loss such as WARP loss (Weston et al., 2010), which is more suitable for ranking problems.
Evaluation is performed on full-resource and low- or zero-resource scenarios of two text classification tasks, namely, on biomedical semantic indexing (Nam et al., 2016) and on multilingual news classification (Pappas and Popescu-Belis, 2017), against several competitive baselines. In both scenarios, we provide a comprehensive ablation analysis that highlights the importance of each model component and the difference with previous embedding formulations when using the same type of architecture and loss function.
Our main contributions are the following:
- (i)
We identify key theoretical and practical limitations of existing joint input-label models.
- (ii)
We propose a novel joint input-label embedding with flexible parametrization that generalizes over the previous such models and addresses their limitations.
- (iii)
We provide empirical evidence of the superiority of our model over monolingual and multilingual models that ignore label semantics, and over previous joint input-label models on both seen and unseen labels.
The remainder of this paper is organized as follows. Section 2 provides background knowledge and explains limitations of existing models. Section 3 describes the model components, training, and relation to previous formulations. Section 4 describes our evaluation results and analysis, while Section 5 provides an overview of previous work and Section 6 concludes the paper and provides future research directions.
2 Background: Neural Text Classification
We are given a collection D = {(xi, yi), i = 1,…, N} made of N documents, where each document xi is associated with labels yi = {yij ∈{0, 1}| j = 1, …, k}, and k is the total number of labels. Each document is a sequence of words grouped into sentences, with Ki being the number of sentences in document i and Tj being the number of words in sentence j. Each label j has a textual description composed of multiple words, with Lj being the number of words in each description. Given the input texts and their associated labels seen during the training portion of D, our goal is to learn a text classifier that is able to predict labels both in the seen, , or unseen, , label sets, defined as the sets of unique labels that have been seen or not during training, respectively, and, hence, and .2
2.1 Input Text Representation
To encode the input text, we focus on hierarchical attention networks (HANs), which are competitive for monolingual (Yang et al., 2016) and multilingual text classification (Pappas and Popescu-Belis, 2017). The model takes as input a document x and outputs a document vector h. The input words and label words are represented by vectors in ℝd from the same3 embeddings , where is the vocabulary and d is the embedding dimension; E can be pre-trained or learned jointly with the rest of the model. The model has two levels of abstraction, word and sentence. The word level is made of an encoder network gw and an attention network aw, while the sentence level similarly includes an encoder and an attention network.
Encoders.
Attention.
2.2 Label Text Representation
To encode the label text we use an encoder function that takes as input a label description cj and outputs a label vector ej ∈ℝdc∀j = 1, …, k. For efficiency reasons, we use a simple, parameter-free function to compute ej, namely, the average of word vectors which describe label j, namely, , and hence dc = d in this case. By stacking all these label vectors into a matrix, we obtain the label embedding . In principle, we could also use the same encoder functions as the ones for input text, but this would increase the computation significantly; hence, we keep this direction as future work.
2.3 Output Layer Parametrizations
2.3.1 Typical Linear Unit
2.3.2 Bilinear Input-Label Unit
Limitations.
Summary.
We hypothesize that these are the reasons why these models do not yet perform well on seen labels compared to models that make use of the typical linear unit, and they do not take full advantage of the structure of the problem when tested on unseen labels. Ideally, we would like to have a model that will address these issues and will combine the benefits from both the typical linear unit and the joint input-label models.
3 The Proposed Output Layer Parametrization for Text Classification
We propose a new output layer parametrization for neural text classification which is composed of a generalized input-label embedding that captures the structure of the labels, the structure of the encoded texts and the interactions between the two, followed by a classification unit which is independent of the label set size. The resulting model has the following properties: (i) it is able to capture complex output structure, (ii) it has a flexible parametrization that allows its capacity to be controlled, and (iii) it is trained with a classification surrogate loss such as cross-entropy. The model is depicted in Figure 1. In this section, we describe the model in detail, showing how it can be trained efficiently for arbitrarily large label sets and how it is related to previous models.
3.1 A Generalized Input-Label Embedding
3.2 Classification Unit
Summary.
3.3 Training Objectives
3.4 Scaling Up to Large Label Sets
3.5 Relation to Previous Parametrizations
4 Experiments
The evaluation is performed on large-scale biomedical semantic indexing using the BioASQ data set, obtained by Nam et al. (2016), and on multilingual news classification using the DW corpus, which consists of eight language data sets obtained by Pappas and Popescu-Belis (2017). The statistics of these data sets are listed in Table 1.
Data set . | Documents . | Labels . | |||
---|---|---|---|---|---|
abbrev. . | # count . | # words . | . | # count . | . |
BioASQ | 11,705,534 | 528,156 | 214 | 26,104 | 35.0 |
DW | 598,304 | 884,272 | 436 | 5,637 | 2.3 |
– en | 112,816 | 110,971 | 516 | 1,385 | 2.1 |
– de | 132,709 | 261,280 | 424 | 1,176 | 1.8 |
– es | 75,827 | 130,661 | 412 | 843 | 4.7 |
– pt | 39,474 | 58,849 | 571 | 396 | 1.8 |
– uk | 35,423 | 105,240 | 342 | 288 | 1.7 |
– ru | 108,076 | 123,493 | 330 | 916 | 1.8 |
– ar | 57,697 | 58,922 | 357 | 435 | 2.4 |
– fa | 36,282 | 34,856 | 538 | 198 | 2.5 |
Data set . | Documents . | Labels . | |||
---|---|---|---|---|---|
abbrev. . | # count . | # words . | . | # count . | . |
BioASQ | 11,705,534 | 528,156 | 214 | 26,104 | 35.0 |
DW | 598,304 | 884,272 | 436 | 5,637 | 2.3 |
– en | 112,816 | 110,971 | 516 | 1,385 | 2.1 |
– de | 132,709 | 261,280 | 424 | 1,176 | 1.8 |
– es | 75,827 | 130,661 | 412 | 843 | 4.7 |
– pt | 39,474 | 58,849 | 571 | 396 | 1.8 |
– uk | 35,423 | 105,240 | 342 | 288 | 1.7 |
– ru | 108,076 | 123,493 | 330 | 916 | 1.8 |
– ar | 57,697 | 58,922 | 357 | 435 | 2.4 |
– fa | 36,282 | 34,856 | 538 | 198 | 2.5 |
4.1 Biomedical Text Classification
We evaluate on biomedical text classification to demonstrate that our generalized input-label model scales to very large label sets and performs better than previous joint input-label models on both seen and unseen label prediction scenarios.
4.1.1 Settings
We follow the exact evaluation protocol, data, and settings of Nam et al. (2016), as described below. We use the BioASQ Task 3a data set, which is a collection of scientific publications in biomedical research. The data set contains about 12M documents labeled with around 11 labels out of 27,455, which are defined according to the Medical Subject Headings (MESH) hierarchy. The data were minimally pre-processed with tokenization, number replacements (NUM), rare word replacements (UNK), and split with the provided script by year so that the training set includes all documents until 2004 and the ones from 2005 to 2015 were kept for the test set; this corresponded to 6,692,815 documents for training and 4,912,719 for testing. For validation, a set of 100,000 documents were randomly sampled from the training set. We report the same ranking-based evaluation metrics as Nam et al. (2016), namely, rank loss (RL), average precision (AvgPr), and one-error loss (OneErr).
Our hyper-parameters were selected on validation data based on average precision as follows: 100-dimensional word embeddings, encoder, attention (same dimensions as the baselines), joint input-label embedding of 500, batch size of 64, maximum number of 300 words per document and 50 words per label, ReLU activation, 0.3% negative label sampling, and optimization with ADAM until convergence. The word embeddings were learned end-to-end on the task.4
The baselines are the joint input-label models from Nam et al. (2016), noted as [N16], namely:
- •
WSABIE+: This model is an extension of the original WSABIE model by Weston et al. (2011), which, instead of learning a ranking model with fixed document features, jointly learns features for documents and words, and is trained with the WARP ranking loss.
- •
AiTextML: This model is the one proposed by Nam et al. (2016) with the purpose of learning joint representations of documents, labels, and words, along with a joint input-label space that is trained with the WARP ranking loss.
The scores of the WSABIE+ and AiTextML baselines in Table 2 are the ones reported by Nam et al. (2016). In addition, we report scores of a word-level attention neural network (WAN) with Dense encoder and attention followed by a sigmoid output layer, trained with binary cross-entropy loss.5 Our model replaces WAN’s output layer with a generalized input-label embedding layer and its variations, noted GILE-WAN. For comparison, we also compare to bilinear input-label embedding versions of WAN for the model by Yazdani and Henderson (2015), noted as BIL-WAN [YH16], and the one by Nam et al. (2016), noted as BIL-WAN [N16]. Note that the AiTextML parameter space is huge and makes learning difficult for our models (linear with respect to labels and documents). Instead, we make sure that our models have far fewer parameters than the baselines (Table 2).
4.1.2 Results
The results on biomedical semantic indexing on seen and unseen labels are shown in Table 2. We observe that the neural baseline, WAN, outperforms WSABIE+ and AiTextML on the seen labels, by +5.73 and +9.59 points in terms of AvgPr, respectively. The differences are even more pronounced when considering the ranking loss and one error metrics. This result is compatible with previous findings that existing joint input-label models are not able to outperform strong supervised baselines on seen labels. However, WAN is not able to generalize at all to unseen labels, hence the WSABIE+ and AiTextML have a clear advantage in the zero-resource setting.
In contrast, our generalized input-label model, GILE-WAN, outperforms WAN even on seen labels, where our model has higher average precision by +2.02 points, better ranking loss by +43% and comparable OneErr (−3%). And this gain is not at the expense of performance on unseen labels. GILE-WAN outperforms WSABIE+ and AiTextML variants6 by a large margin in both cases—for example, by +7.75, +11.61 points on seen labels and by +12.58, +10.29 points in terms of average precision on unseen labels, respectively. Interestingly, our GILE-WAN model also outperforms the two previous bilinear input-label embedding formulations of Yazdani and Henderson (2015) and Nam et al. (2016), namely, BIL-WAN [YH15] and BIL-WAN [N16], by +3.71, +2.48 points on seen labels and +3.45 and +2.39 points on unseen labels, respectively, even when they are trained with the same encoders and loss as ours. These models are not able to outperform the WAN baseline when evaluated on the seen labels, that is they have −1.68 and −0.46 points lower average precision than WAN, but they outperform WSABIE+ and AiTextML on both seen and unseen labels. Overall, the results show a clear advantage of our generalized input-label embedding model against previous models on both seen and unseen labels.
4.1.3 Ablation Analysis
To evaluate the effectiveness of individual components of our model, we performed an ablation study (last three rows in Table 2). Note that when we use only the label or only the input embedding in our generalized input-label formulation, the dimensionality of the joint space is constrained to be the dimensionality of the encoded labels and inputs respectively (i.e., dj =100 in our experiments).
All three variants of our model outperform previous embedding formulations of Nam et al. (2016) and Yazdani and Henderson (2015) in all metrics except for AvgPr on seen labels, where they score slightly lower. The decrease in AvgPrec for our model variants with dj =100 compared with the neural baselines could be attributed to the difficulty in learning the parameters of a highly nonlinear space with only a few hidden dimensions. Indeed, when we increase the number of dimensions (dj =500), our full model outperforms them by a large margin. Recall that this increase in capacity is only possible with our full model definition in Equation (9) and none of the other variants allow us to do this without interfering with the original dimensionality of the encoded labels () and input (ht). In addition, our model variants with dj =100 exhibit consistently higher scores than baselines in terms of most metrics on both seen and unseen labels, which suggests that they are able to capture more complex relationships across labels and between encoded inputs and labels.
Overall, the best performance among our model variants is achieved when using only the label embedding and, hence, it is the most significant component of our model. Surprisingly, our model with only the label embedding achieves higher performance than our full model on unseen labels but it is far behind our full model when we consider performance on both seen and unseen labels. When we constrain our full model to have the same dimensionality with the other variants (i.e., dj =100), it outperforms the one that uses only the input embedding in most metrics and it is outperformed by the one that uses only the label embedding.
4.2 Multilingual News Text Classification
We evaluate on multilingual news text classification to demonstrate that our output layer based on the generalized input-label embedding outperforms previous models with a typical output layer in a wide variety of settings, even for labels that have been seen during training.
4.2.1 Settings
We follow the exact evaluation protocol, data, and settings of Pappas and Popescu-Belis (2017), as described below. The data set is split per language into 80% for training, 10% for validation, and 10% for testing. We evaluate on both types of labels (general Yg, and specific Ys) in a full-resource scenario, and we evaluate only on the general labels (Yg) in a low-resource scenario. Accuracy is measured with the micro-averaged F1 percentage scores.
The hierarchical models have Dense encoders in all scenarios (Table 3, 6, and 7), except from the varying encoder experiment (Table 4). For the low-resource scenario, the levels of data availability are: tiny from 0.1% to 0.5%, small from 1% to 5% and medium from 10% to 50% of the original training set. For each level, the average F1 across discrete increments of 0.1, 1 and 10 are reported respectively. The decision thresholds, which were tuned on validation data by Pappas and Popescu-Belis (2017), are set as follows: for the full-resource scenario it is set to 0.4 for |Ys| < 400 and 0.2 for |Ys|≥ 400, and for the low-resource scenario it is set to 0.3 for all sets.
The baselines are all the monolingual and multilingual neural networks from Pappas and Popescu-Belis (2017),8 noted as [PB17], namely:
- •
NN: A neural network that feeds the average vector of the input words directly to a classification layer, as the one used by Klementiev et al. (2012).
- •
HNN: A hierarchical network with encoders and average pooling at every level, followed by a classification layer, as the one used by Tang et al. (2015).
- •
HAN: A hierarchical network with encoders and attention, followed by a classification layer, as the one used by Yang et al. (2016).
- •
MHAN: Three multilingual hierarchical networks with shared encoders, noted MHAN-Enc, shared attention, noted MHAN-Att, and shared attention and encoders, noted MHAN-Both, as the ones used by Pappas and Popescu-Belis (2017).
To ensure a controlled comparison to the above baselines, for each model we evaluate a version where their output layer is replaced by our generalized input-label embedding output layer using the same number of parameters; these have the abbreviation “GILE” prepended in their name (e.g., GILE-HAN). The scores of HAN and MHAN models in Table 3, Table 6, and Table 7 are the ones reported by Pappas and Popescu-Belis (2017), while for Table 4 we train them ourselves using their code. Lastly, the best score for each pairwise comparison between a joint input-label model and its counterpart is marked in bold.
4.2.2 Results
Table 3 displays the results of full-resource document classification using Dense encoders for both general and specific labels. On the left, we display the performance of models on the English sub-corpus when English and an auxiliary language are used for training, and on the right, the performance on the auxiliary language sub-corpus when that language and English are used for training.
The results show that in 98% of comparisons on general labels (top half of Table 3) the joint input-label models improve consistently over the corresponding models using a typical sigmoid classification layer. This finding validates our main hypothesis that the joint input-label models successfully exploit the semantics of the labels, which provide useful cues for classification, as opposed to models which are agnostic to label semantics. The results for specific labels (bottom half of Table 3) demonstrate the same trend, with the joint input-label models performing better in 87% of comparisons.
In Table 5, we also directly compare our embedding to previous bilinear input-label embedding formulations when using the best monolingual configuration (HAN) from Table 3, exactly as done in Section 4.1. The results on the general labels show that GILE outperforms the previous bilinear input-label models, BIL [YH15] and BIL [N16], by +1.62 and +3.3 percentage points on average, respectively. This difference is much more pronounced on the specific labels, where the label set is much larger, namely, +6.5 and +13.5 percentage points, respectively. Similarly, our model with constrained dimensionality is also as good or better on average than the bilinear input-label models, by +0.9 and +2.2 on general labels and by −0.5 and +6.1 on specific labels respectively, which highlights the importance of learning nonlinear relationships across encoded labels and documents. Among our ablated model variants, as in the previous section, the best is the one with only the label projection but it still worse than our full model by −5.2 percentage points. The improvements of GILE against each baseline is significant and consistent on both data sets. Hence, in the following experiments we will only consider the best of these alternatives.
HAN . | Languages . | |||||||
---|---|---|---|---|---|---|---|---|
Yg output layer . | en . | de . | es . | pt . | uk . | ru . | ar . | fa . |
Linear [PB17] | 71.2 | 71.8 | 82.8 | 71.3 | 85.3 | 79.8 | 80.5 | 76.6 |
BIL [YH15] | 71.7 | 70.5 | 82.0 | 71.1 | 86.6 | 80.6 | 80.4 | 76.0 |
BIL [N16] | 69.8 | 69.1 | 80.9 | 67.4 | 87.5 | 79.9 | 78.4 | 75.1 |
GILE (Ours) | 76.5 | 74.2 | 83.4 | 71.9 | 86.1 | 82.7 | 82.6 | 77.2 |
- constraineddj | 73.6 | 73.1 | 83.3 | 71.0 | 87.1 | 81.6 | 80.4 | 76.4 |
- only label | 71.4 | 69.6 | 82.1 | 70.3 | 86.2 | 80.6 | 81.1 | 76.2 |
- only input | 55.1 | 54.2 | 80.6 | 66.5 | 85.6 | 60.8 | 78.9 | 74.0 |
Ys output layer | en | de | es | pt | uk | ru | ar | fa |
Linear[PB17] | 43.4 | 44.8 | 46.3 | 41.9 | 46.4 | 45.8 | 41.2 | 49.4 |
BIL [YH15] | 40.7 | 37.8 | 38.1 | 33.5 | 44.6 | 38.1 | 39.1 | 42.6 |
BIL [N16] | 34.4 | 30.2 | 34.4 | 33.6 | 31.4 | 22.8 | 35.6 | 38.9 |
GILE (Ours) | 45.9 | 47.3 | 47.4 | 42.6 | 46.6 | 46.9 | 41.9 | 48.6 |
- constraineddj | 38.5 | 38.0 | 36.8 | 35.1 | 42.1 | 36.1 | 36.7 | 48.7 |
- only label | 38.4 | 41.5 | 42.9 | 38.3 | 44.0 | 39.3 | 37.2 | 43.4 |
- only input | 12.1 | 10.8 | 8.8 | 20.5 | 11.8 | 7.8 | 12.0 | 24.6 |
HAN . | Languages . | |||||||
---|---|---|---|---|---|---|---|---|
Yg output layer . | en . | de . | es . | pt . | uk . | ru . | ar . | fa . |
Linear [PB17] | 71.2 | 71.8 | 82.8 | 71.3 | 85.3 | 79.8 | 80.5 | 76.6 |
BIL [YH15] | 71.7 | 70.5 | 82.0 | 71.1 | 86.6 | 80.6 | 80.4 | 76.0 |
BIL [N16] | 69.8 | 69.1 | 80.9 | 67.4 | 87.5 | 79.9 | 78.4 | 75.1 |
GILE (Ours) | 76.5 | 74.2 | 83.4 | 71.9 | 86.1 | 82.7 | 82.6 | 77.2 |
- constraineddj | 73.6 | 73.1 | 83.3 | 71.0 | 87.1 | 81.6 | 80.4 | 76.4 |
- only label | 71.4 | 69.6 | 82.1 | 70.3 | 86.2 | 80.6 | 81.1 | 76.2 |
- only input | 55.1 | 54.2 | 80.6 | 66.5 | 85.6 | 60.8 | 78.9 | 74.0 |
Ys output layer | en | de | es | pt | uk | ru | ar | fa |
Linear[PB17] | 43.4 | 44.8 | 46.3 | 41.9 | 46.4 | 45.8 | 41.2 | 49.4 |
BIL [YH15] | 40.7 | 37.8 | 38.1 | 33.5 | 44.6 | 38.1 | 39.1 | 42.6 |
BIL [N16] | 34.4 | 30.2 | 34.4 | 33.6 | 31.4 | 22.8 | 35.6 | 38.9 |
GILE (Ours) | 45.9 | 47.3 | 47.4 | 42.6 | 46.6 | 46.9 | 41.9 | 48.6 |
- constraineddj | 38.5 | 38.0 | 36.8 | 35.1 | 42.1 | 36.1 | 36.7 | 48.7 |
- only label | 38.4 | 41.5 | 42.9 | 38.3 | 44.0 | 39.3 | 37.2 | 43.4 |
- only input | 12.1 | 10.8 | 8.8 | 20.5 | 11.8 | 7.8 | 12.0 | 24.6 |
The best bilingual performance on average is that of the GILE-MHAN-Att model, for both general and specific labels. This improvement can be attributed to the effective sharing between label semantics across languages through the joint multilingual input-label output layer. Effectively, this model has the same multilingual sharing scheme with the best model reported by Pappas and Popescu-Belis (2017), MHAN-Att, namely, sharing attention at each level of the hierarchy, which agrees well with their main finding.
Interestingly, the improvement holds when using different types of hierarchical encoders, namely, Dense GRU, and biGRU, as shown in Table 4, which demonstrate the generality of the approach. In addition, our best models outperform logistic regression trained either on top-10% most frequent words or on the full vocabulary, even though our models utilize many fewer parameters, that is, 377K/138K vs. 26M/5M. Increasing the capacity of our models should lead to even further improvements.
Multilingual learning.
So far, we have shown that the proposed joint input-label models outperform typical neural models when training with one and two languages. Does the improvement remain when increasing the number of languages even more? To answer the question we report in Table 6 the average F1-score per language for the best baselines from the previous experiment (HAN and MHAN-Att) with the proposed joint input-label versions of them (GILE-HAN and GILE-MHAN-Att) when increasing the number of languages (1, 2, and 8) that are used for training. Overall, we observe that the joint input-label models outperform all the baselines independently of the number of languages involved in the training, while having the same number of parameters. We also replicate the previous result that a second language helps but beyond that there is no improvement.
Low-resource transfer.
We investigate here whether joint input-label models are useful for low-resource languages. Table 7 shows the low- resource classification results from English to seven other languages when varying the amount of their training data. Our model with both shared encoders and attention, GILE-MHAN, outperforms previous models in average, namely, HAN (Yang et al., 2016) and MHAN (Pappas and Popescu-Belis, 2017), for low-resource classification in the majority of the cases.
The shared input-label space appears to be helpful especially when transferring from English to German, Portuguese, and Arabic languages. GILE-MHAN is significantly behind MHAN on transferring knowledge from English to Spanish and to Russian in the 0.1% to 0.5% resource setting, but in the rest of the cases they have very similar scores.
Label sampling.
To speed up computation it is possible to train our model by sampling labels, instead of training over the whole label set. How much speed-up can we achieve from this label sampling approach and still retain good levels of performance? In Figure 2, we attempt to answer this question by reporting the performance of our GILE-HNN model when varying the amount of labels (%) that it uses for training over English general and specific labels of the DW data set. In both cases, the performance of GILE-HNN tends to increase as the percentage of labels sampled increases, but it levels off for the higher percentages.
For general labels, top performance is reached with a 40% to 50% sampling rate, which translates to a 22% to 18% speed-up, whereas for the specific labels, it is reached with a 60% to 70% sampling rate, which translates to a 40% to 36% speed-up. The speed-up is correlated to the size of the label set, since there are many fewer general labels than specific labels, namely, 327 vs. 1,058 here. Hence, we expect even higher speedups for bigger label sets. Interestingly, GILE-HNN with label sampling reaches the performance of the baseline with a 25% and 60% sample for general and specific labels respectively. This translates to a speed-up of 30% and 50%, respectively, compared with a GILE-HNN trained over all labels. Overall, these results show that our model is effective and that it can also scale to large label sets. The label sampling should also be useful in tasks where the computation resources may be limited or budgeted.
5 Related Work
5.1 Neural text Classification
Research in neural text classification was initially based on feed-forward networks, which required unsupervised pre-training (Collobert et al., 2011; Mikolov et al., 2013; Le and Mikolov, 2014) and later on they focused on networks with hierarchical structure. Kim (2014) proposed a convolutional neural network (CNN) for sentence classification. Johnson and Zhang (2015) proposed a CNN for high-dimensional data classification, while Zhang et al. (2015) adopted a character-level CNN for text classification. Lai et al. (2015) proposed a recurrent CNN to capture sequential information, which outperformed simpler CNNs. Lin et al. (2015) and Tang et al. (2015) proposed hierarchical recurrent neural networks and showed that they were superior to CNN-based models. Yang et al. (2016) demonstrated that a hierarchical attention network with bi-directional gated encoders outperforms previous alternatives. Pappas and Popescu-Belis (2017) adapted such networks to learn hierarchical document structures with shared components across different languages.
The issue of scaling to large label sets has been addressed previously by output layer approximations (Morin and Bengio, 2005) and with the use of sub-word units or character-level modeling (Sennrich et al., 2016; Lee et al., 2017) which is mainly applicable to structured prediction problems. Despite the numerous studies, most of the existing neural text classification models ignore label descriptions and semantics. Moreover, they are based on typical output layer parametrizations that are dependent on the label set size, and thus are not able to scale well to large label sets nor to generalize to unseen labels. Our output layer parametrization addresses these limitations and could potentially improve such models.
5.2 Output Representation Learning
There exist studies that aim to learn output representations directly from data without any semantic grounding to word embeddings (Srikumar and Manning, 2014; Yeh et al., 2018; Augenstein et al., 2018). Such methods have a label-set-size dependent parametrization, which makes them data hungry, less scalable on large label sets, and incapable of generalizing to unseen classes. Wang et al. (2018) addressed the lack of semantic grounding to word embeddings by proposing an efficient method based on label-attentive text representations which are helpful for text classification. However, in contrast to our study, their parametrization is still label-set-size dependent and thus their model is not able to scale well to large label sets nor to generalize to unseen labels.
5.3 Zero-shot Text Classification
Several studies have focused on learning joint input-label representations grounded to word semantics for unseen label prediction for images (Weston et al., 2011; Socher et al., 2013; Norouzi et al., 2014; Zhang et al., 2016; Fu et al., 2018), called zero-shot classification. However, there are fewer such studies for text classification. Dauphin et al. (2014) predicted semantic utterances of text by mapping them in the same semantic space with the class labels using an unsupervised learning objective. Yazdani and Henderson (2015) proposed a zero-shot spoken language understanding model based on a bilinear input-label model able to generalize to previously unseen labels. Nam et al. (2016) proposed a bilinear joint document-label embedding that learns shared word representations between documents and labels. More recently, Shu et al. (2017) proposed an approach for open-world classification that aims to identify novel documents during testing but it is not able to generalize to unseen classes. Perhaps the model most similar to ours is from the recent study by Pappas et al. (2018) on neural machine translation, with the difference that they have single-word label descriptions and they use a label-set-dependent bias in a softmax linear prediction unit, which is designed for structured prediction. Hence, their model can neither handle unseen labels nor multi-label classification, as we do here.
Compared with previous joint input-label models, the proposed model has a more general and flexible parametrization, which allows the output layer capacity to be controlled. Moreover, it is not restricted to linear mappings, which have limited expressivity, but uses nonlinear mappings, similar to energy-based learning networks (LeCun et al., 2006; Belanger and McCallum, 2016). The link to the latter can be made if we regard in Equation (11) as an energy function for the i-th document and the j-th label, the calculation of which uses a simple multiplicative transformation (Equation (10)). Lastly, the proposed model performs well on both seen and unseen label sets by leveraging the binary cross-entropy loss, which is the standard loss for classification problems, instead of a ranking loss.
6 Conclusion
We proposed a novel joint input-label embedding model for neural text classification that generalizes over existing input-label models and addresses their limitations while preserving high performance on both seen and unseen labels. Compared with baseline neural models with a typical output layer, our model is more scalable and has better performance on the seen labels. Compared with previous joint input-label models, it performs significantly better on unseen labels without compromising performance on the seen labels. These improvements can be attributed to the the ability of our model to capture complex input-label relationships, to its controllable capacity, and to its training objective, which is based on cross-entropy loss.
As future work, the label representation could be learned by a more sophisticated encoder, and the label sampling could benefit from importance sampling to avoid revisiting uninformative labels. Another interesting direction would be to find a more scalable way of increasing the output layer capacity—for instance, using a deep rather than a wide classification network. Moreover, adapting the proposed model to structured prediction, for instance by using a softmax classification unit instead of a sigmoid one, would benefit tasks such as neural machine translation, language modeling, and summarization in isolation but also when trained jointly with multi-task learning.
Acknowledgments
We are grateful for the support from the European Union through its Horizon 2020 program in the SUMMA project n. 688139, see http://www. summa-project.eu. We would also like to thank our action editor, Eneko Agirre, and the anonymous reviewers for their invaluable suggestions and feedback.
Notes
Our code is available at: github.com/idiap/gile.
Note that depending on the number of labels per document the problem can be a multi-label or multi-class problem.
This statement holds true for multilingual classification problems, too, if the embeddings are aligned across languages.
Here, the word embeddings are included in the parameter statistics because they are variables of the network.
In our preliminary experiments, we also trained the neural model with a hinge loss as WSABIE+ and AiTextML, but it performed similarly to them and much worse than WAN, so we did not further experiment with it.
Namely, avg when using the average of word vectors and inf when using inferred label vectors to make predictions.
The word embeddings are not included in the parameters statistics because they are not variables of the network.