Abstract
Recent years have witnessed increasing interest in developing interpretable models in Natural Language Processing (NLP). Most existing models aim at identifying input features such as words or phrases important for model predictions. Neural models developed in NLP, however, often compose word semantics in a hierarchical manner. As such, interpretation by words or phrases only cannot faithfully explain model decisions in text classification. This article proposes a novel Hierarchical Interpretable Neural Text classifier, called HINT, which can automatically generate explanations of model predictions in the form of label-associated topics in a hierarchical manner. Model interpretation is no longer at the word level, but built on topics as the basic semantic unit. Experimental results on both review datasets and news datasets show that our proposed approach achieves text classification results on par with existing state-of-the-art text classifiers, and generates interpretations more faithful to model predictions and better understood by humans than other interpretable neural text classifiers.1
1 Introduction
Deep Learning (DL) models have achieved state-of-the-art performance in many NLP tasks (Devlin et al. 2019; Yang et al. 2019; Brown et al. 2020; Yan et al. 2021). A deep neural network containing many layers is usually viewed as a black box that has limited interpretability. Recently, the field of explainable AI has exploded with various new approaches proposed to address the problem of the lack of interpretability of deep learning models (Lipton 2018; Jacovi and Goldberg 2020; Ribeiro et al. 2020).
Methods for the interpretation of DL models can be broadly classified into post-hoc interpretation methods and self-explanatory methods. The former typically aims to establish the relationship between the changes in the prediction output and the changes in the input of a DL model in order to identify features important for model decisions. For example, Jawahar, Sagot, and Seddah (2019) used probing to examine BERT intermediate layers. Abdou et al. (2020) modified input text by linguistic perturbations and observed their impacts on model outputs. Selvaraju et al. (2020) tracked the impact from gradient changes. Kim et al. (2020) erased word tokens from input text by marginalizing out the tokens. On the other hand, the self-explanatory models are able to generate explanations during model training by “twinning” a black-box Machine Learning model with transparent modules. For example, in parallel to model learning, an addition module is trained to interpret model behavior and is used to regularize the model for interpretability (Alvarez-Melis and Jaakkola 2018; Rieger et al. 2020). Such models, however, usually require expert prior knowledge or annotated data to guide the learning of interpretability modules. Chen and Ji (2020) proposed to improve the interpretability of neural text classifiers by inserting variational word masks into the classifier after the word embedding layer in order to filter out noisy word-level features. The interpretations generated by their model are only at the word-level and ignore hierarchical semantic compositions in text.
We argue that existing word-level or phrase-level interpretations are not sufficient for interpreting text classifier behaviors, as documents tend to exhibit topic and label shifts. It is therefore more desirable to explore hierarchical structures to capture semantic shifts in text at different granularity levels (O’Hare et al. 2009; Lin et al. 2012; Yang et al. 2016; Wang et al. 2020; Arnold et al. 2019; Xie et al. 2021; Gui and He 2021). Moreover, simply establishing the relationship between the changes in the input and the changes in the output in a DL model could identify features that are important for predictions, but ignores subtle interactions among input features. Recent approaches have been developed to build explanations through detecting feature interactions (Singh, Murdoch, and Yu 2019; Chen, Zheng, and Ji 2020; Jin et al. 2020; Gui et al. 2022). Nevertheless, they are only able to identify sub-text spans that are important for model decisions and largely focus on sentence-level classification tasks.
We speculate that a good interpretation model for text classification should be able to identify the key latent semantic factors and their interactions that contribute to the model’s final decision. This is often beyond what word-level interpretations could capture. To this end, considering the hierarchical structure of input documents, we propose a novel Hierarchical Interpretable Neural Text classifier, called Hint, which can generate interpretations in a hierarchical manner. One example output generated by Hint is shown in Figure 1 in which a review document consisting of 6 sentences (shown in the upper box) is fed to a classifier for the prediction of a sentiment label. Traditional interpretation methods can only identify words that are indicative of sentiment categories, as shown in the middle left box in Figure 1. However, it is still unclear how these words contribute to the document-level sentiment label, especially when there are words with mixed polarities. Moreover, humans may also be interested in topics discussed in the document and their associated sentiments, and how they are combined to reach the final document-level class label. The middle center box highlights the important topic words in each sentence. Note that traditional post hoc word-level interpretation methods would not be able to identify these words since they are less relevant to sentiment class labels when considered in isolation. The middle right box in Figure 1 shows the associated topic for each sentence and its respective sentiment label. For example, the sentence S2 is associated with the topic “Actor performance” and has a negative polarity. The lower box shows the topic partition of sentences based on their topic semantic similarities. Such hierarchical explanations (from word-level label-dependent and label-independent interpretations, to sentence-level topics with their associated labels, and finally to the document-level topic partition) are generated automatically from our proposed approach. The only supervision information required for model learning are documents paired with their class labels.
As will be shown in the Experiments section, our proposed approach achieves comparable classification performance compared to the existing state-of-the-art neural text classifiers when evaluated on three document classification datasets. Moreover, it generates interpretations more faithful to model predictions and better understood by humans compared to word-level interpretation methods. In summary, our contributions are 3-fold:
We propose a neural text classifier with built-in interpretability that can generate hierarchical explanations by identifying both label-dependent and topic-related words at the word-level, detecting topics and their associated labels at the sentence-level, and finally producing the document-level topic and sentiment composition.
The evaluation of explanations generated by our approach shows that it generates interpretations better understood by humans and more faithfully for model predictions compared to existing word-level interpretation methods.
Experimental results show that our proposed approach performs on par with the existing state-of-the-art methods on the three document classification datasets.
2 Related Work
Our work is related to the following lines of research:
Post-hoc Interpretation.
Post-hoc interpretation methods typically aim to identify the contribution of input attributes or features to model predictions. For example, Wu et al. (2020) proposed a perturbation-based method to interpret pre-trained language models used for dependency parsing. Niu et al. (2020) evaluated the robustness and interpretability of neural-network-based machine translation models by slightly perturbing input. Abdou et al. (2020) proposed a new dataset for evaluating model interpretability by seven different types of linguistic perturbations. Kim et al. (2020) argued that interpretation methods that measure the changes in prediction probabilities by erasing word tokens from input text may face the out-of-distribution (OOD) problem. They proposed to marginalize out a token in an input sentence to mitigate the OOD problem. Jin et al. (2020) and Chen, Zheng, and Ji (2020) built hierarchical explanations through detecting feature interactions. But their models can only identify sub-text spans that are important for model decisions and largely focus on sentence-level classification tasks.
Self-Explanatory Models.
Different from post-hoc interpretation, self-explanatory methods aim to generate explanations during model training with the interpretability naturally built in. Existing work utilizes mutual information (Chen et al. 2018; Guan et al. 2019), attention signals (Zhou, Zhang, and Yang 2020), Bayesian network (Chen et al. 2020; Tang, Hahn-Powell, and Surdeanu 2020), or information bottleneck (Alvarez-Melis and Jaakkola 2018; Bang et al. 2021) to identify the key attributes or features from input data. For example, Zhou, Zhang, and Yang (2020) used a variational autoencoder based classifier to identify operational risk in model training. Chen et al. (2020) recognized name entities from clinical records and used a Bayesian network to obtain interpretable predictions. Zhang et al. (2020) proposed an interpretable relation recognition approach by Bayesian Structure Learning. Tang, Hahn-Powell, and Surdeanu (2020) proposed a rule-based decoder to generate rules for model explanation. Zanzotto et al. (2020) proposed a kernel-based encoder for the interpretable embedding metric to visualize how syntax is used in inference. Jiang et al. (2020) incorporated regular expressions into recurrent neural network training for cold-start scenarios in order to obtain interpretable outputs. Chen and Ji (2020) proposed variational word masks (VMASK) that are inserted into a neural text classifier after the word embedding layer in order to filter out noisy word-level features, forcing the classifier to focus on important features to make predictions.
In general, existing self-explanatory methods mainly focus on tracking the influence of input features on model outputs and use it as constraints for model learning. But they ignore the subtle interplay of input attributes. In this article, we propose a novel hierarchical interpretation model, which can generate interpretations at different granularity levels and achieve classification performance on par with the existing state-of-the-art neural classifiers.
Interpretation Based on Attentions.
The attention mechanisms have been widely used in neural architectures applied to various NLP tasks. It is common to use attention weights to interpret models’ predictive decisions (Li, Monroe, and Jurafsky 2016; Lai and Tan 2019; De-Arteaga et al. 2019). In recent years, however, there has been work showing that attention is not a valid explanation. For example, Jain and Wallace (2019) found that it is possible to identify alternative attention weights after the model is trained, which produced the same predictions. Serrano and Smith (2019) modified attention weights in already-trained text classification models and analyzed the resulting differences in their predictions. They concluded that attention cannot be used as a valid indicator for model predictions. While the aforementioned work modified attention weights in a post-hoc manner after a model was trained, Pruthi et al. (2020) proposed modifying attention weights during model learning and produced models whose actual weights could lead to deceived interpretations. Wiegreffe and Pinter (2019) argued the validity of the claim in prior work (Jain and Wallace 2019) and proposed alternative experimental design to test when/whether attention can be used as explanation. Their results showed that prior work does not disprove the usefulness of attention mechanisms for explainability.
3 Hierarchical Interpretable Neural Text Classifier (Hint)
Our proposed Hierarchical Interpretable Neural Text (Hint) classification model is shown in Figure 2. For each sentence in an input document, a dual representation learning module (§3.1) is used to generate the contextual representation guided by the document class label and the latent topic representation, from which the word-level interpretations can be generated. To aggregate the sentences with similar topic representations, we create a fully connected graph (§3.2) whose nodes are initialized by the sentence contextual representations and edge weights are topic similarity values of the respective sentence nodes. Sentence interactions are captured by a single-layer Graph Attention Network to derive the document representation for classification. In what follows, we describe each of the modules of Hint in detail. The notations used in this article are shown in Table 1.
Symbol . | Description . |
---|---|
Sentence Representation Learning Module | |
N | The dimension of input word embeddings and learned sentence embeddings. |
xij ∈ℝN | The input embedding of j-th word in i-th sentence. |
si ∈ℝN | The learned embedding of i-th sentence. |
biLSTMϕ | The bidirectional LSTM encoder with learnable parameter set ϕ. |
uij ∈ℝN/2 | The attention vector for j-th word in i-th sentence. |
aij ∈ℝ | The attention signal for j-th word in i-th sentence. |
Topic Representation Learning Module | |
K | The dimension of topic embeddings. |
βij | The topic based attention signal for j-th word in i-th sentence. |
ω | The parameter set in topic based attention learning, which includes Wμ,Wω ∈ℝN×K, and bμ,bω ∈ℝK. |
ri ∈ℝN | The sentence representation of the i-th sentence derived based on the word-level topic attentions βij. |
zi ∈ℝK | The topic representation of the i-th sentence. |
𝕣i′ ∈ℝN | The reconstructed sentence representation of the i sentence based on the learned autoencoder. |
i, λi | The regularization term and its corresponding weight. |
zik | The probability that a sentence i belongs to the k-th topic, also represented as P(tk|si). |
The occurrence probability of the k-th topic in document d, also defined as P(tk|d). | |
Document Representation Learning Module | |
cij | The similarity between the i-th and j-th sentence based on the learned topic representation. |
eij | The static edge weight derived by normalizing cij. |
The representation of the i-th sentence learned by the graph attention network in the l-th iteration, where is initialized by si. | |
wd | The representation of document d. |
ηa, ηb | The weight of different terms in the loss function for document representation learning. |
Symbol . | Description . |
---|---|
Sentence Representation Learning Module | |
N | The dimension of input word embeddings and learned sentence embeddings. |
xij ∈ℝN | The input embedding of j-th word in i-th sentence. |
si ∈ℝN | The learned embedding of i-th sentence. |
biLSTMϕ | The bidirectional LSTM encoder with learnable parameter set ϕ. |
uij ∈ℝN/2 | The attention vector for j-th word in i-th sentence. |
aij ∈ℝ | The attention signal for j-th word in i-th sentence. |
Topic Representation Learning Module | |
K | The dimension of topic embeddings. |
βij | The topic based attention signal for j-th word in i-th sentence. |
ω | The parameter set in topic based attention learning, which includes Wμ,Wω ∈ℝN×K, and bμ,bω ∈ℝK. |
ri ∈ℝN | The sentence representation of the i-th sentence derived based on the word-level topic attentions βij. |
zi ∈ℝK | The topic representation of the i-th sentence. |
𝕣i′ ∈ℝN | The reconstructed sentence representation of the i sentence based on the learned autoencoder. |
i, λi | The regularization term and its corresponding weight. |
zik | The probability that a sentence i belongs to the k-th topic, also represented as P(tk|si). |
The occurrence probability of the k-th topic in document d, also defined as P(tk|d). | |
Document Representation Learning Module | |
cij | The similarity between the i-th and j-th sentence based on the learned topic representation. |
eij | The static edge weight derived by normalizing cij. |
The representation of the i-th sentence learned by the graph attention network in the l-th iteration, where is initialized by si. | |
wd | The representation of document d. |
ηa, ηb | The weight of different terms in the loss function for document representation learning. |
3.1 Dual Module for Sentence-Level Representation Learning
The dual module captures the sentence contextual and latent topic information separately. In particular, we hoped that the sentence-level context representation would capture the label-dependent semantic information, while the sentence-level topic representation would encode label-independent semantic information shared across documents regardless of their class labels.
3.1.1 Context Representation Learning
In the sentence-level context representation learning module shown in Figure 3a, the goal is to capture the contextual representation of a sentence with word-level label-relevant features. We choose a bidirectional LSTM (biLSTM) network, which captures the contextual semantics information conveyed in a sentence, with an attention mechanism that can capture the task relevant weights for interpretation (Yang et al. 2016).
The aforementioned approach in producing the sentence-level contextual representations is a typical way in encoding sentence semantics. When used in building neural classifiers, we would expect such representations to implicitly capture the class label information. More concretely, its word-level attention weights can be used to identify words that are important for text classification decisions. Taking sentiment classification as an example, as has been previously shown in hierarchically stacked LSTM or GRU networks, words with higher attention weights are often indicative of polarities (Yang et al. 2016) .
3.1.2 Topic Representation Learning
Sentence-level contextual representations learned in Section 3.1.1 implicitly inject label information to si and capture label-dependent word features. Here, we propose using a label-independent approach to capture the hidden relationships between input words to infer latent topics, which could be subsequently used to determine their potential contributions to class labels, in order to enhance the generalization and interpretability.
Regularization Terms.
In the following, we introduce a number of regularization terms used in our model.
Orthogonal regularization.
Topic uniqueness regularization.
Topic discrepancy regularization.
Based on the topic representations, zi, we can essentially partition text into different groups. Inspired by Johansson, Shalit, and Sontag (2016), we propose another regularization term to re-weigh different partitions.
Intuitively, we want to reduce the discrepancy between different latent topics weighted by the posterior probability of topics’ given text in order to prevent the learner from using “unreliable” topics of the data when trying to generalize from the factual to the counterfactual domains. For example, if in our movie reviews, very few people mentioned the topic of “source effect,” inferring the attitude toward this topic is highly prone to error. As such, the importance of this topic should be down-weighted.
The above result states that if two input topics have similar representations, or have higher occurrence probabilities in the current document, they will obtain larger updates that push their representations closer to the representation of the current document calculated based on the mean pooling of its constituent sentences. Essentially, the discrepancy term separates the document representations based on the topic similarities, then updates the corresponding topic distribution based on the topic occurrence probabilities and the document representation.
Final objective function.
3.2 Document Modeling
After obtaining the sentence-level context representations si and latent topic representations zi, the next step is to aggregate such representations to derive the document-level representation for classification.
Graph Node Update.
We represent each document by a graph in which the nodes represent sentences in the document d and the edges linking every two nodes measure their topic similarities. Each graph node is initialized by its respective sentence-level context representation si. The topic similarity cij between the i-th sentence and j-th sentence is defined as the inner-product of their latent topic vectors, .
Document Classification.
4 Model Interpretation Generation
For a given document d, the proposed model can not only predict a label, but also generate a hierarchical interpretation for its prediction. Taking the document in Figure 1 as an example, we will elaborate below how to generate the word- and sentence-level explanations, as well as how to aggregate the hierarchical information to generate the final model prediction.
Word-Level Interpretation Generation.
The dual module learns a context representation and a topic representation of a sentence. The approach in producing the context representations is a typical way in encoding sentence semantics (§3.1.1). When used in building neural classifiers, we would expect that such representations implicitly capture the class label information. More concretely, its word-level attention weights can be used to identify words that are associated with the class label. For the example shown in Figure 1, label-relevant words such as “worst” is indicative of the negative polarity. The latent topic learning module (§3.1.2) aims to capture latent topics shared across all documents regardless of their class labels. The word-level attention weights are generated by a stochastic process as shown in Equation (4). It can be observed that words identified in this way are more topic-related (such as “production” and “actors”) and are less relevant to the class label.
To extract topic words (the word cloud in Figure 1), we first multiply the word embeddings matrix E ∈ℝV×N with the weight matrix Wc ∈ℝN×K from the topic encoder network (Equation (8) in §3.1.2), where V denotes the vocabulary size, N is the word embedding dimension, and K is the topic number.3 From the resulting matrix π ∈ℝV×K, we can then extract the top n words from each topic dimension as the topic words (Chaney and Blei 2021). In the following, we explain why each column in π can be considered as a topic.
Sentence-Level Interpretation Generation.
From the sentence-level latent topic representation, we can identify the most prominent topic dimension in the hidden topic vector zi and use it as the topic label for each sentence. As has been illustrated in the lower part (word cloud) in Figure 1, the topics that correspond to the six sentences can be summarized as “Awful movie,” “Actor performance,” “Injury,” and “Inconsistent Plot,” from left to right. Here, we represent each topic as a word cloud, which contains the top-10 topic-associated words from the corpus vocabulary as shown in Figure 1. The topic labels are manually assigned for better illustration. We can also automatically generate topic labels by selecting the most relevant phrase from the document to represent each topic (see in Figure 10). Specifically, for each topic, we first find the most relevant sentence according to the sentence-topic distribution, that is, , where M is the number of sentences in a document, d is the dimension of a sentence representation, and S and Z are the sentence contextual representations and the topic representations, respectively. Then, we extract the key phrase from the sentence via the Rapid Automatic Keyword Extraction algorithm.4 We infer the class label of each sentence by feeding the sentence contextual representation into the classification layer of Hint and obtain probabilities of class labels, thus obtaining its class-associated intensity.
Document-Level Interpretation Generation.
Once the topic and class label for each sentence is obtained, we can aggregate sentences based on the similarity of their latent topic representations. The contextual representation of the document is obtained by taking the weighted aggregation of its constituent sentence contextual representations, where the weights are the topic similarity values. As sentences are assigned to various topics, we can easily study how topics and their associated class labels change throughout the document. In addition, we can also infer the most prominent topic in the document.
5 Experimental Setup
5.1 Datasets
We conduct experiments on three English document datasets, including two review datasets: patient reviews extracted from Yelp5 and the IMDB movie reviews (Maas et al. 2011), as well as the Guardian News dataset.6 For Yelp reviews, we retrieve patient reviews based on a set of predefined keywords.7 Each review is accompanied by keywords indicating its associated healthcare categories. Because the majority of reviews have ratings of either 1 or 5 stars, we only keep the reviews with 1 and 5 stars as negative and positive instances, respectively. The IMDB dataset also has two class categories (positive and negative). Yelp has twice as many positive reviews as negative ones while IMDB has a balanced class distribution. As the IMDB dataset does not provide the train/test split, we follow the same split proportion as that in the implementation of Scholar.8 The Guardian News dataset contains 5 categories, namely, Sports, Politics, Business, Technology, and Culture, from which nearly 40% of the documents are in the Sports category and less than 10% and 5% of the documents are in the Technology and Culture categories, respectively. The data statistics are shown in Table 2.
5.2 Baselines
We compare our approach with the following baselines:
CNN: In our experiments, the kernel sizes are 3,4,5, and the number of kernels of each size is 100.
LSTM: For each document, words are fed into LSTM sequentially and composed by mean pooling. A softmax layer is stacked to generate the class prediction. We also report the results of LSTM with an attention mechanism (LSTM+Att).
HAN (Yang et al. 2016): The Hierarchical Attention Network stacks two bidirectional Gated Recurrent Units and applies two levels of attention mechanisms at the sentence-level and at the document-level, respectively.
BERT (Devlin et al. 2019): We feed each document into BERT as a long sequence with sentences separated by the [SEP] token, which is fine-tuned on our data. We truncate documents with length over 512 tokens and use the representation of the [CLS] token for classification.
Scholar (Card, Tan, and Smith 2018): A neural topic model trained with variational autoencoder (Kingma and Welling 2014) with document-level class labels incorporated as supervised information. Scholar essentially learns a latent topic representation of an input document and then predicts the class label conditional on the latent topic representation.
VMASK (Chen and Ji 2020): The model applies variational word masking strategy to mask out unimportant words to improve interpretability of neural text classifiers. During training, the binary mask is derived from the Gumbel-softmax operator on the non-linear transformation of an input sentence, and then element-multiplication is applied on the mask and the sentence to remove the unimportant words. In inference, they use softmax to get a softened version of the mask, instead. We report results from two variants of VMASK, by using the text input encoded either by BERT or LSTM.9
5.3 Data Pre-processing
For the IMDB reviews, we use the processed IMDB dataset provided by Scholar.10 For both review datasets, we set the maximum sentence length to 60 words and the maximum document length to 10 sentences. We only keep the most frequent 15,000 words in the training set, and mark the other words as [unk]. Sentences with more than 30% [unk] are removed from our training set. For the Guardian news data, we download the dataset from Kaggle11 and follow its provided train/test split. We set the maximum sentence length to 60 words and the maximum document length to 18 sentences.
6 Experimental Results
6.1 Text Classification Results
The text classification results are shown in Table 3. Methods marked with † are re-implemented by us. As shown in Table 3, the vanilla classification models, such as CNN and LSTM, show inferior performance across three datasets. With the incoporation of the attention mechanism, LSTM-att slightly improves over LSTM. HAN was built on bidirectional GRUs but with two levels of attention mechanism at the word- and the sentence-level. It outperforms LSTM-att. BERT was built on the Transformer architecture. But it gives slightly worse results compared to HAN. The hierarchical modeling in HAN may explain its comparatively superior performance. The neural topic modeling approach, Scholar, performs better than CNN, but slightly worse than other baselines on IMDB and Yelp. VMASK learns to assign different weights to word-level features by minimizing the classification loss. Its BERT variant generally outperforms the LSTM variant and gives the best results among the baselines. We have additionally performed a statistical significance test, the Student’s t-test, to compare the performance of Hint with VMASK-BERT by training both models for 10 times, and show the results in Table 3. In general, Hint outperforms all baselines and the improvement is more prominent on the largest Guardian News dataset with the longest average document length.
Methods . | IMDB . | Yelp . | Guardian . |
---|---|---|---|
CNN† | 83.36 | 94.16 | 92.82 |
LSTM† | 87.30 | 97.10 | 93.57 |
LSTM-att† | 87.56 | 97.30 | 93.97 |
HAN | 87.92 | 97.70 | 94.34 |
BERT | 87.59 | 97.52 | 94.28 |
Scholar | 86.10 | 96.87 | 93.97 |
VMASK-BERT | 88.23*** | 98.10** | 94.49** |
VMASK-LSTM | 87.40 | 98.04 | 93.79 |
Hint | 89.11*** | 98.42** | 95.38** |
Methods . | IMDB . | Yelp . | Guardian . |
---|---|---|---|
CNN† | 83.36 | 94.16 | 92.82 |
LSTM† | 87.30 | 97.10 | 93.57 |
LSTM-att† | 87.56 | 97.30 | 93.97 |
HAN | 87.92 | 97.70 | 94.34 |
BERT | 87.59 | 97.52 | 94.28 |
Scholar | 86.10 | 96.87 | 93.97 |
VMASK-BERT | 88.23*** | 98.10** | 94.49** |
VMASK-LSTM | 87.40 | 98.04 | 93.79 |
Hint | 89.11*** | 98.42** | 95.38** |
To further examine the ability of Hint in dealing with imbalanced data, we plot in Figure 4 the per-class precision, recall, and F1 results on the Guardian News data. While Hint generally outperforms VMASK in F1 across all classes, it achieves much better results on minority classes. For example, Hint improves upon VMASK by nearly 8% in precision on the smallest Culture class.
6.2 Topic Evaluation Results
The sentence-level topic representation learning module in the Hint framework generates latent topic vectors that allow us to extract top associated words for each latent topic dimension by the weights of the decoder layer which transforms the latent variables to the reconstructed input. Existing work shows that good latent variables should be able to cluster the high-dimensional text representations into coherent semantic groups (Kingma and Welling 2014). As described in Section 4, we can interpret the top associated words for each latent topic dimension as topic words. In this subsection, we show the topic extraction results by displaying the top 10 words in each latent dimension as a word cloud.
Figures 5a and 5b show the word clouds of the generated example topic words on the Yelp and the IMDB, respectively. It can be easily inferred from Figure 5a that users express general positive comments, praising convenient facility locations and competitive pricing; while they complain about dusty environment and service quality and express negative feeling relating to their diseases. In Figure 5b, we can observe reviewers’ attitudes toward different genres of movies. They like thriller and animated movie. On the contrary, they show negative feelings toward luxury lifestyle or movies relating to misogyny. These results show that Hint can indeed extract topics discussed under different polarity categories despite using no topic-level polarity annotations for topic learning. Figure 6 shows example topic words, each of which corresponds to the five news categories from the Guardian news data.
In addition to visualizing the extracted topics, we also evaluate the quality of the extracted topics using four different topic coherence measures, including the normalized Pointwise Mutual Information (NPMI), a lexicon-based method (UCI), and context-vector-based coherence measures (CV). We compare the results with LDA (Blei, Ng, and Jordan 2003) and Scholar (Card, Tan, and Smith 2018) in Table 4. It can be observed that overall, Hint gives the best results on Yelp. It performs worse than Scholar on the Guardian News in UCI, but achieves better results in CV and NPMI. On the IMDB dataset, however, Hint only outperforms the other two models in CV and was beaten by LDA in both NPMI and UCI. One possible reason is that Hint estimates the word probability by context embedding. Hence, it beats the baselines on context-vector-based coherence, but only achieves comparable performance on the lexicon-based metric.
Method . | IMDB . | Yelp . | Guardian News . | ||||||
---|---|---|---|---|---|---|---|---|---|
CV . | NPMI . | UCI . | CV . | NPMI . | UCI . | CV . | NPMI . | UCI . | |
LDA | 0.341 | − 0.032 | − 1.936 | 0.377 | − 0.039 | −1.495 | 0.362 | −0.140 | −2.858 |
Scholar | 0.351 | −0.057 | −2.010 | 0.424 | −0.061 | −2.188 | 0.373 | −0.286 | − 1.207 |
Hint | 0.401 | −0.068 | −1.992 | 0.445 | − 0.039 | − 1.385 | 0.423 | − 0.108 | −3.085 |
Method . | IMDB . | Yelp . | Guardian News . | ||||||
---|---|---|---|---|---|---|---|---|---|
CV . | NPMI . | UCI . | CV . | NPMI . | UCI . | CV . | NPMI . | UCI . | |
LDA | 0.341 | − 0.032 | − 1.936 | 0.377 | − 0.039 | −1.495 | 0.362 | −0.140 | −2.858 |
Scholar | 0.351 | −0.057 | −2.010 | 0.424 | −0.061 | −2.188 | 0.373 | −0.286 | − 1.207 |
Hint | 0.401 | −0.068 | −1.992 | 0.445 | − 0.039 | − 1.385 | 0.423 | − 0.108 | −3.085 |
6.3 Interpretability Evaluation
A good interpretation method should give explanations that are (i) easily understood by humans and (ii) indicative of true importance of input features. We conduct both quantitative and human evaluations on the interpretation results generated by Hint.
6.3.1 Word Removal Experiments
A good interpretation model should be able to identify truly important features when making predictions (Alvarez-Melis and Jaakkola 2018). A common evaluation strategy is to remove features identified by the interpretation model, and measure the drop in the classification accuracy (Chen and Ji 2020).
Figure 7 shows the correlation score between the accuracy drop and the number of removed words evaluated on the three datasets. In addition to VMASK, we also take two variants of Hint as the contrasts, namely, Hint-Context and Hint-Topic. The former masks the words assigned with large αij weights by the context learning module; the latter masks the words with large βij attention weights in the topic learning module. Hint masks the top-K unique topic words according to their weights in each topic. It can be observed that simply masking words with higher weights identified by the context learning module or the topic learning module does not give good correlation scores. The results are worse than VMASK, which automatically determine which words to mask based on the information bottleneck theory. Nevertheless, when masking words based on those identified by Hint, we observe better correlation scores with smaller spreads compared to VMASK, showing the effectiveness of Hint in identifying task-important words.
In Table 5, we list part of the top k words removed by Hint and VMASK on the IMDB dataset with the corresponding performance drops. It can be observed that when k is 20, both methods tend to identify opinion words as key features for removal that are task-relevant, resulting in a similar classification accuracy drop. With the increasing number of k, VMASK still primarily focuses on opinion words, which leads to a further modest accuracy drop. On the contrary, when k increases, Hint starts to extract topic-related words such as “comedies” and “screenwriter.” These words may seem to be task-irrelevant. However, the removal of them causes more noticeable accuracy drop. We speculate that such words are highly relevant with the latent topics discussed in text, which in turn are associated with implicit polarities important for the decision of document-level sentiment classification.
k . | ACC ↓ . | Method . | Removed words . |
---|---|---|---|
20 | 1.1% | VMASK | brilliantly best intelligently tough cynicism |
1.2% | Hint | unwatchable highest dramas deeply flawless | |
40 | 2.8% | VMASK | interesting timeless lacks failed remaining |
3.8% | Hint | perfection comedies screenwriter reporter disagree | |
60 | 2.3% | VMASK | recommend like pretty suggestion poorly |
4.6% | Hint | scripts mysteries complaint funeral werewolf |
k . | ACC ↓ . | Method . | Removed words . |
---|---|---|---|
20 | 1.1% | VMASK | brilliantly best intelligently tough cynicism |
1.2% | Hint | unwatchable highest dramas deeply flawless | |
40 | 2.8% | VMASK | interesting timeless lacks failed remaining |
3.8% | Hint | perfection comedies screenwriter reporter disagree | |
60 | 2.3% | VMASK | recommend like pretty suggestion poorly |
4.6% | Hint | scripts mysteries complaint funeral werewolf |
6.3.2 Human Evaluation for Interpretablity
We conduct human evaluation to validate the interpretablity of our proposed method on the following criteria inspired by existing methods on human evaluation (Zhou et al. 2020):
Correctness. It measures to what extent users can make correct predictions given the model interpretations. That is, users are asked to predict the document label based on the model-generated interpretation. If the interpretation is correct, then users should be able to predict the document label easily.
Faithfulness. It measures to what extent the generated explanation is faithful to the model prediction.
Informativeness. It measures to what extent the interpretation reveals the key information conveyed in text such as the main topic discussed in text, its associated polarity, and the secondary topic (if there is any) mentioned in text.
We randomly select 100 samples with the interpretations generated by HAN (Yang et al. 2016), VMASK (Chen and Ji 2020), and our model for evaluation. We invite three evaluators, all proficient in English and with at least MSc degrees in Computer Science, to score the interpretations generated on the sampled data on a Likert scale of 1 to 5. Details of the evaluation protocol are presented in Appendix A.
We show interpretations generated from each of the three models in Figure 8 for a movie review with mixed sentiments. The review expresses a negative polarity toward the topic of acting and script, while a positive polarity for the soundtrack, resulting in an overall negative sentiment. For Correctness, the evaluators are required to predict the document label only based on the generated interpretations without reading the document content in detail. We can observe that VMASK highlights both positive and negative words (e.g., “junk” and “enjoy”), making it relatively difficult to infer the document-level sentiment label. HAN additionally provides the sentence-level importance from which we know that sentence S2 is more important than the others and it contains the negative word “avoid.” Compared with the baselines, Hint reveals much richer information. One can easily tell that the document contains mixed sentiments as the first three sentences carry a negative sentiment while the last one bears a positive polarity. In addition, the document discusses three topics (shown as three word clouds) with S1 and S3 associated with the most prominent topic about acting and script, carrying a negative sentiment. Thus, it seems that the hint-generated interpretations align with the model-predicted label (i.e., Faithfulness) and also provides a higher level of Informativeness. The human evaluation results are shown in Table 6. It can be observed that Hint gives the best results among all criteria.
6.3.3 Completeness and Sufficiency on ERASER
We also use the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark (DeYoung et al. 2020) to evaluate model interpretability. ERASER contains seven datasets that are repurposed from existing NLP corpora originally used for sentiment analysis, natural language inference, question answering, and so forth. Each dataset is augmented with human annotated rationales (supporting evidence) that support output predictions. We select Movie Reviews12 as our evaluation dataset. In ERASER, Movie Reviews only contains a total of 1,600 documents; another 200 test samples have been annotated with human rationales that are text spans indicative of the document polarity labels. We train all the models on our IMDB dataset and evaluate on the annotated Movie Reviews.
6.3.4 Agreement with Human Rationales
We use the multiple-aspect sentiment analysis dataset, BeerAdvocate (McAuley, Leskovec, and Jurafsky 2012), consisting of beer reviews, each of which is annotated with 5 aspects and the aspect-level rating scores in the range of 0 to 5. It has been widely used in evaluating rationale extraction models (Bastings, Aziz, and Titov 2019; Lei, Barzilay, and Jaakkola 2016; Li and Eisner 2019; Yu et al. 2021) by calculating the agreement between the annotated sentence-level rationales and model identified text spans. The common pipeline in much rationale extraction work is to predict binary masks for rationale selection, that is, masking the unimportant text spans and then predicting the sentiment scores only based on the selected rationales. In the HardKuma approach proposed for rationale extraction, constraints are further imposed to guarantee the continuity and sparsity of the selected text spans (Bastings, Aziz, and Titov 2019). More recently, Yu et al. (2021) argued that such a two-component pipeline approach tends to generate suboptimal results because even the first step of rationale selection selects a sub-optimal rationale; the sentiment predictor can still produce a lower prediction loss. To overcome this problem, they proposed the Attention-to-Rationale (A2R) approach by adding an additional predictor which predicts the sentiment scores based on soft attentions as opposed to the selected rationales. During training, the gap between the two predictors, one based on the selected rationales and the other based on soft attentions, is minimized. We show the results of both HardKuma and A2R reported in Yu et al. (2021) in the upper part of Table 7.14
. | Look . | Smell . | Palate . | |||
---|---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | Precision . | Recall . | |
HardKuma | 81.0 | 69.9 | 74.0 | 72.4 | 45.4 | 73.0 |
A2R | 84.7 | 71.2 | 79.3 | 71.3 | 64.2 | 60.9 |
VMASK | 33.8 | 28.5 | 16.0 | 13.5 | 27.0 | 36.8 |
HAN | 76.1 | 58.2 | 56.0 | 48.1 | 71.6 | 66.0 |
Hint | 84.4 | 67.0 | 59.4 | 54.8 | 70.4 | 65.1 |
. | Look . | Smell . | Palate . | |||
---|---|---|---|---|---|---|
Precision . | Recall . | Precision . | Recall . | Precision . | Recall . | |
HardKuma | 81.0 | 69.9 | 74.0 | 72.4 | 45.4 | 73.0 |
A2R | 84.7 | 71.2 | 79.3 | 71.3 | 64.2 | 60.9 |
VMASK | 33.8 | 28.5 | 16.0 | 13.5 | 27.0 | 36.8 |
HAN | 76.1 | 58.2 | 56.0 | 48.1 | 71.6 | 66.0 |
Hint | 84.4 | 67.0 | 59.4 | 54.8 | 70.4 | 65.1 |
In our experiments, we train Hint and the baselines, VMASK and HAN, on the BeerAdvocate training set, and stop training when the models reach the smallest Mean Square Error on the validation set. Afterward, rationale selection is performed based on the word-level attention for VMASK (we use the top 15% of words as rationales), and based on the sentence-level importance scores for HAN and Hint. For the latter two models, we only extract the top sentence as the rationale for each document in the test set.
Following the setting in A2R (Yu et al. 2021), the overlap between the selected important words or sentences and the gold-standard rationales are calculated as precision and recall values and are shown in Table 7. It can be observed that approaches specifically designed for rationale extraction, HardKuma and A2R, give better results compared with other approaches that are not optimized for rationale extraction. VMASK performs the worst as it can only select token-level rationales. Hint outperforms HAN on both the Look and the Smell aspects by a large margin, and the two models give similar results on the Palate aspect.
6.4 Ablation Study
To study the effects of different modules in our model, we perform an ablation study and show the results in Table 8. In addition to the accuracy on the three datasets, we also report the interpretability metrics, namely, completeness and sufficiency for different variants.15 For Topic Representation Learning (§3.1.2), we remove the Bayesian inference part that is used to learn word-level weight βij. That is, rather than using βij to aggregate the word representations xij in order to derive the sentence embedding ri as shown in Figure 3b, we now derive the sentence embedding ri using the word-level TFIDF weights to aggregate the word representations xij. We also explore the effects without the regularization terms defined in Equations 10 and 17, respectively. Finally, we study the impact with or without the GATs and the number of GAT layers in Document Representation Learning (§3.2). From the results in Table 8, Hint achieves best performance on accuracy and overall better performance on interpretability metrics. The variant of using uniform weight as an initialization for topic learning shows good interpretability on IMDB. This shows that with our proposed stochastic learning process for topic-related weights, it does not matter whether the word token weights are initialized by TFIDF or a uniform distribution. Although using multiple GAT layers fails to bring improvement to classification accuracy, 4-layer GAT has overall better interpretability performance than other GAT configurations.
Methods . | IMDB . | Yelp . | Guardian . | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc(↑) . | Com(↑) . | Suff(↓) . | Acc(↑) . | Com(↑) . | Suff(↓) . | Acc(↑) . | Com(↑) . | Suff(↓) . | |
Hint | 89.11 | 0.21 | 0.11 | 98.52 | 0.22 | 0.09 | 95.37 | 0.16 | 0.05 |
Remove Bayesian inference for β learning | 89.02 | 0.17 | 0.14 | 98.45 | 0.16 | 0.11 | 95.21 | 0.16 | 0.06 |
Replace TFIDF with uniform weight | 88.62 | 0.20 | 0.10 | 98.31 | 0.12 | 0.09 | 95.08 | 0.18 | 0.06 |
w/o the Regularization Term 1 (Equation (10)) | 89.03 | 0.18 | 0.11 | 98.49 | 0.11 | 0.07 | 95.20 | 0.15 | 0.07 |
w/o the Regularization Term 2 (Equation (17)) | 89.06 | 0.19 | 0.13 | 98.50 | 0.13 | 0.08 | 95.26 | 0.13 | 0.08 |
Remove GAT | 89.00 | 0.17 | 0.10 | 98.41 | 0.22 | 0.10 | 94.87 | 0.08 | 0.04 |
2-layer GAT | 88.93 | 0.17 | 0.11 | 98.52 | 0.18 | 0.10 | 94.99 | 0.14 | 0.06 |
4-layer GAT | 88.85 | 0.18 | 0.07 | 98.50 | 0.18 | 0.09 | 93.58 | 0.13 | 0.04 |
Methods . | IMDB . | Yelp . | Guardian . | ||||||
---|---|---|---|---|---|---|---|---|---|
Acc(↑) . | Com(↑) . | Suff(↓) . | Acc(↑) . | Com(↑) . | Suff(↓) . | Acc(↑) . | Com(↑) . | Suff(↓) . | |
Hint | 89.11 | 0.21 | 0.11 | 98.52 | 0.22 | 0.09 | 95.37 | 0.16 | 0.05 |
Remove Bayesian inference for β learning | 89.02 | 0.17 | 0.14 | 98.45 | 0.16 | 0.11 | 95.21 | 0.16 | 0.06 |
Replace TFIDF with uniform weight | 88.62 | 0.20 | 0.10 | 98.31 | 0.12 | 0.09 | 95.08 | 0.18 | 0.06 |
w/o the Regularization Term 1 (Equation (10)) | 89.03 | 0.18 | 0.11 | 98.49 | 0.11 | 0.07 | 95.20 | 0.15 | 0.07 |
w/o the Regularization Term 2 (Equation (17)) | 89.06 | 0.19 | 0.13 | 98.50 | 0.13 | 0.08 | 95.26 | 0.13 | 0.08 |
Remove GAT | 89.00 | 0.17 | 0.10 | 98.41 | 0.22 | 0.10 | 94.87 | 0.08 | 0.04 |
2-layer GAT | 88.93 | 0.17 | 0.11 | 98.52 | 0.18 | 0.10 | 94.99 | 0.14 | 0.06 |
4-layer GAT | 88.85 | 0.18 | 0.07 | 98.50 | 0.18 | 0.09 | 93.58 | 0.13 | 0.04 |
6.5 Case Study
To show the capability of Hint in dealing with documents with mixed sentiments, we select one document from the IMDB dataset to illustrate the interpretations generated in Figure 10. The figure consists of three parts. The top part shows the word-level interpretations in the form of label-dependent words (in yellow) and label-independent words (in blue), as well as the sentence-level sentiment labels (sentence IDs highlighted with red or green colors). The middle part shows a heat map illustrating the association strengths between sentences and topics with a darker value indicating a stronger association. The lower part presents a bar chart showing the sentiment strength of each topic with the green and the red color for the positive and the negative sentiment, respectively. To make it easier to understand what each topic is about, we automatically extract the most relevant text span in the document to represent each topic (shown under the bar chart) by the approach described in Section 4.
Our model derives the document label by aggregating the sentence-level context representations weighted by their topic similarities (see in §3.2). From Figure 10, we can observe that Topic 1 appears to be the most prominent topic in the document from the sentence-topic heat map. Sentence S2 is related to Topic 2, while both sentences S3 and S4 are grouped under Topic 3. Among the three topics, Topic 1 is positive, while Topics 2 and 3 are negative. After aggregating sentences weighted by their topic similarities, the model infers an overall positive sentiment since the most prominent Topic 1 is positive. This example shows that Hint is able to capture both the topic and sentiment changes in text.
7 Conclusion and Future Work
In this article, we have proposed a Hierarchical Interpretable Neural Text classifier, called Hint, which automatically generates hierarchical interpretations of text classification results. It learns the sentence-level context and topic representations in an orthogonal manner in which the former captures the label-dependent semantic information while the latter encodes the label-independent topic information shared across documents. The learned sentence representations are subsequently aggregated by a Graph Attention Network to derive the document-level representation for text classification. We have evaluated Hint on both review data and news data and shown that it achieves text classification performance on par with the existing neural text classifiers and generates more faithful interpretations as verified by both quantitative and qualitative evaluations.
Although we only focus on interpreting neural text classifiers here, the proposed framework can be extended to deal with other tasks such as content-based recommendation. In such a setup, we will need to learn both user- and item-based latent interest factors by analyzing reviews written by users and those associated with particular products. Because the proposed Hint is able to extract topics and their associated polarity strengths from reviews, it is possible to derive user- and item-based latent interest factors based on the outputs produced by Hint. Moreover, many NLP tasks such as natural language inference, rumor veracity classification, extractive question answering, and information extraction can be framed as classification problems. The proposed framework has a great potential to be extended to a wide range of NLP tasks.
Appendix A: Human Evaluation Instruction
Hint, HAN, and VAMSK generate different forms of interpretations. HAN can generate the interpretations based on the attention weights at both the word-level and the sentence-level. VMASK can only generate the interpretations at the word-level. Apart from the word-level and the sentence-level interpretations, Hint can also generate interpretations at the document-level by partitioning sentences into various topics and associating with each topic a polarity label. In the actual evaluation, to reduce cognitive load, we only present the most prominent topic in the document and the most contrastive topic in the form of word clouds to the evaluators. To retrieve the most prominent topic, we first identify the topic dimension with the largest value in the latent topic vector for each sentence and then select the most common topic dimension among all sentences. To select the most contrastive topic, we choose the one that has the minimal similarity with the first chosen topic. To generate the word cloud, we retrieve the topic words following the approach discussed in Section 4 with the vocabulary constrained to the local document. The evaluation schema is shown below:
Walk-in Clinics, Surgeons, Oncologist, Cardiologists, Hospitals, Internal Medicine, Assisted Living Facilities, Cannabis Dispensaries, Doctors, Home Health Care, Health Coach, Emergency Pet Hospital, Pharmacy, Sleep Specialists, Professional Services, Addiction Medicine, Weight Loss Centers, Pediatric Dentists, Cosmetic Surgeons, Nephrologists, Naturopathic, Holistic, Pediatricians, Nurse Practitioner, Urgent Care, Orthopedists, Drugstores, Optometrists, Rehabilitation Center, Hypnosis, Hypnotherapy, Physical Therapy, Neurologist, Memory Care, Allergists, Counseling & Mental Health, Pet Groomers, Podiatrists, Dermatologists, Diagnostic Services, Radiologists, Medical Centers, Gastroenterologist, Obstetricians & Gynecologists, Pulmonologist, Ear Nose & Throat, Ophthalmologists, Sports Medicine, Nutritionists, Psychiatrists, Vascular Medicine, Cannabis Clinics, Hospice, First Aid Classes, Medical Spas, Spine Surgeons, Health Retreats, Medical Transportation, Dentists, Health & Medical, Speech Therapists, Emergency Medicine, Chiropractors, Medical Supplies, General Dentistry, Occupational Therapy, Urologists |
Walk-in Clinics, Surgeons, Oncologist, Cardiologists, Hospitals, Internal Medicine, Assisted Living Facilities, Cannabis Dispensaries, Doctors, Home Health Care, Health Coach, Emergency Pet Hospital, Pharmacy, Sleep Specialists, Professional Services, Addiction Medicine, Weight Loss Centers, Pediatric Dentists, Cosmetic Surgeons, Nephrologists, Naturopathic, Holistic, Pediatricians, Nurse Practitioner, Urgent Care, Orthopedists, Drugstores, Optometrists, Rehabilitation Center, Hypnosis, Hypnotherapy, Physical Therapy, Neurologist, Memory Care, Allergists, Counseling & Mental Health, Pet Groomers, Podiatrists, Dermatologists, Diagnostic Services, Radiologists, Medical Centers, Gastroenterologist, Obstetricians & Gynecologists, Pulmonologist, Ear Nose & Throat, Ophthalmologists, Sports Medicine, Nutritionists, Psychiatrists, Vascular Medicine, Cannabis Clinics, Hospice, First Aid Classes, Medical Spas, Spine Surgeons, Health Retreats, Medical Transportation, Dentists, Health & Medical, Speech Therapists, Emergency Medicine, Chiropractors, Medical Supplies, General Dentistry, Occupational Therapy, Urologists |
Input: | A document d consists of Md sentences , | |
Word Emb | Initialized by the GloVe embedding, | |
Context learn | Word-level biLSTM | : |
Attention layer | ||
Context aggregate | . | |
Topic learn | Word Weight Init. | ·xij |
Bayesian inference | μω ∈ℝN | |
ω = Softmax(μω + σω · ϵ), | ||
Autorenoder | β1×L = softmax(ReLU(ω1×d ·xi) | |
zi = softmax(Wc ·xi + bc) | ||
ri′ = tanh(Wc′·zi + bc′) | ||
Doc Modeling | Node Init. | |
Edge weight init. | ||
Node update | ||
Classification |
Input: | A document d consists of Md sentences , | |
Word Emb | Initialized by the GloVe embedding, | |
Context learn | Word-level biLSTM | : |
Attention layer | ||
Context aggregate | . | |
Topic learn | Word Weight Init. | ·xij |
Bayesian inference | μω ∈ℝN | |
ω = Softmax(μω + σω · ϵ), | ||
Autorenoder | β1×L = softmax(ReLU(ω1×d ·xi) | |
zi = softmax(Wc ·xi + bc) | ||
ri′ = tanh(Wc′·zi + bc′) | ||
Doc Modeling | Node Init. | |
Edge weight init. | ||
Node update | ||
Classification |
Appendix B: List of keywords used for Yelp reviews retrieval
The list of keywords used for retrieving patient reviews from Yelp is shown in Table A1.
Appendix C: Model Architecture and Parameter Setting
Our model architecture is shown in Table A2. We describe the parameter setup for each part of the model below:
Context Learning We use the pretrained 300-dimension GloVe embeddings with the dimension N = 300. The dimension of the word-level biLSTM hidden states is 150, and the dimension of the output x is also 300. The word embedding sequence is fed to two consecutive linear layers to obtain the attention weights. The weight matrices for the linear layers, Linear1 and Linear2, are (300,200) and (200,1), respectively. Then we aggregate by attention weights to obtain the sentence-level contextual representation si.
Topic Learning We calculate the TFIDF values for words offline. During inference, the TFIDF value of the out-of-vocabulary words is set to 1e − 4. For each sentence, we first normalize the TFIDF values of its constituent words and then aggregate the word embeddings weighted by their respective TFIDF values. This gives an initial sentence representation xi ∈ℝN, which is then fed into two MLPs to generate the mean μω and the variance , which are used to generate the output latent variable ω ∈ℝN. After non-linear (ReLU) transformation and normalization (Softmax), we obtain the topic-aware weights β that is used to generate the input pi for the autoenoder. The encoder and decoder in our autoencoder are 1-layer MLP with non-linear transformation. The weight matrices Wc and Wc′ are (300,K) and (K,300) respectively. K is the number of pre-defined topics. We set K = 50 for the two review datasets and K = 30 for the Guardian News data empirically.
Document Modeling Graph nodes are initialized by the linear-transformed contextual sentence-level representations. The weight matrix in Linear3 and Linear4 are (300,200) and (200,50). The graph node dimension is 50.
Classification The Linear5 and Linear6 have the dimensions of (300,200) and (200,#labels), respectively.
We use dropout layers to alleviate overfitting, and insert a dropout layer after the word embedding layer, the word-level biLSTM layer, and after obtaining zi and ω, respectively. The dropout rate is 0.4. We use the Adam (Kingma and Ba 2015) optimizer and set the learning rate to 1e − 4. The λ1 and λ2 in the regularization term are set to 0.05 and 0.01, respectively. ηa and ηb are set to 0.001 and 1, respectively. We train the model for 30 epochs and evaluate the performance at the end of each epoch. We report the average results for running 5 times with random seeds.
acknowledgments
This work was funded by the UK Engineering and Physical Sciences Research Council (grant no. EP/T017112/1, EP/V048597/1, EP/X019063/1). Hanqi Yan receives the PhD scholarship funded jointly by the University of Warwick and the Chinese Scholarship Council. Yulan He is supported by a Turing AI Fellowship funded by the UK Research and Innovation (grant no. EP/V020579/1).
Notes
Our source code can be accessed at https://github.com/hanqi-qi/SINE.
Note that ω is shared among all inputs, which is different from typical latent variable models in which a local latent variable is associated with each individual input.
We can also use the decoder weight matrix , which is symmetrical to Wc.
All the keywords are listed in Appendix B.
Our results on IMDB are different from those reported in the original paper due to different train/test splits.
For hint, we select the important words according to their weights from both the context representation learning and the topic representation modules.
We randomly select 200 test samples to evaluate the interpretability.
References
Author notes
Action Editor: Tal Linzen