Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification

In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN1) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exist. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.


Introduction
There has been significant progress on English sentence-and document-level sentiment classification in recent years using models based on neural networks (Socher et al., 2013;İrsoy and Cardie, 2014a;Le and Mikolov, 2014;Tai et al., 2015;Iyyer et al., 2015).Most of these, however, rely on a massive amount of labeled training data or fine-grained annotations such as the Stanford Sentiment Treebank (Socher et al., 2013), which provides senti-ment annotations for each phrase in the parse tree of every sentence.On the other hand, such a luxury is not available to many other languages, for which only a handful of sentence-or document-level sentiment annotations exist.To aid the creation of sentiment classification systems in such low-resource languages, we propose the ADAN model that uses the abundant resources for a source language (here, English) to produce sentiment analysis models for a target language with (little or) no available labeled data.Our system is unsupervised in the sense that it does not require annotations in the target language.In this paper, we use Chinese sentiment classification as a motivating example, but our method can be readily applied to other languages as we only require unlabeled text in the target language, which is fairly accessible for most languages.
In particular, Chinese sentiment analysis remains much less explored compared to English, mostly due to the lack of large-scale labeled training corpora.Previous methods perform Chinese sentiment classification by training linear classifiers on small domain-specific datasets with hundreds to a few thousand instances.Training modern deep neural networks is impossible on such small datasets, and consequently a lot of research effort goes into hand-crafting better features that do not necessarily generalize well (Tan and Zhang, 2008).Although some prior work tries to alleviate the scarcity of sentiment annotations by leveraging labeled English data (Wan, 2008;Wan, 2009;Lu et al., 2011), these methods rely on external knowledge such as bilingual lexicons or machine translation (MT) systems, both of which are difficult and expensive to obtain.In this work, we propose an end-to-end neural network model that only requires labeled English data and unlabeled Chinese text as input, and explicitly transfers the knowledge learned on English sentiment analysis to Chinese.Our trained system directly operates on Chinese sentences to predict their sentiment (e.g.positive or negative).
We hypothesize that an ideal model for crosslingual sentiment analysis should learn features that both perform well on the English sentiment classification task, and are invariant with respect to the shift in language.Therefore, ADAN simultaneously optimizes two components: i) a sentiment classifier P for English; and ii) an adversarial language predictor Q that tries to predict whether a sentence x is from English or Chinese.The structure of the model is shown in Figure 1.The two classifiers take input from the jointly learned feature extractor F, which is trained to maximize accuracy on English sentiment analysis and simultaneously to minimize the language predictor's chance of correctly predicting the language of the text.This is why the language predictor Q is called "adversarial".
The model is exposed to both English and Chinese sentences during training, but only the labeled English sentences pass through the sentiment classifier.The feature extractor and the sentiment classifier are then used for Chinese sentences at test time.
In this manner, we can train the system with massive amounts of unlabeled text in Chinese.Upon convergence, the joint features (output of F) are thus encouraged to be both discriminative for sentiment analysis and invariant across languages.
The idea of incorporating an adversary in a neural network model has achieved great success in computer vision for image generation (Goodfellow et al., 2014) and visual domain adaptation (Ganin and Lempitsky, 2015;Ajakan et al., 2014).However, to the best of our knowledge, this work is the first to develop an adversarial network for a cross-lingual NLP task.While conceptually similar to domain adaptation, most research on domain adaptation assumes that the input from both domains share the same representation, such as image pixels for image recognition and bag of words for text classification.However, in our setting, the bag-of-word representation is infeasible because the two languages have completely different vocabularies.
In the following sections, we present our method in more detail and show experimental results in which ADAN significantly outperforms several baselines, some even with access to the powerful commercial MT system of Google Translate1 .

Related Work
Sentence-level Sentiment Classification is a popular NLP task on which neural network models have demonstrated tremendous power when coupled with copious data (Socher et al., 2013;İrsoy and Cardie, 2014a;Le and Mikolov, 2014;Tai et al., 2015;Iyyer et al., 2015).In principle, our cross-lingual framework could use any one of those methods for its feature extractor and the sentiment classifier.
We choose to build upon the Deep Averaging Network (DAN), a very simple neural network model that yields surprisingly good performance, comparable to complicated syntactic models like recursive neural networks (Iyyer et al., 2015).The simplicity of DAN helps to illustrate the effectiveness of our framework.For each document, DAN takes the arithmetic mean of the document word vectors (Mikolov et al., 2013;Pennington et al., 2014) as input, and passes it through several fully-connected layers until a softmax for classification.
Cross-lingual Sentiment Analysis (Mihalcea et al., 2007;Banea et al., 2008;Banea et al., 2010) is motivated by the lack of high-quality labeled data in many non-English languages.For Chinese in particular, there have been several representative works in both the machine learning direction (Wan, 2008;Wan, 2009;Lu et al., 2011) and the more traditional lexical direction (He et al., 2010).Our work is comparable to these papers in objective but very different in method.The work by Wan uses machine translation to directly convert English training data to Chinese; this is one of our baselines.Lu et al. (2011) instead uses labeled data from both languages to improve the performance on both.Blitzer et al. (2007), Glorot et al. (2011) and Chen et al. (2012) try to learn effective classifiers for which the training and test samples are from different underlying distributions.This can be thought of as a generalization of cross-lingual text classification.However, one main difference is that, when applied to text classification tasks such as sentiment analysis, most works in domain adaptation evaluate on adapting product reviews from one domain to another (e.g.books to electronics), where the divergence in distribution is much less significant than that between two languages.In addition, for cross-lingual sentiment analysis, it might be difficult to find data from exactly the same domain in two languages, in which case our model still demonstrates impressive performance.

Domain Adaptation
Adversarial Networks (Goodfellow et al., 2014;Ganin and Lempitsky, 2015) have enjoyed much success in computer vision, but to the best of our knowledge, have not yet been applied in NLP with comparable success.We are the first to apply adver-sarial training to the cross-lingual setting in NLP.
A series of work in image generation has used architectures similar to ours, by pitching a neural image generator against a discriminator that learns to classify real versus generated images (Goodfellow et al., 2014;Denton et al., 2015).More relevant to this work, adversarial architectures have produced the state-of-the-art in unsupervised domain adaptation for image object recognition where Ganin and Lempitsky (2015) train with many labeled source images and unlabeled target images, similar to our setup.
3 The ADAN Model

Network Architecture
As illustrated in Figure 1, the ADAN model is a feedforward network with two branches.Hence there are three main components in the network, a joint feature extractor F that maps an input sequence x to the feature space, a sentiment classifier P that predicts the sentiment label for x given the feature representation F(x), and a language predictor Q that also takes the feature F(x) but predicts whether x is from English or Chinese.
An input document is modeled as a sequence of words x = {w 1 , . . ., w n } where n is the number of tokens in x.Each word w ∈ x is represented by its word embedding v w (Mikolov et al., 2013).As the same feature extractor F is trained and tested on both English and Chinese sentences, it is favorable if the word representations for both languages align approximately in a shared space.Prior work on bilingual word embeddings (Zou et al., 2013;Vulić and Moens, 2015;Gouws et al., 2015) attempts to induce distributed word representations that encode semantic relatedness between words across languages, so that similar words are closer in the embedded space regardless of language.
Since many of these methods require a large amount of parallel corpus or extended training time, in this work we leverage the pre-trained bilingual word embeddings (BWE) in (Zou et al., 2013).Their work provides 50-dimensional embeddings for 100k English words and a different set of 100k Chinese words.Note that using the word embeddings in (Zou et al., 2013) makes our work implicitly dependent on a parallel corpus from the two lan-guages, since it is required to train the bilingual embeddings.However, more recent approaches to train bilingual word embeddings require little or even no parallel corpus (Gouws et al., 2015;Vulić and Moens, 2015).These methods can alleviate or eliminate this dependence.For experiments and more discussions on word embeddings, see Section 4.5.
The feature extractor F is a Deep Averaging Network (DAN) (Iyyer et al., 2015).F first calculates the arithmetic mean of the word vectors in the input sequence, then passes the average through a feedforward network with ReLU nonlinearities.The activations of the last layer in F is considered the extracted features for the input and is then passed on to P and Q.The sentiment classifier P and the language predictor Q are standard feed-forward networks, with a softmax layer on top for classification.

Training
The ADAN model can be trained end-to-end with standard back-propagation, which we detail in this section.For the two classifiers P and Q parametrized by θ p and θ q respectively, we use the traditional cross-entropy loss, denoted as L p (ŷ, y) and L q (ŷ, y).L p is the negative log-likelihood of the model P predicting the correct sentiment label, and L q is that of Q predicting the correct language.We therefore seek the minimum of the following loss functions for P and Q: To accomplish the aforementioned joint learning of features, the feature extractor F, parameterized by θ f , solves the following optimization problem: where λ is a hyper-parameter that balances between the two branches P and Q.The intuition is that while P and Q are individually trying to excel at their own classification tasks, F drives its parameters to extract hidden representations that help the sentiment prediction of P and hamper the language prediction of Q.Therefore, upon successful training, F extracts language-invariant features suitable for sentiment analysis.
There are two approaches to train ADAN.The first one, inspired by (Goodfellow et al., 2014), performs alternate training: first the sentiment classifier and the feature extractor are trained together, then the adversarial language predictor is trained while the other two are "frozen".This method has more control towards convergence when the learning progress of P and Q are not fully in sync.For instance, in (Goodfellow et al., 2014), a hyper-parameter k is introduced to train one component for k iterations then the other for one, which is helpful since their generator learns faster than the discriminator.
The other method is to use a Gradient Reversal Layer (GRL) proposed in (Ganin and Lempitsky, 2015).The GRL does nothing during forward propagation, but negates the gradients it receives during backward propagation.Specifically, a GRL R λ with hyper-parameter λ behaves as follows: where I is the identity matrix.After a GRL is inserted between F and Q, running standard Stochastic Gradient Descent (SGD) on the entire network optimizes for (1).
In our experiments, we found that GRL is easier to work with since the entire network can be optimized en masse.On the other hand, we also observed that the training progress of P and Q is not usually in sync and it takes more than one batch of training for the language predictor to adapt to the changes in the joint features.Therefore, the first method's flexibility in coordinating the training of P and Q is indeed important.Thus, we combine both approaches by adapting the GRL method to perform alternate training.This is achieved by setting λ to a non-zero value only once out of k batches.When λ = 0, the gradients from Q will not be back-propagated to F. This allows Q more iterations to adapt to F before F makes another adversarial update.We set λ = 0.02 and k = 3 in all experiments, which is selected based on the performance on a held-out validation set.
For unsupervised cross-lingual sentiment analysis, we have access to labeled English data and unlabeled Chinese data during training.To train the ADAN model, we assemble mini-batches of sam-

Experiments and Discussions
To demonstrate the effectiveness of our model, we experiment on sentence-level sentiment classification with 5 labels (strongly negative, slightly negative, neutral, slightly positive, and strongly positive).

Data
Labeled English Data.We use a balanced dataset of 700k Yelp reviews from Zhang et al. (2015) with their ratings as labels (scale 1-5).We also adopt their train-validation split: 650k reviews form the training set and the remaining 50k form a validation set used in English-only benchmarks.We then compares ADAN with domain adaptation baselines, which did not yield satisfactory results for our task.TCA (Pan et al., 2011) did not work since it requires quadratic space in terms of the number of samples (650k in our case).SDA (Glorot et al., 2011) and the subsequent mSDA (Chen et al., 2012) are proven very effective for cross-domain sentiment classification on Amazon reviews.However, as shown in Table 1, mSDA did not perform competitively.We postulate that this is because many domain adaptation models including mSDA were designed for the use of bag-of-words features, which are ill-suited in these tasks where the two languages have completely different vocabularies.We instead use the same input representation as DAN for mSDA, the averaged word vector for each instance, which may violate the underlying assumption of mSDA and thus leading to poor results.In summary, this suggests that even strong domain adaptation algorithms cannot be used out of the box for our task to get satisfactory results.
One strong baseline we compare against is English DAN with MT.We use the commercial Google Translate engine2 , which is highly engineered, trained on enormous resources, and arguably one of the best MT systems in the world.It translates the Chinese reviews into English and makes predictions using the best-performing DAN model trained on Yelp reviews.Previous studies (Banea et al., 2008;Salameh et al., 2015) on sentiment analysis for Arabic and European languages claim this MT approach to be very competitive and can sometimes match the state-of-the-art system trained on that language.On the other hand, as shown in Table 1, our ADAN model significantly outperforms the MT baseline, vindicating that our adversarial model can successfully perform cross-lingual sentiment analysis without annotated data on the target language.
Semi-supervised Learning In practice, it is usually not very difficult to obtain at least a little bit of annotated data.ADAN can be readily adapted to exploit such extra labeled data in the target language, by letting those labeled instances pass through the sentiment classifier P as the English samples do during training.We simulate this semi-supervised scenario by using 1k labeled Chinese reviews for training.As shown in Table 1 (right), ADAN significantly outperforms DAN trained on the combination of the English and Chinese training data.

Qualitative Analysis and Visualizations
To qualitatively demonstrate how ADAN bridges the distributional discrepancies between English and Chinese instances, t-SNE (Van der Maaten and Hinton, 2008) visualizations of the activations at various layers are shown in Figure 2. We randomly select 1000 sentences from the Chinese and English validation sets respectively, and plot the t-SNE of the hidden node activations at three locations in our model: the averaging layer, the end of the joint feature extractor, and the last hidden layer in the sentiment classifier before softmax.The train-on-English model is the DAN+BWE baseline in Table 1.Note that there is actually only one "branch" in this baseline model, but in order to compare to ADAN, we conceptually treat the first three layers as the feature extractor.In addition, t-SNE plot of the bilingual word embeddings of the most frequent 1000 words in both languages are shown in Figure 3.It can be seen from Figure 3 that BWE indeed provide a more or less mixed distribution for words across languages.There are still some dense clusters of monolingual words in the pre-trained BWE, which is slightly alleviated after ADAN training which updates the word embeddings.However, a somehow surprising finding in Figure 2a is the clear dichotomy between the averaged word vectors in the two languages despite the distributions of the word embeddings themselves being mixed (Figure 3).This might suggest that the diversion between languages are not only determined by word semantics, but also largely depends on the way how words are used.Therefore, one needs to look beyond word representations when tackling crosslingual NLP problems.Furthermore, we can see in Figure 2b that the distributional discrepancies between Chinese and English are significantly reduced after passing through the joint feature extractor (F), and the learned feature in ADAN brings the distributions in the two languages dramatically closer compared to the monolingually trained baseline.Finally, when looking at the last hidden layer activations in the sentiment classifier of the baseline model (Figure 2c), there are several notable clusters of the red dots (English data) that roughly correspond to the class labels.However, most Chinese samples are not close to one of those clusters due to the distributional diversion and may thus cause degraded performance in Chinese sentiment classification.On the other hand, the Chinese samples are more in line with English ones in ADAN model, which results in the accuracy boost over the baseline model.

Side: English Sentiment Classification
Although not an objective of this work, we also present that DAN performs well in the pure supervised setting of English sentiment classification.The result indicates that the DAN baselines in our experiments are competitive for English sentiment classification.While Iyyer et al. (2015) tested DAN for sentiment analysis on the smaller Stanford Sentiment Treebank dataset, we also demonstrate its effectiveness for the large Yelp reviews dataset.Ta-  (Zhang et al., 2015).(Zhang et al., 2015).DAN beats most of the baselines including LSTM, and is close to the best model in (Zhang et al., 2015), a very large convolutional neural network.For more discussions on the different embedding choices, please refer to Section 4.5.

Impact of Bilingual Word Embeddings
In this section we discuss the effect of the bilingual word embeddings, which we deem as a key factor for future improvement.Table 2 shows that small dimensionality of the BWE poses a limitation to DAN's performance on English sentiment classification, especially since there is a large gap between the dimensionality of the embeddings and the hidden layers (50 vs. 900).Even using 300 dimensional random embeddings outperforms the 50d BWE, and is only marginally worse than BWE (50d) + Random 250d (padding the BWE to 300 dimensions).Furthermore, with a better word embedding word2vec (Mikolov et al., 2013), the performance is 2% higher and close to the state of the art.
Nevertheless, when performing cross-lingual training for ADAN, random WE no longer works, as seen in Table 1.This intuitively makes sense because the word representations from both languages are drawn from one identical random distribution, so the adversarial language predictor receives little useful information to distinguish between the underlying language distributions, hence producing rather useless gradients.We can see that adding randomness to the BWE also yields worse performance and the best performing model only uses the 50 dimen-  2, the accuracy of ADAN could probably be further improved if we can increase the quality or at least the dimensionality of the BWE.

Implementation Details
For all our experiments, the feature extractor F has three fully-connected hidden layers with ReLU nonlinearities, while both P and Q have two.All hidden layers contains 900 hidden units.This choice is more or less ad-hoc, and the performance could potentially be improved with more careful model selection.Batch Normalization (Ioffe and Szegedy, 2015) is used in each hidden layer.We corroborate the observations by Ioffe and Szegedy (2015) that Batch Normalization can sometimes eliminate the need for Dropout (Srivastava et al., 2014) or Word Dropout (Iyyer et al., 2015), which make little difference in our experiments.We use Adam (Kingma and Ba, 2014) with a learning rate of 0.05 for optimization.ADAN is implemented in Torch7 (Collobert et al., 2011), and the code will be made available after the review process.
Training is very efficient for ADAN due to its simplicity.It takes less than 6 minutes to finish one epoch (1.3M instances) on one Titan X GPU.We train ADAN for 30 epochs and use early stopping to select the best model on the validation set.

Conclusion and Future Work
In this work, we presented ADAN, an adversarial deep averaging network for cross-lingual sentiment classification.ADAN leverages the affluent resources on English to help sentiment categorization on other languages where few or no annotated data exist.We validated our hypothesis by empirical experiments on Chinese sentiment categorization, where we have labeled English training data and only unlabeled Chinese data.Experiments show that ADAN outperforms several baselines including domain adaptation models and a highly competitive MT baseline using Google Translate.Furthermore, we showed that in the presence of labeled data in the target language, ADAN can naturally incorporate this additional supervision and yields even more competitive results.
For future work, one direction is to explore better bilingual word embeddings, which we identify as one key factor limiting the performance of ADAN.In another direction, our adversarial framework for cross-lingual text categorization can be applied not only to DAN, but also many other neural models such as LSTM, etc. Further, our framework is not limited to text classification tasks, and can be extended to, for instance, phrase level opinion mining ( İrsoy and Cardie, 2014b) by extracting phrase-level opinion expressions from sentences using deep recurrent neural networks.Our framework can be applied to these phrase-level models for languages where labeled data might not exist.

Figure 1 :
Figure 1: Adversarial Deep Averaging Network.Orange: sentiment classifier P ; Red: language classifier Q; Green & Blue: feature extractor F. Bilingual Word Embeddings will be discussed in Section 3.1.

Figure 2
Figure 2: t-SNE Visualizations of activations at various layers the train-on-English baseline model (top) and ADAN (bottom).The three columns are activations taken from the (a) averaging layer, (b) the end of the joint feature extractor F, and (c) the last hidden layer in the sentiment classifier P before softmax.Red dots are English samples and blue dots are Chinese.Numbers indicate the class labels for each instance (zoom in for details).The distributions of the two languages are brought much closer in ADAN as they are represented deeper in the network (left to right).
Figure 3: t-SNE Visualizations of the Word Embeddings of the most frequent 1000 words from both languages.Red dots are English words and blue ones are Chinese (better viewed in color).(a) is the pre-trained embeddings from(Zou et al., 2013), and (b) is the updated embeddings in the trained ADAN model.

Table 1 :
ADAN Performance for Fine-grained Chinese Hotel Review Sentiment Classification in both unsupervised and semi-supervised settings.plesdrawn from both English and Chinese training data with equal probability, and use them to train the ADAN network.However, only the labeled English instances go through the sentiment classifier P during training, while both English and Chinese instances go through the language predictor Q.At test time, to predict the sentiment of a Chinese sentence, we pass it through the trained F and then P.
Our main results are shown in Table1.The left table shows the unsupervised setting where no labeled data is used in Chinese.The DAN+BWE baseline model uses bilingual word embeddings to map both English and Chinese reviews to the same space and is trained using only English Yelp reviews.We can see from Table1that the bilingual embedding by itself does not suffice to transfer knowledge of English sentiment classification to Chinese, and the performance is poor (29.11%).

Table 2 :
DAN Performance for English Yelp Review Sentiment Classification ble 2 compares our DAN with results in