Learning Neural Sequence-to-Sequence Models from Weak Feedback with Bipolar Ramp Loss

In many machine learning scenarios, supervision by gold labels is not available and conse quently neural models cannot be trained directly by maximum likelihood estimation. In a weak supervision scenario, metric-augmented objectives can be employed to assign feedback to model outputs, which can be used to extract a supervision signal for training. We present several objectives for two separate weakly supervised tasks, machine translation and semantic parsing. We show that objectives should actively discourage negative outputs in addition to promoting a surrogate gold structure. This notion of bipolarity is naturally present in ramp loss objectives, which we adapt to neural models. We show that bipolar ramp loss objectives outperform other non-bipolar ramp loss objectives and minimum risk training on both weakly supervised tasks, as well as on a supervised machine translation task. Additionally, we introduce a novel token-level ramp loss objective, which is able to outperform even the best sequence-level ramp loss on both weakly supervised tasks.


Introduction
Sequence-to-sequence neural models are standardly trained using a maximum likelihood estimation (MLE) objective. However, MLE training requires full supervision by gold target structures, which in many scenarios are too difficult or expensive to obtain. For example, in semantic parsing for question-answering it is often easier to collect gold answers rather than gold parses * Both authors contributed equally to this publication. (Clarke et al., 2010;Berant et al., 2013;Pasupat and Liang, 2015;Rajpurkar et al., 2016, inter alia). In machine translation, there are many domains for which no gold references exist, although crosslingual document-level links are present for many multilingual data collections.
In this paper we investigate methods where a supervision signal for output structures can be extracted from weak feedback. In the following, we use learning from weak feedback, or weakly supervised learning, to refer to a scenario where output structures generated by the model are judged according to an external metric, and this feedback is used to extract a supervision signal that guides the learning process. Metric-augmented sequence-level objectives from reinforcement learning (Williams, 1992;Ranzato et al., 2016), minimum risk training (MRT) (Smith and Eisner, 2006;Shen et al., 2016) or margin-based structured prediction objectives (Taskar et al., 2005;Edunov et al., 2018) can be seen as instances of such algorithms.
In natural language processing applications, such algorithms have mostly been used in combination with full supervision tasks, allowing to compute a feedback score from metrics such as BLEU or F-score that measure the similarity of output structures against gold structures. Our main interest is in weak supervision tasks where the calculation of a feedback score cannot fall back onto gold structures. For example, matching proposed answers to a gold answer can guide a semantic parser towards correct parses, and matching proposed translations against linked documents can guide learning in machine translation.
In such scenarios the judgments by the external metric may be unreliable and thus unable to select a good update direction. It is our intuition that a more reliable signal can be produced by not just encouraging outputs that are good according to weak positive feedback, but also by actively discouraging bad structures. In this way, a system can more effectively learn what distinguishes good outputs from bad ones. We call an objective that incorporates this idea a bipolar objective. The bipolar idea is naturally captured by the structured ramp loss objective (Chapelle et al., 2009), especially in the formulation by Gimpel and Smith (2012) and Chiang (2012), who use ramp loss to separate a hope from a fear output in a linear structured prediction model. We employ several ramp loss objectives for two weak supervision tasks, and adapt them to neural models.
First, we turn to the task of semantic parsing in a setup where only question-answer pairs, but no gold semantic parses, are given. We assume a baseline system has been trained using a small supervised data set of question-parse pairs under the MLE objective. The goal is to improve this system by leveraging a larger data set of questionanswer pairs. During learning, the semantic parser suggests parses for which corresponding answers are retrieved. These answers are then compared to the gold answer and the resulting weak supervision signal guides the semantic parser towards finding correct parses. We can show that a bipolar ramp loss objective can improve upon the baseline by over 12 percentage points in F1 score.
Second, we use ramp losses on a machine translation task where only weak supervision in the form of cross-lingual document-level links is available. We assume a translation system has been trained using MLE on out-of-domain data. We then investigate whether documentlevel links can be used as a weak supervision signal to adapt the translation system to the target domain. We formulate ramp loss objectives that incorporate bipolar supervision from relevant and irrelevant documents. We also present a metric that allows us to include bipolar supervision in an MRT objective. Experiments show that bipolar supervision is crucial for obtaining gains over the baseline. Even with this very weak supervision, we are able to achieve an improvement of over 0.4% BLEU over the baseline using a bipolar ramp loss.
Finally, we turn to a fully supervised machine translation task. In supervised learning, MLE training in a fully supervised scenario has also been associated with two issues. First, it can cause exposure bias (Ranzato et al., 2016) because during training the model receives its context from the gold structures of the training data, but at test time the context is drawn from the model distribution instead. Second, the MLE objective is agnostic to the final evaluation metric, causing a loss-evaluation mismatch (Wiseman and Rush, 2016). Our experiments use a similar setup as Edunov et al. (2018), who apply structured prediction losses to two fully supervised sequenceto-sequence tasks, but do not consider structured ramp loss objectives. Like our predecessors, we want to understand whether training a pre-trained machine translation model further with a metricinformed sequence-level objective will improve translation performance by alleviating the abovementioned issues. By gauging the potential of applying bipolar ramp loss in a full supervision scenario, we achieve best results for a bipolar ramp loss, improving the baseline by over 0.4% BLEU.
In sum, we show that bipolar ramp loss is superior to other sequence-level objectives for all investigated tasks, supporting our intuition that a bipolar approach is crucial where strong positive supervision is not available. In addition to adapting the ramp loss objective to weak supervision, our ramp loss objective can also be adapted to operate at the token level, which makes it particularly suitable for neural models as they produce their outputs token by token. A token-level objective also better emulates the behavior of the ramp loss for linear models, which only update the weights of features that differ between hope and fear. Finally, the token-level objective allows us to capture token-level errors in a setup where MLE training is not available. Using this objective, we obtain additional gains on top of the sequence-level ramp loss for weakly supervised tasks.

Related Work
Training neural models with metric-augmented objectives has been explored for various NLP tasks in supervised and weakly supervised scenarios. MRT for neural models has previously been used for machine translation (Shen et al., 2016) and semantic parsing Guu et al., 2017). 1 Other objectives based on classical structured prediction losses have been used for both machine translation and summarization (Edunov et al., 2018), as well as semantic parsing (Iyyer et al., 2017;Misra et al., 2018). Objectives inspired by REINFORCE have, for example, been applied to machine translation (Ranzato et al., 2016;Norouzi et al., 2016), semantic parsing Mou et al., 2017;Guu et al., 2017), and reading comprehension (Choi et al., 2017;Yang et al., 2017). 2 Misra et al. (2018) are the first to compare several objectives for neural semantic parsing. For semantic parsing, they find that objectives employing structured prediction losses perform best. Edunov et al. (2018) compare different classical structured prediction objectives including MRT on a fully supervised machine translation task. They find MRT to perform best. However, they only obtain larger gains by interpolating MRT with the MLE loss. Neither Misra et al. (2018) nor Edunov et al. (2018) investigate objectives that correspond to the bipolar ramp loss that is central in our work.
The ramp loss objective (Chapelle et al., 2009) has been applied to supervised phrase-based machine translation (Gimpel and Smith, 2012;Chiang, 2012). We adapt these objectives to neural models and adapt them to incorporate bipolar weak supervision, while also introducing a novel token-level ramp loss objective.

Neural Sequence-to-Sequence Learning
Our neural sequence-to-sequence models utilize an encoder-decoder setup (Cho et al., 2014;Sutskever et al., 2014) with an attention mechanism (Bahdanau et al., 2015). Specifically, we employ the framework NEMATUS (Sennrich et al., 2017). Given an input sequence x = x 1 , x 2 , . . . x |x| , the probability that a model assigns for an output sequence y = y 1 , y 2 , . . . y |y| is given by π w (y|x) = |y| j=1 π w (y j |y <j , x). Using beam search, we can obtain a sorted k-best list K(x) of most likely to least likely outputs and we define the most likely output asŷ = arg max y∈K(x) π w (y|x). 2 We do not use REINFORCE because its updates are based on only one sampled model output, which can lead to high variance. Because it is possible for us to obtain feedback for more than one model output, we employ the more robust MRT that calculates an average over several outputs.

Maximum Likelihood Estimation (MLE).
Prior to employing metric-augmented objectives, we assume that a model has been pre-trained with a maximum likelihood estimation (MLE) objective. Given inputs x and gold structuresȳ, the parameters of the neural network are updated using Stochastic Gradient Descent (SGD) with minibatches of size M , leading to the following objective: (1) Minimum Risk Training (MRT). We compare our ramp loss objectives to MRT (Shen et al., 2016), which uses an external metric to assign rewards to model outputs. Given an input x, S outputs are sampled from the model distribution and updates are performed based on the following MRT objective: (2) where δ(y m,s ) is the reward returned for y m,s by the external metric, and π w (y m,s |x m ) is a distribution over outputs that is normalized over S samples and can be controlled for sharpness by a temperature parameter. 3 Following Shen et al. (2016), we use a baseline term b(x m ) that acts as a control variate for variance reduction of the stochastic gradient (Williams, 1992;Greensmith et al., 2004) and allows negative updates for rewards smaller than the baseline. We compute this term by sampling S outputs from the model distribution such that b(x) = − 1 S S s =1 δ(y s ). Ramp Loss Objectives. Our ramp loss objectives can be formulated as follows: where y − is a fear output that is to be discouraged and y + is a hope output that is to be encouraged.
arg max y∈N (x) π w (y|x) RAMP2 arg max y∈P(x) π w (y|x)ŷ Table 1: Configurations for y + and y − for semantic parsing. We abbreviate P(x) = K(x) : δ(y) = 1, which is the most likely output in the k-best list K(x) that leads to the correct answer, and N (x) = K(x) : δ(y) = 0, which is the most likely output in the k-best list K(x) that leads to the wrong answer.
Intuitively, y − should be an output which has high probability, but receives a bad reward from the external metric. Analogously, y + should be an output which has high probability and receives a high reward from the external metric. The concrete instantiations of y − and y + depend on the underlying task and are thus deferred to the respective sections below (see Tables 1, 4, and 7). The RAMP loss defined in equation (3) has been introduced as equation (8) in Gimpel and Smith (2012). This loss naturally incorporates a bipolarity principle by including both hope and fear into one objective. An alternative formulation of ramp loss can be given by favoring the current model prediction, that is, setting y + =ŷ, and searching for a fear output. This has been called ''cost-augmented decoding'' and been formalized in equation (6) in Gimpel and Smith (2012). This loss dates back to the ''margin-rescaled hinge loss'' of Taskar et al. (2004) and will be called RAMP1 in the following. The converse approach has been called ''cost-diminished decoding'' and been formalized in equation (7) in Gimpel and Smith (2012). Here the model prediction is penalized by setting y − =ŷ and searching for a hope output. This objective has been called ''direct loss'' in Hazan et al. (2010), and will be called RAMP2 in the following.
Finally, we introduce a ramp loss objective that can operate on the token level. To be able to adjust individual tokens, we move to log probabilities, so that the sequence decomposes as a sum over individual tokens and it is possible to ignore tokens while encouraging or discouraging others. This leads to the RAMP-T objective: where τ + m,j and τ − m,j are set to 0, 1 or −1 depending on the decision whether the corresponding token y + m,j /y − m,j should be left untouched, encouraged or discouraged. Concretely, we define: With this definition, tokens that appear in both y + and y − are left untouched, whereas tokens that appear only in the hope output are encouraged, and tokens that appear only in the fear output are discouraged (see Figure 1 for an example). This more fine-grained contrast allows the model to learn what distinguishes a good output from a bad one more effectively. 4

Semantic Parsing
Ramp Loss Objectives. In semantic parsing for question answering, natural language questions are mapped to machine readable parses. Such a parse, y, can be executed against a database that returns an answer a. This answer a can be compared to the available gold answerā and the following metric can be defined: For RAMP, y + is defined as the most probable output in the k-best list K(x) that leads to the correct answer, that is, where δ(y) = 1. In contrast, y − is defined as the most probable output in K(x) that does not lead to the correct answer, namely, where δ(y) = 0. The definitions of y + and y − for this objective and the related ramp loss objectives can be found in Table 1. If y + or y − are found, the parse is cached as a hope or fear output, respectively, for the corresponding input x. If at a later point y + or y − cannot be found in the current k-best list, then previously cached outputs are accessed instead. Should no cached output exist, the corresponding sample is skipped.
Experimental Setup. Our experiments are conducted on the NLMAPS V2 corpus (Lawrence and Riezler, 2018), which is a publicly available corpus 5 for geographical questions that can be answered with the OPENSTREETMAP database. 6 The corpus is a recent extension of its predecessor (Haas and Riezler, 2016), which has been used in Kočiský et al. (2016) or Duong et al. (2018).
For each question, the corpus provides both gold parses and gold answers that can be obtained by executing the parses against the database. We take a random subset of 2,000 question-parse pairs to train an initial model π w with the MLE objective. Following Lawrence and Riezler (2018), we take a pre-order traversal of the tree-structured parses to obtain individual tokens. A further 1,843 and 2,000 instances of the corpus are retained for development and test set, respectively. For the remaining 22,766 questions, we assume that no gold parses exist and only gold answers are available. With the gold answers as a guide, the initial model π w is further improved using the metricaugmented objectives of Section 3 and the metric defined in equation (7).
The model has 1,024 hidden units (GRUs) and word embeddings of size 1,000. The optimal learning rate was chosen in preliminary experiments on the development set and is set to 0.1. Gradients are clipped to 1.0 if they exceed a value of 1.0 and the sentence length is capped at 200. In the case of the MRT objectives, we set S = S = 10. For the RAMP objectives the size of the k-best list K is 10. For objectives with minibatches, the size of a minibatch is M = 80 and validation on the development set is performed after every 100 updates. For objectives where updates are performed after each seen input, the validation is run after every 8,000 updates, leading to the same number of seen inputs compared to the objectives with minibatches.
For validation and at test time, the most likely parse is obtained after a beam search with a beam of size 12. The obtained parse is executed against the database to retrieve its corresponding answer, which is compared to the available gold answer. We define recall as the percentage of correct answers in the entire set and precision as the percentage of correct answers in the set of non-empty answers. The harmonic mean of recall and precision constitutes the F1 score. The stopping point is determined by the highest F1 score on the development set after 30 validations or 30 days of run time 7 and corresponding results are reported on the test set. To measure statistical significance between models we use an approximate randomization test (Noreen, 1989).
Experimental Results. Results using the various ramp loss objectives as well as MRT are shown in Table 2. MRT outperforms the MLE baseline by about 6 percentage points in F1 score. RAMP1 performs worse than MRT, but can still significantly outperform the baseline by 3.05 points in F1 score. RAMP2 performs better than RAMP1, but outperforms MRT only nominally.
In contrast to this, by carefully selecting both a hope and fear parse, RAMP achieves a significant further 5.43 points in F1 score over MRT. By incorporating token-level feedback, our novel objective RAMP-T outperforms all other models significantly and beats the baseline by over 12 points in F1 score. Compared with RAMP, RAMP-T can take advantage of the token-level feedback that allows a model to determine which tokens in the hope output are instrumental to obtain a positive reward but are missing in the fear output. Analogously, it is possible to identify which tokens in the fear output lead to an incorrect parse, rather than also punishing the tokens in the fear output which are actually correct.
MRT is not naturally a bipolar objective. It can only discourage wrong parses if the baseline is larger than 0. Investigating the value of the baseline for 10,000 instances shows that in 37% 7 The 30-day mark was only hit by RAMP2.  of the cases the baseline is 0 (i.e., none of the sampled parses leads to the correct answer). As a result, 37% of the time, wrong parses are ignored rather than discouraged. To explore the importance of always discouraging wrong parses, we introduce the objective MRT NEG: it modifies the feedback for parses with a wrong answer to be −1 rather than 0, which resembles the fear output that is discouraged in the RAMP objective. With this change, the MRT objective always behaves in a bipolar manner, irrespective of the baseline's value. As a consequence, MRT NEG can significantly outperform MRT by 2.33 points in F1 score (see Table 3). This showcases the importance of utilizing bipolar supervision and it constitutes an important finding compared to previous approaches Misra et al., 2018), where the feedback is defined to lie in the range of [0, 1]. However, MRT NEG still falls short of RAMP by 3.1 points in F1 score. This could be because of the different batch sizes, as MRT uses a batch size of 1, whereas RAMP employs a batch size of 80. To ensure that the difference between the objectives does not stem from this difference, we run an experiment with RAMP where the batch size is also set to 1 (i.e., RAMP M=1). Crucially, it still significantly outperforms MRT. At the same time, it does, however, have a lower F1 score than RAMP (see Table 3). This showcases the importance of using a larger minibatch size, so that an average over several inputs is computed before updating. In fact, its F1 score is on par with the MRT NEG objective, which uses the same minibatch size and incorporates bipolar supervision just as RAMP does. However, RAMP M=1 should still be preferred because the RAMP  Table 3: Answer F1 scores on the NLMAPS V2 test set for RAMP and the MRT objective as well as two further objectives, which help crystallize the difference between the two former objectives, averaged over two independent runs. M is the minibatch size. All models are statistically significant from each other at p < 0.01, except the pair (3, 4).
objectives are more efficient than MRT objectives.
In the case of MRT, for every training instance S + S = 20 queries need to be executed against the database to obtain an answer and corresponding reward. On the other hand, RAMP has to execute at most the 10 queries of the k-best list K, but often less if both a correct and an incorrect query are found earlier.
To summarize, RAMP can attribute its success to two factors: First, it discourages parses that receive a wrong answer rather than ignoring them as MRT often does. Second, a larger minibatch size leads to improvements because updates are based on an average over several inputs. Further performance gains can be obtained by using the token-level objective RAMP-T. Finally, RAMP objectives are more efficient because fewer outputs have to be judged.

Weakly Supervised Machine Translation
Ramp Loss Objectives. We consider machine translation (MT) in a weakly supervised domain adaptation setting, where in-domain references are unavailable. In this setting, we obtain weak feedback by matching translation model outputs against cross-lingually linked documents. For each input sentence x, we can obtain a set of relevant documents D + (x) ∈ D where D is a collection of target language documents. Crosslingual link structures can be found in many multilingual document collections, such as crosslingual citations in patent documents or product categories in e-commerce data. Our example is links between Wikipedia documents. Instead of a reference translation, we use a relevant document d + sampled from D + (x) to guide our search for y + and y − . As a relevant document provides much weaker supervision than a reference translation, we construct a more informative supervision signal by integrating negative supervision from an irrelevant document d − sampled from a collection of irrelevant contrast documents. For each input x, the bipolar supervision signal then consists of a pair of sampled documents (d + , d − ).
Unlike semantic parsing for question answering, our task uses a continuous reward δ(y) ∈ [0, 1]. In fully supervised MT a sentence-level approximation of the BLEU score can serve as the reward. But computing the BLEU score between a translation and a document does not make sense. We therefore propose two different alternative metrics. The first, δ 1 (y, d), computes how well a translation matches a relevant document. The second, δ 2 (y, d + , d − ) computes how well a translation differentiates between a relevant and an irrelevant document. δ 1 (y, d) is defined as the average n-gram precision between a hypothesis and a document, multiplied by a brevity penalty. As we do not have a reference length, we include a brevity penalty term that compares the output length to the input length. This ratio can be modified by a factor r that represents the average length difference between source and target language and which can be computed over the training data: u n c(u n , y) · 1 1 u n ∈d u n c(u n , y) where u n are the n-grams present in y, c() counts the occurrences of an n-gram in y, and N is the maximum order of n-grams used. The brevity penalty term is is defined as the difference between δ 1 (y, d + ) and δ 1 (y, d − ), subject to a linear transformation to allow values to lie between 0 and 1: Our intuition behind this metric is that it should measure how well a translation differentiates between the relevant and irrelevant document, leading to domain-specific translations being weighted higher than domain-agnostic ones. Table 4 shows our loss functions for the weakly supervised case. RAMP and RAMP2 define y + and y − in the same way as is done in the semantic parsing task, except that the metric δ 1 (y, d + ) is used to match outputs against documents. Like Gimpel and Smith (2012), we include a scaling factor α to trade off the importance of the reward against the model score in determining y + and y − . Note that these objectives do not include negative supervision from d − . Using the metrics defined above, we formulate two objectives that include d − : RAMP − defines y + in the same way as RAMP, but uses a different definition of y − : Instead of using a fear output with respect to d + (i.e., a translation with high probability and low reward δ 1 (y, d + )), we use a hope output with respect to d − (i.e., a translation with high probability and high reward δ 1 (y, d − )). As this translation matches an irrelevant document well, it can be used as a negative output. The same definition of y − is also used in RAMP1 − . Note that this objective does not include positive supervision from d + . Finally, RAMP δ 2 incorporates d + and d − in a different way. This objective defines y + as a hope and y − as a fear, but uses the joined metric δ 2 (y, d + , d − ) with respect to the document pair (d + , d − ).
Experimental Setup. We test our objectives on a weakly supervised English-German Wikipedia translation task first proposed in Jehl and Riezler (2016). In-domain training data are 10,000 English sentences with relevant German documents sampled from the WikiCLIR corpus (Schamoni et al., 2014). 8 The task includes a small in-domain development and test set (dev: 1,712 sentences, test: 1,526 sentences), each consisting of four Wikipedia articles on diverse subjects. Irrelevant documents d − are sampled from the German side of the News Commentary 9 data set, which contains document boundary information.
Byte-pair encoding (Sennrich et al., 2016) with 30,000 merge operations is applied to all source and target data. Sentences longer than 80 words Table 4: Configurations for y + and y − for weakly supervised MT adaptation.ŷ is the highest-probability model output. π w (y|x) is the probability of y under the model. The arg max y is taken over the k-best list K(x). α is a scaling factor regulating the influence of the metric compared to the model probability. δ 1 and δ 2 are metrics defined with respect to relevant and irrelevant documents d + and d − (see Eq. 8 and 9).
are removed from the training set. Our neural MT model uses 500-dimensional word embeddings and hidden layer dimension of 1,024. Encoder and decoder use GRU units. An out-of-domain model is trained on 2.1 million sentence pairs from Europarl v7 (Koehn, 2005), News Commentary v10, and the MultiUN v1 corpus (Eisele and Chen, 2010). The baseline (MLE) is trained using the MLE objective and ADADELTA (Zeiler, 2012) for 20 epochs. We train on batches of 64 and use dropout for regularization, with a dropout rate of 0.2 for embedding and hidden layers and 0.1 for source and target layers. Gradients are clipped if their norm exceeds 1.0. The metric-augmented objectives are trained using SGD. All hyperparameters are chosen on the development set. For the ramp loss objectives, we use a learning rate of 0.005, α = 10, and a k-best size of 16. We compare ramp loss to MRT using both δ 1 (y, d + ) and δ 2 (y, d + , d − ) as the external cost function, denoted as MRT δ 1 and MRT δ 2 , respectively. MRT is trained using a learning rate of 0.05, S = 16, and S = 10. For testing and validation, translations are obtained using beam search with a beam size of 16. Results are validated every 200 updates and training is run for 25 validations. The stopping point is determined by the BLEU score (Papineni et al., 2001) on the development set. We report scores computed with Moses' 10 multi-bleu.perl on tokenized, truecased output. Results are averaged over 2 runs.
Experimental Results. Results for the different objectives can be found in  in performance. This shows that supervision from only d + or only d − is insufficient. The deteriorating effect is strongest for RAMP, which uses d + to select both y + and y − . We explain this by the fact that d + is an imperfect label. Trying to push the model to perfectly reproduce d + will not lead to a good translation. The same observation holds true for MRT δ 1 . This objective only includes the reward δ 1 (y, d + ). Compared with the RAMP objectives, the decrease for MRT δ 1 is smaller. On the other hand, MRT δ 2 , which incorporates bipolar supervision, produces a nominal improvement over the MLE baseline. This objective is outperformed by RAMP − and RAMP δ 2 . Both objectives produce a small, but significant, improvement of 0.3% BLEU over the MLE baseline. This result shows that bipolar supervision is crucial for success in this weak supervision scenario. It also shows that unlike MRT, for the bipolar ramp loss it does not matter whether δ 1 or δ 2 is used, as they both capture the same idea. The superiority of these objectives over MRT shows again the success of intelligently selecting positive and negative outputs. Another small, but significant, improvement is produced by the token-level variant RAMP − -T, leading to the best overall result.
To summarize, we find that for this task, which uses very weak supervision from document-level links, small improvements can be obtained. To achieve these improvements, it is imperative to use objectives that include bipolar supervision from d + and d − . This finding holds for both ramp loss and MRT. The best overall result is obtained using ramp loss in the token-level variant.
Analysis of Translation Results. As the improvements in the translation experiments are very small, we conduct a small-scale analysis to better determine the nature of the gains. Our analysis is inspired by Bentivogli et al. (2016). We compare the weakly supervised MLE baseline to the best experiment in this setting, which uses the bipolar token-level ramp loss RAMP − -T.
We first analyze the performance by sentence length. We separate the translations into source length brackets and score each bracket separately. The brackets represent quartiles of the source length distribution, ensuring an approximately equal amount of sentences in each bracket. Results are shown in Figure 2. For all systems, we observe a drop in performance up to an input length of 33. Surprisingly, BLEU scores increase again for the top bracket (source length > 33). For this bracket, we also see the biggest gap between MLE and RAMP − -T of 0.52 and 0.67% BLEU for the two runs. This increase is mitigated by much weaker increases in the bottom brackets. A possible explanation for the weaker performance of MLE in the top bracket is the observation that hypotheses produced by the MLE system are longer than  for RAMP − -T. For the top bracket, hypothesis lengths exceed reference lengths for all systems. However, for MLE this over-generation is more severe at 106% of the reference length, compared to RAMP − -T at 102%, potentially causing a higher loss in precision.
As our test set consists of parallel sentences extracted from four Wikipedia articles, we can examine the performance for each article separately. Figure 3 shows the results. We observe large differences in performance according to article ID. These are probably caused by some articles being more similar to the out-of-domain training data than others. Comparing RAMP − -T and MLE, we see that RAMP − -T outperforms MLE for each article by a small margin. Figure 4 shows the size of the improvements by article. We observe that margins are bigger on articles with better baseline performance. This suggests that there are challenges arising from domain mismatch that are not addressed by our method.
Lastly, we present an examination of example outputs. Table 6 shows an example of a long sentence from Article 2, which describes the German town of Schüttorf. This article is originally in German, meaning that our model is Source Towards the end of the 19th century, a strong textile industry was developing itself in Schüttorf with several large local businesses (Schlikker & Söhne, Gathmann & Gerdemann, G. Schümer & Co. and ten Wolde, later Carl Remy; today's RoFa is not one of the original textile companies, but was founded by H. Lammering and later taken over by Gerhard Schlikker jun., Levert Rost  back-translating from English into German. The reference contains some awkward or even ungrammatical phrases such as ''was developing itself'', a literal translation from German. The example also illustrates that translating Wikipedia involves handling frequent proper names (there are 11 proper names in the example). Both models struggle with translating proper names, but RAMP − -T produces the correct phrase ''Gathmann & Gerdemann'', while MLE fails to do so. The RAMP − -T translation is also fully grammatical, whereas MLE incorrectly translates the main verb phrase ''was developing itself'' into a relative clause, and contains an agreement error in the translation of the noun phrase ''one of the original textile companies''. Although making fewer errors in grammar and proper name translation, RAMP − -T contains two deletion errors and MLE only contains one. This could be caused by the active optimization of sentence length in the ramp loss model.

Fully Supervised Machine Translation
Our work focuses on weakly supervised tasks, but we also conduct experiments using a fully supervised MT task. These experiments are motivated on the one hand by adapting the findings of Gimpel and Smith (2012) to the neural MT paradigm, and on the other hand by expanding the work by Edunov et al. (2018) on applying classical structured prediction losses to neural MT.
Ramp Loss Objectives. For fully supervised MT we assume access to one or more reference translationsȳ for each input x. The reward BLEU +1 (y,ȳ) is a per-sentence approximation of the BLEU score. 11 Table 7 shows the different definitions of y + and y − , which give rise to different ramp losses. RAMP, RAMP1, and RAMP2 are defined analogously to the other tasks. We again include a hyperparameter α > 0 interpolating cost function and model score when searching for y + and y − . Gimpel and Smith (2012) also include the perceptron loss in their analysis. PERC1 is a re-formulation of the Collins perceptron (Collins, 2002) where the reference is used as y + andŷ is used as y − . A comparison with PERC1 is not possible for the weakly supervised tasks in the previous sections, as gold structures are not available for these tasks. With neural MT and subword methods we are able to compute this loss for any reference without running into the problem of reachability that was faced by phrase-based MT (Liang et al., 2006). However, Loss y + y − RAMP arg max y π w (y|x) − α(1 − BLEU +1 (y,ȳ)) arg max y π w (y|x) + α(1 − BLEU +1 (y,ȳ)) RAMP1ŷ arg max y π w (y|x) + α(1 − BLEU +1 (y,ȳ)) RAMP2 arg max y π w (y|x) − α(1 − BLEU +1 (y,ȳ))ŷ PERC1ȳŷ PERC2 arg max y BLEU +1 (y,ȳ)ŷ Table 7: Configurations for y + and y − for fully supervised MT.ŷ is the highest-probability model output,ȳ is a gold standard reference. π w (y|x) is the probability of y according to the model. The arg max y is taken over the k-best list K(x). BLEU +1 is smoothed per-sentence BLEU and α is a scaling factor.
using sequence-level training towards a reference can lead to degenerate solutions where the model gives low probability to all its predictions (Shen et al., 2016). PERC2 addresses this problem by replacingȳ by a surrogate translation that achieves the highest BLEU +1 score in K(x). This approach is also used by Edunov et al. (2018) for the loss functions which require an oracle. PERC1 corresponds to equation (9), PERC2 to equation (10) of Gimpel and Smith (2012).
Experimental Setup. We conduct experiments on the IWSLT 2014 German-English task, which is based on Cettolo et al. (2012) in the same way as Edunov et al. (2018). The training set contains 160K sentence pairs. We set the maximum sentence length to 50 and use BPE with 14,000 merge operations. Edunov et al. (2018) sample 7K sentences from the training set as heldout data. We do the same, but only use one tenth of the data as heldout set to be able to validate often.
Our baseline system (MLE) is a BiLSTM encoder-decoder with attention, which is trained using the MLE objective. Word embedding and hidden layer dimensions are set to 256. We use batches of 64 sentences for baseline training and batches of 40 inputs for training RAMP and PERC variants. MRT makes an update after each input using all sampled outputs and resulting in a batch size of 1. All experiments use dropout for regularization, with dropout probability set to 0.2 for embedding and hidden layers and to 0.1 for source and target layers. During MLE-training, the model is validated every 2500 updates and training is stopped if the MLE loss on the heldout set worsens for 10 consecutive validations.
For metric-augmented training, we use SGD for optimization with learning rates optimized on the development set. Ramp losses and PERC2 use a k-best list of size 16. For ramp loss training, we set α = 10. RAMP and PERC variants both use a learning rate of 0.001. A new k-best list is generated for each input using the current model parameters. We compare ramp loss to MRT as described above. For MRT, we use SGD with a learning rate of 0.01 and set S = 16 and S = 10. As Edunov et al. (2018) observe beam search to work better than sampling for MRT, we also run an experiment in this configuration, but find no difference between results. As beam search runs significantly slower, we only report sampling experiments.
The model is validated on the development set after every 200 updates for experiments with batch size 40 and after 8,000 updates for MRT experiments with batch size 1. The stopping point is determined by the BLEU score on the heldout set after 25 validations. As we are training on the same data as the MLE baseline, we also apply dropout during ramp loss training to prevent overfitting. BLEU scores are computed with Moses' multi-bleu.perl on tokenized, truecased output. Each experiment is run 3 times and results are averaged over the runs.
Experimental Results. As shown in Table 8, all experiments except for PERC1 yield improvements over MLE, confirming that sequencelevel losses that update towards the reference can lead to degenerate solutions. For MRT, our findings show similar performance to the initial experiments reported by Edunov et al. (2018), who gain 0.24 BLEU points on the same test set. 12 PERC2 and RAMP2, improve over the 12 See their Table 2. Using interpolation with the MLE objective, Edunov et al. (2018) achieve +0.7 BLEU points. As we are only interested in the effect of sequence-level objectives, we do not add MLE interpolation. The best model by Edunov et al. (2018) achieved a BLEU score of 32.91%. It is possible that these scores are not directly comparable to ours due to different pre-and post-processing. They also use a multi-layer CNN architecture (Gehring et al., 2017), which has been shown to outperform a simple RNN architecture such as ours.  The main difference between RAMP and RAMP1, compared to PERC2 and RAMP2, is the fact that the latter objectives useŷ as y − , whereas the former use a fear translation with high probability and low BLEU +1 . We surmise that for this fully supervised task, selecting a y − which has some known negative characteristics is more important for success than finding a good y + . RAMP, which fulfills both criteria, still outperforms RAMP2. This result re-confirms the superiority of bipolar objectives compared to nonbipolar ones. Although still improving over MLE, token-level ramp loss RAMP-T is outperformed by RAMP by a small margin. This result suggests that when using a metric-augmented objective on top of an MLE-trained model in a full supervision scenario without domain shift, there is little room for improvement from token-level supervision, while gains can still be obtained from additional sequence-level information captured by the external metric, such as information about the sequence length.
To summarize, our findings on a fully supervised task show the same small margin for improvement as Edunov et al. (2018), without any further tuning of performance (e.g., by interpolation with the MLE objective). Bipolar RAMP is found to outperform the other losses. This observation is also consistent with the results by Gimpel and Smith (2012) for phrase-based MT. We conclude that for fully supervised MT, deliberately selecting a hope and fear translation is beneficial.

Conclusion
We presented a study of weakly supervised learning objectives for three neural sequence-tosequence learning tasks. In our first task of semantic parsing, question-answer pairs provide a weak supervision signal to find parses that execute to the correct answer. We show that ramp loss can outperform MRT if it incorporates bipolar supervision where parses that receive negative feedback are actively discouraged. The best overall objective is constituted by the token-level ramp loss. Next, we turn to weak supervision for machine translation in form of cross-lingual document-level links. We present two ramp loss objectives that combine bipolar weak supervision from a linked document d + and an irrelevant document d − . Again, the bipolar ramp loss objectives outperform MRT, and the best overall result is obtained using tokenlevel ramp loss. Finally, to tie our work to previous work on supervised machine translation, we conduct experiments in a fully supervised scenario where gold references are available and a metricaugmented loss is desired to reduce the exposure bias and the loss-evaluation mismatch. Again, the bipolar ramp loss objective performs best, but we find that the overall margin for improvement is small without any additional engineering. We conclude that ramp loss objectives show promise for neural sequence-to-sequence learning, especially when it comes to weakly supervised tasks where the MLE objective cannot be applied. In contrast to ramp losses that either operate only in the undesirable region of the search space (''cost-augmented decoding'' as in RAMP1) or only in the desirable region of the search space (''cost-diminished decoding'' as in RAMP2), bipolar RAMP operates in both regions of the search space when extracting supervision signals from weak feedback. We showed that MRT can be turned into a bipolar objective by defining a metric that assigns negative values to bad outputs. This improves the performance of MRT objectives. However, the ramp loss objective is still superior as it is easy to implement and efficient to compute. Furthermore, on weakly supervised tasks our novel token-level ramp loss objective RAMP-T can obtain further improvements over its sequence-level counterpart because it can more directly assess which tokens in a sequence are crucial to its success or failure.