Are Ellipses Important for Machine Translation?

Abstract This article describes an experiment to evaluate the impact of different types of ellipses discussed in theoretical linguistics on Neural Machine Translation (NMT), using English to Hindi/Telugu as source and target languages. Evaluation with manual methods shows that most of the errors made by Google NMT are located in the clause containing the ellipsis, the frequency of such errors is slightly more in Telugu than Hindi, and the translation adequacy shows improvement when ellipses are reconstructed with their antecedents. These findings not only confirm the importance of ellipses and their resolution for MT, but also hint toward a possible correlation between the translation of discourse devices like ellipses with the morphological incongruity of the source and target. We also observe that not all ellipses are translated poorly and benefit from reconstruction, advocating for a disparate treatment of different ellipses in MT research.


Introduction
Ellipsis is a linguistic phenomenon in which parts of a sentence are omitted, and have to be retrieved from discourse or real-world context. For example, in (1), the phrase like apples is deleted at the site marked by [e] and can be understood from the context.

Kim likes apples, but Alex does not [e].
Ellipsis is a form of anaphora that often functions to reduce redundancy in language and improve discourse cohesion (Menzel 2017;Mitkov 1999). Languages provide various mechanisms to elide information, based on which different ellipses are defined in linguistics. For our study, following the theory of ellipses in Halliday and Hasan (1976) and Miller and Pullum (2013), we classify ellipses as nominal, verbal, and clausal. Nominal ellipses correspond to the deletion of the head noun, as in (2), sometimes with their dependents, as in (3). They are also called head noun ellipsis (McShane, Nirenburg, and Babkin 2015) and Noun Phrase Ellipsis (Corver and van Koppen 2011). A more recent, theory neutral term for such constructions is fused head NPs since the phrasal head here is realized jointly with a dependent function (Huddleston and Pullum 2002).
2. My sister's two boys are wild, but John's two [e] are really quite well-behaved.
3. They adopted Mary's analysis on data as John's [e] was poorly structured.
Ellipses occur in the environment of certain syntactical structures or trigger words, known as the licensors of ellipses. The nominal ellipsis in (2) is licensed by the cardinal number two and in (3) by the genitive proper noun John's. Demonstrative determiners, quantifiers, and so forth, can also license nominal ellipses (Khullar, Majmundar, and Shrivastava 2020;Khullar, Anthony, and Shrivastava 2019;Menzel 2017;Halliday and Hasan 1976).
Verbal ellipsis, verb ellipsis, or verb phrase ellipsis (VPE) is the deletion of the main verb, as in (4). 1 We also have instances of post-auxiliary ellipsis (PAE), as in (5), where the ellipsis is licensed by a modal or auxiliary verb (Sag 1976;Hankamer 1978 Finally, when the entire clause in a sentence gets deleted to avoid repetition of information, it is called clausal ellipsis, such as in (6). This phenomenon, also known as sluicing, is licensed by wh-words. Predicate ellipses, such as in (7), have been loosely put together in the clausal ellipsis category by Halliday and Hasan (1976). However, we identify them with instances of PAE where the negation is contracted to the auxiliary. 6. We have a linguistics exam, but I do not remember when [e]. 7. She was always talking about who was good-looking and who wasn't [e].
The context from which an ellipsis 2 gets its sense and/or reference is called the antecedent. 3 If the antecedent is present textually, the ellipsis is endophoric. However, if the ellipsis cannot be recovered from a co-text, it is exophoric (Miller and Pullum 2013) or situational ellipsis, such as in (8), where the interlocutors infer the missing information using situational cues and the knowledge of the grammar of the language.
8. I will take two [e].
For this study, we do not take into account closely related phenomenon like do-so anaphora, such as in (9), where the exact site of ellipsis is not evident and one-anaphora, such as in (10), where a noun gets replaced with a non-lexical proform entity one. They are usually discussed as cases of substitution rather than ellipsis. 9. She swims really fast and so do I.
10. The upper room is smaller than the lower one.
Ellipses are not very frequent in text, 4 but for improving the accuracy of Natural Language Processing (NLP) systems that handle data with ellipses, they are important (Zhang et al. 2019;Dean, Cheung, and Precup 2016). One such NLP application could be Machine Translation (MT). This is because the elided parts of the text are unavailable overtly at the surface syntax for text processing, and their meaning may come from context that is often present outside the current sentence, or, in some cases, may not be endophorically available at all. Thus, the MT process would involve representation of this missing information from the source correctly into the target, which could become more challenging when the former and the latter exhibit different strategies to elide information. The empirical evidence to confirm the extent of this impact is sparse. In this article, we conduct a data-driven study to gauge the size of the research problem we are addressing for different ellipsis types using English-Hindi/Telugu as source and target language pairs. Both English and Hindi belong to the Indo-European language family and, hence, show some linguistic similarities, although Hindi is more inflectional and morphologically richer than English. Telugu, on the other hand, is an agglutinative language from the Dravidian language family, and is rather unrelated to English. Selecting these language pairs allows us to assess the errors in relation to the degree of morphological dissimilarity between the source and the target languages.
The main source of inspiration for this empirical study comes from the recent work on MT for English-Russian by Voita, Sennrich, and Titov (2019), where VPE has been identified as one of the linguistic phenomena that cause inconsistencies in translation output, along with discourse structures like deixis and lexical cohesion. Using their findings as the starting point, we conduct an empirical study to determine the impact of different ellipses on an existing NMT system for English to Hindi/Telugu.

Experimental Setup
We prepare two test sets-the first one contains sentences with the three aforementioned ellipsis types and the second the same sentences with resolved ellipses. We gather the sentences from various popular annotated corpora, such as the VPE corpus by Bos and Spenader (2011), the NoEl corpus (Khullar, Majmundar, and Shrivastava 2020), a curated ellipses dataset by Khullar, Anthony, and Shrivastava (2019), and the GECCo Corpus (Menzel and Lapshinova-Koltunski 2014). For consistency and fair analysis, we randomly pick 500 sentences for each ellipsis type, which results in a total of 1,500 sentences. The antecedent of the ellipsis is frequently present in the same sentence as the ellipsis. But it can also be present in the previous or following sentence, although the latter is comparatively rare (Khullar, Majmundar, and Shrivastava 2020). For our study, we pick sentences where the ellipsis and its antecedent occur in the same sentence. Hence, they can be handled by MT systems that only operate sentence by sentence.
For the second test set, we manually reconstruct the ellipsis with their resolution marked in the respective ellipsis corpora. Reconstruction is one of the acceptable ways to resolve ellipses in some linguistic theories that consider resolution as involving searching for some antecedent that could be substituted at the ellipsis site to produce a well-formed string with the same meaning the elided string provides (Lappin and Shih 1996;Chomsky 1995;Wasow 1972). Thus, for a sentence like (11), the ellipsis reconstruction procedure leads to (12).
11. You definitely saved Kendall's life today, but not Pike's.
12. You definitely saved Kendall's life today, but not Pike's life.
Because the antecedents do not always occur strictly under identity with the elided material, reconstructed sentences such as (13) and (14) become grammatically incorrect. In the ellipses test set, we find 38 such sentences involving VPE or PAE and 12 nominal ellipses. All these errors are caused by a mismatch in agreement morphology. We manually correct the sentences and verify them with a native English speaker.
13. *I gave him one pencil, but he wanted three [pencil].
14. *John lives with his grandparents, but Bill does not [lives with his grandparents].
We create the second test set to check if reconstruction offers any advantage in MT research. We use manually reconstructed, gold sentences to analyze the exact impact of this procedure on MT. In practice, the text can be fed into an ellipses resolution system (Khullar 2020;Zhang et al. 2019) as a preprocessing step. This means that the accuracy of such a system will also contribute to the final translation quality.
To obtain the translations, we use the Google NMT, which comprises a deep LSTM network with 8 encoder and 8 decoder layers with attention and residual connections (Wu et al. 2016). It is freely available for translations between English and nine Indian languages, including Hindi and Telugu. It is competitive to the state-of-the-art MT and allows us to use the same underlying model for both of the language pairs.

Results and Discussion
We opt for manual evaluation 5 to focus on the translation of parts of the sentence containing the ellipsis and its antecedent. For each language pair, two linguists 6 assign a category to a translated sentence, from a list of 7 proposed categories, summarized in Table 1. The sentences for which both the linguists assign the same category are separated out directly for analysis. The inter-annotator agreement is high (0.89 for Hindi and 0.91 for Telugu), which indicates reliability of our evaluation efforts. For the sentences where the category labels mismatch, the linguists discuss the dispute and check whether they can agree to assign the same category. There is no unresolved disagreement for any sentence at the end. Therefore, no sample is disregarded from analysis. The correctly translated English sentences are further analyzed for the representation of the ellipses. When the source and target show a similar ellipsis strategy, as in (15), they are assigned the category A. For each source (S) and target (T) pair, we present a gloss (G) of the latter as per the Leipzig Glossing Rules. To avoid repetitiveness, we add the meaning (M) of the target only when it is different from the source.

S She bought a car but I don't know when.
TĀme kāru konnad-i kānī eppuu uundō nāku teliya-du G she car bought-a but when be I know-NEG The assigned category is B when the target sentence has a different ellipsis strategy than the source; however, the meaning remains unchanged. For example, in (16), the target has noun modifier coordination and not ellipsis.
16. S I will just take a minute or two.
T mujh-e bas ekdo minat lagenge G 1SG-OBL just one -two minute take-FT The category C is for samples in which the target does not have the ellipsis seen in the source, such as in (17), although the meaning is perfectly localized.

S But you knew that, didn't you?
T Kānī mīku adi telusu, kādā? G but you that knew right M 'But you knew that, right?'

Error Analysis
Out of the 1,500 sentences from the first test set, the Hindi translations of 1,066 sentences and the Telugu translations of 1,201 sentences receive a label from D-G categories (see Table 2). Hence, over 70% of the translations in both languages are poor. The higher frequency of errors in Telugu could hint toward a possible relation between the translation of ellipses with the degree of morphological dissimilarities between the source and the target. Among the incorrect translations, the number of sentences assigned F/G categories is far greater than the number of sentences assigned D/E. This implies that despite the translation errors, most of these sentences are grammatically still acceptable. In other words, the translation adheres to the target language (fluency), but does not capture the source text well (adequacy). This is in line with the observation made in Voita, Sennrich, and Titov (2019) that the translation of a sentence containing a discourse structure such as ellipsis often looks correct when read independently but not in context. We also note that most of the errors are located in the phrase containing the ellipsis, indicating that translating elided parts of a sentence is indeed hard. We now analyze the errors for each ellipsis type.

Noun Ellipsis.
In the first type of error from category D, the translated sentence is fairly comprehensible, but has small grammatical errors. These errors are contributed by wrong agreement morphology between the elided noun (and/or the noun modifiers) and the verb. For example, in (18), the Hindi word for gave has masculine gender, which is incorrect as the elided noun baskets in Hindi bears feminine gender, resulting in a gender agreement mismatch between the subject and verb. We display the errors in red.
18. S She brought three baskets, and gave us one.
T *vah teen tokariyaan laee, aur ham-en ek di-ya G she three baskets(F) brought, and 1PL-ACC one give-M.PERF The phrase containing the ellipsis is sometimes translated literally from the source into the target, even though the latter does not have the same ellipsis strategy. This results in a grammatically weird construction, as in (19). The translators agreed that there should have been clausal coordination in this sentence, as without it the adjective scary does not necessarily modify the elided noun costume in the target.
19. S We are looking for the funniest costume, and the scariest.
T ham sabase majedaar poshaak-ki talaash kar rahe hai aur sabase daraavane G 1PL most funny costume-ACC search do PROG PRS and most scary When the meaning of the elided noun is slightly changed or ambiguous in the target, it results in the errors from category F. For example, the target sentence in (20) reads weirdly due to the incorrect translation of the intended meaning of the NP mine. Finally, when the meaning of the elided noun is completely lost in the target, it results in the errors from category G, such as in (21), where the ellipsis is so poorly translated that the intended meaning his story is not present at all in the target. 21. S Everyone believed her story as his wasn't all that dramatically told.
T Nāakīyagā ceppabainadi antāāme katha kādani andarū viśvasincāru G dramatically being-said everything her story not everyone believed M 'Not everything that was said dramatically was her story, everyone believed.' 4.1.2 Verbal Ellipsis. We find small grammatical errors like agreement feature mismatch in the sentences with verbal ellipses as well. For example in (22), the subject in the clause containing the ellipsis misses the ergative marker, making the sentence grammatically incorrect. The meaning, however, is still comprehensible. 27. S He has not changed, but those around him have changed.
T atanu maraledu, kanī atani cuu unnavaru mararu G he not.have.changed, but his around those.who.have changed M 'He didn't change, but everyone around him have changed.' If the error is not due to the ellipsis, like in (28), the reconstruction of the ellipsis, as in (29), does not improve the translation. However, it also does not make it worse.
28. S Some students in the class like physics and some don't.
T *kaksha mein kuchh chhaatr jaise bhautikee aur kuchh nahin class LOC some students like physics and some not M 'Some students in the class are like physics and some are not'.
29. S Some students in the class like physics and some don't like physics. T *kaksha mein kuchh chhaatr bhautikee kee tarah hain aur kuchh-G class LOC some students physics ACC similar PRS and some -bhautikee kee tarah nahin hain physics ACC similar not PRS M 'Some students in the class are like physics and some are not like physics'.
All in all, since the elided material gets overtly represented, this procedure is successful in improving the adequacy. In no sample does it make the meaning worse. A drawback, as discussed previously, is that it adds redundant information that lowers the fluency. Since the sentences with clausal ellipsis require repetition of an entire clause, their fluency is most negatively impacted. More importantly, since this procedure does not impact the translation of the wh-word, it is not of much use for clausal ellipsis.

Conclusion
Translating missing information that can be retrieved from elsewhere in the context poses an attractive goal for MT. We carried out an experiment to test the impact of different ellipses discussed in linguistics on NMT for English to Hindi/Telugu. The experimental results confirmed that ellipsis is hard for MT. We also found that ellipsis reconstruction is useful, mostly for sentences with noun and verb ellipses to improve their translation adequacy, although at the cost of their fluency.