Abstract
Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score for English (a 2.3 higher F1-score than previous work [Dobrovolskii, 2021]) using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than previous work), and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We obtain substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages. We provide the code and models as open source.1
1 Introduction
There has been a great deal of recent research in pretrained language models that employ encoder- decoder or decoder-only architectures (e.g., see GPT-3, GLAM, Lamda [Brown et al., 2020; Du et al., 2021; Thoppilan et al., 2022]), and that can generate text using autoregressive or text-to-text (seq2seq) models (e.g., see T5, MT5 [Raffel et al., 2019; Xue et al., 2021]). These models have led to remarkable results on a number of problems.
Coreference resolution is the task of finding referring expressions in text that point to the same entity in the real world. Coreference resolution is a core task in NLP, relevant to a wide range of applications (e.g., see Jurafsky and Martin [2021] Chapter 21 for discussion), but somewhat surprisingly, there has been relatively limited work on coreference resolution using encoder-decoder or decoder-only architectures.
The state-of-the-art models on coreference problems are based on encoder-only models, such as BERT (Devlin et al., 2019) or SpanBERT (Joshi et al., 2020). All recent state-of-the-art coreference models (see Table 2), however, have the disadvantage of a) requiring engineering of a specialized search or structured prediction step for coreference resolution, on top of the encoder’s output representations; b) often requiring a pipelined approach with intermediate stages of prediction (e.g., mention detection followed by coreference prediction); and c) an inability to leverage more recent work in pretrained seq2seq models.
This paper describes a text-to-text (seq2seq) approach to coreference resolution that can directly leverage modern encoder-decoder or decoder-only models. The method takes as input a sentence at a time, together with prior context, encoded as a string, and makes predictions corresponding to coreference links. The method has the following advantages over previous approaches:
Simplicity: We use greedy seq2seq prediction without a separate mention detection step and do not employ a higher order decoder to identify links.
Accuracy: The accuracy of the method exceeds the previous state of the art.
Text-to-text (seq2seq) based: The method can make direct use of modern generation models that employ the generation of text strings as the key primitive.
A key question that we address in our work is how to frame coreference resolution as a seq2seq problem. We describe three transition systems, where the seq2seq model takes a single sentence as input, and outputs an action corresponding to a set of coreference links involving that sentence as its output. Figure 1 gives an overview of the highest performing system, “Link-Append” which encodes prior coreference decisions in the input to the seq2seq model, and predicts new conference links (either to existing clusters, or creating a new cluster) as its output. We provide the code and models as open source.2Section 4 describes ablations considering other systems, such as a “Link-only” system (which does not encode previous coreference decisions in the input), and mention-based (Mention-Link-Append), which has a separate mention detection system, in some sense mirroring prior work (see Section 5).
We describe results on the CoNLL-2012 data set in Section 4. In addition, Section 5 describes multilingual results, in two settings: first, the setting where we fine-tune on each language of interest; second, zero-shot results, where an MT5 model fine-tuned on English alone is applied to languages other than English. Zero-shot experiments show that for most languages, accuracies are higher than recent translation-based approaches and early supervised systems.
2 Related Work
Most similar to our approach is the work of Webster and Curran (2014), who use a shift-reduce transition-based system for coreference resolution. The transition system uses two data structures, a queue initialized with all mentions and a list. The Shift transition moves from the queue a mention to top of the list. The Reduce transition merges the top mentions with selected clusters. Webster and Curran (2014) consider the approach to better reflect human cognitive processing, to be simple and to have small memory requirements. Xia et al. (2020) use this transition-based system together with a neural approach for mention identification and transition prediction; this neural model (Xia et al., 2020) gives higher accuracy scores (see Table 2) than Webster and Curran (2014).
Lee et al. (2017) focus on predicting mentions and spans using an end-to-end neural model based on LSTMs (Hochreiter and Schmidhuber, 1997), while Lee et al. (2018) extend this to a differentiable higher-order model considering directed paths in the antecedent tree.
Another important method to gain higher accuracy is to use stronger pretrained language models, which we follow in this paper as well. A number of recent coreference resolution systems kept the essential architecture fixed while they replace the pretrained models with increasingly stronger models. Lee et al. (2018) used Elmo (Peters et al., 2018) including feature tuning and show an impressive improvement of 5.1 F1 on the English CoNLL 2012 test set over the baseline score of Lee et al. (2017). The extension from an end-to-end to the differentiable higher-order inference provides an additional 0.7 F1-score on the test set, which leads to a final F1-score of 73.0 for this approach. Joshi et al. (2019) use the same inference model and explore how to best use and gain another significant improvement of 3.9 points absolute and reach a score of 76.9 F1-score on the test set (see Table 2). Finally, Joshi et al. (2020) use SpanBERT, which leads to a even higher accuracy score of 79.6. SpanBERT performs well for coreference resolution due to its span-based pretraining objective.
Dobrovolskii (2021) considers coreference links between words instead of spans, which reduces the complexity to O(n2) of the coreference models and uses RoBERTa as language model, which provides better results than SpanBERT for many tasks.
Similarly, Kirstain et al. (2021) reduce the high memory footprint of mention detection by using the start- and end-points of mention spans to identify mentions with a bilinear scoring function. The top λn scored mentions are used to restrict the search space for coreferences prediction using again a bilinear function for scoring. The algorithm has a quadratic complexity since each possible coreference pair has to be scored.
Wu et al. (2020) cast coreference resolution as question answering and report gains originating from pretraining on Quoref and SQuAD 2.0 of 1 F1-score on the development set. The approach first predicts mentions with a recall-oriented objective, then creates queries for these potential mentions for the cluster prediction. This procedure requires the application of the model for each mention candidate multiple times per document, which leads to high execution time.
Our work makes direct use of T5-based models (Raffel et al., 2019). T5 adopts the idea of treating tasks in Natural Language Processing uniformly as “text-to-text” problems, which means to only have text as input and generate text as output. This idea simplifies and unifies the approach for a large number of tasks by applying the same model, objective, training procedure, and decoding process.
3 Three seq2seq Transition Systems
3.1 The Link-Append System
The Link-Append system processes the document a single sentence at a time. At each point the input to the seq2seq model is a text string that encodes the first i sentences together with coreference clusters that have been built up over the first (i − 1) sentences. As an example, the input for i = 3 for the example in Figure 1 is the following:
Input:
Speaker-A still have n’t gone to that fresh French restaurant by your house # Speaker-A ’m like dying to go there — # Speaker-B You mean the one right next to the apartment **
Here the # symbol is used to delimit sentences, and the start of the focus sentence is marked using the pipe-symbol — and the end of a sentence with two asterisk symbols **.
We have three sentences (i = 3). There is a single coreference cluster in the first i − 1 = 2 sentences, marked using the [1 …] bracketings.
The output from the seq2seq model is also a text string. The text string encodes a sequence of 0 or more actions, terminated by the SHIFT token. Each action links some mention (a span) in the ith sentence to some mention in the previous context (often in the first i − 1 sentences, but sometimes also in the ith sentence). An example prediction given the above input is the following:
Prediction
You ; the apartment your house; the one right next to the apartment that fresh French restaurant by your house ; Shift
More precisely, the first action would actually be “You ## mean the one [1” where the substring “mean the one” is the 3-gram in the original text immediately after the mention “You”. The 3-gram helps to disambiguate the mention fully, in the case where the same string might appear multiple times in the sentence of interest. For brevity we omit these 3-grams in the following discussion, but they are used throughout the models output to specify mentions.3
In this case there are three actions, separated by the “;” symbol, followed by the terminating shift action. The first action is
You
This is an append action: specifically, it appends the mention “You” in the third sentence to the existing coreference cluster labeled [1 …]. The second action is
the apartment your house
This is a link action. It links the mention “the apartment” in the third sentence to “your house” in the previous context. Similarly the third action,
the one right next to the apartment that fresh French restaurant by your house
is a also a link action, in this case linking the mention “the one right next to the apartment” to a previous mention in the discourse.
The sequence of actions is terminated by the shift symbol. At this point the ith sentence has been processed, and the model moves to the next step where the (i + 1)th sentence will be processed. Assuming the next sentence is “Speaker-B yeah yeah yeah”, the input at the (i + 1)th step will be
Input:
Speaker-A still have n’t gone to # Speaker-A ’m like dying to go there # Speaker-B mean — # Speaker-B yeah yeah yeah
Note that the three actions in the previous prediction have been reflected in the new input, which now includes three coreference clusters, labeled [1 …], [2 …] and [3 …].
In summary, the method processes a sentence at a time, and uses append and link actions to build up links between mentions in the current sentence under focus and previous mentions in the discourse.
A critical question is how to map training data examples (which contain coreference clusters for entire documents) to sequences of actions for each sentence. Clearly there is some redundancy in the system, in that in many cases either link or append actions could be used to build up the same set of coreference clusters. We use the following method for creation of training examples:
Process mentions in the order in which they appear in the sentence. Specifically, mentions are processed in order of their end-point (earlier end-points are earlier in the ordering). Ties are broken by their start-point (later start-points are earlier in the ordering). It can be seen that the order in the previous example, You, the apartment, the one right next to the apartment, follows this procedure.
For each mention, if there is another mention in the same coreference cluster earlier in the document, either:
Create an append action if there are at least two members of the cluster in the previous i − 1 sentences.
Otherwise create a link action to the most recent member of the coreference cluster (this may be either in the first i − 1 sentences, or in the ith sentence).
The basic idea then will be to always use append actions where possible, but to use link actions where a suitable append action is not available.
3.2 The Link-only System
The Link-only system is a simple variant of the Link-Append system. There are two changes: First, the only actions in the Link-only system are link and SHIFT, as described in the previous section. Second, when encoding the input in the Link-only system, the first i sentences are taken again with the # separator, but no information about coreference clusters over the first i − 1 sentences is included.
The Link-only system can therefore be viewed as a simplification of the Link-Append system. We will compare the two systems in experiments, in general seeing that the Link-Append system provides significant improvements in performance.
3.3 The Mention-Link-Append System
The Mention-Link-Append system is a modification of the Link-Append system, which includes an additional class of actions, the mention actions. A mention action selects a single sub-string from the sentence under focus, and creates a singleton coreference cluster. The algorithm that creates training examples is modified to have an additional step for the creation of mention actions, as follows:
Process mentions in the order in which they appear in the sentence.
For each mention, if it is the first mention in a coreference structure, introduce a mention action for that mention.
For each mention, if there is another mention in the same coreference cluster earlier in the document, either:
Create an append action if there is at least two members of the cluster in the previous i − 1 sentences.
Otherwise create a link action to the most recent member of the coreference cluster (this may be either in the first i − 1 sentences, or in the ith sentence).
Note that the Mention-Link-Append system can create singleton coreference structures, unlike the Link-Append or Link-only systems. This is its primary motivation.
3.4 A Formal Description
We now give a formal definition of the three systems. This section can be safely skipped on a first reading of the paper.
3.4.1 Initial Definitions and Problem Statement
We introduce some key initial definitions—of documents, potential mentions, and clusterings—before giving a problem statement:
A document is a pair (w1…wn,s1…sm), where wi is the ith word in the document, and s1,s2,…,smis a sequence of integers specifying a segmentation of w1…wninto m sentences. Each siis the endpoint for sentence i in the document. Hence 1 ≤ s1 < s2…sm−1 < sm, and sm = n. The ith sentence spans words (si−1 + 1)…siinclusive (where for convenience we define s0 = 0).
A clustering K is a sequence of sets K1,K2,…K|K|, where eachKi ⊆ℳ, and for anyi,jsuch thati≠j, we haveKi ∩ Kj = ∅. We in addition assume that for alli, |Ki|≥ 2 (although see Section 3.5 for discussion of the case where |Ki|≥ 1). We defineto be the set of all possible clusterings.
The coreference problem is to take a documentxas input, and to predict a clusteringKas the output. We assume a training set ofNexamples, , consisting of documents paired with clusterings.
3.5 The Three Transition Systems
The transition systems considered in this paper take a document x as input, and produce a coreference clustering K as the output. We assume a definition of transition systems that is closely related to work on deterministic dependency parsing (Nivre, 2003, 2008), and which is very similar to the conventional definition of deterministic finite-state machines. Specifically, a transition system consists of: 1) A set of states . 2) An initial state . 3) A set of actions . 4) A transition function . This will usually be a partial function: That is, for a particular state c, there will be some actions a such that δ(c,a) is undefined. For convenience, for any state c we define to be the set of actions such that for all , δ(c,a) is defined. 5) A set of final states .
A path is then a sequence c0,a0,c1,a1,…cN where for i = 1…N, ci +1 = δ(ci,ai), and where .
All transition systems in this paper use the following definition of states:
A state is a pair (i,K) such that 1 ≤ i ≤ (m + 1) andis a clustering such that for k ∈ 1…|K|, for j ∈ (i + 1)…m, Kk ∩ Mj = ∅ (i.e., K is a clustering over the mentions in the first i sentences). In addition we define the following:
is the set of all possible states.
c0 = (1,ϵ) is the initial state, whereϵis the empty sequence.
is the set of final states.
Intuitively, the state (i,K) keeps track of which sentence is being worked on, through the index i, and also keeps track of a clustering of the partial mentions up to and including sentence i.
We now describe the actions used by the various transition systems. The actions will either augment the clustering K, or increment the index i. The actions fall into four classes—link actions, append actions, mention actions, and the shift action—defined as follows:
Link Actions.
Append Actions.
Mention Actions.
The SHIFT Action.
We are now in a position to define the transition systems:
The link-append transition system is defined as follows:
, c0, andare as defined in definition 5.
For any state (i,K), the set of possible actions is. The full set of actions is
The transition function δ is as defined above.
The Link-only system is identical to the above, but with. The Mention-Link-Append system is identical to the above, but with A(i,K) = L(i,K) ∪App(i,K) ∪Mention(i,k) ∪{SHIFT}.
All that remains in defining the seq2seq method for each transition system is to: a) define an encoding of the state (i,K) as a string input to the seq2seq model; b) define an encoding of each type of action, and of a sequence of actions corresponding to single sentence; c) defining a mapping from a training example consisting of an (x,K) pair to a sequence of input-output texts corresponding to training examples.
4 Experimental Setup
We train a mT5 model to predict from an input a target text. We use the provided training, development, and test splits as described in section 4.1. For the preparation of the input text, we follow previous work and include the speaker in the input text before each sentence (Wu et al., 2020) as well as the text genre at the document start if this information is available in the corpus. We apply as described in Section 3 the corresponding transitions as an oracle to obtain the input and target texts. We shorten the text at the front if the input text is larger as the sentence piece token input size of the language model and add further context beyond the sentence i when the input space is not filled up (note that as described in Section 3, we use the pipe symbol — to mark the start of the focus sentence and the end with two asterisk symbols **).
4.1 Data
We use the English coreference resolution dataset from the CoNLL-2012 Shared Task (Pradhan et al., 2012) and SemEval-2010 Shared Task set (Recasens et al., 2010) for multilingual coreference resolution experiments. The SemEval-2010 datasets include six languages and is therefore a good test bed for multilingual coreference resolution. We excluded English as the data overlaps with our training data.
The statistics on the dataset sizes are summarized in Table 1. The table shows that the English CoNLL-2012 Shared Task is substantially larger than any of the other data sets.
4.2 Experiments
Setup for English.
For our experiments, we use mT5 and initialize our model with either the xl or xxl checkpoints.5 For fine-tuning, we use the hyperparameters suggested by Raffel et al. (2019): a batch-size of 128 sequences and a constant learning rate of 0.001. We use micro-batches of 8 to reduce the memory requirements. We save checkpoints every 2k steps. From these models, we select the model with the best development results. We train for 100k steps. We use inputs with 2048 sentence piece tokens and 384 output tokens for training. All our models have been tested with 3k sentence piece tokens input length if not stated otherwise. The training of the xxl-model takes about 2 days on 128 TPUs-v4. On the development set, inference takes about 30 minutes on 8 TPUs.
Setup for Other Languages.
We used the English model in this work to continue training with the above settings on other languages than English (Arabic, Chinese, and the SemEval-2010 datasets). For few-shot learning, we use the first 10 documents for each language and we train only for 200 steps since the evaluation then shows 100% fit to the training set.
The experimental evaluation changed in recent publication for the SemEval-2010 by reporting F1-scores as an average of MUC, B3, and following the CoNLL-2012 evaluation schema (Roesiger and Kuhn, 2016; Schröder et al., 2021; Xue et al., 2021). We follow this schema in this paper as well. Another important difference between the SemEval-2010 and the CoNLL-2012 datasets is the annotation of singletons (mentions without antecedents) in the SemEval datasets. Most recent systems predict only coreference chains. This has lead also to different evaluation methods for the SemEval-2010 datasets. The first method keeps the singletons for the evaluation purposes (e.g., Xia and Durme, 2021) and the second excludes the singletons from evaluation set (e.g., Roesiger and Kuhn, 2016; Schröder et al., 2021; Bitew et al., 2021). The exclusion of singletons seems better suited to compare recent systems but makes direct comparison with previous work difficult. We report numbers for both setups.
5 Multilingual Coreference Resolution Results
5.1 Zero-Shot and Few-Shot
Since mT5 is pretrained on 100+ languages (Xue et al., 2021), we evaluate Zero-Shot transfer ability from English to other languages. We apply our system trained on the English CoNLL-2012 Shared Task dataset to the non-English SemEval-2010 test sets. Table 3 shows evaluation scores for our transition-based systems and reference systems. In our results overview (Table 3), we report in the column sing. whether singletons are included (Y) or excluded (N), in the P-column for prediction and in the E-column for evaluation. We use for training the same setting as the reference systems (Kobdani and Schütze, 2010; Roesiger and Kuhn, 2016; Schröder et al., 2021). In the Zero-shot experiments, the transition-based systems are trained only on English CoNLL-2012 datasets and applied without modification to the multilingual SemEval-2010 test sets.
Bitew et al. (2021) use machine translation for the coreferences prediction of the SemEval-2010 datasets. The authors found they obtained the best accuracy when they first translated the test sets to English, then predicted the English coreferences with the system of Joshi et al. (2020) and finally projected back the predictions. They apply this method to four out of the six languages for the SemEval-2010 datasets. We include in Table 3 their results as a comparison to our Zero-Shot results. The two methods are directly comparable as they do not use the target language annotations for training. Our Zero-Shot F1-scores are substantially higher compared with the machine translation approach for Dutch, Italian, and Spanish and a bit lower for Catalan, cf. Table 3.
Xia and Durme (2021) explored for a large number of settings few-shot learning using the continued training approach. We use the same approach with a single setting that uses the first 10 documents for each language. For details about the experimental setup see Section 4.2. Table 3 presents the results for the Link-Append system. This shows that already with a few additional training documents a high accuracy can be reached. This could be useful either to adapt to a specific coreference annotations schema or to specific language (see examples in Figure 2 and 3).
5.2 Supervised
We also carried out experiments in a fully supervised setup in which we use all available training data of the SemEval-2010 Shared Task. We adopted the method of continued training of Xia and Durme (2021). In our experiments, we start from our finetuned English model and continue training on the SemEval-2010 datasets and the Arabic OntoNotes dataset for the later we use data and splits of the CoNLL-2012 Shared Task.
To verify the finding of Xia and Durme (2021), we compared the results when we continue the training from a finetuned model and from the initial mT5 model. We conducted this exploratory experiment using 1k training steps for the German dataset. The results are in favor of the experiment with continued training using an already fine-tuned model with a score of 84.5 F1 vs 81.0 F1 for fresh mT5 model. This model also achieves 77.3 F1 when evaluated without singletons (cf. Table 3), surpassing previous SotA of 74.5 F1 (Schröder et al., 2021). We did not explore training longer due to the computational cost of training from a fresh mT5 model to reach a potentially better performance. We adopted the approach for all datasets of the SemEval-2010 Shared Task as this model provides competitive coreference models with low training cost.
Table 3 includes the accuracy scores for the cluster/mentions-based transition systems which reaches SotA for all languages when the prediction and evaluation includes the singletons (P =Y, E =Y). In order to compare the results with Xia and Durme (2021), we removed from the results for the cluster/mentions-based transition system the singletons in the prediction but still include them in the evaluation (P =N, E =Y).
6 Discussion
In this section, we analyze performance factors with an ablation study, analyze errors, and reflect on design choices that are important for the model’s performance. Table 2 shows the results for our systems on the English, Arabic, and Chinese CoNLL-2012 Shared Task and compares with previous work.
6.1 Ablation Study
Our best-performing transition system is Link-Append, which predicts links and clusters iteratively for sentences of a document without predicting mentions before-hand. Table 4 shows an ablation study. The results at the top of the table show the development set results for the Link-Append system when, with each Shift, the already identified coreference clusters are annotated in the input. This information is then available in the next step and the clusters can be extended by the Append transition.
System . | Ablation . | F1 . |
---|---|---|
Link-Append | 100k steps/3k pieces | 83.2 |
Link-Append | 2k sentence pieces | 83.1 |
Link-Append | 50k steps | 82.9 |
Link-Append | no context beyond i | 82.8 |
Link-Append | xxl-T5.1.1 | 82.7 |
Link-Append | xl-mT5 | 78.0 |
Mention-Link-Append | 3k pieces | 82.6 |
Mention-Link-Append | 2k pieces | 82.2 |
Link-only | link transitions only | 81.4 |
System . | Ablation . | F1 . |
---|---|---|
Link-Append | 100k steps/3k pieces | 83.2 |
Link-Append | 2k sentence pieces | 83.1 |
Link-Append | 50k steps | 82.9 |
Link-Append | no context beyond i | 82.8 |
Link-Append | xxl-T5.1.1 | 82.7 |
Link-Append | xl-mT5 | 78.0 |
Mention-Link-Append | 3k pieces | 82.6 |
Mention-Link-Append | 2k pieces | 82.2 |
Link-only | link transitions only | 81.4 |
The models are trained with an input size of 2048 tokens using mT5. We use a larger input size of 3000 (3k) tokens for decoding to accommodate for long documents and very long distances between mentions of coreferences. When we use 2k sentence pieces, the accuracy is 83.1 instead of 83.2 averaged F1-score on the development set using the model trained for 100k steps.
At the bottom of Table 4, the performance of a system is shown that does not annotate the identified clusters in the input. In this system the Append transition cannot be applied and hence only the Link and Shift transition are used. The accuracy of this system is substantially lower, by 1.8 F1-score.
We observe drops in accuracy when we do not use context beyond the sentence i or when we train for only 50k steps. We observe 0.5 lower F1-score, when we use xxl-T5.1.16 instead of the xxl-mT5 model. An analysis shows that the English OntoNotes corpus contains some non-English text, speaker names, and special symbols. For instance, there are Arabic names that are mapped to OOV, but also the curly brackets {}. There are also other cases where T5 translated non-English words to English (e.g., German ‘nicht’ to ‘not’).
With the Mention-Link-Append system, we introduced a system that is capable of introducing mentions, which is useful for data sets that include single mentions, such as the SemEval-2010 data set. This transition system has an 82.6 F1-score on the development set with an input context of 3k sentences pieces, which is 0.6 F1-score lower than the Link-Append transition system. We added examples in the Appendix to illustrate mistakes in a Zero-shot setting (Figure 2) and supervised English example (Appendix).
6.2 Error Analysis
We observe two problems originating from the sequence-to-sequence models: first, hallucinations (words not found in the input) and second, ambiguous matches of mentions to the input. In order to evaluate the frequency of hallucinations, we counted cases where the predicted mentions and their context could not be matched to a word sequence in the input. We found only 11 cases (0.07%) in all 14.5k Link and Append predictions for the development set. The second problem are mentions with their n-gram context which are found more than once in the input. This constitutes 84 cases (0.6%) of all 14.5k Link and Append predictions.
Table 5 shows average F1-scores for buckets of documents within a length range incremented by 128 tokens, analogous to the analysis of Xia et al. (2020). All systems’ F1-scores drop after the segment length 257–512 substantially by about 3-4 points. The Link-Append (LA) system seems to have two more stable F1-score regions 1–512 and 513–1152 tokens divided by the mentioned larger drop while we see for the other system slightly lower accuracy in each segment.
Subset . | #Docs . | JS-L . | CM . | LA . |
---|---|---|---|---|
1 – 128 | 57 | 84.6 | 84.5 | 85.8 |
129 – 256 | 73 | 83.7 | 83.6 | 85.2 |
257 – 512 | 78 | 82.9 | 83.4 | 86.0 |
513 – 768 | 71 | 80.1 | 79.3 | 83.2 |
769 – 1152 | 52 | 79.1 | 78.6 | 83.3 |
1153+ | 12 | 71.3 | 69.6 | 74.9 |
all | 343 | 80.1 | 79.5 | 83.2 |
Subset . | #Docs . | JS-L . | CM . | LA . |
---|---|---|---|---|
1 – 128 | 57 | 84.6 | 84.5 | 85.8 |
129 – 256 | 73 | 83.7 | 83.6 | 85.2 |
257 – 512 | 78 | 82.9 | 83.4 | 86.0 |
513 – 768 | 71 | 80.1 | 79.3 | 83.2 |
769 – 1152 | 52 | 79.1 | 78.6 | 83.3 |
1153+ | 12 | 71.3 | 69.6 | 74.9 |
all | 343 | 80.1 | 79.5 | 83.2 |
6.3 Design Choice
With this paper, we follow the paradigm of a text-to-text approach. Our goal was to use only the text output from a seq2seq model, and potentially the score associated with the output. Crucial for the high accuracy of the Link-Append systems are the design choices that seem to fit a text-to- text approach well. (1) Initial experiments, not presented in the paper, showed lower performance for a standard two-stage approach using mention prediction followed by mention-linking. The Link-only transition system, which we included as a baseline in the paper, was the first system that we implemented that only predicted conference links, avoiding mention-detection. Hence this crucial first design choice is the prediction of links and not to predict mentions first. (2) The prediction of links in a state-full fashion, where the prior input records previous coreference decisions, finally leads to the superior accuracy for the text-to-text model. (3) The larger model enables us to use the simpler paradigm of a text-to-text model successfully. The smaller models provide substantially lower performance. We speculate in line with the arguments of Kaplan et al. (2020) that distinct capabilities of a model become strong or even emerge with model size. (4) The strong multilingual results originate from the multilingual T5 model, which was initially surprising to us. For English, the mT5 model performed better as well which we attribute to larger vocab of the sentence piece encoding model of mT5.
7 Conclusions
In this paper, we combine a text-to-text (seq2seq) language model with a transition-based systems to perform coreference resolution. We reach 83.3 F1-score on the English CoNLL-2012 data set surpassing previous SotA. In the text-to-text framework, the Link-Append transition system has been superior to hybrid Mention-Link-Append transition system with mixed prediction of mentions, links and clusters. Our trained models are useful for future work as they could be used to initialize models for continuous training or zero-shot transfer to new languages.
Acknowledgments
We would like to thank the action editor and three anonymous reviewers for their thoughtful and insightful comments, which were very helpful in improving the paper.
Notes
Note that no explicit constraints are placed on the model’s output, so there is the potential for the model to generate mention references that do not correspond to substrings within the input; however this happens very rarely in practice, see section 6.2 for discussion. There is also the potential for the 3-gram to be insufficient context to disambiguate the exact location of a mention; again, this happens rarely, see section 6.2.
Specifically, the addition of the link can either: 1) create a new cluster within K, if neither m or m′ are in an existing cluster within K; 2) add m to an existing cluster within K, if m′ is already in some cluster in K, and m is not in an existing clustering; 3) add m′ to an existing cluster within K, if m is already in some cluster in K, and m′ is not in an existing clustering; 4) merge two clusters, if m and m′ are both in clusters within K, and the two clusters are different; 5) leave K unchanged, if m and m′ are both within the same existing cluster within K. In practice cases (2), (3), (4), and (5) are never seen in oracle sequences of actions, but for completeness we include them.
The xxl-T5.1.1 model refers to a model provided by Xue et al. (2021) trained for 1.1 million steps on English data.
References
Author notes
Action Editor: Vincent Ng