Coreference Resolution through a seq2seq Transition-Based System

Most recent coreference resolution systems use search algorithms over possible spans to identify mentions and resolve coreference. We instead present a coreference resolution system that uses a text-to-text (seq2seq) paradigm to predict mentions and links jointly. We implement the coreference system as a transition system and use multilingual T5 as an underlying language model. We obtain state-of-the-art accuracy on the CoNLL-2012 datasets with 83.3 F1-score for English (a 2.3 higher F1-score than previous work [Dobrovolskii, 2021]) using only CoNLL data for training, 68.5 F1-score for Arabic (+4.1 higher than previous work), and 74.3 F1-score for Chinese (+5.3). In addition we use the SemEval-2010 data sets for experiments in the zero-shot setting, a few-shot setting, and supervised setting using all available training data. We obtain substantially higher zero-shot F1-scores for 3 out of 4 languages than previous approaches and significantly exceed previous supervised state-of-the-art results for all five tested languages. We provide the code and models as open source.1

Coreference resolution is the task of finding referring expressions in text that point to the same 1 https://github.com/google-research/google-research/tree/master/coref_mt5 Input: Speaker-A I still have n't gone to that fresh French restaurant by your house Prediction: SHIFT: next sentence Input: Speaker-A I 2 still have n't gone to that fresh French restaurant by your house Speaker-A I 17 'm like dying to go there Prediction: A I 17 → I 2 B SHIFT: next sentence Input: Speaker-A [1 I ] still have n't gone to that fresh French restaurant by your house Speaker-A [1 I ] 'm like dying to go there Speaker-B You mean the one right next to the apartment Prediction: A You → [1 B the apartment → your house C the one right next to the apartment → that fresh French restaurant by your house D SHIFT: next sentence Input: Speaker-A [1 I ] still have n't gone to [3 that fresh French restaurant by [2 your house ] ] Speaker-A [1 I ] 'm like dying to go there Speaker-B [1 You ] mean [3 the one right next to [2 the apartment ] ] Speaker-B yeah yeah yeah Prediction: SHIFT: next sentence Figure 1: Example of one of our transition-based coreference systems, the Link-Append system.The system processes a single sentence at a time, using an input encoding of the prior sentences annotated with coreference clusters, followed by the new sentence.As output, the system makes predictions that link mentions in the new sentence to either previously created coreference clusters (e.g., "You → [1") or when a new cluster is created, to previous mentions (e.g., "the apartment → your house").The system predicts "SHIFT" when processing of the sentence is complete.Note in the figure we use the word indices 2 and 17 to distinguish the two incidences of "I" in the text.entity in the real world.Coreference resolution is a core task in NLP, relevant to a wide range of applications (e.g., see Jurafsky and Martin (2021) Chapter 21 for discussion), but somewhat surpris-ingly, there has been relatively limited work on coreference resolution using encoder-decoder or decoder-only architectures.
The state-of-the-art models on coreference problems are based on encoder-only models, such as BERT (Devlin et al., 2019) or SpanBERT (Joshi et al., 2020).All recent state-of-the-art coreference models (see Table 2) however have the disadvantage of a) requiring engineering of a specialized search or structured prediction step for coreference resolution, on top of the encoder's output representations; b) often requiring a pipelined approach with intermediate stages of prediction (e.g., mention detection followed by coreference prediction); and c) an inability to leverage more recent work in pretrained seq2seq models.
This paper describes a text-to-text (seq2seq) approach to coreference resolution that can directly leverage modern encoder-decoder or decoder-only models.The method takes as input a sentence at a time, together with prior context, encoded as a string, and makes predictions corresponding to coreference links.The method has the following advantages over previous approaches: • Simplicity: We use greedy seq2seq prediction without a separate mention detection step and do not employ a higher order decoder to identify links.
• Accuracy: The accuracy of the method exceeds the previous state of the art.
• Text-to-text (seq2seq) based: the method can make direct use of modern generation models that employ the generation of text strings as the key primitive.
A key question that we address in our work is how to frame coreference resolution as a seq2seq problem.We describe three transition systems, where the seq2seq model takes a single sentence as input, and outputs an action corresponding to a set of coreference links involving that sentence as its output.Figure 1 gives an overview of the highest performing system, "Link-Append", which encodes prior coreference decisions in the input to the seq2seq model, and predicts new conference links (either to existing clusters, or creating a new cluster) as its output.We provide the code and models as open source2 .Section 4 describes ablaoogle-research/tree/master/coref_mt5 tions considering other systems, such as a "Linkonly" system (which does not encode previous coreference decisions in the input), and mentionbased (Mention-Link-Append), which has a separate mention detection system, in some sense mirroring prior work (see Section 5).
We describe results on the CoNLL-2012 data set in Section 4. In addition, Section 5 describes multilingual results, in two settings: first, the setting where we fine-tune on each language of interest; second, zero-shot results, where an MT5 model fine-tuned on English alone is applied to languages other than English.Zero-shot experiments show that for most languages, accuracies are higher than recent translation-based approaches and early supervised systems.

Related Work
Most similar to our approach is the work of Webster and Curran (2014) which uses a shift-reduce transition-based system for coreference resolution.The transition system uses two data structures, a queue initialized with all mentions and a list.The SHIFT transition moves from the queue a mention to top of the list.The REDUCE transition merges the top mentions with selected clusters.Webster and Curran (2014) consider the approach to better reflect human cognitive processing, to be simple and to have small memory requirements.Xia et al. (2020) use this transition-based system together with a neural approach for mention identification and transition prediction; this neural model (Xia et al., 2020) gives higher accuracy scores (see Table 2) than Webster and Curran (2014).Lee et al. (2017) focus on predicting mentions and spans using an end-to-end neural model based on LSTMs (Hochreiter and Schmidhuber, 1997), while Lee et al. (2018) extends this to a differentiable higher-order model considering directed paths in the antecedent tree.
Another important method to gain higher accuracy is to use stronger pretrained language models which we follow in this paper as well.A number of recent coreference resolution systems kept the essential architecture fixed while they replace the pretrained models with increasingly stronger models.Lee et al. (2018) used Elmo (Peters et al., 2018) including feature tuning and show an impressive improvement of 5.1 F1 on the English CoNLL 2012 test set over the baseline score of Lee et al. (2017).The extension from an end-to-end to the differentiable higher-order inference provides an additional 0.7 F1-score on the test set which leads to a final F1-score of 73.0 for this approach.Joshi et al. (2019) use the same inference model and explore how to best use and gain another significant improvement of 3.9 points absolute and reach a score of 76.9 F1-score on the test set (see Table 2).Finally Joshi et al. (2020) use SpanBERT which leads to a even higher accuracy score of 79.6.SpanBERT performs well for coreference resolution due to its span-based pretraining objective.Dobrovolskii (2021) considers coreference links between words instead of spans which reduces the complexity to O(n 2 ) of the coreference models and uses RoBERTa as language model which provides better results than SpanBERT for many tasks.
Similarly, Kirstain et al. (2021) reduce the high memory footprint of mention detection by using the start-and end-points of mention spans to identify mentions with a bilinear scoring function.The top λn scored mentions are used to restrict the search space for coreferences prediction using again a bilinear function for scoring.The algorithm has a quadratic complexity since each possible coreference pair has to be scored.Wu et al. (2020) cast coreference resolution as question answering and report gains originating from pretraining on Quoref and SQuAD 2.0 of 1 F1-score on the development set.The approach first predicts mentions with a recall-oriented objective, then creates queries for these potential mentions for the cluster prediction.This procedure requires the application of the model for each mention candidate multiple times per document which leads to high execution time.
Our work makes direct use of T5-based models (Raffel et al., 2019).T5 adopts the idea of treating tasks in Natural Language Processing uniformly as "text-to-text" problems, which means to only have text as input and generate text as output.This idea simplifies and unifies the approach for a large number of tasks by applying the same model, objective, training procedure and decoding process.
3 Three seq2seq Transition Systems

The Link-Append System
The Link-Append system processes the document a single sentence at a time.At each point the input to the seq2seq model is a text string that encodes the first i sentences together with coreference clusters that have been built up over the first (i − 1) sentences.As an example, the input for i = 3 for the example in Figure 1 is the following: Input: Speaker-A [1 I ] still have n't gone to that fresh French restaurant by your house # Speaker-A [1 I ] 'm like dying to go there | # Speaker-B You mean the one right next to the apartment ** Here the # symbol is used to delimit sentences, and the start of the focus sentence is marked using the pipe-symbol | and the end of a sentence with two asterisk symbols **.
The output from the seq2seq model is also a text string.The text string encodes a sequence of 0 or more actions, terminated by the SHIFT token.Each action links some mention (a span) in the ith sentence to some mention in the previous context (often in the first i − 1 sentences, but sometimes also in the ith sentence).An example prediction given the above input is the following: Prediction You → [1 ; the apartment → your house; the one right next to the apartment → that fresh French restaurant by your house ; SHIFT More precisely, the first action would actually be "You ## mean the one → [1" where the substring "mean the one" is the 3-gram in the original text immediately after the mention "You".The 3-gram helps to disambiguate the mention fully, in the case where the same string might appear multiple times in the sentence of interest.For brevity we omit these 3-grams in the following discussion, but they are used throughout the models output to specify mentions. 3n this case there are three actions, separated by the ";" symbol, followed by the terminating SHIFT action.The first action is You → [1 This is an append action: specifically, it appends the mention "You" in the 3rd sentence to the existing coreference cluster labeled [1 . ..].The second action is the apartment → your house This is a link action.It links the mention "the apartment" in the 3rd sentence to "your house" in the previous context.Similarly the third action, the one right next to the apartment → that fresh French restaurant by your house is a also a link action, in this case linking the mention "the one right next to the apartment" to a previous mention in the discourse.
The sequence of actions is terminated by the SHIFT symbol.At this point the ith sentence has been processed, and the model moves to the next step where the (i + 1)th sentence will be processed.Assuming the next sentence is "Speaker-B yeah yeah yeah", the input at the (i+1)th step will be In summary, the method processes a sentence at a time, and uses append and link actions to build up links between mentions in the current sentence under focus and previous mentions in the discourse.
A critical question is how to map training data examples (which contain coreference clusters for entire documents) to sequences of actions for each sentence.Clearly there is some redundancy in the system, in that in many cases either link or append actions could be used to build up the same set of coreference clusters.We use the following method for creation of training examples: • Process mentions in the order in which they appear in the sentence.Specifically, mentions are processed in order of their end-point (earlier end-points are earlier in the ordering).Ties are broken by their start-point (later start-points are earlier in the ordering).It can be seen that the order in the previous example, You, the apartment, the one right next to the apartment, follows this procedure.
• For each mention, if there is another mention in the same coreference cluster earlier in the document, either: 1. Create an append action if there are at least two members of the cluster in the previous i − 1 sentences.2. Otherwise create a link action to the most recent member of the coreference cluster (this may be either in the first i − 1 sentences, or in the ith sentence).
The basic idea then will be to always use append actions where possible, but to use link actions where a suitable append action is not available.

The Link-only System
The Link-only system is a simple variant of the Link-Append system.There are two changes: First, the only actions in the Link-only system are link, and SHIFT, as described in the previous section.Second, when encoding the input in the Linkonly system, the first i sentences are taken again with the # separator, but no information about coreference clusters over the first i − 1 sentences is included.
The Link-only system can therefore be viewed as a simplification of the Link-Append system.We will compare the two systems in experiments, in general seeing that the Link-Append system gives significant improvements in performance.

The Mention-Link-Append System
The Mention-Link-Append system is a modification of the Link-Append system, which includes an additional class of actions, the mention actions.A mention action selects a single sub-string from the sentence under focus, and creates a singleton coreference cluster.The algorithm that creates training examples is modified to have an additional step for the creation of mention actions, as follows: • Process mentions in the order in which they appear in the sentence.
• For each mention, if it is the first mention in a coreference structure, introduce a mention action for that mention.
• For each mention, if there is another mention in the same coreference cluster earlier in the document, either: 1. Create an append action if there is at least two members of the cluster in the previous i − 1 sentences.2. Otherwise create a link action to the most recent member of the coreference cluster (this may be either in the first i − 1 sentences, or in the ith sentence).
Note that the Mention-Link-Append system can create singleton coreference structures, unlike the LINK-APPEND or Link-only systems.This is its primary motivation.

A Formal Description
We now give a formal definition of the three systems.This section can be safely skipped on a first reading of the paper.

Initial Definitions, and Problem Statement
We introduce some key initial definitions-of documents, potential mentions, and clusteringsbefore giving a problem statement: Definition 1 (Documents) A document is a pair (w 1 . . .w n , s 1 . . .s m ), where w i is the ith word in the document, and s 1 , s 2 , . . ., s m is a sequence of integers specifying a segmentation of w 1 . . .w n into m sentences.Each s i is the endpoint for sentence i in the document.Hence 1 ≤ s 1 < s 2 . . .s m−1 < s m , and s m = n.The ith sentence spans words (s i−1 + 1) . . .s i inclusive (where for convenience we define s 0 = 0).
Definition 2 (Potential Mentions) Assume an input document (w 1 . . .w n , s 1 . . .s m ).For each i ∈ 1 . . .m we define M i to be the set of potential mentions in the ith sentence; specifically, Hence each member of M i is a pair (a, b) specifying a subspan of the ith sentence.We define is the set of all potential mentions in the document, and M ≤i is the set of potential mentions in sentences 1 . . .i.
Definition 3 (Clusterings) A clustering K is a sequence of sets K 1 , K 2 , . . .K |K| , where each K i ⊆ M, and for any i, j such that i = j, we have K i ∩ K j = ∅.We in addition assume that for all i, |K i | ≥ 2 (although see Section 3.5 for discussion of the case where |K i | ≥ 1).We define K to be the set of all possible clusterings.Definition 4 (Problem Statement) The coreference problem is to take a document x as input, and to predict a clustering K as the output.We assume a training set of N examples, {(x (i) , K (i) )} N i=1 , consisting of documents paired with clusterings.

The Three Transition Systems
The transition systems considered in this paper take a document x as input, and produce a coreference clustering K as the output.We assume a definition of transition systems that is closely related to work on deterministic dependency parsing (Nivre, 2003(Nivre, , 2008)), and which is very similar to the conventional definition of deterministic finitestate machines.Specifically, a transition system consists of: 1) A set of states C. 2) An initial state c 0 ∈ C. 3) A set of actions A. 4) A transition function δ : C × A → C.This will usually be a partial function: that is for a particular state c, there will be some actions a such that δ(c, a) is undefined.For convenience, for any state c we define A(c) ⊆ A to be the set of actions such that for all a ∈ A(c), δ(c, a) is defined.5) A set of final states F ⊆ C.
All transition systems in this paper use the following definition of states: K is a clustering over the mentions in the first i sentences).In addition we define the following: • C is the set of all possible states.
• c 0 = (1, ) is the initial state, where is the empty sequence.
Intuitively, the state (i, K) keeps track of which sentence is being worked on, through the index i, and also keeps track of a clustering of the partial mentions up to and including sentence i.
We now describe the actions used by the various transition systems.The actions will either augment the clustering K, or increment the index i.The actions fall into four classes-link actions, append actions, mention actions, and the shift action-defined as follows: Link actions.Given a state (i, K), we define the set of possible link actions as A link action (m → m ) augments K by adding a link between mentions m and m .We define K ⊕ (m → m ) to be the result of adding link m → m to clustering K. 4 We can then define the transition function associated with a link action: Append actions.Given a state (i, K), we define the set of possible append actions as An append action (m → k) augments K by adding mention m to the cluster K k withing the sequence K.We define K ⊕ (m → k) to be the result of this action (thereby overloading the ⊕ operator); the transition function associated with an append action is then Mention actions.Given a state (i, K), we define the set of possible mention actions as A mention action Add(m) augments K by either creating a new singleton cluster containing m alone, assuming that m does not currently appear in K; otherwise it leaves K unchanged.We define K ⊕ Add(m) to be the result of this action, and The SHIFT action.The final action in the system is the SHIFT action.This can be applied in any state, and simply advances the index i, leaving the clustering K unchanged: δ((i, K), SHIFT) = ((i + 1), K) We are now in a position to define the transition systems: Definition 6 (The three transition systems) The link-append transition system is defined as follows: • C, c 0 , and F are as defined in definition 5.
• For any state (i, K), the set of possible actions is • The transition function δ is as defined above.
The Link-only system is identical to the above, but with A(i, K) = L(i, K) ∪ {SHIFT}.The Mention-Link-Append system is identical to the above, but with All that remains in defining the seq2seq method for each transition system is to: a) define an encoding of the state (i, K) as a string input to the seq2seq model; b) define an encoding of each type of action, and of a sequence of actions corresponding to single sentence; c) defining a mapping from a training example consisting of an (x, K) pair to a sequence of input-output texts corresponding to training examples.

Experimental Setup
We train an mT5 model to predict from an input a target text.We use the provided training, development and test splits as described in section 4.1.For the preparation of the input text, we follow previous work and include the speaker in the input text before each sentence (Wu et al., 2020) as well as the text genre at the document start if this information is available in the corpus.We apply as described in Section 3 the corresponding transitions as an oracle to obtain the input and target texts.We shorten the text at the front if the input text is larger as the sentence piece token input size of the language model and add further context beyond the sentence i when the input space is not filled up (note that as described in section 3, we use the pipe symbol | to mark the start of the focus sentence and the end with two asterisk symbols **).
For few-shot learning, we use the first 10 documents for each language and we train only for 200 steps since the evaluation then shows 100% fit to the training set.All our models have been tested with 3k sentence piece tokens input length.In Section 5, we present our work on multilingual coreference resolution and Section 6 discusses the results for all languages.

Multilingual Coreference Resolution Results
The SemEval-2010 datasets (Recasens et al., 2010) include six languages and is therefore a good test bed for multilingual coreference resolution.We excluded English as the data overlaps with our training data.The experimental evaluation changed in recent publication by reporting F1scores as an average of MUC, B 3 and CEAF Φ 4 following the CoNLL-2012 evaluation schema.We follow this schema in this paper as well.
Another important difference between the SemEval-2010 and the CoNLL-2012 datasets is the annotation of singletons (mentions without antecedents) in the SemEval datasets.Most recent systems predict only coreference chains.This has lead also to different evaluation methods for the SemEval-2010 datasets.The first method keeps the singletons for the evaluation purposes (e.g.(Xia and Durme, 2021)) and the second excludes the singletons from evaluation set (e.g.(Roesiger and Kuhn, 2016;Schröder et al., 2021;Bitew et al., 2021)).The exclusion of singletons seems better suited to compare recent systems but makes direct comparison with previous work difficult.In our results overview (Table 3), we report in the column sing.whether singletons are included (Y) or excluded (N), in the P-column for prediction and in the E-column for evaluation.

Zero-Shot and Few-Shot
Since mT5 is pretrained on 100+ languages (Xue et al., 2021), we evaluate Zero-Shot transfer ability from English to other languages.We apply our system trained on the English CoNLL-2012 Shared Task dataset to the non-English SemEval-2010 test sets.Table 3 shows evaluation scores for our transition-based systems and reference systems.We use for training the same setting as the reference systems (Kobdani and Schütze, 2010 2021) use machine translation for the coreferences prediction of the SemEval-2010 datasets.The authors found they got the best accuracy when they first translated the test sets to English, then predicted the English coreferences with the system of Joshi et al. (2020) and finally projected back the predictions.They apply this method to four out of the six languages for the SemEval-2010 datasets.We included in Table 3 their results as a comparison to our Zero-Shot results.The two methods are directly comparable as they do not use the target language annotations for training.Our Zero-Shot F1-scores are substantially higher compared with the machine translation approach for Dutch, Italian, Spanish and a bit lower for Catalan, cf.Table 3. Xia and Durme (2021) explored for a large number of settings few-shot learning using the continued training approach.We use the same approach with a single setting that uses the first 10 documents for each language.For details about the experimental setup see Section 4.2.Table 3 presents the results for the Link-Append sys-tem.This shows that already with a few additional training documents a high accuracy can be reached.This could be useful either to adapt to a specific coreference annotations schema or to specific language (see examples in Figure 2 and 3).

Supervised
We also carried out experiments in a fully supervised setup in which we use all available training data of the SemEval-2010 Shared Task.We adopted the method of continued training of Xia and Durme (2021).In our experiments, we start from our finetuned English model and continue training on the SemEval-2010 datasets and the Arabic OntoNotes dataset for the later we use data and splits of the CoNLL-2012 Shared Task.
To verify the finding of Xia and Durme (2021), we compared the results when we continue the training from a finetuned model and from the initial mT5 model.We conducted this exploratory experiments using 1k training steps for the German dataset.The results are in favor of the experiment with continued training using an already fine-tuned model with a score of 84.5 F1 vs 81.0 F1 for fresh mT5 model.This model also achieves 77.3 F1 when evaluated without singletons (cf.Table 3), surpassing previous SotA of 74.5 F1 (Schröder et al., 2021).We did not explore train- ing longer due to computational cost of training from a fresh mT5 model to reach a potentially better performance.We adopted the approach for all datasets of the SemEval-2010 Shared Task as this model provides competitive coreference models with low training cost.
Table 3 includes the accuracy scores for the cluster/mentions-based transition systems which reaches SotA for all languages when the prediction and evaluation includes the singletons (P=Y, E=Y).In order to compare the results with Xia and Durme (2021), we removed from the results for the cluster/mentions-based transition system the singletons in the prediction but still include them in the evaluation (P=N, E=Y).
Table 2 compares the results for Arabic and Chinese of our model with the recent work.The Link-Append system is 4.1 points better than (Min, 2021) and 5.3 points better than (Xia and Durme, 2021) which presents previous SotA for Arabic and Chinese, respectively.

Discussion
In this section, we analyse performance factors with an ablation study, analyse errors, and reflect on design choices that are important for the model's performance.Avg.F1-scores.The models have been trained with 100k training steps and tested with 2k sentence pieces filling up remaining space in the input beyond the focus sentence i with further sentences of the document as context.In inference mode, the model uses input length of 3k sentences pieces if not stated otherwise.

Ablation Study
Our best-performing transition system is Link-Append, which predicts links and clusters iteratively for sentences of a document without predicting mentions before-hand.Table 4 shows an ablation study.The results at the top of the table show the development set results for the Link-Append system when, with each SHIFT, the already identified coreference clusters are annotated in the input.
This information is then available in the next step and the clusters can be extended by the APPEND transition.
The models are trained with an input size of 2048 tokens using mT5.We use a larger input size of 3000 (3k) tokens for decoding to accommodate for long documents and very long distances between mentions of coreferences.When we use 2k sentence pieces, the accuracy is 83.1 instead of 83.2 averaged F1-score on the development set using the model trained for 100k steps.
At the bottom of Table 4, the performance of a system is shown that does not annotate the identified clusters in the input.In this system the Append transition cannot be applied and hence only the Link and Shift transition are used.The accuracy of this system is substantially lower, by 1.8 F1-score.
We observe drops in accuracy when we do not use context beyond the sentence i or when we train for only 50k steps.We observe 0.5 lower F1-score, when we use xxl-T5.1.16instead of the xxl-mT5 model.An analysis shows that the English OntoNotes corpus contains some non-English text, speaker names and special symbols.For instance, there are Arabic names that are mapped to OOV, but also the curly brackets {}.There are also other cases where T5 translated non-English words to English (e.g.German 'nicht' to 'not').
With the Mention-Link-Append system, we introduced a system that is capable of introducing mentions, which is useful for data sets that include single mentions, such as the SemEval-2010 data set.This transition system has an 82.6 F1-score on the development set with an input context of 3k sentences pieces, which is 0.6 F1-score lower than the Link-Append transition system.We added examples in the appendix to illustrate mistakes in a Zero-shot setting (Figure 2) and supervised English example (Figure 3).

Error Analysis
We observe two problems originating from the sequence-to-sequence models: first, hallucinations (words not found in the input) and second, ambiguous matches of mentions to the input.In order to evaluate the frequency of hallucinations, we counted cases where the predicted mentions and their context could not be matched to a word sequence in the input.We found only 11 cases (0.07%) in all 14.5k Link and Append predictions for the development set.The second problem are mentions with their n-gram context which are found more than once in the input.This cases constitutes 84 cases (0.6%) of all 14.5k Link and Append predictions.(Joshi et al., 2020), CM for the Constant Memory model (Xia et al., 2020) and LA for the Link-Append system.The entries for JS-L and CM are taken from the paper of Xia et al. (2020).
Table 5 shows average F1-scores for buckets of documents within a length range incremented by 128 tokens, analogous to the analysis of Xia et al. (2020).All system's F1-scores drop after the segment length 257-512 substantially by about 3-4 points.The Link-Append (LA) system seems to have two more stable F1-score regions 1-512 and 513-1152 tokens divided by the mentioned larger drop while we see for the other system slightly lower accuracy in each segment.

Design Choice
With this paper, we follow the paradigm of a text-2-text approach.Our goal was to use only the text output from a seq2seq model, and potentially the score associated with the output.Crucial for the high accuracy of the Link-Append systems are the design choices that seem to fit a text-2-text approach well.(1) Initial experiments, not presented in the paper, showed lower performance for a standard two stage approach using mention prediction followed by mention-linking.The Linkonly transition system which we included as a baseline in the paper was the first system that we implemented that only predicted conference links, avoiding mention-detection.Hence this crucial first design choice is the prediction of links and not to predict mentions first.(2) The prediction of links in a state-full fashion, where the prior input records previous coreference decisions, finally leads to the superior accuracy for the text-2-text model.(3) The larger model enables us to use the simpler paradigm of a text-2-text model successfully.The smaller models provide substantially lower performance.We speculate in line with the arguments of Kaplan et al. (2020) that distinct capabilities of a model get strong or even emerge with model size.(4) The strong multilingual results originates from the multilingual T5 model, which was initially surprising to us.For English, the mT5 model performed better as well which we attribute to larger vocab of the sentence piece encoding model of mT5.

Conclusions
In this paper, we combine a text-to-text (seq2seq) language model with a transition-based systems to perform coreference resolution.We reach 83.3 F1score on the English CoNLL-2012 data set surpassing previous SotA.In the text-to-text framework, the Link-Append transition system has been superior to hybrid Mention-Link-Append transition system with mixed prediction of mentions, links and clusters.Our trained models are useful for future work as they could be used to initialize models for continuous training or Zero-shot transfer to new languages.

Input:
Speaker-A [1 I ] still have n't gone to [3 that fresh French restaurant by [2 your house ] ] # Speaker-A [1 I ] 'm like dying to go there # Speaker-B [1 You ] mean [3 the one right next to [2 the apartment ] ] | # Speaker-B yeah yeah yeah Note that the three actions in the previous prediction have been reflected in the new input, which now includes three coreference clusters, labeled [1 . ..], [2 . ..] and [3 . ..].

Figure 2 :
Figure 2: German Zero-shot predictions.The red bold marked text are wrong predictions.

Figure 3 :
Figure 3: Mistakes picked from CoNLL-2012 development set, e.g.Hong Kong should have been identified recursively within Hong Kong Disneyland; in the last sentence, [3 Disney] refers to [3 Disney Corporation] cluster instead correctly to [4 The world 's fifth Disney park] cluster.

Figure 4 :
Figure 4: Mistakes picked from CoNLL-2012 development, e.g., the coreferences [18 a lot of blood] as well as [27 [13 the ship 's ] attorneys] are not in the gold annotation.

Table 2 :
; English, Arabic and Chinese test set results and comparison with previous work on the CoNLL-2012 Shared Task test data set.The average F1 score of MUC, B 3 and CEAF Φ4 is the main evaluation criterion.

Table 3 :
Test set results for SemEval-2010 datasets.The Sing. column shows whether, the singletons are included (Y) or removed (N) in the Prediction and the Evaluation set.The last column shows average F1 score of MUC, B 3 and CEAF Φ4 .

Table 4 :
Table 2 shows the results for our systems on the English, Arabic and Chinese CoNLL-2012 Shared Task and compares with previous work.Development set results for an ablation study using English CoNLL-2012 data sets and reporting

Table 5 :
Average F1-score on the development set for buckets of document length incremented by 128 tokens.The column JS-L shows average F1-scores for the SpanBert-Large model