Multilingual Coreference Resolution in Multiparty Dialogue

Abstract Existing multiparty dialogue datasets for entity coreference resolution are nascent, and many challenges are still unaddressed. We create a large-scale dataset, Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due to the availability of gold-quality subtitles in multiple languages, we propose reusing the annotations to create silver coreference resolution data in other languages (Chinese and Farsi) via annotation projection. On the gold (English) data, off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has broader coverage of multiparty coreference than prior datasets. On the silver data, we find success both using it for data augmentation and training from scratch, which effectively simulates the zero-shot cross-lingual setting.


Introduction
Coreference resolution is a challenging aspect of understanding natural language dialogue (Khosla et al., 2021).Many dialogue datasets are between two participants, even though there are distinct challenges that arise in the multiparty setting with more than two speakers.Fig. 1 shows how "you" could refer to any subset of the listeners of an utterance.While there are some datasets on multiparty conversations from TV transcripts (Choi and Chen, 2018), they only annotate people, resulting in incomplete annotations across entity types.Moreover, these datasets are only limited to English, and works in dialogue coreference resolution in other languages are rare (Muzerelle et al., 2014).
We introduce a new (entity) coreference resolution dataset focused on multiparty dialogue that supports experiments in multiple languages.We first annotate for coreference on the transcripts from two popular TV shows, in English.We then * Work done at JHU/HLTCOE.leverage existing gold subtitle translations (Creutz, 2018) in Chinese and Farsi to project our annotations, resulting in a multilingual corpus (Fig. 1).
Our experiments demonstrate that coreference resolution models trained on existing datasets are not robust to a shift to this domain.Further, we demonstrate that training on our projected annotations to non-English languages leads to improvements in non-English evaluation.Finally, we lay out an evaluation for zero-shot cross-lingual coreference resolution, requiring models to test on other languages with no in-language examples.We release over 1,200 scenes from TV shows with all annotations and related metadata in English, Chinese, and Farsi, which we call MMC: Multilingual Multiparty Coreference.

Motivation and Related Work
Many works on coreference resolution primarily study documents with a single author or speaker.OntoNotes (Weischedel et al., 2013) is a widely used dataset that mostly consists of single-author documents, like newswire, while other datasets like PreCo (Chen et al., 2018), LitBank (Bamman et al., 2020), WikiCoref (Ghaddar and Langlais, 2016) also consist of documents like books.Many recent modeling contributions also focus primarily on this setting and these datasets (Lee et al., 2017(Lee et al., , 2018;;Xu and Choi, 2020;Bohnet et al., 2022) and some offload it to pretrained language models (Wu et al., 2020;Toshniwal et al., 2021) or ignore the speaker identity entirely (Xia et al., 2020) in an attempt to unify dialogue with non-dialogue domains.
The dialogue domain is less studied because we lack a suitable dataset, even though these exist for other NLP tasks (Section 2.1).In addition to filling this gap, we also present a scalable solution for dataset creation in other languages, following related work in data projection methods (Section 2.2).The limitations of existing works motivate the creation of our dataset.

Multiparty Conversations
One of the focuses of this work is multiparty coreference resolution, which concerns coreference in conversational text with multiple participants.In particular, we are interested in conversations with more than two participants since this brings additional challenges not present in typical dialogue datasets.For example, in two-way conversations, "you" is typically deducible as the listener of an utterance.However, as shown in Fig. 1, "you" in multiparty conversations with more participants could refer to any of the participants present in the conversation.Additional challenges include using a third person pronoun to refer to one of the interlocutors and plural mentions ("we", "you all") that refer to a subset of the participants in the conversation (Zhou and Choi, 2018).

Dialogue Coreference Resolution
Coreference resolution in dialogue has recently reemerged as an area of research, with multiple datasets created and annotated for coreference resolution (Li et al. (2016), Khosla et al. (2021), more examples in Table 1) and the development of dialogue-specific models (Xu and Choi, 2021;Kobayashi et al., 2021;Kim et al., 2021).The datasets can be broadly categorized into transcripts of spoken conversations (e.g.interviews), meeting notes, online discussions, and one-on-one goal-driven genres.Table 1 shows that none of the datasets sufficiently covers spontaneous multiparty conversations.For the datasets that are multiparty, they are either incompletely annotated (Friends CI only annotates mentions referring to people), task-oriented (AMI), or discussion forums (ON-web, BOLT-DF).As a result, there are drawbacks to each of these datasets, like an expectation of formality (without the types of language found in spontaneous dialogue) or missing clarity on the listener or reader identities (e.g.missing usage of second person pronouns).None of these datasets aim for exhaustive annotation on multiparty dialogue in spontaneous social interactions.
Friends CI (Choi and Chen, 2018) is the closest dataset to the goals of this work.1 Different from our goals, Friends CI is focused on character linking instead of general entity coreference.While pronouns like "you" are annotated, other entities, like objects or locations, are not.However, if we want to use coreference resolution models in downstream systems for information extraction (Li et al., 2020b) or dialogue understanding (Rolih, 2018;Liu et al., 2021), we need a dataset that aligns closer with multiparty spontaneous conversations.We contribute a large-scale and more exhaustively annotated dataset for multiparty coreference resolution.

Multilinguality
Coreference Resolution Coreference resolution models are typically developed for a single language, and while there is some prior work on cross-lingual and multilingual models (Xia and Van Durme, 2021), these methods still require some  (Byron and Allen, 1998) Friends CI (Chen and Choi, 2016) ON (tc) (Weischedel et al., 2013) ON (bc) (Weischedel et al., 2013) ON (wb) (Weischedel et al., 2013) BOLT (DF) (Li et al., 2016) BOLT (SMS, CTS) (Li et al., 2016 Khosla et al. (2021).OntoNotes (ON) is divided by genre.data in the desired language for best performance.While there are coreference resolution datasets in many languages (Weischedel et al., 2013;Recasens et al., 2010), they are often limited and expensive to annotate from scratch for each new language.We take a step towards a more general solution for building coreference resolution models from scratch in (almost) any language.By collecting and annotating data that already exists in a highly parallel corpus, we suggest a different approach to expensive in-language annotation: data projection.
Data Projection Using annotations in English to create data in a target language has been useful for tasks such as semantic role labeling (Akbik et al., 2015;Aminian et al., 2019), information extraction (Riloff et al., 2002), POS tagging (Yarowsky and Ngai, 2001), and dependency parsing (Ozaki et al., 2021).Previous works find improvements when training on a mixture of gold source language data and projected silver target language data in crosslingual tasks such as semantic role labeling (Fei et al., 2020;Daza and Frank, 2020) and information extraction (Yarmohammadi et al., 2021).The intuition of using both gold and projected silver data is to allow the model to see high-quality gold data as well as data with target language statistics.In this work, we extend projection to coreference resolution both for creating a model without in-language data and for augmenting existing annotations.

Multilingual Multiparty Dialogue Coreference Dataset
In this section, we present our multilingual multiparty coreference (MMC) dataset2 , including the construction process of data alignment and filtering, annotation, and projection.3Core to our contribution is the choice of a multiparty dataset that already has gold translations and prioritizing multilinguality throughout the data collection process.

Parallel Dialogue Corpus
We construct a parallel corpus of multiparty dialogue by aligning the English transcripts from TV shows and parallel subtitles from the OpenSubtitles corpus (Tiedemann, 2012;Lison and Tiedemann, 2016), a sentence-aligned parallel corpus widely used in machine translation. 4V sitcoms are an ideal target for meeting our criteria for a spontaneous multiparty genre, as they contain rich multiparty dialogues, multiple references to interlocutors, and spontaneous utterances. 5We select Friends and The Big Bang Theory (TBBT) because there is prior work in preprocessing and speaker identification for the transcripts of these shows (Roy et al., 2014;Choi and Figure 2: This figure illustrates the annotation interface.Given a set of proposed markables ("queries"), users highlight the best antecedent or speaker that the markable refers to or select "no previous mention" or "not a mention."Plural entities and uncertainty due to missing context can also be annotated.Chen, 2018;Sang et al., 2022).
We align the available data with that from two languages distant from English: Chinese and Farsi (Section 3.3).Due to missing episodes and alignments for some languages, the final three-way aligned corpus is an intersection of what is available in all three languages, and empty or clearly misaligned scenes are removed (Table 2).

English Coreference Annotation
We automatically create an initial set of proposed markable mentions, aiming for high recall.Like prior work (Pradhan et al., 2012;Poesio et al., 2018;Bamman et al., 2020), for consistent annotation, these markables are then considered for coreference linking.We mainly follow the annotation pro-cess of OntoNotes 5.0 (Weischedel et al., 2013).6However, we make some simplifications that are easier to understand for crowdworkers, roughly following those made by Chen et al. (2018).Unlike OntoNotes, we do not consider verbs and verb phrases as markable.Entities mentioned once (singletons) are annotated.Also, non-proper modifiers can be coreferred with generic mentions, and subspans can be coreferred with the whole span.

Markable Mention Proposal
We ensemble predictions from the Berkeley parser with T5-Large (Kitaev and Klein, 2018;Raffel et al., 2020) and RoBERTa-based (Liu et al., 2019) spaCy7 to detect nouns, noun phrases, and pronouns.These constitute our proposed markable mention spans.
Interface Our annotation interface (Fig. 2) is derived from that of Yuan et al. (2022).The interface simplifies coreference annotation to selecting any antecedent for each query span (proposed markable) found by the parser.For consistency, the interface encourages users to select proposed markables, although they can also add a new antecedent mention if it is not among those proposed by the parser.They can also label a markable span as not a mention.Coreference clusters are formed by taking the transitive closure after annotation.
We make several modifications to the interface to annotate coreference more completely and in the dialogue setting.These include permitting the selection of speakers, mentions of arbitrary size for plural mentions, and an indication of uncertainty (e.g.without further context, the example in Fig. 1 requires audiovisual cues).While the annotation of plural mentions and uncertainty labels are not used in this work, we hope they enable future studies.

Pilot Annotation
We sampled three scenes of differing lengths from the training set for a qualification study.For these scenes, we adjudicated annotations from four experts as the gold labels.Then, we invited annotators from Amazon Mechanical Turk to participate and receive feedback on their errors.Nine high-scoring annotators on the pilot8 (>80 MUC score) were selected for annotating the full training set.We paid USD $7 for each pilot study, which could be completed in 25-35 minutes, including reading the instructions.
Full Annotation For the training set, the scenes were batched into roughly 250 proposed markables each.We paid $4 per batch (expected $12/hour) for each of the nine high-scoring annotators.Each of the scenes was annotated once, although we inspected these annotations to ensure they were nontrivial (i.e.not all-blank or all-singletons).
For the dev and test splits, three professional data specialists,9 in consultation with the authors, annotated the documents with two-way redundancy.After reviewing common error and disagreement types with the authors, one of the specialists performed adjudication of the disagreements (described in Appendix B).Following several prior works (Weischedel et al., 2013;Chen et al., 2018;Toshniwal et al., 2020), we adopt MUC score as an approximate measure of agreement between annotators.The average MUC score of each annotator to the adjudicated clusters is 86.1.This agreement score is comparable to reported scores in widely used datasets: OntoNotes (89.60),PreCo (85.30).The inter-annotator MUC agreement score on this combined split is 80.3 and the inter-annotator CoNLL-F1 score is 81.55.The Cohen's Kappa score is 0.7911, which is interpreted as "substantial agreement."Note that the high agreement can be partially attributed to the agreement over non-mentions and starts of coreference chains.
7.2%, 8.8%, and 10.0% of the clusters in training, dev, and test splits contain plural mentions.Meanwhile, 0.4%, 1.4%, and 1.6% of the mentions are marked as "uncertain."The specialists working on the dev and test sets were more likely to mark an annotation as uncertain than crowdworkers.

Silver Data via Annotation Projection
Data projection transfers span-level annotations in a source language to a target language via word-toword alignments in a fast, fully-automatic way.The projected (silver) target data can be used directly or combined with gold data as data augmentation.
Alignment We need to align English (source side) mention spans to Chinese or Farsi (target side) text spans.Our cleaned dataset contains utterancelevel aligned English to Chinese and Farsi text.Using automatic tools, we obtain more fine-grained (word-level) alignments, and project source spans to target spans according to these alignments.For multi-token spans, the target is a contiguous span containing all aligned tokens from the source span.
We use Awesome-align (Dou and Neubig, 2021), a contextualized embedding-based word aligner10 that extracts word alignments based on token embedding similarities.We fine-tune the underlying XLM-R encoder on around two million parallel sentences from the OSCAR corpus (Abadji et al., 2022).We further fine-tune on Farsi-English gold alignments by Tavakoli and Faili (2014) and the GALE Chinese-English gold alignments (Li et al., 2015).See Appendix C for dataset statistics and fine-tuning hyperparameters.By projecting the annotated English mentions to the target side, the entity clusters associated with each mention are also implicitly projected.Some coreference annotations are not transferred to the target language side either due to empty subtitles in our cleaned data or erroneous automatic word alignment and projection of the source text span.We refer to such cases as null projections.
Fig. 3 shows parallel utterances with their gold English and projected Chinese and Farsi annotations.Some short English utterances do not have counterparts, such as the second utterance ("Huh").Chinese and Farsi annotations are also a subset of English annotations due to null projections.For example, the English mention "it" in the last utterance is missing in the target transcripts, so this span's annotation is missing in the projected data.
While there are the same number of episodes in the English and projected data, the number of scenes, mentions, and clusters in the projected data   projected.Since all episodes are three-way parallel, the splits for each language contain the same scenes (some empty scenes are omitted).
are smaller due to missing scenes or null projections.We see around 30% (Zh) and 40% (Fa) drop in aligned mentions (Table 3).
Alignment Correction We conducted alignment annotation for both English-Chinese and English-Farsi utterances to collect alignment corrections for the Chinese and Farsi test set with four Chinese speakers and three Farsi speakers.For each language pair, we presented the user with the utterance in each language and one of the English spans highlighted.On the target language side, the prediction by the projection model is displayed.
The user makes corrections to the automatic alignments if necessary.This is conducted via the TASA interface11 (Stengel-Eskin et al., 2019).These corrected annotations serve as the test set for both  Chinese and Farsi are pro-drop languages.Most of the addition operations are related to pronouns, where the target is corrected from an empty string to the location of the trace of the pronoun (in Chinese) or the implied pronoun affix (in Farsi).In the modification operation, a small amount of target mentions are also corrected to an empty string.This resulted in 401 and 823 additional pronoun mentions in Chinese and Farsi respectively.
Dataset Statistics MMC contains about 101 hours of episodes, resulting in 323,627 English words, 226,045 Chinese words, and 258,244 Farsi words.Table 3 shows the final statistics of our three-way aligned, multiparty coreference corpus.This dataset is used for the remainder of the paper.To summarize, English dev and test data are twoway annotated followed by adjudication; English train is one-way annotated; and Chinese and Farsi are automatically derived via projection, but both Chinese and Farsi test alignments are corrected.4 Methods

Model
For all experiments, we use the higher-order inference (HOI) coreference resolution model (Xu and Choi, 2020), modified slightly to predict singleton clusters (Xu and Choi, 2021).Given a document, HOI encodes texts with an encoder and enumerates all possible spans to detect mentions.These spans are scored by a mention detector, which prunes the spans to a small list of candidate mentions.The candidate mentions are scored pairwise, corresponding to the likelihood of being coreferring, and the resulting scores are used in clustering.While mentions can be linked to their top-scoring antecedent, higher-order inference goes further and ensures high agreement between all mentions in a cluster by making additional passes.Singletons can be predicted when a high-scoring (via the mention detector) mention only has low-scoring (via the pairwise scorer) candidate antecedents.For English-only experiments, SpanBERT-large (Joshi et al., 2020) is used as the encoder while for the other experiments XLM-R-base (Conneau et al., 2020) is used.More hyperparameter details are in Appendix D.12

Noise-tolerant Mention Loss
The loss function used by Xu and Choi (2021) consists of a cluster loss, L c ,13 typically used for coreference resolution (Lee et al., 2017;Joshi et al., 2020;Xu and Choi, 2020) and a binary cross-entropy mention detection loss, L m , used to better predict singleton losses.
Compared to the two-way annotated and merged dev/test set, the one-way annotated train set is more likely to be subject to annotator biases, leading to noise in the train set.These inconsistencies are further exacerbated when projected to silver data, leading to a low recall of mentions in training, as evidenced by the number of "additions" in Table 4.
To address this noise, we propose a modification of L m to downweight negative labels.Following the notation from Xu and Choi (2021), let Ψ + i be the set of gold candidate mentions and Ψ − i be the remainder of the candidate spans.Applying a hyperparameter τ ∈ [0, 1], we can rewrite binary cross-entropy loss, L τ m , as where x i is a candidate span and P (x i ) is the output of the mention scorer.Following Xu and Choi (2021), the mention loss is also weighted in the final loss, L = L c + α m L τ m .

Data
We evaluate the performance of models across three datasets: MMC, OntoNotes (Pradhan et al., 2012), and Friends CI (Choi and Chen, 2018).OntoNotes is a collection of documents spanning multiple domains, some of which include multiparty conversations or dialogues, like weblogs, telephone conversations, and broadcast news (interviews).Furthermore, OntoNotes is available in English and Chinese.Friends CI is a collection of annotations on TV transcripts from Friends, including entity linking where character entities are linked.As the focus of this work is on multiparty conversations, we further separate OntoNotes into documents with 0 or 1 (ON ≤ 1 ), 2 (ON 2 ), or more than two (ON > 2 ) speakers/authors for evaluation.We didn't include split antecedents and drop-pronouns in the experiment, since the baseline model doesn't support predicting them.The statistics of datasets used in our experiments are in Table 5.

Evaluation
We use the average of MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF φ 4 (Luo, 2005)), which is also used for OntoNotes.Furthermore, each model is trained three times and the average test score (CoNLL F1 ) is reported.

Experiments and Results
First, we highlight the differences of MMC in contrast to Friends CI and OntoNotes.To do so, we train an off-the-shelf model on the three datasets.Additionally, we establish monolingual baselines for all three languages.Finally, we explore the crossand multi-lingual settings to validate the recipe of using data projection for coreference resolution.

Monolingual Models
Table 6 shows the performance of several monolingual models.They highlight that models trained on other datasets (ON, Friends CI ) perform substantially worse than models trained in-domain (on MMC).Additionally, we find that both combining datasets and using continued training from OntoNotes (Xia and Van Durme, 2021) can be effective for further improving model performance: for English, this leads to gains of 2.7 F1 points (combining) and 5.2 F1 points (continued training), and continued training is also effective in Chinese.
Notably, combining the Chinese datasets yields the best scores on dialogues (ON 2 , ON > 2 ) in OntoNotes.This highlights the utility of the silver MMC data as a resource for augmenting preexisting in-domain data.Combining data is less helpful for English than Chinese possibly because there is more training data in ON En than ON Zh , making the Chinese data augmentation more useful.The baselines for ON Zh may also be less optimized by prior work than models for ON En .Table 7: Performance of models trained on datasets of different languages (English, Farsi, and Chinese) and the combination of all three of them.All four models use XLM-R-base as the encoder.

Cross-lingual and Multilingual Models
Next, we demonstrate the ability of the silver data in Chinese and Farsi to contribute towards creating a model with no in-language coreference resolution data.While Chinese and Farsi are the two languages we choose to study in this work, parallel subtitles for the TV Shows in MMC are available in at least 60 other languages and can be used similarly, given a projection model.14 Simple Baseline We adopt a simple head lemma match baseline to determine a lower bound for each language if we did not have any training data.We first find the NP constituencies as candidate mentions derived from off-the-shelf constituency parsers.We adopt the Berkeley parser with T5-Large (Kitaev and Klein, 2018;Raffel et al., 2020) for English and multilingual selfattention parser (Kitaev et al., 2019) with Chinese ELECTRA-180G-large (Cui et al., 2020) for Chinese.For Farsi, we adopted the constituency parser in DadmaTools (Etezadi et al., 2022).15However, we were not confident in the Farsi parser quality (under 5 CoNLL F1 when evaluated on Farsi MMC), and could not find another widely used constituency parser for Farsi, so we omit Farsi in our results.To predict the clusters, we extract and lemmatize the head word for each mention.We link any two mentions that have the same head word lemma.

Cross-lingual Transfer
We evaluate the monolingual XLM-R models for English, Chinese, and Farsi on each of the languages, i.e. "test" for English, "test correct " for Chinese, and "test silver " for Farsi.This effectively evaluates the zero-shot ability for the other two languages.
Table 7 shows that models trained on English data or silver projected data in Farsi and Chinese can achieve reasonable performance on the test set of its own language.Models trained on projected silver data in Farsi and Chinese achieve the best performance among their own test set compared with zero-shot performance of models trained in another language.Consequently, this implies that a recipe of projecting a coreference resolution dataset to another language and using that data to train from scratch outperforms naive zero-shot transfer via multilingual encoders.

Multilingual Models
We combine the training data of three languages and train multilingual models.Table 7 shows that these multilingual models achieve slightly to moderately worse performance on each test set compared to their monolingual counterparts.This contrasts with prior work (Fei et al., 2020;Daza and Frank, 2020;Yarmohammadi et al., 2021) that finds benefits to using silver data.The poorer performance of the multilingual model could be due to using the same set of hyperparameters for all three languages.While it does not surpass the monolingual models, it enjoys the benefits of being more parameter efficient.

Noise-tolerant Loss Results
Table 8 shows model performance using our modified loss.We find some benefits to downweighting negatively labeled spans, obtaining 1-3 points improvement compared to the original loss across all three languages. 16Thus, MMC could also enable exploration into additional modeling questions around the use of projected and noisy annotations.

Analysis
We analyze our modeling results in relation to our original motivation.First, we explore differences between datasets (Section 6.1), the number of speakers (Section 6.2), and overfitting (Section 6.3).
For the data construction, we analyze the alignment corrections process (Section 6.4) and compare recipes for annotation projection (Section 6.5).

Comparison of Datasets
Since Friends CI is also based on TV Shows (Friends) and its dataset overlaps with MMC, we would expect a model trained on Friends CI to perform well on MMC.Instead, we find that its performance is over 23 F1 points worse.The main difference between Friends CI and MMC is that Friends CI only annotates characters instead of all possible mentions, and therefore there are fewer mentions per document in Friends CI than in MMC.A closer inspection of the precision and recall appears to validate this hypothesis, as the macro precision (across the three metrics) is 65.8% compared to a recall of 37.5%.This is also evident in the mention span precision and recall, where a model trained on Friends CI scores 91.5% precision but only 50.3% on recall.We see the same trend for OntoNotes: high precision and low recall both on the coreference metrics and on mention boundaries.ers.However, this is not the case with both Friends CI and MMC, which perform best on twoperson dialogues.17Nonetheless, the drop in performance from ON 2 to ON > 2 highlights the additional difficulty of multiparty dialogue documents (in OntoNotes).These trends are similar for both English and Chinese.

Overfitting to Specific Shows
As one of our goals is a dataset enabling a better understanding of multiparty conversations, a concern is that models may overfit to the limited (two) TV shows and the subset of characters (and names) in the training set.While the test set contains our target domain (multiparty conversation), it also shares characters and themes with the training set.
Names We test whether models are sensitive to speaker names, perhaps overfitting to the character names and their traits.We replace speaker names in the original MMC dataset with random names.First, we assume the self-identified genders of the speakers through their pronoun usage.Next, for each scene, we replace the name of a character with a randomly sampled name of the same gender. 18he results in Table 9 show that models do overfit to character names: for models trained on MMC, Friends CI , and ON+MMC, performance on MMC test sets drops after replacing names, thereby showing that they are sensitive to names seen in training.On the other hand, both ON and ON→MMC show more robustness to changes in speaker name.This is likely because ON does not have a persistent set of characters for the entire We create a training set (MMC-Name) without a persistent set of characters or speakers by randomly replacing the character names.While MMC performance drops slightly compared to a model trained with the original data, it outperforms on the name-replaced test set.Since we have the {original, replaced} name mapping, we can convert predictions from MMC-Name to MMC, resulting in an F1 on MMC competitive with the baseline, after post-processing.These findings support the hypothesis that models that see names used in a "generic fashion" are more robust towards name changes (Shwartz et al., 2020).
TV Series To determine overfitting to a specific TV show, we split MMC (English) into the two components: MMC Friends and MMC TBBT , shown in Table 10.In this analysis, we find that the variance due to random seed is high, which might explain why training with MMC TBBT appears to be the best model.The results suggest both models find MMC TBBT easier to predict.Furthermore, training with MMC TBBT outperforms MMC Friends when evaluated on MMC Friends , suggesting that the substantially larger size of the MMC TBBT portion beats any in-domain advantages MMC Friends may have.

Alignment Correction
To identify the types of systematic errors made by automatic projection, we analyzed the corrected Chinese alignments.There is a similar pattern in Farsi.Most of the drop is in recall, since many new mentions are added via alignment correction.These new additions are mostly words that contain compound possessive pronouns or verbal inflectional suffixes that align to a source English word, which are not often captured by automatic word alignment methods.For example, the word " " , is a verb with the inflectional suffix " " aligning to the source mention "I".Another example is " ", composed of the noun " " plus the possessive pronoun " " aligning to the source span "your" in Fig. 3.

Annotation from Scratch
Instead of relying on noisy (but free) projections of parallel English data, one could directly annotate coreference in the target language with native speakers.To investigate the quality of test silver and test correct , we perform an analysis study on three randomly sampled scenes from the Chinese test set and ask an annotator to complete the full coreference annotation task.We also obtain oracle word alignments to explore the effect of alignment errors in our data projection framework.
We find MUC score (agreement) rates of 71.84, 78.23, and 87.25 using test silver , test correct , and oracle projections respectively.This suggests that the corrected test set has a comparable agreement rate to that of the gold data, while the gold projections are also within the range of inter-annotator agreement.As automatic alignment methods improve, our recipe for creating multilingual coreference data will also benefit.Nonetheless, one of the limitations of MMC is that quality of the Chinese and Farsi test sets could still be higher.
Advantages Despite lower quality, the data projection method still has several advantages over from-scratch annotation as it is faster and there is less demand for an in-language expert.
First, annotation from scratch requires a syntactic parser to find constituencies for mention linking (Sec.3.2).The zero-shot transfer setting usually involves lower-resource languages, where parsers, if they exist at all, may not perform well.Thus, projection may be the only solution in these cases.
Second, the annotation quality depends on the guidelines.Linguistic experts in the target language will need to design annotation guidelines and experts are not always available.However, this step can be skipped with projection (since we are releasing MMC, which has parallel text in numerous languages).Not only the projection task itself is significantly simpler to explain, it is easier to understand and can be faster than annotating from scratch.In our setting, around 70% of the predicted alignments were marked as correct.One could design heuristics to only present the difficult mention pairs, which would further reduce annotation cost.

Conclusion
Motivated by a desire to better understand spontaneous multilingual conversations, we developed a collection of coreference annotations on the transcripts and subtitles of two popular TV sitcoms.To reduce the cost of annotating from scratch for each language, we selected our English data such that there were already existing gold human translations available in the form of subtitles, in order to automatically project our annotations from English.After manually correcting these projections, we observe a few point differences in reported values across various multilingual models.
There exist dozens of additional languages that our annotations may be projected to in the future.If automatic projection leads to only a few point variance in the estimated performance of a model, we believe this framework is sufficient for driving significant new work in coreference across many non-English languages in the future.

Limitations
There are several limitations in the dataset inherent to the difficulty of the task, crowdsourcing, and the use of models for candidate proposals.The inter-annotator agreement scores are not perfect.One contributing factor is that we do not postprocess or provide explicit instructions for pleonastic pronouns, so annotators used their own judgment.These account for 3.15% of the mentions in the pilot annotation.There is also a distribution difference between the (noisier) train and dev/test set caused by different annotator sources, how they were paid, and whether the annotations were adjudicated.Additionally, annotation was performed without access to ground truth video, which could impede annotation or encourage guessing when situatedness may be required.Since annotation in MMC is aided by other models (parser and aligner), system errors may not necessarily be caught during annotation.

Appendices A Split Antecedent Statistics
MMC-En has a number of split antecedents; 1,156 antecedents across 2,745 spans in the training set; 255 across 717 spans in the dev set, and 178 across 444 spans in the test set.

B Merging Two-way Annotations
A third annotator adjudicates disagreement in the two-way annotations in the dev/test set.To decide whether a pair of annotations disagree, we first build common clusters between two annotations.After annotation, each query mention is annotated with two antecedents.A = {(q 1 , a 1 1 , a 2 1 ), (q 2 , a 1 2 , a 2 2 ), ..., (q n , a 1 n , a 2 n )} q i is the i th query and n is the number of candidate queries.a 1 i and a 2 i are the antecedents linked to the i th query.We build initially agreed clusters by taking the transitive closure of the subset of A where each triplet agrees exactly (i.e. for q i , a 1 i = a 2 i ) between the two annotations.Note that the annotations, a i , can be another query span, q j , that is also annotated.This lets us connect the annotations and form clusters.
Next, we incrementally add query spans to these clusters if both annotators link them to the same cluster (a 1 i = a 2 i but a 1 i and a 2 i are in the same cluster anyway), continuing until no further pairs agree.At the end, if there exist q i where a 1 i = a 2 i , then each (q i , a 1 i , a 2 i ) is marked for adjudication.The adjudicator is prompted to select between a 1 i , a 2 i , or relabel q i entirely.Their annotation is final.

C Word Alignment
Word alignments are extracted from the finetuned XLM-R-large model using Awesome-align.We first fine-tuned XLM-R on English-{Chinese, Farsi} parallel data that has been filtered using LASER semantic similarity scores (Schwenk and Douze, 2017;Thompson and Post, 2020).We reuse empirically-chosen Awesome-align hyperparameters from prior work for a similar task (Yarmohammadi et al., 2021): softmax normalization with probability thresholding of 0.001, 4 gradient accumulation steps, 1 training epoch with a learning rate of 2 • 10 −5 , alignment layer of 16, and masked language modeling ("mlm"), translation language modeling ("tlm"), self-training objective ("so"), and parallel sentence identification ("psi") training objectives.We further fine-tuned the resulting model on the gold word alignments on 1500 En-Fa and 2800 En-Zh sentence pairs with the same hyperparameters, for 5 training epochs with a learning rate of 10 −4 and only "so" as the training objective.

D Hyperparameters
We reuse most of the hyperparameters from Xu and Choi (2020): we enumerate spans up to a maximum span width of 30 and set the maximum speakers to 200, "top span ratio" to 0.4, and maximum top antecedents (beam size) to 50.For XLM-R models, we set the LM learning rate to 10 −5 and task learning rate to 3 • 10 −4 .For SpanBERT models, we use a LM learning rate of 2 • 10 −5 and task learning rate of 2 • 10 −4 .Following a grid search, we set the mention loss weights (α m ) for the each language and dataset: 5 for MMC-Zh and MMC-En, 6.5 for MMC-Fa, and 0 for OntoNotes.For τ we find τ Fa = 0.55, τ Zh = 0.7, and τ En = 0.7 performed best on dev.
Figure 1: (Top) An example of ambiguous coreference due to multi-person context: Penny's "you" could refer to Sheldon, Howard, or both.(Bottom) Annotations can be projected to new languages, enabling model training beyond English.

Figure 3 :
Figure 3: Example utterances with gold English and projected Chinese and Farsi coreference annotations.

Table 1 :
Examples of dialogue coreference datasets.Nothing to our knowledge satisfies our desire for modeling spontaneous multiparty conversations.Additionally, parallel data is available for MMC, which enables exploration in non-English languages.Superscript C indicates that they were additionally annotated by Lang. (split) Scenes Utter.Ment.Clusters

Table 3 :
MMC statistics.English (En) is manually annotated while Chinese (Zh) and Farsi (Fa) are

Table 4 :
Corrections in Chinese and Farsi test sets.The number in brackets is the number of dropped pronouns that are recovered.
Chinese and Farsi.1,904 (24.81%) projections are corrected in Chinese and 2,485 (32.26%) projections in Farsi.There are three types of corrections: addition, deletion, and modification, shown in Table 4.For addition, a mention boundary is added for a null projection.For deletion, the predicted projection is discarded.Modification is where the predicted mention boundaries are modified.

Table 5 :
Statistics for additional datasets used in this work.OntoNotes has a Chinese split; we are not aware of other Farsi coreference datasets.

Table 6 :
TrainTest ON Friends CI MMC ON ≤ 1 ON 2 ON > 2 MMC Friends MMC TBBT F1(%) scores of models trained on a combination of different datasets for English and Chinese.All English models except MMC (XLM-R) use SpanBERT-Large as the encoder, while MMC (XLM-R) and Chinese models use XLM-R-base as the encoder.

Table 8 :
Test set performance of models trained with different τ .τ = 1 is the regular binary crossentropy mention loss reported earlier in the paper.τ is chosen according to a grid search (Sec.4.2).
Table 6 also shows that in OntoNotes, models perform poorer on documents with more speak-

Table 11 :
Table11shows the difference in model performance between the corrected and the silver test set.Performance drops a few F1 points on the corrected set, which is caused by the distribution shift from (uncorrected, silver) training data.Naturally, MMC-Zh suffers the largest drop because it is closest in the domain to test silver .How-F1 of models on Chinese and Farsi test set before and after correction.ever, it is still one of the best performing models.The performance drop of the ON-only trained model is only 0.85 points, possibly because this model is trained on the cleaner (gold) training labels.These observations suggest that while the alignment correction yields a cleaner test set, the automatic silver data is still a good substitute for model development when no gold data is available.