A Survey on Cross-Lingual Summarization

Abstract Cross-lingual summarization is the task of generating a summary in one language (e.g., English) for the given document(s) in a different language (e.g., Chinese). Under the globalization background, this task has attracted increasing attention of the computational linguistics community. Nevertheless, there still remains a lack of comprehensive review for this task. Therefore, we present the first systematic critical review on the datasets, approaches, and challenges in this field. Specifically, we carefully organize existing datasets and approaches according to different construction methods and solution paradigms, respectively. For each type of dataset or approach, we thoroughly introduce and summarize previous efforts and further compare them with each other to provide deeper analyses. In the end, we also discuss promising directions and offer our thoughts to facilitate future research. This survey is for both beginners and experts in cross-lingual summarization, and we hope it will serve as a starting point as well as a source of new ideas for researchers and engineers interested in this area.


Introduction
To help people efficiently grasp the gist of documents in a foreign language, Cross-Lingual Summarization (XLS) aims to generate a summary in the target language from the given document(s) in a different source language.This task could be regarded as a combination of monolingual summarization (MS) and machine translation (MT), both of which are unsolved natural language processing (NLP) tasks and have been continuously studied for decades (Paice, 1990;Brown et al., 1993).
XLS is an extremely challenging task: (1) from the perspective of data, unlike MS, naturally occurring documents in a source language paired with the corresponding summaries in different target languages are rare, making it difficult to collect largescale and human-annotated datasets (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021); (2) from the perspective of models, XLS requires both the abilities to translate and summarize, which makes it hard to generate accurate summaries by directly conducting XLS (Cao et al., 2020).
Despite its importance, XLS has attracted a little attention (Leuski et al., 2003;Wan et al., 2010) in the statistical learning era due to its difficulties and the scarcity of parallel corpus.Recent years have witnessed the rapid development of neural networks, especially the emergence of pre-trained encoder-decoder models (Zhang et al., 2020a;Raffel et al., 2020;Lewis et al., 2020;Liu et al., 2020;Tang et al., 2021;Xue et al., 2021), making neural summarizers and translators achieve impressive performance.Meanwhile, creating largescale XLS datasets has proven feasible by utilizing existing MS datasets (Zhu et al., 2019;Wang et al., 2022b) or Internet resources (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021).The aforementioned successes have laid the foundation for the XLS research field and gradually attracted interest in XLS.In particular, recent researchers put their efforts into solving XLS task and published more than 20 papers over the past five years.Nevertheless, there still lacks a systematic review of progresses, challenges and opportunities of XLS.
To fill the above gap and help new researchers, in this paper, we provide the first comprehensive review of existing efforts relevant to XLS and give multiple promising directions for future research.Specifically, we first briefly introduce the formal definition and evaluation metrics of XLS ( § 2), which serves as a strong background before delving further into XLS.Then, we pro-arXiv:2203.12515v2[cs.CL] 30 Aug 2022 vide an exhaustive overview of existing XLS research datasets ( § 3).In detail, to alleviate the scarcity of XLS data, previous work resorts to different ways to construct large-scale benchmark datasets, which are divided into synthetic datasets and multi-lingual website datasets.The synthetic datasets (Zhu et al., 2019;Bai et al., 2021a;Wang et al., 2022b) are constructed through (manually or automatically) translating the summaries of existing MS datasets from a source language to target languages while the multi-lingual website datasets (Nguyen and Daumé III, 2019;Ladhak et al., 2020;Fatima and Strube, 2021;Perez-Beltrachini and Lapata, 2021) are collected from websites that provide multi-lingual versions for their content.
Next, we thoroughly introduce and summarize existing models, which are organized with respect to different paradigms, i.e., pipeline ( § 4) and end-to-end ( § 5).In detail, the pipeline models adopt either translate-then-summarize approaches (Leuski et al., 2003;Boudin et al., 2011;Wan, 2011;Yao et al., 2015;Zhang et al., 2016;Linhares Pontes et al., 2018;Wan et al., 2018;Ouyang et al., 2019) or summarize-then-translate approaches (Orȃsan and Chiorean, 2008;Wan et al., 2010).In this manner, the pipeline models avoid conducting XLS directly, thus bypassing the model challenge we discussed previously.However, the pipeline method suffers from error propagation and recurring latency, making it not suitable for the real-world scenario (Ladhak et al., 2020).Consequently, the end-to-end method has attracted more attention.To alleviate the model challenge, it generally utilizes the related tasks (e.g., MS and MT) as auxiliaries or resorts to external resources.The end-to-end models mainly fall into four categories, i.e., multi-task methods (Zhu et al., 2019;Takase and Okazaki, 2020;Cao et al., 2020;Bai et al., 2021a;Liang et al., 2022), knowledgedistillation methods (Ayana et al., 2018;Duan et al., 2019;Nguyen and Luu, 2022), resourceenhanced methods (Zhu et al., 2020;Jiang et al., 2022) and pre-training methods (Dou et al., 2020;Xu et al., 2020;Ma et al., 2021;Chi et al., 2021a;Wang et al., 2022b).For each category, we will thoroughly go through the previous work and discuss the corresponding pros and cons.Finally, we also point out multiple promising directions on XLS to push forward the future research ( § 6), followed by the conclusion ( § 7).Our contributions are concluded as follows: • To the best of our knowledge, this survey is the first that presents a thorough review of XLS.• We comprehensively review the existing XLS work and carefully organize them according to different frameworks.• We suggest multiple promising directions to facilitate future research on XLS.

Task Definition
Given a collection of documents in the source language D = {D i } m i=1 (m denotes the number of documents and m ≥1), the goal of XLS is to generate the corresponding summary in the target language Y = {y i } n i=1 with n words.The conditional distribution of XLS models is: where θ represents model parameters and y 1:t−1 is the partial ground truth summary.
It is worth noting that: (1) when m>1, this task is upgraded to cross-lingual multi-document summarization (XLMS) which has been discussed by some previous studies (Orȃsan and Chiorean, 2008;Boudin et al., 2011;Zhang et al., 2016); (2) when the given documents are dialogues, the task becomes cross-lingual dialogue summarization (XLDS) which has been recently proposed by Wang et al. (2022b).The XLMS and XLDS are also within the scope of this survey.Furthermore, we define the source and the target languages in XLS should be two exactly distinct human languages, which also means (1) if the source language is in code-mixed style of two natural languages (e.g., Chinese and English), the target language should not be either of the both; (2) the programming languages (e.g., PYTHON or JAVA) should not be the source or the target language1 .

Evaluation
Following MS, ROUGE scores (Lin, 2004) are universally adopted as the basic automatic metrics for XLS, especially the F1 scores of ROUGE-1, ROUGE-2 and ROUGE-L which measure the unigram, bigram and longest common sequence between the ground truth and the generated summaries, respectively.Nevertheless, the original ROUGE scores are specifically designed for English.To make these metrics suitable for other languages, some useful toolkits are released, e.g., multi-lingual ROUGE 2 and MLROUGE 3 .In addition to these metrics based on lexical overlap, recent work proposes new metrics based on the semantic similarity (token/word embeddings), such as MoverScore 4 (Zhao et al., 2019) and BERTScore 5 (Zhang et al., 2020b) which have been shown their great consistency with the human judgements on MS (Koto et al., 2021).

Datasets
In this section, we review available large-scale XLS datasets 6 and further divide them into two categories: synthetic datasets ( § 3.1) and multilingual website datasets ( § 3.2).For each category, we will introduce the construction details and the key characteristics of the corresponding datasets.In addition, we compare these two categories to provide a deeper understanding ( § 3.3).

Synthetic Datasets
Intuitively, one straightforward way to build XLS datasets is directly translating the summaries of a MS dataset from their original language to different target languages.The datasets built in this way are named synthetic datasets, which could benefit from existing MS datasets.Datasets Construction.En2ZhSum (Zhu et al., 2019) is constructed through utilizing a sophisticated MT service 7 to translate the summaries of CNN/Dailymail (Hermann et al., 2015) and MSMO (Zhu et al., 2018) from English to Chinese.In the same way, Zh2EnSum (Zhu et al., 2019) is built through translating the summaries of LCSTS (Hu et al., 2015) from Chinese to English.Later, Bai et al. (2021a)  There are also some XLS datasets in the statistical learning era, e.g., multiple MultiLing datasets (Giannakopoulos, 2013;Giannakopoulos et al., 2015) and translated DUC2001 dataset (Wan, 2011).However, these datasets are either not public or extremely limited in scale (typically less than 100 samples).Thus, we do not go into these datasets in depth.through translating the English Gigaword 8 to German using the WMT'19 English-German winner MT model (Ng et al., 2019).More recently, Wang et al. (2022b) construct XSAMSum and XMediaSum, which directly employ professional translators to translate summaries of two dialogue-oriented MS datasets, i.e., SAMSum (Gliwa et al., 2019) and Medi-aSum (Zhu et al., 2021), from English to both German and Chinese.In this way, their datasets achieve much higher quality than those automatically constructed ones.Quality Controlling.Since the translation results provided by MT services might contain flaws, En2ZhSum, Zh2EnSum and En2DeSum further use the round-trip translation (RTT) strategy to filter out low-quality samples.Specifically, given a monolingual document-summary pair D src , S src , the summary S src is first translated to a target langauge S tgt , and then S tgt is translated back to the source language S src .Next, D src , S tgt will be retained as an XLS sample only if the ROUGE scores between S src and S src exceed the pre-defined thresholds.In addition, the translated summaries in the test set of En2ZhSum and Zh2EnSum are post-edited by human annotators to ensure the reliability of model evaluation.
As for manually translated synthetic datasets, i.e., XSAMSum and XMediaSum, Wang et al. (2022b) design a quality control loop, where data reviewers and experts participate, to ensure the accuracy of the translation.Datasets Statistics.Table 1 compares previous synthetic datasets in terms of the translation method, genre, scale, source language and target language.We conclude that: (1) There is a tradeoff between scale and quality.In line with MS, the scale of XLS datasets in news domain is much larger than others since news articles are convenient to collect.When faced with such largescale datasets, it is expensive and even impractical to manually translate or post-edit all their summaries.Thus, these datasets generally adopt automatic translation methods, causing limited quality.
(2) The XLS datasets in the dialogue domain are extremely challenging than those in the news domain.Besides the limited scale, the key information of one dialogue is often scattered and spanned multiple utterances, leading to low information density (Feng et al., 2022c), which together with complex dialogue phenomena (e.g., coreference, repetition and interruption) makes the task quite challenging (Wang et al., 2022b).

Multi-Lingual Website Datasets
In globalization process, online resources across different languages are overwhelmingly growing.One reason is that many websites start to provide multi-lingual versions for their content to facilitate global users.Therefore, these websites might contain a large number of parallel documents in different languages.Some researchers try to utilize such resources to establish XLS datasets.Datasets Construction.Nguyen and Daumé III (2019) collect news articles from Global Voices website9 which reports and translates news about unheard voices across the globe.The translated news on this website is performed by volunteer translators.Each news article also links to its parallel articles in other languages, if available.Thus, it is convenient to obtain different language versions of an article.Then, they employ crowdworkers to write English summaries for hundreds of selected English articles.In this manner, the non-English articles together with the English summaries constitute the Global Voices XLS dataset10 .Although this dataset utilizes online resources, the way to collect summaries (i.e., crowd-sourcing) limits its scale and directions (the target language must be English).
To alleviate the dilemma, WikiLingua (Ladhak et al., 2020) collects multi-lingual guides from WikiHow11 , where each step in a guide consists of a paragraph and the corresponding one-sentence summary.Heuristically, the dataset combines paragraphs and one-sentence summaries of all the steps in one guide to create a monolingual articlesummary pair.With the help of hyperlinks between parallel guides in different languages, the article in one language and its summary in another one are easy to align.In this way, WikiLingua collects articles and the corresponding summaries in 18 different languages, leading to 306 (18×17) directions.Similarly, Perez-Beltrachini and Lapata (2021) construct XLS datasets from Wikipedia12 , a widely-used multi-lingual encyclopedia.In detail, the Wikipedia articles are typically organized into lead sections and bodies.They focus on 4 languages and pair lead sections with the corresponding bodies in different languages to construct XLS samples.In the end, the collected samples form the XWikis dataset with 12 directions.Moreover, Hasan et al. (2021a) construct Cross-Sum dataset by automatically aligning identical news articles written in different languages from the XL-Sum dataset (Hasan et al., 2021b).The multi-lingual news article-summary pairs in XL-Sum are collected from BBC website13 .As a result, CrossSum involves 45 languages and 1936 directions.
Quality Controlling.For manually annotated dataset, i.e., Global Voices, Nguyen and Daumé III (2019) employ human evaluation to remove lowquality annotated summaries to ensure the quality.For automatically collected datasets, i.e., WikiLingua and XWikis, they typically extract the desired content from the websites via heuristic matching rules to ensure the correctness.As for automatically aligned dataset, i.e., CrossSum, Hasan et al. (2021a) adopt LaBSE (Feng et al., 2022a) to encode all summaries from XL-Sum (Hasan et al., 2021b).Then, they align documents belonging to different languages based on the cosine similarity of corresponding summaries, and pre-define a minimum similarity score to reduce the number of incorrect alignments.samples in each direction of the same dataset may be different since different articles might be available in different languages.Hence, we measure the overall scale of each dataset from its average, maximum and minimum number of XLS samples per direction, respectively.We find that: (1)

Datasets Statistics.
The scale of Global Voices is extremely less than other datasets due to the different methods to collect summaries.Specifically, WikiLingua, XWikis and XL-Sum (the basis of CrossSum) datasets automatically extract a huge amount of summaries from online resources via simple strategies rather than crowd-sourcing.
(2) CrossSum and WikiLingua involve more languages than others, and most of language pairs have intersectional articles, resulting in numerous cross-lingual directions.

Discussion
According to the above review of large-scale XLS datasets, the approaches for building datasets are summarized as: (I) manually or (II) automatically translating the summaries of MS datasets; (III) automatically collecting documents as well as summaries from multi-lingual websites.Among them, approach I involves fewer noises than others since its translation and quality control are performed by professional translators rather than machine translation or volunteers.However, this approach is too labor-intensive and costly to build large-scale datasets.For instance, to control costs, XMediaSum (Wang et al., 2022b) only manually translates part of (~8.6%) summaries of Me-diaSum (Zhu et al., 2021).Besides, Zh2EnSum and En2ZhSum (Zhu et al., 2019) are automatically collected via approach II, and only their test sets have been manually corrected.Therefore, despite the high quality of the constructed data, approach I is more suitable for building validation and test sets of large-scale XLS datasets rather than the whole datasets.
Approaches II and III could be adopted to build whole XLS datasets.We discuss them in the following situations: (1) High-resource source languages ⇒ highresource target languages: This situation has been well studied in previous work, and most of the proposed XLS datasets focus on this situation.Both approaches II and III are useful to construct XLS datasets whose source and target languages are both high-resource languages.
(2) High-resource source languages ⇒ lowresource target languages: When the documents and summaries from XLS datasets are respectively in a high-resource language and a low-resource language, approach III loses its effectiveness.This is because, for a multi-lingual website, its content in a low-resource language is typically less than that in a high-resource language.As a result, the number of collected XLS samples involving low-resource languages is significantly limited.For example, WikiLingua (Ladhak et al., 2020), as a multi-lingual website dataset, contains 113.2kEnglish⇒Spanish samples, but only 7.2k English⇒Czech samples.In this situation, approach II might be a possible way to collect a large number of samples.Note that the MT from a high-resource language to a low-resource language might involve more translation flaws than those between two high-resource languages.Thus, besides the RTT strategy, how to filter out the potential flaws is worthy to be further studied.
(3) Low-resource source languages ⇒ highor low-resource target languages: If the source language is low-resource, there might be no MS dataset and enough website content in this language, leading to the failures of approaches II and III.Therefore, how to build datasets in this situation is still an open-ended problem, which needs to be explored in the future.As pointed by Feng et al. (2022b), one straightforward approach is to automatically translate both documents and summaries from high-resource MS datasets.However, translating documents with hundreds of words might introduce substantial noise, especially when lowresource languages are involved.Thus, its practicality and reliability need more careful justification.

Pipeline Methods
Early XLS work generally focuses on the pipeline methods whose main idea is decomposing XLS into MS and MT sub-tasks, and then accomplishing them step by step.These methods can be further divided into summarize-then-translate (Sum-Trans) and translate-then-summarize (Trans-Sum) types according to the finished order of sub-tasks.For each type, we will systematically present previous methods.Besides, we compare these two types to provide deeper analyses.

Sum-Trans
Orȃsan and Chiorean (2008) utilize the Maximum Marginal Relevance (MMR) algorithm to summarize Romanian news, and then translate the summaries from Romanian to English via eTranslator MT service14 .Furthermore, Wan et al. (2010) find the translated summaries might fall into low readability due to the limited MT performance at that time.To alleviate this issue, they first use a trained SVM model (Cortes and Vapnik, 1995) to predict the translation quality of each English sentence, where the model only leverages features in the English sentences.Then, they select sentences with high quality and informativeness to form summaries which are finally translated to Chinese by Google MT service15 .

Trans-Sum
Compared with Sum-Trans, Trans-Sum attracts more research attention, and this type of pipeline method can be further classified into three subtypes depending on whether its summarizer is extractive, compressive or abstractive: • The extractive method selects complete sentences from the translated documents as summaries.• The compressive method first extracts key sentences from the translated documents, and further removes non-relevant or redundant words in the key sentences to obtain the final summaries.• The abstractive method generates new sentences as summaries, which are not limited to original words or phrases.Note that we do not classify the Sum-Trans approaches in the same manner since their summarizers are all extractive.Extractive Trans-Sum.Leuski et al. (2003) build a cross-lingual information delivery system which first translates Hindi documents to English via a statistical MT model and then selects important English sentences to form summaries.In this system, the summarizer only uses the document information from the target language side, which heavily depends on the MT results and might lead to flawed summaries.However, semantic information from both sides should be taken into account.
To this end, after translating English documents to Chinese, Wan (2011) designs two graph-based summarizers (i.e., SimFusion and CoRank) which utilize bilingual information to output the final Chinese summaries: (i) the SimFusion summarizer first measures the saliency scores of Chinese sentences through combing the English-side and Chinese-side similarity, and then, the salient Chinese sentences constitute the final summaries; (ii) the CoRank summarizer simultaneously ranks both English and Chinese sentences by incorporating mutual influences between them, and then, the top-ranking Chinese sentences are used to constitute summaries.
Later, Boudin et al. (2011) translate documents from English to French, and then use SVM regression method to predict translation quality of each sentence based on bilingual features.Next, the crucial translated sentences are selected based on a modified PageRank algorithm (Page et al., 1999) considering the translation quality.Lastly, the redundant sentences are removed from the selected sentences to form the final summaries.
Compressive Trans-Sum.Inspired by phrasebased MT, Yao et al. (2015) propose a compressive summarization method that simultaneously selects and compresses sentences.Specifically, the sentence selection is based on bilingual features, and the sentence compression is performed by removing the redundant or poorly translated phrases in one single sentence.To further excavate the complementary information of similar sentences, Zhang et al. (2016) first parse bilingual documents into predicate-argument structures (PAS), and then produce summaries by fusing bilingual PAS structures.In this way, several salient PAS elements (concepts or facts) from different sentences can be merged into one summary sentence.Similarly, Linhares Pontes et al. ( 2018) take bilingual lexical chunks into account during measuring the sentence similarity and further compress sentences at both single-and multi-sentence levels.
Abstractive Trans-Sum.With the emergence of large-scale synthetic XLS datasets (Zhu et al., 2019), researchers attempt to adopt the sequence-  2019) train an abstractive summarizer (i.e., PGNet, See et al. 2017) on English pairs of a noisy document and a clean summary.In this manner, the summarizer could achieve good robustness, when summarizing the English documents which are translated from a low-resource language.

Sum-Trans vs. Trans-Sum
We compare Sum-Trans and Trans-Sum in the following situations: • When using extractive or compressive summarizers, the summarizers of the Trans-Sum methods can benefit from bilingual documents while the counterpart of the Sum-Trans methods can only utilize the source-language documents.Thus, the Trans-Sum methods typically achieve better performance than the Sum-Trans counterparts.For instance, on the manually translated DUC 2001 dataset, PBCS (Yao et al., 2015), as a Trans-Sum method, outperforms its Sum-Trans baseline by 8%/8.4%/10.4% in terms of ROUGE-1/2/L.On the other hand, the Trans-Sum methods are less efficient since they need to translate the whole documents rather than a few summaries.• Apart from the above discussion, when adopting abstractive summarizers, a large-scale MS dataset is required to train the summarizers.It is also worth noting that the MS datasets in low-resource languages are much smaller than the MT counterparts (Tiedemann and Thottingal, 2020;Hasan et al., 2021b).Thus, the Trans-Sum methods are helpful if the source language is low-resource.In contrast, if the target lan-guage is low-resource in MS, the Sum-Trans methods are more useful (Ouyang et al., 2019;Ladhak et al., 2020).

End-to-End Methods
Though the pipeline method is intuitive, it 1) suffers from error propagation; 2) needs either a large corpus to train MT models or the monetary cost of paid MT services; 3) has a latency during inference.Thanks to the rapid development of neural networks, many end-to-end XLS models are proposed to alleviate the above issues.
In this section, we take stock of previous endto-end XLS models and further divide them into four frameworks (cf., Figure 1): multi-task framework ( § 5.1), knowledge-distillation framework ( § 5.2), resource-enhanced framework ( § 5.3) and pre-training framework ( § 5.4).For each framework, we will entirely introduce its core idea and corresponding models.At last, we discuss the pros and cons w.r.t each framework ( § 5.5).

Multi-Task Framework
It is challenging for an end-to-end model to directly conduct XLS since it requires both the abilities to translate and summarize (Cao et al., 2020).As shown in Figure 1(a), many researchers use the related tasks (e.g., MT and MS) together with XLS to train unified models.In this way, XLS models could also benefit from the related tasks.Zhu et al. (2019) utilize a shared transformer encoder to encode the input sequences of both XLS and MT/MS.Then, two independent transformer decoders are used to conduct XLS and MT/MS, respectively.It is the first time to show the end-toend method outperforms the pipeline ones.Later, Cao et al. (2020) use two encoder-decoder models to perform MS in the source and target languages, respectively.Meanwhile, the source encoder and the target decoder jointly conduct XLS.Then, two linear mappers are used to convert the context representation (i.e., the output of encoders) from the source to the target language and vice versa.In addition, two discriminators are adopted to discriminate between the encoded and mapped representations.Thereby, the overall model could jointly learn to summarize documents and align representations between both languages.
Although the above efforts design unified models in the multi-task framework, their decoders are independent for different tasks, leading to limitations in capturing the relationships among the multiple tasks.To solve this problem, Takase and Okazaki (2020) train a single encoder-decoder model on both MS, MT and XLS datasets.They prepend a special token at the input sequences to indicate which task is performed.In addition, Bai et al. (2021a) make the MS a prerequisite for XLS and propose MCLAS, a XLS model of single encoder-decoder architecture.For the given documents, MCLAS generates the sequential concatenation of the corresponding monolingual and cross-lingual summaries.In this way, the translation alignment is also implicit in the generation process, making MCLAS achieve great performance in XLS.More recently, Liang et al. (2022) utilize conditional variational auto-encoder (CVAE) (Sohn et al., 2015) to capture the hierarchical relationship among MT, MS and XLS.Specifically, three variables are adopted in the proposed model to reconstruct the results of MT, MS and XLS, respectively.Besides, the used encoder and decoder are shared among all tasks, while the prior and recognition networks are independent to indicate the different tasks.Considering the limited XLS data in low-resource languages, Bai et al. (2021a) and Liang et al. (2022) also investigate XLS in the few-shot setting.

Knowledge-Distillation Framework
The original thought of knowledge distillation is distilling the knowledge in an ensemble of models (i.e., teacher models) into a single model (i.e., student model) (Hinton et al., 2015).Due to the close relationship between MT/MS and XLS, some researchers attempt to use MS or MT, or both models to teach the XLS model in the knowledgedistillation framework.In this way, besides the XLS labels, the student model can also learn from the output or hidden state of the teacher models.Ayana et al. (2018) utilize large-scale MS and MT corpora to train MS and MT models, respectively.Then, they use the trained MS or MT, or both models as the teacher models to teach the XLS student model.Both the teacher and student models are bi-directional GRU models (Cho et al., 2014).To let the student model mimic the output of the teacher model, the KL-divergence between the generation probabilities of these two models is used as the training objective.Later, Duan et al. (2019) implement transformer (Vaswani et al., 2017) as the backbone of the MS teacher model and the XLS student model, and further train the student model with two objectives: (1) the crossentropy between the generation distributions of these two models; (2) the Euclidean distance between the attention weights of both models.It is worth noting that both Ayana et al. (2018) and Duan et al. (2019) focus on zero-shot XLS due to the scarcity of XLS dataset at that time, while their training objectives do not include the XLS.
After the emergence of large-scale XLS datasets, Nguyen and Luu (2022) confirm the knowledge-distillation framework can also be adopted in rich-resource scenarios.Specifically, they employ the transformer student and teacher models, and further propose a variant Sinkhorn divergence, which together with the XLS objective supervises the student XLS model.

Resource-Enhanced Framework
As shown in Figure 1(c), the resource-enhanced framework utilizes additional resources to enrich the information of the input documents, and the generation probability of the output summaries is conditioned on both the encoded and enriched information.Zhu et al. (2020) explore the translation pattern in XLS.In detail, they first encode the input documents in source language via a transformer encoder, and then obtain the translation distribution for the words of the input documents by the fast-align toolkit (Dyer et al., 2013).Lastly, a transformer decoder is used to generate summaries in target language based on both its output distribution and the translation distributions.In this way, the extra bilingual alignment information helps the XLS model better learn the transformation from the source to the target language.Jiang  and Tarau, 2004) to extract key clues from input sequences, and then construct article graphs based on these clues via a designed algorithm.Next, they encode the clues and the article graphs by a clue encoder (with transformer encoder architecture) and a graph encoder (based on graph nerual networks), respectively.Finally, a transformer decoder with two types of cross-attention (perform on the outputs of both clue and graph encoders) is adopted to generate final summaries.In addition, they consider the translation distribution used in Zhu et al. (2020) to further strength the proposed model.

Pre-Training Framework
The emergence of pre-trained models has brought NLP to a new era (Qiu et al., 2020).The pretrained models typically first learn the general representation from large-scale corpora, and then adapt to the specific task through fine-tuning.
More recently, the general multi-lingual pretrained generative models have shown impressive performance on many multi-lingual NLP tasks.For example, mBART (Liu et al., 2020), as a multi-lingual pre-trained model, is derived from BART (Lewis et al., 2020).mBART is pre-trained with BART-style denoising objectives on a huge volume of unlabeled multi-lingual data.mBART shows its superiority in MT originally (Liu et al., 2020), and Liang et al. (2022) find it can also outperform many multi-task XLS models on largescale XLS datasets through simply fine-tuning.Later, mBART-50 (Tang et al., 2021) goes a step further and extends the language processing abilities of mBART from 25 languages to 50 languages.In addition to the BART-style pre-trained models, mT5 (Xue et al., 2021) is a multi-lingual T5 (Raffel et al., 2020) model, which is pre-trained in 101 languages with the T5-style span corruption objective.Although great performance has been achieved, these general pre-trained models only utilize the denoising or span corruption objectives in multiple languages without any cross-lingual supervision, resulting in the under-explored crosslingual ability.
To solve this problem, Xu et al. (2020) propose a mix-lingual XLS model which is pretrained with masked language model (MLM), denoising auto-encoder (DAE), MS, translation span corruption (TSC) and MT tasks16 .The TSC and MT pre-training samples are derived from OPUS English↔Chinese parallel corpus 17 .Dou et al. (2020) utilize XLS, MT and MS tasks to pre-train another XLS model.They leverage the English↔German/Chinese MT samples from WMT2014/WMT2017 dataset.For XLS, they pre-train the model on En2ZhSum and English-German datasets (Dou et al., 2020).Wang et al. (2022b) focus on dialogue-oriented XLS and extend mBART-50 with MS, MT and two dialogueoriented pre-training objectives (i.e., action infilling and utterance permutation) via the second pre-training stage on MediaSum and XMediaSum datasets.Note that Xu et al. (2020), Dou et al. (2020) and Wang et al. (2022b) only focus on XLS task.The languages supported by these models are limited to a few specific ones.
Moreover, there are also some general crosslingual pre-trained models that have not been evaluated in XLS, e.g., XNLG (Chi et al., 2020) and VECO (Luo et al., 2021).
Table 3 shows the details of the above crosslingual pre-training tasks.TSC and TPSC predict the masked spans from a translation pair.The input sequence of TSC is only masked in one language while the counterpart of TPSC is masked in both languages.

Discussion
Table 4 summarizes all end-to-end XLS models.We conclude that all four frameworks resort to external resources to improve the XLS performance: (1) The multi-task framework uses large-scale MS and MT corpora to help XLS.Though the multitask learning is intuitive, its training strategy and weights of different task is non-trivial to determine.(2) The knowledge-distillation framework is another way to utilize the large-scale MS and MT corpora.This framework is most suitable for zero-shot XLS since it could be supervised by the MS and MT teacher models without any XLS labels.Nevertheless, knowledge distillation often fails to live up to its name, transferring very limited knowledge from teacher to student (Stanton et al., 2021).Thus, it should be verified more deeply in the rich-resource XLS.(3) The resource-enhanced framework employs the off-the-shelf toolkits to enhance the representation of input documents.This framework significantly relaxes the dependence on external data, but it suffers from error propagation.(4) The pre-training framework can benefit from both unlabeled and labeled corpora.In detail, pre-trained models learn the general language knowledge from large-scale unlabeled data with self-supervised objectives.In order to improve the cross-lingual ability, they can resort to MT parallel corpus and design supervised signals.This framework absorbs more knowledge from more external corpora than others, leading to the promising performance on XLS.
To give a deeper comparison of end-to-end XLS models, as shown in Table 5, we organize a leader- Zhu et al., 2019) 38.25 20.20 34.76 40.34 22.65  Table 5: The leaderboard of end-to-end XLS models on En2ZhSum and Zh2EnSum datasets (Zhu et al., 2019) in terms of ROUGE(R)-1/2/L (Lin, 2004).The evaluation scripts refer to Zhu et al. (2020).♥ : multi-task framework; ♣ : resource-enhanced framework; ♠ : pretraining framework.† indicates the results are obtained by evaluating output files provided by the authors; ‡ denotes the results by running their released codes; * indicates the results are reported in the original papers which adopt the same evaluation scripts as Zhu et al. (2020).
board with unified evaluation metrics, based on the released codes and generated results from representative published literature.The models in the pre-training framework (Liu et al., 2020;Dou et al., 2020;Xu et al., 2020) generally outperform others.Besides, the pre-training framework could also serve other frameworks.For example, Liang et al. (2022) utilize mBART weights as model initialization for VHM (i.e., mVHM), bringing decent gains compared with vanilla VHM.Therefore, it is possible and valuable to combine the advantages of different frameworks, which is worthy of discussion in the future.

Prospects
In this section, we discuss and suggest the following promising future directions, which meet actual application needs: The Essence of XLS.Unifying two abilities (i.e., translation and summarization abilities) in a single model is non-trivial (Cao et al., 2020).Even though the effectiveness of the state-of-the-art models has been proved, the essence of XLS remains unclear, especially (1) the hierarchical relationship between MT&MS and XLS (Liang et al., 2022), and (2) the theoretical analysis for what makes MT&MS help XLS?XLS Dataset with Low-Resource Languages.There are thousands of languages in the world and most of them are low-resource.Despite the practical significance, building high-quality and largescale XLS datasets whose source or target language is low-resource remains challenging (c.f., Section 3.3), and needs to be further explored in the future.
Unified XLS across Genres and Domains.As we described in Section 3, existing XLS datasets cover multiple genres or domains, i.e., news, dialogue, guides and encyclopedia.The diversity across them is naturally promoting the need for unified XLS, instead of promoting the trend of devising unique models on individual genres or domains.At present, the unified XLS is still underexplored, making us believe the urgent need for it.
Controllable XLS.Bai et al. (2021b) integrate a compression rate to control how much information should be kept in the target language.If the compression rate is 100%, XLS degrades to MT.Thus, the continuous variable unifies XLS and MT tasks.In this manner, a new research view is introduced to leverage MT to help XLS.In addition, controlling some other attributes of the target summary may be useful in real applications, such as entitycentric XLS and aspect-based XLS.
Low-Resource XLS.Most languages in the world are low-resource, which makes large-scale parallel datasets across these languages rare and expensive.Hence, low-resource XLS is more realistic.Nevertheless, current work has not well investigate and explore this situation.Recently, promptbased learning has become a new paradigm in NLP (Liu et al., 2021).With the help of the well-designed prompting function, a pre-trained model is able to perform few-shot or even zeroshot learning.Future work can adopt promptbased learning to deal with the low-resource XLS.
Triangular XLS.Following triangular MT, triangular XLS is a special case of low-resource XLS where the language pair of interest has limited parallel data, but both languages have abundant parallel data with a pivot language.This situation typically appears in multi-lingual website datasets (a category of XLS datasets, cf., § 3.2), because their documents are usually centered in English and then translated into other languages to facilitate global users.Hence, English acts as the pivot language.How to exploit such abundant parallel data to improve the XLS of interest language pairs remains challenging.
Many-to-Many XLS.Most previous work trains XLS models separately in each cross-lingual direction.In this way, the knowledge of XLS cannot be transferred among different directions.Be-sides, a trained model can only perform in a single direction, resulting in limited usage.To solve this problem, Hasan et al. (2021a) jointly fine-tune mT5 in multiple directions.At last, the fine-tuned model can perform in arbitrary even unseen directions, which is named many-to-many XLS.Future work can focus on designing robust and effective training strategies for many-to-many XLS.
Long Document XLS.Recently, long document MS has attracted wide research attention (Cohan et al., 2018;Sharma et al., 2019;Wang et al., 2021Wang et al., , 2022a)).Long document XLS is also important in real scenes, e.g., facilitating researchers to access the arguments of scientific papers in foreign languages.Nevertheless, this direction has not been noticed by previous work.Interestingly, we find many non-English scientific papers have the corresponding English abstracts due to the regulations of publishers.For example, many Chinese academic journals require researchers to write abstracts in both Chinese and English.This might be a feasible method to construct long document XLS datasets.We hope future work can promote this direction.
Multi-Document XLS.Previous multi-document XLS work (Orȃsan and Chiorean, 2008;Boudin et al., 2011;Zhang et al., 2016) only utilizes statistical features to build pipeline systems, and further performs on early XLS datasets.The multidocument XLS is also worthy of discussion in the pre-trained model era.
Multi-Modal XLS.With the increasing of multimedia data on the internet, some researchers put their effort into multi-modal summarization (Zhu et al., 2018;Sanabria et al., 2018;Li et al., 2018Li et al., , 2020;;Fu et al., 2021), where the input of summarization systems is a document together with images or videos.Nevertheless, existing multi-modal summarization work only focuses on the monolingual scene and ignores cross-lingual ones.We hope future work could light up multi-modal XLS.
Evaluation Metrics.Developing evaluation metrics for XLS is still an open problem that needs to be further studied.Current XLS metrics typically inherit from MS.However, different from MS, the XLS samples consist of source document, (target document), source summary, target summary .Besides the target summary, how to apply other information to assess the summary quality would be an interesting point for further study.
Others.Considering current XLS research is still in the preliminary stage, many research points of MS are missing in XLS, such as the factual inconsistency and hallucination problems.These directions are also worthy to be deeply explored in further work.

Conclusion
In this paper, we present the first comprehensive survey of current research efforts on XLS.We systematically summarize existing XLS datasets and methods, highlight their characteristics and compare them with each other to provide deeper analyses.In addition, we give multiple perspective directions to facilitate further research on XLS.We hope that this XLS survey could provide a clear picture of this topic and boost the development of the current XLS technologies.

Figure 1 :
Figure 1: Overview of four end-to-end frameworks (best viewed in color).XLS: cross-lingual summarization; MT: machine translation; MS: monolingual summarization.Dashed arrows indicate the supervised signals.Rimless colored blocks denote the input or output sequences of the corresponding tasks.Note that the knowledgedistillation framework might contain more than one teacher model, and the auxiliary/pre-training tasks used in the multi-task/pre-training framework are not limited to MT and MS, here we omit these for simplicity.
Table 2 lists the key characteristics of the representative multi-lingual website datasets.It is worth noting that the number of XLS

Table 2 :
Overview of representative multi-lingual website datasets."L" denotes the number of languages involved in each dataset."D" indicates the number of cross-lingual directions."Scale (avg/max/min)" calculates the average/maximum/minimum number of XLS samples per direction.

Table 3 :
Examples of inputs and targets used by different cross-lingual pre-training tasks for the sentence "Everything that kills make me feel alive" with its Chinese translation and summarization.The randomly selected spans are replaced with unique mask tokens (i.e, [M1], [M2] and [M3]) in TSC and TPSC.

Table 4 :
Cao et al. (2020)d-to-end XLS models."Transformer"means the vanilla transformer encoder-decoder architecture.*denotesthe variant architecture."REC"representsthe reconstruction objective, which is used to supervise the linear mappers in the model proposed byCao et al. (2020)."KD" denotes the knowledge distillation objectives, derived from the output or hidden state of the corresponding teacher models, such as MS and MT models.The "Training Objective" of pre-trained models lists the pre-training objectives.Language nomenclature used in "Evaluation Direction" is ISO 639-1 codes.† indicates the number of samples in the dataset is less than 2000.‡ denotes unreleased datasets.