Abstract
Text style transfer is an important task in natural language generation, which aims to control certain attributes in the generated text, such as politeness, emotion, humor, and many others. It has a long history in the field of natural language processing, and recently has re-gained significant attention thanks to the promising performance brought by deep neural models. In this article, we present a systematic survey of the research on neural text style transfer, spanning over 100 representative articles since the first neural text style transfer work in 2017. We discuss the task formulation, existing datasets and subtasks, evaluation, as well as the rich methodologies in the presence of parallel and non-parallel data. We also provide discussions on a variety of important topics regarding the future development of this task.1
1. Introduction
Language is situational. Every utterance fits in a specific time, place, and scenario, conveys specific characteristics of the speaker, and typically has a well-defined intent. For example, someone who is uncertain is more likely to use tag questions (e.g., “This is true, isn’t it?”) than declarative sentences (e.g., “This is definitely true.”). Similarly, a professional setting is more likely to include formal statements (e.g., “Please consider taking a seat.”) as compared to an informal situation (e.g., “Come and sit!”). For artificial intelligence systems to accurately understand and generate language, it is necessary to model language with style/attribute,2 which goes beyond merely verbalizing the semantics in a non-stylized way. The values of the attributes can be drawn from a wide range of choices depending on pragmatics, such as the extent of formality, politeness, simplicity, personality, emotion, partner effect (e.g., reader awareness), genre of writing (e.g., fiction or non-fiction), and so on.
The goal of TST is to automatically control the style attributes of text while preserving the content. TST has a wide range of applications, as outlined by McDonald and Pustejovsky (1985) and Hovy (1987). The style of language is crucial because it makes natural language processing more user-centered. TST has many immediate applications. For instance, one such application is intelligent bots, for which users prefer distinct and consistent persona (e.g., empathetic) instead of emotionless or inconsistent persona. Another application is the development of intelligent writing assistants; for example, non-expert writers often need to polish their writing to better fit their purpose, for example, more professional, polite, objective, humorous, or other advanced writing requirements, which may take years of experience to master. Other applications include automatic text simplification (where the target style is “simple”), debiasing online text (where the target style is “objective”), fighting against offensive language (where the target style is “non-offensive”), and so on.
To formally define TST, let us denote the target utterance as x′ and the target discourse style attribute as a′. TST aims to model p(x′|a, x), where x is a given text carrying a source attribute value a. Consider the previous example of text expressed by two different extents of formality:
Source sentence x: “Come and sit!” Source attribute a: Informal
Target sentence x′: “Please consider taking a seat.” Target attribute a′: Formal
In this case, a TST model should be able to modify the formality and generate the formal sentence x′ = “Please consider taking a seat.” given the informal input x = “Come and sit!”. Note that the key difference of TST from another NLP task, style-conditioned language modeling, is that the latter is conditioned on only a style token, whereas TST takes as input both the target style attribute a′ and a source sentence x that constrains the content.
Crucial to the definition of style transfer is the distinction of “style” and “content,” for which there are two common practices. The first one is by linguistic definition, where non-functional linguistic features are classified into the style (e.g., formality), and the semantics are classified into the content. In contrast, the second practice is data-driven—given two corpora (e.g., a positive review set and a negative review set), the invariance between the two corpora is the content, whereas the variance is the style (e.g., sentiment, topic) (Mou and Vechtomova 2020).
Driven by the growing needs for TST, active research in this field has emerged, from the traditional linguistic approaches, to the more recent neural network–based approaches. Traditional approaches rely on term replacement and templates. For example, early work in NLG for weather forecasts builds domain-specific templates to express different types of weather with different levels of uncertainty for different users (Sripada et al. 2004; Reiter et al. 2005; Belz 2008; Gkatzia, Lemon, and Rieser 2017). Research that more explicitly focuses on TST starts from the frame language-based systems (McDonald and Pustejovsky 1985), and schema-based NLG systems (Hovy 1987, 1990) which generate text with pragmatic constraints such as formality under small-scale well-defined schema. Most of this earlier work required domain-specific templates, hand-featured phrase sets that express a certain attribute (e.g., friendly), and sometimes a look-up table of expressions with the same meaning but multiple different attributes (Bateman and Paris 1989; Stamatatos et al. 1997; Power, Scott, and Bouayad-Agha 2003; Reiter, Robertson, and Osman 2003; Sheikha and Inkpen 2011; Mairesse and Walker 2011).
With the success of deep learning in the last decade, a variety of neural methods have been recently proposed for TST. If parallel data are provided, standard sequence-to-sequence models are often directly applied (Rao and Tetreault 2018) (see Section 4). However, most use cases do not have parallel data, so TST on non-parallel corpora has become a prolific research area (see Section 5). The first line of approaches disentangle text into its content and attribute in the latent space, and apply generative modeling (Hu et al. 2017; Shen et al. 2017). This trend was then joined by another distinctive line of approach, prototype editing (Li et al. 2018), which extracts a sentence template and its attribute markers to generate the text. Another paradigm soon followed, namely, pseudo-parallel corpus construction to train the model as if in a supervised way with the pseudo-parallel data (Zhang et al. 2018d; Jin et al. 2019). These three directions, (1) disentanglement, (2) prototype editing, and (3) pseudo-parallel corpus construction, are further advanced with the emergence of Transformer-based models (Sudhakar, Upadhyay, and Maheswaran 2019; Malmi, Severyn, and Rothe 2020).
Given the advances in TST methodologies, it now starts to expand its impact to downstream applications, such as persona-based dialog generation (Niu and Bansal 2018; Huang et al. 2018), stylistic summarization (Jin et al. 2020a), stylized language modeling to imitate specific authors (Syed et al. 2020), online text debiasing (Pryzant et al. 2020; Ma et al. 2020), simile generation (Chakrabarty, Muresan, and Peng 2020), and many others.
Motivation of a Survey on TST.
The increasing interest in modeling the style of text can be regarded as a trend reflecting the fact that NLP researchers start to focus more on user-centeredness and personalization. However, despite the growing interest in TST, the existing literature shows a large diversity in the selection of benchmark datasets, methodological frameworks, and evaluation metrics. Thus, the aim of this survey is to provide summaries and potential standardizations on some important aspects of TST, such as the terminology, problem definition, benchmark datasets, and evaluation metrics. We also aim to provide different perspectives on the methodology of TST, and suggest some potential cross-cutting research questions for our proposed research agenda of the field. As shown in Table 1, the key contributions targeted by this survey are as follows:
We conduct the first comprehensive review that covers most existing works (more than 100 papers) on deep learning-based TST.
We provide an overview of the task setting, terminology definition, benchmark datasets (Section 2), and evaluation metrics for which we proposed standard practices that can be helpful for future works (Section 3).
We categorize the existing approaches on parallel data (Section 4) and non-parallel data (Section 5) for which we distill some unified methodological frameworks.
We discuss a potential research agenda for TST (Section 6), including expanding the scope of styles, improving the methodology, loosening dataset assumptions, and improving evaluation metrics.
We provide a vision for how to broaden the impact of TST (Section 7), including connecting to more NLP tasks, and more specialized downstream applications, as well as considering some important ethical impacts.
Motivation . | Data . | Method . | Extended Applications . |
---|---|---|---|
| Tasks
| On Parallel Data
| Helping Other NLP Tasks
|
Motivation . | Data . | Method . | Extended Applications . |
---|---|---|---|
| Tasks
| On Parallel Data
| Helping Other NLP Tasks
|
Paper Selection.
The neural TST papers reviewed in this survey are mainly from top conferences in NLP and artificial intelligence (AI), including ACL, EMNLP, NAACL, COLING, CoNLL, NeurIPS, ICML, ICLR, AAAI, and IJCAI. Other than conference papers, we also include some non-peer-reviewed preprint papers that can offer some insightful information about the field. The major factors for selecting non-peer-reviewed preprint papers include novelty and completeness, among others.
2. What Is Text Style Transfer?
This section provides an overview of the style transfer task. Section 2.1 goes through the definition of styles and the scope of this survey. Section 2.2 gives a task formulation and introduces the notations that will be used across the survey. Finally, Section 2.3 lists all the common subtasks for neural TST which can save the literature review efforts for future researchers.
2.1 How to Define Style?
Linguistic Definition of Style.
An intuitive notion of style refers to the manner in which the semantics is expressed (McDonald and Pustejovsky 1985). Just as everyone has their own signatures, style originates as the characteristics inherent to every person’s utterance, which can be expressed through the use of certain stylistic devices such as metaphors, as well as choice of words, syntactic structures, and so on. Style can also go beyond the sentence level to the discourse level, such as the stylistic structure of the entire piece of the work, for example, stream of consciousness, or flashbacks.
Beyond the intrinsic personal styles, for pragmatic uses, style further becomes a protocol to regularize the manner of communication. For example, for academic writing, the protocol requires formality and professionalism. Hovy (1987) defines style by its pragmatic aspects, including both personal (e.g., personality, gender) and interpersonal (e.g., humor, romance) aspects. Most existing literature also takes these well-defined categories of styles.
Data-Driven Definition of Style as the Scope of this Survey.
This survey aims to provide an overview of existing neural TST approaches. To be concise, we will limit the scope to the most common settings of existing literature. Specifically, most deep learning work on TST adopts a data-driven definition of style, and the scope of this survey covers the styles in currently available TST datasets. The data-driven definition of style is different from the linguistic or rule-based definition of style, which theoretically constrains what constitutes a style and what not, such as a style guide (e.g., American Psychological Association 2020) that requires that formal text not include any contraction, e.g., “isn’t.” The distinction of the two defintions of style is shown in Figure 1.
With the rise of deep learning methods of TST, the data-driven definition of style extends the linguistic style to a broader concept—the general attributes in text. It regards “style” as the attributes that vary across datasets, as opposed to the characteristics that stay invariant (Mou and Vechtomova 2020). The reason is that deep learning models (which are the focus of this survey) need large corpora to learn the style from, but not all styles have well-matched large corpora. Therefore, apart from the very few manually annotated datasets with linguistic style definitions, such as formality (Rao and Tetreault 2018) and humor & romance (Gan et al. 2017), many recent dataset collection works automatically look for meta-information to link a corpus to a certain attribute. A typical example is the widely used Yelp review dataset (Shen et al. 2017), where reviews with low ratings are put into the negative corpus, and reviews with high ratings are put into the positive corpus, although the negative vs. positive opinion is not a style that belongs to the linguistic definition, but more of a content-related attribute.
Most methods mentioned in this survey can be applied to scenarios that follow this data-driven definition of style. As a double-edged sword, the prerequisite for most methods is that there exist style-specific corpora for each style of interest, either parallel or non-parallel. Note that there can be future works that do not take such an assumption, which will be discussed in Section 6.3.
Comparison of the Two Definitions.
There are two phenomena rising from the data-driven definition of style as opposed to the linguistic style. One is that the data-driven definition of style can include a broader range of attributes including content and topic preferences of the text. The other is that data-driven styles, if collected through automatic classification by meta-information such as ratings, user information, and source of text, can be more ambiguous than the linguistically defined styles. As shown in Jin et al. (2019, Section 4.1.1), some automatically collected datasets have a concerningly high undecideable rate and inter-annotator disagreement rate when the annotators are asked to associate the dataset with human-defined styles such as political slant and gender-specific tones.
The advantage of the data-driven style is that it can marry well with deep learning methods because most neural models learn the concept of style by learning to distinguish the multiple style corpora. For the (non-data-driven) linguistic style, although it is under-explored in the existing deep learning works of TST, we provide in Section 6.3 a discussion of how potential future works can learn TST of linguistics styles with no matched data.
2.2 Task Formulation
We define the main notations used in this survey in Table 2.
Category . | Notation . | Meaning . |
---|---|---|
Attribute | a | An attribute value, e.g., the formal style |
a′ | An attribute value different from a | |
𝔸 | A predefined set of attribute values | |
ai | The i-th attribute value in 𝔸 | |
Sentence | x | A sentence with attribute value a |
x′ | A sentence with attribute value a′ | |
Xi | A corpus of sentences with attribute value ai | |
xi | A sentence from the corpus Xi | |
Attribute-transferred sentence of x learned by the model | ||
Model | E | Encoder of a TST model |
G | Generator of a TST model | |
fc | Attribute classifier | |
θE | Parameters of the encoder | |
θG | Parameters of the generator | |
θfc | Parameters of the attribute classifier | |
Embedding | z | Latent representation of text, i.e., zE(x) |
a | Latent representation of the attribute value in text |
Category . | Notation . | Meaning . |
---|---|---|
Attribute | a | An attribute value, e.g., the formal style |
a′ | An attribute value different from a | |
𝔸 | A predefined set of attribute values | |
ai | The i-th attribute value in 𝔸 | |
Sentence | x | A sentence with attribute value a |
x′ | A sentence with attribute value a′ | |
Xi | A corpus of sentences with attribute value ai | |
xi | A sentence from the corpus Xi | |
Attribute-transferred sentence of x learned by the model | ||
Model | E | Encoder of a TST model |
G | Generator of a TST model | |
fc | Attribute classifier | |
θE | Parameters of the encoder | |
θG | Parameters of the generator | |
θfc | Parameters of the attribute classifier | |
Embedding | z | Latent representation of text, i.e., zE(x) |
a | Latent representation of the attribute value in text |
As mentioned previously in Section 2.1, most neural approaches assume a given set of attribute values 𝔸, and each attribute value has its own corpus. For example, if the task is about formality transfer, then for the attribute of text formality, there are two attribute values, a = “formal” and a′ = “informal,” corresponding to a corpus X1 of formal sentences and another corpus X2 of informal sentences. The style corpora can be parallel or non-parallel. Parallel data means that each sentence with the attribute a is paired with a counterpart sentence with another attribute a′. In contrast, non-parallel data only assumes mono-style corpora.
2.3 Existing Subtasks with Datasets
We list the common subtasks and corresponding datasets for neural TST in Table 3. The attributes of interest vary from style features (e.g., formality and politeness) to content preferences (e.g., sentiment and topics). Each task of which will be elaborated below.
Task . | Attribute Values . | Datasets . | Size . | Pa? . |
---|---|---|---|---|
Style Features | ||||
Formality | Informal ↔ Formal | GYAFC3 (Rao and Tetreault 2018) | 50K | ✓ |
XFORMAL4 (Briakou et al. 2021b) | 1K | ✓ | ||
Politeness | Impolite → Polite | Politeness5 (Madaan et al. 2020) | 1M | ✗ |
Gender | Masculine ↔ Feminine | Yelp Gender6 (Prabhumoye et al. 2018) | 2.5M | ✗ |
Humor & Romance | Factual ↔ Humorous ↔ Romantic | FlickrStyle7 (Gan et al. 2017) | 5K | ✓ |
Biasedness | Biased → Neutral | Wiki Neutrality8 (Pryzant et al. 2020) | 181K | ✓ |
Toxicity | Offensive → Non-offensive | Twitter (dos Santos, Melnyk, and Padhi 2018) | 58K | ✗ |
Reddit (dos Santos, Melnyk, and Padhi 2018) | 224K | |||
Reddit Politics (Tran, Zhang, and Soleymani 2020) | 350K | |||
Authorship | Shakespearean ↔ Modern | Shakespeare (Xu et al. 2012) | 18K | ✓ |
Different Bible translators | Bible9 (Carlson, Riddell, and Rockmore 2018) | 28M | ||
Simplicity | Complicated → Simple | PWKP (Zhu, Bernhard, and Gurevych 2010) | 108K | ✓ |
Expert (den Bercken, Sips, and Lofi 2019) | 2.2K | ✓ | ||
MIMIC-III10 (Weng, Chung, and Szolovits 2019) | 59K | ✗ | ||
MSD11 (Cao et al. 2020) | 114K | ✓ | ||
Engagingness | Plain → Attractive | Math12 (Koncel-Kedziorski et al. 2016) | <1K | ✓ |
TitleStylist13 (Jin et al. 2020a) | 146K | ✗ | ||
Content Preferences | ||||
Sentiment | Positive ↔ Negative | Yelp14 (Shen et al. 2017) | 250K | ✗ |
Amazon15 (He and McAuley 2016) | 277K | |||
Topic | Entertainment ↔ Politics | Yahoo! Answers16 (Huang et al. 2020) | 153K | ✗ |
Politics | Democratic ↔ Republican | Political17 (Voigt et al. 2018) | 540K | ✗ |
Task . | Attribute Values . | Datasets . | Size . | Pa? . |
---|---|---|---|---|
Style Features | ||||
Formality | Informal ↔ Formal | GYAFC3 (Rao and Tetreault 2018) | 50K | ✓ |
XFORMAL4 (Briakou et al. 2021b) | 1K | ✓ | ||
Politeness | Impolite → Polite | Politeness5 (Madaan et al. 2020) | 1M | ✗ |
Gender | Masculine ↔ Feminine | Yelp Gender6 (Prabhumoye et al. 2018) | 2.5M | ✗ |
Humor & Romance | Factual ↔ Humorous ↔ Romantic | FlickrStyle7 (Gan et al. 2017) | 5K | ✓ |
Biasedness | Biased → Neutral | Wiki Neutrality8 (Pryzant et al. 2020) | 181K | ✓ |
Toxicity | Offensive → Non-offensive | Twitter (dos Santos, Melnyk, and Padhi 2018) | 58K | ✗ |
Reddit (dos Santos, Melnyk, and Padhi 2018) | 224K | |||
Reddit Politics (Tran, Zhang, and Soleymani 2020) | 350K | |||
Authorship | Shakespearean ↔ Modern | Shakespeare (Xu et al. 2012) | 18K | ✓ |
Different Bible translators | Bible9 (Carlson, Riddell, and Rockmore 2018) | 28M | ||
Simplicity | Complicated → Simple | PWKP (Zhu, Bernhard, and Gurevych 2010) | 108K | ✓ |
Expert (den Bercken, Sips, and Lofi 2019) | 2.2K | ✓ | ||
MIMIC-III10 (Weng, Chung, and Szolovits 2019) | 59K | ✗ | ||
MSD11 (Cao et al. 2020) | 114K | ✓ | ||
Engagingness | Plain → Attractive | Math12 (Koncel-Kedziorski et al. 2016) | <1K | ✓ |
TitleStylist13 (Jin et al. 2020a) | 146K | ✗ | ||
Content Preferences | ||||
Sentiment | Positive ↔ Negative | Yelp14 (Shen et al. 2017) | 250K | ✗ |
Amazon15 (He and McAuley 2016) | 277K | |||
Topic | Entertainment ↔ Politics | Yahoo! Answers16 (Huang et al. 2020) | 153K | ✗ |
Politics | Democratic ↔ Republican | Political17 (Voigt et al. 2018) | 540K | ✗ |
Formality.
Adjusting the extent of formality in text was first proposed by Hovy (1987). It is one of the most distinctive stylistic aspects that can be observed through many linguistic phenomena, such as more full names (e.g., “television”) instead of abbreviations (e.g., “TV”), and more nouns (e.g., “solicitation”) instead of verbs (e.g., “request”). The formality dataset, Grammarly’s Yahoo Answers Formality Corpus (GYAFC) (Rao and Tetreault 2018), contains 50K formal-informal pairs retrieved by first getting 50K informal sentences from the Yahoo Answers corpus, and then recruiting crowdsource workers to rewrite them in a formal way. Briakou et al. (2021b) extend the formality dataset to a multilingual version with three more languages, Brazilian Portuguese, French, and Italian.
Politeness.
Politeness transfer (Madaan et al. 2020) aims to control the politeness in text. For example, “Could you please send me the data?” is a more polite expression than “send me the data!”. Madaan et al. (2020) compiled a dataset of 1.39 million automatically labeled instances from the raw Enron corpus (Shetty and Adibi 2004). As politeness is culture-dependent, this dataset mainly focuses on politeness in North American English.
Gender.
Linguistic phenomena related to gender is a heated research area (Trudgill 1972; Lakoff 1973; Tannen 1990; Argamon et al. 2003; Boulis and Ostendorf 2005). The gender-related TST dataset is proposed by Prabhumoye et al. (2018), who compiled 2.5M reviews from Yelp Dataset Challenge that are labeled with the gender of the user.
Humor & Romance.
Humor and romance are some artistic attributes that can provide readers with joy. Li et al. (2018) first propose to borrow the FlickrStyle stylized caption dataset (Gan et al. 2017) from the computer vision domain. In the FlickrStyle image caption dataset, each image has three captions, with a factual, a humorous, and a romantic style, respectively. By keeping only the captions of the three styles, Li et al. (2018) created a subset of the FlickrStyle dataset of 5K parallel (factual, humorous, romantic) triplets.
Biasedness.
Wiki Neutrality Corpus (Pryzant et al. 2020) is the first corpus of biased and neutralized sentence pairs. It is collected from Wikipedia revisions that adjusted the tone of existing sentences to a more neutral voice. The types of bias in the biased corpus include framing bias, epistemological bias, and demographic bias.
Toxicity.
Another important use of TST is to fight against offensive language. Tran, Zhang, and Soleymani (2020) collect 350K offensive sentences and 7M non-offensive sentences by crawling sentences from Reddit using a list of restricted words.
Authorship.
Changing the tone of the author is an artistic use of TST. Xu et al. (2012) created an aligned corpus of 18K pairs of Shakespearean English and their modern English translation. Carlson, Riddell, and Rockmore (2018) collected 28M parallel data from English versions of the Bible by different translators.
Simplicity.
Another important use of TST is to lower the language barrier for readers, such as translating legalese, medical jargon, or other professional text into simple English, to avoid discrepancies between expert wordings and lay understanding (Tan and Goonawardene 2017). Common tasks include converting standard English Wikipedia into Simple Wikipedia, whose dataset contains 108K samples (Zhu, Bernhard, and Gurevych 2010). Another task is to simplify medical descriptions to patient-friendly text, including a dataset with 2.2K samples (den Bercken, Sips, and Lofi 2019), another non-parallel dataset with 59K free-text discharge summaries compiled from MIMIC-III (Weng, Chung, and Szolovits 2019), and a more recent parallel dataset with 114K samples compiled from the health reference Merck Manuals (MSD), where discussions on each medical topic has one version for professionals, and the other for consumers (Cao et al. 2020).
Sentiment.
Sentiment modification is the most popular task in previous work on TST. It aims to change the sentiment polarity in reviews, for example, from a negative review to a positive review, or vice versa. There is also work on transferring sentiments on fine-grained review ratings (e.g., 1–5 scores). Commonly used datasets include Yelp reviews (Shen et al. 2017) and Amazon product reviews (He and McAuley 2016).
Topic.
There are a few works that cover topic transfer. For example, Huang et al. (2020) form a two-topic corpus by compiling Yahoo! Answers under two topics, entertainment and politics, respectively. There is also a recent dataset with 21 text styles such as Sciences, Sport, Politics, and others (Zeng, Shoeybi, and Liu 2020).
Political Slant.
Political slant transfer proposed by Prabhumoye et al. (2018) aims to transfer the political view in text. For example, a Republican’s comment can be “defund all illegal immigrants,” while Democrats are more likely to support humanistic actions towards immigrants. The political slant dataset (Voigt et al. 2018) is collected from comments on Facebook posts of the United States Senate and House members. The dataset uses top-level comments directly responding to the posts of a Democratic or Republican congressperson. There are 540K training, 4K development, and 56K test instances in the dataset.
Combined Attributes.
Lample et al. (2019) propose a more challenging setting of text attribute transfer: multi-attribute transfer. For example, the source sentence can be a positive review on an Asian restaurant written by a male reviewer, and the target sentence is a negative review on an American restaurant written by a female. Each of their datasets has 1–3 independent categories of attributes. Their first dataset is FYelp, which is compiled from the Yelp Dataset Challenge, labeled with sentiment (positive or negative), gender (male or female), and eatery category (American, Asian, Mexican, bar, or dessert). Their second dataset, Amazon, which is based on the Amazon product review dataset (Li et al. 2018), contains the following attributes: sentiment (positive or negative), and product category (book, clothing, electronics, movies, or music). Their third dataset, Social Media Content dataset, collected from internal Facebook data that is private, contains gender (male or female), age group (18–24 or 65+), and writer-annotated feeling (relaxed or annoyed).
3. How to Evaluate Style Transfer?
A successful style-transferred output not only needs to demonstrate the correct target style; but also, due to the uncontrollability of neural networks, we need to verify that it preserves the original semantics, and maintains natural language fluency. Therefore, the commonly used practice of evaluation considers the following three criteria: (1) transferred style strength, (2) semantic preservation, and (3) fluency.
We will first introduce the practice of automatic evaluation on the three criteria, discuss the benefits and caveats of automatic evaluation, and then introduce human evaluation as a remedy for some of the intrinsic weaknesses of automatic evaluation. Finally, we will suggest some standard practice of TST evaluation for future work. The overview of evaluation methods regarding each criterion is listed in Table 4.
Criterion . | Automatic Evaluation . | Human Evaluation . |
---|---|---|
Overall | BLEU with gold references | Rating or ranking |
- Transferred Style Strength | Accuracy by a separately trained style classifier | Rating or ranking |
- Semantic Preservation | BLEU/ROUGE/etc. with (modified) inputs | Rating or ranking |
- Fluency | Perplexity by a separately trained language model | Rating or ranking |
Criterion . | Automatic Evaluation . | Human Evaluation . |
---|---|---|
Overall | BLEU with gold references | Rating or ranking |
- Transferred Style Strength | Accuracy by a separately trained style classifier | Rating or ranking |
- Semantic Preservation | BLEU/ROUGE/etc. with (modified) inputs | Rating or ranking |
- Fluency | Perplexity by a separately trained language model | Rating or ranking |
3.1 Automatic Evaluation
Automatic evaluation provides an economic, reproducible, and scalable way to assess the quality of generation results. However, due to the complexities of natural language, each metric introduced below can address certain aspects, but also has intrinsic blind spots.
BLEU with Gold References.
Similar to many text generation tasks, TST also has human-written references on several datasets (Yelp, Captions, etc.), so it is common to use the BLEU score (Papineni et al. 2002) between the gold references and model outputs. Using BLEU to evaluate TST models has been seen across pre-deep learning works (Xu et al. 2012; Jhamtani et al. 2017) and deep learning approaches (Rao and Tetreault 2018; Li et al. 2018; Jin et al. 2019).
There are three problems with using BLEU between the gold references and model outputs:
- Problem 1.
It mainly evaluates content and simply copying the input can result in high BLEU scores.
- Problem 2.
BLEU is shown to have low correlation with human evaluation.
- Problem 3.
Some datasets do not have human-written references.
Problem 1: Different from machine translation, where using BLEU only is sufficient, TST has to consider the caveat that simply copying the input sentence can achieve high BLEU scores with the gold references on many datasets (e.g., ∼40 on Yelp, ∼20 on Humor & Romance, ∼50 for informal-to-formal style transfer, and ∼30 for formal-to-informal style transfer). This is because most text rewrites have a large extent of n-gram overlap with the source sentence. In contrast, machine translation does not have this concern, because the vocabulary of its input and output are different, and copying the input sequence does not give high BLEU scores. A possible fix to consider is to combine BLEU with PINC (Chen and Dolan 2011) as in paraphrasing (Xu et al. 2012; Jhamtani et al. 2017). By using PINC and BLEU as a 2-dimensional metric, we can minimize the n-gram overlap with the source sentence but maximize the n-gram overlap with the reference sentences.
Problems 2 & 3: Other problems include insufficient correlation of BLEU with human evaluations (e.g., ≤0.30 with respect to human-rated grammaticality shown in Li et al. [2018] and ≤0.45 with respect to human evaluations shown in Mir et al. [2019]), and the unavailability of human-written references for some datasets (e.g., gender and political datasets [Prabhumoye et al. 2018], and the politeness dataset [Madaan et al. 2020]). A commonly used fix is to make the evaluation more fine-grained using three different independent aspects, namely, transferred style strength, semantic preservation, and fluency, which will be detailed below.
Transferred Style Strength.
To automatically evaluate the transferred style strength, most works separately train a style classifier to distinguish the attributes (Hu et al. 2017; Shen et al. 2017; Fu et al. 2018; Li et al. 2018; Prabhumoye et al. 2018).18 This classifier is used to judge whether each sample generated by the model conforms to the target attribute. The transferred style strength is calculated as . Li et al. (2018) shows that the attribute classifier correlates well with human evaluation on some datasets (e.g., Yelp and Captions), but has almost no correlation with others (e.g., Amazon). The reason is that some product genres has a dominant number of positive or negative reviews.
Semantic Preservation.
Many metrics can be applied to measure the similarity between the input and output sentence pairs, including BLEU (Papineni et al. 2002), ROUGE (Lin and Och 2004), METEOR (Banerjee and Lavie 2005), chrF (Popović 2015), and Word Mover Distance (WMD) (Kusner et al. 2015). Recently, some additional deep-learning-based metrics have been proposed, such as cosine similarity based on sentence embeddings (Fu et al. 2018), and BERTScore (Zhang et al. 2020). There are also evaluation metrics that are specific for TST such as the Part-of-Speech distance (Tian, Hu, and Yu 2018). Another newly proposed metric is to first delete all attribute-related expressions in the text, and then apply the above similarity evaluations (Mir et al. 2019). Among all the metrics, Mir et al. (2019) and Yamshchikov et al. (2021) showed that METEOR and WMD have better correlation with human evaluation than BLEU, although, in practice, BLEU is the most widely used metric to evaluate the semantic similarity between the source sentence and style-transferred output (Yang et al. 2018; Madaan et al. 2020).
Fluency.
Fluency is a basic requirement for natural language outputs. To automate this evaluation, perplexity is calculated via a language model (LM) pretrained on the training data of all attributes (Yang et al. 2018). However, the effectiveness of perplexity remains debatable, as Pang and Gimpel (2019) showed its high correlation with human ratings of fluency, whereas Mir et al. (2019) suggested no significant correlation between perplexity and human scores. We note that perplexity by LM can suffer from the following undesired properties:
Biased toward shorter sentences than longer sentences.
For the same meaning, less frequent words will have worse perplexity (e.g., agreeable) than frequent words (e.g., good).
A sentence’s own perplexity will change if the sentence prior to it changes.
LMs are not good enough yet.
LMs do not necessarily handle well the domain shift between their training corpus and the style-transferred text.
Perplexity scores produced by LMs are sensitive to the training corpora, LM architecture and configuration, as well as optimization configuration. Therefore, different models’ outputs must be evaluated by exactly the same LM for fair comparison, which adds more difficulty to benchmarking.
Task-Specific Criteria.
As TST can serve as a component for other downstream applications, some task-specific criteria are also proposed to evaluate the quality of generated text. For example, Reiter, Robertson, and Osman (2003) evaluated the effect of their tailored text on reducing smokers’ intent to smoke through clinical trials. Jin et al. (2020a) applied TST to generate eye-catchy headlines so they have an attractive score, and future works in this direction can also test the click-through rates. Hu et al. (2017) evaluated how the generated text as augmented data can improve the downstream attribute classification accuracy.
Tips for Automatic Metrics.
For the evaluation metrics that rely on the pretrained models, namely, the style classifier and LM, we need to beware of the following:
The pretrained models for automatic evaluation should be separate from the proposed TST model.
Machine learning models can be imperfect, so we should be aware of the potential false positives and false negatives.
The pretrained models are imperfect in the sense that they will favor toward a certain type of methods.
For the second point, we need to understand what the false positives and false negatives of the generated outputs can be. An illustrative example is that if the style classifier only reports 80+% performance (e.g., on the gender dataset [Prabhumoye et al. 2018] and Amazon dataset [Li et al. 2018]), even perfect style rewrites can only score 80+%, but maybe an imperfect model can score 90% because it can resemble the imperfect style classification model more and takes advantage of the false positives. Other reasons for false positives can be adversarial attacks. Jin et al. (2020b) showed that merely paraphrasing using synonyms can drop the performance of high-accuracy classification models from TextCNN (Kim 2014) to BERT (Devlin et al. 2019) by 90+%. Therefore, higher scores by the style classifier do not necessarily indicate more successful transfer. Moreover, the style classifier can produce false negatives if there is a distribution shift between the training data and style-transferred outputs. For example, in the training corpus, a product may appear often with the positive attribute, and in the style-transferred outputs, this product co-occurs with the opposite, negative attribute. Such false negatives are observed on the Amazon product review dataset (Li et al. 2018). On the other hand, the biases of the LM correlate with sentence length, synonym replacement, and prior context.
The third point is a direct result implied by the second point, so in practice, we need to keep in mind and check whether the proposed model takes advantage of the evaluation metrics or makes improvements that are generalizable.
3.2 Human Evaluation
Compared with the pros and cons of the automatic evaluation metrics mentioned above, human evaluation stands out for its flexibility and comprehensiveness. For example, when asking humans to evaluate the fluency, we do not need to worry for the bias toward shorter sentences as in the LM. We can also design criteria that are not computationally easy such as comparing and ranking the outputs of multiple models. There are several ways to conduct human evaluation. In terms of evaluation types, there are pointwise scoring, namely, asking humans to provide absolute scores of the model outputs, and pairwise comparison, namely asking humans to judge which of the two outputs is better, or providing a ranking for multiple outputs. In terms of the criteria, humans can provide overall evaluation, or separate scores for transferred style strength, semantic preservation, and fluency.
However, the well-known limitations of human evaluation are cost and irreproducibility. Performing human evaluations can be time consuming, which may result in significant time and financial costs. Moreover, the human evaluation results in two studies are often not directly comparable, because human evaluation results tend to be subjective and not easily irreproducible (Belz et al. 2020). Moreover, some styles are very difficult to evaluate without expertise and extensive reading experience.
As a remedy, we encourage future researchers to report inter-rater agreement scores such as the Cohen’s kappa (Cohen 1960) and Krippendorff’s alpha (Krippendorff 2018). Briakou et al. (2021a) also recommends standardizing and describing evaluation protocols (e.g., linguistic background of the annotators, compensation, detailed annotation instructions for each evaluation aspect), and releasing annotations.
Tips for Human Evaluation.
3.3 Suggested Evaluation Settings for Future Work
Currently, the experiments of various TST work do not adopt the same setting, making it difficult to do head-to-head comparison among the empirical results of multiple studies. Although it is reasonable to customize the experimental settings according to the needs of a certain study, it is suggested to at least use the standard setting in at least one of the many reported experiments, to make it easy to compare with previous and future studies. For example, at least (1) experiment on at least one commonly used dataset, (2) list up-to-date best-performing previous models as baselines, (3) report on a superset of the most commonly used metrics, and (4) release system outputs.
For (1), we suggest that future work use at least one of the most commonly used benchmark datasets, such as the Yelp data preprocessed by Shen et al. (2017) and its five human references provided by Jin et al. (2019), Amazon data preprocessed by Li et al. (2018), and formality data provided by Rao and Tetreault (2018).
For (2), we suggest that future studies actively check the latest style transfer papers curated at https://github.com/fuzhenxin/Style-Transfer-in-Text and our repository https://github.com/zhijing-jin/Text_Style_Transfer_Survey, and compare with the state-of-the-art performances instead of older ones. We also call for more reproducibility in the community, including source codes and evaluation codes, because, for example, there are several different scripts to evaluate the BLEU scores.
For (3), because no single evaluation metric is perfect and comprehensive enough for TST, it is strongly suggested to use both human and automatic evaluation on three criteria. In evaluation, apart from customized use of metrics, we suggest that most future work include at least the following evaluation practices:
Human evaluation: Rate at least two state-of-the-art models according to the curated paper lists.
Automatic evaluation: At least report the BLEU score with all available references if there exist human-written references (e.g., the five references for the Yelp dataset provided by Jin et al. [2019]), and BLEU with the input only when there are no human-written references.
For (4), it will also be very helpful to provide system outputs for each TST paper, so that future works can better reproduce both human and automatic evaluation results. Note that releasing system outputs can help future studies’ comparison of automatic evaluation results, because there can be different scripts to evaluate the BLEU scores, as well as different style classifiers and LM. It will be a great addition to the TST community if future work can establish an online leaderboard, let existing research groups upload their output files, and automatically evaluate the model outputs using a standard set of automatic evaluation scripts.
4. Methods on Parallel Data
Over the last several years, various methods have been proposed for TST. In general, they can be categorized based on whether the dataset has parallel text with different styles or several non-parallel mono-style corpora. The rightmost column “Pa?” in Table 3 shows whether there exist parallel data for each TST subtask. In this section, we will cover TST methods on parallel datasets, and in Section 5 we will detail the approaches on non-parallel datasets. To ease the understanding for the reader, we will in most cases explain TST on one attribute between two values, such as transferring the formality between informal and formal tones, which can potentially be extended to multiple attributes.
Most methods adopt the standard neural sequence-to-sequence (seq2seq) model with the encoder-decoder architecture, which was initially developed for neural machine translation (NMT) (Sutskever, Vinyals, and Le 2014; Bahdanau, Cho, and Bengio 2015; Cho et al. 2014) and extensively seen on text generation tasks such as summarization (Rush, Chopra, and Weston 2015) and many others (Song et al. 2019). The encoder-decoder seq2seq model can be implemented by either LSTM as in Rao and Tetreault (2018); and Shang et al. (2019) or Transformer (Vaswani et al. 2017) as in Xu, Ge, and Wei (2019). Copy mechanism (Gülçehre et al. 2016; See, Liu, and Manning 2017) is also added to better handle stretches of text that should not be changed (e.g., some proper nouns and rare words) (Gu et al. 2016; Merity et al. 2017). Based on this architecture, recent work has developed multiple directions of improvement: multi-tasking, inference techniques, and data augmentation, which will be introduced below.
Multi-tasking.
In addition to the seq2seq learning on paired attributed-text, Xu, Ge, and Wei (2019) propose adding three other loss functions: (1) classifier-guided loss, which is calculated using a well-trained attribute classifier and encourages the model to generate sentences conforming to the target attribute, (2) self-reconstruction loss, which encourages the seq2seq model to reconstruct the input itself by specifying the desired style the same as the input style, and (3) cycle loss, which first transfers the input sentence to the target attribute and then transfers the output back to its original attribute. Each of the three losses can gain performance improvement of 1–5 BLEU points with the human references (Xu, Ge, and Wei 2019). Another type of multi-tasking is to jointly learn TST and machine translation from French to English, which improves the performance by 1 BLEU score with human-written references (Niu, Rao, and Carpuat 2018). Specifically for formality transfer, Zhang, Ge, and Sun (2020) multi-task TST and grammar error correction (GEC) so that knowledge from GEC data can be transferred to the informal-to-formal style transfer task.
Inference Techniques.
To avoid the model copying too many parts of the input sentence and not performing sufficient edits to flip the attribute, Kajiwara (2019) first identifies words in the source sentence requiring replacement, and then changes the words by negative lexically constrained decoding (Post and Vilar 2018) that avoids naive copying. Because this method only changes the beam search process for model inference, it can be applied to any TST model without model re-training.
Data Augmentation.
Because style transfer data is expensive to annotate, there are not as many parallel datasets as in machine translation. Hence, various methods have been proposed for data augmentation to enrich the data. For example, Rao and Tetreault (2018) first train a phrase-based machine translation (PBMT) model on a given parallel dataset and then use back-translation (Sennrich, Haddow, and Birch 2016b) to construct a pseudo-parallel dataset as additional training data, which leads to an improvement of around 9.7 BLEU points with respect to human written references.
Most recently, Zhang, Ge, and Sun (2020) use a data augmentation technique by making use of largely available online text. They scrape informal text from online forums and generate back-translations, that is, informal English → a pivot language such as French → formal English, where the formality of the back-translated English text is ensured with a formality classifier that is used to only keep text that is classified as formal text.
5. Methods on Non-Parallel Data
Parallel data for TST is difficult to obtain, and for some styles impossible to crowd-source (e.g., Mark Twain novels rewritten in Hemmingway’s style). Hence, the majority of TST methods assume only non-parallel mono-style corpora, and investigate how to build deep learning models based on this constraint. In this section, we will introduce three main branches of TST methods: disentanglement (Section 5.1), prototype editing (Section 5.2), and pseudo-parallel corpus construction (Section 5.3).
5.1 Disentanglement
Disentanglement-based models usually perform the following three actions:
Encode the text x with attribute a into a latent representation z (i.e., x → z)
Manipulate the latent representation z to remove the source attribute (i.e., z → z′)
Decode into text x′ with the target attribute a′ (i.e., z′ → x′)
To build such models, the common workflow in disentanglement papers consists of the following three steps:
- Step 1.
Select a model as the backbone for the encoder-decoder learning (Section 5.1.1).
- Step 2.
Select a manipulation method of the latent representation (Section 5.1.2).
- Step 3.
For the manipulation method chosen above, select (multiple) appropriate loss functions (Section 5.1.3).
The organization of this section starts with Section 5.1.1, which introduces the encoder-decoder training objectives that is used for Step 1. Next, Section 5.1.2 overviews three main approaches to manipulate the latent representation for Step 2, and Section 5.1.3 goes through a plethora of training objectives for Step 3. Table 5 provides an overview of existing models and their corresponding configurations. To give a rough idea of the effectiveness of each model, we show their performance on the Yelp dataset.
. | Settings . | Performance on Yelp . | ||||||
---|---|---|---|---|---|---|---|---|
Enc-Dec . | Disen. . | Style Control . | Content Control . | BL-Ref . | Acc. (%) . | BL-Inp . | PPL↓ . | |
Mueller, Gifford, and Jaakkola (2017) | VAE | LRE | – | – | – | – | – | – |
Hu et al. (2017) | VAE | ACC | ACO | – | 22.3 | 86.7 | 58.4 | – |
Shen et al. (2017) | AE&GAN | ACC | AdvR∥AdvO | – | 7.8 | 73.9 | 20.7 | 72* |
Fu et al. (2018) | AE | ACC | AdvR | – | 12.9 | 46.9 | 40.1 | 166.5* |
Prabhumoye et al. (2018) | AE | ACC | ACO | – | 6.8 | 87.2 | – | 32.8* |
Zhao et al. (2018) | GAN | ACC | AdvR | – | – | 73.4 | 31.2 | 29.7 |
Yang et al. (2018) | AE | ACC | LMO | – | – | 91.2 | 57.8 | 47.0&60.9 |
Logeswaran, Lee, and Bengio (2018) | AE | ACC | AdvO | Cycle | – | 90.5 | – | 133 |
Tian, Hu, and Yu (2018) | AE | ACC | AdvO | Noun | 24.9 | 92.7 | 63.3 | – |
Liao et al. (2018) | VAE | LRE | – | – | – | 88.3 | – | – |
Romanov et al. (2019) | AE | LRS | ACR&AdvR | – | – | – | – | – |
John et al. (2019) | AE&VAE | LRS | ACR&AdvR | BoW&AdvBoW | – | 93.4 | – | – |
Bao et al. (2019) | VAE | LRS | ACR&AdvR | BoW&AdvBoW | – | – | – | – |
Dai et al. (2019) | AE | ACC | ACO | Cycle | 20.3 | 87.7 | 54.9 | 73 |
Wang, Hua, and Wan (2019) | AE | LRE | – | – | 24.6 | 95.4 | – | 46.2 |
Li et al. (2020) | GAN | ACC | ACO&AdvR | – | – | 95.5 | 53.3 | – |
Liu et al. (2020) | VAE | LRE | – | – | 18.8 | 92.3 | – | 18.3 |
Yi et al. (2020) | VAE | ACC | ACO | Cycle | 26.0 | 90.8 | – | 109 |
Jin et al. (2020a) | AE | LRE | – | – | – | – | – | – |
. | Settings . | Performance on Yelp . | ||||||
---|---|---|---|---|---|---|---|---|
Enc-Dec . | Disen. . | Style Control . | Content Control . | BL-Ref . | Acc. (%) . | BL-Inp . | PPL↓ . | |
Mueller, Gifford, and Jaakkola (2017) | VAE | LRE | – | – | – | – | – | – |
Hu et al. (2017) | VAE | ACC | ACO | – | 22.3 | 86.7 | 58.4 | – |
Shen et al. (2017) | AE&GAN | ACC | AdvR∥AdvO | – | 7.8 | 73.9 | 20.7 | 72* |
Fu et al. (2018) | AE | ACC | AdvR | – | 12.9 | 46.9 | 40.1 | 166.5* |
Prabhumoye et al. (2018) | AE | ACC | ACO | – | 6.8 | 87.2 | – | 32.8* |
Zhao et al. (2018) | GAN | ACC | AdvR | – | – | 73.4 | 31.2 | 29.7 |
Yang et al. (2018) | AE | ACC | LMO | – | – | 91.2 | 57.8 | 47.0&60.9 |
Logeswaran, Lee, and Bengio (2018) | AE | ACC | AdvO | Cycle | – | 90.5 | – | 133 |
Tian, Hu, and Yu (2018) | AE | ACC | AdvO | Noun | 24.9 | 92.7 | 63.3 | – |
Liao et al. (2018) | VAE | LRE | – | – | – | 88.3 | – | – |
Romanov et al. (2019) | AE | LRS | ACR&AdvR | – | – | – | – | – |
John et al. (2019) | AE&VAE | LRS | ACR&AdvR | BoW&AdvBoW | – | 93.4 | – | – |
Bao et al. (2019) | VAE | LRS | ACR&AdvR | BoW&AdvBoW | – | – | – | – |
Dai et al. (2019) | AE | ACC | ACO | Cycle | 20.3 | 87.7 | 54.9 | 73 |
Wang, Hua, and Wan (2019) | AE | LRE | – | – | 24.6 | 95.4 | – | 46.2 |
Li et al. (2020) | GAN | ACC | ACO&AdvR | – | – | 95.5 | 53.3 | – |
Liu et al. (2020) | VAE | LRE | – | – | 18.8 | 92.3 | – | 18.3 |
Yi et al. (2020) | VAE | ACC | ACO | Cycle | 26.0 | 90.8 | – | 109 |
Jin et al. (2020a) | AE | LRE | – | – | – | – | – | – |
5.1.1 Encoder-Decoder Training Method.
There are three model choices to obtain the latent representation z from the discrete text x and then decode it into the new text x′ via reconstruction training: auto-encoder (AE), variational auto-encoder (VAE), and generative adversarial networks (GANs).
Auto-Encoder (AE).
Auto-encoding is a commonly used method to learn the latent representation z, which first encodes the input sentence x into a latent vector z and then reconstructs a sentence as similar to the input sentence as possible. AE is used in many TST works (e.g., Shen et al. 2017; Hu et al. 2017; Fu et al. 2018; Zhao et al. 2018; Prabhumoye et al. 2018; Yang et al. 2018). To avoid auto-encoding from blindly copying all the elements from the input, Hill, Cho, and Korhonen (2016) adopt denoising auto-encoding (DAE) (Vincent et al. 2010) to replace AE in NLP tasks. Specifically, DAE first passes the input sentence x through a noise model to randomly drop, shuffle, or mask some words, and then reconstructs the original sentence from this corrupted sentence. This idea is used in later TST works, for example, Lample et al. (2019) and Jin et al. (2020a). As pre-trained models became prevalent in recent years, the DAE training method has increased in popularity relative to its counterparts such as GAN and VAE, because pre-training over large corpora can grant models better performance in terms of semantic preservation and fluency (Lai, Toral, and Nissim 2021; Riley et al. 2021).
Variational Auto-Encoder (VAE).
Generative Adversarial Networks (GANs).
5.1.2 Latent Representation Manipulation.
Based on the general encoder and decoder training method, the core element of disentanglement is the manipulation of latent representation z. Figure 2 illustrates three main methods: latent representation editing, attribute code control, and latent representation splitting. In addition, the “Disen.” column of Table 5 shows the type of latent representation manipulation for each work in disentanglement.
The first approach, Latent Representation Editing (LRE), shown in Figure 2a, is achieved by ensuring two properties of the latent representation z. The first property is that z should be able to serve as the latent representation for auto-encoding, namely, aligning fc(z) with the input x, where zE(x). The second property is that z should be learned such that it incorporates the new attribute value of interest a′. To achieve this, the common practice is to first learn an attribute classifier fc, for example, a multilayer perceptron that takes the latent representation z as input, and then iteratively update z within the constrained space by the first property and at the same time maximize the prediction confidence score regarding a′ by this attribute classifier (Mueller, Gifford, and Jaakkola 2017; Liao et al. 2018; Wang, Hua, and Wan 2019; Liu et al. 2020). An alternative way to achieve the second property is to multi-task by another auto-encoding task on the corpus with the attribute a′ and share most layers of the transformer except the query transformation and layer normalization layers (Jin et al. 2020a).
The second approach, Attribute Code Control (ACC), as shown in Figure 2b, first enforces the latent representation z of the sentence x to contain all information except its attribute value a via adversarial learning, and then the transferred output is decoded based on the combination of z and a structured attribute code a corresponding to the attribute value a. During the decoding process, the attribute code vector a controls the attribute of generated text by acting as either the initial state (Shen et al. 2017; Yi et al. 2020) or the embedding (Fu et al. 2018; Dai et al. 2019).
The third approach, Latent Representation Splitting (LRS), as illustrated in Figure 2c, first disentangles the input text into two parts: the latent attribute representation a, and semantic representation z that captures attribute-independent information. We then replace the source attribute a with the target attribute a′, and the final transferred text is generated using the combination of z and a′ (John et al. 2019; Romanov et al. 2019).
5.1.3 Training Objectives.
When disentangling the attribute information a and the attribute-independent semantic information z, we need to achieve two aims:
- Aim 1.
The target attribute is fully and exclusively controlled by a (and not z). We typically use style-oriented losses to achieve this aim (Section 5.1.3.1).
- Aim 2.
The attribute-independent information is fully and exclusively captured by z (and not a). Content-oriented losses are more often used for this aim (Section 5.1.3.2).
5.1.3.1 Style-Oriented Losses.
To achieve Aim 1, many different style-oriented losses have been proposed, to nudge the model to learn a more clearly disentangled a and exclude the attribute information from z.
Attribute Classifier on Outputs (ACO).
Attribute Classifier on Representations (ACR).
Adversarial Learning on Representations (AdvR).
Adversarial Learning on Outputs (AdvO).
Language Modeling on Outputs (LMO).
5.1.3.2 Content-Oriented Losses.
The style-oriented losses introduced above ensures the attribute information to be contained in a, but not necessarily putting constraints on the style-independent semantics z. To learn the attribute-independent information fully and exclusively in z, the following content-oriented losses are proposed:
Cycle Reconstruction (Cycle).
One way to train the above cycle loss is by reinforcement learning as done by Luo et al. (2019), who use the loss function as a negative for content preservation.
Bag-of-Words Overlap (BoW).
To approximately measure content preservation, bag-of-words (BoW) features are used by John et al. (2019), Bao et al. (2019). To focus on content information only, John et al. (2019) exclude stopwords and style-specific words.
Adversarial BoW Overlap (AdvBoW).
BoW ensures the content to be fully captured in z. As a further step, we want to ensure that the content information is exclusively captured in z, namely, not contained in a at all, via the following AdvBow loss on a (John et al. 2019; Bao et al. 2019).
Other Losses/Rewards.
There are also other losses/rewards in recent work such as the noun overlap loss (Noun) (Tian, Hu, and Yu 2018), as well as rewards for semantics and fluency (Xu et al. 2018; Gong et al. 2019; Sancheti et al. 2020). We do not discuss them in much detail because they do not directly operate on the disentanglement of latent representations.
5.2 Prototype Editing
Despite a plethora of models that use end-to-end training of neural networks, the prototype-based text editing approach still attracts lots of attention, since the proposal of a pipeline method called delete, retrieve, and generate (Li et al. 2018).
Prototype editing is reminiscent of early word replacement methods used for TST, such as synonym matching using a style dictionary (Sheikha and Inkpen 2011), WordNet (Khosmood and Levinson 2010; Mansoorizadeh et al. 2016), hand-crafted rules (Khosmood and Levinson 2008; Castro, Ortega, and Muñoz 2017), or using hypernyms and definitions to replace the style-carrying words (Karadzhov et al. 2017).
Featuring more controllability and interpretability, prototype editing builds an explicit pipeline for TST from x with attribute a to its counterpart x′ with attribute a′:
- Step 1.
Detect attribute markers of a in the input sentence x, and delete them, resulting in a content-only sentence (Section 5.2.1).
- Step 2.
Retrieve candidate attribute markers carrying the desired attribute a′ (Section 5.2.2).
- Step 3.
Infill the sentence by adding new attribute markers and make sure the generated sentence is fluent (Section 5.2.3).
5.2.1 Attribute Marker Detection.
Extracting attribute markers is a non-trivial NLP task. Traditional ways to do it involve first using tagging, parsing, and morphological analysis to select features, and then filtering by mutual information and chi-square testing. In recent deep learning pipelines, there are three major types of approaches to identify attribute markers: frequency-ratio methods, attention-based methods, and fusion methods.
Frequency-ratio methods calculate some statistics for each n-gram in the corpora. For example, Li et al. (2018) detect the attribute markers by calculating its relative frequency of co-occurrence with attribute a versus a′, and those with frequencies higher than a threshold are considered the markers of a. Using a similar approach, Madaan et al. (2020) first calculate the ratio of mean TF-IDF between the two attribute corpora for each n-gram, then normalize this ratio across all possible n-grams, and finally mark those n-grams with a normalized ratio p higher than a pre-set threshold as attribute markers.
Attention-based methods train an attribute classifier using the attention mechanism (Bahdanau, Cho, and Bengio 2015), and consider words with attention weights higher than average as markers (Xu et al. 2018). For the architecture of the classifier, Zhang et al. (2018c) use LSTM, and Sudhakar, Upadhyay, and Maheswaran (2019) use a BERT classifier, where the BERT classifier has shown higher detection accuracy for the attribute markers.
Fusion methods combine the advantages of the above two methods. For example, Wu et al. (2019) prioritize the attribute markers predicted by frequency-ratio methods, and use attention-based methods as an auxiliary back up. One use case is when frequency-ratio methods fail to identify any attribute markers in a given sentence, they will use the attention-based methods as a secondary choice to generate attribute markers. Another case is to reduce false positives. To reduce the number of attribute markers that are wrongly recognized, Wu et al. (2019) set a threshold to filter out low-quality attribute markers by frequency-ratio methods, and in cases where all attribute markers are deleted, they use the markers predicted by attention-based methods.
There are still remaining limitations of the previous methods, such as imperfect accuracy of the attribute classifier, and unclear relation between attribute and attention scores. Hence, Lee (2020) propose word importance scoring, similar to what is used by Jin et al. (2020b) for adversarial paraphrasing, to measure how important a token is to the attribute by the difference in the attribute probability of the original sentence and that after deleting a token.
5.2.2 Target Attribute Retriever.
After deleting the attribute markers Markera(x) of the sentence x with attribute a, we need to find a counterpart attribute marker Markera′(x′) from another sentence x′ carrying a different attribute a′. Denote the sentence template with all attribute markers deleted as Template(x) x\Markera(x). Similarly, the template of the sentence x′ is Template(x′) x′∖Markera′(x′). A common approach is to find the counterpart attribute marker by its context, because the templates of the original attribute and its counter attribute marker should be similar. Specifically, we first match a template Template(x) with the most similar template Template(x′) in the opposite attribute corpus, and then identify the attribute markers Markera(x) and Markera′(x′) as counterparts of each other. To match templates with their counterparts, most previous works find the nearest neighbors by the cosine similarity of sentence embeddings. Commonly used sentence embeddings include TF-IDF as used in Li et al. (2018) and Sudhakar, Upadhyay, and Maheswaran (2019), averaged GloVe embedding distance used in Li et al. (2018) and Sudhakar, Upadhyay, and Maheswaran (2019), and Universal Sentence Encoder (Cer et al. 2018) used in Sudhakar, Upadhyay, and Maheswaran (2019). Apart from sen-tence embeddings, Tran, Zhang, and Soleymani (2020) use Part-of-Speech templates to match several candidates in the opposite corpus, and conduct an exhaustive search to fill parts of the candidate sentences into the masked positions of the original attribute markers.
5.2.3 Generation from Prototypes.
Li et al. 2018) and Sudhakar, Upadhyay, and Maheswaran (2019) feed the content-only sentence template and new attribute markers into a pretrained language model that rearranges them into a natural sentence. This infilling process can naturally be achieved by a masked language model (MLM) (Malmi, Severyn, and Rothe 2020). For example, Wu et al. (2019) use a MLM of the template conditioned on the target attribute, and this MLM is trained on an additional attribute classification loss using the model output and a fixed pre-trained attribute classifier. Because these generation practices are complicated, Madaan et al. (2020) propose a simpler way. They skip Step 2 that explicitly retrieves attribute candidates, and, instead, directly learn a generation model that only takes attribute-masked sentences as inputs. This generation model is trained on data where the attribute-carrying sentences x are paired with their templates Template(x). Training on the pairs of (Template(x), x) constructed in this way can make the model learn how to fill the masked sentence template with the target attribute a.
5.3 Pseudo-Parallel Corpus Construction
To provide more signals for training, it is also helpful to generate pseudo-parallel data for TST. Two major approaches are retrieval-based and generation-based methods.
Retrieval-Based Corpora Construction.
One common way to construct pseudo-parallel data is through retrieval, namely, extracting aligned sentence pairs from two mono-style corpora. Jin et al. (2019) empirically observe that semantically similar sentences in the two mono-style corpora tend to be the attribute-transferred counterparts of each other. Hence, they construct the initial pseudo corpora by matching sentence pairs in the two attributed corpora according to the cosine similarity of pretrained sentence embeddings. Formally, for each sentence x, its pseudo counterpart is its most similar sentence in the other attribute corpus X′, namely, = argmaxx′∈X′ Similarity(x, x′). This approach is extended by Nikolov and Hahnloser (2019), who use large-scale hierarchical alignment to extract pseudo-parallel style transfer pairs. Such retrieval-based pseudo-parallel data construction is also useful for machine translation (Munteanu and Marcu 2005; Uszkoreit et al. 2010; Marie and Fujita 2017; Grégoire and Langlais 2018; Ren et al. 2020).
Generation-Based Corpora Construction.
Another way is through generation, such as iterative back-translation (IBT) (Hoang et al. 2018). IBT is a widely used method in machine translation (Artetxe et al. 2018; Lample et al. 2018a, 2018b; Dou, Anastasopoulos, and Neubig 2020) that adopts an iterative process to generate pseudo-parallel corpora.
Before starting the iterative process, IBT needs to first initialize two style transfer models: Ma→a′, which transfers from the attribute a to the other attribute a′, and Ma′→a, which transfers from a′ to a. Then, in each iteration, it executes the following steps:
- Step 1.
Use the models to generate pseudo-parallel corpora. Specifically, Ma→a′(x) generates pseudo pairs (x, ) for all x ∈ X, and Ma′→a(x′) generates pairs of (, x′) for all x′ ∈ X′.
- Step 2.
Re-train these two style transfer models on the datasets generated by 1, that is, re-train Ma→a′(x) on (, x′) pairs and Ma′→a(x′) on (, x) pairs.
For Step 1, in order to generate the initial pseudo-parallel corpora, a simple baseline is to randomly initialize the two models Ma→a′ and Ma′→a, and use them to translate the attribute of each sentence in x ∈ X and x′ ∈ X′. However, this simple initialization is subject to randomness and may not bootstrap well. Another way, adopted by Zhang et al. (2018d), borrows the idea from unsupervised machine translation (Lample et al. 2018a) that first learns an unsupervised word-to-word translation table between attribute a and a′, and uses it to generate an initial pseudo-parallel corpora. Based on such initial corpora, they train initial style transfer models and bootstrap the IBT process. Another model, Iterative Matching and Translation (IMaT) (Jin et al. 2019), does not learn the word translation table, and instead trains the initial style transfer models on a retrieval-based pseudo-parallel corpora introduced in the retrieval-based corpora construction above.
For Step 2, during the iterative process, it is possible to encounter divergence, as there is no constraint to ensure that each iteration will produce better pseudo-parallel corpora than the previous iteration. One way to enhance the convergence of IBT is to add additional losses. For example, Zhang et al. (2018d) use the attribute classification loss ACO, as in Equation (3), to check whether the generated sentence by back-translation fits the desired attribute according to a pre-trained style classifier. Alternatively, IMaT (Jin et al. 2019) uses a checking mechanism instead of additional losses. At the end of each iteration, IMaT looks at all candidate pseudo-pairs of an original sentence, and uses WMD (Kusner et al. 2015) to select the sentence that has the desired attribute and is the closest to the original sentence.
6. Research Agenda
In this section, we will propose some potential directions for future TST research, including expanding the scope of styles (Section 6.1), improving the methodology (Section 6.2), loosening the style-specific data assumptions (Section 6.3), and improving evaluation metrics (Section 6.4).
6.1 Expanding the Scope of Styles
More Styles.
Extending the list of styles for TST is one popular research direction. Existing research originally focused on styles such as simplification (Zhu, Bernhard, and Gurevych 2010), formality (Sheikha and Inkpen 2011), and sentiment transfer (Shen et al. 2017), while the recent two years have seen a richer set of styles such as politeness (Madaan et al. 2020), biasedness (Pryzant et al. 2020), medical text simplification (Cao et al. 2020), and so on.
Such extension of styles is driven by the advancement of TST methods, and also various downstream needs, such as persona-based dialog generation, customized text rewriting applications, and moderation of online text. Apart from the styles that have been researched as listed in Table 3, there are also many other new styles that can be interesting to conduct new research on, including but not limited to the following:
Factual-to-empathetic transfer, to improve counseling dialogs (after the first version of this survey in 2020, we gladly found that this direction has now a preliminary exploration by Sharma et al. [2021])
Non-native-to-native transfer (i.e., reformulating grammatical error correction with TST)
Sentence disambiguation, to resolve nuance in text
More Difficult Forms of Style.
Another direction is to explore more complicated forms of styles. As covered by this survey, the early work on deep learning-based TST explores relatively simple styles, such as verb tenses (Hu et al. 2017) and positive-vs.-negative Yelp reviews (Shen et al. 2017). In these tasks, each data point is one sentence with a clear, categorized style, and the entire dataset is in the same domain. Moreover, the existing datasets can decouple style and style-independent contents relatively well.
We propose that TST can potentially be extended into the following settings:
Aspect-based style transfer (e.g., transferring the sentiment on one aspect but not the other aspects on aspect-based sentiment analysis data)
Authorship transfer (which has tightly coupled style and content)
Document-level style transfer (which includes discourse planning)
Domain adaptive style transfer (which is preceded by Li et al. [2019])
Style Interwoven with Semantics.
In some cases, it can be difficult or impossible to separate attributes from meaning, namely, the subject matter or the argument that the author wants to convey. One reason is that the subject that the author is going to write about can influence the choice of writing style. For example, science fiction writing can use the first person voice and fancy, flowery tone when describing a place. Another reason is that many stylistic devices such as allusion depend on content words.
Currently, it is a simplification of the problem setting to limit it to scenarios where the attribute and semantics can be approximately separated. For evaluation, so far researchers have allowed the human judges to decide the scores of transferred style strength and the content preservation.
In future work, it will be an interesting direction to address the more challenging scenarios where the style and semantics are interwoven.
6.2 Improving the Methodology on Non-Parallel Data
Because the majority of TST research focuses on non-parallel data, we discuss here its strengths and limitations.
6.2.1 Understanding the Strengths and Limitations of Existing Methods.
To come up with improvement directions for TST methods, it is important to first investigate the strengths and limitations of existing methods. We analyze the three major streams of approaches for unsupervised TST in Table 6, including their strengths, weaknesses, and future directions.
Method . | Strengths & Weaknesses . |
---|---|
Disentanglement | + More profound in theoretical analysis, e.g., disentangled representation learning |
− Difficulties of training deep generative models (VAEs, GANs) for text | |
− Hard to represent all styles as latent code | |
− Computational cost rises with the number of styles to model | |
Prototype Editing | + High BLEU scores due to large word preservation |
− Attribute marker detection step can fail if the style and semantics are confounded | |
− The step target attribute retrieval by templates can fail if there are large rewrites for styles, e.g., Shakespearean English vs. modern English | |
− Target attribute retrieval step has large complexity (quadratic to the number of sentences) | |
− Large computational cost if there are many styles, each of which needs a pre-trained LM for the generation step | |
? Future work can enable matchings for syntactic variation | |
? Future work can use grammatical error correction to post-edit the output | |
Pseudo-Parallel Corpus Construction | + Performance can approximate supervised model performance, if the pseudo-parallel data are of good quality |
− May fail for small corpora | |
− May fail if the mono-style corpora do not have many samples with similar contents | |
− For IBT, divergence is possible, and sometimes needs special designs to prevent it | |
− For IBT, time complexity is high (due to iterative pseudo data generation) | |
? Improve the convergence of the IBT |
Method . | Strengths & Weaknesses . |
---|---|
Disentanglement | + More profound in theoretical analysis, e.g., disentangled representation learning |
− Difficulties of training deep generative models (VAEs, GANs) for text | |
− Hard to represent all styles as latent code | |
− Computational cost rises with the number of styles to model | |
Prototype Editing | + High BLEU scores due to large word preservation |
− Attribute marker detection step can fail if the style and semantics are confounded | |
− The step target attribute retrieval by templates can fail if there are large rewrites for styles, e.g., Shakespearean English vs. modern English | |
− Target attribute retrieval step has large complexity (quadratic to the number of sentences) | |
− Large computational cost if there are many styles, each of which needs a pre-trained LM for the generation step | |
? Future work can enable matchings for syntactic variation | |
? Future work can use grammatical error correction to post-edit the output | |
Pseudo-Parallel Corpus Construction | + Performance can approximate supervised model performance, if the pseudo-parallel data are of good quality |
− May fail for small corpora | |
− May fail if the mono-style corpora do not have many samples with similar contents | |
− For IBT, divergence is possible, and sometimes needs special designs to prevent it | |
− For IBT, time complexity is high (due to iterative pseudo data generation) | |
? Improve the convergence of the IBT |
Challenges for Disentanglement.
Theoretically, although disentanglement is impossible without inductive biases or other forms of supervision (Locatello et al. 2019), disentanglement is achievable with some weak signals, such as only knowing how many factors have changed, but not which ones (Locatello et al. 2020).
In practice, some big challenges for disentanglement-based methods include, for example, the difficulty to train deep text generative models such as VAEs and GANs. Also, it is not easy to represent all styles as latent code. Moreover, if targeting multiple styles, the computational complexity linearly increases with the number of styles to model.
Challenges for Prototype Editing.
Prototype-editing approaches usually result in relatively high BLEU scores, partly because the output text largely overlaps with the input text. This line of methods is likely to perform well on tasks such as sentiment modification, for which it is easy to identify “attribute markers,” and the input and output sentences share an attribute-independent template.
However, prototype editing cannot be applied to all types of style transfer tasks. The first step, attribute marker retrieval, might not work if the datasets have confounded style and contents, because they may lead to wrong extraction of attribute markers, such as some content words or artifacts which can also be used to distinguish the style-specific data.
The second step, target attribute retrieval by templates, will fail if there is too little word overlap between a sentence and its counterpart carrying another style. An example is the TST task to “Shakespearize” modern English. There is little lexical overlap between a Shakespearean sentence written in early modern English and its corresponding modern English expression. In such cases, the retrieval step is likely to fail, because there is a large number of rewrites between the two styles, and the template might be almost hollow. Moreover, this step is also computationally expensive, if there are a large number of sentences in the data (e.g., all Wikipedia text), since this step needs to calculate the pair-wise similarity among all available sentences across style-specific corpora.
The third step, generation from prototype, requires a separate pretrained LM for each style corpus. When there are multiple styles of interest (e.g., multiple persona), this will induce a large computational cost.
The last limitation of prototype editing is that it amplifies the intrinsic problem of using BLEU to evaluate TST (Problem 1, namely, the fact that simply copying the input can result in a high BLEU score, as introduced in Section 3.1). For the retrieval-based method, some can argue that there is some performance gain because this method in practice copies more expressions in the input sentence than other lines of methods.
As future study, there can be many interesting directions to explore, for example, investigating the performance of existing prototype editing models under a challenging dataset that reveals the above shortcomings, proposing new models to improve this line of approaches, and better evaluation methods for prototype editing models.
Challenges for Pseudo-Parallel Corpus Construction.
The method to construct pseudo-parallel data can be effective, especially when the pseudo-parallel corpora resemble supervised data. The challenge is that this approach may not work if the non-parallel corpora do not have enough samples that can be matched to create the pseudo-parallel corpora, or when the IBT cannot bootstrap well or fails to converge. The time complexity for training IBT is also very high because it needs to iteratively generate pseudo-parallel corpus and re-train models. Interesting future directions can be reducing the computational cost, designing more effective bootstrapping, and improving the convergence of IBT.
6.2.2 Understanding the Evolution from Traditional NLG to Deep Learning Methods.
Despite the exciting methodological revolution led by deep learning recently, we are also interested in the merging point of traditional computational linguistics and the deep learning techniques (Henderson 2020). Specific to the context of TST, we will introduce the traditional NLG framework, and its impact on the current TST approaches, especially the prototype editing method.
Traditional NLG Framework.
The traditional NLG framework stages sentence generation into the following steps (Reiter and Dale 1997):
Content determination (not applicable)
Discourse planning (not applicable)
Sentence aggregation
Lexicalization
Referring expression generation
Linguistic realization
The first two steps, content determination and discourse planning, are not applicable to most datasets because the current focus of TST is sentence-level and not discourse-level.
Among Steps 3 to 6, sentence aggregation groups necessary information into a single sentence, lexicalization chooses the right word to express the concepts generated by sentence aggregation, referring expression generation produces surface linguistic forms for domain entities, and linguistic realization edits the text so that it conforms to grammar, including syntax, morphology, and orthography. This framework is widely applied to NLG tasks (e.g., Zue and Glass 2000; Mani 2001; McTear 2002; Gatt and Reiter 2009; Androutsopoulos and Malakasiotis 2010).
Re-Viewing Prototype-Based TST.
Among the approaches introduced so far, the most relevant for the traditional NLG is the prototype-based text editing, which has been introduced in Section 5.2.
Using the language of the traditional NLG framework, the prototype-based techniques can be viewed as a combination of sentence aggregation, lexicalization, and linguistic realization. Specifically, prototype-based techniques first prepare an attribute-free sentence template, and supply it with candidate attribute markers that carry the desired attribute, both of which are sentence aggregation. Then, using language models to infill the prototype with the correct expressions corresponds to lexicalization and linguistic realization. Note that the existing TST systems do not explicitly deal with referring expression generation (e.g., generating co-references), leaving it to be handled by language models.
Meeting Point of Traditional and New Methods.
Viewing prototype-based editing as a merging point where traditional, controllable framework meets deep learning models, we can see that it takes advantage of the powerful deep learning models and the interpretable pipeline of the traditional NLG. There are several advantages in merging the traditional NLG with the deep learning models. First, sentence planning–like steps make the generated contents more controllable. For example, the template of the original sentence is saved, and the counterpart attributes can also be explicitly retrieved, as a preparation for the final rewriting. Such a controllable, white-box approach can be easy to tune, debug, and improve. The accuracy of attribute marker extraction, for example, is constantly improving across literature (Sudhakar, Upadhyay, and Maheswaran 2019) and different ways to extract attribute markers can be easily fused (Wu et al. 2019). Second, sentence planning–like steps ensure the truthfulness of information. As most content words are kept and no additional information is hallucinated by the black-box neural networks, we can better ensure that the information of the attribute-transferred output is consistent with the original input.
6.2.3 Inspiration from Tasks with Similar Nature.
An additional perspective that can inspire new methodological innovation is insights from other tasks that share a similar nature as TST. We will introduce in this section several closely related tasks, including machine translation, image style transfer, style-conditioned language modeling, counterfactual story rewriting, contrastive text generation, and prototype-based text editing.
Machine Translation.
The problem settings of machine translation and TST share much in common: The source and target language in machine translation is analogous to the original and desired attribute, a and a′, respectively. The major difference is that in NMT, the source and target corpora are in completely different languages, which have almost disjoint word vocabulary, whereas in TST, the input and output are in the same language, and the model is usually encouraged to copy most content words from input such as the BoW loss introduced in Section 5.1.3.2. Some TST works have been inspired by MT, such as the pseudo-parallel construction (Nikolov and Hahnloser 2019; Zhang et al. 2018d), and in the future there may be more interesting intersections.
Data-to-Text Generation.
Data-to-text generation is another potential domain that can draw inspiration from and to TST. The data-to-text generation task is to generate textual descriptions from structured data such as tables (Wiseman, Shieber, and Rush 2017; Parikh et al. 2020), meaning representations (Novikova, Dusek, and Rieser 2017), or Resource Description Framework triples (Gardent et al. 2017; Ferreira et al. 2020). With the recent rise of pretrained seq2seq models for transfer learning (Raffel et al. 2020), it is common to formulate data-to-text as a seq2seq task by serializing the structured data into a sequence (Kale and Rastogi 2020; Ribeiro et al. 2020; Guo et al. 2020). Then data-to-text generation can be seen as a special form of TST from structured information to text. This potential connection has not yet been investigated but worth exploring.
Neural Style Transfer.
Neural style transfer first originates in image style transfer (Gatys, Ecker, and Bethge 2016), and its disentanglement ideas inspired some early TST research (Shen et al. 2017). The difference between image style transfer and TST is that, for images, it is feasible to disentangle the explicit representation of the image texture as the gram matrix of image neural feature vectors, but for text, styles do not have such an explicit representation, but more abstract attributes. Besides this difference, many other aspects of style transfer research can have shared nature. Note that there are style transfer works across different modalities, including images (Gatys, Ecker, and Bethge 2016; Zhu et al. 2017; Chen et al. 2017b), text, voice (Gao, Singh, and Raj 2018; Qian et al. 2019; Yuan et al. 2021), handwriting (Azadi et al. 2018; Zhang and Liu 2013), and videos (Ruder, Dosovitskiy, and Brox 2016; Chen et al. 2017a). Many new advances in one style transfer field can inspire another style transfer field. For example, image style transfer has been used as a way for data augmentation (Zheng et al. 2019; Jackson et al. 2019) and adversarial attack (Xu et al. 2020), but TST has not yet been applied for such usage.
Style-Conditioned Language Modeling.
Different from language modeling that learns how to generate general natural language text, conditional language modeling learns how to generate text given a condition, such as some context, or a control code (Pfaff 1979; Poplack 2000). Recent advances of conditional language models (Keskar et al. 2019; Dathathri et al. 2020) also include text generation conditioned on a style token, such as positive or negative. Possible conditions include author style (Syed et al. 2020), speaker identity, persona and emotion (Li et al. 2016), genre, and attributes derived from text, topics, and sentiment (Ficler and Goldberg 2017). They are currently limited to a small set of pre-defined “condition” tokens and can only generate from scratch a sentence, but are not yet able to be conditioned on an original sentence for style rewriting. The interesting finding in this research direction is that it can make good use of a pretrained LM and just do some light-weight inference techniques to generate style-conditioned text, so perhaps such approaches can inspire future TST methods and reduce the carbon footprints of training TST models from scratch.
Counterfactual Story Rewriting.
Counterfactual story rewriting aims to learn a new event sequence in the presence of a perturbation of a previous event (i.e., counterfactual condition) (Goodman 1947; Starr 2019). Qin et al. (2019) propose the first dataset, each sample of which takes an originally five-sentence story, and changes the event in the second sentence to a new, counterfactual event. The task is to generate the last three sentences of the story based on the newly altered second sentence that initiates the story. The criteria of the counterfactual story rewriting include relevance with the first two sentences, and minimal edits from the original story ending. This line of research is relatively difficult to directly apply to TST, because its motivation and dataset nature is different from the general TST, and more importantly, this task is not conditioned on a predefined categorized style token, but the free-form textual story beginning.
Contrastive Text Generation.
As neural network-based NLP models more easily learn spurious statistical correlations in the data rather than achieve robust understanding (Jia and Liang 2017), there is recent work to construct auxillary datasets composed of near-misses of the original data. For example, Gardner et al. (2020) ask crowdsource workers to rewrite the input of the task with minimal changes but matching a different target label. To alleviate expensive human labor, Xing et al. (2020) develop an automatic text editing approach to generate contrast set for aspect-based sentiment analysis. The difference between contrastive text generation and TST is that the former does not require content preservation but mainly aims to construct a slightly textually different input that can result in a change of the ground-truth output, to test the model robustness. So the two tasks are not completely the same, although they have some intersections that might inspire future work, such as aspect-based style transfer suggested in Section 6.1.
Prototype-Based Text Editing.
Prototype editing is not unique in TST, but also widely used in other NLP tasks. Knowing the new advances in prototype editing for other tasks can potentially inspire new method innovations in TST. Guu et al. (2018) first proposes the protype editing approach to improve LM by first sampling a lexically similar sentence prototype and then editing it using variational encoder and decoders. This prototype-and-then-edit approach can also be seen in summarization (Wang, Quan, and Wang 2019), machine translation (Cao and Xiong 2018; Wu, Wang, and Wang 2019; Gu et al. 2018; Zhang et al. 2018a; Bulté and Tezcan 2019), conversation generation (Weston, Dinan, and Miller 2018; Cai et al. 2019), code generation (Hashimoto et al. 2018), and question answering (Lewis et al. 2020). As an extension to the retrieve and edit steps, Hossain, Ghazvininejad, and Zettlemoyer (2020) use an ensemble approach to retrieve a set of relevant prototypes, edit, and finally rerank to pick the best output for machine translation. Such extension can also be potentially applied to TST.
6.3 Loosening the Style-Specific Dataset Assumptions
A common assumption for most deep learning-based TST works, as mentioned in Section 2.1, is the availability of style-specific corpora for each style of interest, either parallel or non-parallel. This assumption can potentially be loosened in two ways.
Linguistic Styles with No Matched Data.
Because there are various concerns raised by the data-driven definition of style as described in Section 2.1, a potentially good research direction is to bring back the linguistic definition of style, and thus remove some of the concerns associated with large datasets. Several methods can be a potential fit for this: prompt design (Li and Liang 2021; Qin and Eisner 2021; Scao and Rush 2021) that passes a prompt to GPT (Radford et al. 2019; Brown et al. 2020) to obtain a style-transferred text; style-specific template design; or use templates to first generate synthetic data and make models learn from the synthetic data. Prompt design is not yet investigated as a direction for TST research, but it is an interesting direction to explore.
Distinguishing Styles from a Mixed Corpus.
It might also be possible to distinguish styles from a mixed corpus with no style labels. For example, Riley et al. (2021) learn a style vector space from text; Xu, Cheung, and Cao (2020) use unsupervised representation learning to separate the style and contents from a mixed corpus of unspecified styles; and Guo et al. (2021) use cycle training with a conditional variational auto-encoder to unsupervisedly learn to express the same semantics through different styles. Theoretically, although disentanglement is impossible without inductive biases or other forms of supervision (Locatello et al. 2019), disentanglement is achievable with some weak signals, such as only knowing how many factors have changed, but not which ones (Locatello et al. 2020). A more advanced direction can be emergent styles (Kang, Wang, and de Melo 2020), since styles can be evolving, for example across dialog turns.
6.4 Improving Evaluation Metrics
There has been a lot of attention to the problems of evaluation metrics of TST and potential improvements (Pang and Gimpel 2019; Tikhonov and Yamshchikov 2018; Mir et al. 2019; Fu et al. 2019; Pang 2019; Yamshchikov et al. 2021; Jafaritazehjani et al. 2020). Recently, Gehrmann et al. (2021) have proposed a new framework that is a live environment to evaluate NLG in a principled and reproducible manner. Apart from the existing scoring methods, future work can also make use of linguistic rules such as a checklist to evaluate what capabilities the TST model has achieved. For example, there can be a checklist for formality transfer according to existing style guidelines, such as the APA style guide (American Psychological Association 2020). Such a checklist-based evaluation can make the performance of black-box deep learning models more interpretable, and also allow for more insightful error analysis.
7. Expanding the Impact of TST
In this last section of this survey, we highlight several directions to expand the impact of TST. First, TST can be used to help other NLP tasks such as paraphrasing, data augmentation, and adversarial robustness probing (Section 7.1). Moreover, many specialized downstream tasks can be achieved with the help of TST, such as persona-consistent dialog generation, attractive headline generation, style-specific machine translation, and anonymization (Section 7.2). Last but not least, we overview the ethical impacts that are important to take into consideration for future development of TST (Section 7.3).
7.1 Connecting TST to More NLP Tasks
TST can be applied to other important NLP tasks, such as paraphrase generation, data augmentation, and adversarial robustness probing.
Paraphrase Generation.
Paraphrase generation is to express the same information in alternative ways (Madnani and Dorr 2010). The nature of paraphrasing shares a lot in common with TST, which is to transfer the style of text while preserving the content. One of the common ways of paraphrasing is syntactic variation, such as “X wrote Y.”, “Y was written by X.”, and “X is the writer of Y.” (Androutsopoulos and Malakasiotis 2010). Besides syntactic variation, it also makes sense to include stylistic variation as a form of paraphrases, which means that the linguistic style transfer (not the content preference transfer in Table 3) can be regarded as a subset of paraphrasing. The caution here is that if the paraphrasing is for a downstream task, researchers should first check if the downstream task is compatible with the used styles. For example, dialog generation may be sensitive to all linguistic styles, whereas summarization can allow linguistic style-varied paraphrases in the dataset.
There are three implications of this connection of TST and paraphrase generation. First, many trained TST models can be borrowed for paraphrasing, such as formality transfer and simplification. A second connection is that the method innovations proposed in the two fields can inspire each other. For example, Krishna, Wieting, and Iyyer (2020) formulate style transfer as a paraphrasing task. Thirdly, the evaluation metrics of the two tasks can also inspire each other. For example, Yamshchikov et al. (2021) associate the semantic similarity metrics for two tasks.
Data Augmentation.
Data augmentation generates text similar to the existing training data so that the model can have larger training data. TST is a good method for data augmentation because TST can produce text with different styles but the same meaning. Image style transfer has already been used for data augmentation (Zheng et al. 2019; Jackson et al. 2019), so it can be interesting to see future works also apply TST for data augmentation.
Adversarial Robustness Probing.
Another use of style transferred text is adversarial robustness probing. For example, styles that are task-agnostic can be used for general adversarial attack (e.g., politeness transfer to probe sentiment classification robustness) (Jin et al. 2020b), while the styles that can change the task output can be used to construct contrast sets (e.g., sentiment transfer to probe sentiment classification robustness) (Xing et al. 2020). Xu et al. (2020) apply image style transfer to adversarial attack, and future research can also explore the use of TST in the two ways suggested above.
7.2 Connecting TST to More Specialized Applications
TST can be applied not only to other NLP tasks as introduced in the previous section, but also can be helpful for specialized downstream applications. In practice, when applying NLP models, it is important to customize for some specific needs, such as generating dialog with a consistent persona, writing headlines that are attractive and engaging, making machine translation models adapt to different styles, and anonymizing the user identity by obfuscating the style.
Persona-Consistent Dialog Generation.
A useful downstream application of TST is persona-consistentdialoggeneration(Li et al. 2016; Zhang et al. 2018b; Shuster et al. 2020). Because conversational agents directly interact with users, there is a strong demand for human-like dialog generation. Previously, this has been done by encoding speaker traits into a vector and the conversation is then conditioned on this vector (Li et al. 2016). As future work, TST can also be used as part of the pipeline of persona-based dialog generation, where the persona can be categorized into distinctive style types, and then the generated text can be post-processed by a style transfer model.
Attractive Headline Generation.
In journalism writing, it is crucial to generate engaging headlines. Jin et al. (2020a) first use TST to generate eye-catchy headlines with three different styles: humorous, romantic, and clickbaity styles. Li et al. (2021) follow this direction and propose a disentanglement-based model to generate attractive headlines for Chinese news.
Style-Specific Machine Translation.
In machine translation, it is useful to have an additional control of the style for the translated text. Commonly used styles for TST in machine translation are politeness (Sennrich, Haddow, and Birch 2016a) and formality (Niu, Martindale, and Carpuat 2017; Wu, Wang, and Liu 2020). For example, Wu, Wang, and Liu (2020) translate from informal Chinese to formal English.
Anonymization.
TST can also be used for anonymization, which is an important way to protect user privacy, especially since there are ongoing heated discussions of ethics in the AI community. Many concerns have been raised about the discriminative task of author profiling, which can mine the demographic identities of the author of a writing, even including privacy-invading properties such as gender and age (Schler et al. 2006). As a potential solution, TST can be applied to alter the text and obfuscate the real identity of the users (Reddy and Knight 2016; Gröndahl and Asokan 2020).
7.3 Ethical Implications of TST
Recently, there is more and more attention being paid to the ethics concerns associated with AI research. We discuss in the following two ethics considerations: (1) social impact of TST applications, and (2) data privacy problem of TST.
Fields that involve human subjects or direct application to humans work under a set of core principles and guidelines (Beauchamp, Childress et al. 2001). Before initiating a research project, responsible research bodies use these principles as a ruler to judge whether the research is ethically correct to start. NLP research and applications, including TST, that directly involve human users, is regulated under a central regulatory board, the Institutional Review Board (IRB). We also provide several guidelines below to avoid ethical misconduct in future publications on TST.
7.3.1 Social Impact of TST Applications.
Technologies can have unintended negative consequences (Hovy and Spruit 2016). For example, TST can facilitate the automation of intelligent assistants with designed attributes, but can also be used to create fake text or fraud.
Thus, inventors of a technology should beware how other people very probably adopt this technology for their own incentives. For TST, because it has a wide range of subtasks and applications, we examine each of them with the following two questions:
Who will benefit from such a technology?
Who will be harmed by such a technology?
Beneficial Impact.
An important direction of NLP for social good is to fight against abusive online text. TST can serve as a very helpful tool as it can be used to transfer malicious text to normal language. Shades of abusive language include hate speech, offensive language, sexist and racist language, aggression, profanity, cyberbullying, harassment, trolling, and toxic language (Waseem et al. 2017). There is also other negative text such as propaganda (Bernays 2005; Carey 1997), and others. It is widely known that malicious text is harmful to people. For example, research shows that cyberbullying victims tend to have more stress and suicidal ideation (Kowalski et al. 2014), and also detachment from family and offline victimization (Oksanen et al. 2014). There are more and more efforts put into combating toxic language, such as 30K content moderators that Facebook and Instagram employ (Harrison 2019). Therefore, the automatic malicious-to-normal language transfer can be a helpful intelligent assistant to address such needs. Apart from purifying malicious text on social media, it can also be used on social chatbots to make sure there is no bad content in the language they generate (Roller et al. 2021).
Neutral Impact.
Most TST tasks are neutral. For example, informal-to-formal transfer can be used as a writing assistant to help make writing more professional, and formal-to-informal transfer can tune the tone of bots to be more casual. Most applications to customize the persona of bots are also neutral with regard to their societal impact.
Dual Use.
Besides positive and neutral applications, there are, unfortunately, several TST tasks that are double-edged swords. For example, take one of the most popular TST tasks, sentiment modification; although it can be used to change intelligent assistants or robots from a negative to positive mood (which is unlikely to harm any parties), the vast majority of research applies this technology to manipulate the polarity of reviews, such as Yelp (Shen et al. 2017) and Amazon reviews (He and McAuley 2016). This leads to a setting where a negative restaurant review is changed to a positive comment, or vice versa, with debatable ethics. Such a technique can be used as a cheating method for the commercial body to polish its reviews, or harm the reputation of their competitors. Once this technology is used, it will automatically manipulate the online text to contain polarity that the model owner desires. Hence, we suggest the research community raise serious concern against the review sentiment modification task.
Another task, political slant transfer, may induce concerns within some specific context. For example, social bots (i.e., autonomous bots on social media, such as Twitter bots and Facebook bots) are a big problem in the United States, even playing a significant role in the 2016 U.S. presidential election (Bessi and Ferrara 2016; Shao et al. 2018). It is reported that at least 400,000 bots were responsible for about 19% of the total Tweets. Social bots usually target to advocate certain ideas, supporting campaigns, or aggregating other sources either by acting as a “follower” and/or gathering followers itself. So the political slant transfer task, which transfers the tone and content between Republican comments and Democratic ones, are highly sensitive and may face the risk of being used on social bots to manipulate political views of the mass.
Some more arguable ones are male-to-female tone transfer, which can be potentially used for identity deception. The cheater can create an online account and pretend to be an attractive young woman. There is also the reversed direction (female-to-male tone transfer), which can be used for applications such as authorship obfuscation (Shetty, Schiele, and Fritz 2018), anonymizing the author attributes by hiding the gender of a female author by re-synthesizing the text to use male textual attributes.
7.3.2 Data Privacy Issues for TST
Another ethics concern is the use of data in research practice. Researchers should not overmine user data, such as demographic identities. Such data privacy widely exists in the data science community as a whole, and there have been many ethics discussions (Tse et al. 2015; Russell, Dewey, and Tegmark 2015).
The TST task needs data containing some attributes along with the text content. Although it is acceptable to use ratings of reviews that are classified as positive or negative, user attributes are sensitive, including the gender of the user’s account (Prabhumoye et al. 2018), and age (Lample et al. 2019). The collection and potential use of such sensitive user attributes can have implications that need to be carefully considered.
8. Conclusion
This article presented a comprehensive review of TST with deep learning methods. We have surveyed recent research efforts in TST and developed schemes to categorize and distill the existing literature. This survey has covered the task formulation, evaluation metrics, and methods on parallel and non-parallel data. We also discussed several important topics in the research agenda of TST, and how to expand the impact of TST to other tasks and applications, including ethical considerations. This survey provides a reference for future researchers working on TST.
Acknowledgments
We thank Qipeng Guo for his insightful discussions and the anonymous reviewers for their constructive suggestions.
Notes
Our curated paper list is at https://github.com/zhijing-jin/Text_Style_Transfer_Survey.
Note that we interchangeably use the terms style and attribute in this survey. Attribute is a broader terminology that can include content preferences, e.g., sentiment, topic, and so on. This survey uses style in the same broad way, following the common practice in recent papers (see Section 2.1).
GYAFC data: https://github.com/raosudha89/GYAFC-corpus.
GYAFC data: https://github.com/Elbria/xformal-FoST.
Politeness data: https://github.com/tag-and-generate/politeness-dataset.
The Yelp Gender dataset is from the Yelp Challenge https://www.yelp.com/dataset and its preprocessing needs to follow Prabhumoye et al. (2018).
Wiki Neutrality data: http://bit.ly/bias-corpus.
Bible data: https://github.com/keithecarlson/StyleTransferBibleData.
MIMIC-III data: Request access at https://mimic.physionet.org/gettingstarted/access/ and follow the preprocessing of Weng, Chung, and Szolovits (2019).
MSD data: https://srhthu.github.io/expertise-style-transfer/.
Math data: https://gitlab.cs.washington.edu/kedzior/Rewriter/.
TitleStylist data: https://github.com/jind11/TitleStylist.
Yelp data: https://github.com/shentianxiao/language-style-transfer.
Yahoo! Answers data: https://webscope.sandbox.yahoo.com/catalog.php?datatype=l&did=11.
Political data: https://nlp.stanford.edu/robvoigt/rtgender/.
Note that this style classifier usually reports 80+% or 90+% accuracy, and we will discuss the problem of false positives and false negatives in the last paragraph of this section.
References
Author notes
Equal contribution.