Abstract
Given a sentence (e.g., “I like mangoes”) and a constraint (e.g., sentiment flip), the goal of controlled text generation is to produce a sentence that adapts the input sentence to meet the requirements of the constraint (e.g., “I hate mangoes”). Going beyond such simple constraints, recent work has started exploring the incorporation of complex syntactic-guidance as constraints in the task of controlled paraphrase generation. In these methods, syntactic-guidance is sourced from a separate exemplar sentence. However, these prior works have only utilized limited syntactic information available in the parse tree of the exemplar sentence. We address this limitation in the paper and propose Syntax Guided Controlled Paraphraser (SGCP), an end-to-end framework for syntactic paraphrase generation. We find that Sgcp can generate syntax-conforming sentences while not compromising on relevance. We perform extensive automated and human evaluations over multiple real-world English language datasets to demonstrate the efficacy of Sgcp over state-of-the-art baselines. To drive future research, we have made Sgcp’s source code available.1
1 Introduction
Controlled text generation is the task of producing a sequence of coherent words based on given constraints. These constraints can range from simple attributes like tense, sentiment polarity, and word-reordering (Hu et al., 2017; Shen et al., 2017; Yang et al., 2018) to more complex syntactic information. For example, given a sentence “The movie is awful!” and a simple constraint like flip sentiment to positive, a controlled text generator is expected to produce the sentence “The movie is fantastic!”.
These constraints are important in not only providing information about what to say but also how to say it. Without any constraint, the ubiquitous sequence-to-sequence neural models often tend to produce degenerate outputs and favor generic utterances (Vinyals and Le, 2015; Li et al., 2016). Although simple attributes are helpful in addressing what to say, they provide very little information about how to say it. Syntactic control over generation helps in filling this gap by providing that missing information.
Incorporating complex syntactic information has shown promising results in neural machine translation (Stahlberg et al., 2016; Aharoni and Goldberg, 2017; Yang et al., 2019), data-to-text generation (Peng et al., 2019), abstractive text-summarization (Cao et al., 2018), and adversarial text generation (Iyyer et al., 2018). Additionally, recent work (Iyyer et al., 2018; Kumar et al., 2019) has shown that augmenting lexical and syntactical variations in the training set can help in building better performing and more robust models.
In this paper, we focus on the task of syntactically controlled paraphrase generation, that is, given an input sentence and a syntactic exemplar, produce a sentence that conforms to the syntax of the exemplar while retaining the meaning of the original input sentence. While syntactically controlled generation of paraphrases finds applications in multiple domains like data-augmentation and text passivization, we highlight its importance in the particular task of text simplification. As pointed out in Siddharthan (2014), depending on the literacy skill of an individual, certain syntactical forms of English sentences are easier to comprehend than others. As an example, consider the following two sentences:
- S1
Because it is raining today, you should carry an umbrella.
- S2
You should carry an umbrella today, because it is raining.
Connectives that permit pre-posed adverbial clauses have been found to be difficult for third to fifth grade readers, even when the order of mention coincides with the causal (and temporal) order (Anderson and Davison, 1986; Levy, 2003). Hence, they prefer sentence S2. However, various other studies (Clark and Clark, 1968; Katz and Brent, 1968; Irwin, 1980) have suggested that for older school children, college students, and adults, comprehension is better for the cause-effect presentation, hence sentence S1. Thus, modifying a sentence, syntactically, would help in better comprehension based on literacy skills.
Prior work in syntactically controlled paraphrase generation addressed this task by conditioning the semantic input on either the features learned from a linearized constituency-based parse tree (Iyyer et al., 2018), or the latent syntactic information (Chen et al., 2019a) learned from exemplars through variational auto-encoders. Linearizing parse trees typically results in loss of essential dependency information. On the other hand, as noted in Shi et al. (2016), an autoencoder- based approach might not offer rich enough syntactic information as guaranteed by actual constituency parse trees. Moreover, as noted in Chen et al. (2019a), Scpn (Iyyer et al., 2018), and Cgen (Chen et al., 2019a) tend to generate sentences of the same length as the exemplar. This is an undesirable characteristic because it often results in producing sentences that end abruptly, thereby compromising on grammaticality and semantics. Please see Table 1 for sample generations using each of the models.
Source | — how do i predict the stock market ? |
Exemplar | — can a brain transplant be done ? |
Scpn | — how can the stock and start ? |
Cgen | — can the stock market actually happen ? |
Sgcp (Ours) | — can i predict the stock market ? |
Source | — what are some of the mobile apps you ca n’t live without and why ? |
Exemplar | — which is the best resume you have come across ? |
Scpn | — what are the best ways to lose weight ? |
Cgen | — which is the best mobile app you ca n’t ? |
Sgcp (Ours) | — which is the best app you ca n’t live without and why ? |
Source | — how do i predict the stock market ? |
Exemplar | — can a brain transplant be done ? |
Scpn | — how can the stock and start ? |
Cgen | — can the stock market actually happen ? |
Sgcp (Ours) | — can i predict the stock market ? |
Source | — what are some of the mobile apps you ca n’t live without and why ? |
Exemplar | — which is the best resume you have come across ? |
Scpn | — what are the best ways to lose weight ? |
Cgen | — which is the best mobile app you ca n’t ? |
Sgcp (Ours) | — which is the best app you ca n’t live without and why ? |
To address these gaps, we propose Syntax Guided Controlled Paraphraser (Sgcp) which uses full exemplar syntactic tree information. Additionally, our model provides an easy mechanism to incorporate different levels of syntactic control (granularity) based on the height of the tree being considered. The decoder in our framework is augmented with rich enough syntactical information to be able to produce syntax conforming sentences while not losing out on semantics and grammaticality.
The main contributions of this work are as follows:
We propose Sgcp, an end-to-end model to generate syntactically controlled paraphrases at different levels of granularity using a parsed exemplar.
We provide a new decoding mechanism to incorporate syntactic information from the exemplar sentence’s syntactic parse.
We provide a dataset formed from Quora Question Pairs2 for evaluating the models. We also perform extensive experiments to demonstrate the efficacy of our model using multiple automated metrics as well as human evaluations.
2 Related Work
Controllable Text Generation.
is an important problem in NLP that has received significant attention in recent times. Prior work include generating text using models conditioned on attributes like formality, sentiment, or tense (Hu et al., 2017; Shen et al., 2017; Yang et al., 2018) as well as on syntactical templates (Iyyer et al., 2018; Chen et al., 2019a). These systems find applications in adversarial sample generation (Iyyer et al., 2018), text summarization, and table-to-text generation (Peng et al., 2019). While achieving state-of-the-art in their respective domains, these systems typically rely on a known finite set of attributes thereby making them quite restrictive in terms of the styles they can offer.
Paraphrase Generation.
While generation of paraphrases has been addressed in the past using traditional methods (McKeown, 1983; Barzilay and Lee, 2003; Quirk et al., 2004; Hassan et al., 2007; Zhao et al., 2008; Madnani and Dorr, 2010; Wubben et al., 2010), they have recently been superseded by deep learning-based approaches (Prakash et al., 2016; Gupta et al., 2018; Li et al., 2019,s; Kumar et al., 2019). The primary task of all these methods (Prakash et al., 2016; Gupta et al., 2018; Li et al., 2018) is to generate the most semantically similar sentence and they typically rely on beam search to obtain any kind of lexical diversity. Kumar et al. (2019) try to tackle the problem of achieving lexical, and limited syntactical diversity using submodular optimization but do not provide any syntactic control over the type of utterance that might be desired. These methods are therefore restrictive in terms of the syntactical diversity that they can offer.
Controlled Paraphrase Generation.
Our task is similar in spirit to Iyyer et al. (2018) and Chen et al. (2019a), which also deals with the task of syntactic paraphrase generation. However, the approach taken by them is different from ours in at least two aspects. Firstly, SCPN (Iyyer et al., 2018) uses an attention-based (Bahdanau et al., 2014) pointer-generator network (See et al., 2017) to encode input sentences and a linearized constituency tree to produce paraphrases. Because of the linearization of syntactic tree, considerable dependency-based information is generally lost. Our model, instead, directly encodes the tree structure to produce a paraphrase. Secondly, the inference (or generation) process in SCPN is computationally very expensive, because it involves a two-stage generation process. In the first stage, they generate full parse trees from incomplete templates, and then from full parse trees to final generations. In contrast, the inference in our method involves a single-stage process, wherein our model takes as input a semantic source, a syntactic tree and the level of syntactic style that needs to be transferred, to obtain the generations. Additionally, we also observed that the model does not perform well in low resource settings. This, again, can be attributed to the compounding implicit noise in the training due to linearized trees and generation of full linearized trees before obtaining the final paraphrases.
Chen et al. (2019a) propose a syntactic exemplar-based method for controlled paraphrase generation using an approach based on latent variable probabilistic modeling, neural variational inference, and multi-task learning. This, in principle, is very similar to Chen et al. (2019b). As opposed to our model, which provides different levels of syntactic control of the exemplar-based generation, this approach is restrictive in terms of the flexibility it can offer. Also, as noted in Shi et al. (2016), an autoencoder-based approach might not offer rich enough syntactic information as offered by actual constituency parse trees. Additionally, VAEs (Kingma and Welling, 2014) are generally unstable and harder to train (Bowman et al., 2016; Gupta et al., 2018) than seq2seq-based approaches.
3 Sgcp: Proposed Method
3.1 Inputs
Given an input sentence X and a syntactic exemplar Y, our goal is to generate a sentence Z that conforms to the syntax of Y while retaining the meaning of X.
The semantic encoder (Section 3.2) works on sequence of input tokens, and the syntactic encoder (Section 3.3) operates on constituency-based parse trees. We parse the syntactic exemplar Y3 to obtain its constituency-based parse tree. The leaf nodes of the constituency-based parse tree consists of token for the sentence Y. These tokens, in some sense, carry the semantic information of sentence Y, which we do not need for generating paraphrases. In order to prevent any meaning propagation from exemplar sentence Y into the generation, we remove these leaf/terminal nodes from its constituency parse. The tree thus obtained is denoted as .
The syntactic encoder, additionally, takes as input H, which governs the level of syntactic control needed to be induced. The utility of H will be described in Section 3.3.
3.2 Semantic Encoder
3.3 Syntactic Encoder
This encoder provides the necessary syntactic guidance for the generation of paraphrases. Formally, let constituency tree , where is the set of nodes, the set of edges, and the labels associated with each node.
As indicated in Section 3.1, syntactic encoder takes as input the height H, which governs the level of syntactic control. We randomly prune the tree to height H ∈{3,…,Hmax}, where Hmax is the height of the full constituency tree . As an example, in Figure 2b, we prune the constituency-based parse tree of the exemplar sentence, to height H = 3. The leaf nodes for this tree have the labels WP,VBZ,NP, and <DOT>. Although we calculate the hidden-state representation of all the nodes, only the terminal nodes are responsible for providing the syntactic signal to the decoder (Section 3.4).
3.4 Syntactic Paraphrase Decoder
In the subsequent sections, we use t to denote the decoder time step.
3.4.1 Using Semantic Information
3.4.2 Using Syntactic Information
During training, each terminal node in the tree , pruned at H, is equipped with information about the span of words it needs to generate. At each time step t, only one terminal node representation is responsible for providing the syntactic signal which we call . This hidden-state representation to be used is governed through an signalling vector , where each ai ∈{0,1}. 0 indicates that the decoder should keep on using the same hidden-representation that is currently being used, and 1 indicates that the next element (hidden-representation) in the queue should be used for decoding.
ai = 1 provides a signal to pop an element from the queue while ai = 0 provides a signal to keep on using the last popped element. This element is then used to guide the decoder syntactically by providing a signal in the form of hidden-state representation (Equation 8).
Specifically, in this example, the a1 = 1 signals to pop to provide syntactic guidance to the decoder for generating the first token. a2 = 1 signals to pop to provide syntactic guidance to the decoder for generating the second token. a3 = 1 helps in obtaining from to provide guidance to generate the third token. As described earlier, a4,…,a8 = 0 indicates, that the same representation should be used for syntactically guiding tokens z4,…,z8. Finally a9 = 1 helps in retrieving for guiding decoder to generate token z9. Note that
Although a is provided to the model during training, this information might not be available during inference. Providing a during generation makes the model restrictive and might result in producing ungrammatical sentences. Sgcp is tasked to learn a proxy for the signalling vector a, using transition probability vectorp.
3.4.3 Overall
4 Experiments
Our experiments are geared towards answering the following questions:
- Q1.
Is Sgcp able to generate syntax conforming sentences without losing out on meaning? (Section 5.1, 5.4)
- Q2.
What level of syntactic control does Sgcp offer? (Section 5.2, 5.3, 5.2)
- Q3.
How does Sgcp compare against prior models, qualitatively? (Section 5.4)
- Q4.
Are the improvements achieved by Sgcp statistically significant? (Section 5.1)
Based on these questions, we outline the methods compared (Section 4.1), along with the datasets (Section 4.2) used, evaluation criteria (Section 4.3) and the experimental setup (Section 4.4).
4.1 Methods Compared
As in Chen et al. (2019a), we first highlight the results of the two direct return-input baselines.
- 1.
Source-as-Output: Baseline where the output is the semantic input.
- 2.
Exemplar-as-Output: Baseline where the output is the syntactic exemplar.
We compare the following competitive methods:
- 3.
Scpn (Iyyer et al., 2018) is a sequence-to- sequence based model comprising two encoders built with LSTM (Hochreiter and Schmidhuber, 1997) to encode semantics and syntax respectively. Once the encoding is obtained, it serves as an input to the LSTM-based decoder, which is augmented with soft-attention (Bahdanau et al., 2014) over encoded states as well as a copying mechanism (See et al., 2017) to deal with out-of-vocabulary tokens.4
- 4.
Cgen (Chen et al., 2019a) is a VAE (Kingma and Welling, 2014) model with two encoders to project semantic input and syntactic input to a latent space. They obtain a syntactic embedding from one encoder, using a standard Gaussian prior. To obtain the semantic representation, they use von Mises-Fisher prior, which can be thought of as a Gaussian distribution on a hypersphere. They train the model using a multi-task paradigm, incorporating paraphrase generation loss and word position loss. We considered their best model, VGVAE + LC + WN + WPL, which incorporates the above objectives.
- 5.
Sgcp (Section 3) is a sequence-and-tree-to-sequence based model that encodes semantics and tree-level syntax to produce paraphrases. It uses a GRU-based (Chung et al., 2014) decoder with soft-attention on semantic encodings and a begin of phrase (bop) gate to select a leaf node in the exemplar syntax tree. We compare the following two variants of Sgcp:
(a) Sgcp-F: Uses full constituency parse tree information of the exemplar for generating paraphrases.
(a) Sgcp-R: Sgcp can produce multiple paraphrases by pruning the exemplar tree at various heights. This variant first generates five candidate generations, corresponding to five different heights of the exemplar tree, namely, {Hmax,Hmax− 1,Hmax− 2,Hmax− 3,Hmax− 4}, for each (source, exemplar) pair. From these candidates, the one with the highest ROUGE-1 score with the source sentence is selected as the final generation.
Note that, except for the return-input baselines, all methods use beam search during inference.
4.2 Datasets
We train the models and evaluate them on the following datasets:
(1) ParaNMT-small (Chen et al., 2019a) contains 500K sentence-paraphrase pairs for training, and 1,300 manually labeled sentence-exemplar-reference, which is further split into 800 test data points and 500 dev. data points, respectively.
As in Chen et al. (2019a), our model uses only (sentence, paraphrase) during training. The paraphrase itself serves as the exemplar input during training.
This dataset is a subset of the original ParaNMT-50M dataset (Wieting and Gimpel, 2018). ParaNMT-50M is a data set generated automatically through backtranslation of original English sentences. It is inherently noisy because of imperfect neural machine translation quality, with many sentences being non-grammatical and some even being non-English sentences. Because of such noisy data points, it is optimistic to assume that the corresponding constituency parse tree would be well aligned. To that end, we propose to use the following additional dataset, which is more well-formed and has more human intervention than the ParaNMT-50M dataset.
(2) QQP-Pos: The original Quora Question Pairs (QQP) dataset contains about 400K sentence pairs labeled positive if they are duplicates of each other and negative otherwise. The dataset is composed of about 150K positive and 250K negative pairs. We select those positive pairs that contain both sentences with a maximum token length of 30, leaving us with ∼146K pairs. We call this dataset QQP-Pos.
Similar to ParaNMT-small, we use only the sentence-paraphrase pairs as training set and sentence-exemplar-reference triples for testing and validation. We randomly choose 140K sentence-paraphrase pairs as the training set , and the remaining 6K pairs are used to form the evaluation set . Additionally, let . Note that is a set of sentences while is a set of sentence-paraphrase pairs.
Let be the initial evaluation set. For selecting exemplar for each each sentence-paraphrase pair, we adopt the following procedure:
- Step 1:
For a given , construct an exemplar candidate set . |ℂ|≈ 12,000.
- Step 2:
Retain only those sentences C ∈ℂ whose sentence length (= number of tokens) differ by at most two when compared to the paraphrase Z. This is done since sentences with similar constituency-based parse tree structures tend to have similar token lengths.
- Step 3:
Remove those candidates C ∈ℂ, which are very similar to the source sentence X, that is, BLEU(X, C) > 0.6.
- Step 4:
From the remaining instances in ℂ, choose that sentence C as the exemplar Y which has the least Tree-Edit distance with the paraphrase Z of the selected pair, namely, . This ensures that the constituency-based parse tree of the exemplar Y is quite similar to that of Z, in terms of Tree-Edit distance.
- Step 5:
.
- Step 6:
Repeat procedure for all other pairs in .
From the obtained evaluation set , we randomly choose 3K triplets for the test set , and remaining 3K for the validation set .
4.3 Evaluation
It should be noted that there is no single fully reliable metric for evaluating syntactic paraphrase generation. Therefore, we evaluate on the following metrics to showcase the efficacy of syntactic paraphrasing models.
Automated Evaluation.
(i) Alignment based metrics: We compute BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), ROUGE-1, ROUGE-2, and ROUGE-L (Lin, 2004) scores between the generated and the reference paraphrases in the test set.
(ii) Syntactic Transfer: We evaluate the syntactic transfer using Tree-edit distance (Zhang and Shasha, 1989) between the parse trees of:
- (a)
the generated and the syntactic exemplar in the test set - TED-E
- (b)
the generated and the reference paraphrase in the test set - TED-R
(iii) Model-based evaluation: Because our goal is to generate paraphrases of the input sentences, we need some measure to determine if the generations indeed convey the same meaning as the original text. To achieve this, we adopt a model-based evaluation metric as used by Shen et al. (2017) for Text Style Transfer and Isola et al. (2017) for Image Transfer. Specifically, classifiers are trained on the task of Paraphrase Detection and then used as Oracles to evaluate the generations of our model and the baselines. We fine-tune two RoBERTa (Liu et al., 2019) based sentence pair classifiers, one on Quora Question Pairs (Classifier-1) and other on ParaNMT + PAWS5 datasets (Classifier-2), which achieve accuracies of 90.2% and 94.0% on their respective test sets.6
Once trained, we use Classifier-1 to evaluate generations on QQP-Pos and Classifier-2 on ParaNMT-small.
We first generate syntactic paraphrases using all the models (Section 4.1) on the test splits of QQP-Pos and ParaNMT-small datasets. We then pair the source sentence with their corresponding generated paraphrases and send them as input to the classifiers. The Paraphrase Detection score, denoted as PDS in Table 2, is defined as, the ratio of the number of generations predicted as paraphrases of their corresponding source sentences by the classifier to the total number of generations.
- (a)
Human Evaluation.
Although TED is sufficient to highlight syntactic transfer, there has been some scepticism regarding automated metrics for paraphrase quality (Reiter, 2018). To address this issue, we perform human evaluation on 100 randomly selected data points from the test set. In the evaluation, three judges (non-researchers proficient in the English language) were asked to assign scores to generated sentences based on the semantic similarity with the given source sentence. The annotators were shown a source sentence and the corresponding outputs of the systems in random order. The scores ranged from 1 (doesn’t capture meaning at all) to 4 (perfectly captures the meaning of the source sentence).
QQP-Pos . | ||||||||
---|---|---|---|---|---|---|---|---|
Model | BLEU ↑ | METEOR ↑ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ | TED-R↓ | TED-E↓ | PDS ↑ |
Source-as-Output | 17.2 | 31.1 | 51.9 | 26.2 | 52.9 | 16.2 | 16.6 | 99.8 |
Exemplar-as-Output | 16.8 | 17.6 | 38.2 | 20.5 | 43.2 | 4.8 | 0.0 | 10.7 |
Scpn (Iyyer et al., 2018) | 15.6 | 19.6 | 40.6 | 20.5 | 44.6 | 9.1 | 8.0 | 27.0 |
Cgen (Chen et al., 2019a) | 34.9 | 37.4 | 62.6 | 42.7 | 65.4 | 6.7 | 6.0 | 65.4 |
Sgcp-F | 36.7 | 39.8 | 66.9 | 45.0 | 69.6 | 4.8 | 1.8 | 75.0 |
Sgcp-R | 38.0 | 41.3 | 68.1 | 45.7 | 70.2 | 6.8 | 5.9 | 87.7 |
ParaNMT-small | ||||||||
Source-as-Output | 18.5 | 28.8 | 50.6 | 23.1 | 47.7 | 12.0 | 13.0 | 99.0 |
Exemplar-as-Output | 3.3 | 12.1 | 24.4 | 7.5 | 29.1 | 5.9 | 0.0 | 14.0 |
Scpn (Iyyer et al., 2018) | 6.4 | 14.6 | 30.3 | 11.2 | 34.6 | 6.2 | 1.4 | 15.4 |
Cgen (Chen et al., 2019a) | 13.6 | 24.8 | 44.8 | 21.0 | 48.3 | 6.7 | 3.3 | 70.2 |
Sgcp-F | 15.3 | 25.9 | 46.6 | 21.8 | 49.7 | 6.1 | 1.4 | 76.6 |
Sgcp-R | 16.4 | 27.2 | 49.6 | 22.9 | 50.5 | 8.7 | 7.0 | 83.5 |
QQP-Pos . | ||||||||
---|---|---|---|---|---|---|---|---|
Model | BLEU ↑ | METEOR ↑ | ROUGE-1 ↑ | ROUGE-2 ↑ | ROUGE-L ↑ | TED-R↓ | TED-E↓ | PDS ↑ |
Source-as-Output | 17.2 | 31.1 | 51.9 | 26.2 | 52.9 | 16.2 | 16.6 | 99.8 |
Exemplar-as-Output | 16.8 | 17.6 | 38.2 | 20.5 | 43.2 | 4.8 | 0.0 | 10.7 |
Scpn (Iyyer et al., 2018) | 15.6 | 19.6 | 40.6 | 20.5 | 44.6 | 9.1 | 8.0 | 27.0 |
Cgen (Chen et al., 2019a) | 34.9 | 37.4 | 62.6 | 42.7 | 65.4 | 6.7 | 6.0 | 65.4 |
Sgcp-F | 36.7 | 39.8 | 66.9 | 45.0 | 69.6 | 4.8 | 1.8 | 75.0 |
Sgcp-R | 38.0 | 41.3 | 68.1 | 45.7 | 70.2 | 6.8 | 5.9 | 87.7 |
ParaNMT-small | ||||||||
Source-as-Output | 18.5 | 28.8 | 50.6 | 23.1 | 47.7 | 12.0 | 13.0 | 99.0 |
Exemplar-as-Output | 3.3 | 12.1 | 24.4 | 7.5 | 29.1 | 5.9 | 0.0 | 14.0 |
Scpn (Iyyer et al., 2018) | 6.4 | 14.6 | 30.3 | 11.2 | 34.6 | 6.2 | 1.4 | 15.4 |
Cgen (Chen et al., 2019a) | 13.6 | 24.8 | 44.8 | 21.0 | 48.3 | 6.7 | 3.3 | 70.2 |
Sgcp-F | 15.3 | 25.9 | 46.6 | 21.8 | 49.7 | 6.1 | 1.4 | 76.6 |
Sgcp-R | 16.4 | 27.2 | 49.6 | 22.9 | 50.5 | 8.7 | 7.0 | 83.5 |
4.4 Setup
(a) Pre-processing. Because our model needs access to constituency parse trees, we tokenize and parse all our data points using the fully parallelizable Stanford CoreNLP Parser (Manning et al., 2014) to obtain their respective parse trees. This is done prior to training in order to prevent any additional computational costs that might be incurred because of repeated parsing of the same data points during different epochs.
(b) Implementation Details. We train both our models using the Adam Optimizer (Kingma and Ba, 2014) with an initial learning rate of 7e-5. We use a bidirectional three-layered GRU for encoding the tokenized semantic input and a standard pointer-generator network with GRU for decoding. The token embedding is learnable with dimension 300. To reduce the training complexity of the model, the maximum sequence length is kept at 60. The vocabulary size is kept at 24K for QQP and 50K for ParaNMT-small.
Sgcp needs access to the level of syntactic granularity for decoding, depicted as H in Figure 2. During training, we keep on varying it randomly from 3 to Hmax, changing it with each training epoch. This ensures that our model is able to generalize because of an implicit regularization attained using this procedure. At each time-step of the decoding process, we keep a teacher forcing ratio of 0.9.
5 Results
5.1 Semantic Preservation and Syntactic Transfer
1. Automated Metrics: As can be observed in Table 2, our method(s) (Sgcp-F/R (Section 4.1)) are able to outperform the existing baselines on both the datasets. Source-as-Output is independent of the exemplar sentence being used and since a sentence is a paraphrase of itself, the paraphrastic scores are generally high while the syntactic scores are below par. An opposite is true for Exemplar-as-Output. These baselines also serve as dataset quality indicators. It can be seen that source is semantically similar while being syntactically different from target sentence whereas the opposite is true when exemplar is compared to target sentences. Additionally, source sentences are syntactically and semantically different from exemplar sentences as can be observed from TED-E and PDS scores. This helps in showing that the dataset has rich enough syntactic diversity to learn from.
Through TED-E scores it can be seen that Sgcp-F is able to adhere to the syntax of the exemplar template to a much larger degree than the baseline models. This verifies that our model is able to generate meaning preserving sentences while conforming to the syntax of the exemplars when measured using standard metrics.
It can also be seen that Sgcp-R tends to perform better than Sgcp-F in terms of paraphrastic scores while taking a hit on the syntactic scores. This makes sense, intuitively, because in some cases Sgcp-R tends to select lower H values for syntactic granularity. This can also be observed from the example given in Table 6 where H = 6 is more favorable than H = 7, because of better meaning retention.
Although Cgen performs close to our model in terms of BLEU, ROUGE, and METEOR scores on ParaNMT-small dataset, its PDS is still much lower than that of our model, suggesting that our model is better at capturing the original meaning of the source sentence. In order to show that the results are not coincidental, we test the statistical significance of our model. We follow the non-parametric Pitman’s permutation test (Dror et al., 2018) and observe that our model is statistically significant when the significance level (α) is taken to be 0.05. Note that this holds true for all metric on both the datasets except ROUGE-2 on ParaNMT-small.
2. Human Evaluation:Table 4 shows the results of human assessment. It can be seen that annotators, generally tend to rate Sgcp-F and Sgcp-R (Section 4.1) higher than the baseline models, thereby highlighting the efficacy of our models. This evaluation additionally shows that automated metrics are somewhat consistent with the human evaluation scores.
Source | what should be done to get rid of laziness ? |
Template Exemplar | how can i manage my anger ? |
Scpn (Iyyer et al., 2018) | how can i get rid ? |
Cgen (Chen et al., 2019a) | how can i get rid of ? |
Sgcp-F (Ours) | how can i stop my laziness ? |
Sgcp-R (Ours) | how do i get rid of laziness ? |
Source | what books should entrepreneurs read on entrepreneurship ? |
Template Exemplar | what is the best programming language for beginners to learn ? |
Scpn (Iyyer et al., 2018) | what are the best books books to read to read ? |
Cgen (Chen et al., 2019a) | what ’s the best book for entrepreneurs read to entrepreneurs ? |
Scpn-F (Ours) | what is a best book idea that entrepreneurs to read ? |
Sgcp-R (Ours) | what is a good book that entrepreneurs should read ? |
Source | how do i get on the board of directors of a non profit or a for profit organisation ? |
Template Exemplar | what is the best way to travel around the world for free ? |
Scpn (Iyyer et al., 2018) | what is the best way to prepare for a girl of a ? |
Cgen (Chen et al., 2019a) | what is the best way to get a non profit on directors ? |
Sgcp-F (Ours) | what is the best way to get on the board of directors ? |
Sgcp-R (Ours) | what is the best way to get on the board of directors of a non profit or a for profit organisation ? |
Source | what should be done to get rid of laziness ? |
Template Exemplar | how can i manage my anger ? |
Scpn (Iyyer et al., 2018) | how can i get rid ? |
Cgen (Chen et al., 2019a) | how can i get rid of ? |
Sgcp-F (Ours) | how can i stop my laziness ? |
Sgcp-R (Ours) | how do i get rid of laziness ? |
Source | what books should entrepreneurs read on entrepreneurship ? |
Template Exemplar | what is the best programming language for beginners to learn ? |
Scpn (Iyyer et al., 2018) | what are the best books books to read to read ? |
Cgen (Chen et al., 2019a) | what ’s the best book for entrepreneurs read to entrepreneurs ? |
Scpn-F (Ours) | what is a best book idea that entrepreneurs to read ? |
Sgcp-R (Ours) | what is a good book that entrepreneurs should read ? |
Source | how do i get on the board of directors of a non profit or a for profit organisation ? |
Template Exemplar | what is the best way to travel around the world for free ? |
Scpn (Iyyer et al., 2018) | what is the best way to prepare for a girl of a ? |
Cgen (Chen et al., 2019a) | what is the best way to get a non profit on directors ? |
Sgcp-F (Ours) | what is the best way to get on the board of directors ? |
Sgcp-R (Ours) | what is the best way to get on the board of directors of a non profit or a for profit organisation ? |
5.2 Syntactic Control
1. Syntactical Granularity: Our model can work with different levels of granularity for the exemplar syntax, namely, different tree heights of the exemplar tree can be used for decoding the output.
As can been seen in Table 6, at height 4 the syntax tree provided to the model is not enough to generate the full sentence that captures the meaning of the original sentence. As we increase the height to 5, it is able to capture the semantics better by predicting some of in the sentence. We see that at heights 6 and 7 Sgcp is able to capture both semantics and syntax of the source and exemplar, respectively. However, as we provide the complete height of the tree (i.e., 7), it further tries to follow the syntactic input more closely leading to sacrifice in the overall relevance since the original sentence is about pure substances and not a pure substance. It can be inferred from this example that because a source sentence and exemplar’s syntax might not be fully compatible with each other, using the complete syntax tree can potentially lead to loss of relevance and grammaticality. Hence by choosing different levels of syntactic granularity, one can address the issue of compatibility to a certain extent.
2. Syntactic Variety:Table 5 shows sample generations of our model on multiple exemplars for a given source sentence. It can be observed that Sgcp can generate high-quality outputs for a variety of different template exemplars even the ones which differ a lot from the original sentence in terms of their syntax. A particularly interesting exemplar is what is chromosomal mutation ? what are some examples ?. Here, Sgcp is able to generate a sentence with two question marks while preserving the essence of the source sentence. It should also be noted that the exemplars used in Table 5 were selected manually from the test sets, considering only their qualitative compatibility with the source sentence. Unlike the procedure used for the creation of QQP-Pos dataset, the final paraphrases were not kept in hand while selecting the exemplars. In real-world settings, where a gold paraphrase won’t be present, these results are indicative of the qualitative efficacy of our method.
SOURCE: how do i develop my career in software ? . | |
---|---|
SYNTACTIC EXEMPLAR | Sgcp GENERATIONS |
how can i get a domain for free ? | how can i develop a career in software ? |
what is the best way to register a company ? | what is the best way to develop career in software ? |
what are good places to visit in new york ? | what are good ways to develop my career in software ? |
can i make 800,000 a month betting on horses ? | can i develop my career in software ? |
what is chromosomal mutation ? what are some examples ? | what is good career ? what are some of the ways to develop my career in software ? |
is delivery free on quikr ? | is career useful in software ? |
is it possible to mute a question on quora ? | is it possible to develop my career in software ? |
SOURCE: how do i develop my career in software ? . | |
---|---|
SYNTACTIC EXEMPLAR | Sgcp GENERATIONS |
how can i get a domain for free ? | how can i develop a career in software ? |
what is the best way to register a company ? | what is the best way to develop career in software ? |
what are good places to visit in new york ? | what are good ways to develop my career in software ? |
can i make 800,000 a month betting on horses ? | can i develop my career in software ? |
what is chromosomal mutation ? what are some examples ? | what is good career ? what are some of the ways to develop my career in software ? |
is delivery free on quikr ? | is career useful in software ? |
is it possible to mute a question on quora ? | is it possible to develop my career in software ? |
S . | what are pure substances ? what are some examples ? . |
---|---|
E . | what are the characteristics of the elizabethan theater ? . |
H = 4 | what are pure substances ? |
H = 5 | what are some of pure substances ? |
H = 6 | what are some examples of pure substances ? |
H = 7 | what are some examples of a pure substance ? |
S . | what are pure substances ? what are some examples ? . |
---|---|
E . | what are the characteristics of the elizabethan theater ? . |
H = 4 | what are pure substances ? |
H = 5 | what are some of pure substances ? |
H = 6 | what are some examples of pure substances ? |
H = 7 | what are some examples of a pure substance ? |
5.3 Sgcp-R Analysis
ROUGE-based selection from the candidates favors paraphrases that have higher n-gram overlap with their respective source sentences, hence may capture source’s meaning better. This hypothesis can be directly observed from the results in Tables 2 and 4, where we see higher values on automated semantic and human evaluation scores. Although this helps in obtaining better semantic generation, it tends to result in higher TED values. One possible reason is that, when provided with the complete tree, fine-grained information is available to the model for decoding and it forces the generations to adhere to the syntactic structure. In contrast, at lower heights, the model is provided with lesser syntactic information but equivalent semantic information.
5.4 Qualitative Analysis
As can be seen from Table 7, Sgcp not only incorporates the best aspects of both the prior models, namely Scpn and Cgen, but also utilizes the complete syntactic information obtained using the constituency-based parse trees of the exemplar.
. | Single-Pass . | Syntactic Signal . | Granularity . |
---|---|---|---|
Scpn | ✗ | Linearized Tree | ✓ |
Cgen | ✓ | POS Tags (During training) | ✗ |
Sgcp | ✓ | Constituency Parse Tree | ✓ |
. | Single-Pass . | Syntactic Signal . | Granularity . |
---|---|---|---|
Scpn | ✗ | Linearized Tree | ✓ |
Cgen | ✓ | POS Tags (During training) | ✗ |
Sgcp | ✓ | Constituency Parse Tree | ✓ |
From the generations in Table 3, we can see that our model is able to capture both the semantics of the source text as well as the syntax of template. Scpn, evidently, can produce outputs with the template syntax, but it does so at the cost of semantics of the source sentence. This can also be verified from the results in Table 2, where Scpn performs poorly on PDS as compared with other models. In contrast, Cgen and Sgcp retain much better semantic information, as is desirable. While generating sentences, Cgen often abruptly ends the sentence, as in example 1 in Table 3, truncating the penultimate token with of. The problem of abrupt ending due to insufficient syntactic input length was highlighted in Chen et al. (2019a) and we observe similar trends. Sgcp, on the other hand, generates more relevant and grammatical sentences.
Based on empirical evidence, Sgcp alleviates this shortcoming, possibly due to dynamic syntactic control and decoding. This can be seen in, for example, example 3 in Table 3 where Cgen truncates the sentence abruptly (penultimate token = directors) but Sgcp is able to generate relevant sentence without compromising on grammaticality.
5.5 Limitations and Future Directions
All natural language English sentences cannot necessarily be converted to any desirable syntax. We note that Sgcp does not take into account the compatibility of source sentence and template exemplars and can freely generate syntax conforming paraphrases. This, at times, leads to imperfect paraphrase conversion and nonsensical sentences like example 6 in Table 5 (is career useful in software ?). Identifying compatible exemplars is an important but separate task in itself, which we defer to future work.
Another important aspect is that the task of paraphrase generation is inherently domain agnostic. It is easy for humans to adapt to new domains for paraphrasing. However, because of the nature of the formulation of the problem in NLP, all the baselines, as well as our model(s), suffer from dataset bias and are not directly applicable to new domains. A prospective future direction can be to explore it from the lens of domain independence.
Analyzing the utility of controlled paraphrase generations for the task of data augmentation is another interesting possible direction.
6 Conclusion
In this paper we proposed Sgcp, an end-to-end framework for the task of syntactically controlled paraphrase generation. Sgcp generates paraphrase of an input sentence while conforming to the syntax of an exemplar sentence provided along with the input. Sgcp comprises a GRU-based sentence encoder, a modified RNN-based tree encoder, and a pointer-generator—based novel decoder. In contrast to previous work that focuses on a limited amount of syntactic control, our model can generate paraphrases at different levels of granularity of syntactic control without compromising on relevance. Through extensive evaluations on real-world datasets, we demonstrate Sgcp’s efficacy over state-of-the-art baselines.
We believe that the above approach can be useful for a variety of text generation tasks including syntactic exemplar-based abstractive summarization, text simplification and data-to-text generation.
Acknowledgments
This research is supported in part by the Ministry of Human Resource Development (Government of India). We thank the action editor Asli Celikyilmaz and the three anonymous reviewers for their helpful suggestions in preparing the manuscript. We also thank Chandrahas for his indispensable comments on earlier drafts of this paper.
Notes
Obtained using the Stanford CoreNLP toolkit (Manning et al., 2014).
Because the ParaNMT dataset only contains paraphrase pairs, we augment it with the PAWS (Zhang et al., 2019) dataset to acquire negative samples.
Because the test set of QQP is not public, the 90.2% number was computed on the available dev set (not used for model selection).
References
Author notes
This research was conducted during the authors internship at Indian Institute of Science.