Syntax-Guided Controlled Generation of Paraphrases

Given a sentence (e.g., “I like mangoes”) and a constraint (e.g., sentiment flip), the goal of controlled text generation is to produce a sentence that adapts the input sentence to meet the requirements of the constraint (e.g., “I hate mangoes”). Going beyond such simple constraints, recent work has started exploring the incorporation of complex syntactic-guidance as constraints in the task of controlled paraphrase generation. In these methods, syntactic-guidance is sourced from a separate exemplar sentence. However, these prior works have only utilized limited syntactic information available in the parse tree of the exemplar sentence. We address this limitation in the paper and propose Syntax Guided Controlled Paraphraser (SGCP), an end-to-end framework for syntactic paraphrase generation. We find that Sgcp can generate syntax-conforming sentences while not compromising on relevance. We perform extensive automated and human evaluations over multiple real-world English language datasets to demonstrate the efficacy of Sgcp over state-of-the-art baselines. To drive future research, we have made Sgcp’s source code available. 1


Introduction
Controlled text generation is the task of producing a sequence of coherent words based on given constraints. These constraints can range from simple attributes like tense, sentiment polarity and wordreordering (Hu et al., 2017;Shen et al., 2017;Yang et al., 2018) to more complex syntactic information. For example, given a sentence "The movie is awful!" and a simple constraint like flip sentiment * This research was conducted during the author's internship at Indian Institute of Science. 1 https://github.com/malllabiisc/SGCP SOURCE -how do i predict the stock market ? EXEMPLAR -can a brain transplant be done ?

SCPN
how can the stock and start ? CGEN -can the stock market actually happen ? SGCP (Ours) -can i predict the stock market ?

SOURCE
what are some of the mobile apps you ca n't live without and why ? EXEMPLAR -which is the best resume you have come across ?

SCPN
what are the best ways to lose weight ? CGEN -which is the best mobile app you ca n't ? SGCP (Ours) -which is the best app you ca n't live without and why ?  (Iyyer et al., 2018), CGEN (Chen et al., 2019a), SGCP (Ours). We observe that SGCP is able to generate syntax conforming paraphrases without compromising much on relevance.
to positive, a controlled text generator is expected to produce the sentence "The movie is fantastic!".
These constraints are important in not only providing information about what to say but also how to say it. Without any constraint, the ubiquitous sequence-to-sequence neural models often tend to produce degenerate outputs and favour generic utterances (Vinyals and Le, 2015;Li et al., 2016). While simple attributes are helpful in addressing what to say, they provide very little information about how to say it. Syntactic control over generation helps in filling this gap by providing that missing information.
Incorporating complex syntactic information has shown promising results in neural machine translation (Stahlberg et al., 2016;Aharoni and Goldberg, 2017;Yang et al., 2019), data-to-text generation (Peng et al., 2019), abstractive textsummarization (Cao et al., 2018) and adversarial text generation (Iyyer et al., 2018). Additionally, recent work (Iyyer et al., 2018;Kumar et al., 2019) has shown that augmenting lexical and syntactical variations in the training set can help in building  Figure 1: Architecture of SGCP (proposed method). SGCP aims to paraphrase an input sentence, while conforming to the syntax of an exemplar sentence (provided along with the input). The input sentence is encoded using the Sentence Encoder (Section 3.2) to obtain a semantic signal c t . The Syntactic Encoder (Section 3.3) takes a constituency parse tree (pruned at height H) of the exemplar sentence as an input, and produces representations for all the nodes in the pruned tree. Once both of these are encoded, the Syntactic Paraphrase Decoder (Section 3.4) uses pointer-generator network, and at each time step takes the semantic signal c t , the decoder recurrent state s t , embedding of the previous token and syntactic signal h Y t to generate a new token. Note that the syntactic signal remains the same for each token in a span (shown in figure above curly braces; please see Figure 2 for more details). The gray shaded region (not part of the model) illustrates a qualitative comparison of the exemplar syntax tree and the syntax tree obtained from the generated paraphrase. Please refer Section 3 for details.
better performing and more robust models.
In this paper, we focus on the task of syntactically controlled paraphrase generation, i.e., given an input sentence and a syntactic exemplar, produce a sentence which conforms to the syntax of the exemplar while retaining the meaning of the original input sentence. While syntactically controlled generation of paraphrases finds applications in multiple domains like data-augmentation and text passivization, we highlight its importance in the particular task of Text simplification. As pointed out in Siddharthan (2014), depending on the literacy skill of an individual, certain syntactical forms of English sentences are easier to comprehend than others. As an example consider the following two sentences: S1 Because it is raining today, you should carry an umbrella.
S2 You should carry an umbrella today, because it is raining.
Connectives that permit pre-posed adverbial clauses have been found to be difficult for third to fifth grade readers, even when the order of mention coincides with the causal (and temporal) order (Anderson and Davison, 1986;Levy, 2003). Hence, they prefer sentence S2. However, various other studies (Clark and Clark, 1968;Katz and Brent, 1968;Irwin, 1980) have suggested that for older school children, college students and adults, comprehension is better for the cause-effect presentation, hence sentence S1. Thus, modifying a sentence, syntactically, would help in better comprehension based on literacy skills. Prior work in syntactically controlled paraphrase generation addressed this task by conditioning the semantic input on either the features learnt from a linearized constituency-based parse tree (Iyyer et al., 2018), or the latent syntactic information (Chen et al., 2019a) learnt from exemplars through variational auto-encoders. Linearizing parse trees, typically, result in loss of essen-tial dependency information. On the other hand, as noted in (Shi et al., 2016), an auto-encoder based approach might not offer rich enough syntactic information as guaranteed by actual constituency parse trees. Moreover, as noted in Chen et al. (2019a), SCPN (Iyyer et al., 2018) and CGEN (Chen et al., 2019a) tend to generate sentences of the same length as the exemplar. This is an undesirable characteristic because it often results in producing sentences that end abruptly, thereby compromising on grammaticality and semantics. Please see Table 1 for sample generations using each of the models.
To address these gaps, we propose Syntax Guided Controlled Paraphraser (SGCP) which uses full exemplar syntactic tree information. Additionally, our model provides an easy mechanism to incorporate different levels of syntactic control (granularity) based on the height of the tree being considered. The decoder in our framework is augmented with rich enough syntactical information to be able to produce syntax conforming sentences while not losing out on semantics and grammaticality.
The main contributions of this work are as follows: 1. We propose Syntax Guided Controlled Paraphraser (SGCP), an end-to-end model to generate syntactically controlled paraphrases at different levels of granularity using a parsed exemplar.
2. We provide a new decoding mechanism to incorporate syntactic information from the exemplar sentence's syntactic parse.
3. We provide a dataset formed from Quora Question Pairs 2 for evaluating the models. We also perform extensive experiments to demonstrate the efficacy of our model using multiple automated metrics as well as human evaluations.

Related Work
Controllable Text Generation is an important problem in NLP which has received significant attention in recent times. Prior works include generating text using models conditioned on attributes like formality, sentiment or tense (Hu et al., 2017;Shen et al., 2017;Yang et al., 2018) as well as on syntactical templates (Iyyer et al., 2018;Chen et al., 2019a). These systems find applications in adversarial sample generation (Iyyer et al., 2018), text summarization and table-to-text generation (Peng et al., 2019). While achieving state-ofthe-art in their respective domains, these systems typically rely on a known finite set of attributes thereby making them quite restrictive in terms of the styles they can offer.
Paraphrase generation. While generation of paraphrases has been addressed in the past using traditional methods (McKeown, 1983;Barzilay and Lee, 2003;Quirk et al., 2004;Hassan et al., 2007;Zhao et al., 2008;Madnani and Dorr, 2010;Wubben et al., 2010), they have recently been superseded by deep learning-based approaches (Prakash et al., 2016;Gupta et al., 2018;Li et al., 2019Kumar et al., 2019). The primary task of all these methods (Prakash et al., 2016;Gupta et al., 2018; is to generate the most semantically similar sentence and they typically rely on beam search to obtain any kind of lexical diversity. Kumar et al. (2019) try to tackle the problem of achieving lexical, and limited syntactical diversity using submodular optimization but do not provide any syntactic control over the type of utterance that might be desired. These methods are therefore restrictive in terms of the syntactical diversity that they can offer.
Controlled Paraphrase Generation. Our task is similar in spirit to Iyyer et al. (2018); Chen et al. (2019a), which also deals with the task of syntactic paraphrase generation. However, the approach taken by them is different from ours in at least two aspects. Firstly, SCPN (Iyyer et al., 2018) uses attention (Bahdanau et al., 2014) based pointergenerator network (See et al., 2017) to encode input sentences and a linearised constituency tree to produce paraphrases. Due to the linearization of syntactic tree, a lot of dependency-based information is generally lost. Our model, instead, directly encodes the tree structure to produce a paraphrase. Secondly, the inference (or generation) process in SCPN is computationally very expensive, since it involves a two-stage generation process. In the first stage, they generate full parse trees from incomplete templates, and then from full parse trees to final generations. In contrast, the inference in our method involves a single-stage process, wherein our model takes as input a semantic source, a syntactic tree and the level of syntactic style that needs to be transferred, to obtain the generations. Additionally, we also observed that the model does not perform well in low resource settings. This, again, can be attributed to the compounding implicit noise in the training due to linearised trees and generation of full linearised trees before obtaining the final paraphrases. Chen et al. (2019a) propose a syntactic exemplar-based method for controlled paraphrase generation using an approach based on latent variable probabilistic modeling, neural variational inference, and multi-task learning. This, in principle, is very similar to Chen et al. (2019b). As opposed to our model which provides different levels of syntactic control of the exemplar-based generation, this approach is restrictive in terms of the flexibility it can offer. Also, as noted in Shi et al. (2016), an auto-encoder based approach might not offer rich enough syntactic information as offered by actual constituency parse trees. Additionally, VAEs (Kingma and Welling, 2014) are generally unstable and harder to train (Bowman et al., 2016;Gupta et al., 2018) than seq2seq based approaches.

SGCP: Proposed Method
In this section, we describe the inputs and various architectural components, essential for building SGCP, an end-to-end trainable model. Our model, as shown in Figure 1, comprises a sentence encoder (3.2), syntactic tree encoder (3.3), and a syntactic-paraphrase-decoder (3.4).

Inputs
Given an input sentence X and a syntactic exemplar Y , our goal is to generate a sentence Z that conforms to the syntax of Y while retaining the meaning of X.
While the semantic encoder (Section 3.2) works on sequence of input tokens, the syntactic encoder (Section 3.3) operates on constituency-based parse trees. We parse the syntactic exemplar Y 3 to obtain its constituency-based parse tree. The leaf nodes of the constituency-based parse tree consists of token for the sentence Y. These tokens, in some sense, carry the semantic information of sentence Y, which we do not need for generating paraphrases. In order to prevent any meaning propagation from exemplar sentence Y into the generation, we remove these leaf/terminal nodes from its constituency parse. The tree thus obtained is denoted as C Y .
The syntactic encoder, additionally, takes as input H, which governs the level of syntactic control needed to be induced. The utility of H will be described in Section 3.3.

Semantic Encoder
The semantic encoder, a multi-layered Gated Recurrent Unit (GRU), receives tokenized sentence X = {x 1 , . . . , x T X } as input and computes the contextualized hidden state representation h X t for each token using: where e(x t ) represents the learnable embedding of the token x t and t ∈ {1, . . . , T X } . Note that we use byte-pair encoding (Sennrich et al., 2016) for word/token segmentation.

Syntactic Encoder
This encoder provides the necessary syntactic guidance for the generation of paraphrases. Formally, let constituency tree where V is the set of nodes, E the set of edges and Y the labels associated with each node. We calculate the hidden-state representation h Y v of each node v ∈ V using the hidden-state representation of its parent node pa(v) and the embedding associated with its label y v as follows: (2) where e(y v ) is the embedding of the node label y v , and W pa , W v , b v are learnable parameters. This approach can be considered similar to TreeLSTM (Tai et al., 2015). We use GeLU activation function (Hendrycks and Gimpel, 2016) rather than the standard tanh or relu, because of superior empirical performance.
As indicated in Section 3.1, syntactic encoder takes as input the height H, which governs the level of syntactic control. We randomly prune the tree C Y to height H ∈ {3, . . . , H max }, where H max is the height of the full constituency tree C Y . As an example, in Figure 2b, we prune the constituency-based parse tree of the exemplar sentence, to height H = 3. The leaf nodes for this tree have the labels WP, VBZ, NP and <DOT>. While we calculate the hidden-state representation of all the nodes, only the terminal nodes are responsible for providing the syntactic signal to the decoder (Section 3.4). The constituency parse tree serves as an input to the syntactic encoder (Section 3.3). The first step is to remove the leaf nodes which contain meaning representative tokens (Here: What is the best language ...). H denotes the height to which the tree can be pruned and is an input to the model. Figure  (a) shows the full constituency parse tree annotated with vector a for different heights. Figure (b) shows the same tree pruned at height H = 3 with its corresponding a vector. The vector a serves as an signalling vector (Section 3.4.2) which helps in deciding the syntactic signal to be passed on to the decoder. Please refer Section 3 for details.
We maintain a queue L Y H of such terminal node representations where elements are inserted from left to right for a given H. Specifically, for the particular example given in Figure 2b, We emphasize the fact that the length of the queue |L Y H | is a function of height H.

Syntactic Paraphrase Decoder
Having obtained the semantic and syntactic representations, the decoder is tasked with the generation of syntactic paraphrases. This can be modeled as finding the best Z = Z * that maximizes the probability P(Z|X, Y ), which can further be factorized as: where T Z is the maximum length up to which decoding is required.
In the subsequent sections, we use t to denote the decoder time step.

Using Semantic Information
At each decoder time step t, the attention distribution α t is calculated over the encoder hidden states h X i , obtained using Equation 1, as: where s t is the decoder cell-state and v, W h , W s , b attn are learnable parameters.
The attention distribution provides a way to jointly-align and train sequence to sequence models by producing a weighted sum of the semantic encoder hidden states, known as context-vector c t given by: c t serves as the semantic signal which is essential for generating meaning preserving sentences.

Using Syntactic Information
During training, each terminal node in the tree C Y , pruned at H, is equipped with information about the span of words it needs to generate. At each time step t, only one terminal node representation h Y v ∈ L Y H is responsible for providing the syntactic signal which we call h Y t . This hidden-state representation to be used is governed through an signalling vector a = (a 1 , . . . , a Tz ), where each a i ∈ {0, 1}. 0 indicates that the decoder should keep on using the same hidden-representation h Y v ∈ L Y H that is currently being used, and 1 indicates that the next element (hidden-representation) in the queue L Y H should be used for decoding. The utility of a can be best understood through Figure 2b. Consider the syntactic tree pruned at height H = 3. For this example, a = (1, 1, 1, 0, 0, 0, 0, 0, 1) a i = 1 provides a signal to pop an element from the queue L Y H while a i = 0 provides a signal to keep on using the last popped element. This element is then used to guide the decoder syntactically by providing a signal in the form of hiddenstate representation (Equation 8).
Specifically, in this example, the a 1 = 1 signals L Y H to pop h Y WP to provide syntactic guidance to the decoder for generating the first token. a 2 = 1 signals L Y H to pop h Y VBZ to provide syntactic guidance to the decoder for generating the second token. a 3 = 1 helps in obtaining h Y NP from L Y H to provide guidance to generate the third token. As described earlier, a 4 , . . . , a 8 = 0 indicate that the same representation h Y NP should be used for syntactically guiding tokens z 4 , . . . , z 8 . Finally a 9 = 1 helps in retrieving h Y <DOT> for guiding decoder to generate token z 9 . Note that |L Y H | = Tz i=1 a i While a is provided to the model during training, this information might not be available during inference. Providing a during generation makes the model restrictive and might result in producing ungrammatical sentences. SGCP is tasked to learn a proxy for the signalling vector a, using transition probability vector p.
At each time step t, we calculate p t ∈ (0, 1) which determines the probability of changing the syntactic signal using: where pop removes and returns the next element in the queue, s t is the decoder state, and e(z t ) is the embedding of the input token at time t during decoding.

Overall
The semantic signal c t , together with decoder state s t , embedding of the input token e(z t ) and the syntactic signal h Y t is fed through a GRU followed by softmax of the output to produce a vocabulary distribution as: where [; ] represents concatenation of constituent elements, and W, b are trainable parameters. We augment this with the copying mechanism  as in the pointer-generator network (See et al., 2017). Usage of such a mechanism offers a probability distribution over the extended vocabulary (the union of vocabulary words and words present in the source sentence) as follows: where w c , w s , w x and b gen are learnable parameters, e(z t ) is the input token embedding to the decoder at time step t and α t i is the element corresponding to the i th co-ordinate in the attention distribution as defined in Equation 4 The overall objective can be obtained by taking negative log-likelihood of the distributions obtained in Equation 6 and Equation 9.
where a t is the t th element of the vector a.

Experiments
Our experiments are geared towards answering the following questions: Q1. Is SGCP able to generate syntax conforming sentences without losing out on meaning? (Section 5. Based on these questions, we outline the methods compared (Section 4.1), along with the datasets (Section 4.2) used, evaluation criteria (Section 4.3) and the experimental setup (Section 4.4).

Methods Compared
As in Chen et al. (2019a), we first highlight the results of the two direct return-input baselines.
1. Source-as-Output: Baseline where the output is the semantic input.
2. Exemplar-as-Output: Baseline where the output is the syntactic exemplar.
We compare the following competitive methods: 3. SCPN (Iyyer et al., 2018) is a sequence-tosequence based model comprising two encoders built with LSTM (Hochreiter and Schmidhuber, 1997) to encode semantics and syntax respectively. Once the encoding is obtained, it serves as an input to the LSTM based decoder which is augmented with softattention (Bahdanau et al., 2014) over encoded states as well as a copying mechanism (See et al., 2017) to deal with out-ofvocabulary tokens. 4 4. CGEN (Chen et al., 2019a) is a VAE (Kingma and Welling, 2014) model with two encoders to project semantic input and syntactic input to a latent space. They obtain a syntactic embedding from one encoder, using a standard Gaussian prior. To obtain the semantic representation, they use von Mises-Fisher prior, which can be thought of as a Gaussian distribution on a hypersphere. They train the model using a multi-task paradigm, incorporating paraphrase generation loss and word position loss. We considered their best model, VGVAE + LC + WN + WPL, which incorporates the above objectives.

SGCP (Section 3)
is a sequence-and-tree-tosequence based model which encodes semantics and tree-level syntax to produce paraphrases. It uses a GRU (Chung et al., 2014) based decoder with soft-attention on semantic encodings and a begin of phrase (bop) gate to select a leaf node in the exemplar syntax tree. We compare the following two variants of SGCP: (a) SGCP-F : Uses full constituency parse tree information of the exemplar for generating paraphrases. 4 Note that the results for SCPN differ from the ones shown in (Iyyer et al., 2018). This is because the dataset used in (Iyyer et al., 2018) is atleast 50 times larger than the largest dataset (ParaNMT-small) in this work.
(a) SGCP-R : SGCP can produce multiple paraphrases by pruning the exemplar tree at various heights. This variant first generates 5 candidate generations, corresponding to 5 different heights of the exemplar tree namely {H max , H max − 1, H max − 2, H max − 3, H max − 4}, for each (source, exemplar) pair. From these candidates, the one the highest ROUGE-1 score with the source sentence, is selected as the final generation.
Note that, except for the return-input baselines, all methods use beam search during inference.

Datasets
We train the models and evaluate them on the following datasets: (1) ParaNMT-small (Chen et al., 2019a) contains 500K sentence-paraphrase pairs for training, and 1300 manually labeled sentence-exemplarreference which is further split into 800 test data points and 500 dev. data points respectively. As in Chen et al. (2019a), our model uses only (sentence, paraphrase) during training. The paraphrase itself serves as the exemplar input during training.
This dataset is a subset of the original ParaNMT-50M dataset . ParaNMT-50M is a data set generated automatically through backtranslation of original English sentences. It is inherently noisy due to imperfect neural machine translation quality with many sentences being non-grammatical and some even being non-English sentences. Because of such noisy data points, it is optimistic to assume that the corresponding constituency parse tree would be well aligned. To that end, we propose to use the following additional dataset which is more well-formed and has more human intervention than the ParaNMT-50M dataset.
(2) QQP-Pos: The original Quora Question Pairs (QQP) dataset contains about 400K sentence pairs labeled positive if they are duplicates of each other and negative otherwise. The dataset is composed of about 150K positive and 250K negative pairs. We select those positive pairs which contain both sentences with a maximum token length of 30, leaving us with~146K pairs. We call this dataset as QQP-Pos.
Similar to ParaNMT-small, we use only the sentence-paraphrase pairs as training set and sentence-exemplar-reference triples for testing and validation.
We randomly choose 140K sentence-paraphrase pairs as the training set T train , and the remaining 6K pairs T eval are used to form the evaluation set E. Additionally, let T eset = {{X, Z} : (X, Z) ∈ T eval }. Note that T eset is a set of sentences while T eval is a set of sentence-paraphrase pairs. Let E = φ be the initial evaluation set. For selecting exemplar for each each sentence-paraphrase pair (X, Z) ∈ T eval , we adopt the following procedure: Step 1: For a given (X, Z) ∈ T eval , construct an exemplar candidate set C = T eset − {X, Z}. |C| ≈ 12, 000.
Step 2: Retain only those sentences C ∈ C whose sentence length (= number of tokens) differ by at most 2 when compared to the paraphrase Z. This is done since sentences with similar constituency-based parse tree structures tend to have similar token lengths.
Step 3: Remove those candidates C ∈ C, which are very similar to the source sentence X, i.e. BLEU(X, C) > 0.6.
Step 4: From the remaining instances in C, choose that sentence C as the exemplar Y which has the least Tree-Edit distance with the paraphrase Z of the selected pair i.e. Y = argmin C∈C TED(Z, C). This ensures that the constituency-based parse tree of the exemplar Y is quite similar to that of Z, in terms of Tree-Edit distance.
Step 5: E := E ∪ (X, Y, Z) Step 6: Repeat procedure for all other pairs in T eval .
From the obtained evaluation set E, we randomly choose 3K triplets for the test set T test , and remaining 3K for the validation set V.

Evaluation
It should be noted that there is no single fullyreliable metric for evaluating syntactic paraphrase generation. Therefore, we evaluate on the following metrics to showcase the efficacy of syntactic paraphrasing models.
(ii) Syntactic Transfer: We evaluate the syntactic transfer using Tree-edit distance (Zhang and Shasha, 1989) between the parse trees of: (a) the generated and the syntactic exemplar in the test set -TED-E (b) the generated and the reference paraphrase in the test set -TED-R (iii) Model-based evaluation: Since our goal is to generate paraphrases of the input sentences, we need some measure to determine if the generations indeed convey the same meaning as the original text. To achieve this, we adopt a model-based evaluation metric as used by Shen et al. (2017) for Text Style Transfer and Isola et al. (2017) for Image Transfer. Specifically, classifiers are trained on the task of Paraphrase Detection and then used as Oracles to evaluate the generations of our model and the baselines. We fine-tune two RoBERTa  based sentence pair classifiers, one on Quora Question Pairs (Classifier-1) and other on ParaNMT + PAWS 5 datasets (Classifier-2) which achieve accuracies of 90.2% and 94.0% on their respective test sets 6 .
Once trained, we use Classifier-1 to evaluate generations on QQP-Pos and Classifier-2 on ParaNMT-small.
We first generate syntactic paraphrases using all the models (Section 4.1) on the test splits of QQP-Pos and ParaNMT-small datasets. We then pair the source sentence with their corresponding generated paraphrases and send them as input to the classifiers. The Paraphrase Detection score, denoted as PDS in Table 2, is defined as, the ratio of the number of generations predicted as paraphrases of their corresponding source  Table 2: Results on QQP and ParaNMT-small dataset. Higher↑ BLEU, METEOR, ROUGE and PDS is better whereas lower↓ TED score is better. SGCP-R selects the best candidate out of many, resulting in performance boost for semantic preservation (shown in box). We bold the statistically significant results of SGCP-F, only, for a fair comparison with the baselines. Note that Source-as-Output, and Exemplaras-Output are only dataset quality indicators and not the competitive baselines. Please see Section 5 for details.
sentences by the classifier to the total number of generations.

Human Evaluation.
While TED is sufficient to highlight syntactic transfer, there has been some scepticism regarding automated metrics for paraphrase quality (Reiter, 2018). To address this issue, we perform human evaluation on 100 randomly selected data points from the test set. In the evaluation, 3 judges (nonresearchers proficient in the English language) were asked to assign scores to generated sentences based on the semantic similarity with the given source sentence. The annotators were shown a source sentence and the corresponding outputs of the systems in random order. The scores ranged from 1 (doesn't capture meaning at all) to 4 (perfectly captures the meaning of the source sentence).

Setup
(a) Pre-processing. Since our model needs access to constituency parse trees, we tokenize and parse all our data points using the fully parallelizable Stanford CoreNLP Parser (Manning et al., 2014) to obtain their respective parse trees. This is done prior to training in order to prevent any additional computational costs that might be incurred because of repeated parsing of the same data points during different epochs.
(b) Implementation details. We train both our models using the Adam Optimizer (Kingma and Ba, 2014) with an initial learning rate of 7e-5. We use a bidirectional 3-layered GRU for encoding the tokenized semantic input and a standard pointer-generator network with GRU for decoding. The token embedding is learnable with dimension 300. To reduce the training complexity of the model, the maximum sequence length is kept at 60. The vocabulary size is kept at 24K for QQP and 50K for ParaNMT-small. SGCP needs access to the level of syntactic granularity for decoding, depicted as H in Figure 2. During training, we keep on varying it randomly from 3 to H max , changing it with each training epoch. This ensures that our model is able to generalize because of an implicit regularization attained using this procedure. At each time-step of the decoding process, we keep a teacher forcing ratio of 0.9.

Semantic Preservation and Syntactic transfer
1. Automated Metrics: As can be observed in Table 2, our method(s) (SGCP-F/R (Section 4.1)) are able to outperform the existing baselines on Source what should be done to get rid of laziness ? Template Exemplar how can i manage my anger ?
SCPN (Iyyer et al., 2018) how can i get rid ? CGEN (Chen et al., 2019a) how can i get rid of ?

SGCP-F (Ours)
how can i stop my laziness ?

SGCP-R (Ours) how do i get rid of laziness ?
Source what books should entrepreneurs read on entrepreneurship ? Template Exemplar what is the best programming language for beginners to learn ?
SCPN (Iyyer et al., 2018) what are the best books books to read to read ? CGEN (Chen et al., 2019a) what 's the best book for entrepreneurs read to entrepreneurs ?

SGCP-F (Ours)
what is a best book idea that entrepreneurs to read ?

SGCP-R (Ours)
what is a good book that entrepreneurs should read ?
Source how do i get on the board of directors of a non profit or a for profit organisation ? Template Exemplar what is the best way to travel around the world for free ?
SCPN (Iyyer et al., 2018) what is the best way to prepare for a girl of a ? CGEN (Chen et al., 2019a) what is the best way to get a non profit on directors ?

SGCP-F (Ours)
what is the best way to get on the board of directors ?

SGCP-R (Ours)
what is the best way to get on the board of directors of a non profit or a for profit organisation ? Table 3: Sample generations of the competitive models. Please refer to Section 5.5 for details both the datasets. Source-as-Output is independent of the exemplar sentence being used and since a sentence is a paraphrase of itself, the paraphrastic scores are generally high while the syntactic scores are below par. An opposite is true for Exemplar-as-Output. These baselines also serve as dataset quality indicators. It can be seen that source is semantically similar while being syntactically different from target sentence whereas the opposite is true when exemplar is compared to target sentences. Additionally, source sentences are syntactically and semantically different from exemplar sentences as can be observed from TED-E and PDS scores. This helps in showing that the dataset has rich enough syntactic diversity to learn from. Through TED-E scores it can be seen that SGCP-F is able to adhere to the syntax of the exemplar template to a much larger degree than the baseline models. This verifies that our model is able to generate meaning preserving sentences while conforming to the syntax of the exemplars when measured using standard metrics.
It can also be seen that SGCP-R tends to perform better than SGCP-F in terms of paraphrastic scores while taking a hit on the syntactic scores. This makes sense, intuitively, because in some cases SGCP-R tends to select lower H values for syntactic granularity. This can also be observed from the example given in Table 6 where H = 6 is more favourable than H = 7, because of better meaning retention.
Although CGEN performs close to our model in terms of BLEU, ROUGE and METEOR scores on ParaNMT-small dataset, its PDS is still much lower than that of our model, suggesting that our model is better at capturing the original meaning of the source sentence. In order to show that the results are not coincidental, we test the statistical significance of our model. We follow the non-parametric Pitman's permutation test (Dror et al., 2018) and observe that our model is statistically significant when the significance level (α) is taken to be 0.05. Note that this holds true for all metric on both the datasets except ROUGE-2 on ParaNMT-small.  Table 4: A comparison of human evaluation scores for comparing quality of paraphrases generated using all models. Higher score is better. Please refer to Section 5.1 for details. Table 4 shows the results of human assessment. It can be seen that annotators, generally tend to rate SGCP-F and SGCP-R (Section 4.1) higher than the baseline models, thereby highlighting the efficacy of our models. This evaluation additionally shows that automated metrics are somewhat consistent with the human evaluation scores.   As can been seen in Table 6, at height 4 the syntax tree provided to the model is not enough to generate the full sentence that captures the meaning of the original sentence. As we increase the height to 5, it is able to capture the semantics better by predicting some of in the sentence. We see that at heights 6 and 7 SGCP is able to capture both semantics and syntax of the source and exemplar respectively. However, as we provide the complete height of the tree i.e., 7, it further tries to follow the syntactic input more closely leading to sacrifice in the overall relevance since the original sentence is about pure substances and not a pure substance. It can be inferred from this example that since a source sentence and exemplar's syntax might not be fully compatible with each other, using the complete syntax tree can potentially lead to loss of relevance and grammaticality. Hence by choosing different levels of syntactic granularity, one can address the issue of compatibility to a certain extent. Table 5 shows sample generations of our model on multiple exemplars for a given source sentence. It can be observed that SGCP can generate high-quality outputs for a variety of different template exemplars even the ones which differ a lot from the original sentence in terms of their syntax. A particularly interesting exemplar is what is chromosomal mutation ? what are some examples ?. Here, SGCP is able to generate a sentence with two question marks while preserving the essence of the source sentence. It should also be noted that the exemplars used in Table 5, were selected manually from the test sets, considering only their qualitative compatibility with the source sentence. Unlike the procedure used for the creation of QQP-Pos dataset, the final paraphrases were not kept in hand while selecting the exemplars. In real-world settings, where a gold paraphrase won't be present, these results are indicative of the qualitative efficacy of our method.

SGCP-R Analysis
ROUGE based selection from the candidates favour paraphrases which have higher n-gram overlap with their respective source sentences, hence may capture source's meaning better. This hypothesis can be directly observed from the results in Table 2 and Table 4 where we see higher values on automated semantic and human evaluation scores. While this helps in getting better semantic generations, it tends to result in higher TED values. One possible reason is that, when provided with the complete tree, fine-grained information is available to the model for decoding and it forces the generations to adhere to the syntactic structure. In contrast, at lower heights, the model is provided with lesser syntactic information but equivalent semantic information.  As can be seen from Table 7, SGCP not only incorporates the best aspects of both the prior models, namely SCPN and CGEN, but also utilizes the complete syntactic information obtained using the constituency-based parse trees of the exemplar.

Qualitative Analysis
From the generations in Table 3, it can be observed that our model is able to capture both, the semantics of the source text as well as the syntax of template. SCPN, evidently, can produce outputs with the template syntax, but it does so at the cost of semantics of the source sentence. This can also be verified from the results in Table 2 where SCPN performs poorly on PDS as compared to other models. In contrast CGEN and SGCP retain much better semantic information, as is desirable. While generating sentences, CGEN often abruptly ends the sentence as in example 1 in Table  3, truncating the penultimate token with of. The problem of abrupt ending due to insufficient syntactic input length was highlighted in Chen et al. (2019a) and we observe similar trends. SGCP on the other hand generates more relevant and grammatical sentences.
Based on empirical evidence, SGCP alleviates this shortcoming, possibly due to dynamic syntactic control and decoding. This can be seen in e.g., 3 in Table 3 where CGEN truncates the sentence abruptly (penultimate token = directors) but SGCP is able to generate relevant sentence without compromising on grammaticality.

Limitations and Future directions
All natural language English sentences cannot necessarily be converted to any desirable syntax.
We note that SGCP does not take into account the compatibility of source sentence and template exemplars and can freely generate syntax conforming paraphrases. This at times, leads to imperfect paraphrase conversion and nonsensical sentences like example 6 in Table 5 (is career useful in software ?). Identifying compatible exemplars is an important but separate task in itself, which we defer to future work.
Another important aspect is that the task of paraphrase generation is inherently domain agnostic. It is easy for humans to adapt to new domains for paraphrasing. However, due to the nature of the formulation of the problem in NLP, all the baselines as well as our model(s), suffer from dataset bias and are not directly applicable to new domains. A prospective future direction can be to explore it from the lens of domain independence.
Analyzing the utility of controlled paraphrase generations for the task of data augmentation is another interesting possible direction.

Conclusion
In this paper, we proposed SGCP, an end-toend framework for the task of syntactically controlled paraphrase generation. SGCP generates paraphrase of an input sentence while conforming to the syntax of an exemplar sentence provided along with the input. SGCP comprises a GRUbased sentence encoder, a modified RNN based tree encoder, and a pointer-generator based novel decoder. In contrast to previous works that focus on a limited amount of syntactic control, our model can generate paraphrases at different levels of granularity of syntactic control without compromising on relevance. Through extensive evaluations on real-world datasets, we demonstrate SGCP's efficacy over state-of-the-art baselines. We believe that the above approach can be useful for a variety of text generation tasks including syntactic exemplar-based abstractive summarization, text simplification and data-to-text generation.