Iterative Paraphrastic Augmentation with Discriminative Span Alignment

We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing resources, or the rapid creation of new resources from a small, manually-produced seed corpus. We illustrate our framework on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. Based on roughly four days of collecting training data for the alignment model and approximately one day of parallel compute, we automatically generate 495,300 unique (Frame, Trigger) combinations annotated in context, a roughly 50x expansion atop FrameNet v1.7.


Introduction
Data augmentation is the process of automatically increasing the size of a dataset with the goal of improving performance on a task of interest. It has been applied in many areas of machine learning including computer vision (Shorten and Khoshgoftaar, 2019) and speech recognition (Ragni et al., 2014;Ko et al., 2015).
In this paper, we focus on paraphrastic augmentation, a technique to automatically expand text-based datasets both in their overall size and in their degree of lexical and syntactic diversity, via the use of a paraphrase model. Broadly speaking, a paraphrase model outputs a sentence S given an input sentence S such that meaning(S) ≈ meaning(S ) and S = S . Prior work has demonstrated the efficacy of paraphrastically augmented datasets on a variety of sentence-level tasks, including machine translation, natural language inference, and intent classification (e.g. (Ribeiro et al., 2018;Hu et al., 2019a;Kumar et al., 2019)). Here we focus on augmenting data for span labeling problems, where we are concerned with balancing the joint objectives of finding different ways to express meaning at the level of a word or phrase while ensuring the paraphrase is sensitive to the context of the surrounding sentence.
Often in paraphrastic augmentation an input sentence is rewritten one or more times, with the assumption the output(s) are meaning preserving. For example, in sentiment analysis, data consists of (Sentence i , Label i ) pairs, where each Label i is in {0, 1}, indicating negative or positive sentiment. To augment this kind of dataset, we can paraphrase each Sentence i with a model f and thereby produce an additional (f (Sentence i ), Label i ) pair, doubling the size of the dataset.
In many language understanding tasks however, data contains span labels of the form: (Sentence i , {(start i,1 , end i,1 , type i,1 ), ...}), where the latter element is a set of tuples indicating each label's location (as a contiguous subsequence of the input tokens) and type. Although a paraphrase is expected to have the same meaning as the sentence from which it was generated, words and phrases are usually added, removed, or reordered. For a given annotated sentence, while we expect the same label types to still apply to a paraphrase, the location (start and end) will likely shift. To address this issue, we introduce a new model for span-based discriminative sentence alignment. Given an input sentence S, a paraphrase f (S), and a span of tokens in S representing a label location, the alignment model finds a semantically equivalent span in f (S). We present the architectural details of this model, a dataset for span alignment, and corresponding results in §4.
A second problem is that most paraphrase models offer no control over specific words or phrases that are included in or excluded from the final output. Text-based data augmentation typically aims Figure 1: Framework for iterative paraphrastic augmentation illustrated on an actual system output. The original, manually-annotated sentence contains a tag over the word "corroborate". In Iteration 1, the sentence is paraphrased using a lexically constrained decoder with a negative constraint on "corroborate" and all associated inflectional forms, guaranteeing that it will not appear in the output. Then, a span alignment model is used to obtain a link between "corroborate" in the original sentence and "confirm" in the paraphrased sentence. All inflectional forms of "confirm" are then unioned with the set of negative constraints and the process repeats for a predetermined number of iterations.
to increase lexical diversity, so it would be useful to force each tagged text span in the input to be rewritten in the paraphrase, ideally as a synonymous or semantically similar phrase via lexically constrained decoding ( §3).
Finally, we describe in §5 a framework that utilizes constrained paraphrasing and alignment in conjunction, iteratively, to augment datasets for span labeling problems. A schematic is given in Figure 1. We present the results of applying this framework to FrameNet in §6, including a new dataset with 495,300 unique (Frame, Trigger) pairs annotated in context. 1

Background
Monolingual Paraphrasing Coinciding with the improvement of machine translation, several works have explored sentential paraphrasing through back-translations (Mallinson et al., 2017;Wieting and Gimpel, 2018). One such model (Wieting and Gimpel, 2018) was used for sentence canonicalization, although its further usefulness was hindered by the lack of control over the paraphrasing process. Hu et al. (2019b) introduced constrained decoding (Post and Vilar, 2018) to sentential paraphrasing, enabling lexical control over the paraphrases.

Automatic Lexicon Expansion
As an alternative to manual labor, past work has sought to automatically build on existing semantic resources. Snow et al. (2006) used hypernym predictions and coordinate term classifiers to add 10,000 new WordNet entries with high precision. FrameNet+ (Pavlick et al., 2015) tripled the size of 1 http://nlp.jhu.edu/parabank FrameNet by substituting words from PPDB (Ganitkevitch et al., 2013), a collection of primarily word-level paraphrases obtained via bilingual pivoting. The paraphrases lack context, so e.g., "quite" might be listed as a paraphrase of "especially", without any means to determine when one might not be an appropriate substitute. While the expansion itself involved little cost, the lexicalized nature of their procedure failed to capture word senses in context and resulted in many false positives, requiring costly manual evaluation of every sentence. In contrast, we seek to mitigate false positives and enhance lexical and syntactic diversity by using a context-aware paraphrase model.
Paraphrasing for Structured Prediction Structured prediction finds a mapping between a surface form and some aspect of its underlying structure. Natural language allows for surface forms that express the same meaning -paraphrases -which makes learning this mapping nontrivial. Berant and Liang (2014) leveraged unstructured Q&A data by learning a paraphrasing model which maps a new query to existing ones with known structures. More relevant to our work, Wang et al. (2015) built a semantic parser from a small seed lexicon by generating canonical utterances from a domain-general grammar and then manually collecting paraphrases of these utterances through crowd-sourcing. A semantic parser is then trained on the paraphrases to produce the underlying structures that generated them. Our work is distinct in that we automatically expand our seed lexicon, collecting human judgments on a small subset of outputs in order to assess quality. Moreover, we introduce a general framework for augmenting data for span labeling, while Wang et al. (2015) focused on parsing.  (2019) introduced a pointer-network-based phrase-level aligner for paraphrase alignment which obtains high recall on several tasks. Syntactic chunking is used to build a candidate set of phrases in both source and paraphrase sequences, which the model is then tasked with aligning. Their model is applied to an open alignment task, where more than one phrase in the source and paraphrase should be aligned, differing from the setting described in §4.
The Berkeley FrameNet Project FrameNet (Baker et al., 2007) is the application of framesemantic theory (Fillmore, 1982) to real-world data. Organizationally, each frame contains a description of a concept, a list of entities participating in the frame (frame elements), and a list of lexical units, which are the semantically similar words that evoke, or trigger, the given concept. Figure 2 illustrates a sentence labeled under the FrameNet protocol. As of FrameNet v1.7, the resource contains roughly 1,200 frames, 8,500 annotated lexical units, and 200,000 annotations.
FrameNet has been applied to a variety of NLP tasks, including semantic role labeling (Gildea and Jurafsky, 2002), question-answering (Shen and Lapata, 2007), information extraction (Ruppenhofer and Rehbein, 2012), and recognizing textual entailment (Burchardt and Frank, 2006). As an entirely manually-created resource, FrameNet's utility is limited by the size of its lexical inventory and number of annotations (Shen and Lapata, 2007;Pavlick et al., 2015); an ideal candidate for augmentation.

Lexically Constrained Paraphrasing
Sentential paraphrasing is a sequence generation problem where the goal is to find an output sequence conveying similar semantics to the input sequence while also ensuring that the two sequences are lexically or syntactically distinct. Prior work has approached this problem with sequence-tosequence neural networks (Wieting and Gimpel, 2018;Hu et al., 2019a), where an encoder embeds the input sequence into a fixed-dimensional space and a decoder produces a sequence autoregressively. Often, the decoder uses beam search to explore the output space more efficiently. Lexically constrained decoding allows one to dynamically include or exclude token sequences from the output via user-supplied positive or negative constraints. When combined with paraphrasing, it can boost external NLP task performance via data augmentation (Hu et al., 2019a). Our work employs negative constraints, which exclude certain token sequences from the output. This is achieved by setting the likelihood of the last tokens in the sequences to zero when the preceding tokens were generated (Hu et al., 2019a).
We recreated the rewriter described in the prior work by using a paraphrase corpus (Hu et al., 2019c) that offers richer lexical diversity. We followed the model architecture described in Hu et al. with a few minor changes: 1) we use Sentence-Piece (Kudo and Richardson, 2018) unigrams instead of tokenization, following Hu et al.; 2) we do not not use source factors, as SentencePiece unigrams are case-sensitive. These changes allow us to rewrite raw text without tokenization. 2

Alignment Models
We present a BERT-based model (Devlin et al., 2018) to align spans of text between paraphrastic sentence pairs. The model is trained and evaluated on a new dataset released alongside this paper, consisting of 36,417 labeled sentence pairs.

Word Alignment Baselines
We compare our span alignment model ( §4.2) with two word-level alignment baselines: FastAlign (Dyer et al., 2013) and DiscAlign (Stengel-Eskin et al., 2019). The former is a fast implementation of IBM Model 2 (Brown et al., 1993) which decomposes the conditional probability of a target sequence given a source sequence into a lexical model and an alignment model. FastAlign is an asymmetric model, meaning that it must be run in both directions (source to paraphrase and paraphrase to source) and then these alignments must be combined using some heuristic-we use the growdiag-final-and heuristic. A FastAlign model was run over the concatenation of the test data, the train data, and paraphrased FrameNet data to obtain the final test alignments.
DiscAlign is a discriminatively-trained neural alignment model which uses the matrix product of contextualized encodings of the source and paraphrase word sequences to directly model the probability of an alignment given the source and paraphrase sequences. Unlike FastAlign, which is trained on bitext alone, DiscAlign is pre-trained on bitext and fine-tuned on gold-standard alignments. For this task, a DiscAlign model was pretrained with 141 million sentences of ParaBank data (Hu et al., 2019b) and finetuned on a 713 sentence subset of the Edinburgh++ corpus (Cohn et al., 2008). Both DiscAlign and FastAlign have been successfully used for cross-lingual word alignment, with DiscAlign outperforming FastAlign on Arabic-English and Chinese-English alignment by a large margin (Stengel-Eskin et al., 2019).

Span Alignment Model
Architecture Our model takes as input two tokenized English-language sentences S (source, with n tokens) and S (reference, with m tokens), where S is a paraphrase of S. The model also takes as input a span s in S: a contiguous subsequence of tokens with length between 1 and n, initially represented as a tuple of (start, end) offsets into the source-side token sequence. Given this, the model predicts a spanŝ ∈ {(i, j)|1 ≤ i ≤ j ≤ m}, representing the best alignment between s and the O(n 2 ) possible candidate spans 3 in S .
In the forward pass, we embed S and S using a pre-trained 12-layer BERT-Base model with frozen parameters, obtaining a hidden vector t i ∈ R 768 for each of the (m + n + 3) input tokens. S and S are embedded at the same time, i.e. as We then obtain a fixed-size representation S ∈ R 768 of the source-side span by mean-pooling the hidden states corresponding to the token positions from the start offset s 1 to the end offset s 2 . In the same way, we compute span representations C i for each of the O(n) reference-side candidate answer spans whose length 4 is within k of the length of the source-side span s. For each span pair representation (S, C i ) we create an aggregate V i ∈ R 1540 by concatenating three vectors: • Positional cues (Cue): start index and length per span 5 Intuitively, if the element-wise difference of the two span representations is close to the zero vector, the spans are likely close in meaning. Concatenating element-wise maxima to the representation worked best empirically, suggesting that extreme values may contain information not present in other parts of the representation. Since word spans in the source likely start in a similar position and are of a similar length as compared to corresponding word spans in the reference, the positional cues provide a useful signal. The aggregate vector V i is fed into a simple feedforward neural network f , consisting of one layer with 770 hidden units, PReLU activations, batchnorm, and a sigmoid output layer.
We use binary cross entropy loss with soft labels: rather than each C i candidate span being labeled as 1 or 0 depending on whether it is the gold-standard span or not, we assign labels according to the function, 2 −d(S,C i ) , where d measures the absolute difference of the start and end offsets between two spans, d(a, b) = |a 1 − b 1 | + |a 2 − b 2 |. In this way, the gold span is given a label of 1, candidate spans that are close to the gold-standard span are given partial credit, and partial credit exponentially decreases towards 0 as the distance between the candidate span and gold-standard span increases. In Tables 1 and 2 Figure 3: Span alignment inference. A BERTbased representation of the input span "corroborate" is passed to a neural network f that scores the input span against each possible candidate span.  "soft binary cross entropy", or SBCE. At inference time, we choose the span corresponding to the aggregate representation V i that is assigned the highest score by the neural network f , i.e.ŝ = arg max i f (V i ). A diagram illustrating the inference procedure is given in Figure 3.
Data To train and evaluate our model we crowdsourced a span-alignment dataset consisting of 36,417 labeled sentence pairs, which we release to the community. Each instance in the dataset consists of a sentence, a span in the sentence, an automatic paraphrase, and a span in the automatic paraphrase, where the two spans have been manually aligned. The source text was taken from FrameNet, which already has span annotations, so we fixed these spans and asked annotators to identify the corresponding spans in each automatic paraphrase. Annotators were given the option to decide that no semantically equivalent phrase was present, which occurred roughly 9% of the time. Of the cases where annotators did select a span, they chose the same span approximately 88% of the time. The text content under the original sentence spans was diverse, with roughly 10k unique phrases; approximately 4 alignments per phrase.

Results
Since the baseline aligners are word-level, and our model is span-level, in order to have a fair comparison we evaluate on span F 1 (Table 1), computing  the overlap between the reference span in the paraphrase and the predicted span. Predicted spans are obtained from word-level alignments by following the alignments of each word in the reference span to the paraphrase, and taking the maximal span covered by those alignments. The span F 1 metric allows partial credit to be assigned to the model in cases where the predicted span and reference span do not match exactly.
We also evaluate spans with exact matching (Table 2), where credit is only assigned if the predicted span matches the gold span exactly. Table 1 shows that when evaluated on span overlap, our model significantly outperforms both baselines. Table 2 shows that these results generalize to the more difficult exact match setting. While all models experience a drop in performance, our model continues to outperform both baselines. Because no prediction threshold was used in the baselines (unlike in our model) the values for precision and recall are equal for the baselines but can differ slightly for our model, as the addition of a threshold allows the model to incur a false negative without predicting a false positive.

Discussion
Tables 1 and 2 reflect the strength of our model for span alignment. Because our model is trained to choose spans by design, the probability of an exact match is higher a priori since its task is more constrained: rather than choosing the words of a span independently, it chooses them as a set, with limits on the difference in length between the source and target spans. This is reflected both in the better performance of our model on the exact match as well as the soft matching evaluation (where an ex-act match counts as perfect precision and perfect recall, greatly boosting scores).
The last two rows of Table 2 illustrate that SBCE boosts recall while keeping precision virtually intact; our intuition is that this training regime gives the model more confidence at inference time when scoring spans which appear similar to, but slightly different from the assumed correct answer, where those spans were then ultimately correct.
To determine whether our model was simply memorizing information associated with each lexical unit, we ran an experiment where all source-side spans in the test set were guaranteed to not have been observed at training time 6 . Under this setting, we lost roughly two points of F 1 , suggesting that the model generalizes well to unseen words.

Iterative Augmentation Procedure
Our alignment model ( §4) is paired with the lexically constrained paraphrase model ( §3) to form an iterative procedure for augmenting data of the form: The process consists of three steps: constraint expansion, paraphrasing, and aligning. In constraint expansion, we negatively constrain on a text span of interest, including its upper/lowercase counterparts and morphological variants using the pattern software package (Smedt and Daelemans, 2012). By applying negative constraints, the paraphrase model is forced to generate a semantically equivalent sentence with a different surface form of the labeled text, thereby increasing the size of the lexicon. In the alignment stage, we score representations of each candidate span in the paraphrase together with the representation of the original text span, selecting the one with the highest score under the model. Using the newly obtained aligned phrase as the input to constraint expansion, we repeat the process for a predetermined number of iterations.
To encourage the model to produce as many new words as possible, we perform frame-wise constraint unioning: taking the union of all the constraint sets from sentences that originated in the same frame, and then using that as the constraint set for those sentences in the next iteration. This prevents the same lexical unit from being used by 6 In our main experiments, (original sentence, trigger, paraphrase, alignment) combinations are disjoint between train and test, but it is possible to observe the same trigger (with a different sentence, paraphrase, or alignment) at both trainand test-time.
rewrites of different example sentences in the same frame.

Experiments
Our approach lends itself to two scenarios: in §6.1 we are concerned with building a semantic resource from scratch, whereas in §6.2 we are concerned with expanding a pre-existing resource. We demonstrate the usefulness of our approach on downstream tasks in §6.3, where we apply our generated paraphrastic dataset to the task of Frame Identification. Following Pavlick et al. (2015), we consider FrameNet as an illustrative resource motivating augmentation. In all experiments we treat each system output (paraphrase and alignment) as evoking the same frame as the original FrameNet input.

Building FrameNet (nearly) from Scratch
To simulate constructing a resource using iterative paraphrastic augmentation we consider what FrameNet would have looked like in its earliest stages of development 7 . Using each object's "created date" attribute, we ablate out all but the 20 earliest-added frames, the three earliest-added lexical units per frame, and the three earliest-added annotations per lexical unit, for a total of at most 8 180 annotations in our seed corpus.
We then ran 10 iterations of augmentation with a beam size of 30 for the paraphrase model. For each input, we ran the alignment model on each of the top-20 beam elements and chose the beam element with the highest score under the alignment model. At the end of each iteration, constraints were unioned frame-wise. This resulted in 1710 paraphrased and aligned sentences 9 , and 1316 unique (Frame, LexicalUnit) combinations. Some generated words lemmatized to the same form, causing the number of lexical units to be less than the number of sentences.
Automatic Evaluation Prior to ablation, the 20 frames in the seed corpus contained a total of 360 lexical units, of which 60 were chosen to remain in the seed. We treat the set of 300 unobserved 7 The decision to select our seeds based on frame creation date -in contrast to some other sub-selection strategy -was informed by discussions with FrameNet creators (personal communication). 8 In practice we were left with slightly fewer (171), as we removed sentences that were observed by the alignment model at training time, and some lexical units contained less than three annotations. 9 I.e., 171 sentences rewritten 10 times each. lexical units as gold standard and compute precision and recall of the lexical units contained within the 1710-sentence system output. Lexical units were only considered correct if they were in the correct frame; comparisons were made between (Frame, LexicalUnit) combinations.
Our system produced 128 true positives, 1188 false positives, and 172 false negatives, yielding a precision of 9.7% and recall of 42.7%. If we include the 60 lexical units from the seed corpus, recall increases to 52.7% of the total 360. Although we recover over half of the lexical units, there are many false positives. Upon manual inspection, we found that many of the words predicted by the framework were valid, yet absent from FrameNet, motivating us to develop a more comprehensive method of evaluation.

Manual Evaluation
We conducted a 3xredundant manual evaluation of the 1710 system outputs using trusted, locally trained annotators. For each system output -a paraphrase with a highlighted phrase corresponding to the span predicted by the alignment model -we provided a description of the anticipated frame 10 and three gold-standard example annotations 11 to reinforce the frame definition. Workers were then asked to rate three candidate sentences, each with a highlighted trigger phrase, on a scale of 0-100, as to how well the highlighted trigger evoked the given frame in the context of the sentence. Unknown to annotators, of the three candidate sentences in each task, only one of them (in a random position) was an actual system output; the other two were positive or negative gold-standard sentences taken from FrameNet: 1. System output. Frame a and lexical unit b.
2. Gold in-frame sentence. Frame a and lexical unit ¬b. 3. Gold out-of-frame adversarial example.
Frame ¬a. The scores collected on gold in-and out-offrame control sentences provide a means to ground the interpretation of scores on system outputs and also enable us to gauge overall annotator understanding of the task by scoring sentences for which 10 We assume that the paraphrase transformation is labelpreserving, so the anticipated frame is simply the frame of the original FrameNet sentence. 11 The trigger words in the example sentences were made to be disjoint with the trigger words in the candidate sentences in order to avoid biasing annotators. we know the correct response.
Since each system output was judged by three distinct annotators, we average each triple of judgments and treat values less than 50 as a rejection ("the highlighted trigger, in the context of the sentence, does not evoke the given frame") and values greater than or equal to 50 as an acceptance. Gold in-and out-of-frame sentences had acceptance rates of 95.26% and 6.57% respectively, suggesting workers possessed a relatively strong understanding of the task. Figure 4 provides a sample of actual system outputs and associated individual (non-aggregated) scores.
Filtering Methods We experiment with several methods of filtering system outputs, providing a trade-off between the competing goals of quality and size. Each system output has an associated iteration number, score under the paraphrase model, and score under the alignment model; each filtering method then uses this information to select a subset of the unfiltered system outputs.
We report the precision -the ratio of elements in the subset that had a score over 50 -and recallthe number of elements in the subset with a score over 50, divided by the number of elements in the unfiltered set that also had a score over 50 -in Table 3. The upper section of Table 3 presents results for a variety of heuristic filtering methods, e.g. the subset of system outputs with an iteration number of three or lower, while the lower section presents results for a neural filtering model.
The neural model takes as input a system output's iteration number, score under the paraphrase model, and score under the alignment model, and produces a score between 0 and 1, where 0 represents a decision to filter an output, and 1 represents a decision to keep it. Architecturally, the model is a feed-forward neural network with two hidden layers, 10 units per hidden layer, and a sigmoid output layer, trained to minimize binary cross entropy loss. We trained one model to favor precision by downweighting the training loss when the label was 1, and a second model to favor recall by downweighting when the label was 0. As training data, we used the 1710 aggregated manual judgments from above (where each system output has a label of 0 or 1), plus 2988 additional judgments collected specifically for this model. We split the data as 90% train (4228) and 10% test (470), and present results 12 in the lower section of Table 3. 12 Results in the upper section of Table 3 are reported over Frame: Judgment Original: British television is almost as widely admired abroad as it is at home. Paraphrase: Britain's TV is almost as much advertised abroad as it is at home. Score: 15 Frame: Posture Original: They sat facing each other, so they might look as much as they wished, and then began to talk. Paraphrase: The two of them gathered together to appear as they wished, and then began to speak. Score: 45 Frame: Motion Original: The smoke was drifting slowly across the farm buildings in the still air. Paraphrase: In the still air, the smoke streaked slowly through the farm buildings. Score: 90 Figure 4: Sample of actual system outputs and associated manually judged scores. Annotators did not have access to the original sentence when assigning scores, but they are provided here to illustrate the way in which the paraphrase and alignment models function. In the first example, the paraphrase model makes a mistake; in the second, the sentence is roughly synonymous but borderline out-of-frame; in the third, both the paraphrase and alignment are high-quality.

Discussion
The upper section of Table 3 suggests that iteration number, paraphrase model score, and aligner model score each have slightly different filtering characteristics, and a simple conjunction of criteria achieves higher precision than any condition alone. The P-Classifier, optimized to select a high-precision subset of the data, achieves higher precision than any of the heuristic methods, and higher recall than the highest-precision heuristic method. The precision of the P-classifier (95%) is roughly the same as the human-level acceptance rate on gold in-frame sentences (95.26%) while generating a resource that is 2.28x as large as the original. A higher recall subset may be obtained with the R-Classifier, which retains 96.99% of acceptable outputs with a precision of 81.19%.

Expanding Existing FrameNet
In this section we report the results of applying large-scale iterative augmentation to an existing resource. As in our reconstruction experiment, we ran 10 iterations of augmentation, but with minor configuration changes to enable faster processing over the roughly 200,000 FrameNet annotations 13 . The paraphrase model used a beam size of 20 and we ran the alignment model on each of the top-3 beam elements, choosing the beam element with the highest score under the alignment model. We the 1710 system outputs from §6.1, while the results in the lower section of the table are reported over the 470-element test set. 13 In practice, we filtered out sentences with greater than 80 tokens due to a limitation in the paraphrase model, leaving 198,368, or 99.55% of the original sentences. did not perform frame-wise constraint unioning.
Our unfiltered dataset, which excludes the original FrameNet data, contains 1,983,680 automatically paraphrased and aligned sentences and 495,300 (Frame, Trigger) combinations 14 annotated in context. Of the 495,300 new triggers, 428,416 are unique after applying lemmatization; each lemma has 4.63 automatic in-context annotations on average. We use the filter models from §6.1 to select high quality and high quantity subsets of the unfiltered data; each system output in our data release has an associated score from both filter classifiers to enable post-hoc filtering. The P-Classifier retains 138,797 sentences and 33,332 (Frame, Trigger) combinations, while the Rclassifier retains 1,807,235 sentences and 425,050 combinations. To enable further experimentation, each sentence in our release contains a unique identifier linking it to FrameNet v1.7. Because our data only contains alignments of triggers and not frame elements, it cannot be directly used for full FrameNet SRL. However, by additionally applying positive constraints on frame element spans during lexically constrained decoding, an alignment link may be trivially obtained, allowing our framework to be used for full SRL.

Using Paraphrastic Data on a Downstream Task
We have demonstrated the usefulness of iterative paraphrastic augmentation for expanding lexical  Table 3: Human evaluation of system outputs across several filtering methods, with manuallyjudged Precision for the subset of outputs remaining after applying the given filter, Recall of sentences manually judged to be acceptable, and the Multiple (in terms of number of sentences) of the resulting dataset in relation to the original seed corpus. Filtering methods consider the iteration number, and scores from the paraphrase and aligner models for a given system output. The "lax" row applies a filter consisting of the conjunction of the criteria from rows 3, 5, and 7 (relatively lenient conditions) whereas the "strict" row conjoins the criteria from rows 2, 4, and 6 (which are stricter, and lead to higher precision but fewer lexical units).
resources but have not shown how the resulting data is useful for downstream tasks, other than as a means to guide future lexicographical additions. The dataset generated in §6.2 naturally lends itself to several downstream tasks such as word sense disambiguation (Das et al., 2010b) or Frame Identification, a major subtask (Das et al., 2010a;Hermann et al., 2014) of FrameNet semantic role labeling (SRL). In this section, we show how paraphrastic augmentation can improve Frame ID model robustness in low-resource settings.
Given an ontology in a new domain, it is often prohibitively expensive to annotate entire documents, and full-document annotation may not provide full coverage of the ontology due to the rarity of some ontological types. A commonly-used alternative to full-document annotation is exemplarbased annotation, where several canonical examples (or "exemplars") are identified for each ontological type, ensuring at least full coverage of the ontology. Below, we conduct experiments to show that the addition of paraphrastic data to full-document and exemplar annotations boosts Frame Identification model performance.
Task FrameNet parsing Kshirsagar et al., 2015;Roth and Lapata, 2015;Swayamdipta et al., 2018), is an established task in the field of semantic parsing. Most previous work has viewed FrameNet parsing as a semantic role labeling task, where the goal is to identify the frame and label all the arguments given a sentence with a known trigger span, but little attention has been paid to identifying trigger spans themselves .
Given the practical importance of finding triggers, we focus on jointly identifying both triggers and frames, rather than frames alone.
Specifically, given a sentence consisting of a sequence of words, our task is to find all substrings 15 of the sentence that trigger a frame and to identify the corresponding frames. We pose this as a span tagging problem, with trigger spans being tagged with the associated frame and non-trigger spans tagged as NULL. 16 Model We adopt a two-pass Long Short-Term Memory (LSTM) model for the frame identification task. We first convert the sentence s = s 1 , s 2 , . . . , s I into a sequence of embedding vectors e 0 1 , e 0 2 , . . . , e 0 I , where each embedding e 0 i is a concatenation of GloVe, BERT (first subtoken, fixed), character and POS embeddings (Pennington et al., 2014;Devlin et al., 2018;Alberti et al., 2019). Then we use a l-layer stacked bidirectional LSTM model (Hochreiter and Schmidhuber, 1997) to obtain a contextual embedding for each word: e l 1 , e l 2 , . . . , e l I = BiLSTM( e 0 1 , e 0 2 , . . . , e 0 I ).
We then apply another unidirectional LSTM model on top to get a representation for a span s i:j : e i:j = LSTM( e l i , e l i+1 , . . . , e l j ).
As in the alignment model, we set a maximum span length to reduce the computation complexity from O(I 2 ) to O(I) 17 . A fully-connected neural network is then applied to transform the representation e i:j into a logit vector, which is then translated by softmax into a distribution over the label set comprised of frames and NULL. We train with cross-entropy loss.
The FrameNet corpus provides two sets of annotated sentences: full-text and exemplars, where the full-text contains fully annotated documents, but the exemplars are only annotated with one frame for every sentence. For the full-text sentences, we treat both the trigger and non-trigger spans as training examples, but the non-trigger spans in the exemplar and paraphrastic sentences are excluded due to the fact that they represent incomplete annotations, rather than true negative examples. Furthermore,  pointed out that some triggers are not annotated in the full-text sentences, leading to false negative training examples. In light of this, we apply the label smoothing trick (Szegedy et al., 2016) 18 on negative examples to smooth the point distribution, resulting in a 3 F1 score improvement.
Experiments To illustrate the utility of our method in a low-resource setting, we use a 10% sample of the full-text sentences as our full-text dataset, choosing the first n l lexical units by order of appearance, and subsequently sampling n e exemplar sentences for each lexical unit. We augment the dataset by adding the top-n p paraphrases (ranked by the product of paraphrase and alignment model scores) for each exemplar sentence. In our experiments, we try combinations of (n l , n e , n p ) ∈ {1, 3} × {1, 3} × {0, 1, 4}, where n p = 0 means only exemplar sentences are used for training.
We use the FrameNet v1.7 release as the dataset, 19 and adopt the same development and test split as proposed by Das and Smith (2011), treating all the other documents as training examples. We use the greedy search to find the optimal hyper-parameters and conduct all the experiments under the same hyper-parameters.
The evaluation metric used is the frame identification F1 score, where a frame prediction is viewed as true positive when the trigger span and frame both match exactly.

Results and Analysis
Results are shown in Figure 5, where the leftmost bar is the result of 10% 18 A smoothing factor 0.2 is empirically chosen. 19 We use the FrameNet support within NLTK (Schneider and Wooters, 2017) to process the raw data.
full-text only results for reference with 5 repetitions.
full-text (nl = 1, ne = 1) (nl = 1, ne = 3) (nl = 3, ne = 1) (nl = 3, ne = 3) 20 We see that in both experiments increased numbers of exemplars and paraphrases improve frame identification. When we only have one annotated lexical unit for each frame, generating one paraphrase is beneficial and generating four is even better. With three lexical units, the relative impact of paraphrasing begins to diminish. When we have one exemplar sentence per lexical unit, adding one paraphrase is helpful, while adding four is less so. As the number of diverse examples increase, the impact of paraphrasing wanes.
Future Work While we have shown that paraphrasing is beneficial for training a Frame Identification model in a low-resource setting, it is important to be aware of the limitations of paraphrastic data. The paraphrasing generation process does not guarantee that the resulting data will be beneficial for training and evaluation since it is possible that some of the paraphrases are already well-understood by the model (Ribeiro et al., 2018). Furthermore, generated paraphrases could include lexical units that fall outside of the ontology being used, all leading to negative impact w.r.t. evaluation. Future work may investigate tactical data augmentation such as considering a filtering score proposed by Ribeiro et al. (2018) or limiting the paraphrastic data to its intersection with the FrameNet ontology.

Conclusion
We introduced a novel approach for iterative construction of semantic resources via automatic paraphrsing. To demonstrate two possible uses of our framework, we simulated the rapid creation of a new semantic resource from a small seed corpus and generated a large-scale expansion of an existing resource. The latter experiment, run on FrameNet data, generated a lexically diverse dataset with 495,300 unique (Frame, Trigger) combinations annotated in context, 50x the number of such combinations originally in FrameNet, which we release to the community alongside our 36,417-instance span-alignment dataset.