Abstract
Sequence-to-Sequence (S2S) models have achieved remarkable success on various text generation tasks. However, learning complex structures with S2S models remains challenging as external neural modules and additional lexicons are often supplemented to predict non-textual outputs. We present a systematic study of S2S modeling using contained decoding on four core tasks: part-of-speech tagging, named entity recognition, constituency, and dependency parsing, to develop efficient exploitation methods costing zero extra parameters. In particular, 3 lexically diverse linearization schemas and corresponding constrained decoding methods are designed and evaluated. Experiments show that although more lexicalized schemas yield longer output sequences that require heavier training, their sequences being closer to natural language makes them easier to learn. Moreover, S2S models using our constrained decoding outperform other S2S approaches using external resources. Our best models perform better than or comparably to the state-of-the-art for all 4 tasks, lighting a promise for S2S models to generate non-sequential structures.
1 Introduction
Sequence-to-Sequence (S2S) models pretrained for language modeling (PLM) and denoising objectives have been successful on a wide range of NLP tasks where both inputs and outputs are sequences (Radford et al., 2019; Raffel et al., 2020; Lewis et al., 2020; Brown et al., 2020). However, for non-sequential outputs like trees and graphs, a procedure called linearization is often required to flatten them into ordinary sequences (Li et al., 2018; Fernández-González and Gómez-Rodríguez 2020; Yan et al., 2021; Bevilacqua et al., 2021; He and Choi, 2021a), where labels in non-sequential structures are mapped heuristically as individual tokens in sequences, and numerical properties like indices are either predicted using an external decoder such as Pointer Networks (Vinyals et al., 2015a) or cast to additional tokens in the vocabulary. While these methods are found to be effective, we hypothesize that S2S models can learn complex structures without adapting such patches.
To challenge the limit of S2S modeling, BART (Lewis et al., 2020) is finetuned on four tasks without extra decoders: part-of-speech tagging (POS), named entity recognition (NER), constituency parsing (CON), and dependency parsing (DEP). Three novel linearization schemas are introduced for each task: label sequence (LS), label with text (LT), and prompt (PT). LS to PT feature an increasing number of lexicons and a decreasing number of labels, which are not in the vocabulary (Section 3). Every schema is equipped with a constrained decoding algorithm searching over valid sequences (Section 4).
Our experiments on three popular datasets depict that S2S models can learn these linguistic structures without external resources such as index tokens or Pointer Networks. Our best models perform on par with or better than the other state-of-the-art models for all four tasks (Section 5). Finally, a detailed analysis is provided to compare the distinctive natures of our proposed schemas (Section 6).1
2 Related Work
S2S (Sutskever et al., 2014) architectures have been effective on many sequential modeling tasks. Conventionally, S2S is implemented as an encoder and decoder pair, where the encoder learns input representations used to generate the output sequence via the decoder. Since the input sequence can be very long, attention mechanisms (Bahdanau et al., 2015; Vaswani et al., 2017) focusing on particular positions are often augmented to the basic architecture. With transfer-learning, S2S models pretrained on large unlabeled corpora have risen to a diversity of new approaches that convert language problems into a text-to-text format (Akbik et al., 2018; Lewis et al., 2020; Radford et al., 2019; Raffel et al., 2020; Brown et al., 2020). Among them, tasks most related to our work are linguistic structure predictions using S2S, POS, NER, DEP, and CON.
POS has been commonly tackled as a sequence tagging task, where the input and output sequences have equal lengths. S2S, on the other hand, does not enjoy such constraints as the output sequence can be arbitrarily long. Therefore, S2S is not as popular as sequence tagging for POS. Prevailing neural architectures for POS are often built on top of a neural sequence tagger with rich embeddings (Bohnet et al., 2018; Akbik et al., 2018) and Conditional Random Fields (Lafferty et al., 2001).
NER has been cast to a neural sequence tagging task using the IOB notation (Lample et al., 2016) over the years, which benefits most from contextual word embeddings (Devlin et al., 2019; Wang et al., 2021). Early S2S-based works cast NER to a text-to-IOB transduction problem (Chen and Moschitti, 2018; Straková et al., 2019; Zhu et al., 2020), which is included as a baseline schema in Section 3.2. Yan et al. (2021) augment Pointer Networks to generate numerical entity spans, which we refrain to use because the focus of this work is purely on the S2S itself. Most recently, Cui et al. (2021) propose the first template prompting to query all possible spans against a S2S language model, which is highly simplified into a one-pass generation in our PT schema. Instead of directly prompting for the entity type, Chen et al. (2022) propose to generate its concepts first then its type later. Their two-step generation is tailored for few-shot learning, orthogonal to our approach. Moreover, our prompt approach does not rely on non-textual tokens as they do.
CON is a more established task for S2S models since the bracketed constituency tree is naturally a linearized sequence. Top-down tree linearizations based on brackets (Vinyals et al., 2015b) or shift-reduce actions (Sagae and Lavie, 2005) rely on a strong encoder over the sentence while bottom-up ones (Zhu et al., 2013; Ma et al., 2017) can utilize rich features from readily built partial parses. Recently, the in-order traversal has proved superior to bottom-up and top-down in both transition (Liu and Zhang, 2017) and S2S (Fernández-González and Gómez-Rodríguez, 2020) constituency parsing. Most recently, a Pointer Networks augmented approach (Yang and Tu, 2022) is ranked top among S2S approaches. Since we are interested in the potential of S2S models without patches, a naive bottom-up baseline and its novel upgrades are studied in Section 3.3.
DEP has been underexplored as S2S due to the linearization complexity. The first S2S work maps a sentence to a sequence of source sentence words interleaved with the arc-standard, reduce-actions in its parse (Wiseman and Rush, 2016), which is adopted as our LT baseline in Section 3.4. Zhang et al. (2017) introduce a stack-based multi-layer attention mechanism to leverage structural linguistics information from the decoding stack in arc-standard parsing. Arc-standard is also used in our LS baseline, however, we use no such extra layers. Apart from transition parsing, Li et al. (2018) directly predict the relative head position instead of the transition. This schema is later extended to multilingual and multitasking by Choudhary and O’riordan (2021). Their encoder and decoder use different vocabularies, while in our PT setting, we re-use the vocabulary in the S2S language model.
S2S appears to be more prevailing for semantic parsing due to two reasons. First, synchronous context-free grammar bridges the gap between natural text and meaning representation for S2S. It has been employed to obtain silver annotations (Jia and Liang, 2016), and to generate canonical natural language paraphrases that are easier to learn for S2S (Shin et al., 2021). This trend of insights viewing semantic parsing as prompt-guided generation (Hashimoto et al., 2018) and paraphrasing (Berant and Liang, 2014) has also inspired our design of PT. Second, the flexible input/output format of S2S facilitates joint learning of semantic parsing and generation. Latent variable sharing (Tseng et al., 2020) and unified pretraining (Bai et al., 2022) are two representative joint modeling approaches, which could be augmented with our idea of PT schema as a potentially more effective linearization.
Our finding that core NLP tasks can be solved using LT overlaps with the Translation between Augmented Natural Languages (Paolini et al., 2021). However, we take one step further to study the impacts of textual tokens in schema design choices. Our constrained decoding is similar to existing work (Hokamp and Liu, 2017; Deutsch et al., 2019; Shin et al., 2021). We craft constrained decoding algorithms for our proposed schemas and provide a systematic ablation study in Section 6.1.
3 Schemas
This section presents our output schemas for POS, NER, CON, and DEP in Table 1. For each task, 3 lexically diverse schemas are designed as follows to explore the best practice for structure learning. First, Label Sequence (LS) is defined as a sequence of labels consisting of a finite set of task-related labels, that are merged into the S2S vocabulary, with zero text. Second, Label with Text (LT) includes tokens from the input text on top of the labels such that it has a medium number of labels and text. Third, PrompT (PT) gives a list of sentences describing the linguistic structure in natural language with no label. We hypothesize that the closer the output is to natural language, the more advantage the S2S takes from the PLM.
3.1 Part-of-Speech Tagging (POS)
LS
LS defines the output as a sequence of POS tags. Formally, given an input sentence of n tokens x = { x1, x2, ⋯, xn}, its output is a tag sequence of the same length yLS = { y1, y2, ⋯, yn}. Distinguished from sequence tagging, any LS output sequence is terminated by the “end-of-sequence” (EOS) token, which is omitted from yLS for simplicity. Predicting POS tags often depends on their neighbor contexts. We challenge that the autoregressive decoder of a S2S model can capture this dependency through self-attention.
LT
For LT, the token from the input is inserted before its corresponding tag. Formally, the output is defined yLT = {(x1, y1), (x2, y2),.., (xn, yn)}. Both x and y are part of the output and the S2S model is trained to generate each pair sequentially.
PT
PT is a human-readable text describing the POS sequence. Specifically, we use a phrase yiPT = “xi is ” for the i-th token, where is the definition of a POS tag yi, e.g., a noun. The final prompt is then the semicolon concatenation of all phrases: yPT = y1PT; y2PT; ⋯; ynPT.
3.2 Named Entity Recognition (NER)
LS
LT of an input sentence comprising n tokens x = { x1, x2, ⋯, xn} is defined as the BIEOS tag sequence yLS = {y1, y2, ⋯, yn}, which labels each token as the Beginning, Inside, End, Outside, or Single-token entity.
LT
LT uses a pair of entity type labels to wrap each entity: yLT = ..B-yj, xi,.., xi +k, E-yj,.., where yj is the type label of the j-th entity consisting of k tokens.
PT
PT is defined as a list of sentences describing each entity: yiPT = “xi is ”, where is the definition of a NER tag yi, e.g., a person. Different from the prior prompt work (Cui et al., 2021), our model generates all entities in one pass which is more efficient than their brute-force approach.
3.3 Constituency Parsing (CON)
Schemas for CON are developed on constituency trees pre-processed by removing the first level of non-terminals (POS tags) and rewiring their children (tokens) to parents, e.g., (NP (PRON My) (NOUN friend))→(NP My friend).
LS
LS is based on a top-down shift-reduce system consisting of a stack, a buffer, and a depth record d. Initially, the stack contains only the root constituent with label TOP and depth 0; the buffer contains all tokens from the input sentence; d is set to 1. A Node-X (N-X) transition creates a new depth-d non-terminal labeled with X, pushes it to the stack, and sets . A Shift (SH) transition removes the first token from the buffer and pushes it to the stack as a new terminal with depth d. A Reduce (RE) pops all elements with the same depth d from the stack then make them the children of the top constituent of the stack, and it sets . The linearization of a constituency tree using our LS schema can be obtained by applying 3 string substitutions: replace each left bracket and the label X following it with a Node-X, replace terminals with SH, replace right brackets with RE.
LT
LT is derived by reverting all SH in LS back to the corresponding tokens so that tokens in LT effectively serves as SH in our transition system.
PT
PT is also based on a top-down linearization, although it describes a constituent using templates: “pihas {cj}”, where pi is a constituent and cj-s are its children. To describe a constituent, the indefinite article “a” is used to denote a new constituent (e.g., “…has a noun phrase”). The definite article “the” is used for referring to an existing constituent mentioned before (e.g., “the noun phrase has ...”), or describing a constituent whose children are all terminals (e.g., “…has the noun phrase ‘My friend’”). When describing a constituent that directly follows its mention, the determiner “which” is used instead of repeating itself multiple times e.g., “(…and the subordinating clause, which has ...”). Sentences are joined with a semicolon “;” as the final prompt.
3.4 Dependency Parsing (DEP)
LS
LS uses three transitions from the arc-standard system (Nivre, 2004): shift (SH), left arc (<), and right arc (>).
LT
LT for DEP is obtained by replacing each SH in a LS with its corresponding token.
PT
PT is derived from its LS sequence by removing all SH. Then, for each left arc creating an arc from xj to xi with dependency relation r (e.g., a possessive modifier), a sentence is created by applying the template “xiis rofxj”. For each right arc creating an arc from xi to xj with the dependency relation r, a sentence is created with another template “xihas rxj”. The prompt is finalized by joining all such sentences with a semicolon.
4 Decoding Strategies
To ensure well-formed output sequences that match the schemas (Section 3), a set of constrained decoding strategies is designed per task except for CON, which is already tackled as S2S modeling without constrained decoding (Vinyals et al., 2015b; Fernández-González and Gómez-Rodríguez, 2020). Formally, given an input x and any partial y <i, a constrained decoding algorithm defines a function NextY returning the set of all possible values for yi that can immediately follow y <i without violating the schema. For brevity, subwords handling is separately explained in Section 5.
4.1 Part-of-Speech Tagging
LS
A finite set of POS tags, , are collected from the training set. Specifically, NextY returns if i ≤ n. Otherwise, it returns EOS.
LT
PT
The PT generation can be divided into two phases: token and “is-a-tag” statement generations. A binary status u is used to indicate whether yi is expected to be a token. To generate a token, an integer k ≤ n is used to track the index of the next token. To generate an “is-a-tag” statement, an offset o is used to track the beginning of an “is-a-tag” statement. Each description dj of a POS tag tj is extended to a suffix sj = “is dj ;”.
4.2 Named Entity Recognition
LS
Similar to POS-LS, the NextY for NER returns BIEOS tags if i ≤ n else EOS.
LT
Opening tags (<>) in NER-LT are grouped into a vocabulary . The last generated output token yi−1 (assuming y0 = BOS, a.k.a. beginning of a sentence) is looked up in to decide what type of token will be generated next. To enforce label consistency between a pair of tags, a variable e is introduced to record the expected closing tag. Reusing the definition of k in Algorithm 3, decoding of NER-LT is described in Algorithm 4.
PT
For each entity type ei, its description di is filled into the template “is di;” to create an “is-a” suffix si. Since the prompt is constructed using text while the number of entities is variable, it is not straightforward to tell whether a token belongs to an entity or an “is-a” suffix. Therefore, a noisy segmentation procedure is utilized to split a phrase into two parts: entity and “is-a” suffix. Each si is collected into a trie to perform segmentation of a partially generated phrase p (Algorithm 5).
Once a segment is obtained, the decoder is constrained to generate the entity or the suffix. For the generation of an entity, string matching is used to find every occurrence o of its partial generation in x and add the following token xo +1 into the candidate set . String matching could be noisy when an entity shares the same surface form with a non-entity phrase although no such cases are found in our datasets. Entities are generated sequentially and no nested entity is considered. To complete an “is-a” suffix, children of the prefix-matched node are added to the candidates (Algorithm 6).
4.3 Constituency Parsing
PT
The reverse linearization algorithm for CON-PT is non-trivial. To restore a constituency tree from PT, the prompt is split into sentences creating new constituents, and sentences attaching new constituents to existing ones. Splitting is done by longest-prefix-matching (Algorithm 7) using a trie built with the definite and indefinite article versions of the description of each constituent label e.g., “the noun phrase” and “a noun phrase” of NP.
Algorithm 8 describes the splitting procedure.
Once a prompt is split into two types of sentences, a constituency tree is then built accordingly. We use a variable parent to track the last constituent that gets attachments, and another variable latest to track the current new consistent that gets created. Due to the top-down nature of linearization, the target constituent that new constituents are attached to is always among the siblings of either parent or the ancestors of parent. The search of the target constituent is described in Algorithm 9.
4.4 Dependency Parsing
LS
Arc-standard (Nivre, 2004) transitions are added to a candidate set and only transitions permitted by the current parsing state are allowed.
LT
DEP-LS replaces all SH transitions with input tokens in left-to-right order. Therefore, an incremental offset is kept to generate the next token in place of each SH in DEP-LT.
PT
DEP-PT is more complicated than CON-PT because each sentence contains one more token. Its generation is therefore divided into 4 possible states: first token (1st), relation (rel), second token (2ed), and semicolon. An arc-standard transition system is executed synchronously with constrained decoding since PT is essentially a simplified transition sequence with all SH removed. Let b and s be the system buffer and stack, respectively. Let c be a set of candidate tokens that will be generated in y, which initially contains all input tokens and an inserted token “sentence” that is only used to represent the root in “the sentence has a root …” A token is removed from c once it gets popped out of s. Since DEP-PT generates no SH, each input token xj in y effectively introduces SH(s) till it is pushed onto s at index i (i ∈{1,2}), as formally described in Algorithm 11.
After the first token is generated, its offset o in y is recorded such that the following relation sequence yi >o can be located. To decide the next token of yi >o, it is then prefix-matched with a trie built with the set of “has-” and “is-” dependency relations. The children of the prefix-matched node are considered candidates if it has any. Otherwise, the dependency relation is marked as completed. Once a relation is generated, the second token will be generated in a similar way. Finally, upon the completion of a sentence, the transition it describes is applied to the system and c is updated accordingly. The full procedure is described in Algorithm 12. Since a transition system has been synchronously maintained with constrained decoding, no extra reverse linearization is needed.
5 Experiments
For all tasks, BART-Large (Lewis et al., 2020) is finetuned as our underlying S2S model. We also tried T5 (Raffel et al., 2020), although its performance was less satisfactory. Every model is trained three times using different random seeds and their average scores and standard deviations on the test sets are reported. Our models are experimented on the OntoNotes 5 (Weischedel et al., 2013) using the data split suggested by Pradhan et al. (2013). In addition, two other popular datasets are used for fair comparisons to previous works: the Wall Street Journal corpus from the Penn Treebank 3 (Marcus et al., 1993) for POS, DEP, and CON, as well as the English portion of the CoNLL’03 dataset (Tjong Kim Sang and De Meulder, 2003) for NER.
Each token is independently tokenized using the subword tokenizer of BART and merged into an input sequence. The boundary information for each token is recorded to ensure full tokens are generated in LT and PT without broken pieces. To fit in the positional embeddings of BART, sentences longer than 1,024 subwords are discarded, which include 1 sentence from the Penn Treebank 3 training set, and 24 sentences from the OntoNotes 5 training set. Development sets and test sets are not affected.
5.1 Part-of-Speech Tagging
Token level accuracy is used as the metric for POS. LT outperforms LS although LT is twice as long as LS, suggesting that textual tokens positively impact the learning of the decoder (Table 2). PT performs almost the same with LT, perhaps due to the fact that POS is not a task requiring a powerful decoder.
Results for POS.
Model . | PTB . | OntoNotes . |
---|---|---|
Bohnet et al. (2018) | 97.96 | – |
He and Choi (2021b) | – | 98.32 ± 0.02 |
LS | 97.51 ± 0.11 | 98.21 ± 0.02 |
LT | 97.70 ± 0.02 | 98.40 ± 0.01 |
PT | 97.64 ± 0.01 | 98.37 ± 0.02 |
5.2 Named Entity Recognition
For CoNLL’03, the provided splits without merging the development and training sets are used. For OntoNotes 5, the same splits as Chiu and Nichols (2016), Li et al. (2017); Ghaddar and Langlais (2018); He and Choi (2020, 2021b) are used. Labeled span-level F1 score is used for evaluation.
We acknowledge that the performance of NER systems can be largely improved by rich embeddings (Wang et al., 2021), document context features (Yu et al., 2020), dependency tree features (Xu et al., 2021), and other external resources. While our focus is the potential of S2S, we mainly consider two strong baselines that also use BART as the only external resource: the generative BART-Pointer framework (Yan et al., 2021) and the recent template-based BART NER (Cui et al., 2021).
As shown in Table 3, LS performs the worst on both datasets, possibly attributed to the fact that the autoregressive decoder overfits the high-order left-to-right dependencies of BIEOS tags. LT performs close to the BERT-Large biaffine model (Yu et al., 2020). PT performs comparably well with the Pointer Networks approach (Yu et al., 2020) and it outperforms the template prompting (Cui et al., 2021) by a large margin, suggesting S2S has the potential to learn structures without using external modules.
Results for NER. denotes S2S.
Model . | CoNLL’03 . | OntoNotes 5 . |
---|---|---|
Clark et al. (2018) | 92.60 | – |
Peters et al. (2018) | 92.22 | – |
Akbik et al. (2019) | 93.18 | – |
Straková et al. (2019) | 93.07 | – |
Yamada et al. (2020) | 92.40 | – |
Yu et al. (2020)† | 92.50 | 89.83 |
Yan et al. (2021)‡ | 93.24 | 90.38 |
Cui et al. (2021) | 92.55 | – |
He and Choi (2021b) | – | 89.04 ± 0.14 |
Wang et al. (2021) | 94.6 | – |
Zhu and Li (2022) | – | 91.74 |
Ye et al. (2022) | – | 91.9 |
LS | 70.29 ± 0.70 | 84.61 ± 1.18 |
LT | 92.75 ± 0.03 | 89.60 ± 0.06 |
PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
Model . | CoNLL’03 . | OntoNotes 5 . |
---|---|---|
Clark et al. (2018) | 92.60 | – |
Peters et al. (2018) | 92.22 | – |
Akbik et al. (2019) | 93.18 | – |
Straková et al. (2019) | 93.07 | – |
Yamada et al. (2020) | 92.40 | – |
Yu et al. (2020)† | 92.50 | 89.83 |
Yan et al. (2021)‡ | 93.24 | 90.38 |
Cui et al. (2021) | 92.55 | – |
He and Choi (2021b) | – | 89.04 ± 0.14 |
Wang et al. (2021) | 94.6 | – |
Zhu and Li (2022) | – | 91.74 |
Ye et al. (2022) | – | 91.9 |
LS | 70.29 ± 0.70 | 84.61 ± 1.18 |
LT | 92.75 ± 0.03 | 89.60 ± 0.06 |
PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
5.3 Constituency Parsing
All POS tags are removed and not used in training or evaluation. Terminals belonging to the same non-terminal are flattened into one constituent before training and unflattened in post-processing. The standard constituent-level F-score produced by the EVALB3 is used as the evaluation metric.
Table 4 shows the results on OntoNotes 5 and PTB 3. Incorporating textual tokens into the output sequence is important on OntoNotes 5, leading to a +0.9 F-score, while it is not the case on PTB 3. It is possibly due to the fact that OntoNotes is more diverse in domains, requiring a higher utilization of pre-trained S2S for domain transfer. PT performs the best, and it has a competitive performance to recent works, despite the fact that it uses no extra decoders.
Results for CON. denotes S2S.
Model . | PTB 3 . | OntoNotes 5 . |
---|---|---|
Fernández-González and Gómez-Rodríguez (2020) | 91.6 | -- |
Mrini et al. (2020) | 96.38 | – |
He and Choi (2021b) | – | 94.43 ± 0.03 |
Yang and Tu (2022) | 96.01 | – |
LS | 95.23 ± 0.08 | 93.40 ± 0.31 |
LT | 95.24 ± 0.04 | 94.32 ± 0.11 |
PT | 95.34 ± 0.06 | 94.55 ± 0.03 |
5.4 Dependency Parsing
The constituency trees from PTB and OntoNotes are converted into the Stanford dependencies v3.3.0 (De Marneffe and Manning, 2008) for DEP experiments. Forty and 1 non-projective trees are removed from the training and development sets of PTB 3, respectively. For OntoNotes 5, these numbers are 262 and 28. Test sets are not affected.
As shown in Table 5, textual tokens are crucial in learning arc-standard transitions using S2S, leading to +2.6 and +7.4 LAS improvements, respectively. Although our PT method underperforms recent state-of-the-art methods, it has the strongest performance among all S2S approaches. Interestingly, our S2S model manages to learn a transition system without explicitly modeling the stack, the buffer, the partial parse, or pointers.
Results for DEP. denotes S2S.
Model . | UAS . | LAS . |
---|---|---|
Wiseman and Rush (2016) | 91.17 | 87.41 |
Zhang et al. (2017) | 93.71 | 91.60 |
Li et al. (2018) | 94.11 | 92.08 |
Mrini et al. (2020) | 97.42 | 96.26 |
LS | 92.83 ± 0.43 | 90.50 ± 0.53 |
LT | 95.79 ± 0.07 | 93.17 ± 0.16 |
PT | 95.91 ± 0.06 | 94.31 ± 0.09 |
(a) PTB results for DEP. | ||
Model | UAS | LAS |
He and Choi (2021b) | 95.92 ± 0.02 | 94.24 ± 0.03 |
LS | 86.54 ± 0.12 | 83.84 ± 0.13 |
LT | 94.15 ± 0.14 | 91.27 ± 0.19 |
PT | 94.51 ± 0.22 | 92.81 ± 0.21 |
(b) OntoNotes results for DEP. |
Model . | UAS . | LAS . |
---|---|---|
Wiseman and Rush (2016) | 91.17 | 87.41 |
Zhang et al. (2017) | 93.71 | 91.60 |
Li et al. (2018) | 94.11 | 92.08 |
Mrini et al. (2020) | 97.42 | 96.26 |
LS | 92.83 ± 0.43 | 90.50 ± 0.53 |
LT | 95.79 ± 0.07 | 93.17 ± 0.16 |
PT | 95.91 ± 0.06 | 94.31 ± 0.09 |
(a) PTB results for DEP. | ||
Model | UAS | LAS |
He and Choi (2021b) | 95.92 ± 0.02 | 94.24 ± 0.03 |
LS | 86.54 ± 0.12 | 83.84 ± 0.13 |
LT | 94.15 ± 0.14 | 91.27 ± 0.19 |
PT | 94.51 ± 0.22 | 92.81 ± 0.21 |
(b) OntoNotes results for DEP. |
We believe that the performance of DEP with S2S can be further improved with a larger and more recent pretrained S2S model and dynamic oracle (Goldberg and Nivre, 2012).
6 Analysis
6.1 Ablation Study
We perform an ablation study to show the performance gain of our proposed constrained decoding algorithms on different tasks. Constrained decoding algorithms (CD) are compared against free generation (w/o CD) where a model freely generates an output sequence that is later post-processed into task-specific structures using string-matching rules. Invalid outputs are patched to the greatest extent, e.g., POS label sequences are padded or truncated. As shown in Table 6, ablation of constrained decoding seldom impacts the performance of LS on all tasks, suggesting that the decoder of seq2seq can acclimatize to the newly added label tokens. Interestingly, the less performant NER-LS model degrades the most, promoting the necessity of constrained decoding for weaker seq2seq models. The performance of LT on all tasks is marginally degraded when constrained decoding is ablated, indicating the decoder begins to generate structurally invalid outputs when textual tokens are freely generated. This type of problem seems to be exacerbated when more tokens are freely generated in the PT schemas, especially for the DEP-PT.
Ablation test results.
Model . | PTB . | OntoNotes . |
---|---|---|
LS | 97.51 ± 0.11 | 98.21 ± 0.02 |
w/o CD | 97.51 ± 0.11 | 98.21 ± 0.02 |
LT | 97.70 ± 0.02 | 98.40 ± 0.01 |
w/o CD | 97.67 ± 0.02 | 98.39 ± 0.01 |
PT | 97.64 ± 0.01 | 98.37 ± 0.02 |
w/o CD | 97.55 ± 0.02 | 98.29 ± 0.05 |
(a) Accuracy of ablation tests for POS. | ||
Model | CoNLL 03 | OntoNotes 5 |
LS | 70.29 ± 0.70 | 84.61 ± 1.18 |
w/o CD | 66.33 ± 0.73 | 84.57 ± 1.16 |
LT | 92.75 ± 0.03 | 89.60 ± 0.06 |
w/o CD | 92.72 ± 0.02 | 89.50 ± 0.07 |
PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
w/o CD | 93.12 ± 0.06 | 90.23 ± 0.05 |
(b) F1 of ablation tests for NER. | ||
Model | PTB | OntoNotes |
LS | 90.50 ± 0.53 | 83.84 ± 0.13 |
w/o CD | 90.45 ± 0.47 | 83.78 ± 0.13 |
LT | 93.17 ± 0.16 | 91.27 ± 0.19 |
w/o CD | 93.12 ± 0.14 | 91.05 ± 0.20 |
PT | 94.31 ± 0.09 | 92.81 ± 0.21 |
w/o CD | 81.50 ± 0.27 | 81.76 ± 0.36 |
(c) LAS of ablation tests for DEP. |
Model . | PTB . | OntoNotes . |
---|---|---|
LS | 97.51 ± 0.11 | 98.21 ± 0.02 |
w/o CD | 97.51 ± 0.11 | 98.21 ± 0.02 |
LT | 97.70 ± 0.02 | 98.40 ± 0.01 |
w/o CD | 97.67 ± 0.02 | 98.39 ± 0.01 |
PT | 97.64 ± 0.01 | 98.37 ± 0.02 |
w/o CD | 97.55 ± 0.02 | 98.29 ± 0.05 |
(a) Accuracy of ablation tests for POS. | ||
Model | CoNLL 03 | OntoNotes 5 |
LS | 70.29 ± 0.70 | 84.61 ± 1.18 |
w/o CD | 66.33 ± 0.73 | 84.57 ± 1.16 |
LT | 92.75 ± 0.03 | 89.60 ± 0.06 |
w/o CD | 92.72 ± 0.02 | 89.50 ± 0.07 |
PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
w/o CD | 93.12 ± 0.06 | 90.23 ± 0.05 |
(b) F1 of ablation tests for NER. | ||
Model | PTB | OntoNotes |
LS | 90.50 ± 0.53 | 83.84 ± 0.13 |
w/o CD | 90.45 ± 0.47 | 83.78 ± 0.13 |
LT | 93.17 ± 0.16 | 91.27 ± 0.19 |
w/o CD | 93.12 ± 0.14 | 91.05 ± 0.20 |
PT | 94.31 ± 0.09 | 92.81 ± 0.21 |
w/o CD | 81.50 ± 0.27 | 81.76 ± 0.36 |
(c) LAS of ablation tests for DEP. |
Unlike POS and NER, DEP is more prone to hallucinated textual tokens as early errors in the transition sequence get accumulated in the arc-standard system which shifts all later predictions off the track. It is not yet a critical problem as LS generates no textual tokens while a textual token in LT still serves as a valid shift action even if it is hallucinated. However, a hallucinated textual token in PT is catastrophic as it could be part of any arc-standard transitions. As no explicit shift transition is designed, a hallucinated token could lead to multiple instances of missing shifts in Algorithm 12.
6.2 Case Study
To facilitate understanding and comparison of different models, a concrete example of input (I), gold annotation (G), and actual model prediction per each schema is provided below for each task. Wrong predictions and corresponding ground truth are highlighted in red and teal, respectively.
POS
NER
CON
DEP
6.3 Design Choices
In the interest of experimentally comparing the schema variants, we would like each design we consider to be equivalent in some systematic way. To this end, we fix other aspects and variate two dimensions of the prompt design, lexicality, and verbosity, to isolate the impact of individual variables.
Lexicality
We call the portion of textual tokens in a sequence its lexicality. Thus, LS and PT have zero and full lexicality, respectively, while LT falls in the middle. To tease apart the impact of lexicality, we substitute the lexical phrases with corresponding tag abbreviations in PT on POS and DEP, e.g., “friend” is a noun→ “friend” is a NN, “friend” is a nominal subject of “bought” → “friend” is a nsubj of “bought”. Tags are added to the BART vocabulary and learned from scratch as LS and LT. As shown in Table 7, decreasing the lexicality of PT marginally degrades the performance of S2S on POS. On DEP, the performance drop is rather significant. Similar trends are observed comparing LT and LS in Section 5, confirming that lexicons play an important role in prompt design.
Study of lexicality on POS and DEP.
Model . | PTB 3 . | OntoNotes 5 . |
---|---|---|
POS-PT | 97.64 ± 0.01 | 98.37 ± 0.02 |
dec.LEX | 97.63 ± 0.02 | 98.35 ± 0.03 |
DEP-PT | 94.31 ± 0.09 | 92.81 ± 0.21 |
dec.LEX | 93.89 ± 0.18 | 91.19 ± 0.86 |
Model . | PTB 3 . | OntoNotes 5 . |
---|---|---|
POS-PT | 97.64 ± 0.01 | 98.37 ± 0.02 |
dec.LEX | 97.63 ± 0.02 | 98.35 ± 0.03 |
DEP-PT | 94.31 ± 0.09 | 92.81 ± 0.21 |
dec.LEX | 93.89 ± 0.18 | 91.19 ± 0.86 |
Verbosity
Our PT schemas on NER and CON are designed to be as concise as human narrative, and as easy for S2S to generate. Another design choice would be as verbose as some LS and LT schemas. To explore this dimension, we increase the verbosity of NER-PT and CON-PT by adding “isn’t an entity” for all non-entity tokens and substituting each “which” to its actual referred phrase, respectively. The results are presented in Table 8. Though increased verbosity would eliminate any ambiguity, unfortunately, it hurts performance. Emphasizing a token “isn’t an entity” might encounter the over-confidence issue as the boundary annotation might be ambiguous in gold NER data (Zhu and Li, 2022). CON-PT deviates from human language style when reference is forbidden, which eventually makes it lengthy and hard to learn.
Study of verbosity on NER and CON.
Model . | CoNLL 03 . | OntoNotes 5 . |
---|---|---|
NER-PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
inc.VRB | 92.47 ± 0.03 | 89.63 ± 0.23 |
Model | PTB 3 | OntoNotes 5 |
CON-PT | 95.34 ± 0.06 | 94.55 ± 0.03 |
inc.VRB | 95.19 ± 0.06 | 94.02 ± 0.49 |
Model . | CoNLL 03 . | OntoNotes 5 . |
---|---|---|
NER-PT | 93.18 ± 0.04 | 90.33 ± 0.04 |
inc.VRB | 92.47 ± 0.03 | 89.63 ± 0.23 |
Model | PTB 3 | OntoNotes 5 |
CON-PT | 95.34 ± 0.06 | 94.55 ± 0.03 |
inc.VRB | 95.19 ± 0.06 | 94.02 ± 0.49 |
6.4 Stratified Analysis
Section 5 shows that our S2S approach performs comparably to most ad-hoc models. To reveal its pros and cons, we further partition the test data using task-specific factors and run tests on them. The stratified performance on OntoNotes 5 is compared to the strong BERT baseline (He and Choi, 2021b), which is representative of non-S2S models implementing many state-of-the-art decoders.
For POS, we consider the rate of Out-Of-Vocabulary tokens (OOV, tokens unseen in the training set) in a sentence as the most significant factor. As illustrated in Figure 1a, the OOV rate degrades the baseline performance rapidly, especially when over half tokens in a sentence are OOV. However, all S2S approaches show strong resistance to OOV, suggesting that our S2S models unleash greater potential through transfer learning.
Factors impacting each task: the rate of OOV tokens for POS, the rate of unseen entities for NER, the sentence length for CON, and the head-dependent distance for DEP.
Factors impacting each task: the rate of OOV tokens for POS, the rate of unseen entities for NER, the sentence length for CON, and the head-dependent distance for DEP.
For NER, entities unseen during training often confuse a model. This negative impact can be observed on the baseline and LT in Figure 1a. However, the other two schemas generating textual tokens, LT and PT, are less severely impacted by unseen entities. It further supports the intuition behind our approach and agrees with the finding by Shin et al. (2021): With the output sequence being closer to natural language, the S2S model has less difficulty generating it even with unseen entities.
Since the number of binary parses for a sentence of n + 1 tokens is the nth Catalan Number (Church and Patil, 1982), the length is a crucial factor for CON. As shown in Figure 1c, all models, especially LS, perform worse when the sentence gets longer. Interestingly, by simply recalling all the lexicons, LT easily regains the ability to parse long sentences. Using an even more natural representation, PT outperforms them with a performance on par with the strong baseline. It again supports our intuition that natural language is beneficial for pretrained S2S.
For DEP, the distance between each dependent and its head is used to factorize the overall performance. As shown in Figure 1a, the gap between S2S models and the baseline increases with head-dependent distance. The degeneration of relatively longer arc-standard transition sequences could be attributed to the static oracle used in finetuning.
Comparing the three schemas across all subgroups, LT uses the most special tokens but performs the worst, while PT uses zero special tokens and outperforms the rest two. It suggests that special tokens could harm the performance of the pretrained S2S model as they introduce a mismatch between pretraining and finetuning. With zero special tokens, PT is most similar to natural language, and it also introduces no extra parameters in finetuning, leading to better performance.
7 Conclusion
We aim to unleash the true potential of S2S models for sequence tagging and structure parsing. To this end, we develop S2S methods that rival state-of-the-art approaches more complicated than ours, without substantial task-specific architecture modifications. Our experiments with three novel prompting schemas on four core NLP tasks demonstrated the effectiveness of natural language in S2S outputs. Our systematic analysis revealed the pros and cons of S2S models, appealing for more exploration of structure prediction with S2S.
Our proposed S2S approach reduces the need for many heavily engineered task-specific architectures. It can be readily extended to multi-task and few-shot learning. We have a vision of S2S playing an integral role in more language understanding and generation systems. The limitation of our approach is its relatively slow decoding speed due to serial generation. This issue can be mitigated with non-autoregressive generation and model compression techniques in the future.
Acknowledgments
We would like to thank Emily Pitler, Cindy Robinson, Ani Nenkova, and the anonymous TACL reviewers for their insightful and thoughtful feedback on the early drafts of this paper.
Notes
All our resources including source codes are publicly available: https://github.com/emorynlp/seq2seq-corenlp.
We experimented constrained decoding with LS on PTB and OntoNotes, and the improvement is marginal (+0.01).
References
Author notes
Action Editor: Emily Pitler