Unleashing the True Potential of Sequence-to-Sequence Models for Sequence Tagging and Structure Parsing

Sequence-to-Sequence (S2S) models have achieved remarkable success on various text generation tasks. However, learning complex structures with S2S models remains challenging as external neural modules and additional lexicons are often supplemented to predict non-textual outputs. We present a systematic study of S2S modeling using contained decoding on four core tasks: part-of-speech tagging, named entity recognition, constituency, and dependency parsing, to develop efficient exploitation methods costing zero extra parameters. In particular, 3 lexically diverse linearization schemas and corresponding constrained decoding methods are designed and evaluated. Experiments show that although more lexicalized schemas yield longer output sequences that require heavier training, their sequences being closer to natural language makes them easier to learn. Moreover, S2S models using our constrained decoding outperform other S2S approaches using external resources. Our best models perform better than or comparably to the state-of-the-art for all 4 tasks, lighting a promise for S2S models to generate non-sequential structures.


Introduction
Sequence-to-Sequence (S2S) models pretrained for language modeling (PLM) and denoising objectives have been successful on a wide range of NLP tasks where both inputs and outputs are sequences (Radford et al., 2019;Raffel et al., 2020;Lewis et al., 2020;Brown et al., 2020).However, for non-sequential outputs like trees and graphs, a procedure called linearization is often required to flatten them into ordinary sequences (Li et al., 2018;F-G and G-R, 2020;Yan et al., 2021;Bevilacqua et al., 2021;He and Choi, 2021a), where labels in non-sequential structures are mapped heuristically as individual tokens in sequences, and numerical properties like indices are either predicted using an external decoder such as Pointer Networks (Vinyals et al., 2015a) or cast to additional tokens in the vocabulary.While these methods are found to be effective, we hypothesize that S2S models can learn complex structures without adapting such patches.
To challenge the limit of S2S modeling, BART (Lewis et al., 2020) is finetuned on four tasks without extra decoders: part-of-speech tagging (POS), named entity recognition (NER), constituency parsing (CON), and dependency parsing (DEP).Three novel linearization schemas are introduced for each task: label sequence (LS), label with text (LT), and prompt (PT).LS to PT feature an increasing number of lexicons and a decreasing number of labels, which are not in the vocabulary (Section 3).Every schema is equipped with a constrained decoding algorithm searching over valid sequences (Section 4).
Our experiments on three popular datasets depict that S2S models can learn these linguistic structures without external resources such as index tokens or Pointer Networks.Our best models perform on par with or better than the other state-of-the-art models for all four tasks (Section 5).Finally, a detailed analysis is provided to compare the distinctive natures of our proposed schemas (Section 6). 1 2 Related Work S2S (Sutskever et al., 2014) architectures have been effective on many sequential modeling tasks.Conventionally, S2S is implemented as an encoder and decoder pair, where the encoder learns input representations used to generate the output sequence via the decoder.Since the input sequence can be very long, attention mechanisms (Bahdanau et al., 2015;Vaswani et al., 2017) focusing on particular positions are often augmented to the basic architecture.With transfer-learning, S2S models pretrained on large unlabeled corpora have risen to a diversity 1 All our resources including source codes are publicly available: https://github.com/emorynlp/seq2seq-corenlp of new approaches that convert language problems into a text-to-text format (Akbik et al., 2018;Lewis et al., 2020;Radford et al., 2019;Raffel et al., 2020;Brown et al., 2020).Among them, tasks most related to our work are linguistic structure predictions using S2S, POS, NER, DEP, and CON.
POS has been commonly tackled as a sequence tagging task, where the input and output sequences have equal lengths.S2S, on the other hand, does not enjoy such constraints as the output sequence can be arbitrarily long.Therefore, S2S is not as popular as sequence tagging for POS.Prevailing neural architectures for POS are often built on top of a neural sequence tagger with rich embeddings (Bohnet et al., 2018;Akbik et al., 2018) and Conditional Random Fields (Lafferty et al., 2001).
NER has been cast to a neural sequence tagging task using the IOB notation (Lample et al., 2016) over the years, which benefits most from contextual word embeddings (Devlin et al., 2019;Wang et al., 2021).Early S2S-based works cast NER to a textto-IOB transduction problem (Chen and Moschitti, 2018;Straková et al., 2019;Zhu et al., 2020), which is included as a baseline schema in Section 3.2.Yan et al. (2021) augment Pointer Networks to generate numerical entity spans, which we refrain to use because the focus of this work is purely on the S2S itself.Most recently, Cui et al. (2021) propose the first template prompting to query all possible spans against a S2S language model, which is highly simplified into a one-pass generation in our PT schema.Instead of directly prompting for the entity type, Chen et al. (2022) propose to generate its concepts first then its type later.Their two-step generation is tailored for few-shot learning, orthogonal to our approach.Moreover, our prompt approach does not rely on non-textual tokens as they do.
CON is a more established task for S2S models since the bracketed constituency tree is naturally a linearized sequence.Top-down tree linearizations based on brackets (Vinyals et al., 2015b) or shiftreduce actions (Sagae and Lavie, 2005) rely on a strong encoder over the sentence while bottom-up ones (Zhu et al., 2013;Ma et al., 2017) can utilize rich features from readily built partial parses.Recently, the in-order traversal has proved superior to bottom-up and top-down in both transition (Liu and Zhang, 2017) and S2S (F-G and G-R, 2020) constituency parsing.Most recently, a Pointer Networks augmented approach (Yang and Tu, 2022) is ranked top among S2S approaches.Since we are interested in the potential of S2S models without patches, a naive bottom-up baseline and its novel upgrades are studied in Section 3.3.
DEP has been underexplored as S2S due to the linearization complexity.The first S2S work maps a sentence to a sequence of source sentence words interleaved with the arc-standard, reduce-actions in its parse (Wiseman and Rush, 2016), which is adopted as our LT baseline in Section 3.4.Zhang et al. (2017) introduce a stack-based multi-layer attention mechanism to leverage structural linguistics information from the decoding stack in arcstandard parsing.Arc-standard is also used in our LS baseline, however, we use no such extra layers.Apart from transition parsing, Li et al. (2018) directly predict the relative head position instead of the transition.This schema is later extended to multilingual and multitasking by Choudhary and O'riordan (2021).Their encoder and decoder use different vocabularies, while in our PT setting, we re-use the vocabulary in the S2S language model.S2S appears to be more prevailing for semantic parsing due to two reasons.First, synchronous context-free grammar bridges the gap between natural text and meaning representation for S2S.It has been employed to obtain silver annotations (Jia and Liang, 2016), and to generate canonical natural language paraphrases that are easier to learn for S2S (Shin et al., 2021).This trend of insights viewing semantic parsing as prompt-guided generation (Hashimoto et al., 2018) and paraphrasing (Berant and Liang, 2014) has also inspired our design of PT.Second, the flexible input/output format of S2S facilitates joint learning of semantic parsing and generation.Latent variable sharing (Tseng et al., 2020) and unified pretraining (Bai et al., 2022) are two representative joint modeling approaches, which could be augmented with our idea of PT schema as a potentially more effective linearization.
Our finding that core NLP tasks can be solved using LT overlaps with the Translation between Augmented Natural Languages (Paolini et al., 2021).However, we take one step further to study the impacts of textual tokens in schema design choices.Our constrained decoding is similar to existing works (Hokamp and Liu, 2017;Deutsch et al., 2019;Shin et al., 2021).We craft constrained decoding algorithms for our proposed schemas and provide a systematic ablation study in Section 6.1.PT "My" is a possessive pronoun; "friend" is a noun; "who" is a wh-pronoun; "lives" is a 3rd-person present verb; "in" is a preposition; "Orlando" is a proper noun; "bought" is a past verb; "me" is a personal pronoun; "a" is a determiner; "gift" is a noun; "from" is a preposition; "Disney" is a proper noun; "World" is a proper noun.

PT
The sentence has a noun phrase and a verb phrase; The noun phrase has the noun phrase "My friend" and the subordinating clause, which has the wh-noun phrase "who" and the clause, which has the verb phrase, which has "lives" and the preposition phrase, which has "in" and the noun phrase "Orlando"; the verb phrase has "bought" and the noun phrase "me" and the noun phrase "a gift" and the preposition phrase, which has "from" and the noun phrase "Disney World"."My" is a possessive modifier of "friend"; "who" is a nominal subject of "lives"; "in" is a case marker of "Orlando"; "lives" has an oblique "Orlando"; "friend" has a relative clause "lives"; "friend" is a nominal subject of "bought"; "bought" has an indirect object "me"; "a" is a determiner of "gift"; "Disney" is a compound word of "World"; "from" is a case marker of "World"; "gift" has a nominal modifier "World"; "bought" has an object "gift".

Dependency Parsing
Table 1: Schemas for the sentence "My friend who lives in Orlando bought me a gift from Disney World".

Schemas
This section presents our output schemas for POS, NER, CON, and DEP in Table 1.For each task, 3 lexically diverse schemas are designed as follows to explore the best practice for structure learning.First, Label Sequence (LS) is defined as a sequence of labels consisting of a finite set of task-related la-bels, that are merged into the S2S vocabulary, with zero text.Second, Label with Text (LT) includes tokens from the input text on top of the labels such that it has a medium number of labels and text.Third, PrompT (PT) gives a list of sentences describing the linguistic structure in natural language with no label.We hypothesize that the closer the output is to natural language, the more advantage the S2S takes from the PLM.
3.1 Part-of-Speech Tagging (POS) LS LS defines the output as a sequence of POS tags.Formally, given an input sentence of n tokens Distinguished from sequence tagging, any LS output sequence is terminated by the "end-of-sequence" (EOS) token, which is omitted from y LS for simplicity.Predicting POS tags often depends on their neighbor contexts.We challenge that the autoregressive decoder of a S2S model can capture this dependency through self-attention.
LT For LT, the token from the input is inserted before its corresponding tag.Formally, the output is defined Both x and y are part of the output and the S2S model is trained to generate each pair sequentially.
PT PT is a human-readable text describing the POS sequence.Specifically, we use a phrase y PT i ="x i is y i " for the i-th token, where y i is the definition of a POS tag y i , e.g., a noun.The final prompt is then the semicolon concatenation of all phrases:

Named Entity Recognition (NER)
LS LT of an input sentence comprising n tokens , which labels each token as the Beginning, Inside, End, Outside, or Single-token entity.
LT LT uses a pair of entity type labels to wrap each entity: y LT = ..B-y j , x i , .., x i+k , E-y j , .., where y j is the type label of the j-th entity consisting of k tokens.
PT PT is defined as a list of sentences describing each entity: y PT i ="x i is y i ", where y i is the definition of a NER tag y i , e.g., a person.Different from the prior prompt work (Cui et al., 2021), our model generates all entities in one pass which is more efficient than their brute-force approach.

Constituency Parsing (CON)
Schemas for CON are developed on constituency trees pre-processed by removing the first level of non-terminals (POS tags) and rewiring their children (tokens) to parents, e.g., (NP (PRON My) (NOUN friend)) → (NP My friend).
LS LS is based on a top-down shift-reduce system consisting of a stack, a buffer, and a depth record d.Initially, the stack contains only the root constituent with label TOP and depth 0; the buffer contains all tokens from the input sentence; d is set to 1.A Node-X (N-X) transition creates a new depth-d non-terminal labeled with X, pushes it to the stack, and sets d ← d + 1.A Shift (SH) transition removes the first token from the buffer and pushes it to the stack as a new terminal with depth d.A Reduce (RE) pops all elements with the same depth d from the stack then make them the children of the top constituent of the stack, and it sets d ← d − 1.The linearization of a constituency tree using our LS schema can be obtained by applying 3 string substitutions: replace each left bracket and the label X following it with a Node-X, replace terminals with SH, replace right brackets with RE.
LT LT is derived by reverting all SH in LS back to the corresponding tokens so that tokens in LT effectively serves as SH in our transition system.
PT PT is also based on a top-down linearization, although it describes a constituent using templates: "p i has {c j }", where p i is a constituent and c j -s are its children.To describe a constituent, the indefinite article "a" is used to denote a new constituent (e.g., "... has a noun phrase").The definite article "the" is used for referring to an existing constituent mentioned before (e.g., "the noun phrase has ..."), or describing a constituent whose children are all terminals (e.g., "... has the noun phrase 'My friend'").When describing a constituent that directly follows its mention, the determiner "which" is used instead of repeating itself multiple times e.g., "(... and the subordinating clause, which has ...").Sentences are joined with a semicolon ";" as the final prompt.
LT LT for DEP is obtained by replacing each SH in a LS with its corresponding token.
PT PT is derived from its LS sequence by removing all SH.Then, for each left arc creating an arc from x j to x i with dependency relation r (e.g., a possessive modifier), a sentence is created by applying the template "x i is r of x j ".For each right arc creating an arc from x i to x j with the dependency relation r, a sentence is created with another template "x i has r x j ".The prompt is finalized by joining all such sentences with a semicolon.

Decoding Strategies
To ensure well-formed output sequences that match the schemas (Section 3), a set of constrained decoding strategies is designed per task except for CON, which is already tackled as S2S modeling without constrained decoding (Vinyals et al., 2015b;F-G and G-R, 2020).Formally, given an input x and any partial y <i , a constrained decoding algorithm defines a function NextY returning the set of all possible values for y i that can immediately follow y <i without violating the schema.For brevity, subwords handling is separately explained in Section 5. LT Since tokens from the input sentence are interleaved with their POS tags in LT, NextY depends on the parity of i, as defined in Algorithm 1.

Part-of-Speech Tagging
PT The PT generation can be divided into two phases: token and "is-a-tag" statement generations.
A binary status u is used to indicate whether y i is expected to be a token.To generate a token, an integer k ≤ n is used to track the index of the next token.To generate an "is-a-tag" statement, an offset o is used to track the beginning of an "is-atag" statement.Each description d j of a POS tag t j is extended to a suffix s j = "is d j ;".PT For each entity type e i , its description d i is filled into the template "is d i ;" to create an "is-a" suffix s i .Since the prompt is constructed using text while the number of entities is variable, it is not straightforward to tell whether a token belongs to an entity or an "is-a" suffix.Therefore, a noisy segmentation procedure is utilized to split a phrase into two parts: entity and "is-a" suffix.Each s i is collected into a trie S to perform segmentation of a partially generated phrase p (Algorithm 5).
Algorithm 4: Once a segment is obtained, the decoder is constrained to generate the entity or the suffix.For the generation of an entity, string matching is used to find every occurrence o of its partial generation in x and add the following token x o+1 into the candidate set Y. String matching could be noisy when an entity shares the same surface form with a non-entity phrase although no such cases are found in our datasets.Entities are generated sequentially and no nested entity is considered.To complete an "is-a" suffix, children of the prefix-matched node are added to the candidates (Algorithm 6).

Constituency Parsing
S2S on CON using LS and LT has been studied and their results are good without using constrained decoding (Vinyals et al., 2015b; F-G and G-R, 2020)2 ; thus, we focus on only PT in our work.
PT The reverse linearization algorithm for CON-PT is non-trivial.To restore a constituency tree from PT, the prompt is split into sentences creating new constituents, and sentences attaching new constituents to existing ones.Splitting is done by longest-prefix-matching (Algorithm 7) using a trie T built with the definite and indefinite article versions of the description of each constituent label e.g., "the noun phrase" and "a noun phrase" of NP.
Algorithm 8: Longest Prefix Splitting Function Split(T , x): (spans, o) ← (∅, 1) Once a prompt is split into two types of sentences, a constituency tree is then built accordingly.We use a variable parent to track the last constituent that gets attachments, and another variable latest to track the current new consistent that gets created.Due to the top-down nature of linearization, the target constituent that new constituents are attached to is always among the siblings of either parent or the ancestors of parent.The search of the target constituent is described in Algorithm 9.
Algorithm 9: Find Target Function FindTarget(parent, label): while parent do foreach sibling of parent do if sibling.label is label and sibling has no children then return sibling parent ← parent.parentreturn null Algorithm 10 shows the final reverse linearization.

Dependency Parsing
LS Arc-standard (Nivre, 2004) transitions are added to a candidate set and only transitions permitted by the current parsing state are allowed.
LT DEP-LS replaces all SH transitions with input tokens in left-to-right order.Therefore, an incremental offset is kept to generate the next token in place of each SH in DEP-LT.
PT DEP-PT is more complicated than CON-PT because each sentence contains one more token.Its generation is therefore divided into 4 possible states: first token (1st), relation (rel), second token (2ed), and semicolon.An arc-standard transition system is executed synchronously with con-Algorithm 10: Reverse CON-PT Function Reverse(T , x): root ← parent ← new TOP-tree latest ← null foreach (i, j, v) ∈ Split(T , x) do if v then if x i:j starts with "the" then target ← FindTarget(parent, v) else latest ← new v-tree add latest to parent.childrenlatest.parent← parent else if x i:j starts with "has" or "which has" then parent ← latest add tokens in "" into latest return root strained decoding since PT is essentially a simplified transition sequence with all SH removed.Let b and s be the system buffer and stack, respectively.Let c be a set of candidate tokens that will be generated in y, which initially contains all input tokens and an inserted token "sentence" that is only used to represent the root in "the sentence has a root . .."A token is removed from c once it gets popped out of s.Since DEP-PT generates no SH, each input token x j in y effectively introduces SH(s) till it is pushed onto s at index i (i ∈ {1, 2}), as formally described in Algorithm 11.
Algorithm 11: Recall Shift Function RecallShift(system, i, x j ): while system.si is not x j do system.apply(SH) After the first token is generated, its offset o in y is recorded such that the following relation sequence y i>o can be located.To decide the next token of y i>o , it is then prefix-matched with a trie T built with the set of "has-" and "is-" dependency relations.The children of the prefix-matched node are considered candidates if it has any.Otherwise, the dependency relation is marked as completed.Once a relation is generated, the second token will be generated in a similar way.Finally, upon the completion of a sentence, the transition it describes is applied to the system and c is updated accordingly.The full procedure is described in Algorithm 12. Since a transition system has been synchronously maintained with constrained decoding, no extra reverse linearization is needed. Algorithm

Experiments
For all tasks, BART-Large (Lewis et al., 2020) is finetuned as our underlying S2S model.We also tried T5 (Raffel et al., 2020) although its performance was less satisfactory.Every model is trained three times using different random seeds and their average scores and standard deviations on the test sets are reported.Our models are experimented on the OntoNotes 5 (Weischedel et al., 2013) using the data split suggested by Pradhan et al. (2013).In addition, two other popular datasets are used for fair comparisons to previous works: the Wall Street Journal corpus from the Penn Treebank 3 (Marcus et al., 1993) for POS, DEP and CON, as well as the English portion of the CoNLL'03 dataset (Tjong Kim Sang and De Meulder, 2003) for NER.
Each token is independently tokenized using the subword tokenizer of BART and merged into an input sequence.The boundary information for each token is recorded to ensure full tokens are generated in LT and PT without broken pieces.To fit in the positional embeddings of BART, sentences longer than 1,024 subwords are discarded, which include 1 sentence from the Penn Treebank 3 training set, and 24 sentences from the OntoNotes 5 training set.Development sets and test sets are not affected.

Part-of-Speech Tagging
Token level accuracy is used as the metric for POS.LT outperforms LS although LT is twice as long as LS, suggesting that textual tokens positively impact the learning of the decoder (Table 2).PT performs almost the same with LT, perhaps due to the fact that POS is not a task requiring a powerful decoder.

Named Entity Recognition
For CoNLL'03, the provided splits without merging the development and training sets are used.For OntoNotes 5, the same splits as Chiu and Nichols (2016); Li et al. (2017); Ghaddar and Langlais (2018); He andChoi (2020, 2021b) are used.Labeled span-level F 1 score is used for evaluation.
We acknowledge that the performance of NER systems can be largely improved by rich embeddings (Wang et al., 2021), document context feature (Yu et al., 2020), dependency tree feature (Xu et al., 2021), and other external resources.While our focus is the potential of S2S, we mainly consider two strong baselines that also use BART as the only external resource: the generative BART-Pointer framework (Yan et al., 2021) and the recent template-based BART NER (Cui et al., 2021).As shown in Table 3, LS performs the worst on both datasets, possibly attributed to that the autoregressive decoder overfits the high-order left-to-right dependencies of BIEOS tags.LT performs close to the BERT-Large biaffine model (Yu et al., 2020).
PT performs comparably well with the Pointer Networks approach (Yu et al., 2020) and it outperforms the template prompting (Cui et al., 2021) by a large margin, suggesting S2S has the potential to learn structures without using external modules.

Constituency Parsing
All POS tags are removed and not used in training or evaluation.Terminals belonging to the same nonterminal are flattened into one constituent before training and unflattened in post-processing.The standard constituent-level F-score produced by the EVALB 3 is used as the evaluation metric.
Table 4 shows the results on OntoNotes 5 and PTB 3. Incorporating textual tokens into the output sequence is important on OntoNotes 5, leading to a +0.9 F-score, while it is not the case on PTB 3. It 3 https://nlp.cs.nyu.edu/evalb/ is possibly due to that OntoNotes is more diverse in domains, requiring a higher utilization of pretrained S2S for domain transfer.PT performs the best, and it has a competitive performance to recent works, despite that it uses no extra decoders.

Model PTB 3
OntoNotes 5 F-G and G-R ( 2020  As shown in Table 5, textual tokens are crucial in learning arc-standard transitions using S2S, leading to +2.6 and +7.4 LAS improvements, respectively.Although our PT method underperforms recent state-of-the-art methods, it has the strongest performance among all S2S approaches.Interestingly, our S2S model manages to learn a transition system without explicitly modeling the stack, the buffer, the partial parse, or pointers. We believe that the performance of DEP with S2S can be further improved with a larger and more recent pretrained S2S model and dynamic oracle (Goldberg and Nivre, 2012).

Ablation Study
We perform an ablation study to show the performance gain of our proposed constrained decoding algorithms on different tasks.Constrained decoding algorithms (CD) are compared against free generation (w/o CD) where a model freely generates an output sequence that is later post-processed into task-specific structures using string-matching rules.
Invalid outputs are patched to the greatest extent, e.g., POS label sequences are padded or truncated.As shown in Table 6, ablation of constrained decoding seldom impacts the performance of LS on all tasks, suggesting that the decoder of seq2seq can acclimatize to the newly added label tokens.
Interestingly, the less performant NER-LS model degrades the most, promoting the necessity of constrained decoding for weaker seq2seq models.The performance of LT on all tasks is marginally degraded when constrained decoding is ablated, indicating the decoder begins to generate structurally invalid outputs when textual tokens are freely generated.This type of problem seems to be exacerbated when more tokens are freely generated in the PT schemas, especially for the DEP-PT.
Unlike POS and NER, DEP is more prone to hallucinated textual tokens as early errors in the transition sequence get accumulated in the arc-standard system which shifts all later predictions off the track.It is not yet a critical problem as LS generates no textual tokens while a textual token in LT still serves as a valid shift action even if it is hallucinated.However, a hallucinated textual token in PT is catastrophic as it could be part of any arcstandard transitions.As no explicit shift transition is designed, a hallucinated token could lead to multiple instances of missing shifts in Algorithm 12.

Case Study
To facilitate understanding and comparison of different models, a concrete example of input (I), gold annotation (G) and actual model prediction per each schema is provided below for each task.Wrong predictions and corresponding ground truth are highlighted in red and teal respectively.POS In the following example, only PT correctly detects the past tense (VBD) of "put".boldface/NN is/VBZ extr./RBinteresting/JJ ./. PT: "The" is a determiner; "word" is a singular noun; "I" is a personal pronoun; "put" is a past tense verb; "in" is a preposition or subordinating conjunction; "boldface" is a singular noun; "is" is a 3rd person singular present verb; "extremely" is an adverb; "interesting" is an adjective; "." is a period.NER In the following example, LS and LT could not correctly recognize "HIStory" as an art work possibly due to its leading uppercase letters.) .)PT: a sentence has a simple clause, which has a noun phrase and a verb phrase and "."; the noun phrase has a noun phrase "It", the verb phrase has "'s" and an adjective phrase "crazy" and a subordinating clause, which has a wh-noun phrase and a simple clause; the wh-noun phrase has a wh-adjective phrase "how much", the simple clause has a noun phrase "he" and a verb phrase "eats".DEP In the following example, LS incorrectly attached "so out of" to "place", and LT wrongly attached "so" to "looks".RA-pcomp RA-prep RA-ccomp .RA-punct RA-root PT: "It" is a nominal subject of "looks"; "so" is an adverbial modifier of "out"; "of" has an object of a preposition "place"; "out" has a prepositional complement "of"; "looks" has a prepositional modifier "out"; "looks" has a punctuation "."; "sentence" has a root "looks".

Design Choices
In the interest of experimentally comparing the schema variants, we would like each design we consider to be equivalent in some systematic way.
To this end, we fix other aspects and variate two dimensions of the prompt design, lexicality and verbosity, to isolate the impact of individual variables.
Lexicality We call the portion of textual tokens in a sequence its lexicality.Thus, LS and PT have zero and full lexicality respectively, while LT falls in the middle.To tease apart the impact of lexicality, we substitute the lexical phrases with corresponding tag abbreviations in PT on POS and DEP, e.g., "friend" is a noun → "friend" is a NN, "friend" is a nominal subject of "bought" → "friend" is a nsubj of "bought".Tags are added to the BART vocabulary and learned from scratch as LS and LT.Verbosity Our PT schemas on NER and CON are designed to be as concise as human narrative, and as easy for S2S to generate.Another design choice would be as verbose as some LS and LT schemas.To explore this dimension, we increase the verbosity of NER-PT and CON-PT by adding "isn't an entity" for all non-entity tokens and substituting each "which" to its actual referred phrase respectively.The results are presented in Table 8.Though increased verbosity would eliminate any ambiguity, unfortunately, it hurts the performance.Emphasizing a token "isn't an entity" might encounter the over-confidence issue as the boundary annotation might be ambiguous in gold NER data (Zhu and Li, 2022).CON-PT deviates from human language style when reference is forbidden, which eventually makes it lengthy and hard to learn.

Stratified Analysis
Section 5 shows that our S2S approach performs comparably to most ad-hoc models.To reveal its pros and cons, we further partition the test data using task-specific factors and run tests on them.The stratified performance on OntoNotes 5 is compared to the strong BERT baseline (He and Choi, 2021b), which is representative of non-S2S models implementing many state-of-the-art decoders.
For POS, we consider the rate of Out-Of-Vocabulary tokens (OOV, tokens unseen in the training set) in a sentence as the most significant factor.As illustrated in Figure 1a, the OOV rate degrades the baseline performance rapidly, especially when over half tokens in a sentence are OOV.However, all S2S approaches show strong resistance to OOV, suggesting that our S2S models unleash greater potential through transfer learning.
For NER, entities unseen during training often confuse a model.This negative impact can be observed on the baseline and LT in Figure 1b.However, the other two schemas generating textual tokens, LT and PT, are less severely impacted by unseen entities.It further supports the intuition behind our approach and agrees with the finding by Shin et al. (2021): with the output sequence being closer to natural language, the S2S model has less difficulty generating it even with unseen entities.
Since the number of binary parses for a sentence of n + 1 tokens is the nth Catalan Number (Church and Patil, 1982), the length is a crucial factor for CON.As shown in Figure 1c, all models, especially LS, perform worse when the sentence gets longer.Interestingly, by simply recalling all the lexicons, LT easily regains the ability to parse long sentences.Using an even more natural representation, PT outperforms them with a performance on par with the strong baseline.It again supports our intuition that natural language is beneficial for pretrained S2S.
For DEP, the distance between each dependent and its head is used to factorize the overall performance.As shown in 1d, the gap between S2S models and the baseline increases with head-dependent distance.The degeneration of relatively longer arcstandard transition sequences could be attributed to the static oracle used in finetuning.
Comparing the three schemas across all subgroups, LT uses the most special tokens but performs the worst, while PT uses zero special tokens and outperforms the rest two.It suggests that special tokens could harm the performance of the pretrained S2S model as they introduce a mismatch between pretraining and finetuning.With zero special tokens, PT is most similar to natural language, and it also introduces no extra parameters in finetuning, leading to better performance.

Conclusion
We aim to unleash the true potential of S2S models for sequence tagging and structure parsing.To this end, we develop S2S methods that rival stateof-the-art approaches more complicated than ours, without substantial task-specific architecture modifications.Our experiments with three novel prompting schemas on four core NLP tasks demonstrated the effectiveness of natural language in S2S outputs.Our systematic analysis revealed the pros and cons of S2S models, appealing for more exploration of structure prediction with S2S.
Our proposed S2S approach reduces the need for many heavily-engineered task-specific architectures.It can be readily extended to multi-task and few-shot learning.We have a vision of S2S playing an integral role in more language understanding and generation systems.The limitation of our approach is its relatively slow decoding speed due to serial generation.This issue can be mitigated with non-autoregressive generation and model compression techniques in the future.
LS A finite set of POS tags, D, are collected from the training set.Specifically, NextY returns D if i ≤ n.Otherwise, it returns EOS.

I:
The word I put in boldface is extremely interesting.G: DT NN PRP VBD IN NN VBZ RB JJ .LS: DT NN PRP VBP IN NN VBZ RB RB JJ LT: The/DT word/NN I/PRP put/VBP in/IN

Figure 1 :
Figure1: Factors impacting each task: the rate of OOV tokens for POS, the rate of unseen entities for NER, the sentence length for CON, and the head-dependent distance for DEP.
Part-of-Speech Tagging My friend who lives in Orlando bought me a gift from Disney World LS PRP$ NN WP VBZ IN NNP VBD PRP DT NN IN NNP NNP LT My/PRP$ friend/NN who/WP lives/VBZ in/IN Orlando/NNP bought/VBD me/PRP a/DT gift/NN from/IN Disney/NNP World/NNP Suffixes are stored in a trie tree T to facilitate prefix matching between a partially generated statement and all candidate suffixes, as shown in Algorithm 2. The full decoding is depicted in Algorithm 3.
To enforce label consistency between a pair of tags, a variable e is introduced to record the expected closing tag.Reusing the definition of k in Algorithm 3, decoding of NER-LT is described in Algorithm 4.
is empty then u ← true return NextY(x, y <i ) else return node.children4.2 Named Entity Recognition LS Similar to POS-LS, the NextY for NER returns BIEOS tags if i ≤ n else EOS.LT Opening tags (<>) in NER-LT are grouped into a vocabulary O.The last generated output token y i−1 (assuming y 0 = BOS, a.k.a.beginning of a sentence) is looked up in O to decide what type of token will be generated next.

Table 2 :
Results for POS.
Large image of the Michael Jackson HIStory statue.G: Large image of the Michael Jackson I: It looks LA-nsubj so out of place RA-pobj I: It looks so out of place.

Table 7 :
Study of lexicality on POS and DEP.As shown in Table7, decreasing the lexicality of PT marginally degrades the performance of S2S on POS.On DEP, the performance drop is rather significant.Similar trends are observed comparing LT and LS in Section 5, confirming that lexicons play an important role in prompt design.

Table 8 :
Study of verbosity on NER and CON.