A Generative Model for Punctuation in Dependency Trees

Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree’s “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syntax tree. When the tree’s yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into “surface” marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to the EM algorithm). When we use the trained model to reconstruct the tree’s underlying punctuation, the results appear plausible across 5 languages, and in particular are consistent with Nunberg’s analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence’s underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.


Introduction
Punctuation enriches the expressiveness of written language. When converting from spoken to written language, punctuation indicates pauses or pitches; expresses propositional attitude; and is conventionally associated with certain syntactic constructions such as apposition, parenthesis, quotation, and conjunction.
In this paper, we present a latent-variable model of punctuation usage, inspired by the rulebased approach to English punctuation of Nunberg (1990). Training our model on English data ⇤ Equal contribution. learns rules that are consistent with Nunberg's hand-crafted rules. Our system is automatic, so we use it to obtain rules for Arabic, Chinese, Spanish, and Hindi as well.
Moreover, our rules are stochastic, which allows us to reason probabilistically about ambiguous or missing punctuation. Across the 5 languages, our model predicts surface punctuation better than baselines, as measured both by perplexity ( §4) and by accuracy on a punctuation restoration task ( § 6.1). We also use our model to correct the punctuation of non-native writers of English ( § 6.2), and to maintain natural punctuation style when syntactically transforming English sentences ( § 6.3). In principle, our model could also be used within a generative parser, allowing the parser to evaluate whether a candidate tree truly explains the punctuation observed in the input sentence ( §8).
Punctuation is interesting In The Linguistics of Punctuation, Nunberg (1990) argues that punctuation (in English) is more than a visual counterpart of spoken-language prosody, but forms a linguistic system that involves "interactions of point indicators (i.e. commas, semicolons, colons, periods and dashes)." He proposes that much as in phonology (Chomsky and Halle, 1968), a grammar generates underlying punctuation which then transforms into the observed surface punctuation.
Consider generating a sentence from a syntactic grammar as follows: Note that these modifications are string transformations that do not see or change the tree. The resulting surface punctuation marks may be clues to the parse tree, but (contrary to NLP convention) they should not be included as nodes in the parse tree. Only the underlying marks play that role.
Punctuation is meaningful Pang et al. (2002) use question and exclamation marks as clues to sentiment. Similarly, quotation marks may be used to mark titles, quotations, reported speech, or dubious terminology (University of Chicago, 2010). Because of examples like this, methods for determining the similarity or meaning of syntax trees, such as a tree kernel (Agarwal et al., 2011) or a recursive neural network (Tai et al., 2015), should ideally be able to consider where the underlying punctuation marks attach.
Punctuation is helpful Surface punctuation remains correlated with syntactic phrase structure. NLP systems for generating or editing text must be able to deploy surface punctuation as human writers do. Parsers and grammar induction systems benefit from the presence of surface punctuation marks (Jones, 1994;Spitkovsky et al., 2011). It is plausible that they could do better with a linguistically informed model that explains exactly why the surface punctuation appears where it does. Patterns of punctuation usage can also help identify the writer's native language (Markov et al., 2018).
Punctuation is neglected Work on syntax and parsing tends to treat punctuation as an afterthought rather than a phenomenon governed by its own linguistic principles. Treebank annotation guidelines for punctuation tend to adopt simple heuristics like "attach to the highest possible node that preserves projectivity" (Bies et al., 1995;Nivre et al., 2018). 1 Many dependency parsing works exclude punctuation from evaluation (Nivre et al., 2007b;Koo and Collins, 2010;Chen and Manning, 2014;Lei et al., 2014;Kiperwasser and Goldberg, 2016), although some others retain punctuation (Nivre et al., 2007a;Goldberg and Elhadad, 2010;Dozat and Manning, 2017 x 0 x 1 x 2 x 3 x 4 " Dale " means " river valley " .
" Dale " means " river valley . " root. nsubj " " dobj " " Figure 1: The generative story of a sentence. Given an unpunctuated tree T at top, at each node w 2 T , the ATTACH process stochastically attaches a left puncteme l and a right puncteme r, which may be empty. The resulting tree T 0 has underlying punctuation u. Each slot's punctuation u i 2 u is rewritten to In tasks such as word embedding induction (Mikolov et al., 2013;Pennington et al., 2014) and machine translation (Zens et al., 2002), punctuation marks are usually either removed or treated as ordinary words (Řehůřek and Sojka, 2010).
Yet to us, building a parse tree on a surface sentence seems as inappropriate as morphologically segmenting a surface word. In both cases, one should instead analyze the latent underlying form, jointly with recovering that form. For example, the proper segmentation of English hoping is not hop-ing but hope-ing (with underlying e), and the proper segmentation of stopping is neither stopp-ing nor stop-ping but stop-ing (with only one underlying p). Cotterell et al. (2015Cotterell et al. ( , 2016 get this right for morphology. We attempt to do the same for punctuation.

Formal Model
We propose a probabilistic generative model of sentences (Figure 1): First, an unpunctuated dependency tree T is stochastically generated by some recursive process p syn (e.g., Eisner, 1996, Model C). 2 Second, each constituent (i.e., dependency subtree) sprouts optional underlying punctuation at its left and right edges, according to a probability distribution p ✓ that depends on the constituent's syntactic role (e.g., dobj for "direct object"). This punctuated tree T 0 yields the underlying stringū =ū(T 0 ), which is edited by a finite-state noisy channel p to arrive at the surface sentencex.
This third step may alter the sequence of punctuation tokens at each slot between words-for example, in §1, collapsing the double comma , , between Pendragon and who. u and x denote just the punctuation at the slots ofū andx respectively, with u i and x i denoting the punctuation token sequences at the i th slot. Thus, the transformation at the i th slot is u i 7 ! x i .
Since this model is generative, we could train it without any supervision to explain the observed surface stringx: maximize the likelihood p(x) in (1), marginalizing out the possible T, T 0 values.
In the present paper, however, we exploit known T values (as observed in the "depunctuated" version of a treebank). Because T is observed, we can jointly train ✓, to maximize just That is, the p syn model that generated T becomes irrelevant, but we still try to predict what surface punctuation will be added to T . We still marginalize over the underlying punctuation marks u. These are never observed, but they must explain the surface punctuation marks x ( § 2.2), and they must be explained in turn by the syntax tree T ( § 2.1). The trained generative model then lets us restore or correct punctuation in new trees T ( §6).
where V is the finite set of possible punctemes and W d ✓ V 2 gives the possible puncteme pairs for a node w that has dependency relation d = d(w) to its parent. V and W d are estimated heuristically from the tokenized surface data ( §4). f (l, r, w) is a sparse binary feature vector, and ✓ is the corresponding parameter vector of feature weights. The feature templates in Appendix A 4 consider the symmetry between l and r, and their compatibility with (a) the POS tag of w's head word, (b) the dependency paths connecting w to its children and the root of T , (c) the POS tags of the words flanking the slots containing l and r, (d) surface punctuation already added to w's subconstituents.

From Underlying to Surface
From the tree T 0 , we can read off the sequence of underlying punctuation tokens u i at each slot i between words. Namely, u i concatenates the right punctemes of all constituents ending at i with the left punctemes of all constituents starting at i (as illustrated by the examples in §1 and Figure 1). The NOISYCHANNEL model then transduces u i to a surface token sequence x i , for each i = 0, . . . , n independently (where n is the sentence length).
Nunberg's formalism Much like Chomsky and Halle's (1968) phonological grammar of English, Nunberg's (1990) descriptive English punctuation grammar (Table 1) can be viewed computationally as a priority string rewriting system, or Markov algorithm (Markov, 1960;Caracciolo di Forino, 1968). The system begins with a token string u.
abcde . ab 7 ! ab abcde . bc 7 ! b a bde . bd 7 ! db a dbe . be 7 ! e a d e Figure 2: Editing abcde 7 ! ade with a sliding window. (When an absorption rule maps 2 tokens to 1, our diagram leaves blank space that is not part of the output string.) At each step, the left-to-right process has already committed to the green tokens as output; has not yet looked at the blue input tokens; and is currently considering how to (further) rewrite the black tokens. The right column shows the chosen edit.
At each step it selects the highest-priority local rewrite rule that can apply, and applies it as far left as possible. When no more rules can apply, the final state of the string is returned as x.
Simplifying the formalism Markov algorithms are Turing complete. Fortunately, Johnson (1972) noted that in practice, phonological u 7 ! x maps described in this formalism can usually be implemented with finite-state transducers (FSTs).
For computational simplicity, we will formulate our punctuation model as a probabilistic FST (PFST)-a locally normalized left-to-right rewrite model (Cotterell et al., 2014). The probabilities for each language must be learned, using gradient descent. Normally we expect most probabilities to be near 0 or 1, making the PFST nearly deterministic (i.e., close to a subsequential FST). However, permitting low-probability choices remains useful to account for typographical errors, dialectal differences, and free variation in the training corpus.
Our PFST generates a surface string, but the invertibility of FSTs will allow us to work backwards when analyzing a surface string ( §3).
A sliding-window model Instead of having rule priorities, we apply Nunberg-style rules within a 2-token window that slides over u in a single leftto-right pass ( Figure 2). Conditioned on the current window contents ab, a single edit is selected stochastically: either ab 7 ! ab (no change), ab 7 ! b (left absorption), ab 7 ! a (right absorption), or ab 7 ! ba (transposition). Then the window slides rightward to cover the next input token, together with the token that is (now) to its left. a and b are always real tokens, never boundary symbols. specifies the conditional edit probabilities. 5 5 Rather than learn a separate edit probability distribution for each bigram ab, one could share parameters across bigrams. For example, Table 1's caption says that "stronger" tokens tend to absorb "weaker" ones. A model that incor-These specific edit rules (like Nunberg's) cannot insert new symbols, nor can they delete all of the underlying symbols. Thus, surface x i is a good clue to u i : all of its tokens must appear underlyingly, and if x i = ✏ (the empty string) then u i = ✏.
The model can be directly implemented as a PFST (Appendix D 4 ) using Cotterell et al.'s (2014) more general PFST construction.
Our single-pass formalism is less expressive than Nunberg's. It greedily makes decisions based on at most one token of right context ("label bias"). It cannot rewrite '".7 !.'" or ",.7 !." because the . is encountered too late to percolate leftward; luckily, though, we can handle such English examples by sliding the window right-to-left instead of left-to-right. We treat the sliding direction as a language-specific parameter. 6

Training Objective
Building on equation (2), we train ✓, to locally maximize the regularized conditional log- & · ||✓|| 2 (5) where the sum is over a training treebank. 7 The expectation E [· · · ] is over T 0 ⇠ p(· | T, x). This generalized expectation term provides posterior regularization (Mann and McCallum, 2010;Ganchev et al., 2010), by encouraging parameters that reconstruct trees T 0 that use symmetric punctuation marks in a "typical" way. The function c(T 0 ) counts the nodes in T 0 whose punctemes contain "unmatched" symmetric punctuation tokens: for example, ) is "matched" only when it appears in a right puncteme with ( at the comparable position in the same constituent's left puncteme. The precise definition is given in Appendix B. 4 porated this insight would not have to learn O(|⌃| 2 ) separate absorption probabilities (two per bigram ab), but only O(|⌃|) strengths (one per unigram a, which may be regarded as a 1-dimensional embedding of the punctuation token a). We figured that the punctuation vocabulary ⌃ was small enough ( Table 2) that we could manage without the additional complexity of embeddings or other featurization, although this does presumably hurt our generalization to rare bigrams. 6 We could have handled all languages uniformly by making 2 passes of the sliding window (via a composition of 2 PFSTs), with at least one pass in each direction. 7 In retrospect, there was no good reason to square the ET 0 [c(T 0 )] term. However, when we started redoing the experiments, we found the results essentially unchanged.
In our development experiments on English, the posterior regularization term was necessary to discover an aesthetically appealing theory of underlying punctuation. When we dropped this term (⇠ = 0) and simply maximized the ordinary regularized likelihood, we found that the optimization problem was underconstrained: different training runs would arrive at different, rather arbitrary underlying punctemes. For example, one training run learned an ATTACH model that used underlying ". to terminate sentences, along with a NOISY-CHANNEL model that absorbed the left quotation mark into the period. By encouraging the underlying punctuation to be symmetric, we broke the ties. We also tried making this a hard constraint (⇠ = 1), but then the model was unable to explain some of the training sentences at all, giving them probability of 0. For example, I went to the " special place " cannot be explained, because special place is not a constituent. 8

Inference
In principle, working with the model (1) is straightforward, thanks to the closure properties of formal languages. Provided that p syn can be encoded as a weighted CFG, it can be composed with the weighted tree transducer p ✓ and the weighted FST p to yield a new weighted CFG (similarly to Bar-Hillel et al., 1961;Nederhof and Satta, 2003). Under this new grammar, one can recover the optimal T, T 0 forx by dynamic programming, or sum over T, T 0 by the inside algorithm to get the likelihood p(x). A similar approach was used by Levy (2008) with a different FST noisy channel.
In this paper we assume that T is observed, allowing us to work with equation (2). This cuts the computation time from O(n 3 ) to O(n). 9 Whereas the inside algorithm for (1) must consider O(n 2 ) possible constituents ofx and O(n) ways of building each, our algorithm for (2) only needs to iterate over the O(n) true constituents of T and the 1 true way of building each. However, it must still consider the |W d | puncteme pairs for each constituent. 8 Recall that the NOISYCHANNEL model family ( § 2.2) requires the surface " before special to appear underlyingly, and also requires the surface ✏ after special to be empty underlyingly. These hard constraints clash with the ⇠ = 1 hard constraint that the punctuation around special must be balanced. The surface " after place causes a similar problem: no edge can generate the matching underlying ". 9 We do O(n) multiplications of N ⇥ N matrices where Algorithm 1 The algorithm for scoring a given (T, x) pair. The code in blue is used during training to get the posterior regularization term in (5).

Algorithms
Given an input sentencex of length n, our job is to sum over possible trees T 0 that are consistent with T andx, or to find the best such T 0 . This is roughly a lattice parsing problem-made easier by knowing T . However, the possibleū values are characterized not by a lattice but by a cyclic WFSA (as |u i | is unbounded whenever |x i | > 0). For each slot 0  i  n, transduce the surface punctuation string x i by the inverted PFST for p to obtain a weighted finite-state automaton (WFSA) that describes all possible underlying strings u i . 10 This WFSA accepts each possible u i with weight p (x i | u i ). If it has N i states, we can represent it (Berstel and Reutenauer, 1988) with a family of sparse weight matrices M i ( ) 2 R N i ⇥N i , whose element at row s and column t is the weight of the s ! t arc labeled with , or 0 if there is no such arc. Additional vectors i , ⇢ i 2 R N i specify the initial and final weights. ( i is one-hot if the PFST has a single N = O(# of punc types · max # of punc tokens per slot). 10 Constructively, compose the u-to-x PFST (from the end of § 2.2) with a straight-line FSA accepting only xi, and project the resulting WFST to its input tape (Pereira and Riley, 1996), as explained at the end of Appendix D. initial state, of weight 1.) For any puncteme l (or r) in V, we define , a product over the 0 or more tokens in l. This gives the total weight of all s ! ⇤ t WFSA paths labeled with l. The subprocedure in Algorithm 1 essentially extends this to obtain a new matrix IN(w) 2 R N i ⇥N k , where the subtree rooted at w stretches from slot i to slot k. Its element IN(w) st gives the total weight of all extended paths in theū WFSA from state s at slot i to state t at slot k. An extended path is defined by a choice of underlying punctemes at w and all its descendants. These punctemes determine an s-to-final path at i, then initial-to-final paths at i + 1 through k 1, then an initial-to-t path at k. The weight of the extended path is the product of all the WFSA weights on these paths (which correspond to transition probabilities in p PFST) times the probability of the choice of punctemes (from p ✓ ).
Specifically, to modify Algorithm 1 to maximize over T 0 values ( § § 6.2-6.3) instead of summing over them, we switch to the derivation semiring (Goodman, 1999), as follows. Whereas IN(w) st used to store the total weight of all extended paths from state s at slot i to state t at slot j, now it will store the weight of the best such extended path. It will also store that extended path's choice of underlying punctemes, in the form of a punctemeannotated version of the subtree of T that is rooted at w. This is a potential subtree of T 0 .
Thus, each element of IN(w) has the form (r, D) where r 2 R and D is a tree. We define addition and multiplication over such pairs: where DD 0 denotes an ordered combination of two trees. Matrix products UV and scalar-matrix products p · V are defined in terms of element addition and multiplication as usual: What is DD 0 ? For presentational purposes, it is convenient to represent a punctuated dependency tree as a bracketed string. For example, the underlying tree T 0 in Figure 1 would be [ [" Dale "] means [" [ river ] valley "] ] where the words correspond to nodes of T . In this case, we can represent every D as a partial bracketed string and define DD 0 by string concatenation. This presentation ensures that multiplication (7) is a complete and associative (though not commutative) operation, as in any semiring. As base cases, each real-valued element of M i (l) or M k (r) is now paired with the string [l or r] respectively, 11 and the real number 1 at line 10 is paired with the string w. The real-valued elements of the i and ⇢ i vectors and the 0 matrix at line 11 are paired with the empty string ✏, as is the real number p at line 13.
In practice, the D strings that appear within the matrix M of Algorithm 1 will always represent complete punctuated trees. Thus, they can actually be represented in memory as such, and different trees may share subtrees for efficiency (using pointers). The product in line 10 constructs a matrix of trees with root w and differing sequences of left/right children, while the product in line 14 annotates those trees with punctemes l, r.
To sample a possible T 0 from the derivation forest in proportion to its probability ( § 6.1), we use the same algorithm but replace equation (6) with with u ⇠ Uniform(0, 1) being a random number.

Optimization
Having computed the objective (5) (5), were tuned on dev data ( §4) for each language respectively.) We train the punctuation model for 30 epochs. The initial NOISYCHANNEL parameters ( ) are drawn from N (0, 1), and the initial ATTACH parameters (✓) are drawn from N (0, 1) (with one minor exception described in Appendix A).

Intrinsic Evaluation of the Model
Data. Throughout § §4-6, we will examine the punctuation model on a subset of the Universal Dependencies (UD) version 1.4 (Nivre et al., 2016)-a collection of dependency treebanks across 47 languages with unified POS-tag and dependency label sets. Each treebank has designated training, development, and test portions. We experiment on Arabic, English, Chinese, Hindi, and Spanish (Table 2)-languages with diverse punctuation vocabularies and punctuation interaction rules, not to mention script directionality. For each treebank, we use the tokenization provided by UD, and take the punctuation tokens (which may be multi-character, such as ...) to be the tokens with the PUNCT tag. We replace each straight double quotation mark " with either " or " as appropriate, and similarly for single quotation marks. 12 We split each non-punctuation token that ends in . (such as etc.) into a shorter non-punctuation token (etc) followed by a special punctuation token called the "abbreviation dot" (which is distinct from a period). We prepend a special punctuation markˆto every sentencex, which can serve to absorb an initial comma, for example. 13 We then replace each token with the special symbol UNK if its type appeared fewer than 5 times in the training portion. This gives the surface sentences.
To estimate the vocabulary V of underlying punctemes, we simply collect all surface token sequences x i that appear at any slot in the training portion of the processed treebank. This is a generous estimate. Similarly, we estimate W d ( § 2.1) as all pairs (l, r) 2 V 2 that flank any d constituent.
Recall that our model generates surface punctuation given an unpunctuated dependency tree. We train it on each of the 5 languages independently. We evaluate on conditional perplexity, which will be low if the trained model successfully assigns a high probability to the actual surface punctuation in a held-out corpus of the same language.
12 For en and en_esl, " and " are distinguished by language-specific part-of-speech tags. For the other 4 languages, we identify two " dependents of the same head word,  Table 2: Statistics of our datasets. "Treebank" is the UD treebank identifier, "#Token" is the number of tokens, "%Punct" is the percentage of punctuation tokens, "#Omit" is the small number of sentences containing non-leaf punctuation tokens (see footnote 19), and "#Type" is the number of punctuation types after preprocessing. (Recall from §4 that preprocessing distinguishes between left and right quotation mark types, and between abbreviation dot and period dot types.) Baselines. We compare our model against three baselines to show that its complexity is necessary. Our first baseline is an ablation study that does not use latent underlying punctuation, but generates the surface punctuation directly from the tree. (To implement this, we fix the parameters of the noisy channel so that the surface punctuation equals the underlying with probability 1.) If our full model performs significantly better, it will demonstrate the importance of a distinct underlying layer. Our other two baselines ignore the tree structure, so if our full model performs significantly better, it will demonstrate that conditioning on explicit syntactic structure is useful. These baselines are based on previously published approaches that reduce the problem to tagging: Xu et al. (2016) use a BiLSTM-CRF tagger with bigram topology; Tilk and Alumäe (2016) use a BiGRU tagger with attention. In both approaches, the model is trained to tag each slot i with the correct string x i 2 V ⇤ (possibly ✏ orˆ). These are discriminative probabilistic models (in contrast to our generative one). Each gives a probability distribution over the taggings (conditioned on the unpunctuated sentence), so we can evaluate their perplexity. 14 Results. As shown in Table 3, our full model beats the baselines in perplexity in all 5 languages. Also, in 4 of 5 languages, allowing a trained NOISYCHANNEL (rather than the identity map) replacing the left one with " and the right one with ".
13 For symmetry, we should also have added a final mark. 14 These methods learn word embeddings that optimize conditional log-likelihood on the punctuation restoration training data. They might do better if these embeddings were shared with other tasks, as multi-task learning might lead them to discover syntactic categories of words.  Table 3: Results of the conditional perplexity experiment ( §4), reported as perplexity per punctuation slot, where an unpunctuated sentence of n words has n + 1 slots. Column "Attn." is the BiGRU tagger with attention, and "CRF" stands for the BiLSTM-CRF tagger. "ATTACH" is the ablated version of our model where surface punctuation is directly attached to the nodes. Our full model "+NC" adds NOISYCHANNEL to transduce the attached punctuation into surface punctuation. DIR is the learned direction ( § 2.2) of our full model's noisy channel PFST: Left-to-right or Right-to-left. Our models are given oracle parse trees T . The best perplexity is boldfaced, along with all results that are not significantly worse (paired permutation test, p < 0.05).
significantly improves the perplexity.

Rules Learned from the Noisy Channel
We study our learned probability distribution over noisy channel rules (ab 7 ! b, ab 7 ! a, ab 7 ! ab, ab7 !ba) for English. The probability distributions corresponding to six of Nunberg's English rules are shown in Figure 3. By comparing the orange and blue bars, observe that the model trained on the en_cesl treebank learned different quotation rules from the one trained on the en treebank. This is because en_cesl follows British style, whereas en has American-style quote transposition. 15 We now focus on the model learned from the en treebank. Nunberg's rules are deterministic, and our noisy channel indeed learned low-entropy rules, in the sense that for an input ab with underlying count 25, 16 at least one of the possible outputs (a, b, ab or ba) always has probability > 0.75. The one exception is ". 7 ! ." for which the argmax output has probability ⇡ 0.5, because writers do not apply this quote transposition rule consistently. As shown by the blue bars in Figure 3, the high-probability transduction rules are consistent with Nunberg's hand-crafted deterministic grammar in Table 1.
Our system has high precision when we look at the confident rules. Of the 24 learned edits with conditional probability > 0.75, Nunberg lists 20.
Our system also has good recall. Nunberg's hand-crafted schemata consider 16 punctuation types and generate a total of 192 edit rules, including the specimens in Table 1. That is, of the 16 2 = 256 possible underlying punctuation bigrams ab, 3 4 are supposed to undergo absorption or transposition. Our method achieves fairly high recall, in the sense that when Nunberg proposes ab7 ! , our learned p( | ab) usually ranks highly among all probabilities of the form p( 0 | ab). 75 of Nunberg's rules got rank 1, 48 got rank 2, and the remaining 69 got rank > 2. The mean reciprocal rank was 0.621. Recall is quite high when we restrict to those Nunberg rules ab 7 ! for which our model is confident how to rewrite ab, in the sense that some p( 0 | ab) > 0.5. (This tends to eliminate rare ab: see footnote 5.) Of these 55 Nunberg rules, 38 rules got rank 1, 15 got rank 2, and only 2 got rank worse than 2. The mean reciprocal rank was 0.836. ¿What about Spanish? Spanish uses inverted question marks ¿ and exclamation marks ¡, which form symmetric pairs with the regular question marks and exclamation marks. If we try to extrapolate to Spanish from Nunberg's English for-malization, the English mark most analogous to ¿ is (. Our learned noisy channel for Spanish (not graphed here) includes the high-probability rules ,¿ 7 ! ,¿ and :¿ 7 ! :¿ and ¿, 7 ! ¿ which match Nunberg's treatment of ( in English.

Attachment Model
What does our model learn about how dependency relations are marked by underlying punctuation?
,Earlier, Kerry said ," ... ,in fact, answer the question". Earlier, Kerry said ," ... ,in fact, answer the question." root. ,advmod, ,"ccomp" ,nmod, The above example 17 illustrates the use of specific puncteme pairs to set off the advmod, ccomp, and nmod relations. Notice that said takes a complement (ccomp) that is symmetrically quoted but also left delimited by a comma, which is indeed how direct speech is punctuated in English. This example also illustrates quotation transposition. The top five relations that are most likely to generate symmetric punctemes and their top (l, r) pairs are shown in Table 4. The above example 18 shows how our model handles commas in conjunctions of 2 or more phrases. UD format dictates that each conjunct after the first is attached by the conj relation. As shown above, each such conjunct is surrounded by underlying commas (via the N.,.,.conj feature from Appendix A), except for the one that bears the conjunction and (via an even stronger weight on the C.✏.✏. ! conj.cc feature). Our learned feature weights indeed yield p(`= ✏, r = ✏) > 0.5 for the final conjunct in this example. Some writers omit the "Oxford comma" before the conjunction: this style can be achieved simply by changing "surrounded" to "preceded" (that is, changing the N feature to N.,.✏.conj).
get an honorable discharge does not, in fact, answer that question."

Punctuation Restoration
In this task, we are given a depunctuated sentencē d 19 and must restore its (surface) punctuation. Our model supposes that the observed punctuated sentencex would have arisen via the generative process (1). Thus, we try to find T , T 0 , andx that are consistent withd (a partial observation ofx). The first step is to reconstruct T fromd. This initial parsing step is intended to choose the T that maximizes p syn (T |d). 20 This step depends only on p syn and not on our punctuation model (p ✓ , p ). In practice, we choose T via a dependency parser that has been trained on an unpunctuated treebank with examples of the form (d, T ). 21 Equation (2) now defines a distribution over (T 0 , x) given this T . To obtain a single prediction for x, we adopt the minimum Bayes risk (MBR) approach of choosing surface punctuationx that minimizes the expected loss with respect to the unknown truth x ⇤ . Our loss function is the total edit distance over all slots (where edits operate on punctuation tokens). Findingx exactly would be intractable, so we use a sampling-based approximation and draw m = 1000 samples from the posterior distribution over (T 0 , x). We then definê where S(T ) is the set of unique x values in the sample andp is the empirical distribution given by the sample. This can be evaluated in O(m 2 ) time.
We evaluate on Arabic, English, Chinese, Hindi, and Spanish. For each language, we train both the parser and the punctuation model on the training split of that UD treebank ( §4), and evaluate on held-out data. We compare to the BiLSTM-CRF baseline in §4 (Xu et al., 2016). 22 We also compare to a "trivial" deterministic baseline, which merely places a period at the end of the sentence (or a "|" in the case of Hindi) and adds no other punctuation. Because most slots do not in fact have punctuation, the trivial baseline already does very well; to improve on it, we must fix its errors without introducing new ones.
Our final comparison on test data is shown in the table in Figure 4. On all 5 languages, our method beats (usually significantly) its 3 competitors: the trivial deterministic baseline, the BiLSTM-CRF, and the ablated version of our model (ATTACH) that omits the noisy channel. Of course, the success of our method depends on the quality of the parse trees T (which is particularly low for Chinese and Arabic). The graph in Figure 4 explores this relationship, by evaluating (on dev data) with noisier trees obtained from parsers that were variously trained on only the first 10%, 20%, . . . of the training data. On all 5 languages, provided that the trees are at least 75% correct, our punctuation model beats both the trivial baseline and the BiLSTM-CRF (which do not use trees). It also beats the ATTACH ablation baseline at all levels of tree accuracy (these curves are omitted from the graph to avoid clutter). In all languages, better parses give better performance, and gold trees yield the best results.

Punctuation Correction
Our next goal is to correct punctuation errors in a learner corpus. Each sentence is drawn from the Cambridge Learner Corpus treebanks, which provide original (en_esl) and corrected (en_cesl) sentences. All kinds of errors are corrected, such 22 We copied their architecture exactly but re-tuned the hyperparameters on our data. We also tried tripling the amount of training data by adding unannotated sentences (provided along with the original annotated sentences by Ginter et al. (2017)), taking advantage of the fact that the BiLSTM-CRF does not require its training sentences to be annotated with trees. However, this actually hurt performance slightly, perhaps because the additional sentences were out-of-domain. We also tried the BiGRU-with-attention architecture of Tilk and Alumäe (2016), but it was also weaker than the BiLSTM-CRF (just as in Table 3). We omit all these results from  Figure 4: Edit distance per slot (which we call average edit distance, or AED) for each of the 5 corpora. Lower is better. The table gives the final AED on the test data. Its first 3 columns show the baseline methods just as in Table 3: the trivial deterministic method, the BiLSTM-CRF, and the ATTACH ablation baseline that attaches the surface punctuation directly to the tree. Column 4 is our method that incorporates a noisy channel, and column 5 (in gray) is our method using oracle (gold) trees. We boldface the best non-oracle result as well as all that are not significantly worse (paired permutation test, p < 0.05). The curves show how our method's AED (on dev data) varies with the labeled attachment score (LAS) of the trees, where --a at x = 100 uses the oracle (gold) trees, a--at x < 100 uses trees from our parser trained on 100% of the training data, and the #--points at x ⌧ 100 use increasingly worse parsers. The p and 8 at the right of the graph show the AED of the trivial deterministic baseline and the BiLSTM-CRF baseline, which do not use trees.
as syntax errors, but we use only the 30% of sentences whose depunctuated trees T are isomorphic between en_esl and en_cesl. These en_cesl trees may correct word and/or punctuation errors in en_esl, as we wish to do automatically.
We assume that an English learner can make mistakes in both the attachment and the noisy channel steps. A common attachment mistake is the failure to surround a non-restrictive relative clause with commas. In the noisy channel step, mistakes in quote transposition are common.
Correction model. Based on the assumption about the two error sources, we develop a discriminative model for this task. Letx e denote the full input sentence, and let x e and x c denote the input (possibly errorful) and output (corrected) punctuation sequences. We model Here T is the depunctuated parse tree, T 0 c is the corrected underlying tree, T 0 e is the error underlying tree, and we assume In practice we use a 1-best pipeline rather than summing. Our first step is to reconstruct T from the error sentencex e . We choose T that maximizes p syn (T |x e ) from a dependency parser trained on en_esl treebank examples (x e , T ). The second step is to reconstruct T 0 e based on our punctuation model trained on en_esl. We choose T 0 e that maximizes p(T 0 e | T, x e ). We then reconstruct T 0 c by where w e is the node in T 0 e , and p(l, r | w e ) is a similar log-linear model to equation (4) with additional features (Appendix C 4 ) which look at w e .
Finally, we reconstruct x c based on the noisy channel p (x c | T 0 c ) in § 2.2. During training, is regularized to be close to the noisy channel parameters in the punctuation model trained on en_cesl.
We use the same MBR decoder as in § 6.1 to choose the best action. We evaluate using AED as in § 6.1. As a second metric, we use the script from the CoNLL 2014 Shared Task on Grammatical Error Correction (Ng et al., 2014): it computes the F 0.5 -measure of the set of edits found by the system, relative to the true set of edits.
As shown in Table 5, our method achieves better performance than the punctuation restoration baselines (which ignore input punctuation). On the other hand, it is soundly beaten by a new BiLSTM-CRF that we trained specifically for the task of punctuation correction. This is the same as the BiLSTM-CRF in the previous section, except that the BiLSTM now reads a punctuated input sentence (with possibly erroneous punctuation). To be precise, at step 0  i  n, the BiL-STM reads a concatenation of the embedding of word i (or BOS if i = 0) with an embedding of the punctuation token sequence x i . The BiLSTM-CRF wins because it is a discriminative model tailored for this task: the BiLSTM can extract arbitrary contextual features of slot i that are correlated with whether x i is correct in context.

Sentential Rephrasing
We suspect that syntactic transformations on a sentence should often preserve the underlying punctuation attached to its tree. The surface punctuation can then be regenerated from the transformed tree. Such transformations include edits that are suggested by a writing assistance tool (Heidorn, 2000), or subtree deletions in compressive summarization (Knight and Marcu, 2002).  Table 5: AED and F 0.5 results on the test split of English-ESL data. Lower AED is better; higher F 0.5 is better. The first three columns (markers correspond to Figure 4) are the punctuation restoration baselines, which ignore the input punctuation. The fourth and fifth columns are our correction models, which use parsed and gold trees. The final column is the BiLSTM-CRF model tailored for the punctuation correction task.
For our experiment, we evaluate an interesting case of syntactic transformation. Wang and Eisner (2016) consider a systematic rephrasing procedure by rearranging the order of dependent subtrees within a UD treebank, in order to synthesize new languages with different word order that can then be used to help train multi-lingual systems (i.e., data augmentation with synthetic data).
As Wang and Eisner acknowledge (2016, footnote 9), their permutations treat surface punctuation tokens like ordinary words, which can result in synthetic sentences whose punctuation is quite unlike that of real languages.
In our experiment, we use Wang and Eisner's (2016) "self-permutation" setting, where the dependents of each noun and verb are stochastically reordered, but according to a dependent ordering model that has been trained on the same language. For example, rephrasing a English sentence  Table 6: Perplexity (evaluated on the train split to avoid evaluating generalization) of a trigram language model trained (with add-0.001 smoothing) on different versions of rephrased training sentences. "Punctuation" only evaluates perplexity on the trigrams that have punctuation. "All" evaluates on all the trigrams. "Base" permutes all surface dependents including punctuation (Wang and Eisner, 2016). "Full" is our full approach: recover underlying punctuation, permute remaining dependents, regenerate surface punctuation. "Half" is like "Full" but it permutes the nonpunctuation tokens identically to "Base." The permutation model is trained on surface trees or recovered underlying trees T 0 , respectively. In each 3-way comparison, we boldface the best result (always significant under a paired permutation test over per-sentence logprobabilities, p < 0.05). We leave the handling of capitalization to future work. We test the naturalness of the permuted sentences by asking how well a word trigram language model trained on them could predict the original sentences. 23 As shown in Table 6, our permutation approach reduces the perplexity over the baseline on 4 of the 5 languages, often dramatically.

Related Work
Punctuation can aid syntactic analysis, since it signals phrase boundaries and sentence structure. Briscoe (1994) and White and Rajkumar (2008) parse punctuated sentences using hand-crafted constraint-based grammars that implement Nunberg's approach in a declarative way. These grammars treat surface punctuation symbols as ordinary words, but annotate the nonterminal categories so as to effectively keep track of the underlying punctuation. This is tantamount to crafting a grammar for underlyingly punctuated sentences and composing it with a finite-state noisy channel. 23 So the two approaches to permutation yield different training data, but are compared fairly on the same test data.
The parser of Ma et al. (2014) takes a different approach and treats punctuation marks as features of their neighboring words. Zhang et al. (2013) use a generative model for punctuated sentences, leting them restore punctuation marks during transition-based parsing of unpunctuated sentences. Li et al. (2005) use punctuation marks to segment a sentence: this "divide and rule" strategy reduces ambiguity in parsing of long Chinese sentences. Punctuation can similarly be used to constrain syntactic structure during grammar induction (Spitkovsky et al., 2011).
Punctuation restoration ( § 6.1) is useful for transcribing text from unpunctuated speech. The task is usually treated by tagging each slot with zero or more punctuation tokens, using a traditional sequence labeling method: conditional random fields (Lui and Wang, 2013;Lu and Ng, 2010), recurrent neural networks (Tilk and Alumäe, 2016), or transition-based systems (Ballesteros and Wanner, 2016).

Conclusion and Future Work
We have provided a new computational approach to modeling punctuation. In our model, syntactic constituents stochastically generate latent underlying left and right punctemes. Surface punctuation marks are not directly attached to the syntax tree, but are generated from sequences of adjacent punctemes by a (stochastic) finite-state string rewriting process . Our model is inspired by Nunberg's (1990) formal grammar for English punctuation, but is probabilistic and trainable. We give exact algorithms for training and inference.
We trained Nunberg-like models for 5 languages and L2 English. We compared the English model to Nunberg's, and showed how the trained models can be used across languages for punctuation restoration, correction, and adjustment.
In the future, we would like to study the usefulness of the recovered underlying trees on tasks such as syntactically sensitive sentiment analysis (Tai et al., 2015), machine translation (Cowan et al., 2006), relation extraction (Culotta and Sorensen, 2004), and coreference resolution (Kong et al., 2010). We would also like to investigate how underlying punctuation could aid parsing. For discriminative parsing, features for scoring the tree could refer to the underlying punctuation, not just the surface punctuation. For generative parsing ( §3), we could follow the scheme in equation (1). For example, the p syn factor in equation (1) might be a standard recurrent neural network grammar (RNNG) (Dyer et al., 2016); when a subtree of T is completed by the REDUCE operation of p syn , the punctuationaugmented RNNG (1) would stochastically attach subtree-external left and right punctemes with p ✓ and transduce the subtree-internal slots with p .
In the future, we are also interested in enriching the T 0 representation and making it more different from T , to underlyingly account for other phenomena in T such as capitalization, spacing, morphology, and non-projectivity (via reordering).