## Abstract

Treebanks traditionally treat punctuation marks as ordinary words, but linguists have suggested that a tree’s “true” punctuation marks are not observed (Nunberg, 1990). These latent “underlying” marks serve to delimit or separate constituents in the syntax tree. When the tree’s yield is rendered as a written sentence, a string rewriting mechanism transduces the underlying marks into “surface” marks, which are part of the observed (surface) string but should not be regarded as part of the tree. We formalize this idea in a generative model of punctuation that admits efficient dynamic programming. We train it without observing the underlying marks, by locally maximizing the incomplete data likelihood (similarly to the EM algorithm). When we use the trained model to reconstruct the tree’s underlying punctuation, the results appear plausible across 5 languages, and in particular are consistent with Nunberg’s analysis of English. We show that our generative model can be used to beat baselines on punctuation restoration. Also, our reconstruction of a sentence’s underlying punctuation lets us appropriately render the surface punctuation (via our trained underlying-to-surface mechanism) when we syntactically transform the sentence.

## 1 Introduction

Punctuation enriches the expressiveness of written language. When converting from spoken to written language, punctuation indicates pauses or pitches; expresses propositional attitude; and is conventionally associated with certain syntactic constructions such as apposition, parenthesis, quotation, and conjunction.

In this paper, we present a latent-variable model of punctuation usage, inspired by the rule-based approach to English punctuation of Nunberg (1990). Training our model on English data learns rules that are consistent with Nunberg’s hand-crafted rules. Our system is automatic, so we use it to obtain rules for Arabic, Chinese, Spanish, and Hindi as well.

Moreover, our rules are stochastic, which allows us to reason probabilistically about ambiguous or missing punctuation. Across the 5 languages, our model predicts surface punctuation better than baselines, as measured both by perplexity (§4) and by accuracy on a punctuation restoration task (§6.1). We also use our model to correct the punctuation of non-native writers of English (§6.2), and to maintain natural punctuation style when syntactically transforming English sentences (§6.3). In principle, our model could also be used within a generative parser, allowing the parser to evaluate whether a candidate tree truly explains the punctuation observed in the input sentence (§8).

### Punctuation is interesting

In *The Linguistics of Punctuation*, Nunberg (1990) argues that punctuation (in English) is more than a visual counterpart of spoken-language prosody, but forms a linguistic system that involves “interactions of point indicators (i.e. commas, semicolons, colons, periods and dashes).” He proposes that much as in phonology (Chomsky and Halle, 1968), a grammar generates **underlying** punctuation which then transforms into the observed **surface** punctuation.

Consider generating a sentence from a syntactic grammar as follows:

Hail the king [, Arthur Pendragon ,][, who wields [ “ Excalibur ” ] ,] .

Although the full tree is not depicted here, some of the constituents are indicated with brackets. In this underlying generated tree, each appositive NP is surrounded by commas. On the surface, however, the two adjacent commas after Pendragon will now be collapsed into one, and the final comma will be absorbed into the adjacent period. Furthermore, in American English, the typographic convention is to move the final punctuation inside the quotation marks. Thus a reader sees only this modified surface form of the sentence:

Hail the king, Arthur Pendragon,who wields“Excalibur.”

Note that these modifications are *string* transformations that do not see or change the tree. The resulting surface punctuation marks may be clues to the parse tree, but (contrary to NLP convention) they should not be included as nodes in the parse tree. Only the underlying marks play that role.

### Punctuation is meaningful

Pang et al. (2002) use question and exclamation marks as clues to sentiment. Similarly, quotation marks may be used to mark titles, quotations, reported speech, or dubious terminology (University of Chicago, 2010). Because of examples like this, methods for determining the similarity or meaning of syntax trees, such as a tree kernel (Agarwal et al., 2011) or a recursive neural network (Tai et al., 2015), should ideally be able to consider where the underlying punctuation marks attach.

### Punctuation is helpful

Surface punctuation remains correlated with syntactic phrase structure. NLP systems for generating or editing text must be able to deploy surface punctuation as human writers do. Parsers and grammar induction systems benefit from the presence of surface punctuation marks (Jones, 1994; Spitkovsky et al., 2011). It is plausible that they could do better with a linguistically informed model that explains exactly *why* the surface punctuation appears where it does. Patterns of punctuation usage can also help identify the writer’s native language (Markov et al., 2018).

### Punctuation is neglected

Work on syntax and parsing tends to treat punctuation as an afterthought rather than a phenomenon governed by its own linguistic principles. Treebank annotation guidelines for punctuation tend to adopt simple heuristics like “attach to the highest possible node that preserves projectivity” (Bies et al., 1995; Nivre et al., 2018).^{1} Many dependency parsing works exclude punctuation from evaluation (Nivre et al., 2007b; Koo and Collins, 2010; Chen and Manning, 2014; Lei et al., 2014; Kiperwasser and Goldberg, 2016), although some others retain punctuation (Nivre et al., 2007a; Goldberg and Elhadad, 2010; Dozat and Manning, 2017).

In tasks such as word embedding induction (Mikolov et al., 2013; Pennington et al., 2014) and machine translation (Zens et al., 2002), punctuation marks are usually either removed or treated as ordinary words (Řehůřek and Sojka, 2010).

Yet to us, building a parse tree on a *surface* sentence seems as inappropriate as morphologically segmenting a *surface* word. In both cases, one should instead analyze the latent *underlying* form, jointly with recovering that form. For example, the proper segmentation of English hoping is not hop-ing but hope-ing (with underlying e), and the proper segmentation of stopping is neither stopp-ing nor stop-ping but stop-ing (with only one underlying p). Cotterell et al. (2015); Cotterell et al. (2016) get this right for morphology. We attempt to do the same for punctuation.

## 2 Formal Model

*unpunctuated*dependency tree

*T*is stochastically generated by some recursive process

*p*

_{syn}(e.g., Eisner, 1996, Model C).

^{2}Second, each constituent (i.e., dependency subtree) sprouts optional underlying punctuation at its left and right edges, according to a probability distribution

*p*

_{θ}that depends on the constituent’s syntactic role (e.g., dobj for “direct object”). This

*punctuated*tree

*T′*yields the underlying string

**ū**=

**ū**(

*T*′) , which is edited by a finite-state noisy channel

*p*

_{ϕ}to arrive at the surface sentence $x\xaf$.

This third step may alter the sequence of punctuation tokens at each **slot** between words—for example, in §1, collapsing the double comma , , between Pendragon and who. u and x denote just the punctuation at the slots of **ū** and $x\xaf$ respectively, with *u*_{i} and *x*_{i} denoting the punctuation token sequences at the *i*^{th} slot. Thus, the transformation at the *i*^{th} slot is *u*_{i}↦*x*_{i}.

Since this model is generative, we could train it without any supervision to explain the observed surface string $x\xaf$: maximize the likelihood $p(x\xaf)$ in (1), marginalizing out the possible *T*, *T′* values.

*T*values (as observed in the “depunctuated” version of a treebank). Because

*T*is observed, we can jointly train

*θ*,

*ϕ*to maximize just

*p*

_{syn}model that generated

*T*becomes irrelevant, but we still try to predict what surface punctuation will be added to

*T*. We still marginalize over the underlying punctuation marks u. These are

*never observed*, but they must explain the surface punctuation marks x (§2.2), and they must be explained in turn by the syntax tree

*T*(§2.1). The trained generative model then lets us restore or correct punctuation in new trees

*T*(§6).

### 2.1 Generating Underlying Punctuation

*T′*given its corresponding unpunctuated tree

*T*, which is given by

**punctemes**that

*T′*attaches to the tree node

*w*. Each puncteme (Krahn, 2014) in the finite set $V$ is a string of 0 or more underlying punctuation

**tokens**.

^{3}The probability

*p*

_{θ}(

*l*,

*r*∣

*w*) is given by a log-linear model

*w*that has dependency relation

*d*=

*d*(

*w*) to its parent. $V$ and $Wd$ are estimated heuristically from the tokenized surface data (§4). $f(l,r,w)$ is a sparse binary feature vector, and θ is the corresponding parameter vector of feature weights. The feature templates in Appendix A

^{4}consider the symmetry between

*l*and

*r*, and their compatibility with (a) the POS tag of

*w*’s head word, (b) the dependency paths connecting

*w*to its children and the root of

*T*, (c) the POS tags of the words flanking the slots containing

*l*and

*r*, (d) surface punctuation already added to

*w*’s subconstituents.

### 2.2 From Underlying to Surface

From the tree *T′*, we can read off the sequence of underlying punctuation tokens *u*_{i} at each slot *i* between words. Namely, *u*_{i} concatenates the right punctemes of all constituents ending at *i* with the left punctemes of all constituents starting at *i* (as illustrated by the examples in §1 and Figure 1). The NoisyChannel model then transduces *u*_{i} to a surface token sequence *x*_{i}, for each *i* = 0, …, *n* independently (where *n* is the sentence length).

#### Nunberg’s formalism

Much like Halle’s (1968) phonological grammar of English, Nunberg’s 1990 descriptive English punctuation grammar (Table 1) can be viewed computationally as a priority string rewriting system, or **Markov algorithm** (Markov, 1960; Caracciolo di Forino, 1968). The system begins with a token string *u*. At each step it selects the highest-priority local rewrite rule that can apply, and applies it as far left as possible. When no more rules can apply, the final state of the string is returned as *x*.

1. Point Absorption,,↦, ,.↦. -,↦--;↦; ;.↦. | 3. Period Absorption.?↦? .!↦!abbv.↦abbv |

2. Quote Transposition",↦," ".↦." | 4. Bracket Absorptions,)↦)-)↦) (,↦(,"↦" “,↦“ |

1. Point Absorption,,↦, ,.↦. -,↦--;↦; ;.↦. | 3. Period Absorption.?↦? .!↦!abbv.↦abbv |

2. Quote Transposition",↦," ".↦." | 4. Bracket Absorptions,)↦)-)↦) (,↦(,"↦" “,↦“ |

#### Simplifying the formalism

Markov algorithms are Turing complete. Fortunately, Johnson (1972) noted that in practice, phonological *u* ↦ *x* maps described in this formalism can usually be implemented with finite-state transducers (FSTs).

For computational simplicity, we will formulate our punctuation model as a probabilistic FST (PFST)—a locally normalized left-to-right rewrite model (Cotterell et al., 2014). The probabilities for each language must be learned, using gradient descent. Normally we expect most probabilities to be near 0 or 1, making the PFST nearly deterministic (i.e., close to a subsequential FST). However, permitting low-probability choices remains useful to account for typographical errors, dialectal differences, and free variation in the training corpus.

Our PFST *generates* a surface string, but the invertibility of FSTs will allow us to work backwards when *analyzing* a surface string (§3).

#### A sliding-window model

Instead of having rule priorities, we apply Nunberg-style rules within a 2-token window that slides over *u* in a single left-to-right pass (Figure 2). Conditioned on the current window contents *ab*, a single edit is selected stochastically: either *ab* ↦ *ab* (no change), *ab* ↦ *b* (left absorption), *ab* ↦ *a* (right absorption), or *ab* ↦ *ba* (transposition). Then the window slides rightward to cover the next input token, together with the token that is (now) to its left. *a* and *b* are always real tokens, never boundary symbols. *ϕ* specifies the conditional edit probabilities.^{5}

These specific edit rules (like Nunberg’s) cannot insert new symbols, nor can they delete *all* of the underlying symbols. Thus, surface *x*_{i} is a good clue to *u*_{i}: all of its tokens must appear underlyingly, and if *x*_{i} = *ϵ* (the empty string) then *u*_{i} = *ϵ*.

The model can be directly implemented as a PFST (Appendix D4) using Cotterell et al.’s (2014) more general PFST construction.

Our single-pass formalism is less expressive than Nunberg’s. It greedily makes decisions based on at most one token of right context (“label bias”). It cannot rewrite ’”. ↦ .’” or “,. ↦ .’” because the . is encountered too late to percolate leftward; luckily, though, we can handle such English examples by sliding the window right-to-left instead of left-to-right. We treat the sliding direction as a language-specific parameter.^{6}

### 2.3 Training Objective

*θ*,

*ϕ*to locally maximize the regularized conditional log-likelihood

^{7}

The expectation $E[\cdots ]$ is over $T\u2032\u223cp(\u22c5\u2223T,x)$. This *generalized expectation* term provides *posterior regularization* (Mann and McCallum, 2010; Ganchev et al., 2010), by encouraging parameters that reconstruct trees *T′* that use symmetric punctuation marks in a “typical” way. The function *c*(*T′*) counts the nodes in *T′* whose punctemes contain “unmatched” symmetric punctuation tokens: for example, ) is “matched” only when it appears in a right puncteme with ( at the comparable position in the same constituent’s left puncteme. The precise definition is given in Appendix B.4

In our development experiments on English, the posterior regularization term was necessary to discover an aesthetically appealing theory of underlying punctuation. When we dropped this term (*ξ* = 0) and simply maximized the ordinary regularized likelihood, we found that the optimization problem was underconstrained: different training runs would arrive at different, rather arbitrary underlying punctemes. For example, one training run learned an Attach model that used underlying " to terminate sentences, along with a NoisyChannel model that absorbed the left quotation mark into the period. By encouraging the underlying punctuation to be symmetric, we broke the ties. We also tried making this a hard constraint (*ξ* = ∞), but then the model was unable to explain some of the training sentences at all, giving them probability of 0. For example, I went to the“ special place ” cannot be explained, because special place is not a constituent.^{8}

## 3 Inference

In principle, working with the model (1) is straightforward, thanks to the closure properties of formal languages. Provided that *p*_{syn} can be encoded as a weighted CFG, it can be composed with the weighted tree transducer *p*_{θ} and the weighted FST *p*_{ϕ} to yield a new weighted CFG (similarly to Bar-Hillel et al., 1961; Nederhof and Satta, 2003). Under this new grammar, one can recover the optimal *T*,*T′* for $x\xaf$ by dynamic programming, or sum over *T*,*T′* by the inside algorithm to get the likelihood $p(x\xaf)$. A similar approach was used by Levy (2008) with a different FST noisy channel.

In this paper we assume that *T* is observed, allowing us to work with (2). This cuts the computation time from *O*(*n*^{3}) to *O*(*n*).^{9} Whereas the inside algorithm for (1) must consider *O*(*n*^{2}) possible constituents of $x\xaf$ and *O*(*n*) ways of building each, our algorithm for (2) only needs to iterate over the *O*(*n*) true constituents of *T* and the 1 true way of building each. However, it must still consider the $|Wd|$ puncteme pairs for each constituent.

### 3.1 Algorithms

Given an input sentence $x\xaf$ of length *n*, our job is to sum over possible trees *T′* that are consistent with *T* and $x\xaf$, or to find the best such *T′*. This is roughly a lattice parsing problem—made easier by knowing *T*. However, the possible ū values are characterized not by a lattice but by a *cyclic* WFSA (as |*u*_{i}| is unbounded whenever |*x*_{i}| > 0).

For each slot 0 ≤ *i* ≤ *n*, transduce the surface punctuation string *x*_{i} by the *inverted* PFST for *p*_{ϕ} to obtain a weighted finite-state automaton (WFSA) that describes *all possible* underlying strings *u*_{i}.^{10} This WFSA accepts each possible *u*_{i} with weight *p*_{ϕ}(*x*_{i}∣*u*_{i}). If it has *N*_{i} states, we can represent it (Berstel and Reutenauer, 1988) with a family of sparse weight matrices $Mi(\upsilon )\u2208RNi\xd7Ni$, whose element at row *s* and column *t* is the weight of the *s* → *t* arc labeled with *υ*, or 0 if there is no such arc. Additional vectors $\lambda i,\rho i\u2208RNi$ specify the initial and final weights. (*λ*_{i} is one-hot if the PFST has a single initial state, of weight 1.)

For any puncteme *l* (or *r*) in $V$, we define *M*_{i}(*l*) = *M*_{i}(*l*_{1})*M*_{i}(*l*_{2})⋯*M*_{i}(*l*_{|l|}), a product over the 0 or more tokens in *l*. This gives the total weight of all *s* →^{*}*t* WFSA paths labeled with *l*.

The subprocedure in Algorithm 1 essentially extends this to obtain a new matrix $In(w)\u2208RNi\xd7Nk$, where the subtree rooted at *w* stretches from slot *i* to slot *k*. Its element In(*w*)_{st} gives the total weight of all **extended paths** in the ū WFSA from state *s* at slot *i* to state *t* at slot *k*. An extended path is defined by a choice of underlying punctemes at *w* and all its descendants. These punctemes determine an *s*-to-final path at *i*, then initial-to-final paths at *i* + 1 through *k* − 1, then an initial-to-*t* path at *k*. The weight of the extended path is the product of all the WFSA weights on these paths (which correspond to transition probabilities in *p*_{ϕ} PFST) times the probability of the choice of punctemes (from *p*_{θ}).

This inside algorithm computes quantities needed for training (§2.3). Useful variants arise via well-known methods for weighted derivation forests (Berstel and Reutenauer, 1988; Goodman, 1999; Li and Eisner, 2009; Eisner, 2016).

Specifically, to modify Algorithm 1 to *maximize* over *T′* values (§§6.2–6.3) instead of summing over them, we switch to the **derivation semiring** (Goodman, 1999), as follows. Whereas In(*w*)_{st} used to store the *total* weight of all extended paths from state *s* at slot *i* to state *t* at slot *j*, now it will store the weight of the *best* such extended path. It will also store that extended path’s choice of underlying punctemes, in the form of a puncteme-annotated version of the subtree of *T* that is rooted at *w*. This is a potential subtree of *T′*.

*w*) has the form (

*r*,

*D*) where

*r*∈ ℝ and

*D*is a tree. We define addition and multiplication over such pairs:

*DD′*denotes an ordered combination of two trees. Matrix products

**UV**and scalar-matrix products

*p*⋅

**V**are defined in terms of element addition and multiplication as usual:

What is *DD′*? For presentational purposes, it is convenient to represent a punctuated dependency tree as a bracketed string. For example, the underlying tree *T′* in Figure 1 would be [ [“ Dale ”] means [“ [ river ] valley ”] ] where the words correspond to nodes of *T*. In this case, we can represent every *D* as a *partial* bracketed string and define *DD′* by string concatenation. This presentation ensures that multiplication (7) is a complete and associative (though not commutative) operation, as in any semiring. As base cases, each real-valued element of **M**_{i}(*l*) or **M**_{k}(*r*) is now paired with the string [*l* or *r*] respectively, ^{11}and the real number 1 at line 10 is paired with the string *w*. The real-valued elements of the *λ*_{i} and *ρ*_{i} vectors and the **0** matrix at line 11 are paired with the empty string *ϵ*, as is the real number *p* at line 13.

In practice, the *D* strings that appear within the matrix **M** of Algorithm 1 will always represent complete punctuated trees. Thus, they can actually be represented in memory as such, and different trees may share subtrees for efficiency (using pointers). The product in line 10 constructs a matrix of trees with root *w* and differing sequences of left/right children, while the product in line 14 annotates those trees with punctemes *l*,*r*.

### 3.2 Optimization

Having computed the objective (5), we find the gradient via automatic differentiation, and optimize $\theta ,\varphi $ via Adam (Kingma and Ba, 2014)—a variant of stochastic gradient decent—with learning rate 0.07, batchsize 5, sentence per epoch 400, and L2 regularization. (These hyperparameters, along with the regularization coefficients *ς* and *ξ* from (5), were tuned on dev data (§4) for each language respectively.) We train the punctuation model for 30 epochs. The initial NoisyChannel parameters (*ϕ*) are drawn from $N(0,1)$, and the initial Attach parameters (*θ*) are drawn from $N(0,1)$ (with one minor exception described in Appendix A).

## 4 Intrinsic Evaluation of the Model

### Data.

Throughout §§4–6, we will examine the punctuation model on a subset of the Universal Dependencies (UD) version 1.4 (Nivre et al., 2016)—a collection of dependency treebanks across 47 languages with unified POS-tag and dependency label sets. Each treebank has designated training, development, and test portions. We experiment on Arabic, English, Chinese, Hindi, and Spanish (Table 2)—languages with diverse punctuation vocabularies and punctuation interaction rules, not to mention script directionality. For each treebank, we use the tokenization provided by UD, and take the punctuation tokens (which may be multi-character, such as …) to be the tokens with the PUNCT tag. We replace each straight double quotation mark " with either “ or ” as appropriate, and similarly for single quotation marks.^{12} We split each non-punctuation token that ends in . (such as etc.) into a shorter non-punctuation token (etc) followed by a special punctuation token called the “abbreviation dot” (which is distinct from a period). We prepend a special punctuation mark ∘ to every sentence $x\xaf$, which can serve to absorb an initial comma, for example.^{13} We then replace each token with the special symbol UNK if its type appeared fewer than 5 times in the training portion. This gives the surface sentences.

Language . | Treebank . | #Token . | %Punct . | #Omit . | #Type . |
---|---|---|---|---|---|

Arabic | 𝖺𝗋 | 282K | 7.9 | 255 | 18 |

Chinese | 𝗓𝗁 | 123K | 13.8 | 3 | 23 |

English | 𝖾𝗇𝖾𝗇_𝖾𝗌𝗅 | 255K97.7K | 11.79.8 | 402 | 3516 |

Hindi | 𝗁𝗂 | 352K | 6.7 | 21 | 15 |

Spanish | 𝖾𝗌_𝖺𝗇𝖼𝗈𝗋𝖺 | 560K | 11.7 | 25 | 16 |

Language . | Treebank . | #Token . | %Punct . | #Omit . | #Type . |
---|---|---|---|---|---|

Arabic | 𝖺𝗋 | 282K | 7.9 | 255 | 18 |

Chinese | 𝗓𝗁 | 123K | 13.8 | 3 | 23 |

English | 𝖾𝗇𝖾𝗇_𝖾𝗌𝗅 | 255K97.7K | 11.79.8 | 402 | 3516 |

Hindi | 𝗁𝗂 | 352K | 6.7 | 21 | 15 |

Spanish | 𝖾𝗌_𝖺𝗇𝖼𝗈𝗋𝖺 | 560K | 11.7 | 25 | 16 |

To estimate the vocabulary $V$ of *underlying* punctemes, we simply collect all *surface* token sequences *x*_{i} that appear at any slot in the training portion of the processed treebank. This is a generous estimate. Similarly, we estimate $Wd$ (§2.1) as all pairs $(l,r)\u2208V2$ that flank any *d* constituent.

Recall that our model generates surface punctuation given an unpunctuated dependency tree. We train it on each of the 5 languages independently. We evaluate on conditional perplexity, which will be low if the trained model successfully assigns a high probability to the actual surface punctuation in a held-out corpus of the same language.

### Baselines.

We compare our model against three baselines to show that its complexity is necessary. Our first baseline is an ablation study that does not use latent underlying punctuation, but generates the surface punctuation directly from the tree. (To implement this, we fix the parameters of the noisy channel so that the surface punctuation equals the underlying with probability 1.) If our full model performs significantly better, it will demonstrate the importance of a distinct underlying layer.

Our other two baselines ignore the tree structure, so if our full model performs significantly better, it will demonstrate that conditioning on explicit syntactic structure is useful. These baselines are based on previously published approaches that reduce the problem to tagging: Xu et al. (2016) use a BiLSTM-CRF tagger with bigram topology; Tilk and Alumäe (2016) use a BiGRU tagger with attention. In both approaches, the model is trained to tag each slot *i* with the correct string $xi\u2208V*$ (possibly *ϵ* or ^{∧}). These are discriminative probabilistic models (in contrast to our generative one). Each gives a probability distribution over the taggings (conditioned on the unpunctuated sentence), so we can evaluate their perplexity.^{14}

### Results.

As shown in Table 3, our full model beats the baselines in perplexity in all 5 languages. Also, in 4 of 5 languages, allowing a trained NoisyChannel (rather than the identity map) significantly improves the perplexity.

. | Attn. . | CRF . | Attach . | +NC . | Dir . |
---|---|---|---|---|---|

Arabic | 1.4676 | 1.3016 | 1.2230 | 1.1526 | L |

Chinese | 1.6850 | 1.4436 | 1.1921 | 1.1464 | L |

English | 1.5737 | 1.5247 | 1.5636 | 1.4276 | R |

Hindi | 1.1201 | 1.1032 | 1.0630 | 1.0598 | L |

Spanish | 1.4397 | 1.3198 | 1.2364 | 1.2103 | R |

. | Attn. . | CRF . | Attach . | +NC . | Dir . |
---|---|---|---|---|---|

Arabic | 1.4676 | 1.3016 | 1.2230 | 1.1526 | L |

Chinese | 1.6850 | 1.4436 | 1.1921 | 1.1464 | L |

English | 1.5737 | 1.5247 | 1.5636 | 1.4276 | R |

Hindi | 1.1201 | 1.1032 | 1.0630 | 1.0598 | L |

Spanish | 1.4397 | 1.3198 | 1.2364 | 1.2103 | R |

## 5 Analysis of the Learned Grammar

### 5.1 Rules Learned from the Noisy Channel

We study our learned probability distribution over noisy channel rules (*ab* ↦ *b*, *ab* ↦ *a*, *ab* ↦ *ab*, *ab* ↦ *ba*) for English. The probability distributions corresponding to six of Nunberg’s English rules are shown in Figure 3. By comparing the orange and blue bars, observe that the model trained on the 𝖾𝗇_𝖼𝖾𝗌𝗅 treebank learned different quotation rules from the one trained on the 𝖾𝗇 treebank. This is because 𝖾𝗇_𝖼𝖾𝗌𝗅 follows British style, whereas 𝖾𝗇 has American-style quote transposition.^{15}

We now focus on the model learned from the 𝖾𝗇 treebank. Nunberg’s rules are deterministic, and our noisy channel indeed learned low-entropy rules, in the sense that for an input *ab* with underlying count ≥ 25,^{16} at least one of the possible outputs (*a*, *b*, *ab* or *ba*) always has probability >0.75. The one exception is ”.↦.” for which the argmax output has probability ≈ 0.5, because writers do not apply this quote transposition rule consistently. As shown by the blue bars in Figure 3, the high-probability transduction rules are consistent with Nunberg’s hand-crafted deterministic grammar in Table 1.

Our system has high precision when we look at the confident rules. Of the 24 learned edits with conditional probability >0.75, Nunberg lists 20.

Our system also has good recall. Nunberg’s hand-crafted schemata consider 16 punctuation types and generate a total of 192 edit rules, including the specimens in Table 1. That is, of the 16^{2} = 256 *possible* underlying punctuation bigrams *ab*, $34$ are supposed to undergo absorption or transposition. Our method achieves fairly high recall, in the sense that when Nunberg proposes *ab* ↦ *γ*, our learned *p*(*γ*∣*ab*) usually ranks highly among all probabilities of the form *p*(*γ′*∣*ab*). 75 of Nunberg’s rules got rank 1, 48 got rank 2, and the remaining 69 got rank >2. The mean reciprocal rank was 0.621. Recall is quite high when we restrict to those Nunberg rules *ab* ↦ *γ* for which our model is confident how to rewrite *ab*, in the sense that some *p*(*γ′*∣*ab*) > 0.5. (This tends to eliminate rare *ab*: see §5.) Of these 55 Nunberg rules, 38 rules got rank 1, 15 got rank 2, and only 2 got rank worse than 2. The mean reciprocal rank was 0.836.

¿What about Spanish? Spanish uses inverted question marks ¿ and exclamation marks ¡, which form symmetric pairs with the regular question marks and exclamation marks. If we try to extrapolate to Spanish from Nunberg’s English formalization, the English mark most analogous to ¿ is (. Our learned noisy channel for Spanish (not graphed here) includes the high-probability rules ,¿↦,¿ and :¿↦:¿ and ¿,↦¿ which match Nunberg’s treatment of ( in English.

### 5.2 Attachment Model

What does our model learn about how dependency relations are marked by underlying punctuation?

The above example^{17} illustrates the use of specific puncteme pairs to set off the advmod, ccomp, and nmod relations. Notice that said takes a complement (ccomp) that is symmetrically quoted but also left delimited by a comma, which is indeed how direct speech is punctuated in English. This example also illustrates quotation transposition. The top five relations that are most likely to generate symmetric punctemes and their top (*l*, *r*) pairs are shown in Table 4.

parataxis2.38 . | appos2.29 . | list1.33 . | advcl0.77 . | ccomp0.53 . | |||||
---|---|---|---|---|---|---|---|---|---|

, , | 26.8 | , , | 18.8 | εε | 60.0 | εε | 73.8 | εε | 90.8 |

εε | 20.1 | : ε | 18.1 | , , | 22.3 | , , | 21.2 | “” | 2.4 |

( ) | 13.0 | - ε | 15.9 | , ε | 5.3 | ε , | 3.1 | , , | 2.4 |

- ε | 9.7 | εε | 14.4 | < > | 3.0 | ( ) | 0.74 | :“” | 0.9 |

: ε | 8.1 | ( ) | 13.1 | ( ) | 3.0 | ε - | 0.21 | “ ,” | 0.8 |

parataxis2.38 . | appos2.29 . | list1.33 . | advcl0.77 . | ccomp0.53 . | |||||
---|---|---|---|---|---|---|---|---|---|

, , | 26.8 | , , | 18.8 | εε | 60.0 | εε | 73.8 | εε | 90.8 |

εε | 20.1 | : ε | 18.1 | , , | 22.3 | , , | 21.2 | “” | 2.4 |

( ) | 13.0 | - ε | 15.9 | , ε | 5.3 | ε , | 3.1 | , , | 2.4 |

- ε | 9.7 | εε | 14.4 | < > | 3.0 | ( ) | 0.74 | :“” | 0.9 |

: ε | 8.1 | ( ) | 13.1 | ( ) | 3.0 | ε - | 0.21 | “ ,” | 0.8 |

The above example^{18} shows how our model handles commas in conjunctions of 2 or more phrases. UD format dictates that each conjunct after the first is attached by the conj relation. As shown above, each such conjunct is surrounded by underlying commas (via the N.,.,.conj feature from Appendix A), except for the one that bears the conjunction and (via an even stronger weight on the C.*ε*.*ε*.$conj\u2192$.cc feature). Our learned feature weights indeed yield *p*(*ℓ* = *ε*, *r* = *ε*) > 0.5 for the final conjunct in this example. Some writers omit the “Oxford comma” before the conjunction: this style can be achieved simply by changing “surrounded” to “preceded” (that is, changing the N feature to N.,.*ε*.conj).

## 6 Performance on Extrinsic Tasks

We evaluate the trained punctuation model by using it in the following three tasks.

### 6.1 Punctuation Restoration

In this task, we are given a depunctuated sentence $d\xaf$^{19} and must restore its (surface) punctuation. Our model supposes that the observed punctuated sentence $x\xaf$ would have arisen via the generative process (1). Thus, we try to find *T*, *T′*, and $x\xaf$ that are consistent with $d\xaf$ (a partial observation of $x\xaf$).

The first step is to reconstruct *T* from $d\xaf$. This initial parsing step is intended to choose the *T* that maximizes $psyn(T\u2223d\xaf)$.^{20} This step depends only on *p*_{syn} and not on our punctuation model (*p*_{θ}, *p*_{ϕ}). In practice, we choose *T* via a dependency parser that has been trained on an unpunctuated treebank with examples of the form $(d\xaf,T)$.^{21}

*T′*, x) given this

*T*. To obtain a single prediction for

**x**, we adopt the minimum Bayes risk (MBR) approach of choosing surface punctuation $x^$ that minimizes the expected loss with respect to the unknown truth

**x**

^{*}. Our loss function is the total edit distance over all slots (where edits operate on punctuation tokens). Finding $x^$ exactly would be intractable, so we use a sampling-based approximation and draw

*m*= 1000 samples from the posterior distribution over (

*T′*,

**x**). We then define

*S*(

*T*) is the set of unique

**x**values in the sample and $p^$ is the empirical distribution given by the sample. This can be evaluated in

*O*(

*m*

^{2}) time.

We evaluate on Arabic, English, Chinese, Hindi, and Spanish. For each language, we train both the parser and the punctuation model on the training split of that UD treebank (§4), and evaluate on held-out data. We compare to the BiLSTM-CRF baseline in §4 (Xu et al., 2016).^{22} We also compare to a “trivial” deterministic baseline, which merely places a period at the end of the sentence (or a "|" in the case of Hindi) and adds no other punctuation. Because most slots do not in fact have punctuation, the trivial baseline already does very well; to improve on it, we must fix its errors without introducing new ones.

Our final comparison on test data is shown in the table in Figure 4. On all 5 languages, our method beats (usually significantly) its 3 competitors: the trivial deterministic baseline, the BiLSTM-CRF, and the ablated version of our model (Attach) that omits the noisy channel.

Of course, the success of our method depends on the quality of the parse trees *T* (which is particularly low for Chinese and Arabic). The graph in Figure 4 explores this relationship, by evaluating (on dev data) with noisier trees obtained from parsers that were variously trained on only the first 10%, 20%, …of the training data. On all 5 languages, provided that the trees are at least 75% correct, our punctuation model beats both the trivial baseline and the BiLSTM-CRF (which do not use trees). It also beats the Attach ablation baseline at all levels of tree accuracy (these curves are omitted from the graph to avoid clutter). In all languages, better parses give better performance, and gold trees yield the best results.

### 6.2 Punctuation Correction

Our next goal is to correct punctuation errors in a learner corpus. Each sentence is drawn from the Cambridge Learner Corpus treebanks, which provide original (𝖾𝗇_𝖾𝗌𝗅) and corrected (𝖾𝗇_𝖼𝖾𝗌𝗅) sentences. All kinds of errors are corrected, such as syntax errors, but we use only the 30% of sentences whose depunctuated trees *T* are isomorphic between 𝖾𝗇_𝖾𝗌𝗅 and 𝖾𝗇_𝖼𝖾𝗌𝗅. These 𝖾𝗇_𝖼𝖾𝗌𝗅 trees may correct word and/or punctuation errors in 𝖾𝗇_𝖾𝗌𝗅, as we wish to do automatically.

We assume that an English learner can make mistakes in both the attachment and the noisy channel steps. A common attachment mistake is the failure to surround a non-restrictive relative clause with commas. In the noisy channel step, mistakes in quote transposition are common.

#### Correction model.

Based on the assumption about the two error sources, we develop a discriminative model for this task. Let $x\xafe$ denote the full input sentence, and let *x*_{e} and *x*_{c} denote the input (possibly errorful) and output (corrected) punctuation sequences. We model $p(xc\u2223x\xafe)=\u2211T\u2211Tc\u2032psyn(T\u2223x\xafe)\u22c5p\theta (Tc\u2032\u2223T,xe)\u22c5p\varphi (xc\u2223Tc\u2032)$. Here *T* is the depunctuated parse tree, *T*_{c}*′* is the corrected underlying tree, *T*_{e}*′* is the error underlying tree, and we assume $p\theta (Tc\u2032\u2223T,xe)=\u2211Te\u2032p(Te\u2032\u2223T,xe)\u22c5p\theta (Tc\u2032\u2223Te\u2032)$.

*T*from the error sentence $x\xafe$. We choose

*T*that maximizes $psyn(T\u2223x\xafe)$ from a dependency parser trained on 𝖾𝗇_𝖾𝗌𝗅 treebank examples ($x\xafe$,

*T*). The second step is to reconstruct

*T*

_{e}

*′*based on our punctuation model trained on 𝖾𝗇_𝖾𝗌𝗅. We choose

*T*

_{e}

*′*that maximizes

*p*(

*T*

_{e}

*′*∣

*T*,x

_{e}). We then reconstruct

*T*

_{c}

*′*by

*w*

_{e}is the node in

*T*

_{e}

*′*, and

*p*(

*l*,

*r*∣

*w*

_{e}) is a similar log-linear model to (4) with additional features (Appendix C

^{4}) which look at

*w*

_{e}.

Finally, we reconstruct x_{c} based on the noisy channel *p*_{ϕ}(*x*_{c}∣*T*_{c}*′*) in §2.2. During training, *ϕ* is regularized to be close to the noisy channel parameters in the punctuation model trained on 𝖾𝗇_𝖼𝖾𝗌𝗅.

We use the same MBR decoder as in §6.1 to choose the best action. We evaluate using AED as in §6.1. As a second metric, we use the script from the CoNLL 2014 Shared Task on Grammatical Error Correction (Ng et al., 2014): it computes the F_{0.5}-measure of the set of edits found by the system, relative to the true set of edits.

As shown in Table 5, our method achieves better performance than the punctuation restoration baselines (which ignore input punctuation). On the other hand, it is soundly beaten by a new BiLSTM-CRF that we trained specifically for the task of punctuation correction. This is the same as the BiLSTM-CRF in the previous section, except that the BiLSTM now reads a *punctuated* input sentence (with possibly erroneous punctuation). To be precise, at step 0 ≤ *i* ≤ *n*, the BiLSTM reads a concatenation of the embedding of word *i* (or BOS if *i* = 0) with an embedding of the punctuation token sequence *x*_{i}. The BiLSTM-CRF wins because it is a discriminative model tailored for this task: the BiLSTM can extract arbitrary contextual features of slot *i* that are correlated with whether *x*_{i} is correct in context.

### 6.3 Sentential Rephrasing

We suspect that syntactic transformations on a sentence should often preserve the underlying punctuation attached to its tree. The surface punctuation can then be regenerated from the transformed tree. Such transformations include edits that are suggested by a writing assistance tool (Heidorn, 2000), or subtree deletions in compressive summarization (Knight and Marcu, 2002).

For our experiment, we evaluate an interesting case of syntactic transformation. Wang and Eisner (2016) consider a systematic rephrasing procedure by rearranging the order of dependent subtrees within a UD treebank, in order to synthesize new languages with different word order that can then be used to help train multi-lingual systems (i.e., data augmentation with synthetic data).

As Wang and Eisner acknowledge (2016, footnote 9), their permutations treat surface punctuation tokens like ordinary words, which can result in synthetic sentences whose punctuation is quite unlike that of real languages.

In our experiment, we use Wang and Eisner (2016) “self-permutation” setting, where the dependents of each noun and verb are stochastically reordered, but according to a dependent ordering model that has been trained on the same language. For example, rephrasing a English sentence

under an English ordering model may yield

which is still grammatical except that , and . are wrongly swapped (after all, they have the same POS tag and relation type). Worse, permutation may yield bizarre punctuation such as , , at the start of a sentence.

Our punctuation model gives a straightforward remedy—instead of permuting the tree directly, we first discover its most likely underlying tree

by the maximizing variant of Algorithm 1 (§3.1). Then, we permute the underlying tree and sample the surface punctuation from the distribution modeled by the trained PFST, yielding

We leave the handling of capitalization to future work.

We test the naturalness of the permuted sentences by asking how well a word trigram language model trained on them could predict the original sentences.^{23} As shown in Figure 6, our permutation approach reduces the perplexity over the baseline on 4 of the 5 languages, often dramatically.

. | Punctuation . | All . | ||||
---|---|---|---|---|---|---|

. | Base . | Half . | Full . | Base . | Half . | Full . |

Arabic | 156.0 | 231.3 | 186.1 | 540.8 | 590.3 | 553.4 |

Chinese | 165.2 | 110.0 | 61.4 | 205.0 | 174.4 | 78.7 |

English | 98.4 | 74.5 | 51.0 | 140.9 | 131.4 | 75.4 |

Hindi | 10.8 | 11.0 | 9.7 | 118.4 | 118.8 | 91.8 |

Spanish | 266.2 | 259.2 | 194.5 | 346.3 | 343.4 | 239.3 |

. | Punctuation . | All . | ||||
---|---|---|---|---|---|---|

. | Base . | Half . | Full . | Base . | Half . | Full . |

Arabic | 156.0 | 231.3 | 186.1 | 540.8 | 590.3 | 553.4 |

Chinese | 165.2 | 110.0 | 61.4 | 205.0 | 174.4 | 78.7 |

English | 98.4 | 74.5 | 51.0 | 140.9 | 131.4 | 75.4 |

Hindi | 10.8 | 11.0 | 9.7 | 118.4 | 118.8 | 91.8 |

Spanish | 266.2 | 259.2 | 194.5 | 346.3 | 343.4 | 239.3 |

## 7 Related Work

Punctuation can aid syntactic analysis, since it signals phrase boundaries and sentence structure. Briscoe (1994) and White and Rajkumar (2008) parse punctuated sentences using hand-crafted constraint-based grammars that implement Nunberg’s approach in a declarative way. These grammars treat *surface* punctuation symbols as ordinary words, but annotate the nonterminal categories so as to effectively keep track of the *underlying* punctuation. This is tantamount to crafting a grammar for underlyingly punctuated sentences and composing it with a finite-state noisy channel.

The parser of Ma et al. (2014) takes a different approach and treats punctuation marks as features of their neighboring words. Zhang et al. (2013) use a generative model for punctuated sentences, leting them restore punctuation marks during transition-based parsing of unpunctuated sentences. Li et al. (2005) use punctuation marks to segment a sentence: this “divide and rule” strategy reduces ambiguity in parsing of long Chinese sentences. Punctuation can similarly be used to constrain syntactic structure during grammar induction (Spitkovsky et al., 2011).

Punctuation restoration (§6.1) is useful for transcribing text from unpunctuated speech. The task is usually treated by tagging each slot with zero or more punctuation tokens, using a traditional sequence labeling method: conditional random fields (Lui and Wang, 2013; Lu and Ng, 2010), recurrent neural networks (Tilk and Alumäe, 2016), or transition-based systems (Ballesteros and Wanner, 2016).

## 8 Conclusion and Future Work

We have provided a new computational approach to modeling punctuation. In our model, syntactic constituents stochastically generate latent underlying left and right punctemes. Surface punctuation marks are not directly attached to the syntax tree, but are generated from sequences of adjacent punctemes by a (stochastic) finite-state string rewriting process . Our model is inspired by Nunberg (1990) formal grammar for English punctuation, but is probabilistic and trainable. We give exact algorithms for training and inference.

We trained Nunberg-like models for 5 languages and L2 English. We compared the English model to Nunberg’s, and showed how the trained models can be used across languages for punctuation restoration, correction, and adjustment.

In the future, we would like to study the usefulness of the recovered underlying trees on tasks such as syntactically sensitive sentiment analysis (Tai et al., 2015), machine translation (Cowan et al., 2006), relation extraction (Culotta and Sorensen, 2004), and coreference resolution (Kong et al., 2010). We would also like to investigate how underlying punctuation could aid parsing. For discriminative parsing, features for scoring the tree could refer to the underlying punctuation, not just the surface punctuation. For generative parsing (§3), we could follow the scheme in (1). For example, the *p*_{syn} factor in (1) might be a standard recurrent neural network grammar (RNNG) (Dyer et al., 2016); when a subtree of *T* is completed by the Reduce operation of *p*_{syn}, the punctuation-augmented RNNG (1) would stochastically attach subtree-external left and right punctemes with *p*_{θ} and transduce the subtree-internal slots with *p*_{ϕ}.

In the future, we are also interested in enriching the *T′* representation and making it more different from *T*, to underlyingly account for other phenomena in *T* such as capitalization, spacing, morphology, and non-projectivity (via reordering).

## Acknowledgments

This material is based upon work supported by the National Science Foundation under Grant Nos. 1423276 and 1718846, including a REU supplement to the first author. We are grateful to the state of Maryland for the Maryland Advanced Research Computing Center, a crucial resource. We thank Xiaochen Li for early discussion, Argo lab members for further discussion, and the three reviewers for quality comments.

## Notes

Our model could be easily adapted to work on constituency trees instead.

Multi-token punctemes are occasionally useful. For example, the puncteme … might consist of either 1 or 3 tokens, depending on how the tokenizer works; similarly, the puncteme ?! might consist of 1 or 2 tokens. Also, if a single constituent of *T* gets surrounded by both parentheses and quotation marks, this gives rise to punctemes (“ and ”). (A better treatment would add the parentheses as a separate puncteme pair at a unary node above the quotation marks, but that would have required *T′* to introduce this extra node.)

The appendices (supplementary material) are available at https://arxiv.org/abs/1906.11298.

Rather than learn a separate edit probability distribution for each bigram *ab*, one could share parameters across bigrams. For example, Table 1’s caption says that “stronger” tokens tend to absorb “weaker” ones. A model that incorporated this insight would not have to learn *O*(|*Σ*|^{2}) separate absorption probabilities (two per bigram *ab*), but only *O*(|*Σ*|) strengths (one per unigram *a*, which may be regarded as a 1-dimensional embedding of the punctuation token *a*). We figured that the punctuation vocabulary *Σ* was small enough (Table 2) that we could manage without the additional complexity of embeddings or other featurization, although this does presumably hurt our generalization to rare bigrams.

We could have handled all languages uniformly by making ≥ 2 passes of the sliding window (via a composition of ≥ 2 PFSTs), with at least one pass in each direction.

In retrospect, there was no good reason to square the $ET\u2032[c(T\u2032)]$term. However, when we started redoing the experiments, we found the results essentially unchanged.

Recall that the NoisyChannel model family (§2.2) requires the surface “ before special to appear underlyingly, and also requires the surface *ε* after special to be empty underlyingly. These hard constraints clash with the $\xi =\u221e$ hard constraint that the punctuation around special must be balanced. The surface ” after place causes a similar problem: no edge can generate the matching underlying “.

We do *O*(*n*) multiplications of *N* × *N* matrices where *N* = *O*(# of punc types ⋅max # of punc tokens per slot).

We still construct the real matrix **M**_{i}(*l*) by *ordinary* matrix multiplication before pairing its elements with strings. This involves summation of real numbers: each element of the resulting real matrix is a marginal probability, which sums over possible PFST paths (edit sequences) that could map the underlying puncteme *l* to a certain substring of the surface slot *x*_{i}. Similarly for **M**_{k}(*r*).

For 𝖾𝗇 and 𝖾𝗇_𝖾𝗌𝗅, “ and ” are distinguished by language-specific part-of-speech tags. For the other 4 languages, we identify two " dependents of the same head word, replacing the left one with “ and the right one with ”.

For symmetry, we should also have added a final mark.

These methods learn word embeddings that optimize conditional log-likelihood on the punctuation restoration training data. They might do better if these embeddings were shared with other tasks, as multi-task learning might lead them to discover syntactic categories of words.

American style places commas and periods inside the quotation marks, even if they are not logically in the quote. British style (more sensibly) places unquoted periods and commas in their logical place, sometimes outside the quotation marks if they are not part of the quote.

For rarer underlying pairs *ab*, the estimated distributions sometimes have higher entropy due to undertraining.

[𝖾𝗇] Earlier, Kerry said,“Just because you get an honorable discharge does not, in fact, answer that question.”

[𝖾𝗇] Sections 1, 2, 5, 6, 7, and 8 will survive any termination of this License.

To depunctuate a treebank sentence, we remove all tokens with POS-tag PUNCT or dependency relation punct. These are almost always leaves; else we omit the sentence.

Ideally, rather than maximize, one would integrate over possible trees *T*, in practice by sampling many values *T*_{k} from $psyn(\u22c5\u2223u\xaf)$ and replacing *S*(*T*) in (10) with $\u22c3kS(Tk)$.

We copied their architecture exactly but re-tuned the hyperparameters on our data. We also tried tripling the amount of training data by adding unannotated sentences (provided along with the original annotated sentences by Ginter et al. (2017)), taking advantage of the fact that the BiLSTM-CRF does not require its training sentences to be annotated with trees. However, this actually hurt performance slightly, perhaps because the additional sentences were out-of-domain. We also tried the BiGRU-with-attention architecture of Tilk and Alumäe (2016), but it was also weaker than the BiLSTM-CRF (just as in Table 3). We omit all these results from Figure 4 to reduce clutter.

So the two approaches to permutation yield different training data, but are compared fairly on the same test data.

## REFERENCES

## Author notes

Equal contribution.

Language and Information: Selected Essays on their Theory and Application, Addison-Wesley 1964, pages 116–150