What Helps Transformers Recognize Conversational Structure? Importance of Context, Punctuation, and Labels in Dialog Act Recognition

Dialog acts can be interpreted as the atomic units of a conversation, more fine-grained than utterances, characterized by a specific communicative function. The ability to structure a conversational transcript as a sequence of dialog acts -- dialog act recognition, including the segmentation -- is critical for understanding dialog. We apply two pre-trained transformer models, XLNet and Longformer, to this task in English and achieve strong results on Switchboard Dialog Act and Meeting Recorder Dialog Act corpora with dialog act segmentation error rates (DSER) of 8.4% and 14.2%. To understand the key factors affecting dialog act recognition, we perform a comparative analysis of models trained under different conditions. We find that the inclusion of a broader conversational context helps disambiguate many dialog act classes, especially those infrequent in the training data. The presence of punctuation in the transcripts has a massive effect on the models' performance, and a detailed analysis reveals specific segmentation patterns observed in its absence. Finally, we find that the label set specificity does not affect dialog act segmentation performance. These findings have significant practical implications for spoken language understanding applications that depend heavily on a good-quality segmentation being available.


Introduction
The human dialog is a never-ending source of diversity, abundant with exceptions and surprising ways to express one's thoughts. As a community, we have spent a massive effort in the past few decades to help the machine achieve even the slightest level of understanding of our means of communication. Remarkably, to some extent, we have succeeded. A consequence of this fact is the widespread presence of so-called voice assistants, i.e., conversational agents of limited capabilities, which have gained much popularity in recent years.
While the main focus of modern dialog research is placed on these human-machine interactions, it is the conversation between humans that poses the greatest challenges to spoken language understanding. Consider the task of intent recognition -in a goal-oriented dialog, where the human expects their machine interlocutor to have only limited understanding capabilities, one can reasonably expect there to be a single, self-contained and straightforward utterance expressing the person's request. Siegert and Krüger (2018) show in a subjective evaluation of Alexa users that they consider such a conversation "more difficult" than talking to a human. With a simpler dialog structure, it is natural to approach intent recognition as a multi-class classification task, by classifying each utterance's underlying intent.
The same task of intent recognition becomes much more complex when the dialog involves two or more humans. Their conversations are riddled with various disfluencies, such as discourse markers, filled pauses, or back-channeling (Charniak and Johnson, 2001). Shalyminov et al. (2018) propose multi-task training for a disfluency detection model capable of spotting hesitations, prepositional phrase restarts, clausal restarts, and corrections. Spontaneous dialogs are also characterized by much more dynamic structure than written text data. Kempson et al. (2000Kempson et al. ( , 2016 show that dialog may be viewed as a sequence of incremental contributions -called split utterancesrather than complete sentences, and propose the Dynamic Syntax paradigm, claiming that standard syntactic models are insufficient to capture dialog. Another study (Purver et al., 2009) finds that up to 20% of utterances in the British National Corpus (Burnard, 2000) dialogs fit the definition Figure 1: An illustration of dialog acts in a Switchboard conversation. Note how the speaker turns may consist of multiple dialog acts, indicating a different function for each utterance. Dialog act annotation allows us to segment the conversation into meaningful units that can be used for downstream processing in spoken language understanding (SLU) applications.
of split utterances, with about 3% of them being cross-speaker utterance completions. Eshghi et al. (2015) propose to view backchannels and other discourse markers as feedback in conversation that is a core component of its semantic structure, rather than a nuisance in downstream processing. This point is further argued by Purver et al. (2018), who propose incremental models for detecting miscommunication phenomena in humanhuman conversations. Clearly, an attempt to determine a person's intent grows beyond a turn-level classification task in such scenarios.
Dialog acts are vital to understanding the structure of the dialog. One of their modern definitions states that they are atomic units of conversation, which are more fine-grained than utterances and more specific in their function (Pareti and Lando, 2018). The part of utterance that forms a dialog act is also known as a functional segment. Recently, the definition, taxonomy, and annotation process of dialog acts has been standardized through an ISO norm (Bunt et al., 2012(Bunt et al., , 2017(Bunt et al., , 2020. Earlier studies on this topic typically used custom-tailored dialog act sets -notably, this category includes the Dialog Act Markup in Several Layers (DAMSL) scheme (Core and Allen, 1997), which was later adopted and modified to annotate the Switchboard corpus (Jurafsky et al., 1997;Stolcke et al., 2000). Interestingly, dialog acts are related to the philosophy of language speech acts theory introduced ini-tially by Austin (1962), in the sense that they view utterances as actions performed by the speakers.
Dialog act recognition typically entails two tasks: dialog act segmentation (DAS) and dialog act classification (DAC). In this work, we address both of them jointly and refer to their combination further as dialog act recognition. At the time of the conception of the first widely studied corpus for this task, the Switchboard Dialog Act (SWDA), DAS was considered a problem too difficult to address, and the pioneering works focused solely on the classification of dialog acts given the oracle segmentation (Stolcke et al., 2000). More recent works attempt to retrieve the segmentation through conditional random fields (CRF) or recurrent neural networks (RNN). However, these models still suffer from a significant margin of error, as shown by Zhao and Kawahara (2019) and later in Section 5.1. It is worth noting that in some downstream applications, the availability of highquality segmentation is valuable regardless of any classification errors: some examples include intent classification (Pareti and Lando, 2018), semantic clustering (Bergstrom and Karahalios, 2009), or temporal sentiment analysis (Clavel and Callejas, 2015), all of which heavily depend on the segmentation.
To the best of our knowledge, the DAS performance of transformer models (Vaswani et al., 2017) has not yet been investigated. Transform-ers recently demonstrated state-of-the-art performance across a range of natural language processing (NLP) tasks when combined with language model pre-training (Devlin et al., 2019;Liu et al., 2019;Beltagy et al., 2020). A major obstacle in applying transformer models to DAS is their O(n 2 ) computational complexity w.r.t. the input sequence length, making it infeasible to process conversations longer than a couple of hundred tokens. Thus, there are few transformers applications to segmentation taskse.g., Glavas and Somasundaran (2020) employed transformers for topic segmentation, but they assume that text had already been segmented and uses the sentence representations instead of word representations as input to transformers.
To address the transformers' limitations, we investigate two approaches. In the first one, we use XLNet , a model based on the TransformerXL architecture , which is capable of processing the input sequence in windows while propagating the activations of the intermediate layers across as additional inputs in the following window. In the second approach, we use Longformer (Beltagy et al., 2020), which processes the whole sequence in a single pass, but for each token attends only to neighboring N other tokens, reducing the complexity to O(mn), which is linear w.r.t. the input length.
Furthermore, we ask several questions to understand better the factors affecting dialog act recognition and design the experiments accordingly: • What is the significance of seeing a larger context in dialog act recognition? Contextual dialog act models have been considered before, but they were either classification models with oracle segmentation or segmentation models that look at a limited number of past turns (see Sections 2.3 and 2.4).
• How strongly does text formatting, i.e., the presence of punctuation and capitalization, affect the segmentation quality? This question is of significant practical importancespeech transcripts are often obtained through an automatic speech recognition (ASR) system, and many of them do not offer enhanced text formatting capabilities.
• How do the size and the specificity of the dialog act label set affect the recognition difficulty? In some applications, the segmenta-tion itself might be more important than having a dialog act label -e.g., when clustering utterances to discover the expressions with similar meaning. Would a large, detailed dialog act label set still be beneficial for such scenarios? Are dialog act labels necessary at all, or is it sufficient to know when they begin and end?
2 Related work 2.1 Switchboard dialog act The most widely studied dialog act dataset is Switchboard (SWDA) (Jurafsky et al., 1997(Jurafsky et al., , 1998. It consists of telephone conversations, first manually segmented into turns and utteranceslater formally called functional segments (Bunt et al., 2012), i.e. the units of dialogue act annotation. Bunt et al. (2012) define them as a minimal stretch of behavior with one or more communicative functions. The total word count is about 1.4M. The conversations have 1454 words on average, and the longest one has 3122 words. The Switchboard annotators originally used the DAMSL labeling scheme (Core and Allen, 1997) with 220 dialog acts and clustered them after annotation into a reduced label set. There seems to be no consensus on the reduced label set size -some of the works using a 42 labels set are Quarteroni et al.

Meeting recorder dialog act
Meeting Recorder Dialog Act (MRDA) (Shriberg et al., 2004) is a corpus of 75 meetings that took place in the International Computer Science Institute (ICSI). The conversations involve more than two speakers and are significantly longer than those in SWDA. The mean word count is about 11k, and the longest dialog has 22.5k words. There are 850k words in total, making MRDA approximately half the size of SWDA. The dialog act labeling scheme is different from that in SWDAthe annotators used a 51 act set that significantly overlaps with SWDA-DAMSL (we refer to that as the full set). These acts were later clustered, with two granularity levels, into a general set of 12 acts and a basic set of 5 acts. The basic set is reduced to the following classes: Statement, Question, Backchannel, Disruption, and Floor-Grabber. We refer the reader to Shriberg et al. (2004) for a detailed comparison of dialog act classes between SWDA and MRDA.

Dialog act classification
There are two main groups of studies: the first assumes that the segmentation is known and considers dialog act recognition as a pure classification task. The original SWDA authors first take such an approach with a hidden Markov model (HMM) (Jurafsky et al., 1998). Others have introduced conditional random fields (CRF) to solve this task (Quarteroni et al., 2011). Some authors found that considering the context explicitly in RNN models helps dialog act classification (Ortega and Vu, 2017;Liu et al., 2017a;Kumar et al., 2018;Raheja and Tetreault, 2019;Dai et al., 2020). Also, it has been shown that incorporating acoustic/prosodic features helps as well to some extent (Ortega and Vu, 2018;Si et al., 2020). Colombo et al. (2020) report the best result to date for SWDA classification -an accuracy of 85%, obtained by a sequence-to-sequence (seq2seq) GRU model with guided attention. For MRDA, the best classification accuracy is 92.2% reported by (Li et al., 2019), achieved with a dual-attention hierarchical BiGRU with a CRF on top. These approaches are not directly comparable with ours, as they assume an oracle segmentation of the transcript.

Dialog act segmentation and recognition
More interesting in the context of our work are the studies that consider dialog act segmentation and recognition. One of the first attempts has been made by Ang et al. (2005) with decision trees and HMMs for the MRDA corpus. CRF has been successfully employed in this task (Quarteroni et al., 2011). The closest work to ours is by Zhao and Kawahara (2019), where a bidirectional gated recurrent unit (BiGRU) model is used to segment and classify dialog acts in SWDA jointly. The model is considered as a sequence tagger with an optional CRF layer or in an encoder-decoder setup. It also integrates previous dialog act predictions for ten previous turns using an attention mechanism. Notably, the main differences from our setup are that Zhao and Kawahara (2019): 1. consider prediction for a single turn at a time, whereas our dialog-level contextual models process multiple turns at the same time, which allows to include both past and future context into prediction; 2. use exclusively lowercase text without punctuation, whereas we study setups both with and without the punctuation and truecasing; 3. limit the vocabulary at 10000 words, whereas we use sub-word tokenizers with no such limitation -this results in the model being able to leverage another 10000 less frequent words in SWDA, that would have otherwise been replaced by an out-of-vocabulary symbol; 4. connect dialog act continuations (the segments labeled in SWDA with a +) to the previous turn when interrupted, e.g., by a backchannel -we view that operation as a work-around for their models to be able to see the relevant future context, whereas our proposed models require no such pre-processing.
Finally, we provide a more detailed analysis of the effect of context on the recognition outputs; we also investigate the effect of punctuation and label set specificity, which is not discussed in that work.

The effect of context and punctuation
In (Liu et al., 2017b), the authors process each dialog act segment in parallel streams using a CNN and combine the sequence of sentence representations using an LSTM to exploit the context. The influence of context is explored in (Bothe et al., 2018) by using an LSTM on the segment representations. Here, dialog act classification is achieved in two stages: learning segment representations and dialog act classification using an LSTM. The usage of punctuation marks as features and other heuristics, such as the number of words in the segment, n-grams, the dialog act of the next segment, and others, is explored in (Samuel et al., 1998;Verbree et al., 2006). However, the effect of each of these heuristics, especially punctuation marks, is not analyzed. To the best of our knowledge, there are no studies that attempt to understand the role of context, punctuation, or label set specificity on dialog act recognition in-depth.

Transformers
The transformer architecture is shown to produce state-of-the-art results on several NLP tasks (Vaswani et al., 2017;Devlin et al., 2019). It consists of repeated blocks of a self-attention layer and a feed-forward layer. The self-attention layer processes the entire input sequence and learns to attend to the relevant tokens by computing the cross-token similarity in the input sequence. The similarity computation is implemented with a dotproduct followed by a softmax operation. Each token's representation in the self-attention layer output is passed through a feed-forward layer before the next self-attention layer. However, as the selfattention layer processes all tokens of the input sequence simultaneously, it is invariant to the input sequence's token order. The ordering information is preserved by adding positional embeddings to the input token embeddings. Positional embeddings include one vector per token position and are learned during model training together with other model parameters.
One major limitation of transformer models is their scalability to longer inputs, as the complexity of each self-attention layer is O(n 2 ) where n is the input sequence length. More recent works address this limitation in several ways: 1) propagation of context between segments of long sequence , 2) local attention (Ye et al., 2019;Beltagy et al., 2020;Zaheer et al., 2020), 3) sparse attention (Kitaev et al., 2020;Tay et al., 2020;Zaheer et al., 2020), 4) efficient attention operation Katharopoulos et al., 2020;Shen et al., 2021). In this work, we explore two of these models for dialog act recognition: XLNet  which is based on the propagation of context, and Longformer (Beltagy et al., 2020) which uses local attention.

XLNet
XLNet ) is a transformer model trained with a masked language model (MLM) criterion. It consists of 12 (base) or 24 (large) self-attention layers.
It is based on Trans-formerXL , which enables it to process text sequences in windows while propagating the context in the forward direction. We leverage this property to process conversational transcripts efficiently. Furthermore, XLNet is pretrained as an autoregressive language model that maximizes the expected likelihood over all permutations of the input sequence factorization order. It is interesting to note that this model, unlike BERT, uses relative positional encodings that do not need to be learned, making it possible to process sequences of arbitrary lengths. Even then, the quadratic computational complexity necessarily renders such processing infeasible, making windowed processing a more practical choice.

Longformer
Longformer (Beltagy et al., 2020) is based on a modification of the self-attention layer that reduces the computational complexity by limiting the context available to each input token. It splits the attention into two components -local and global. The local component is a sliding window of fixed size for each self-attention layer, dramatically reducing long sequences' computational complexity. The global component allows select tokens to attend to the entire sequence. We do not use it in this work -unlike in text classification, where [CLS] uses global attention, or question answering, where the question tokens use global attention (Beltagy et al., 2020), there are no clear candidates for it in dialog act recognition. Following Beltagy et al. (2020), we use RoBERTa (Liu et al., 2019) -BERT with carefully tuned hyperparameters -as the base model to avoid the costly pre-training process. This model's limitation is that it cannot process token sequences longer than those seen during training (4096 tokens for the pre-trained model open-sourced by Beltagy et al. (2020)). We investigate Longformer because we consider its sliding window attention mechanism as a natural extension over the XLNet's windowprocessing mechanism.
4 Experimental setup

Model training
For both transformer models, we use pre-trained sub-word tokenizers and weights, as provided by HuggingFace 1 -allenai/longformer-base-4096 for Longformer and xlnet-base-cased for XLNet. These are the base variants with 12 self-attention layers. To adapt the models to the DAS task, we put a token classification layer on top of the transformer and train it with a per-token cross-entropy loss. We fine-tune each model on the training portion of the dataset -1003 calls for SWDA and 51 meetings for MRDA. We use the validation set (112 SWDA calls; 12 MRDA meetings) to select the best model for each variant and the test set (19 SWDA calls; 12 MRDA meetings) for the final evaluation.
The baseline BiGRU model is trained in the same setup as described in Zhao and Kawahara (2019). For both XLNet and Longformer, we compare their performance to BiGRU by training them as turn-level models that see only a single speaker turn without additional context. In a separate experiment, to measure the effect of providing the surrounding dialog context, we train them as broad-context models processing either full transcripts (Longformer) or chunks (XLNet). All reported metrics are the mean values from three runs with different random seeds (42,43,44). We train each model with a single GeForce GTX 1080 Ti GPU, which allowed us to construct batches of 6 chunks with 512 tokens each for XL-Net training. The same setup might not be optimal for Longformer, as only the first 512 positional embeddings would have been fine-tuned. Therefore, we train it with 4096 token windows and an effective batch size of 6, using gradient accumulation. All models are trained for ten epochs with an Adam optimizer, a learning rate of 5e-5, and a learning schedule linearly decreasing its value towards 0. We evaluate the model on the validation set after each epoch and select the model that achieved the best F1 macro score to report the test set results.

Data preparation
To transform the SWDA 2 and MRDA 3 conversational transcripts into model inputs, we perform several steps. First, we remove all annotator comments from the SWDA text. We evaluate each model in two variants -with/without punctuation and truecasing, to investigate how strongly it affects the performance. When punctuation and truecasing are used, they are always the ground truth. To create a single sequence out of speaker turns, we concatenate them with a unique TURN token in between that does not participate in loss computation but explicitly indicates that the speaker has changed. Following Zhao and Kawahara (2019), we encode the dialog act labels using an E joint coding scheme. In the E scheme, each word comprising a dialog act is assigned a label -the E label in-dicates an end of the dialog act, and the I label indicates a token other than an ending. The joint coding also specializes the E label for each dialog act class in the label set, allowing to perform dialog act recognition. The I label is shared between all dialog act classes. BERT models typically use sub-word tokenization -byte-pair encoding (Gage, 1994;Sennrich et al., 2016) for Longformer and SentencePiece (Kudo and Richardson, 2018) for XLNet. When a word is split into multiple tokens, we assign the dialog act label only to the first token and discard the following tokens' predictions (i.e., they do not participate in loss computation and are ignored when reading predictions during inference).
For SWDA, we are using the 42 dialog act labels (as Abandoned-or-Turn-Exit act is merged with Uninterpretable) encoded into 43 labels in total, including the I label. We experiment with all the label sets available in MRDA -basic with 5 labels, general with 12 labels, and full with 51 labels (6, 13 and 52 respectively when counting the I label). Unless otherwise specified, we always use the 5 labels set for MRDA and 42 labels for SWDA.
Some SWDA dialog acts are extended across turns with a + label, e.g., when somebody interrupted with a backchannel. We respect that by assigning an I label to the last token in the interrupted turn, thus creating a multi-turn functional segment.
For inference, the calls are processed in sliding windows. With XLNet, we use a window size of 512 tokens without overlap. We compare the predictions with and without the context propagation across windows to understand its importance. With Longformer, we do not need to explicitly construct the windows, as each token's attention is limited to a local context of 256 neighboring tokens on each side.

Metrics
To measure the model performance, we use standard micro and macro weighted F1 metrics, as well as metrics explicitly evaluating the segmentation quality (Granell et al., 2010;Zhao and Kawahara, 2019): • Dialog Act Segmentation Error Rate (DSER) measures the percentage of reference segments that were not recognized with perfect boundaries, disregarding the dialog act label.
• Segmentation Word Error Rate (SegWER) is additionally weighted by the number of words in a given segment.
• Dialog Act Error Rate (DER) is computed similarly to DSER but also considers whether the dialog act label is correct.
• Joint Word Error Rate (JointWER) is a word count weighted version of DER.
Note that these metrics are strict: if a 3-word turn with a single Statement act is recognized as an Acknowledgment on the first word and Statement on the next two, the micro F1 score is 66.6%, the macro F1 score is 55.5%, but the error rate metrics are all at 100%.
For reference, when reading the dialog act metrics, the SWDA and MRDA test sets have respectively 4500 and 16702 functional segments. For reading micro and macro F1 scores, SWDA and MRDA test sets have 29.8K and 100.6K words.

Results
In this section, we present the results of our experimental evaluation. Each result table is first split into lower and nolower sections, which respectively stand for a lowercase transcript with no punctuation, and an original case transcript with punctuation symbols. For both scenarios, we always show the results on both MRDA and SWDA datasets.

Single turn context models
We start our experiments by investigating how much improvement we can achieve by replacing a simple but established BiGRU baseline model with one of the transformer models. The baseline is trained in the same setup as in Zhao and Kawahara (2019) 4 . To make the comparison fair, we train the XLNet and Longformer on single turn inputs so that the model does not see any dialog context. The same is true during inference. The results are shown in Table 1.
Both transformer models offer substantial improvements over the BiGRU baseline in all scenarios. In most evaluations, XLNet achieves the best results, outperforming Longformer by a small margin, compared to the improvement over Bi-GRU. Since these experiments do not test the model's ability to handle long-range context, these results suggest that XLNet's pre-training procedure is more suitable for dialog act recognition than that of Longformer.

Broad context models
In the second experiment, we investigate how long-document transformers perform in dialog act recognition. As a baseline (Turns), we re-use the best model from Section 5.1 -XLNet -processing dialog transcript on a turn-by-turn basis without additional context. The other proposed models process the whole transcript in sliding windows. XLNet uses a window of 512 tokens with a step size of 512 tokens. This window traversal strategy is not optimal -the tokens on the window boundaries cannot attend to other tokens close by but belonging to another window. XLNet+prop partially addresses this issue by propagating the intermediate activations between the windows. Longformer uses a window of 512 tokens with a step size of 1 token, which is possible thanks to its special local attention pattern. Therefore, it fully avoids XL-Net's traversal strategy issue. The results are in Table 2.
All broad context models outperform the turnlevel baseline across all metrics, except the turnlevel SWDA nolower baseline in the JointWER metric. XLNet+prop emerges as the best model in all configurations with minor gains over XLNet. Similarly, as in Section 5.1, we observe consistent improvements in all setups when using XL-Net instead of Longformer. However, we cannot conclude that XLNet uses the context more effectively, as its performance on context-less turn prediction was also better than that of Longformer's. Besides the attention patterns, there are other differences between the models, such as the pretraining conditions and positional encoding schemes, which could also explain the observed results. However, it is an indication that limiting Longformer's number of positional embeddings to 4096 is not a limiting factor in its performance.
We compare the runtime of XLNet and Longformer models. Average inference time with 512 tokens window on SWDA transcripts with an eight-core Intel Core i9-9980HK CPU takes 2.8 seconds for Longformer and 14.7 seconds for XL-Net, making Longformer about five times faster  when deployed on a CPU. Figure 2 shows the time it takes for dialog act prediction on a 1750 words call sw2229 from SWDA -for smaller windows of 32 and 64, the models take similar time to run, but as the window size increases, Longformer becomes quicker than XLNet. To summarize, Longformer might be more suitable for practical applications, even if it achieves slightly worse recognition results.
An analysis of confusion patterns in the most performant model (nolower XLNet+prop) does not reveal any new insights in SWDA compared to past works -the most confused label pair is Statement-opinion and Statement-non-opinion. For the same model in MRDA, we observe the Question label has the highest F-score of 98.32%, followed by 94.38% for Statements. Backchannels are the most confused label, with 17% of them being classified as Statements, and 19% of predicted Backchannels being in fact Statements. Also, a significant portion of Disruptions (25%) and Floor-grabbers (28%) are confused with the I label and respectively 20% and 14% of them are predicted as an I label. This indicates that these dialog acts are the most difficult to segment correctly -which might be due to only 66.5% aver-   (11), basic (5), and pure segmentation (1); in SWDA basic (42) and pure segmentation (1). DER and JointWER are not defined for pure segmentation. All experiments are performed using full dialog context, with identical hyperparameters, except for the output layer size. The asterisk denotes the label sets typically used in other works. The left-side plot shows the mean time it takes to predict a single window, and the right-side plot shows the time needed to process the full dialog. Window sizes larger than 512 imply sub-windowing for Longformer, which in this experiment has learned only 512 positional embeddings.

Discussion
This section presents a detailed analysis of various factors affecting dialog act segmentation and recognition performance. In particular, we look into the effects of label set specificity, punctuation, and context.

The effect of label set specificity
Since MRDA provides different label set sizes, it is tempting to see how that affects the recognition performance. Furthermore, we investigate a spe-cial case where we perform pure segmentationi.e., the dialog act labels are stripped, and there remains a single generic E token at the end of each segment. For SWDA, we compare the 42 label set performance with pure segmentation. All experiments are performed using the XLNet+prop model, which was the best model in Section 5.2. The results are shown in Table 3.
We do not observe a strong effect of the label set size on segmentation performance; the pure segmentation model is practically on par with the dialog act recognition model. This is indicated by little change in DSER and SegWER metrics across the label sets in each experimental scenario. On the other hand, the label set size has a major effect on the classification performance, reflected in F1, DER, and JointWER. We offer two explanations for that. Firstly, the larger label sets have more imbalanced classes, e.g., in the 51 labels set, 43% of acts are statements, and the 18th most frequent class is already below 1% of all acts. Secondly, we suspect that the inter-annotator agreement is worse for the large label set, but the MRDA authors only reported it for the five label set (80% agreement).

The effect of dialog context
To understand how the dialog context helps improve the models, we analyze the predictions of turn-level XLNet and dialog-level XLNet+prop. In particular, we find the subset of turns in which the turn-level model made either segmentation or classification errors, but the dialog-level model recognized everything correctly (427 turns, which is 16.3% of turns in the SWDA test set). This subset contains 752 dialog acts and suffers mostly from misclassification errors -19.8% of these dialog acts are mis-segmented with an equal share of over-or under-segmentation, but as many as 75.8% of them have been misclassified.
We take a closer look at the differences between the two models' errors by considering the whole test set again and investigating which dialog acts benefitted the most from dialog-level context. To find them, we first have to perform segmentlevel alignment (since segment boundaries could be misrecognized) using the Levenshtein algorithm. For this purpose, we assume that the reference and predicted segments are equal when they start and end at the same words for pure segmentation and additionally check that their dialog act label is the same for recognition.
Surprisingly, we find that the strongest turnlevel model (XLNet) never correctly recognized more than half of the label set (24 dialog act classes, many of which are infrequent), whereas this number significantly drops for the dialog-level model (4 classes: Declarative-Wh-Question, Dispreferred-answers, Self-talk, Hold-before-answer-agreement). The top 10 dialog acts with improved recognition performance, that occurred at least 10 times in SWDA test set, are shown in Table 4. The turn-level model lacked the necessary context to correctly classify Yes-answers, Agree-Accept, and Response-Ackonwledgment, mistaking them mostly for Ackonwledge-Backchannel.
The model frequently hypothesized Yes-No-Question in place of Wh-Question. Other highly contextual dialog acts such as Repeat-phrase, Rhetorical-Questions, Backchannel-in-questionform or Summarize-reformulate also largely improved.
In terms of segmentation performance differences, the improvements with dialog context are consistent across various kinds of dialog acts: both short (Response-Acknowledgment, No-answers) and long (Statement-non-opinion, Actiondirective); questions (Rhetorical-Questions, Wh-Question, Open-Question) and statements.

The effect of punctuation -MRDA
We have previously observed from Table 3 that removing the capitalization and punctuation has a significant effect on the dialog act recognition. It suggests a strong correlation between punctuation and dialog acts. For example, a Question dialog act segment might often end with a question mark that could serve as a cue for the model. In this subsection, we show the correlations between dialog acts and punctuation for MRDA and SWDA datasets. Table 5 presents dialog act vs. punctuation statistics for the MRDA dataset with 5 labels. Each cell contains the frequency of a dialog act and punctuation occurring together and the percentage of our model errors in parenthesis.
We can observe that the frequency of various punctuation symbols is skewed for each dialog act. For example, segments with Statement and Backchannel dialog act labels most often contain full stop, those with Question dialog act label contain question mark. Similarly, Floor-grabber and Disruption labelled sentences contain no punctuation. Given that correlations between dialog acts and punctuation exist, we expect the models to leverage punctuation as a cue for prediction. Fewer errors (in bold) when punctuation is highly correlated with dialog acts confirm our hypothesis. For example, dialog act Question has a minimal percentage of errors when a question mark is present in the input segment. Upon further investigation, we found that the ending boundary is consistently recognized correctly when a question mark exists, and any errors that occur are at the segment's beginning. Also, the high error percentages for dialog acts Disruption and Floor-grabber could be explained due to their similar distributions of ending punctuation.

The effect of punctuation -SWDA
Given the large label set size of SWDA, we have no straightforward means of visualizing the correlation of punctuation and dialog acts. In order to understand the relationship between punctuation and dialog acts in SWDA, we show the top 10 most affected dialog acts in segmentation and recognition in Table 6. We observe that punctuation is key in recognizing discourse markers such as incomplete utterances, restarts, or repairs that are often labeled as Uninterpretable. Without punctuation, these discourse markers are frequently merged into a neighboring dialog act by the model. It also partially explains the improvements in segmentation of Statements and some less frequent acts such as Hedge, since they are often found next to Uninterpretable (see Figure 3). In many cases, the lack of commas takes away a cue to insert a dialog act boundary from the model. Examples are shown in Figure 4. We hypothesize   that prosody or other cues found in the acoustic signal could mitigate that effect, given the usefulness of such features in dialog act classification works (Ortega and Vu, 2018;Si et al., 2020). Another way to look at the differences in the segmentation structure is to compare the distributions of punctuation symbols found in the middle of the segments (i.e., the punctuation symbols other than the ones ending the previous and the current dialog act). We present them in Table 7. We see that the nolower model uses the punctuation as cues for determining segment boundary and retains a very similar distribution to the ground truth segmentation. On the other hand, the lower model, which cannot see the punctua-tion, tends to under-segment the transcripts. This is consistent with our previous analyses.

Conclusions
We investigated how two transformer models capable of dealing with long sequences, XLNet and Longformer, can be applied to dialog act recognition. We used the well-studied SWDA and MRDA corpora and compared the performance with an established BiGRU baseline. First, we showed that the pre-trained transformers offer a substantial improvement w.r.t. to BiGRU when processing individual speaker turns, without any additional context. Then, we proposed adapting the transformers to consider a broader dialog context through turn concatenation with the TURN token, the use of joint coding, and local attention patterns or windowed processing. With this improvement, we achieved strong segmentation results on SWDA and MRDA dialog act recognition with DSER of 8.4% and 14.2% on the original transcripts and competitive results on lowercase transcripts with no punctuation (17.5% and 32.9%). We found that XLNet was able to get the most out of the additional dialog context. We observed that the additional context is the most beneficial for segmentation while also improving the classification performance. Of a practical note, Longformer allowed for approximately five times quicker inference on a modern CPU.
Across all of our experiments, it was evident that punctuation and original character cases were crucial for both segmentation and classification performance. No other factor influences the results as much -the best lowercase-transcript model (broad context XLNet+prop) still lags behind the simplest unmodified-transcript model (turncontext BiGRU). We analyzed the effect of punctuation and found that it is often correlated with some dialog act classes. The model leverages punctuation as a cue, especially to insert segment boundaries, but to a lesser extent also to classify dialog acts (e.g., question marks in questions).
By considering different dialog act label sets available in MRDA and a pure segmentation task, we found that XLNet's segmentation performance does not depend on the dialog act labels, further with segmentation experiments on SWDA. Regardless of the label set size (or whether the task is pure segmentation), the model performs just as well.
Finally, we found that the addition of broader context is beneficial for the model to learn rare dialog act classes -without it, more than 50% of dialog act classes were never correctly recognized even once in SWDA. With the inclusion of context, that number decreased to less than 10%.
Our findings have significant practical implications for applications that depend on text segmentation, such as the automatic discovery of intents and processes in a given domain or building graphs describing conversational flow from unstructured transcripts. We have shown that the dialog act labels do not have to be specific in order to be able to retrieve good segmentation automatically. This can significantly ease the annotation efforts, removing the need to memorize large label sets for the annotators. Furthermore, we show that the current pre-trained transformer models suffer from limitations when punctuation is not available. They tend to under-segment the text, often merging disfluencies with neighboring dialog acts. While these phenomena would likely affect, e.g., systems trying to measure the semantic similarity of two segments, we expect that even the segmentation predicted on lower-case text would be useful in practical applications. It is interesting to see whether automatically retrieved punctuation can mitigate the gap between manual annotation and no punctuation; we consider this a promising future work candidate.
To foster further research in this direction, we make our code available under the Apache 2.0 license 5 .