As more users across the world are interacting with dialog agents in their daily life, there is a need for better speech understanding that calls for renewed attention to the dynamics between research in automatic speech recognition (ASR) and natural language understanding (NLU). We briefly review these research areas and lay out the current relationship between them. In light of the observations we make in this article, we argue that (1) NLU should be cognizant of the presence of ASR models being used upstream in a dialog system’s pipeline, (2) ASR should be able to learn from errors found in NLU, (3) there is a need for end-to-end data sets that provide semantic annotations on spoken input, (4) there should be stronger collaboration between ASR and NLU research communities.
More and more users every day are communicating with conversational dialog systems present around them like Apple Siri, Amazon Alexa, and Google Assistant. As of 2019, 31% of the broadband households in the United States have a digital assistant.1 Henceforth, we refer to these systems as dialog agents or simply agents. A majority of queries issued to these dialog agents are in the form of speech as the users are directly talking to these agents hands-free.
This is in contrast to a few years ago, when most of the traffic to search engines like Google Search, Yahoo!, or Microsoft Bing was in the form of text queries. The natural language understanding (NLU) models that underlie these search engines were tuned to handle textual queries typed by users. However, with the changing nature of query stream from text to speech, these NLU models also need to adapt in order to better understand the user.
This is an opportune moment to bring our attention to the current state of automatic speech recognition (ASR) and NLU, and the interface between them. Traditionally, the ASR and NLU research communities have operated independently, with little cross-pollination. Although there is a long history of efforts to get ASR and NLU researchers to collaborate, for example, through conferences like HLT and DARPA programs (Liu et al. 2006; Ostendorf et al. 2008), the two communities are diverging again. This is reflected in their disjoint set of conference publication venues: ICASSP, ASRU/SLT, and Interspeech are the major conferences for speech processing, whereas ACL, NAACL, and EMNLP are the major venues for NLU. Figure 1 shows the total number of submitted papers to the ACL conference, and in the speech processing track from 2018–2020 at the same conference.2 The number of submitted speech-related papers at the conference constitute only 54 (3.3%), 80 (2.7%), and 62 (1.8%) in 2018, 2019, and 2020, respectively, showing the limited amount of interaction between these fields.
In this article, we analyze the current state of ASR, NLU, and the relationship between these two large research areas. We classify the different paradigms in which ASR and NLU operate currently and present some ideas on how these two fields can benefit from transfer of signals across their boundaries. We argue that a closer collaboration and re-imagining of the boundary between speech and language processing is critical for the development of next generation dialog agents, and for the advancement of research in these areas.
Our call is specially aimed at the computational linguistics community to consider peculiarities of spoken language, such as disfluencies and prosody that may carry additional information for NLU, and errors associated with speech recognition as a core part of the language understanding problem. This change of perspective can lead to the creation of data sets that span across the ASR/NLU boundary, which in turn will bring the NLU systems closer to real-world settings as well as increase collaboration between industry and academia.
2. Changing Nature of Queries
As users move from typing their queries to conversing with their agent through dialog, there are a few subtle phenomena that differ in the nature of these queries. We briefly discuss these in the following section.
2.1 Structure of the Query
User-typed queries aren’t always well-formed nor do they always follow the syntactic and semantic rules of a language (Bergsma and Wang 2007; Barr, Jones, and Regelson 2008; Manshadi and Li 2009; Mishra et al. 2011). This is not surprising because users often want to invest as little effort as they can in typing a query. They provide the minimum required amount of information in the form of words that they think can return the desired result, and, hence, typed search queries are mostly a bag of keywords (Baeza-Yates, Calderón-Benavides, and González-Caro 2006; Zenz et al. 2009).
On the other hand, spoken utterances addressed to dialog agents are closer to natural language, contain well-formed sequence of words that form grammatical natural language sentences (though spoken language can also be ungrammatical, cf. §2.2), and are more complex and interactive (Trippas et al. 2017, 2018). For example, sometimes a user utterance might need to be segmented into multiple parts:
Agent: “What movie genre do you like?”
User: “I don’t know <pause> anime”
|Typed .||Spoken .|
|barack obama age||what is the age of barack obama|
|boston denver flight||book a flight from london to umm no from boston to denver|
|scooby doo breed||tell me what’s the breed of scooby doo|
|Typed .||Spoken .|
|barack obama age||what is the age of barack obama|
|boston denver flight||book a flight from london to umm no from boston to denver|
|scooby doo breed||tell me what’s the breed of scooby doo|
Another common phenomenon in spoken utterances is disfluencies. When a user stammers, repeats, corrects themselves, or changes their mind in between the utterance, they introduce disfluency in the utterance (Schriberg 1994). For example, as shown in Table 1, “book a flight from london to umm no from boston to denver” is a disfluent query. There has been limited NLU research done on correcting speech recognition errors or handling disfluencies. For example, in the last 27 years there has been only one major data set available containing annotated disfluencies in user utterances (Godfrey and Holliman 1993; Schriberg 1994).
2.2 Errors in the Query
Not only do spoken vs. typed queries differ in structure and style, they also vary in the kind of noise or anomalies in the input. Whereas typed queries can contain spelling errors, spoken queries can contain speech recognition errors, and endpointing issues, which we discuss below.
While there has been extensive research on correcting spelling errors (Hládek, Staš, and Pleva 2020) including state-of-the-art neural machine translation models being launched in products (Lichtarge et al. 2019),3 there has been limited NLU research done on correcting speech recognition errors.
Handling speech recognition errors is crucial for downstream NLU models to work effectively because an error in the transcription of a single word can entirely change the meaning of a query. For example, a user utterance: “stair light on” is transcribed by the cloud Google Speech Recognizer4 as: “sterilite on”. In this example, if the NLU model is given the query “sterilite on”, it is very hard for it to uncover the original intent of the user, which was to turn on the stair lights, unless we force the NLU model to learn to correct/handle such systematic ASR errors. Such error analysis is often done for industrial query logs (Shokouhi et al. 2014) but these data sets are not publicly available for academic research purposes.
Another common error that affects NLU is speech endpointing. If a user utterance is pre-planned (such as, “play music”), the user does not hesitate, but if the utterance is complex or the user is responding to a request from the agents, such as, “do you prefer 5 or 7?”), the user may hesitate when responding and pause, saying “oh <long pause> 7” causing the ASR model to assume that the user stopped after saying “oh” which leads to incomplete queries being transmitted to NLU. On the other hand, if we can learn to provide a signal to endpointing from the dialog manager that the recognized utterance is missing a value (in this case, probably an integer) according to the conversation context, we can improve endpointing, and hence recognition and downstream NLU. Data sets pertaining to endpointing phenomenon are currently available in very limited domains (Raux and Eskenazi 2008) and there is a need for such experimentation in broader domains and the impact analysis of such errors on overall user experience.
Thus, it is imperative for NLU models to be aware of the fact that there could be speech recognition errors in the input that might need to be corrected.
3. Current State of Research
We now describe the current core areas of research related to understanding the meaning of a written piece of text or spoken utterance. Figure 2 shows a general spoken language understanding pipeline.
3.1 Natural Language Understanding
Natural language understanding refers to the process of understanding the semantic meaning of a piece of text. More concretely, understanding the semantic meaning often implies semantic parsing in NLP academic research and industry. Semantic parsing is the task of converting a natural language utterance to a logical form: a machine-understandable representation of its meaning. Semantic parsing is a heavily studied research area in NLP (Kamath and Das 2019).
In its simplest form, a semantic parse of a given query can contain a label identifying the desired intent of the query and the arguments that are required to fulfill the given intent. However, a more informative semantic parse can contain edges describing relations between the arguments, as in the case of abstract meaning representations (AMR) (Banarescu et al. 2013), or a graph describing how different words join together to construct the meaning of a sentence, as in the case of combinatory categorial grammar (CCG) (Steedman 1987).5Figure 3 shows a parse containing semantic role labels of a given query obtained using the AllenNLP tool (Gardner et al. 2018).
In general, all the semantic parsing research done in NLU assumes the input to the parser as a piece of text (and available world context). Thus, the training/evaluation data set of all available semantic parsing data sets are devoid of any speech phenomenon like the presence of speech recognition errors, annotation of pauses, or disfluencies.
3.2 Spoken Language Understanding
Spoken language understanding (SLU) refers to the process of identifying the meaning behind a spoken utterance (De Mori et al. 2008; Tur and De Mori 2011). In that regard, the end goal of NLU and SLU is the same but the input to NLU and SLU components are different: text in the former, and speech input in the latter. The most common SLU task is intent prediction and slot-filling, which involves classifying the intent of the utterance and identifying any required arguments to fulfill that intent (Price 1990). Figure 3 shows the SLU annotation for a slot-filling task. We now review the main approaches used to solve SLU.
3.2.1 Speech → Text → Semantic Parse.
The traditional way of performing speech understanding is to use a pipeline approach: First use an ASR system to transcribe the speech into text and then run NLU on the transcribed text to result into a semantic parse. Using a pipeline approach has its own set of pros and cons. Using a 2-step pipeline approach is modular. The first component is an ASR system and the second component is an NLU system. The errors of each module can be independently analyzed and corrected (Fazel-Zarandi et al. 2019; Wang et al. 2020). The training of both models is also independent, which makes it easier to use off-the-shelf start-of-the-art ASR and NLU models during inference.
The obvious disadvantage of this method is that the NLU and ASR models are unaware of each other. Because the models are not trained jointly, the ASR model cannot learn from the fact that the downstream NLU model could have failed on an erroneous ASR prediction. Similarly, at inference time the NLU model relies only on the best prediction of the ASR model and cannot exploit the uncertainty in ASR’s prediction. However, this can be fixed to a good extent by forcing the ASR model to propagate the n-best list of speech transcript hypotheses to the NLU system and letting the NLU model use all the hypotheses together to make the semantic parse prediction (Hakkani-Tür et al. 2006; Deoras et al. 2012; Weng et al. 2020; Li et al. 2020b) or using a lattice or word-confusion network as input (Tür, Deoras, and Hakkani-Tür 2013; Ladhak et al. 2016).
3.2.2 Speech → Semantic Parse.
There is a renewed focus of attention on approaches to directly parse the spoken utterance to derive the semantics by making a deep neural network consume the speech input and output the semantic parse (Haghani et al. 2018; Chen, Price, and Bangalore 2018; Serdyuk et al. 2018; Kuo et al. 2020). This is an end-to-end approach that does not rely on the intermediate textual representation of the spoken utterance produced by an ASR system. Thus, this system can be trained end-to-end with direct loss being optimized on the semantic parse prediction. However, such end-to-end models are data-hungry and suffer from lack of training data (Lugosch et al. 2019; Li et al. 2020a). Even though such models often have better performance on benchmark data sets, deploying such models in a user-facing product is difficult because of the lack of ease of debugging and fixing errors in output (Glasmachers 2017).
4. Reimagining the ASR–NLU Boundary
In Section 3 we saw that parsing the user query is a step in SLU. And thus SLU is a large umbrella utilizing NLU techniques to parse spoken language. Even though both SLU and NLU at some level are solving the same problem, there is a clear disconnect between the way problems are formulated and the way solutions are devised for these problems. On one hand, in NLU, a lot of emphasis is laid on understanding deep semantic structures in text formulated in the tasks of semantic parsing, dependency parsing, language inference, question answering, coreference resolution, and so forth. On the other hand, SLU is mainly concerned with information extraction on the spoken input formulated in the tasks of slot filling, dialog state modeling , and so on.
Even though there are academic benchmarks available for SLU that aim to extract information from the spoken input, there is an informal understanding between the ASR and NLU communities that assumes that as long as the ASR component can transcribe the spoken text correctly, the majority of the language understanding burden can be taken up by the NLU community. Similarly, there is an implicit assumption in the NLU community that ASR will provide correct transcription of the spoken input and hence NLU does not need to account for the fact that there can be errors in the ASR prediction. We consider the absence of an explicit two-way communication between the two communities problematic.
Figure 2 shows how the NLU research community can expand its domain to also consider spoken language as input to the NLU models instead of pure text. Similarly, the ASR community can also account for whether the text they produced can result in a semantically coherent piece of text or not by explicitly trying to parse the output. That said, there are already a few efforts which have tried blurring the boundary between ASR and NLU. We will first give some examples of how ASR and NLU can learn from each other and then review some existing initial work in this domain aimed at enriching existing academic benchmark data sets.
4.1 ASR → NLU Transfer
A significantly large missing portion of information in understanding a spoken input is the nature of speech, which we often refer to as prosody. Whether the user was angry, happy, in a rush, frustrated, and so on, can help us better understand what the user’s real intent was. For example, “no…don’t stop” and “no don’t…stop” have exactly opposite meanings depending on whether the user paused between first and second words or the second and third words. This information can only be transferred from speech to NLU. Amazon Alexa already has such a tone-detection feature deployed in production.6 There are academic data sets that map speech to emotion (Livingstone and Russo 2018), but academic benchmarks containing examples of intonation affecting the NLU output do not exist.
An ASR system can provide more information than its best guess to the NLU model by providing a list of n-best speech hypotheses. Unfortunately, most of the state-of-the-art NLU models are trained to only accept a single string of text as input, be it parsing, machine translation, or any other established NLU task. To some extent, SLU has enlarged the domain for understanding tasks by creating benchmark data sets that contain n-best speech hypotheses lists—for example, the dialog state tracking challenge data set DSTC-2 (Henderson, Thomson, and Williams 2014). This allows the language understanding model to make use of all the n-best instead of just relying on the top ASR output.7
4.2 NLU → ASR Transfer
Making sure that the output produced by an ASR model can be understood by an NLU model can help improve transcription quality (Velikovich et al. 2018). For example, trying to boost paths in the ASR lattice that contain named entities as predicted by a named entity recognition (NER) model can help overcome recognition errors related to out-of-vocabulary words (Serrino et al. 2019).
ASR models can also learn from the errors produced in the NLU model. If a downstream NLU model in a conversational agent cannot reliably parse a given ASR output, this might indicate the presence of speech recognition errors. If there is a reliable way to identify the cause for NLU failure as a speech error, then such examples can be provided back to the ASR module to improve itself. In Google Home, Faruqui and Christensen (2020) propose a method for picking out the correct transcription from the n-best hypotheses if the top hypothesis does not parse, and explicitly confirm the new hypothesis with the user in a dialog. If the user accepts the selected speech hypothesis, this correction is provided as a training example to the ASR system. In general, if the ASR models start computing error based on whether or not the produced text can be semantically parsed, the performance metric will be more representative of the real-world setting instead of the currently widely used word-error-rate (WER) metric (He, Deng, and Acero 2011).
4.3 Spoken NLU Data Sets
There has already been some progress on the front of enriching existing NLU benchmarks with speech. We would now briefly review these efforts.
4.3.1 Data Sets with Speech Input.
Li et al. (2018) presented a new data set called Spoken-SQuAD, which takes the existing NLU data set SQuAD (Rajpurkar et al. 2016) containing textual questions and textual documents. The Spoken-SQuAD data set contains the audio form of the document that has been artificially constructed by using Google text-to-speech system, and then the textual form of the document was generated using the CMU Sphinx speech recognizer (Walker et al. 2004). You et al. (2021) have created the Spoken-CoQA data set from the CoQA data set (Reddy, Chen, and Manning 2019) using the same technique. Both of these systems have shown that the presence of ASR errors has a devastating effect on the quality of the QA system. However, it is worth noting that this speech still does not reflect what people do in spontaneous interactions.
The above data sets contain artifically synthesized speech. The OSDQA data set (Lee et al. 2018), on the other hand, was constructed by recruiting 20 speakers for speaking out the documents from the original QA data set (Shao et al. 2018). This data set is for Chinese QA and contains spoken Chinese documents as audio. In order to accurately model the real-world setting of SLU, we need to construct a data set containing real spoken utterances similar to the approach used in OSDQA.
There are certain drawbacks with both the artificial and natural-speech style data sets. While artificially generated speech suffers from lack of sufficient speech style variability and the absence of natural speech cues, naturally generated speech comes with strict privacy and scalability concerns, preventing a large-scale collection of human speech. This privacy concern is even more pressing when dealing with utterances that humans issue to dialog agents at home that contain personal information.
4.3.2 Data Sets with Speech Errors.
Instead of providing audio input in the data set, another line of effort is about adding speech recognition errors in the transcribed text. For example, RADDLE (Peng et al. 2021) is a benchmark data set and an evaluation platform for dialog modeling where the input text can contain speech phenomena like verbosity, and speech recognition errors. Similarly, the LAUG toolkit (Liu et al. 2021) provides options to evaluate dialog systems against noise perturbations, speech characteristics like repetitions, corrections, and language variety. NoiseQA (Ravichander et al. 2021) contains ASR errors in the questions of the QA data set introduced both using synthetic speech and natural speech.
In this article we have argued that there is a need for revisiting the boundary between ASR and NLU systems in the research community. We are calling for stronger collaboration between the ASR and NLU communities given the advent of spoken dialog agent systems that need to understand spoken content. In particular, we are calling for NLU benchmark data sets to revisit the assumption of starting from text, and instead move toward a more end-to-end setting where the input to the models is in the form of speech as is the case in real-world dialog settings.
We thank the anonymous reviewers for their helpful feedback. We thank Shachi Paul, Shyam Upadhyay, Amarnag Subramanya, Johan Schalkwyk, and Dave Orr for their comments on the initial draft of the article.
The speech processing area also included papers from other related areas like multimodal processing, so the numbers presented here are an upper bound on speech processing papers.
Reviewing all different semantic parsing formulations is beyond the scope of this paper.
Note that lack of diversity in the n-best speech hypotheses could be an issue and directly using the recognition lattice produced by ASR might be more informative.