Abstract
The transcription bottleneck is often cited as a major obstacle for efforts to document the world’s endangered languages and supply them with language technologies. One solution is to extend methods from automatic speech recognition and machine translation, and recruit linguists to provide narrow phonetic transcriptions and sentence-aligned translations. However, I believe that these approaches are not a good fit with the available data and skills, or with long-established practices that are essentially word-based. In seeking a more effective approach, I consider a century of transcription practice and a wide range of computational approaches, before proposing a computational model based on spoken term detection that I call “sparse transcription.” This represents a shift away from current assumptions that we transcribe phones, transcribe fully, and transcribe first. Instead, sparse transcription combines the older practice of word-level transcription with interpretive, iterative, and interactive processes that are amenable to wider participation and that open the way to new methods for processing oral languages.
1 Introduction
Most of the world’s languages only exist in spoken form. These oral vernaculars include endangered languages and regional varieties of major languages. When working with oral languages, linguists have been quick to set them down in writing: “The first [task] is to get a grip on the phonetics and phonology of the language, so that you can transcribe accurately. Otherwise, you will be seriously hampered in all other aspects of your work” (Bowern 2008, page 34).
There are other goals in capturing language aside from linguistic research, such as showing future generations what a language was like, or transmitting knowledge, or supporting ongoing community use. Within computational linguistics, goals range from modeling language structures, to extracting information, to providing speech or text interfaces. Each goal presents its own difficulties, and learning how to “transcribe accurately” may not be a priority in every case. Nevertheless, in most language situations, extended audio recordings are available, and we would like to be able to index their content in order to facilitate discovery and analysis. How can we best do this for oral languages?
The most common answer in the field of linguistics has been transcription: “The importance of the (edited) transcript resides in the fact that for most analytical procedures …it is the transcript (and not the original recording) which serves as the basis for further analyses” (Himmelmann 2006a, page 259). From this it follows that everything should be transcribed: “For the scientific documentation of a language it would suffice to render all recordings utterance by utterance in a phonetic transcription with a translation” (Mosel 2006, page 70, emphasis mine).
A parallel situation exists in the field of natural language processing (NLP). The “NLP pipeline” can be extended to cover spoken input by prefixing a speech-to-text stage. Given that it is easy to record large quantities of audio, and given that NLP tasks can be performed at scale, we have a problem known as the transcription bottleneck, illustrated in Figure 1.
Transcription Bottleneck: the last frontier in computerizing oral languages?
Meanwhile, linguists have wondered for years whether methods from speech recognition could be applied to automatically transcribe speech in unwritten languages. In such cases there will not be a pronunciation lexicon or a language model, but it is becoming popular to automate at least the phone recognition stage on its own. For example, Michaud et al. report phone error rates in the 0.12–16 range, after training on 5 hours of transcribed audio from the Na language of China (Adams et al. 2018; Michaud et al. 2018). The idea is for humans to post-edit the output, in this case, correcting one out of every 6–8 characters, and then to insert word boundaries, drawing on their knowledge of the lexicon and of likely word sequences, to produce a word-level transcription. The belief is that by manually cleaning up an errorful phone transcription, and converting it into a word-level transcription, we will save time compared with entering a word-level transcription from scratch. To date, this position has not been substantiated.
Although such phone transcription methods are intended to support scalability, they actually introduce new problems for scaling: only linguists can provide phone transcriptions or graph-to-phone rules needed for training a phone recognizer, and only linguists can post-edit phone-level transcriptions.
To appreciate the severity of the problem, consider the fact that connected speech is replete with disfluencies and coarticulation. Thus, an English transcriber who hears d’ya, d’ya see might write do you see or /doʊ joʊ si/, to enable further analysis of the text. Instead, linguists are asked to transcribe at the phone level, i.e., [ʤəʤəsi]. We read this advice in the pages of Language: “field linguists [should modify] their [transcription] practice so as to assist the task of machine learning” (Seifart et al. 2018, page e335); and in the pages of Language Documentation and Conservation: “linguists should aim for exhaustive transcriptions that are faithful to the audio …mismatches result in high error rates down the line” (Michaud et al. 2018, page 12). Even assuming that linguists comply with these exhortations, they must still correct the output of the recognizer while re-listening to the source audio, and they must still identify words and produce a word-level transcription. It seems that the transcription bottleneck has been made more acute.
Three commitments lie at the heart of the transcription bottleneck: transcribing phones, transcribing fully, and transcribing first. None of these commitments is necessary, and all of them are problematic:
1. Transcribing phones. It is a retrograde step to build a phone recognition stage into the speech processing pipeline when the speech technology community has long moved away from the “beads-on-a-string” model. There is no physical basis for steady-state phone-sized units in the speech stream: “Optimizing for accuracy of low-level unit recognition is not the best choice for recognizing higher-level units when the low-level units are sequentially dependent” (Ostendorf 1999, page 79).
2. Transcribing fully. The idea that search and analysis depend on written text has led to the injunction to transcribe fully: transcriptions have become the data. Yet no transcription is transparent. Transcribers are selective in what they observe (Ochs 1979). Transcriptions are subject to ongoing revision (Crowley 2007, pages 139f). “It would be rather naive to consider transcription exclusively, or even primarily, a process of mechanically converting a dynamic acoustic signal into a static graphic/visual one. Transcription involves interpretation...” (Himmelmann 2018, page 35). In short, transcription is observation: “a transcription, whatever the type, is always the result of an analysis or classification of speech material. Far from being the reality itself, transcription is an abstraction from it. In practice this point is often overlooked, with the result that transcriptions are taken to be the actual phonetic ‘data’ ” (Cucchiarini 1993, page 3).
3. Transcribing first. In the context of language conservation, securing an audio recording is not enough by itself. Our next most urgent task is to capture the meaning while speakers are on hand. Laboriously re-representing oral text as written text has lower priority.
What would happen if we were to drop these three commitments and instead design computational methods that leverage the data and skills that are usually available for oral languages? This data goes beyond a small quantity of transcriptions. There will usually be a larger quantity of translations, because translations are easier to curate than transcriptions (cf. Figure 3). There will be a modest bilingual lexicon, because lexicons are created as part of establishing the distinct identity of the language. It will usually be straightforward to obtain audio for the entries in the lexicon. Besides the data, there are locally available skills, such as the ability of speakers to recognize words in context, repeat them in isolation, and say something about what they mean.
This leads us to consider a new model for large scale transcription that consists of identifying and cataloging words in an open-ended speech collection. Part of the corpus will be densely transcribed, akin to glossed text. The rest will be sparsely transcribed: words that are frequent in the densely transcribed portion may be detectable in the untranscribed portion. By confirming the system’s guesses, it will get better at identifying tokens, and we leverage this to help us with the orthodox task of creating contiguous transcriptions.
I elaborate this “Sparse Transcription Model” and argue that it is a good fit to the task of transcribing oral languages, in terms of the available inputs, the desired outputs, and the available human capacity. This leads to new tasks and workflows that promise to accelerate the transcription of oral languages.
This article is organized as follows. I begin by examining how linguists have worked with oral languages over the past century (Section 2). Next, I review existing computational approaches to transcription, highlighting the diverse range of input and output data types and varying degrees of fit to the task (Section 3). I draw lessons from these linguistic and computational contributions to suggest a new computational model for transcription, along with several new tasks and workflows (Section 4). I conclude with a summary of the contributions, highlighting benefits for flexibility, for scalability, and for working effectively alongside speakers of oral languages (Section 5).
2 Background: How Linguists Work with Oral Languages
The existing computational support for transcribing oral languages has grown from observations of the finished products of documentary and descriptive work. We see that the two most widely published textual formats, namely, phonetic transcriptions and interlinear glossed text, correspond to the two most common types of transcription tool (cf. Section 2.3). However, behind the formats is the process for creating them:
No matter how careful I think I am being with my transcriptions, from the very first text to the very last, for every language that I have ever studied in the field, I have had to re-transcribe my earliest texts in the light of new analyses that have come to light by the time I got to my later texts. Not infrequently, new material that comes to light in these re-transcribed early texts then leads to new ways of thinking about some of the material in the later texts and those transcriptions then need to be modified. You can probably expect to be transcribing and re-transcribing your texts until you get to the final stages of your linguistic analysis and write-up (Crowley 2007, pages 139f).
Put another way, once we reach the point where our transcriptions do not need continual revision, we are sufficiently confident about our analysis to publish it. By this time the most challenging learning tasks—identifying the phonemes, morphemes, and lexemes—have been completed. From here on, transcription is relatively straightforward. The real problem, I believe, is the task discussed by Crowley, which we could call “learning to transcribe.” To design computational methods for this task we must look past the well-curated products to the processes that created them (Norman 2013, cf.).
For context, we begin by looking at the why (Section 2.1) before elaborating on the how (Section 2.2), including existing technological support (Section 2.3). We conclude with a set of requirements for the task of learning to transcribe (Section 2.4).
2.1 Why Linguists Transcribe
Linguists transcribe oral languages for a variety of reasons: to preserve records of linguistic events and facilitate access to them, and to support the learning of languages and the discovery of linguistic structures. We consider these in turn.
2.1.1 Preservation and Access
The original purpose of a transcription was to document a communicative event. Such “texts” have long been fundamental for documentation and description (Boas 1911; Himmelmann 2006b).
For most of history, writing has been the preferred means for inscribing speech. In the early decades of modern fieldwork, linguists would ask people to speak slowly so they could keep up, or take notes and reconstruct a text from memory. Today we can capture arbitrary quantities of spontaneous speech, and linguists are exhorted to record as much as possible: “Every chance should be seized immediately, for it may never be repeated…the investigator should not hesitate to record several versions of the same story” (Bouquiaux and Thomas 1992, page 58).
We translate recordings into a widely spoken language to secure their interpretability (Schultze-Berndt 2006, page 214). Obtaining more than one translation increases the likelihood of capturing deeper layers of meaning (Bouquiaux and Thomas 1992, page 57; Evans and Sasse 2007; Woodbury 2007).
Once we have recorded several linguistic events, there comes the question of access: how do we locate items of interest? The usual recommendation is to transcribe everything: “If you do a time-aligned transcription …you will be able to search across an entire corpus of annotated recordings and bring up relevant examples” (Jukes 2011, page 441). Several tools support this, representing transcriptions as strings anchored to spans of audio (e.g., Bird and Harrington 2001; Sloetjes, Stehouwer, and Drude 2013; Winkelmann and Raess 2014).
Because transcriptions are seen as the data, we must transcribe fully. Deviations are noteworthy: “we are also considering not transcribing everything...” (Woodbury 2003, page 11). The existence of untranscribed recordings is seen as dirty laundry: “Elephant in the room: Most language documentation and conservation initiatives that involve recording end up with a backlog of unannotated, ‘raw’ recordings” (Cox, Boulianne, and Alam 2019). There is broad consensus: preservation and access are best served by transcribing fully. Nevertheless, we will establish a way to drop this requirement.
2.1.2 Learning and Discovery
The task of learning to transcribe can be difficult when the sounds are foreign and we don’t know many words. The situation is nothing like transcribing one’s own language. At first we hear a stream of sounds, but after a while we notice recurring forms. When they are meaningful, speakers can readily reproduce them in isolation, and offer a translation, or point at something, or demonstrate an action. As Himmelmann also notes, “transcription necessarily involves hypotheses as to the meaning of the segment being transcribed” (Himmelmann 2018, page 35). We recognize more and more words over time (Newman and Ratliff 2001; Rice 2001). As a consequence, “transcription also involves language learning” (Himmelmann 2018, page 35), and transcribers inevitably “learn the language by careful, repeated listening” (Meakins, Green, and Turpin 2018, page 83).
For readers who are not familiar with the process of transcribing an oral language, I present an artificial example from the TIMIT Corpus (Garofolo et al. 1986). Consider the sentence she had your dark suit in greasy wash water all year. At first, we recognize nothing other than a stream of phones (1a).
The story is a little more complex, as there are places where we equivocate
between writing or omitting a phone, such as in rapid utterances of half a cup, ∼ [hafkʌp]. It requires some knowledge of a language to be
able to go looking for something that is heavily reduced. Additionally,
there are places where we might entertain more than one hypothesis. After
all, language learners routinely mis-parse speech (Cairns et al. 1997), and even fluent adult speakers
briefly recognize words that do not form part of the final hypothesis (e.g.,
recognizing bone en route to trombone,
Shillcock 1990). This phenomenon
of multiple hypotheses occurs when transcribing field recordings (Hermes and
Engman 2017), and it is recommended
practice to keep track of them:
Don’t be afraid of writing multiple alternative transcriptions for a word you hear, especially when the language is new to you... Similarly, it is wise not to erase a transcription in your notebook; simply put a line through it and write the correction next to it. Make note of any transcriptions you are unsure about. Conversely, keep a record of all those you are confident about (Meakins, Green, and Turpin 2018, page 99).
In summary, learning to transcribe is a discovery process with an indeterminate endpoint: we come across new words in each new spoken text; we encounter variability in pronunciation; we run into allophony and allomorphy; we are tripped up by disfluency and coarticulation; and we guess at the contents of indistinct passages.
2.2 How Linguists Transcribe
Over an extended period, linguists have primarily transcribed at the word level. Often, translation was given higher priority than transcription. There have been many efforts to delegate transcription to speakers. We consider these points in turn.
2.2.1 Transcribing Words
Since the start of the modern period of linguistic description, linguists have transcribed at the word level. Working in the Arctic in the 1880s, Franz Boas “listened to stories and wrote down words” (Sanjek 1990, page 195). His early exposure to narratives only produced a wordlist: “my glossary is really growing.” Once he gained facility in the language, Boas would transcribe from memory, or back-translate from his English notes (Sanjek 1990, pages 198f). This is hardly a way to capture idiosyncratic pronunciations. Thus, the texts that have come down to us from this period are not transparent records of speech (Clifford 1990, page 63).
Half a century later, a similar practice was codified in phonemic theory, most notably in Kenneth Pike’s Phonemics: A Technique for Reducing Language to Writing (Pike 1947). The steps were to (a) collect words, (b) identify “minimal pairs,” (c) establish the phonemic inventory, and (d) collect texts using this phonemic orthography. Ken Hale describes a process that began with a version of the phonemic approach and continued with Boas’ practice of building up more vocabulary in sentential contexts:
In starting work on Ulwa, I decided to follow the procedure I have used elsewhere – North America, Mexico, Australia – in working on a “new” language. The first session, for example, would involve eliciting basic vocabulary – I usually start with body part terms – with a view, at this early point, of getting used to the sounds of the language and developing a way of writing it. And I would proceed in this manner through the basic vocabulary (of some 500 items) ... until I felt enough at ease with the Ulwa sound system to begin getting the vocabulary items in sentences rather than in isolation (Hale 2001, page 85, emphasis mine).
When approaching the task of transcription, both linguists and speakers come to the table with naive notions of the concept of word. Any recognized “word” is a candidate for later re-interpretation as a morpheme or multiword expression. Indeed, all levels of segmentation are suspect: texts into sentences, sentences into words, and words into morphemes. When it comes to boundaries, our typography—with its periods, spaces, and hyphens—makes it easy to create textual artifacts which are likely to be more precise as to boundaries than they are accurate.
A further issue arises from the “articulatory overlap” of words (Browman and Goldstein 1989), shown in (2). The desire to segment phone sequences into words encourages us to modify the phone transcription:
- (2)
a. tɛmpɪn ∼ tɛn pɪn ‘ten pin’ (assimilation)
b. ɦæʤɨ ∼ ɦæd jɨ ‘had your’ (palatalization)
c. tɛntsɛnts ∼ tɛn sɛnts ‘ten cents’ (intrusive stop)
d. lɔrænd ∼ lɔ ænd ‘law and (order)’ (epenthesis)
There are just two coherent, homogenous representations: phonetic sequences (1a), and canonical lexical sequences (1c). We can insist on the former, but writing out the idiosyncratic detail of individual tokens is arduous, and subject to unknown inter-annotator agreement (Himmelmann 2018, page 36), and in view of this fact, no longer accepted as the basis for research in phonetics (Valenta et al. 2014; Maddieson 2001, page 213). This leaves the latter, and we observe that writing canonical lexical sequences does not lose idiosyncratic phonetic detail when there is a time-aligned speech signal. It is common practice to align at the level of easily-identifiable “breath groups” (Voegelin and Voegelin 1959, page 25), illustrated in Figure 2.
Transcription from the author’s fieldwork with Kunwinjku [gup], aligned at the level of breath groups (Section 2.3).
Transcription from the author’s fieldwork with Kunwinjku [gup], aligned at the level of breath groups (Section 2.3).
2.2.2 Prioritizing Translation Over Transcription
The language preservation agenda involves capturing linguistic events. Once we have a recording, how are we to prioritize transcription and translation? Recent practice in documentary linguistics has been to transcribe then translate, as Chelliah explains:
Once a recording is made, it must be transcribed and translated to be maximally useful, but as is well-known, transcription is a significant bottleneck to translation. … For data gathering and analysis, the following workflow is typical: recording language interactions or performances ¿ collecting metadata ¿ transcribing ¿ translating ¿ annotating ¿ archiving ¿ disseminating (Chelliah 2018, pages 149, 160, emphasis mine).
In fact, transcription is only a bottleneck for translation if we assume that translation involves written sources. Mid-century linguists made no such assumption, believing that far more material would be translated than we could ever hope to transcribe (Figure 3). The same was true fifty years earlier, when Boas prioritized translation over transcription (Sanjek 1990, pages 198f). Fifty years later it was still considered best practice to prioritize translation over transcription:
At least a rough word-for-word translation must be done immediately along with a free translation…Putting off the transcription may spare the narrator’s patience…(Bouquiaux and Thomas 1992).
The tapered corpus. The quantity of data at each level follows a power law based on the amount of curation required (after Samarin 1967, page 70; Twaddell 1954, page 108). A similar situation has been observed in NLP (Abney and Bird 2010).
It remains likely for endangered languages that more materials will be translated than transcribed. Translation is usually quicker, easier, and more important for the documentary record (Section 2.1.1). Moreover, translation sits better with speakers who assume that linguists would be more interested in the content of their stories, than in how they would be represented in a written code (cf. Maddieson 2001, page 215; Bird 2020).
2.2.3 Working with Speakers
Language documentation involves substantial collaboration with local people (Austin 2007; Rice 2011). Often, these people are bilingual in a language of wider communication. Local bilinguals could be retired teachers or government employees. They could be students who meet the visiting linguist in the provincial capital and escort her to the village. They might have moved from the ancestral homeland to a major city where they participate in urban fieldwork (Kaufman and Perlin 2018). Whatever the situation, these people may be able to provide transcriptions, perhaps by adapting the orthography of another language (Figure 4). Sometimes, one finds speakers who are proficient in the official orthography of the language, even though that orthography might not be in widespread use.
Transcription and glossing performed by a speaker (Bird and Chiang 2012).
Literate speakers have some advantages over linguists when it comes to transcription. They have a comprehensive lexicon and language model. They can hold conversations in the language with other speakers to clarify nuances of meaning. Their professional work may have equipped them with editorial and keyboarding skills.
In many places, employing locals is inexpensive and delivers local economic benefits. There are many initiatives to train these people up into “community linguists” (Dobrin 2008; Rice 2009; Bird and Chiang 2012; Yamada 2014; Sapién 2018). This is suggested as a solution to the transcription bottleneck:
Ideally we should be getting transcriptions of all the recordings, but that is not always feasible or affordable... Most funders will pay for transcription work in the heritage language, so add that line item to your budget to get more recordings transcribed by someone else (King 2015, page 10).
When speakers are not literate—in the narrow western sense—they can still identify words in connected speech, supporting linguists as they transcribe (Gudschinsky 1967, page 9; Nathan and Fang 2009, page 109; Meakins, Green, and Turpin 2018, page 230).
Another way to involve speakers is through oral transcription or respeaking: “starting with hard-to-hear tapes and asking elders to ‘respeak’ them to a second tape slowly so that anyone with training in hearing the language can make the transcription” (Woodbury 2003, page 11). Respeaking is tantamount to dictation to future transcribers (Abney and Bird 2010; Sperber et al. 2013). Respeaking reduces the number of transcription mistakes made by non-speakers (Bettinson 2013). Of note for present purposes, respeaking reproduces words. Speakers do not reproduce disfluencies or mimic dialects. Thus, the respeaking task supports transcription at the word level.
This concludes our discussion of how linguists typically work with oral languages, showing that the most prevalent transcriptional practice takes place at the word level.
2.3 Technological Support for Working with Oral Languages
For many years, the Linguist’s Shoebox (later, Toolbox) was the mainstay of language data collection, replacing the traditional shoebox of file cards with hand-written lexical entries and cultural notes (Buseman, Buseman, and Early 1996). It stored language data in a text file as a sequence of blankline-delimited records, each containing a sequence of newline-delimited fields, each consisting of a field name such as \lx followed by whitespace followed by text content. This format, known as SIL Standard Format, is supported in the Natural Language Toolkit (Robinson, Aumann, and Bird 2007), to facilitate processing of texts and lexicons coming from linguistic fieldwork (Bird, Klein, and Loper 2009, §11.5).
SIL Standard Format was an early version of semistructured data, for which XML was later devised (Abiteboul, Buneman, and Suciu 2000). Fieldworks Language Explorer (FLEx, Butler and Volkinburg 2007; Moe 2008) switched to XML in order to benefit from schemas, validation, stylesheets, and so forth. FLEx, like Shoebox, is “especially useful for helping researchers build a dictionary as they use it to analyze and interlinearize text” (SIL Language Technology 2000). This functionality—updating a lexicon while glossing text—parallels the approach to transcription described in this article.
To accelerate the collection of lexical data, the Rapid Words Method was devised, organizing teams of speakers and collecting upwards of 10,000 words over a ten-day period (Rapidwords 2019). The Rapid Words Method is implemented in WeSay (Albright and Hatton 2008), which permits lexical data to be interchanged with FLEx.
The curation of text data in FLEx is a more involved process. In FLEx, the “baseline” of a glossed text is imported from a text file, i.e., a transcription. The user works through the words adding glosses, cf. (3).
FLEx supports morphological segmentation and glossing, as shown in (5). The baseline includes allophonic detail. The second row shows a phonemic transcription with morphological segmentation. The third row contains morphological glosses with information about tense, aspect, person, number, and noun class. The last row shows the phrasal translation.
There is a qualitative distinction between (3) and (5) aside from the level of detail. The first format is a documentary artifact. It is a work in progress, a sense-disambiguated word-level transcription where the segmentation and the glosses are volatile. I will call this IGT1. The second format is an analytical product. It is a completed work, illustrating a morphophonemic and morphosyntactic analysis (cf. Evans, Nicholas 2003, page 663). I will call this IGT2.
In this view, IGT1 is a kind of word-level transcription where we take care not to collapse homographs. Here, words are not ambiguous grapheme strings, but unique lexical identifiers. Where words are morphologically complex, as we see in (5), we simply expect each morph-gloss pair to uniquely identify a lexical or grammatical morpheme. Getting from IGT1 to IGT2 involves analytical work, together with refactoring the lexicon, merging and splitting entries as we discover the morphology. No software support has been devised for this task.
To accelerate the collection of text data, the most obvious technology is the audio recording device, coupled with speech recognition technology. This promises to deliver us the “radically expanded text collection” required for language documentation (Himmelmann 1998). However, these technological panaceas bring us back to the idea of the processing pipeline (Figure 1) with its attendant assumptions of transcribing first and transcribing fully, and to the transcription bottleneck. Tedlock offers a critique:
Those who deal with the spoken word …seem to regard phonography as little more than a device for moving the scene of alphabetic notation from the field interview to the solitude of an office... The real analysis begins only after a document of altogether pre-phonographic characteristics has been produced... The alphabet continues to be seen as an utterly neutral, passive, and contentless vehicle (Tedlock 1983, page 195).
Here, the possibility of time-aligned annotation offers a solution, cf. Figure 2 (Bird and Harrington 2001; Jacobson, Michailovsky, and Lowe 2001; Sloetjes, Stehouwer, and Drude 2013; Winkelmann and Raess 2014). We do not need to view transcriptions as the data, but just as annotations of the data:
The importance of the (edited) transcript resides in the fact that for most analytical procedures …it is the transcript (and not the original recording) which serves as the basis for further analyses. Obviously, whatever mistakes or inconsistencies have been included in the transcript will be carried on to these other levels of analysis... This problem may become somewhat less important in the near future inasmuch as it will become standard practice to link transcripts line by line (or some other unit) to the recordings, which allows direct and fast access to the original recording whenever use is made of a given segment in the transcript (Himmelmann 2006a, page 259).
Unfortunately, the available software tools fall short, representing the time-aligned transcription of a phrase as a character string. None of the analysis and lexical linking built into FLEx and its predecessors is available. With some effort it is possible to export speech annotations to FLEx, and then to import this as a text for lexical linking and interlinear glossing (Gaved and Salffner 2014). However, this brings us back to the pipeline, and to text standing in for speech as primary data.
As a result, there is a fundamental shortcoming in the technological support for working with oral languages. We have tools for linking unanalyzed texts to speech, and tools for linking analyzed texts to the lexicon. However, to date, there is no transcription tool that supports simultaneous linking to audio and to speech and to the lexicon. To the extent that both speech and lexical knowledge inform transcription, linguists are on their own.
A recent development in the technological support for processing oral languages comes under the heading of Basic Oral Language Documentation (BOLD) (Bird 2010; Reiman 2010). Here, we stay in the oral domain, and speakers produce respeakings and spoken interpretations into a language of wider communication, all aligned at the granularity of breath groups to the source audio (cf. Figure 2). Tools that support the BOLD method include SayMore and Aikuma (Hatton 2013; Hanke and Bird 2013; Bird et al. 2014). SayMore incorporates automatic detection of breath groups, while Aikuma leaves this in the hands of users.
2.4 Requirements for Learning to Transcribe
In this section I have explored why and how linguists learn to transcribe, and the existing technological support for transcription, leading to six central requirements. First, documentary and descriptive practice has focused on transcribing words, not phones. We detect repeated forms in continuous speech, and construct an inventory (Section 2.2.1). Second, in the early stages we have no choice but to transcribe naively, not knowing whether a form is a morpheme, a word, or a multiword expression, only knowing that we are using meaningful units (Section 2.1.2). Third, transcription can proceed by leveraging translations. The translation task is more urgent and speakers find it easier. Thus, we can assume access to meanings, not just forms, during transcription (Sections 2.1.1, 2.2.2). Fourth, our representations and practices need to enable transcribing partially. There will always be regions of signal that are difficult to interpret given the low quality of a field recording or the evolving state of our knowledge (Section 2.1.2). Fifth, we are engaging with a speech community, and so we need to find effective ways of working with speakers in the course of our transcription work (Section 2.2.3). Finally, given our concern with consistency across transcribers, texts, and the lexicon, we want our tools to support simultaneous linking of transcriptions to both the source audio and to the lexicon (Section 2.3).
These observations shape our design of the Sparse Transcription Model (Section 4). Before presenting the model, we review existing computational approaches to transcription that go beyond the methods inspired by automatic speech recognition, and consider to what extent they already address the requirements coming from the practices of linguists.
3 Computation: Beyond Phone Recognition
If language documentation proceeds from a “radically expanded text collection,” how can speech and language technology support this? Approaches inspired by automatic speech recognition draw linguists into laborious phone transcription work. Automatic segmentation delivers pseudowords, not actual words, and depends on the false assumption that words in connected speech do not overlap (Section 2.2.1). As they stand, these are not solutions to the word-level transcription task as practiced in documentary and descriptive linguistics, and as required for natural language processing.
In this section we consider approaches that go beyond phone transcription. Several approaches leverage the translations that are readily available for endangered languages (Section 2.2.2). They may segment phone transcriptions into pseudowords then align these with translations (Section 3.1). Segmentation and alignment may be performed jointly (Section 3.2). Segmentation and alignment may operate directly on the speech, bypassing transcription altogether (Section 3.3). An entirely different approach to the problem is based on spoken term detection (Section 3.4). Next, we consider the main approaches to evaluation and their suitability for the transcription task (Section 3.5). The section concludes with a summary of approaches to computation and some high-level remarks (Section 3.6), and a discussion of how well these methods address the requirements for learning to transcribe (Section 3.7).
3.1 Segmenting and Aligning Phone Sequences
This approach involves segmenting phone sequences into pseudowords then aligning pseudowords with a translation. We construe the aligned words as glosses. At this point, we have produced interlinear glossed text of the kind we saw in (3). Now we can use alignments to infer structure in the source utterances (cf. (Xia and Lewis 2007)).
Phone sequences can be segmented into word-sized units using methods developed for segmentation in non-roman orthographies and in first language acquisition (Cartwright and Brent 1994; Goldwater, Griffiths, and Johnson 2006,s; Johnson and Goldwater 2009; Elsner et al. 2013). Besacier et al. took a corpus of Iraqi Arabic text with sentence-aligned English translations, converted the Arabic text to phone transcriptions by dictionary-lookup, then performed unsupervised segmentation of the transcriptions (Besacier, Zhou, and Gao 2006). In a second experiment, Besacier et al. replaced the canonical phone transcriptions with the output of a phone recognizer trained on 200 hours of audio. This resulted in a reasonable phone error rate of 0.15, and slightly lower translation scores. One of their evaluations is of particular interest: a human was asked to judge which of the automatically identified pseudowords were actual words, and then the system worked with this hybrid input of identified words with intervening phone sequences representing unidentified words, the same scenario discussed in Section 2.1.2.
Although the translation scores were mediocre and the size of training data for the phone recognizer was not representative for most oral languages, the experimental setup is instructive: it leverages prior knowledge of the phone inventory and a partial lexicon. We can be confident of having such resources for any oral language. Zanon Boito et al. used a similar idea, expanding a bilingual corpus with additional input pairs to teach the learner the most frequent 100 words, “representing the information a linguist could acquire after a few days” (Zanon Boito et al. 2017).
Unsupervised segmentation methods often detect sequences of phones that are unlikely to appear within morphemes. Thus, in English, the presence of a non-homorganic sequence [np] is evidence of a boundary. However, the ability of a system to segment [ʈɛnpɪn] as [ʈɛn pɪn] tells us little about its performance on more plausible input where coarticulation has removed a key piece of evidence: [ʈɛmpɪn] (2a). Unsupervised methods also assume no access to speakers and their comprehension of the input and their ability to respeak it or to supply a translation (Bird 2020). Thus, learning to transcribe is not so much language acquisition as cross-linguistic bootstrapping, on account of the available data and skills (Figure 3; (Abney and Bird 2010)).
Another type of information readily available for endangered languages is translation, as just mentioned. Several computational approaches have leveraged translations to support segmentation of the phone sequence. This implements the translation-before-transcription workflow that was discussed in Section 2.2.2.
3.2 Leveraging Translations for Segmentation
More recent work has drawn on translations to support segmentation. Neubig et al. (2012) were the first to explore this idea, leveraging translations to group consecutive phones into pseudowords. They drew on an older proposal for translation-driven segmentation in phrase-based machine translation in which contiguous words of a source sentence are grouped into phrases which become the units of alignment (Wu 1997).
Stahlberg et al. began with IBM Model 3, which incorporates a translation model for mapping words to their translations, and a distortion model for placing translated words in the desired position (Brown et al. 1993; Stahlberg et al. 2016). In Model 3P, distortion is performed first, to decide the position of the translated word. A target word length is chosen and then filled in with the required number of phones. Stahlberg et al. trained their model using 600k words of English–Spanish data, transliterating the source language text to canonical phone sequences and adding 25% noise, approximating the situation of unwritten languages. They reported segmentation F-scores in the 0.6–0.7 range. Stahlberg et al. do not report whether these scores represent an improvement over the scores that would have been obtained with an unsupervised approach.
Godard et al. applied Stahlberg’s method using a more realistic data size, i.e., a corpus of 1.2k utterances from Mboshi with gold transcriptions and sentence-aligned French translations (Godard et al. 2016). They segmented transcriptions with the support of translations, reporting that it performed less well than unsupervised segmentation. Possible factors are the small size of the data, the reliance on spontaneous oral interpretations, and working across language families.
Godard et al. extended their work using a corpus of 5k Mboshi utterances, using phone recognizer output instead of gold transcriptions (Godard et al. 2018b). Although promising, their results demonstrate the difficulty of segmentation in the presence of noise, and sensitivity to the method chosen for acoustic unit discovery.
Adams et al. performed joint word segmentation and alignment between phone sequences and translations using pialign, for German text transliterated into canonical phone sequences (Adams et al. 2015). They extracted a bilingual lexicon and showed that it was possible to generate hundreds of bilingual lexical entries on the basis of just 10k translated sentences. They extended this approach to speech input, training a phone recognizer on Japanese orthographic transcriptions transliterated to gold phoneme transcriptions, and segmented and aligned the phone lattice output (Adams et al. 2016b). They evaluated this in terms of improvement in phone error rate.
This shift to speech input opens the way to a more radical possibility: bypassing transcription altogether.
3.3 Bypassing Transcription
If the downstream application does not require it, why would we limit ourselves to an impoverished alphabetic representation of speech when a higher-dimensional, or non-linear, or probabilistic representation would capture the input more faithfully? Segmentation and alignment tasks can be performed on richer representations of the speech input, for example, building language models over automatically segmented speech (Neubig et al. 2010), aligning word lattices to translations (Adams et al. 2016a), training a speech recognizer on probabilistic transcriptions (Hasegawa-Johnson et al. 2016), or translating directly from speech (Duong et al. 2016; Bansal et al. 2017; Weiss et al. 2017; Chung et al. 2019).
Anastasopoulos, Chiang, and Duong (2016) inferred alignments directly between source audio and text translations first for Spanish–English, then for 330 sentences of Griko speech with orthographic translations into Italian. The task is to “identify recurring segments of audio and cluster them while aligning them to words in a text translation.” Anastasopoulos et al. (2017) took this further, discovering additional tokens of the pseudowords in untranslated audio, in experiments involving 1–2 hours of speech in Ainu and Arapaho.
There is another way to free up our idea of the required output of transcription, namely, to relax the requirement that it be contiguous. We turn to this next.
3.4 Spoken Term Detection
Spoken term detection, also known as keyword spotting, is a long-standing task in speech recognition (Myers, Rabiner, and Rosenberg 1980; Rohlicek 1995; Fiscus et al. 2007). The primary impetus to the development of this method was to detect a term in the presence of irrelevant speech, perhaps to perform a command in real time, or to retrieve a spoken document from a collection. Spoken term detection is much like early transcription in which the non-speaker transcriber is learning to recognize words. Some methods involve modeling the “filler” between keywords, cf. (1b) (Rohlicek 1995).
Spoken term detection methods have been applied to low-resource languages in the context of the IARPA Babel Program (e.g., Rath et al. 2014; Gales et al. 2014; Liu et al. 2014; Chen et al. 2015; Metze et al. 2015), but with far more data than we can expect to have for many oral languages, including 10+ hours of phone transcriptions and more-or-less complete lexicons.
Spoken term detection has been applied to detect recurring forms in untranscribed speech, forms that do not necessarily carry any meaning (Park and Glass 2008; Jansen, Church, and Hermansky 2010; Dunbar et al. 2017). However, this unsupervised approach is less well aligned to the goals of transcription where we expect terms to be meaningful units (Section 2.1.2; Fiscus et al. 2007, page 52).
3.5 Test Sets and Evaluation Measures
In evaluating the above methods, a popular approach is to take an existing large corpus with gold transcriptions (or with transcriptions that are simulated using grapheme-to-phoneme transliteration), along with translations. We simulate the low-resource scenario by giving the system access to a subset of the data with the possible addition of noise (e.g., Besacier, Zhou, and Gao 2006; Stahlberg et al. 2016). Some have compiled small corpora in the course of their work, for example, Griko (Boito et al. 2018) or Mboshi (Godard et al. 2018a; Rialland et al. 2018). Some collaborations have tapped a long-standing collection activity by one of the partners, for example, Yongning Na (Adams et al. 2017). Others have found small collections of transcribed and translated audio on the Web, for example, Arapaho and Ainu (Anastasopoulos et al. 2017).
Phone error rate is the accepted measure for phone transcriptions, and it is easy to imagine how this could be refined for phonetic or phonemic similarity. However, optimizing phone error rate comes at a high cost for linguists, because it requires that they “aim for exhaustive transcriptions that are faithful to the audio” (Michaud et al. 2018, page 12). Reducing phone error rate is not necessarily the most effective way to improve word-level recognition accuracy.
Word error rate is a popular measure, but it suffers from a fundamental problem of representation. Words are identified using character strings, and a transcription word is considered correct if it is string-identical to a reference word. Yet these string identifiers are subject to ongoing revision. Our evolving orthography may collapse phonemic distinctions, creating homographs (cf. English [wɪnd] vs [waɪnd] both written wind). Our incomplete lexicon may omit homophones (cf. an English lexicon containing bat1flying mammal but not bat2striking instrument). Our orthographic practice might not enable us to distinguish homophones (cf. English [sʌn] written son vs sun). This reliance on inadequate word identifiers and an incomplete lexicon will cause us to underestimate word error rate. Conversely, a lack of understanding about dialect variation, or lack of an agreed standard spelling, or noise in word segmentation, all serve to multiply the spellings for a word. This variability will cause us to overestimate word error rate. These problems carry forward to type-based measures of lexicon quality such as precision at k entries (Adams et al. 2015).
Translation quality is another popular measure, and many of the above-cited papers report BLEU scores (Papineni et al. 2002). However, the data requirements for measuring translation quality amplify our sparse data problems. Data sparseness becomes more acute when we encounter mismatches between which concepts are lexicalized across source and target languages:
Parallel texts only address standardized, universal stories, and fail to explore what is culture-specific, either in terms of stories or in terms of lexical items. Parallel bible or other corpora may tell us how to say ‘arise!’ or ‘Cain fought with Abel.’ But we will not encounter the whole subworld of lexical particularities that make a language unique, such as Dalabon dalabborrord ‘place on a tree where the branches rub together, taken advantage of in sorcery by placing something that has been in contact with the victim, such as clothes, in such a way that it will be rubbed as the tree blows in the wind, gradually sickening and weakening the victim.’ The thousands of fascinating words of this type are simply bracketed out from traditions of parallel translation (Evans and Sasse 2007, page 71).
Each “untranslatable word” leads to parallel texts that include lengthy interpretations, with a level of detail that varies according to the translator’s awareness of the lexical resources of the target language, and beliefs about the cultural awareness of the person who elicits the translation. The problem runs deeper than individual lexical items. For example, Woodbury describes a class of suffixes that express affect in Cup’ik discourse, which translators struggle to render into English (Woodbury 1998, page 244). We expect great diversity in how these suffixes are translated. For these reasons, it seems far-fetched to evaluate translations with BLEU, counting n-gram overlap with reference translations. Perhaps the main benefit of translation models lies not in translation per se, but in their contribution to phone or word recognition and segmentation.
3.6 Summary
For linguists, transcription is a unitary task: listen to speech and write down words. Computational approaches, by contrast, show great diversity. Indeed, it is rare for two approaches to use the same input and output settings, as we see in Figure 5.
Summary of transcription methods: Inputs and outputs are indicated using and
respectively. Sometimes, an output
type also serves as an input type for training. Shaded regions represent
data whose existence is independently motivated. When a paper includes
multiple experiments, we only document the most relevant one(s). I do
not include models as inputs or outputs; if data X is used to train a
model, and then the model is used to produce data Y, we identify a task
mapping from X to Y. The first section of the table lists tools for use
by humans; the second section lists automatic tools; the final section
shows our objective, relying on independently motivated data.
Summary of transcription methods: Inputs and outputs are indicated using and
respectively. Sometimes, an output
type also serves as an input type for training. Shaded regions represent
data whose existence is independently motivated. When a paper includes
multiple experiments, we only document the most relevant one(s). I do
not include models as inputs or outputs; if data X is used to train a
model, and then the model is used to produce data Y, we identify a task
mapping from X to Y. The first section of the table lists tools for use
by humans; the second section lists automatic tools; the final section
shows our objective, relying on independently motivated data.
One source of diversity is the input: a large corpus of transcriptions; a small corpus of transcriptions augmented with material from other languages; linguist- or speaker- or non-speaker-supplied phonemic, phonetic, or orthographic transcriptions; transcriptions with or without word breaks; transcriptions derived from graph-to-phone rules applied to lexemes; partial transcriptions; or probabilistic transcriptions. Outputs are equally diverse: phones with or without word segmentation; a lexicon or not; associated meanings; meaning represented using glosses, alignments, or a bilingual lexicon. The computational methods are no less diverse, drawing on speech recognition, machine translation, and spoken term detection.
From the standpoint of our discussion in Section 2, the existing computational approaches incorporate some unwarranted and problematic assumptions. First, all approaches set up the problem in terms of inputs and outputs. However, linguists regularly access speakers during their transcription work. These people not only have a greater command of the language, they may be familiar with the narrative being transcribed, or may have been present during the linguistic performance that was recorded. Given how commonly a linguist transcriber has a speaker “in the loop,” it would be worthwhile to investigate human-in-the-loop approaches. For instance, facilitating access to spoken and orthographic translations can support transcription by non-speakers (Anastasopoulos and Chiang 2017).
Second, the approaches inspired by speech recognition and machine translation proceed by segmenting a phone sequence into words. Evaluating these segmentations requires a gold standard that is not available in the early stages of language documentation. At this point we only have partial knowledge of the morphosyntax; our heightened awareness of form (the speech signal) and reduced awareness of meaning causes us to favor observable phonological words over the desired lexical words; we cannot be confident about the treatment of clitics and compounds as independent or fused; transcribers may use orthographic conventions influenced by another language and many other factors (Himmelmann 2006a; Cahill and Rice 2014). Add to all this the existence of coarticulation in connected speech: consecutive words may not be strictly contiguous thanks to overlaps and gaps in the phone sequence, with the result that segmentation is an unsound practice (cf. Section 2.2.1).
3.7 Addressing the Requirements
How well does this computational work address our six requirements (Section 2.4)?
1. Transcribing words. Much computational work has approached phone recognition in the absence of a lexicon. If we cannot be sure of having a lexicon, the interests of general-purpose methods seem to favor approaches that do not require a lexicon. However, the process of identifying a language and differentiating it from its neighbors involves eliciting a lexicon. Establishing a phoneme inventory involves a lexicon. Early transcription involves a lexicon. In short, we always have a lexicon. To the extent that accelerating the transcription of words means accelerating the construction of a lexicon, then it makes sense to devote some effort early on to lexical elicitation. There are creative ways to simulate partial knowledge of the lexicon, for example, by substituting system output with correct output for the n most frequent words (Besacier, Zhou, and Gao 2006), or by treating the initial bilingual lexicon as a set of mini parallel texts and adding them to the parallel text collection.
2. Using meaningful units. Much computational work has discovered “words” as a byproduct of detecting word boundaries or repeated subsequences, sometimes inspired by simulations of child language acquisition. The lexicon is the resulting accidental inventory of forms. However, transcribers access meanings, and even non-speakers inevitably learn some vocabulary and leverage this in the course of transcribing. It is premature to “bake in” the word boundaries before we know whether a form is meaningful.
3. Leveraging translations. Many computational methods leverage translations (Section 3.2), benefiting from the fact that they are often available (Section 2.2.2). This accords with the decision to view translations as independently motivated in Figure 5.
4. Transcribing partially. Sections of a corpus that need to be submitted to conventional linguistic analysis require contiguous or “dense” transcription. We can locate items for dense transcription by indexing and searching the audio using spoken term detection or alignment to translations. This aligns with the assumptions of mid-century linguists, and the present reality, that not everything will be transcribed (Section 2.2.2).
5. Working with speakers. We observed that there is usually some local capacity for transcription (Section 2.2.3). However, the phone recognition approaches demand a linguist’s expertise in phonetic transcription. In contrast, the other approaches stand to make better use of speakers who are not trained linguists.
6. Simultaneous linking. Users of existing software must decide if they want their transcriptions to be anchored to the audio or the lexicon (Section 2.3). A major source of the difficulty in creating an integrated tool has been the reliance on phone transcription. However, if we transcribe at the word level, there should be no particular difficulty in simultaneously linking to the signal and the lexicon.
This concludes our analysis of existing computational approaches to the processing of oral languages. Next we turn to a new approach, sparse transcription.
4 The Sparse Transcription Model
The Sparse Transcription Model supports interpretive, iterative, and interactive processes for transcribing of oral languages while avoiding the transcription bottleneck (Figure 1). It organizes the work into tasks which can be combined into a variety of workflows. We avoid making premature commitments to details which can only be resolved much later, if at all, such as the precise locus of boundaries, the representation of forms as alphabetic strings, or the status of forms as morphemes, words, or multiword expressions. We privilege the locally available resources and human capacity in a strengths-based approach.
In this section I present an overview of the model (Section 4.1), followed by a discussion of transcription tasks (Section 4.2) and workflows (Section 4.3), and then some thoughts on evaluation (Section 4.4). I conclude the section by showing how the Sparse Transcription Model addresses the requirements for transcription (Section 4.5).
4.1 Overview
The Sparse Transcription Model consists of several data types, together serving as the underlying representation for IGT1 (3). The model includes sparse versions of IGT1 in which transcriptions are not contiguous. The data types are as follows:
Timeline: an abstraction over one or more audio recordings that document a linguistic event. A timeline serves to resolve time values to offsets in an audio file, or the audio track of a video file, following Bird and Liberman (2001, pages 43f). Timelines are per-participant in the case of conversational speech. We assume the existence of some independent means to identify breath groups per speaker in the underlying audio files.
Glossary Entry: a form-gloss pair. Figure 6 shows a fragment of a glossary. Rows 47–49 represent transient stages when we have identified a form but not its meaning (entry 47), or identified a meaning without specifying the form (entry 48), or where we use a stub to represent the case of detecting multiple tokens of a single yet-to-be-identified lexeme (entry 49). The string representation of a form is subject to revision.
Sparse Transcription Model: Breath groups contain tokens which link to glossary entries; breath groups are anchored to the timeline and contained inside interpretations. Tokens are marked for their likelihood of being attested in a breath group, and optionally with a time, allowing them to be sequenced within this breath group; cf. (4).
Sparse Transcription Model: Breath groups contain tokens which link to glossary entries; breath groups are anchored to the timeline and contained inside interpretations. Tokens are marked for their likelihood of being attested in a breath group, and optionally with a time, allowing them to be sequenced within this breath group; cf. (4).
This representation is related to OntoLex (McCrae et al. 2017). A form korroko stands in for a pointer to an OntoLex Form, and a gloss long.ago stands in for a pointer to a LexicalSense, and thence an OntologyEntry. A glossary entry is a subclass of LexicalEntry, or one of its subclasses Word, MultiwordExpression, or Affix. Glossary entries may contain further fields such as the morphosyntactic category, the date of elicitation, and so on. Entries may reference constituents, via a pointer from one LexicalEntry to another.
Token: a tuple consisting of a glossary entry and a breath group, along with a confidence value and an optional time. Examples of tokens are shown in Figure 6. A single time point enables us to sequence all tokens of a breath group. This avoids preoccupation with precise boundary placement, and it leaves room for later addition of hard-to-detect forms that appear between previously identified forms.
Breath Group: a span of a timeline that corresponds to a (partial) utterance produced between breath pauses. These are visible in an audio waveform and serve as the units of manual transcription (cf. Figure 2). Breath groups become the mini spoken documents that are the basis for spoken term detection. Breath groups are non-overlapping. Consecutive breath groups are not necessarily contiguous, skipping silence or extraneous content. Breath groups carry a size value that stores the calculated duration of speech activity.
Interpretation: a representation of the meaning of an utterance (one or more consecutive breath groups). Interpretations may be in the form of a text translation into another language, or an audio recording, or even an image. There may be multiple interpretations assigned to any timeline, and there is no requirement that two interpretations that include the same breath group have the same temporal extent.
4.2 Transcription Tasks
Glossary
Many workflows involve growing the glossary: reviewing existing glossaries with speakers; eliciting words from speakers; previewing some recordings and eliciting words; processing archived transcriptions and entering the words.
Task G (Growing the Glossary: Manual)
Identify a candidate entry for addition and check if a similar entry is already present. If so, possibly update the existing entry. If not, create a new entry ge, and elicit one or more reference recordings for this item and add to A, and create the token(s) to link this to ge. Update our estimate of the typical size of this entry (duration of its tokens).
Task R (Refactoring the Glossary: Semi-automatic)
Identify entries for merging, based on clustering existing entries; or for splitting, based on clustering the tokens of an entry; or for decomposition, based on morphological discovery. Submit these for manual validation.
Breath Groups
Each audio timeline contains breath groups B. We store the amount of speech activity in a breath group. We mark fragments that will not be transcribed (e.g., disfluent or unintelligible speech).
Task B (Breath Group Identification: Semi-automatic)
Automatically identify breath groups and submit for manual review. Compute and store the duration of speech activity in a breath group. Permit material within a breath group to be excluded from transcription. Permit updating of the span of a breath group through splitting or merging.
Interpretation
Interpretations are any kind of semiotic material. This task elicits interpretations, and aligns existing interpretations to utterances of the source audio. We assume that the span of an interpretation [t1, t2] corresponds to one or more breath groups.
Task I (Interpretation: Manual)
Associate interpretations with timeline spans that correspond to one or more consecutive breath groups.
Transcription
The first task is word spotting, or spoken term detection, with the requirement that terms can be assigned a gloss. This may leverage time-aligned translations if they are available. The second task is contiguous, or dense transcription, which leverages word spotting for allocating transcription effort to breath groups of interest, and then for accelerating transcription by guessing which words are present.
Task S (Word Spotting: Automatic)
Find possible instances of a given glossary entry and create a new token for each one. This token identifies a breath group and specifies a confidence value.
Task T (Contiguous Transcription: Manual)
Identify successive forms in a breath group, specifying a time, and linking them to the glossary, expanding the glossary as needed (Task G).
Completion
Here we focus on gaps, aiming to produce contiguous transcriptions. We need to be able to identify how much of a breath group still needs to be transcribed in order to prioritize work. Once a breath group is completed, we want to display a view of the transcription, consisting of form-gloss pairs like (3) or a speech concordance, with support for navigating to the audio or the lexicon.
Task C (Coverage: Automatic)
Calculate the transcriptional coverage for the breath groups of an audio recording.
Task V (View: Automatic)
Generate transcript and concordance views of one or more audio recordings.
This concludes our formalization of transcription tasks. We envisage that further tasks will be defined, extending this set. Next we turn to the workflows that are built from these tasks.
4.3 Transcription Workflows
Classical Workflows
Following Franz Boas, Ken Hale, and countless others working in the intervening century, we begin by compiling a glossary, and we regularly return to it with revisions and enrichments. We may initialize it by incorporating existing wordlists and eliciting further words (Task G). As we grow the glossary, we ideally record each headword and add the recording to the corpus.
Now we begin collecting and listening to recordings. Speakers readily interpret them utterance by utterance into another language (Task I). These utterances will involve one or more breath groups, and this is an efficient point to create and review the breath groups, possibly splitting or merging them, and bracketing out any extraneous material (Task B). These two tasks might be interleaved.
Next, we leverage our glossary and interpretations to perform word spotting (Task S), to produce initial sparse transcription. We might prioritize transcription of a particular topic, or of texts that are already well covered by our sparse transcription.
Now we are ready for contiguous transcription (Task T), working one breath group at a time, paying attention to regions not covered by our transcription (Task C), and continually expanding the glossary (Task G). Now that we have gained evidence for new words we return to word spotting (our previous step).
We postpone transcription of any item that is too difficult. We add new forms to the glossary without regard for whether they are morphemes, words, or multiword expressions. We take advantage of the words that have already been spotted in the current breath group (Task S), and of system support that displays them according to confidence and suggests their location. Once it is complete, we visualize the transcription (Task V).
This workflow is notated in (6). Concurrent tasks are comma-separated, and there is no requirement to complete a task before continuing to a later task. We can jump back any number of steps to revise and continue from that point.
G → B, I → S → T, G, C → V
Sociolinguistic Workflow
Example (7) represents a practice in sociolinguistics, where we identify words that reveal sociolinguistic variables and then try to spot them in our audio corpus. When we find some hits, we review the automatically-generated breath groups, interpret, and transcribe. This task cuts across audio recordings and does not result in a normal text view, only a concordance view.
G → S → B, I, T → V
Elicitation Workflow
Example (8) shows a workflow for elicitation. Here the audio recordings were collected by prompting speakers, perhaps using images. The first step is to align these prompts with the audio.
I → B → T, G
Archival Workflows
Another workflow comes from the context of working with materials found in libraries and archives. We may OCR the texts that appear in the back of a published grammar, then locate these in a large audio collection with the help of grapheme-to-phoneme rules and a universal pronunciation model. We would probably extract vocabulary from the transcriptions to add to the glossary. We might forcibly align the transcriptions to the audio, and then create tokens for each word. Then begins the process of refactoring the glossary.
Glossary Workflows
Working on the glossary may be a substantial undertaking in itself, and may even be the driver of the sparse transcription activity. Early on, we will not know if a form is a morph, a word, or a multiword expression. The presence of allophony and allomorphy may initially multiply the number of entries. What originally looks like homonymy (two entries with a shared form) may turn out to be a case of polysemy (one entry with two glosses), or vice versa, leading us to refactor the glossary (Task R). Clues to word meaning may come from interpretations (if available), but they are just as likely to be elicited, a process that may generate extended commentary, itself a source text which may be added to the audio corpus. Word-spotting and the concordance view may support lexicographic research over a large audio corpus, which itself may never be transcribed more than required for the lexicographic endeavor.
Collaborative Workflows
Transcribing oral languages is collaborative, and the use of computational methods brings us into the space of computer-supported cooperative language documentation (Hanke 2017). We note that the Sparse Transcription Model is designed to facilitate asynchronous update: e.g., we can insert a token without needing to modify the boundaries of existing tokens; multiple transcribers can be spotting words in the same audio files concurrently; several hypotheses for the token at a given location can be represented. It is a separate task, outside the scope of this article, to model lifecycle workflows covering the selection of audio, allocation of transcription tasks to individuals, and various processes for approving content and making it available.
Transcribing First vs. Translating First
Note that interpretations and transcriptions are independently related to the timeline, and so the model is neutral as to which comes first. The similarity-based clustering of tokens or of lexical entries could be sensitive to interpretations when they are available.
Finally, we can observe that the notation for representing transcription workflows offers a systematic means for documenting transcription practices “in order to get the full picture of which strategies and which forms of knowledge are applied in transcription” (Himmelmann 2018, page 37).
4.4 Evaluation
When it comes to recordings made in field conditions, we need to take seriously the fact that there is no such thing as “gold transcription,” as there are so many sources of significant variability (Cucchiarini 1993; Bucholtz 2007). I am not aware of spontaneous speech corpora where multiple human-validated phone transcriptions of the same audio are supplied; cf. the situation for machine translation where source texts may be supplied with multiple translations to support the BLEU metric (Papineni et al. 2002).
One proposal is to apply abductive inference in the presence of a comprehensive lexicon: “a transcription is ‘good enough’ when a human being looking at the transcription can work out what the words are supposed to be” (Bettinson 2013, page 57). This is the case of inscription, efficiently capturing an event in order to reconstruct it later.
The metric of word error rate is usually applied to orthographic words, which tend to collapse meaningful distinctions or introduce spurious distinctions (e.g., wind [wɪnd] vs [waɪnd]; colorvs. colour). Instead, we propose “lexeme error rate,” the obvious measure of the correctness of a lexeme-level transcription (Task T).
Word spotting (Task S) can be evaluated using the usual retrieval measures such as precision and recall. This generalizes to the use of word spotting for selecting audio recordings to transcribe. We have no need for measures of segmentation quality. Our measure of completeness (Task C) does not require segmentation. A measure of the quality of lexical normalization still needs to be considered. We must avoid the combinatorial explosion caused by multiplying out morphological possibilities, while retaining human judgments about the acceptability of morphologically complex forms.
4.5 Addressing the Requirements
The Sparse Transcription Model meets the requirements for transcription (Section 2.4). We transcribe at the word level, and these words are our meaningful units. We transcribe partially, leveraging translations. We draw on the knowledge and skills of speakers, and produce a result that is simultaneously linked to the audio and the lexicon. To reiterate, the model supports “standard practice to link transcripts [to] recordings, which allows direct and fast access” (Himmelmann 2006a, page 259), while taking seriously the fact that “transcription necessarily involves hypotheses as to the meaning of the segment being transcribed and the linguistic forms being used” (Himmelmann 2018, page 35).
Some significant modeling topics remain. First, the glossary may contain redundancy in the form of variants, constituents, and compositions. We assume there is computational support for refactoring the glossary. Second, nothing has been said about how word spotting should be performed. When identifying new tokens of a glossary entry, we would like to call on all available information, including elicited forms, contextual forms, differently inflected forms, unrelated forms which have syllables in common, and so forth. Available interpretations may contribute to this process. Many approaches are conceivable depending on the typology of the language: Does the language have syllable margins characterized by minimal coarticulation? How synthetic is the language, and what is the degree of agglutination or fusion?
Significant implementation topics remain open. First, we say little about the user interface except that it must be able to display sense-disambiguated word-level transcriptions. Each manual task could have software support of its own, for example, a mobile interface where users swipe right or left to confirm or disconfirm hits for a lexeme in phrasal context. Second, we have not articulated how such tools might be integrated with established methods for language processing. These topics are left for further investigation.
5 Conclusion
Oral languages present a challenge for natural language processing, thanks to the difficulties of mapping from spontaneous speech to the expected text input. The most widely explored approach is to prefix a speech-to-text component to the existing NLP pipeline, keeping alive the hope for “a full-fledged automatic system which outputs not only transcripts, but glossed and translated texts” (Michaud et al. 2018, page 22).
However, this position ignores the interpretive nature of transcription, and it intensifies the transcription bottleneck by introducing tasks that can only be performed by the most scarce and expensive participants, namely linguists. Instead of a deficit framing where “unwritten languages” are developed into written languages, we can adopt a strengths-based model where “oral languages” present their own opportunities. We embrace the interests and capacities of the speech community and design tasks and tools that support local participation. After all, these people represent the main workforce and the main beneficiary of language work.
In developing a new approach, I have shown how a pipeline conception of the processing task has led us to make unhelpful assumptions, to transcribe phones, to transcribe fully, and to transcribe first. Long-established transcriptional practices demonstrate that these assumptions are unwarranted. Instead, I proposed to represent a transcription as a mapping from locations in the speech stream to an inventory of meaningful units. We use a combination of spoken term detection and manual transcription to create these tokens. As we identify more instances of a word, and similar sounding words, we get better at spotting them, and accelerate our work of orthodox, contiguous transcription.
Have we addressed the transcription bottleneck? The answer is a qualified yes. When transcriptions are not the data, we are freed from transcribing everything. When we avoid unnecessary and time-consuming tasks, we can allocate scarce resources more judiciously. When we use spoken term detection to identify recordings of interest, we go straight to our descriptive or analytical tasks. We identify linguistically meaningful units at an early stage without baking in the boundaries. We are more selective in where to “transcribe exhaustively.” This is how we manage the transcription trap.
This approach offers new promises of scalability. It embraces the Zipfian distribution of linguistic forms, allocating effort according to frequency. It facilitates linguists in identifying the words, phrases, and passages of interest. It treats each additional resource—including untranscribed audio and interpretations—as further supervision to help us identify the meaningful units. It postpones discussion of orthographic representation. It offers diverse workflows and flexibility in the face of diverse language situations. Finally, it opens up new opportunities to engage with speakers. These people offer the deepest knowledge about the language and the best chance of scaling up the work. This is their language.
Acknowledgments
I am indebted to the Bininj people of the Kuwarddewardde “Stone Country” in Northern Australia for the opportunity to live and work in their community, where I gained many insights in the course of learning to transcribe Kunwinjku. Thanks to Steve Abney, Laurent Besacier, Mark Liberman, Maïa Ponsonnet, to my colleagues and students in the Top End Language Lab at Charles Darwin University, and to several anonymous reviewers for thoughtful feedback. This research has been supported by a grant from the Australian Research Council, Learning English and Aboriginal Languages for Work.