Sensory processing is increasingly conceived in a predictive framework in which neurons would constantly process the error signal resulting from the comparison of expected and observed stimuli. Surprisingly, few data exist on the accuracy of predictions that can be computed in real sensory scenes. Here, we focus on the sensory processing of auditory and audiovisual speech. We propose a set of computational models based on artificial neural networks (mixing deep feedforward and convolutional networks), which are trained to predict future audio observations from present and past audio or audiovisual observations (i.e., including lip movements). Those predictions exploit purely local phonetic regularities with no explicit call to higher linguistic levels. Experiments are conducted on the multispeaker LibriSpeech audio speech database (around 100 hours) and on the NTCD-TIMIT audiovisual speech database (around 7 hours). They appear to be efficient in a short temporal range (25–50 ms), predicting 50% to 75% of the variance of the incoming stimulus, which could result in potentially saving up to three-quarters of the processing power. Then they quickly decrease and almost vanish after 250 ms. Adding information on the lips slightly improves predictions, with a 5% to 10% increase in explained variance. Interestingly the visual gain vanishes more slowly, and the gain is maximum for a delay of 75 ms between image and predicted sound.
1.1 The Predictive Brain
The concept of “predictive brain” progressively emerged in neurosciences in the 1950s (Attneave, 1954; Barlow, 1961). It assumes that the brain is constantly exploiting the redundancy and regularities of the perceived information, hence reducing the amount of processing by focusing on what is new and eliminating what is already known. After half a century of experimental developments, the predictive brain has been mathematically encapsulated by Friston and colleagues into a powerful framework based on Bayesian modeling (Friston, 2005), assimilating such concepts as perceptual inference (Friston, 2003), reinforcement learning (Friston, Daunizeau, & Kiebel, 2009), and optimal control (Friston, 2011). In this framework, it has been proposed that the minimization of free energy, a concept from thermodynamics, could provide a general principle associating perception and action in interaction with the environment in a coherent, predictive process (Friston, Kilner, & Harrison, 2006; Friston, 2010). A number of recent neurophysiological studies confirm the accuracy of the predictive coding paradigm for analyzing sensory processing in the human brain (e.g., Keller & Mrsic-Flogel, 2018).
Actually, predictive coding is a general methodological paradigm in information processing that consists of analyzing the local regularities in an input data stream in order to extract the predictable part of these input data. After optimization of the coding process, the difference signal (i.e., prediction error) provides an efficient summary of the original signal. Technically, this can be cast in terms of minimum description lengths (MDL; Grünwald, Myung, & Pitt, 2005) and accompanying (variational) free energy minimization. The predictive brain is conceived as an inference engine with a fixed structure whose parameters are tuned to provide optimal predictions, that is, predictions of incoming signals from past ones, optimizing mutual information between both sets. This principle was introduced by Barlow (1961) in terms of redundancy reduction.
The information processing system can then focus on the difference between input data and their prediction. In a very general manner, whatever the processing system is, there are two main advantages to processing the difference signal over directly processing the input signal. First, if the prediction is efficient, the difference signal is generally of (much) lower energy than the original signal, which leads to energy consumption saving in subsequent processes and resource saving for representing the signal with a given accuracy (e.g., bit rate saving in an audio or a video coder). In short, this reduces the “cost” of information processing. Second, there is a concentration of novelty or unpredictable information in the difference signal, which is exploitable for, for example, the detection of new events. Because of these advantages, predictive coding has been largely exploited in technological applications, in particular in signal processing for telecommunications (Gersho & Gray, 1992; Jayant & Noll, 1984).
1.2 Predictions in Speech
Speech involves different linguistic levels from the acoustic-phonetic level up to the lexical/syntactic/semantic and pragmatic levels. Each level of language processing is likely to provide predictions (Manning & Schütze, 1999), and globally, automatic speech recognition (ASR) systems are based on statistical predictive models of the structure of speech units in the acoustic input (Jelinek, 1976; Rabiner, 1989; Deng & Li, 2013). Generative grammar traditionally conceives linguistic rules as providing the basis of language complexity at all levels (Jackendoff, 2002), and the existence of correlations between linguistic units at various ranges has been the focus of a large amount of scientific research—for example, by Kaplan and Kay (1994) and Berent (2013) for phonology, Heinz and Idsardi (2011) for lexicon or syntax, and Oberlander and Brew (2000) and Altmann, Cristadoro, and Degli Esposti (2012) for semantics in textual chains. The search for dependencies between linguistic units at various scales may also be related to the more general framework describing long-term dependencies in symbolic sequences (Li, 1990; Li & Kaneko, 1992; Ebeling & Neiman, 1995; Montemurro & Pury, 2002; Lin & Tegmark, 2017).
These different levels of prediction in speech correspond to different temporal scales, ranging from a few tens or hundreds of milliseconds for the lowest linguistic units (i.e., phoneme/syllable) up to a few hundred milliseconds or seconds for the highest ones (i.e., word/phrase/utterance). In the human brain, exploiting these different levels is done by hierarchically organized computational processes that correspond to a large network of cortical areas, as described in a number of recent review papers (e.g., Friederici & Singer, 2015). In this network, fast auditory and phonetic processing is supposed to occur locally in the auditory cortex (superior temporal sulcus/gyrus, STS/STG), while slower lexical access and syntactic processing involve information propagation within a larger network associating the temporal, parietal, and frontal cortices (Hickok & Poeppel, 2007; Giraud & Poeppel, 2012; Friederici & Singer, 2015).
Importantly, Arnal and Giraud (2012) have identified rapid cortical circuits that seem to provide predictions at low temporal ranges in the human brain. They propose that the predictions in time (“when” something important would happen) are based on a coupling between low-frequency oscillations driven by the syllabic rhythm in the delta-theta channel of neural firing (around 2–8 Hz) and midfrequency regulation in the beta channel of neural firing (12–30 Hz). The “what” information would combine top-down predictions conveyed by the beta channel with analysis of the sensory input providing prediction errors to be conveyed to higher centers in a bottom-up process through the gamma channel (30–100 Hz). Such local gamma-theta-beta auditory structures typically operate at relatively short temporal scales (up to a few hundreds of milliseconds) characteristic of phonetic processes and likely to operate at the level of auditory cortical areas in the STS/STG region (Gagnepain, Henson, & Davis, 2012; Mesgarani, Cheung, Johnson, & Chang, 2014). These local circuits mostly operate without lexical and postlexical processes that would require both larger temporal scales and longer cortical loops.
To our knowledge, such predictions occurring at the acoustic-phonetic level have never been quantified. Still, it is of real importance to evaluate what is the nature and amount of phonetic predictions that can be made locally in the speech input. This letter is focused on the quantitative analysis of temporal predictions in speech signals at a phonetic, sublexical level, using state-of-the-art machine learning models.
1.3 Predictions in Speech Coding Systems
In essence, the largest and historically prominent family of computational models for speech signal predictions is found in speech coding techniques for telecommunications. The vast majority of standardized predictive speech codecs apply prediction of a speech signal waveform sample from a linear combination of the preceding samples in the range of about 1 ms. This is the basis of the famous linear predictive coding (LPC) technique and LPC family of speech coders (Markel & Gray, 1976). The predictor coefficients are calculated over successive so-called short time frames of signal of a few tens of milliseconds (typically 20–30 ms), every 10–20 ms. Globally, LPC techniques may be related to the general principle of MDL minimization mentioned in section 1.1, where the MDL model is the set of prediction coefficients (Kleijn & Ozerov, 2007).
The prediction power of the LPC technique within a single short-term frame has been largely quantified in the speech coding literature. However, this literature has quite poorly considered the prediction of speech at the level of one to several short time frames ahead (i.e., a few tens to a few hundred milliseconds—in other words, an intermediary timescale in between the speech sample level and the lexical level). This is mostly due to constraints on latency in telecommunications. For example, only a very few studies have applied some form of predictive coding on vectors of parameters encoding a short-term speech frame. This has been done using differential coding (Yong, Davidson, & Gersho, 1988), recursive coding (Samuelsson & Hedelin, 2001; Subramaniam, Gardner, & Rao, 2006), or Kalman filtering (Subasingha, Murthi, & Andersen, 2009). Yet these approaches are limited to one-step frame prediction. A few other “unconventional” studies (Atal, 1983; Farvardin & Laroia, 1989; Mudugamuwa & Bradley, 1998; Dusan, Flanagan, Karve, & Balaraman, 2007; Girin, Firouzmand, & Marchand, 2007; Girin, 2010; Ben Ali, Djaziri-Larbi, & Girin, 2016) have proposed “long-term” speech coders, which aim at exploiting speech signal redundancy and predictability over larger time spans, typically in the range of a few hundreds of milliseconds.1 However, these methods actually implement a joint coding of several short-term frames (basically, by using trajectory models or projections) but do not apply any explicit prediction of a frame given past frames. In short, to the best of our knowledge, no study has yet attempted to systematically quantify the predictability of the acoustic speech signal at the phonetic to syllable timescale (i.e., one to several short-term frames). A first objective of this letter is to address this question thanks to a (deep) machine learning approach that we present.
1.4 Visual Potential Contribution to Phonetic Predictions in Speech
Importantly, the visual input can also convey relevant information for acoustic-phonetic predictions. As a matter of fact, pioneer studies such as Besle, Fort, Delpuech, and Giard (2004) and Van Wassenhove, Grant, and Poeppel (2005) showed that the visual component of an audiovisual speech input (e.g., “ba”) could result in decreasing the first negative peak N1 in the auditory event-related potential pattern in electroencephalographic (EEG) data. Peak decrease has been related to the ability of the visual input to provide predictive cues likely to suppress the auditory response displayed in N1. The potential predictive role of vision is supported by behavioral data showing that vision of the speaker's face may indeed provide cues for auditory prediction (e.g., Sánchez-García, Alsius, Enns, & Soto-Faraco, 2011; Venezia, Thurman, Matchin, George, & Hickok, 2016).
It has been claimed that the predictive aspect of visual speech information might be enhanced by the fact that there is often an advance of image on sound in natural speech (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar, 2009). Actually, this remains a matter of controversy (Schwartz & Savariaux, 2014). Still, studies on audiovisual speech coders capable of exploiting correlation between audio and visual speech are extremely sparse (though see the pioneering studies in Rao & Chen, 1996, and Girin, 2004). Hence, here also, no systematic quantification of the potential role of the visual input in the predictive coding of speech stimuli has been realized yet. Providing such a quantification, using a machine learning approach, is the second objective of this study.
1.5 Our Contribution: Modeling and Assessing Mid-Term Predictability in Acoustic and AudioVisual Speech
The goal of this study is to quantify what is really predictable online from the speech acoustic signal and the visual speech information (mostly lip movements). To this aim, we propose to use a series of computational models based on artificial (deep) neural networks trained to predict future acoustic features from past information. We focus on midterm prediction, that is, a prediction at the level of sequences of multiple consecutive short-term frames (from 25 ms to 450 ms in our experiments) and with no explicit access to lexical or postlexical information. Depending on the representation (audio or audiovisual), different network architectures such as feedforward neural networks and convolutional neural networks are used to learn sequences of acoustic and visual patterns. In order to generalize the network speech prediction capabilities across many speakers, these networks are trained on large multispeaker audio and audiovisual speech databases. More specifically, we use the LibriSpeech corpus (Panayotov, Chen, Povey, & Khudanpur, 2015), one of the largest publicly available acoustic speech databases, and the NTCD-TIMIT corpus (Abdelaziz, 2017), one of the largest publicly available audiovisual speech databases.
The choice of a statistical framework based on deep learning was motivated by its ability to build successive levels of increasingly meaningful abstractions in order to learn and perform complex (e.g., nonlinear) mapping functions. By combining different types of generic layers (e.g., fully connected, convolutional, recurrent) and training their parameters jointly from raw data, deep neural networks provide a generic methodology for feature extraction, classification, and regression. Deep learning–based models have led to significant performance improvement in many speech processing problems, for example, acoustic automatic speech recognition (ASR; Abdel-Hamid et al., 2014), speech enhancement (Wang & Chen, 2018), audiovisual and visual ASR (Mroueh, Marcheret, & Goel, 2015; Wand, Koutník, & Schmidhuber, 2016; Tatulli & Hueber, 2017), articulatory-to-acoustic mapping (Bocquelet, Hueber, Girin, Savariaux, & Yvert, 2016), and more generally for tasks involving speech-related biosignals (Schultz et al., 2017). Thus, deep-learning models are here considered as providing an accurate evaluation of the amount of information and regularities present in the auditory and visual inputs and likely to intervene in speech-predictive coding in the human brain. Though substantially different from biological neural networks, artificial deep neural networks provide a computational solution to cognitive questions and may thus provide some insight into the nature of biological processes (Kell, Yamins, Shook, Norman-Haignere, & McDermott, 2018).
The proposed computational models of predictive speech coding enabled us to address the following questions:
How much of the future speech sounds can be predicted from the present and the past ones? What is the temporal range at which acoustic-phonetic predictions may operate, and how much of past information do they capitalize on for predicting future events?
If the visual input (i.e., information on the speaker's lip movements) is added to the acoustic input (i.e., the speech sound), how much gain can occur in prediction, and what temporal window of visual information is typically useful for augmenting auditory predictions? Crucially, can audiovisual predictions confirm the assumption that visual information would be available prior to auditory information in the predictive coding of speech, and by what temporal amount?
2 Materials and Methods
Two publicly available data sets were used in this study. The first is the LibriSpeech corpus, which is derived from read audiobooks from the LibriVox project (Panayotov et al., 2015). In this study, we used the “train-clean-100” subset of LibriSpeech, which contains 100.6 hours of read English speech, uttered by 251 speakers (125 female speakers and 126 male speakers). The second data set is the NTCD-TIMIT data set (Abdelaziz, 2017), which contains audio and video recordings of 59 English speakers, each uttering the same 98 sentences extracted from the TIMIT corpus (Garofolo et al., 1993)—5782 sentences in total, representing around 7 hours of speech). NTCD-TIMIT contains both clean and noisy versions of the audio material. In our study, we used only the clean audio signals. As for the video material, NTCD-TIMIT provides a postprocessed version of raw video sequences of the speaker's face focusing on the region of interest (ROI) around the mouth. This includes cropping, rotation, and scaling of the extracted ROI so that the mouths of all speakers approximately lie on the same horizontal line and have the same width. Each ROI image is finally resized as a pixels 8-bit gray-scale image (Abdelaziz, 2017).
Librispeech is our favored data set here for quantifying the auditory speech prediction from audio-only input. Experiments conducted on the NTCD-TIMIT corpus aim more specifically at quantifying the potential benefit of combining audio and visual inputs (over audio-only input) for such prediction. In spite of its reduced size compared to Librispeech (around 7 hours and 59 speakers versus 100 hours and 251 speakers), it remains one of the largest publicly available audiovisual data sets of continuous speech.
2.2 Data Preprocessing
For the LibriSpeech corpus, no specific preprocessing of the audio signal was done. For the NTCD-TIMIT corpus, each audiovisual recording was first cropped in order to reduce the amount of silence before and after each uttered sentence. Temporal boundaries of silence portions were extracted from the phonetic alignment file provided with the data set. In order to take into account anticipatory lip gestures, a safe margin of 150 ms of silence was kept intact before and after each recorded sentence.
A sliding window was used to segment each waveform into short-term acoustic frames. A classical frame length of 25 ms was used in our study (400 samples at 16 kHz). Importantly, a frame shift of 25 ms was chosen in order to avoid any overlap between consecutive frames (i.e., the frame shift was set equal to the frame length). This aimed at preventing the introduction of artificial correlation due to shared samples, which could introduce some bias in the midterm prediction (i.e., the prediction of a speech frame given the preceding ones).
The discrete Fourier transform (DFT) was applied on each frame to represent its spectral content. The overall process is referred to as the short-term Fourier transform (STFT) analysis, and the resulting signal representation is the STFT (complex-valued) spectrogram. In our study, a 512-point fast Fourier transform (FFT) was used to calculate each DFT (each 400-sample short-term frame was zero-padded with 112 zeros and was then applied a Hanning analysis window). Only the 257 first coefficients in the frequency dimension, corresponding to positive frequencies, are retained. Then we computed the log magnitude of the STFT spectrogram (on a dB scale) and rescaled the resulting values to the range dB for each sentence of the data set (the maximum value over each sentence was set to 0 dB and all values below dB were set to dB). Finally, the short-term speech spectrum was converted into a set of so-called Mel-frequency cepstral coefficients (MFCC). Such coefficients were obtained by integrating subbands of the log power spectrum using a set of 40 triangular filters equally spaced on a nonlinear Mel frequency scale and converting the resulting 40-dimensional Mel-frequency log spectrum into a 13-dimensional vector using the discrete cosine transform (DCT). The resulting representation for a complete utterance (a sequence of frames) is referred to as the MFCC spectrogram.
MFCC coefficients are widely used in many fields such as automatic speech recognition (ASR; Rabiner, 1989) and music information retrieval (e.g. classification of musical sound; Kim et al., 2010). MFCC analysis can be seen as a high-level biologically inspired process related to psychoacoustics (i.e., simulating the cochlear filtering). Moreover, MFCC analysis leads to a compact representation of the short-term speech spectrum, which may be of significant interest in the context of statistical learning since it may limit the number of free model parameters to estimate. All the above audio analysis procedures were performed using the Librosa Python open-source library, release 0.6.0 (McFee et al., 2018).
As concerns the video sequences (for the NTCD-TIMIT corpus), a linear interpolation across successive images in the pixel domain was performed in order to adjust the video frame rate (originally 30 fps) to the analysis rate of the audio recordings (40 Hz). Each frame of pixels was then resized to pixels using linear interpolation. Eight-bit integer pixel intensity values were divided by 255 in order to work with normalized values in the [0, 1] range. Video analysis was performed using the openCV2 Python open-source library (Bradski, 2000; release 18.104.22.168).
2.3 Computational Models of Speech Prediction from Audio-Only Data
2.3.3 General Methodology for Model Training
All parameters (i.e. weights) of FF-DNNs are learned from data, usually by stochastic gradient descent and backpropagation. Briefly, this consists of iterating the following process: (1) evaluating a loss function, which measures the average discrepancy between the prediction of the network and the ground-truth value for a subset of the training data (called a minibatch), and (2) calculating the gradient of this loss function with respect to all the network weights, starting from the output layer and backpropagating it through all the hidden layers, then (3) updating all weights using the gradient in order to decrease the loss function. This process is applied over all minibatches of the training data and repeated a certain number of times, called epochs, until the loss function no longer significantly evolves.
In addition to this general process, three strategies are often used to prevent model overfitting and accelerate training convergence: (1) early stopping, which consists of monitoring the loss function on a validation data set and stopping the training as soon as its value stops decreasing after a given number of epochs; (2) batch normalization, which consists of applying a transformation so that the inputs to each layer have zero mean and unit variance (Ioffe & Szegedy, 2015); and (3) dropout, which consists of not updating a random fraction of neurons in a given layer during training. In our study, we combined these three processes.
2.3.4 Model Selection and Training
As in many modeling studies based on deep learning, complex architectures require setting a large number of hyperparameters, mostly related to the sizing of the network, a process known as model selection. It also requires setting several training settings. An extensive search for the optimal combination for these hyperparameters and settings is out of range. Therefore, we optimized only some of them on a subset of each database. We tested combinations of 1, 2, 3, and 4 layers with either 128, 256, or 512 neurons each. This converged to the same architecture for the two data sets, with three groups of 256-neuron fully connected layers. This model is represented in Figure 1a. All models were trained using the Adam optimizer, a popular variant of the stochastic gradient descent (Kingma & Ba, 2014), on minibatches of 256 observations. The Leaky ReLU was used as an activation function (for the neurons of the hidden layers). It is defined as for and for (with in our experiments). The mean squared error (MSE) was used as the loss function. In each experiment, 66% of the data (randomly partitioned) were used for training, and the remaining 33% were used for testing. Twenty percent of the training data were used for validation (early stopping). The number of epochs in early stopping was set to 10.
After model selection, the optimal set of hyperparameters and the same training settings were then used to train and evaluate the final computational models of speech prediction from audio-only data. Two separate series of experiments were conducted. These models were trained and evaluated using the entire train-clean-100 subset of the LibriSpeech corpus (around 100 hours, 251 speakers). They were also trained and evaluated on the audio data of the entire NTCD-TIMIT audiovisual corpus (around 7 hours, 59 speakers). The latter series of experiments were mostly done for comparison with their audiovisual counterpart.
Technical implementation of all models was performed using the Keras open-source library (Chollet et al., 2015, release 2.1.3). All models were trained using GPU-based acceleration.
2.4 Computational Models of Speech Prediction from Both Audio and Visual Data
Integration of audio and visual speech information has been largely considered for automatic audiovisual speech recognition (Potamianos, Neti, Gravier, Garg, & Senior, 2003; Mroueh et al., 2015) and also (though much less extensively) for other applications, such as speech enhancement (Girin, Schwartz, & Feng, 2001) and speech source separation (Rivet, Girin, & Jutten, 2007). Basically, the general principle is that integration can be processed at the input signal level (concatenation of the input data from each modality, that is, early integration), at the output level (combination of the outputs obtained separately from each modality, that is, late integration), or somewhere in between those extremes (after some separate processing of the inputs and before final calculation of the output, that is, midlevel integration) (Schwartz, Robert-Ribes, & Escudier, 1998). Artificial neural networks provide an excellent framework for such multimodal integration, since it can be easily implemented with a fusion layer receiving the inputs from different streams and generating a corresponding output. Moreover, the fusion layer can be placed arbitrarily close to the input or the output.
In this study, we propose a computational model of auditory speech from both audio and visual inputs based on artificial neural networks. We adopt the midlevel fusion strategy, which enables to benefiting from the design and training of the audio (FF-DNN) network used for predictive coding based on audio-only input presented in the previous section, and the design and training of a visual model dedicated to process the lip images.
A convolutional neural network (CNN; LeCun, Bengio, & Hinton, 2015) was used as the core of this visual model. A CNN is a powerful network architecture well adapted to process 2D data for classification and regression. It can extract a set of increasingly meaningful representations along its successive layers. It is thus widely used in image and video processing—for example, object detection (Szegedy et al., 2015), gesture recognition (Baccouche, Mamalet, Wolf, Garcia, & Baskurt, 2012; Ji, Xu, Yang, & Yu, 2013; Karpathy et al., 2014; Simonyan & Zisserman, 2014), or visual speech recognition (Noda, Yamaguchi, Nakadai, Okuno, & Ogata, 2014; Tatulli & Hueber, 2017).
Technically, a CNN is a deep (multilayer) neural network classically composed of one or several convolutional layers, pooling layers, fully connected layers, and one output layer. In a nutshell, a convolutional layer convolves an input 2D image with a set of so-called local filters and then applies a nonlinear transformation to the convolved image. The output is a set of so-called feature maps. Each feature map can be seen as the (nonlinear) response of the input image to the corresponding local filter. One important concept in the convolutional layer is weight sharing, which states that the parameters of each filter remain the same whatever the position of the filter in the image. This allows the CNN to exploit spatial data correlation and build translation-invariant features. A pooling layer then downsamples each feature map in order to build a scale-invariant representation. For example a so-called max-pooling layer outputs the max value observed on subpatches of a feature map. The convolutional + pooling process can be cascaded several times. A CNN generally ends up with a series of fully connected layers that have the same function as in a standard feedforward deep neural network, as described in section 2.3. Note that in a CNN, the first fully connected layer usually operates over a vectorized form of the downsampled feature maps provided by the last pooling layer.
We thus first designed and trained such a visual CNN for efficient visual speech feature extraction from speakers' lip images. Its architecture is represented in Figure 1b. This visual CNN maps a sequence of a speaker's lip images into the corresponding future MFCC vector. Because we process a sequence of () images, the 2D convolution is extended to a 3D convolution, including the temporal dimension, as illustrated by the red cube in Figure 1b. Then, the convolutional + pooling layers of the visual CNN and the fully connected layers of the MFCC FF-DNN were selected. These subnetworks were merged using a fully connected fusion layer, which is followed by other usual layers. The resulting network regressing audio and visual data into audio data is represented in Figure 2. Each portion of the present and past MFCC spectrogram and associated sequence of lip images is mapped into an output predicted (future) MFCC vector. This audiovisual model was trained anew with the audiovisual training data.
2.4.3 Model Selection and Training
As concerns the CNN model processing the speakers' lip images (used to initialize the audiovisual model), we tested one, two, or three groups of convolutional or pooling layers. As often done in computer vision tasks involving CNNs (e.g., Noda et al., 2014), the number of filters was incremented in each layer: 16 for the first layer, 32 for the second, 64 for the third. Filters of size , , and were tested. The pooling factor was fixed to .
For the audiovisual model (the one jointly processing audio and visual data to predict audio), we used the subnetworks of the selected audio and visual networks, and we varied only the number of fully connected layers , with either 256 or 512 neurons each.
The selected visual CNN has three groups of convolution pooling layers, with 16, 32, and 64 filters of size , and a single 256-neuron fully connected layer, as represented in Figure 1b. Finally, the audiovisual model merges the subnetworks from the selected audio and visual models using a 256-neuron, fully connected fusion layer, as represented in Figure 2.
Finally, after model selection, both the visual-only model of Figure 1b and the audiovisual model of Figure 2 were (separately) trained on the entire NTCD-TIMIT data set. In each experiment, the settings of the training were very similar to the ones used for training the MFCC spectrogram FF-DNNs (e.g., use of the Adam optimizer, use of 66% of the data set for training, test on the remaining 33%, validation with early stopping on 20% of the training data).
Two metrics were used to assess the prediction performance of the different models: (1) the mean squared error (MSE) between the predicted audio vector and the corresponding ground-truth audio vector (this MSE was also used as a loss function to train the different models) and (2) the weighted explained variance (EV) regression score, evaluating the proportion to which the predicted coefficients account for the variation of the actual ones.
The weighted EV is within the interval . A value close to 1 indicates that the error between a predicted and ground-truth data is small compared to the ground-truth data themselves—hence, a strong correlation between them. This corresponds to a large prediction gain (larger than 1). An EV value close to 0 generally indicates a poor correlation and a very weak prediction gain (close to 1). Negative EV values indicate that the error is larger than the ground-truth data, hence very inefficient predictions.
Note that in contrast to the EV, the MSE is not weighted and not normalized in any way. Therefore, it is expected to be more sensitive than the EV to potential differences in data sets (e.g., recording material, waveform scaling to avoid clipping). This means that EV values can be more easily compared across our two data sets than MSE values.
3 Results and Discussion
The prediction performances of the audio models trained and evaluated on LibriSpeech are presented in Figure 3. The prediction performances of both audio and audiovisual models trained and evaluated on NTCD-TIMIT are presented in Figure 4. Note that in this section, we express the time lags and in milliseconds for convenience of discussion. For example, denotes the weighted explained variance obtained when predicting a 25 ms audio frame 50 ms in the future, looking 75 ms in the past—that is, using the current frame and the three previous past frames.
3.1 Speech Prediction from Audio Data
3.1.1 General Trends
The results show that it is indeed possible to predict, to a certain extent, the spectral information in the acoustic speech signal in a temporal range of 200 ms following the current frame. As expected, the accuracy of such predictions decreases rapidly when the temporal horizon increases. This evolution more or less follows a logarithmic shape for and an exponential decay toward 0 for . These general trends are observed on both LibriSpeech and NTCD-TIMIT. The audio-only predictive models trained on the large-scale LibriSpeech corpus are globally slightly more accurate than the ones trained on the smaller data set, NTCD-TIMIT. This is likely due to the better generalization capacity of the networks when the data set is larger (in terms of number of speakers and speech material per speaker).
At ms, the weighted explained variance is about 0.75 for the audio-only model trained on LibriSpeech and about 0.65 for the one trained on NTCD-TIMIT (e.g., for ms, on NTCD-TIMIT and on LibriSpeech; compare the cyan solid lines in Figures 3 and 4). This corresponds to an average predictive coding gain around 2.8 and 4, respectively. This provides a rough estimate of the factor by which the power of the error signal (input minus predicted) to transmit by neural processes is reduced compared to the original input. It thus provides some quantification of the amount of biological energy that the system might gain in exploiting a short-range (25 ms) predictive process.
At ms, the weighted explained variance is about 0.5 on LibriSpeech (e.g., ) and about 0.4 on NTCD-TIMIT (e.g., ), which corresponds to an average prediction gain between 1.7 and 2. For the audio-only models trained on LibriSpeech, prediction becomes poor above ms: the explained variance goes below 0.1 and keeps on decreasing toward 0. For models trained on NTCD-TIMIT, the performance degradation occurs a bit sooner: the explained variance goes below 0.1 for between 100 ms and 150 ms, and prediction keeps on decreasing toward 0. Again, the difference between the results obtained with the two data sets is likely due to their difference in size and thus to the resulting difference in generalization properties of the corresponding models. Nevertheless, these results provide a rather coherent estimation of the temporal window in which acoustical predictions are available, typically around the duration of a syllable.
3.1.2 Impact of Past Information
Another aspect of acoustic prediction concerns the role of the temporal context. Unsurprisingly, adding one context frame to the current one provides significant improvement in the prediction of the next frame for both data sets (e.g., and on LibriSpeech, and on NTCD-TIMIT). Adding a second context frame is also beneficial for the large LibriSpeech corpus (e.g., ) though more marginally for the smaller NTCD-TIMIT corpus (e.g., ). Such past information may enable the model to evaluate speech trajectories and extract relevant information on the current dynamics, related, for example, to formant transitions, known to be crucial in speech perception. Adding a third frame of past context ( ms) marginally improves prediction, but only for larger than about 75 ms and only for LibriSpeech data. Adding a fourth past frame ( ms) provides no further gain. While being related to a different task, such results may be compared to classical ones in automatic speech recognition, where adding first and second derivatives of the spectral parameters is classically considered as the optimal choice for reaching the best performance.
3.1.3 Prediction Accuracy per Class of Speech Sound
A fine-grained analysis of the prediction accuracy for five major classes of speech sounds is presented in Figure 5 (this analysis was conducted on the NTCD-TIMIT data set for which phonetic alignment is available).
Interestingly, the prediction accuracy at ms is in a comparable range for vowels, fricatives, nasals, and semivowels but significantly lower for plosive sounds (i.e., plosives exhibit a significantly larger MSE; see Figure 5). This may be explained by the difficulty of predicting the precise timing of the occlusion release within the plosive closure and the shape of the corresponding short-term spectrum from the prerelease signal. This pattern is also visible but decreased for predictions at ms and ms, probably due to a ceiling effect of the prediction power at these temporal horizons.
3.2 Speech Prediction from Audio and Visual Data
The performance of visual-only models (i.e., predictive models of acoustic speech that rely only on lip images, trained and tested on the NTCD-TIMIT corpus) is presented in Figure 6 (left).
As expected, the information provided by the visual modality is real though limited. For example, the best performance obtained at ms is only, which corresponds to a prediction gain of 1.59. This result can be put in perspective with respect to the literature on automatic lip-reading (also known as visual speech recognition), where a typical performance of a visuo-phonetic decoder that does not exploit any high-level linguistic knowledge (via statistical language models) is between 30% and 40% (i.e., 60% to 70% phone error rate). Similarly to the audio-only models, adding past context frames to the current one provides significant improvement in the prediction accuracy. Most of this improvement is observed when considering past information at ms—one additional lips image. Adding another past context frame (i.e., ms) only marginally improves prediction, and going to three past frames does not provide further improvement.
Interestingly, the performance of visual-only models decreases relatively slowly in comparison with the rapid decrease in prediction accuracy for the audio models (for the NTCD-TIMIT data set) in the same range of time lags. For example, for ms, we have , and for ms, we have . However, on average, visual modality does not seem to convey useful information above ms (e.g., at ms, ). These results may contribute to the debate in the neuroscience literature on the fact that the lip movements could be in advance on the sound because of anticipatory processes in speech production (see Chandrasekaran et al., 2009; Golumbic, Cogan, Schroeder, & Poeppel, 2013). The prediction of the spectral parameters from the lip information is maximal for the frame synchronous with the current input lip image (i.e., ms)—or for the next frame ( ms) only for ms—and then decreases smoothly with time. This is not in agreement with a stable advance of lips on sound.
As illustrated in Figure 4, combining audio with visual information improves the prediction over using audio only (compare solid lines with dashed lines). The gain is small but real, increasing the weighted explained variance by up to 0.1 depending on and . In order to better illustrate the dynamics of the gain brought by the visual input, we displayed in Figure 6 (right) the difference of weighted explained variance between audiovisual models and audio-only models. Importantly, results show a peak in the gain provided by the visual input for ms. Therefore, even if there is no systematic lead of lips on sounds, there is a temporal window between 50 ms and 100 ms where the use of visual information is most helpful.
3.2.1 Qualitative evaluation
All of these quantitative results were averaged over many test sentences and speakers. Here, we finally discuss from a qualitative point of view the accuracy of the predicted spectral content at the utterance level. An example of a prediction error at ms using either an audio-only or audiovisual predictive model is shown in Figure 7.
For the audio-only model (see the blue in plot c), peaks in prediction errors are mainly observed at either the vowel onset of consonant-vowel sequences (e.g., [d-iy], [l-(hh)-er]) or at the onset of the consonant of vowel-consonant sequence (e.g., [er-m], [er-t]) for which the precise initiation of the trajectory after a period of relative stability is hardly predictable. As concerns the audiovisual model, the average gain is accompanied by a large range of variations, leading to fluctuations between large gains and large losses provided by lip movements. A substantial gain from the visual input may occur when the speaker produces preparatory lip gestures before beginning to speak as in the [s] onset after the silence at the beginning of the utterance. Visible though poorly audible gestures as the closure for [m] in [l-ey-m] also lead to a visual gain in prediction. Another source of gain could be related to a coarticulation effect, when lips anticipate the upcoming vowel as during the first [d] in the utterance where the stretching gesture starts before the onset of the [iy]. Conversely, cases of error increase due to the visual input concern occurrences of nonvisible tongue gestures, such as intensity decrease in the vowel [uw] due to displacement of the tongue apex in the dental region in the following [nd] cluster around 0.4 s that is detected in the auditory stream but not in the visual stream.
3.3 A Database of Prediction Errors for Future Neurocognitive Experiments
A number of recent neurophysiological experiments have tested the existence and characteristics of predictive patterns in the audio and audiovisual responses to speech in the human brain (Van Wassenhove et al., 2005; Arnal, Morillon, Kell, & Giraud, 2009; Tavano & Scharinger, 2015; Ding, Melloni, Zhang, Tian, & Poeppel, 2016) and a recent theoretical review of predictive processes (Keller & Mrsic-Flogel, 2018). Importantly, these experiments lack a ground-truth basis on the natural predictive structure of audio and audiovisual speech, which can lead to misinterpretations or overgeneralizations of observed patterns (Schwartz & Savariaux, 2014). Our study could provide an interesting basis for future studies, providing a quantitative knowledge on the amount of “predictability” available in the physical signals considered in the simulations. The source code used to format the data and train and evaluate both audio and audiovisual predictive models on LibriSpeech and NTCD-TIMIT data sets, as well as all simulation results, has been made publicly available on https://github.com/thueber/DeepPredSpeech (for source code) and https://zenodo.org/record/3528068 (for data, simulation results on NTCD-TIMIT and pretrained models, doi:10.5281/zenodo.1487974). We believe that such results could be of interest in future neurophysiological experiments aiming at testing neural predictions in speech processing in the human brain.
In the general framework of predictive coding in the human brain, this study aimed at quantifying what is predictable online from the speech acoustic signal and the visual speech information (mostly lip movements). We proposed a set of computational models based on artificial (deep) neural networks that were trained to predict future audio observations from past audio or audiovisual observations. Model training and evaluation were performed on two large and complementary multispeaker data sets, respectively for audio and audiovisual signals, both publicly available. The key results of our study are these:
It is possible to predict the spectral information in the acoustic speech signal in a temporal range of about 250 ms. At 25 ms, prediction enables reducing the power of the signal to transmit by neural processes (i.e., the error signal instead of the input signal) by a factor up to four. But the accuracy of the prediction decreases rapidly with future time lag (e.g., with average prediction gain obtained on the larger of our two tested data sets around 2 (EV 0.5) at 50 ms (with ms), 1.6 at 75 ms (EV 0.37) and almost 1 (EV 0.05), i.e., no gain, at around 350 ms).
The information provided by the visual modality is real but limited. The prediction accuracy of the predictive model based on visual-only information does not evidence a stable advance of lips on sound (as sometimes stated in the literature). The maximum average gain provided by the visual input in addition to the audio input is about +0.1 of explained variance and is obtained for a prediction at 75 ms.
Best prediction accuracy is obtained when considering 50 to 75 ms of past context, for both audio and audiovisual models.
Plosives are more difficult to predict than other types of speech sound.
This study hence provides a set of quantitative evaluations of auditory and audiovisual predictions at the phonetic level, likely to be exploited in predictive coding models of speech processing in the human auditory system. These evaluations are based on a specific class of statistical models based on deep learning techniques. Of course, it can be envisioned that the number of regularities in the speech signal might be actually larger than what has been captured by deep learning techniques in this study. Still, the large amount of data exploited here and the acknowledged efficacy of deep learning techniques make us confident that the estimations provided in this work constitute a reasonable estimation of the order of magnitude of possible regularities captured by statistical models.
As we stated in section 1, predictive coding should operate at a number of higher stages in speech neurocognitive processing, related to lexical, syntactic, and semantic/pragmatic levels exploiting wider temporal scales. Our study should hence be considered as just a first stage in the analysis of predictive coding in speech processing. It provides a baseline along which further studies on higher-level predictive stages can be evaluated quantitatively, comparing the number of additional predictions that can occur from linguistic models to this audiovisual phonetic reference. Future work will focus on the integration of this linguistic level in a more complete neurocognitive architecture for speech predictive coding.
Such coders are limited to speech storage since interactive communication is not feasible with the resulting high latency.
This work has been supported by the European Research Council under the European Community Seventh Framework Programme (FP7/2007-2013 grant agreement 339152, Speech Unit(e)s).