Abstract
When children learn new words, they employ constraints such as the mutual exclusivity (ME) bias: A novel word is mapped to a novel object rather than a familiar one. This bias has been studied computationally, but only in models that use discrete word representations as input, ignoring the high variability of spoken words. We investigate the ME bias in the context of visually grounded speech models that learn from natural images and continuous speech audio. Concretely, we train a model on familiar words and test its ME bias by asking it to select between a novel and a familiar object when queried with a novel word. To simulate prior acoustic and visual knowledge, we experiment with several initialization strategies using pretrained speech and vision networks. Our findings reveal the ME bias across the different initialization approaches, with a stronger bias in models with more prior (in particular, visual) knowledge. Additional tests confirm the robustness of our results, even when different loss functions are considered. Based on detailed analyses to piece out the model’s representation space, we attribute the ME bias to how familiar and novel classes are distinctly separated in the resulting space.
1 Introduction
When children learn new words, they employ a set of basic constraints to make the task easier. One such constraint is the mutual exclusivity (ME) bias: When a learner hears a novel word, they map it to an unfamiliar object (whose name they don’t know yet), rather than a familiar one. This strategy was first described by Markman and Wachtel (1988) over 30 years ago and has since been studied extensively in the developmental sciences (Merriman et al., 1989; Markman et al., 2003; Mather and Plunkett, 2009; Lewis et al., 2020). With the rise of neural architectures, recent years saw renewed interest in the ME bias, this time from the computational modeling perspective: Several studies have examined whether and under which conditions the ME bias emerges in machine learning models (Gulordava et al., 2020; Gandhi and Lake, 2020; Vong and Lake, 2022; Ohmer et al., 2022).
The models in these studies normally receive input consisting of word and object representations, as the ME strategy is used to learn mappings between words and the objects they refer to. Object representations vary in their complexity, from symbolic representations of single objects (e.g., Gandhi and Lake, 2020) to continuous vectors encoding a natural image (e.g., Vong and Lake, 2022). Word representations, however, are based on their written forms in all these studies. For example, the textual form of the word fish has an invariable representation in the input. This is problematic because children learn words from continuous speech, and there is large variation in how the word fish can be realized depending on the word duration, prosody, the quality of the individual sounds and so on; see, e.g., Creel (2012) on how the ME bias affects atypical pronunciations such as [fesh] instead of [fish]. As a result, children face an additional challenge compared to models trained on written words. This is why it is crucial to investigate the ME bias in a more naturalistic setting, with models trained on word representations that take into account variation between acoustic instances of the same word.
Recently, there has been considerable headway in the development of visually grounded speech models that learn from images paired with unlabeled speech (Harwath et al., 2016, 2018a; Kamper et al., 2019; Chrupała, 2022; Peng and Harwath, 2022a; Peng et al., 2023; Berry et al., 2023; Shih et al., 2023). Several studies have shown, for instance, that these models learn word-like units when trained on large amounts of paired speech–vision data (Harwath and Glass, 2017; Harwath et al., 2018b; Olaleye et al., 2022; Peng and Harwath, 2022c; Nortje and Kamper, 2023; Pasad et al., 2023). Moreover, some of these models draw inspiration from the way infants acquire language from spoken words that co-occur with visual cues across different situations in their environments (Miller and Gildea, 1987; Yu and Smith, 2007; Cunillera et al., 2010; Thiessen, 2010). However, the ME bias has not been studied in these models.
In this work we test whether visually grounded speech models exhibit the ME bias. We focus on a recent model by Nortje et al. (2023), as it achieves state-of-the-art performance in a few-shot learning task that resembles the word learning setting considered here. The model’s architecture is representative of many of the other recent visually grounded speech models: It takes a spoken word and an image as input, processes these independently, and then relies on a word-to-image attention mechanism to learn a mapping between a spoken word and its visual depiction. We first train the model to discriminate familiar words. We then test its ME bias by presenting it with a novel word and two objects, one familiar and one novel. To simulate prior acoustic and visual knowledge that a child might have already acquired before word learning, we additionally explore different initialization strategies for the audio and vision branches of the model.
To preview our results, we observe the ME bias across all the different initialization schemes of the visually grounded speech model, and the bias is stronger in models with more prior visual knowledge. We also carry out a series of additional tests to ensure that the observed ME bias is not merely an artefact, and present analyses to pinpoint the relationship between the model’s representation space and the emergence of the ME bias. In experiments where we look at different modeling options (visual initialization and loss functions, in particular), the ME bias is observed in all cases. The code and the accompanying dataset are available from our project website.1
2 Related Work
Visually grounded speech models learn by bringing together representations of paired images and speech while pushing mismatched pairs apart. These models have been used in several downstream tasks, ranging from speech–image retrieval (Harwath et al., 2018b) and keyword spotting (Olaleye et al., 2022) to word (Peng and Harwath, 2022c) and syllable segmentation (Peng et al., 2023).
In terms of design choices, early models used a hinge loss (Harwath et al., 2016, 2018b), while several more advanced losses have been proposed since (Petridis et al., 2018; Peng and Harwath, 2022a; Peng et al., 2023). A common strategy to improve performance is to initialize the vision branch using a supervised vision model, e.g., Harwath et al. (2016) used VGG, Harwath et al. (2020) used ResNet, and recently Shih et al. (2023) and Berry et al. (2023) used CLIP. For the speech branch, self-supervised speech models like wav2vec2.0 and HuBERT have been used for initialization (Peng and Harwath, 2022c). Other extensions include using vector quantization in intermediate layers (Harwath et al., 2020) and more advanced multimodal attention mechanisms to connect the branches (Chrupała et al., 2017; Radford et al., 2021; Peng and Harwath, 2022a, b).
In this work we specifically use the few-shot model of Nortje et al. (2023) that incorporates many of these strategies (Section 5). We also look at how different design choices affect our analysis of the ME bias, e.g., using different losses (Section 7.4).
As noted already, previous computational studies of the ME bias have exclusively used the written form of words as input (Gulordava et al., 2020; Gandhi and Lake, 2020; Vong and Lake, 2022; Ohmer et al., 2022). Visually grounded speech models have the benefit that they can take real speech as input. This better resembles the actual experimental setup with human participants (Markman and Wachtel, 1988; Markman, 1989).
Concretely, since the models in Gulordava et al. (2020) and Vong and Lake (2022) are trained on written words, which are discrete by design, they need to learn a continuous embedding for each of the input classes. However, this makes dealing with novel inputs difficult: If a model never sees a particular item at training time, its embeddings are never updated and remain randomly initialized. As a result, the ME test becomes a comparison of learned vs random embeddings instead of novel vs familiar. To address this issue, Gulordava et al. (2020) use novel examples in their contrastive loss during training, while Vong and Lake (2022) perform one gradient update on novel classes before testing. These strategies mean that, in both cases, the learner has actually seen the novel classes before testing. Such adaptations are necessary in models taking in written input. In contrast, a visually grounded speech model, even when presented with an arbitrary input sequence, can place it in the representation space learned from the familiar classes during training. We investigate whether such a representation space results in the ME bias.
3 Mutual Exclusivity in Visually Grounded Speech Models
Mutual exclusivity (ME) is a constraint used to learn words. It is grounded in the assumption that an object, once named, cannot have another name. The typical setup of a ME experiment (Markman and Wachtel, 1988) involves two steps and is illustrated in Figure 1. First, the experimenter will ensure that the learner (usually a child) is familiar with a set of specific objects by assessing their ability to correctly identify objects associated with familiar words. In this example, the familiar classes are ‘clock’, ‘elephant’, and ‘horse’, as illustrated in the top panel of the figure. Subsequently, at test time the learner is shown a familiar image (e.g., elephant) and a novel image (e.g., guitar) and is asked to determine which of the two corresponds to a novel spoken word, e.g., guitar (middle panel in the figure). If the learner exhibits a ME bias, they would select the corresponding novel object, guitar in this case (bottom panel).
Our primary objective is to investigate the ME bias in computational models that operate on the audio and visual modalities. These models, known as visually grounded speech models, draw inspiration from how children learn words (Miller and Gildea, 1987), by being trained on unlabeled spoken utterances paired with corresponding images. The models learn to associate spoken words and visual concepts, and often do so by predicting a similarity score for a given audio utterance and an input image. This score can then be used to select between competing visual objects given a spoken utterance, as required in the ME test.
4 Constructing a Speech–Image Test for Mutual Exclusivity
To construct our ME test, we need isolated spoken words that are paired with natural images of objects. We also need to separate these paired word–image instances into two sets: familiar classes and novel classes. A large multimodal dataset of this type does not exist, so we create one by combining several image and speech datasets.
For the images, we combine MS COCO (Lin et al., 2014) and Caltech-101 (Fei-Fei et al., 2006). MS COCO contains 328k images of 91 objects in their natural environment. Caltech-101 contains 9k Google images spanning 101 classes. Ground truth object segmentations are available for both these datasets. During training, we use entire images, but during evaluation, we use segmented objects. This resembles a naturalistic learning scenario in which a learner is familiarized with objects by seeing them in a natural context, but is presented with individual objects (or their pictures) in isolation at test time.
For the audio, we combine the FAAC (Harwath and Glass, 2015), Buckeye (Pitt et al., 2005), and LibriSpeech (Panayotov et al., 2015) datasets. These English corpora respectively span 183, 40, and 2.5k speakers.
To select familiar and novel classes, we do a manual inspection to make sure that object segmentations for particular classes are of a reasonably high quality and that there are enough spoken instances for each class in the segmented speech data (at least 100 spoken examples per class). As an example of an excluded class, we did not use curtain, since it was often difficult to reliably see that curtains are depicted after these are segmented out. The final result is a setup with 13 familiar classes and 20 novel classes, as listed in Table 1.
Familiar | bear, bird, boat, car, cat, clock, cow, dog, elephant, horse, scissors, sheep, umbrella |
Novel | ball, barrel, bench, buck, bus, butterfly, cake, camera, canon, chair, cup, fan, fork, guitar, lamp, nautilus, piano, revolver, toilet, trumpet |
Familiar | bear, bird, boat, car, cat, clock, cow, dog, elephant, horse, scissors, sheep, umbrella |
Novel | ball, barrel, bench, buck, bus, butterfly, cake, camera, canon, chair, cup, fan, fork, guitar, lamp, nautilus, piano, revolver, toilet, trumpet |
During training (Figure 1, top panel), a model only sees familiar classes. We divide our data so that we have a training set with 18,279 unique spoken word segments and 94,316 unique unsegmented natural images spanning the 13 familiar classes. These are then paired up for training as explained in Section 5.1. During training we also use a development set for early stopping; this small set consists of 130 word segments and 130 images from familiar classes.
For ME testing (Figure 1, middle panel) we require a combination of familiar and novel classes. Our test set in total consists of 8,575 spoken word segments with 22,062 segmented object images. To implement the ME test, we sample 1k episodes: Each episode consists of a novel spoken word (query) with two sampled images, one matching the novel class from the query and the other containing a familiar object. We ensure that the two images always come from the same image dataset to avoid any intrinsic dataset biases. There is no overlap between training, development, and test samples.
5 A Visually Grounded Speech Model
We want to establish whether visually grounded speech models exhibit the ME bias. While there is a growing number of speech–image models (Section 2), many of them share the same general methodology. We therefore use a visually grounded speech model that is representative of the models in this research area: the Multimodal attention Network (MattNet) of Nortje et al. (2023). This model achieves top performance in a few-shot word–object learning task that resembles the way infants learn words from limited exposure. Most useful for us is that the model is conceptually simple: It takes an image and a spoken word and outputs a score indicating how similar the inputs are, precisely what is required for ME testing.
5.1 Model
MattNet consists of a vision and an audio branch that are connected with a word-to-image attention mechanism, as illustrated in Figure 2.
A spoken word a is first parameterized as a mel-spectrogram with a hop length of 10 ms, a window of 25 ms, and 40 bins. The audio branch takes this input, passes it through an acoustic network consisting of LSTM and BiLSTM layers, and finally outputs a single word embedding by pooling the sequence of representations along the time dimension with a two-layer feedforward network. This method of encoding a variable-length speech segment into a single embedding is similar to the idea behind acoustic word embeddings (Chung et al., 2016; Holzenberger et al., 2018; Wang et al., 2018; Kamper, 2019).
The vision branch is an adaptation of AlexNet (Krizhevsky et al., 2017). An image v is first resized to 224×224 pixels and normalized with means and variances calculated on ImageNet (Deng et al., 2009). The vision branch then encodes the input image into a sequence of pixel embeddings.
The audio and vision branches are connected through a multimodal attention mechanism that takes the dot product between the acoustic word embedding and each pixel embedding. The maximum of these attention scores is taken as the final output of the model, the similarity score S. The idea behind this attention mechanism is to focus on the regions within the image that are most indicative of the spoken word.
As a reminder from Section 4, the model is trained exclusively on familiar classes and never sees any novel classes during training. Novel classes are also never used as negative examples. We train the model with Adam (Kingma and Ba, 2015) for 100 epochs and use early stopping with a validation task. The validation task involves presenting the model with a familiar word query and asking it to identify which of the two familiar object images it refers to. We use the spoken words and isolated object images from the development set for this task (see Section 4).
5.2 Different Initialization Strategies as a Proxy for Prior Knowledge
The ME bias has been observed in children at the age of around 17 months (e.g., Halberda, 2003). At this age, children have already gained valuable experience from both spoken language used in their surroundings and the visual environment that they navigate (Clark, 2004). For example, 4.5-month-olds can recognize objects (Needham, 2001), and 6.5-month-olds can recognize some spoken word forms (Jusczyk and Aslin, 1995). These abilities can be useful when learning new words. In light of this, we adopt an approach that initializes the vision and audio branches of our model to emulate prior knowledge.
For the vision branch, we use the convolutional encoder of the self-supervised AlexNet (Koohpayegani et al., 2020), which distills the SimCLR ResNet50×4 model (Chen et al., 2020) into AlexNet and trains it on ImageNet (Deng et al., 2009). For the audio branch, we use an acoustic network (van Niekerk et al., 2020) pretrained on the LibriSpeech (Panayotov et al., 2015) and Places (Harwath et al., 2018a) datasets using a self-supervised contrastive predictive coding (CPC) objective (Oord et al., 2019). Both these initialization networks are trained without supervision directly on unlabeled speech or vision data, again emulating the type of data an infant would be exposed to. When these initialization strategies are not in use, we initialize the respective branches randomly.
Considering these strategies, we end up with four possible MattNet variations: one where both the vision and audio branches are initialized from pretrained networks, one where only the audio branch is initialized from a CPC model, one where only the vision branch is initialized from AlexNet, and one where neither branch is initialized with pretrained models (i.e., a full random initialization).
In the following sections, we present our results. We compare them to the performance of a naive baseline that chooses one of the two images at random for a given word query. To determine whether the differences between our model variations and a random baseline are statistically significant, we fit mixed-effects regression models to MattNet’s scores using the lme4 package (Bates et al., 2015). Details are given in Appendix A.
6 Mutual Exclusivity Results
Our main question is whether visually grounded models like MattNet (Section 5) exhibit the ME bias. To test this, we present the trained model with two images: one showing familiar and one showing a novel object. The model is then prompted to identify which image a novel spoken word refers to (Section 3). We denote this ME test as the familiar–novel test. With this, we also introduce our notation for specific tests: <image one type>–<image two type>, with the type of the audio query underlined. The class of the audio query will match the one of the underlined image, unless explicitly stated. Table 6 in Appendix B contains a cheat sheet to understand the tests’ notation.
Before we look at our target familiar–novel ME test, it is essential to ensure that our model has successfully learned to distinguish the familiar classes encountered during training; testing for the ME bias would be premature if the model does not know the familiar classes. We therefore perform a familiar–familiar test, where the task is to match a word query from a familiar class to one of two images containing familiar classes.
Table 2 presents the results of these two tests for the different MattNet variations described in Section 5.2. The results of the familiar–familiar test show that all the model variations can distinguish between familiar classes. The vision (AlexNet) initialization of the vision branch contributes more than the audio (CPC) initialization: The two best familiar–familiar models both use vision initialization. Our statistical tests confirm the reported patterns: All model variations are significantly better than the random baseline, and adding the visual (AlexNet) and/or audio (CPC) initialization to the basic model significantly improves MattNet’s accuracy on the familiar–familiar test.
. | Model initialization . | Accuracy (%) . | |||
---|---|---|---|---|---|
Audio (CPC) . | Vision (AlexNet) . | Familiar–familiar . | Familiar–novel . | ||
1 | Random baseline | N/A | N/A | 50.19 | 49.92 |
2 | MattNet | ✗ | ✗ | 72.86 | 57.29 |
3 | ✗ | ✓ | 85.89 | 59.32 | |
4 | ✓ | ✗ | 75.78 | 55.92 | |
5 | ✓ | ✓ | 83.20 | 60.27 |
. | Model initialization . | Accuracy (%) . | |||
---|---|---|---|---|---|
Audio (CPC) . | Vision (AlexNet) . | Familiar–familiar . | Familiar–novel . | ||
1 | Random baseline | N/A | N/A | 50.19 | 49.92 |
2 | MattNet | ✗ | ✗ | 72.86 | 57.29 |
3 | ✗ | ✓ | 85.89 | 59.32 | |
4 | ✓ | ✗ | 75.78 | 55.92 | |
5 | ✓ | ✓ | 83.20 | 60.27 |
We now turn to the ME test. The results are given in the familiar–novel column of Table 2. All MattNet variations exhibit the ME bias, with above-chance accuracy in matching a novel audio segment to a novel image, as also confirmed by our statistical significance test (Appendix A). From the table, the strongest ME bias is found in the MattNet variation that initializes both the audio (CPC) and vision (AlexNet) branches (row 5), followed by the variation with the vision initialization alone (row 3). Surprisingly, using CPC initialization alone reduces the strength of the ME bias (row 2 vs row 4). Again, these results are confirmed by our statistical tests. To summarize: Even the basic MattNet has the ME bias, but the AlexNet initialization makes it noticeably stronger.
To investigate whether the reported accuracies are stable over the course of learning, we consider MattNet’s performance over training epochs on the two tests: familiar–familiar and familiar–novel. We use the model variation with the strongest ME bias, i.e., with both the audio and vision branches initialized. Figure 3 shows that the ME bias (familiar–novel, green solid line) is stronger early on in training and then decreases later on. The pattern is similar for the familiar–familiar score (red dashed line), but the highest score in this case is achieved later in training than the best familiar–novel score. The scores stabilize after approximately 60 epochs; at this epoch, the model’s accuracy is 84.38% on the familiar–familiar task and 56.78% on the familiar–novel task (numbers not shown in the figure). This suggests that the results reported above for both tests are robust and do not only hold for a particular point in training.
In summary, we found that a visually grounded speech model, MattNet, learns the familiar set of classes and thereafter exhibits a consistent and robust ME bias. This bias gets stronger when the model is initialized with prior visual knowledge, although the results for the audio initialization are inconclusive. Whereas the strength of the ME bias slightly changes as the model learns, it is consistently above chance, suggesting that this is a stable effect in our model.
7 Further Analyses
We have shown that our visually grounded speech model has the ME bias. However, we need to make sure that the observed effect is really due to the ME bias and is not a fluke. In particular, because our model is trained on natural images, additional objects might appear in the background, and there is a small chance that some of these objects are from the novel classes. As a result, the model may learn something about the novel classes due to information leaking from the training data. Here we present several sanity-check experiments to show that we observe a small leakage for one model variant, but it does not account for the strong and consistent ME bias reported in the previous section. Furthermore, we provide additional analyses that show how the model structures its audio and visual representation spaces for the ME bias to emerge.
7.1 Sanity Checks
The familiar–novel column in Table 3 repeats the ME results from Section 6. We now evaluate these ME results against three sanity-check experiments.
. | Model initialization . | Accuracy (%) . | |||||
---|---|---|---|---|---|---|---|
Audio . | Vision . | Familiar–novel . | Novel–novel . | Familiar–novel* . | Familiar–novel . | ||
1 | Random baseline | N/A | N/A | 49.92 | 49.85 | 49.72 | 50.58 |
2 | MattNet | ✗ | ✗ | 57.29 | 51.05 | 55.52 | 69.68 |
3 | ✗ | ✓ | 59.32 | 48.74 | 58.51 | 86.92 | |
4 | ✓ | ✗ | 55.92 | 50.52 | 53.41 | 70.93 | |
5 | ✓ | ✓ | 60.27 | 49.92 | 58.41 | 82.88 |
. | Model initialization . | Accuracy (%) . | |||||
---|---|---|---|---|---|---|---|
Audio . | Vision . | Familiar–novel . | Novel–novel . | Familiar–novel* . | Familiar–novel . | ||
1 | Random baseline | N/A | N/A | 49.92 | 49.85 | 49.72 | 50.58 |
2 | MattNet | ✗ | ✗ | 57.29 | 51.05 | 55.52 | 69.68 |
3 | ✗ | ✓ | 59.32 | 48.74 | 58.51 | 86.92 | |
4 | ✓ | ✗ | 55.92 | 50.52 | 53.41 | 70.93 | |
5 | ✓ | ✓ | 60.27 | 49.92 | 58.41 | 82.88 |
We start by testing the following: If indeed the model has a ME bias, it should not make a distinction between two novel classes. So we present MattNet with two novel images and a novel audio query in a novel–novel test. Here, one novel image depicts the class referred to by the query, and the other image depicts a different novel class. If the model does not know the mappings between novel words and novel images, it should randomly choose between the two novel images. The results for this novel–novel test in Table 3 are close to 50% for all MattNet’s variations, as expected.
Surprisingly, our statistical test shows significant differences between the baseline and two out of the four variations: MattNet with full random initialization scores higher than the baseline on this novel–novel task, and MattNet with the vision initialization lower. Since the differences between each model and the baseline are small and in different directions (one model scores lower and the other higher), we believe these patterns are not meaningful. At the same time, one possible explanation of the above-chance performance of MattNet with random initialization is that there may be some leakage of information about the novel classes that may appear in the background of the training images. To test whether our ME results can be explained away by this minor leakage, we observe that the model’s scores in the familiar–novel task (the ME task) are noticeably higher than the scores in the novel–novel task. An additional statistical test (Appendix A) shows that the differences between MattNet’s scores across the two tasks are, indeed, statistically significant for three out of the four variations (except the one with the audio initialization alone). This suggests that the ME bias cannot be explained away by information leakage for most model variations.
To further stress test that the model does not reliably distinguish between novel classes, we perform an additional test: familiar–novel*. In the standard familiar–novel ME test, the model is presented with a familiar class (e.g., elephant) and a novel class (guitar) and correctly matches the novel query word guitar to the novel class. If the model truly uses a ME bias (and not a mapping between novel classes and novel words that it could potentially infer from the training data), then it should still select the novel image (guitar) even when prompted with a mismatched novel word, say ball. Therefore, we construct a test to see whether a novel audio query would still be matched to a novel image even if the novel word does not refer to the class in the novel image. Results for this familiar–novel* test in Table 3—where the asterisk indicates a mismatch in classes—show that the numbers are very close to those in the standard familiar–novel ME test. All the MattNet variations therefore exhibit a ME bias: A novel word query belongs to any novel object, even if the two are mismatched, since the familiar object already has a name. Our statistical tests support this result.
Finally, in all the results presented above, MattNet has a preference for a novel image. One simple explanation that would be consistent with all these results (but would render them trivial) is if the model always chose a novel object when encountering one (regardless of the input query). To test this, we again present the model with a familiar and a novel object, but now query it with a familiar word. The results for this familiar–novel test in Table 3 show that all MattNet variations achieve high scores in selecting the familiar object. Again, our significance test confirms that all the scores are significantly higher than random.
7.2 Why Do We See a ME Bias?
We have now established that the MattNet visually grounded speech model exhibits the ME bias. But this raises the question: Why does the model select the novel object rather than the familiar one? How is the representation space organised for this to happen? We attempt to answer these questions by analyzing different cross-modal audio–image comparisons made in both the familiar–familiar and familiar–novel (ME) tests. Results are given in Figure 4, where we use MattNet with both visual and audio encoders initialized (row 5, Table 3).
First, in the familiar–familiar setting we compare two similarities: (A) the MattNet similarity scores between a familiar audio query and a familiar image from the same class against (B) the similarity between a familiar audio query and a familiar image from a different class (indicated with familiar*). Perhaps unsurprisingly given the strong familiar–familiar performance in Table 2, we observe that the similarities of matched pairs (familiar audio – familiar image, A) are substantially higher than the similarities of mismatched pairs (familiar audio – familiar* image, B). This organization of the model’s representation space can be explained by the contrastive objective in Equation 1, which ensures that the words and images from the same familiar class are grouped together, and different classes are pushed away from one another.
But where do the novel classes fit in? To answer this question, we consider two types of comparisons from the familiar–novel ME setting: (C) the MattNet similarity scores between a novel query and a novel image (from any novel class) against (D) the similarity between a novel query and a familiar image. We observe that the novel audio – novel image similarities (C) are typically higher than the novel audio – familiar image similarities (D). That is, novel words in the model’s representation space are closer to novel images than to familiar images. As a result, a novel query on average is closer to any novel image than to familiar images, which sheds light on why we observe the ME bias.
The similarities involving novel words (C and D) are normally higher than those of the mismatched familiar classes (B). This suggests that novel samples are closer to familiar samples than familiar samples from different classes are to one another. In other words, during training, the model learns to separate out familiar classes (seen during training), but then places the novel classes (not seen during training) relatively close to at least some of the familiar ones. Crucially, samples in the novel regions are still closer to each other (C) than they are to any of the familiar classes (as indicated by D).
How does the contrastive loss in Equation 1 affect the representations of novel classes during training, given that the model never sees any of these novel instances? In Figure 5 we plot the same similarities as we did in Figure 4 but instead we use the model weights before training. It is clear how training raises the similarities of matched familiar inputs (A) while keeping the similarities of mismatched familiar inputs low (D), which is exactly what the loss is designed to do. But how are novel instances affected? One a priori hypothesis might be that training has only a limited effect on the representations from novel classes. But, by comparing Figures 4 and 5, we see that this is not the case: Similarities involving novel classes change substantially during training (C and D). The model thus uses information from the familiar classes that it is exposed to, to update the representation space, affecting both seen and unseen classes.
7.3 Finer-grained Analysis
We have seen a robust ME bias in the aggregated results above. But what do results look like at a finer level? We now consider each of the 20 novel words individually and compute how often the model selects the corresponding novel image (Figure 6a) or any of the familiar images (Figure 6b). While most of the novel words are associated with the ME bias (Figure 6a, dots to the right of the vertical red line), a small number of words yields a strong anti-ME bias when paired with certain familiar words (Figure 6b, red cells). For example, for the novel word bus, in 91% of the test cases the model picks an image of a familiar class shape boat rather than an image of the novel class shape bus. It is worth emphasizing that the ME bias isn’t absolute: Even in human participants it isn’t seen in 100% of test cases. Nevertheless, it is worth investigating why there is an anti-ME bias for some particular words (something that is easier to do in a computational study compared to human experiments).
One reason for an anti-ME result is the phonetic similarity of a novel word to familiar words. For example, bus and boat start with the same consonant followed by a vowel. If we look at Figure 6c, which shows the cosine similarities between the learned audio embeddings from MattNet, we see that spoken instances of bus and boat indeed have high similarity. In fact, several word pairs starting with the same consonant (followed by a vowel) have high learned audio similarities, e.g., buck–boat, bench–boat and cake–cat, all translating to an anti-ME bias in Figure 6b.
However, the anti-ME bias cannot be explained by acoustic similarity alone: Some anti-ME pairs have low audio similarities, e.g., nautilus–elephant. For such cases, the representation space must be structured differently from the aggregated analysis in Section 7.2 (otherwise we would see a ME bias for these pairs). Either the spoken or the visual representation of a particular class can be responsible (or both). To illustrate this, we zoom in on the two novel words showing the strongest anti-ME results in Figure 6a: nautilus and chair.
Figure 7a presents a similar analysis to that of Figure 4 but specifically for nautilus. We see the anti-ME bias: Nautilus audio is more similar to familiar images (C) than to nautilus images (A). This is the reverse of the trend in Figure 4 (C vs D). Is this due to the nautilus word queries or the nautilus images? Here in Figure 7a, box B shows what happens when we substitute the shape nautilus images from box A with any other novel image: The similarity goes up. This means that nautilus images are not placed in the same area of the representation space as the other novel images. But this isn’t all: Boxes B and C are also close to each other. Concretely, if we compare B vs C here in Figure 7a to C vs D in Figure 4, then we still do not see the difference corresponding to the ME result, as in the latter case. This means that the shape nautilus audio is also partially responsible for the anti-ME result here in that it is placed close to familiar images.
Let us do a similar analysis for chair: Figure 7b. We again see the anti-ME results by comparing A and C. But now swapping out shape chair images for other novel images (B) does not change the similarities. In this case, the culprit is therefore mainly the chair audio.
Further similar analyses can be done to look at other anomalous cases. But it is worth noting, again, that the aggregated ME scores from Section 6 are typically between 55% and 61% (not 100%). So we should expect some anti-ME trends in some cases, and the analysis in this section shows how we can shed light on those.
7.4 How Specific Are Our Findings to MattNet?
We have considered one visually grounded speech model, namely, MattNet. How specific are our findings to this particular model? While several parts of our model can be changed to see what impact they have, we limit our investigation to two potentially important components: the loss function and the visual network initialization.
Loss Function.
Apart from changing the loss, the rest of the MattNet structure is retained. Results are shown in Table 4 for models that use self-supervised CPC and AlexNet initializations. The two new losses can learn the familiar classes and exhibit a ME bias. In fact, an even better familiar–familiar performance and a stronger ME bias (familiar–novel) are obtained with the InfoNCE loss. This loss should therefore be considered in future work studying the ME bias in visually grounded speech models.
Visual Network Initialization.
In Section 6 we saw that vision initialization contributes most to the ME strength. Here we investigate whether we can get an even greater performance boost if we initialize MattNet using a supervised version of AlexNet instead of the self-supervised variant used thus far. Both the self-supervised (Koohpayegani et al., 2020) and supervised (Krizhevsky et al., 2017) versions of AlexNet are trained on ImageNet (Deng et al., 2009), so we can fairly compare MattNet when initialized with either option. Both MattNet variants shown in Table 5 make use of CPC initialization. We observe that the supervised AlexNet initialization performs better on the familiar–familiar task than the self-supervised initialization. However, the ME (familiar–novel) results with the supervised AlexNet initialization are only slightly higher than with the self-supervised initialization.
Vision initialization . | Accuracy (%) . | |
---|---|---|
Familiar–familiar . | Familiar–novel . | |
Self-supervised | 83.20 | 60.27 |
Supervised | 87.08 | 61.66 |
Vision initialization . | Accuracy (%) . | |
---|---|---|
Familiar–familiar . | Familiar–novel . | |
Self-supervised | 83.20 | 60.27 |
Supervised | 87.08 | 61.66 |
While there is a broad space of visually grounded models that could be used to consider the ME task, it is encouraging that all the variants in this work show the bias.
8 Conclusion and Future Work
Mutual exclusivity (ME) is a constraint employed by children learning new words: A novel word is assumed to belong to an unfamiliar object rather than a familiar one. In this study, we have demonstrated that a representative visually grounded speech model exhibits a consistent and robust ME bias, similar to the one observed in children. We achieved this by training the model on a set of spoken words and images and then asking it to match a novel acoustic word query to an image depicting either a familiar or a novel object. We considered different initialization approaches simulating prior language and visual processing abilities of a learner. The ME bias was observed in all cases, with the strongest bias occurring when more prior knowledge was used in the model (initializing the vision branch had a particularly strong effect).
In further analyses we showed that the ME bias is strongest earlier on in model training and then stabilises over time. In a series of additional sanity-check tests we showed that the ME bias was not an artefact: It could not be explained away by possible information leakage from the training data or by trivial model behaviors. We found that the resulting embedding space is organized such that novel classes are mapped to a region distinct from the one containing familiar classes, and that different familiar classes are spread out over the space to maximize the distance between familiar classes. As a result, novel words are mapped on to novel images, leading to a ME bias. Lastly, we showed that the ME bias is robust to model design choices in experiments where we changed the loss function and used a supervised instead of self-supervised visual initialization approach.
Future work can consider whether using a larger number of novel and familiar classes affects the results. Another interesting avenue for future studies resolves around multilingualism. Following on from the original ME studies with young children, Byers-Heinlein and Werker (2009) and Kalashnikova et al. (2015), among others, have looked at how multilingualism affects the use of the ME constraint. This setting is interesting since in the multilingual case different words from the distinct languages are used to name the same object. These studies showed that in bi- and trilingual children from the same age group, the ME bias is not as strong as in monolingual children. We plan to investigate this computationally in future work.
Acknowledgments
This work was supported through a Google DeepMind scholarship for LN and a research grant from Fab Inc. for HK. DO was partly supported by the European Union’s HORIZON-CL4-2021-HUMAN-01 research and innovation programme under grant agreement no. 101070190 AI4Trust. We would like to thank Benjamin van Niekerk for useful discussions about the analysis. We would also like to thank the anonymous reviewers and action editor for their valuable feedback.
Notes
References
A Testing for Statistical Significance
To determine whether the differences between our model variations and a random baseline are statistically significant, we fit two types of logistic mixed-effects regression models to the data, where each of them predicts the (binary) model’s choice for each test episode. All models are fitted using the lme4 package (Bates et al., 2015). Unlike many other statistical tests, mixed-effects models take into account the structure of the data: For example, certain classes or even individual images/queries are used in multiple pairwise comparisons.
The first mixed-effects model tests whether each MattNet’s variation is better than the random baseline: it uses the MattNet variation as a predictor variable and random intercepts over trials, test episodes, the specific acoustic realisation of the test query, individual image classes and their pairwise combinations, and specific images in the test episode.
The second mixed-effects model does not consider the random baseline, and instead tests whether adding the visual initialization, the audio initialization or a combination of both improves MattNet: It uses the presence (or lack of) visual initialization and audio initialization as two binary independent variables, as well as their interaction, and the same random intercepts as described above.
In Section 7.1 we additionally test whether MattNet’s scores in the familiar–novel test are significantly higher than in the novel–novel test. For this, we fit a logistic mixed-effects model to MattNet’s combined scores from both tests, with test type and model variation as predictor variables, together with their interaction, as well as random intercepts as described above.
B Test Notation
Setup . | Query audio . | Target image . | Other image . |
---|---|---|---|
Familiar–familiar | familiar | shape familiar | familiar* |
Familiar–novel | familiar | shape familiar | shape novel |
Familiar–novel | novel | shape novel | shape familiar |
Novel–novel | novel | shape novel | shape novel* |
Familiar–novel* | novel | shape novel* | shape familiar |
Setup . | Query audio . | Target image . | Other image . |
---|---|---|---|
Familiar–familiar | familiar | shape familiar | familiar* |
Familiar–novel | familiar | shape familiar | shape novel |
Familiar–novel | novel | shape novel | shape familiar |
Novel–novel | novel | shape novel | shape novel* |
Familiar–novel* | novel | shape novel* | shape familiar |
Author notes
Action Editor: Mauro Cettolo