Speech Segmentation and Cross-Situational Word Learning in Parallel

Abstract Language learners track conditional probabilities to find words in continuous speech and to map words and objects across ambiguous contexts. It remains unclear, however, whether learners can leverage the structure of the linguistic input to do both tasks at the same time. To explore this question, we combined speech segmentation and cross-situational word learning into a single task. In Experiment 1, when adults (N = 60) simultaneously segmented continuous speech and mapped the newly segmented words to objects, they demonstrated better performance than when either task was performed alone. However, when the speech stream had conflicting statistics, participants were able to correctly map words to objects, but were at chance level on speech segmentation. In Experiment 2, we used a more sensitive speech segmentation measure to find that adults (N = 35), exposed to the same conflicting speech stream, correctly identified non-words as such, but were still unable to discriminate between words and part-words. Again, mapping was above chance. Our study suggests that learners can track multiple sources of statistical information to find and map words to objects in noisy environments. It also prompts questions on how to effectively measure the knowledge arising from these learning experiences.


INTRODUCTION
Learning a new language requires mastering several complex tasks. Research has shown that language learners can use statistical cues from their linguistic environment to overcome some of these challenges. For instance, learners can track conditional probabilities between syllables to discover words from continuous speech and between words and objects to learn the meaning of novel words across ambiguous situations. The present study explores how tracking conditional probabilities in audiovisual input may help learners to solve both tasks simultaneously. We combine two well established statistical learning tasks-speech segmentation (e.g., Romberg & Saffran, 2010;Saffran et al., 1996) and cross-situational word learning (e.g., Smith & Yu, 2008;Yu & Smith, 2007)-into a single paradigm.
Faced with continuous speech and only a few words in isolation (∼10%; Brent & Siskind, 2001), one of the crucial challenges for language learners is to segment streams of words into discrete units. Conditional probabilities between syllables (i.e., transitional probabilities; Krogh et al., 2013;Romberg & Saffran, 2010;Saffran et al., 1996) provide one cue that aids segmentation (for evidence of other cues, see Hay & Saffran, 2012;Johnson et al., 2014). In natural a n o p e n a c c e s s j o u r n a l speech, syllables that form words tend to have higher likelihood of co-occurrence (higher Transitional Probabilities, TPs) in comparison to syllables across word boundaries (Swingley, 1999; but see Yang, 2004), which provides a potential cue to segmentation. For instance, in the sequence pretty#baby the TP of pre to ty is greater than the TP of ty to ba, this difference in TP could signal a word boundary for learners (Saffran et al., 1996). There is now a vast empirical literature showing that language learners can track differences in TPs across syllable sequences to segment continuous speech into discrete words (for reviews see Cannistraci et al., 2019;Cunillera & Guilera, 2018; but see Black & Bergmann, 2017). The experimental task in these studies usually starts by familiarizing participants with a continuous speech stream in which TP is the main cue to word boundaries. For instance, some syllables always occur together (creating a word), sometimes occur together (creating a part-word or a low TP word), or never occur together (creating a non-word). Following familiarization, participants' preferences for words, part-words, or non-words are measured. By and large participants differentiate words from foils (part-words or non-words), suggesting that they successfully tracked TP information to find words in the continuous speech stream.
Phonotactic probability (PP), the conditional probability of a syllable occurring in a given position of a word from a given language ( Vitevitch & Luce, 2004), is another statistical cue to word boundaries (Benitez & Saffran, 2021;Mattys & Jusczyk, 2001;Mattys et al., 1999). For instance, in the same sound sequence pretty#baby, the English PPs 1 of the words pretty and baby are comparable (≈ 0.0440, ≈ 0.0050, respectively) and both are higher than the PP of the part-word ty#ba (≈ 0.0022), which could signal word boundaries to language learners. The combined information of TPs and PPs can promote-when both cues point to word boundaries-or impair speech segmentation-when they provide conflicting information about word boundaries. Evidence suggests that this happens when TP is combined with legal versus illegal PPs (Finn & Hudson Kam, 2008), with high versus low PPs (Mersad & Nazzi, 2011), and even with subtle differences in high PPs (Dal Ben et al., 2021). In previous work, we argued that careful consideration of phonotactics from participants' natural languages should be an integral part of the stimuli design of statistical speech segmentation studies (Dal Ben et al., 2021). This is especially true when studying adults, who will promptly bring their extensive learning history and expectations from their natural languages' PPs to the experimental task (Steber & Rossi, 2020;Sundara et al., 2022).
Assigning meaning to words is another challenge for language learners. There is evidence that, early in development, recently segmented words (with stronger TPs) are treated as better candidate labels on subsequent mapping tasks (Graf Estes et al., 2007;Hay et al., 2011). While the benefit of high TP sequences during word learning appears to diminish across development (Mirman et al., 2008;Shoaib et al., 2018), learners continue to be remarkably successful both at segmenting speech using TP information (Saffran et al., 1996; but see Black & Bergmann, 2017) and at making one-to-one mappings between labels and referents (Graf Estes, 2009;Graf Estes et al., 2007;Lany & Saffran, 2010). Furthermore, across the lifespan, language learners rely on phonotactics from their natural languages when learning novel words, with words with stronger PPs being learned faster and more accurately than words with weaker PPs Storkel et al., 2013;but see Cristia 2018). However, this might not be true when learning novel words in ambiguous situations (Dal Ben et al., 2022).
In everyday life, several words are presented with several potential referents at the same time, creating ambiguous learning experiences (Quine, 1960). A growing empirical literature shows that learners can track word-object co-occurrences across ambiguous situations to find the meaning of words (for a recent meta-analysis, see Dal Ben et al., 2019; but see Smith et al., 2014). The experimental task in these studies usually familiarizes participants with a series of ambiguous trials. On each trial, two (or more) words are presented with two (or more) objects. On any given trial, there is insufficient information to solve the ambiguity. However, if participants compare word-object conditional probabilities across trials, word-object relations can be learned 2 (Smith & Yu, 2008;Yu & Smith, 2007).
The evidence that statistical information can promote both speech segmentation and crosssituational word learning prompts the question of whether these processes unfold in sequence or in parallel. Related evidence for the latter is reported by . Adults were familiarized with a continuous speech stream and, at the same time, with a stream of objects. When the first word was being played, its corresponding object was displayed on the screen; when the second word started, its corresponding object replaced the previous one, and so forth. From this dynamic presentation, participants were able to segment words from the continuous speech and to map them to its corresponding objects in parallel. In addition, in a follow-up study, François et al. (2017) replicated the findings and showed neurophysiological markers for online simultaneous speech segmentation and mapping. Although these studies have shown that segmentation and mapping can happen in parallel (see also Shukla et al., 2011 for a related task with infants), both used non-ambiguous word learning tasks.
Intuitively, adding mapping ambiguity could make the simultaneous task too challenging. However, Yurovsky et al. (2012) have shown that adults can simultaneously segment labels from phrases and map them to objects across ambiguous presentations. Using an adaptation of the cross-situational word learning paradigm ( Yu & Smith, 2007), adults were exposed to scenes with two novel objects. On each trial, they would see only one object and hear a sentence that included a word labeling it among other function words. When the position and the onset of labels in the sentences matched the patterns of their natural language (i.e., final position, label preceded by a small set of words), participants were able to segment the labels and to map them to objects. Despite the additional demands that ambiguity might impose, the authors argued that the parallel solution of segmentation and mapping might happen in continuous iterations, as even partial speech segmentation would reduce mapping ambiguity and vice-versa (for similar evidence with multilingual adults see Tachakourt, 2023; for related evidence with other linguistic cues, see Feldman, Myers, et al., 2013). This is in line with proposals by Räsänen and Rasilo (2015). In a comprehensive combination of computational simulations and reanalyses of empirical data, the authors argue that tracking cross-modal conditional probabilities between words and objects in ambiguous situations may boost both speech perception and word learning, in comparison to tracking only TPs or word-object co-occurrences (for a similar argument, see Jones et al., 2010). Moreover, recent meta-analytic findings show that infants effectively integrate audio and visual information, from a variety of sources, when learning language (e.g., Cox et al., 2022;but see Frank et al., 2007, Johnson & Tyler, 2010, and Thiessen, 2010 for potential limits of this integration).
Here we further explore whether the integration of transitional probabilities, phonotactic probabilities, and word-object co-occurrences would promote speech segmentation and word 2 Here we do not join the productive debate between hypothesis-testing and aggregation as learning mechanisms for cross-situational word learning (e.g., Yurovsky & Frank, 2015), as we believe it is beyond the scope of our study.
OPEN MIND: Discoveries in Cognitive Science learning across ambiguous presentations. Our study is guided by three main questions. First, we ask whether words can be segmented and mapped at the same time across dynamic ambiguous presentations. To answer this question, we adapted the design by  to combine a speech stream with several new objects in an ambiguous fashion. Second, we ask whether phonotactic properties of our stimuli would impact speech segmentation and cross-situational word learning in parallel. Answering this question allows us to better understand how multiple linguistic statistics can be combined when learning novel words across ambiguous situations (Saffran, 2020;Smith et al., 2018). Third, we ask whether this joint task would improve segmentation and mapping in comparison to separate tasks. To answer this question, we compared our current findings to data from our previous studies testing speech segmentation (Dal Ben et al., 2021) and cross-situational word learning (Dal Ben et al., 2022) separately, but using the same stimuli (same TP and phonotactic properties) and population.

EXPERIMENT 1
To investigate whether words can be segmented and mapped simultaneously and whether differences in phonotactics would impact this joint performance, we exposed participants to continuous speech streams with varying distributions of phonotactics and TPs. At the same time, we also presented them with a series of objects, two at a time, that corresponded to the words in the speech streams. Critically, one of the languages had TPs and phonotactics aligned, consistently pointing to word boundaries. In another language, words and part-words had balanced phonotactics, with TPs being the only informative statistic to word boundaries. In a third language, TPs and phonotactics were in conflict: TPs pointed to word boundaries and phonotactic information pointed to syllables within-words (part-words).
To investigate whether the joint task would improve segmentation and mapping in comparison to separate tasks, we compared segmentation and mapping performance in the present combined task with performance in the individual tasks (i.e., speech segmentation only and cross-situational word learning only; Dal Ben et al., 2021, 2022, respectively).

Method
Participants. Sixty native Brazilian-Portuguese-speaking adults (M age = 21.37 years, ± 3.27 SD, 32 female) participated. None of the participants reported any visual or auditory impairments that could interfere with the task. Participants were recruited online at the official Facebook group of Universidade Federal de São Carlos, where data was collected. They received no compensation for their in-person participation. The study was conducted according to the Declaration of Helsinki and the Ethics Committee of the host university approved the research (#1.484.847). Participants were randomly assigned to one of three groups.

Stimuli and Design
Auditory Stimuli. Three frequency-balanced languages from Dal Ben et al. (2021) were used (see Table 1). Each language contained six statistically defined disyllabic pseudo-words (TP = 1), which served as labels in our task. Test words and part-words in all Languages were frequency balanced (Aslin et al., 1998). In each language, half of the words were repeated 300 times (labeled H on Table 1) and the other half were repeated 150 times (labeled L on Table 1). The recombination of syllables from the words with higher frequency generated three part-words, used during test phase, that had lower TPs (TP = 0.5), but that were balanced in frequency with the test words (150 repetitions each; Aslin et al., 1998).
In addition, all words and part-words had legal and high phonotactic probabilities in Brazilian-Portuguese. Following previous research (Dal Ben et al., 2021), we decided to use only syllable sequences with high phonotactics (instead of legal vs. illegal or high vs. low ;Finn & Hudson Kam, 2008;Mersad & Nazzi 2011) so that all syllable sequences would be phonotactically plausible in the participants' native language. However, some syllable sequences had higher phonotactic probability than others (Table 1, PP+ or PP−). Phonotactics were calculated using Vitevitch and Luce's (2004) algorithm and Estivalet and Meunier (2015) database of Brazilian-Portuguese biphones. Briefly, we divided the sum of the log (base 10) of token frequency of each biphone on each word position by the total log frequency of words with biphones in that given position (e.g., /mae/ in the third biphone divided by the total log frequency of all words with at least three biphones). Then, using a custom search engine, we created six novel disyllabic words with consonant-vowel structure (CVCV) and with the highest possible phonotactic probability before becoming actual words in Brazilian-Portuguese (labeled PP+; Table 1). Lastly, we recombined their biphones to create six other novel words that had slightly less probable, but still high, phonotactic probabilities (labeled PP−; Table 1). For a full description of the phonotactic calculations, see Dal Ben et al. (2021) and Vitevitch and Luce (2004). Languages were synthesized using the MBROLA speech synthesizer with a Portuguese female voice 3 (Dutoit et al., 1996). Prosodic cues were minimized by setting the pitch constant at 180 Hz, the intensity at 77 dB, and the duration of each word to 696 ms (cf. . The total duration of each language was 15 min 39 s and 424 ms. Following our previous studies, TPs and phonotactics were combined to create three languages. The Balanced language had test words (TP = 1.0) and part-words (TP = 0.5) with balanced phonotactic probabilities (M words = 0.0072, M part-words = 0.0075; Table 1); this language served as a control. The Aligned language had test words with higher phonotactic probabilities in comparison to part-words (M words = 0.0085, M part-words = 0.0072; Table 1). Thus, both TPs and phonotactics signaled word boundaries. Finally, in the Conflict language: test words had lower phonotactic probabilities in comparison to part-words (M words = 0.0072, M part-words = 0.0085; Table 1). Thus, TPs highlighted word boundaries whereas phonotactics highlighted part-words.
Visual Stimuli. Six novel objects, used by Dal Ben et al. (2022), were also used in the present experiment. They were realistic, colorful, 3D objects that are part of the NOUN object base (Horst & Hout, 2016) and were chosen based on their high degree of novelty (M = 77%) and discriminability (M = 90%). For each language, objects and words were randomly paired, forming six word-object pairs. All stimuli are openly available at https://osf.io/rs2bm/.
Design. Our paradigm (Figure 1) was an adaptation of  and combined speech segmentation and cross-situational word learning in the same task. It had two phases: familiarization and test. During familiarization, one of the languages (Balanced, Aligned, Conflict) was played while objects were displayed on the computer screen. We matched words from the speech stream and objects on the screen in such a way that, at any given time, two objects were displayed while their corresponding words were presented (ffi 1392 ms; Figure 1). For instance, when the first word was first presented, the objects 3 We used the MBROLA database br4 (available at: https://github.com/numediart/ MBROLA-voices). OPEN MIND: Discoveries in Cognitive Science corresponding to the first and second words were displayed; when the third word was played, the first two objects were replaced by two other objects and so on. This created a highly dynamic adaptation of the classic 2 × 2 cross-situational word learning arrangement (for a video sample, see https://osf.io/rs2bm/; cf. Smith & Yu, 2008). Importantly, the onset and offset of the words and objects were desynchronized (± 100, ± 150, or ± 200 ms) to avoid additional cues to speech segmentation . In addition, the entire audio stream had a fade-in and fade-out effect of 500 ms to minimize cues for the initial and final words' boundaries. Finally, to minimize fatigue from this extensive exposure (a total of 1350 word-object presentations, or 675 2 × 2 "trials", over ffi 15 minutes), we divided the familiarization into five blocks. Each block had 270 word-object presentations-60 for each high frequency word-object pair and 30 for each low frequency pair-and lasted a little over 3 minutes. Between blocks, participants were given a 5-second pause on a screen displaying the task progress (e.g., "Block 2 of 5").
Following familiarization, two tests were performed, always in the same order: segmentation and mapping. The segmentation test followed a two-alternative forced-choice structure. On each trial, a frequency-balanced word (i.e., a low frequency word, TP = 1, 150 repetitions) and a part-word (TP = 0.5, 150 repetitions) were played with a pause of 500 ms between them. Participants were prompted to indicate which one was a word from the speech stream they had just heard. The order of presentation of words and part-words was counterbalanced across trials. Each of the three low frequency words were tested six times across 18 test trials, with each word being tested against each part-word twice 4 .
The mapping test followed a four-alternative forced-choice structure. Each trial began with four objects displayed in the corners of the screen: one target object (co-occurrence probability = 1 with target word) and three distractors (co-occurrence probability = 0.2 with target word). After 1 second, a target word was played and participants were prompted to select the matching object. Each of the 6 word-object pairs (3 high frequency words and 3 low frequency words) were tested twice across 12 trials.
Procedure. The experiment was conducted in a sound-attenuated room and was computer administered using Psychopy2 (Peirce et al., 2019). Auditory stimuli were played on highdefinition neutral headphones (AKG K240 powered by Fiio e10K dac/amp). All responses were entered on an adapted numeric keyboard with only the keys: 1, 2, 3, 4, Return, +, and − (to increase or decrease the audio volume). At the beginning of the experiment, music with the same intensity as the experimental stimuli (77 dB) was played and participants were instructed to adjust the volume to a comfortable level.
Next, they were instructed that they would hear a new language and see new objects and that their task was to discover which words corresponded to which objects. Following familiarization, they were tested on segmentation and mapping. The first two trials of each testing phase were warm-up trials used to familiarize participants with the structure of the tasks. For example, before the segmentation test trials began, participants were presented with two practice trials with a common word from Brazilian-Portuguese versus a nonsense word (e.g., pato [duck] vs. tafi). Similarly, before the mapping test trials began, participants were presented with two practice trials during which they heard a familiar word and were presented with 4 familiar 4 The decision to test each word six times was based on our previous investigation of speech segmentation only (Dal Ben et al., 2021). Whereas this number of repetitions is higher in comparison to similar studies (e.g., François et al., 2017), follow-up analyses revealed that trial number did not predict performance on neither Experiment 1 nor 2. Full analysis available at: https://osf.io/rs2bm/. objects (e.g., "pato" + picture of a duck, house, cat, ball). In addition, after each test phase, participants were asked to estimate their performance by indicating if the percentage of correct responses was between 0-25%, 25-50%, 50-75%, or 75-100%. Participants' compliance to instructions was continuously assessed using a CCTV system. At the end, participants answered a questionnaire about their educational background and language abilities. Data Analysis. After excluding inattentive responses, defined as test trials with reaction times greater than 3 SDs away from the mean (segmentation: 15 trials, 1% of the data; mapping: 17 trials, 2% of the data), we fitted mixed-effects logistic regressions using the lme4 package for R (Bates et al., 2015;R Core Team, 2021) and Spearmans' correlations, also in R, to explore speech segmentation performance, cross-situational word learning performance, relationships between them, and self-evaluation. Specific models, outcomes, and predictors are described in the next section. Given the exploratory nature of our investigation, we report effect size estimations and confidence intervals, but not p-values (Scheel et al., 2021). All scripts and data are openly available at https://osf.io/rs2bm/.

Results and Discussion
Speech Segmentation. To analyze speech segmentation performance, our mixed-effects logistic regression had selection of the target word (either correct or incorrect) as our outcome variable and chance level (logit of 0.5) and language (Balanced, Aligned, Conflict, respectively) as predictor variables. Our initial model had a maximal random structure with stimuli as random slopes and participants as random intercepts 5 (Barr et al., 2013), but this model did not converge. We then pruned it to include only random intercepts for stimuli and participants 6 .
Participants from the Balanced language were much more likely to select the words over the part-words at test (Odds Ratio = 10.95, 95% CI [4.19, 28.57] 7 ; M = 0.85, SD = 0.16; Figure 2). Participants from the Aligned language, in which both TP and phonotactic probability pointed to word boundaries, were even more likely to select words over part-words (change in OR = 1.61, 95% CI [0.41, 6.27]; M = 0.87, SD = 0.18). On the other hand, participants from the Conflict language, in which TP and phonotactic probabilities worked against each other, were equally likely to select words and part-words (change in OR = 0.13, 95% CI [0.04, 0.42]; M = 0.57, SD = 0.3). These results are in line with our previous findings that adults not only track both TP and PP at the same time, but that these statistics can be combined to improve (i.e., Aligned language) or impair (i.e., Conflict language) speech segmentation (Dal Ben et al., 2021).
In addition, segmentation performance and self-evaluation ( Figure 2) were positively correlated for the Balanced (r s = 0.45) and Aligned (r s = 0.48) languages, but not for the Conflict language (r s = 0.12). This suggests that being exposed to a continuous speech in which TPs and PPs were either aligned or balanced within words formed clearer word representations, which allowed participants to estimate their knowledge of the words more accurately from the speech.
To explore whether our joint task impacts speech segmentation, we compared the present data with data from a previous investigation testing speech segmentation only (Dal Ben et al., 2021). Because we used the exact same languages as previous studies, we fit separate 5 lme4 syntax: selection ∼ chance level + language + (stimuli|participant). 6 lme4 syntax: selection ∼ chance level + language + (1|stimuli) + (1|participant). 7 Regression tables are available at https://osf.io/rs2bm/. mixed-effects logistic regressions 8 for each language (Balanced, Aligned, Conflict), having the selection of target words (correct or incorrect) as our outcome variable, experiment (segmentation only or simultaneous task) as a predictor variable, and participants as random intercepts.
For the Balanced language, participants in the simultaneous task were approximately three times more likely to choose the target word compared to the separate task (change in OR = 3.21, 95% CI [1.50, 6.88]; Figure 3). The difference was even higher for the Aligned language, participants from the simultaneous task were almost five times more likely to make correct selections in comparison to the separate task (change in OR = 4.93, 95% CI [1. 34, 18.16]). On the other hand, in the Conflict language, although participants in the simultaneous task still These results show that adults will use any statistic available-phonetic and audiovisual cooccurrences-to find words in continuous speech. Moreover, the improvement in segmentation in our current task indicates that adults benefit from tracking multiple statistical sources. This provides initial empirical support for the model proposed by Räsänen and Rasilo (2015) and is in line with recent research on language development in natural environments (Clerkin et al., 2017;Smith et al., 2018;Yu et al., 2021).
Cross-situational Word Learning. To analyze cross-situational word learning, our mixed-effects logistic regression 9 had selection of the target object (either correct or incorrect) as the outcome variable, chance level (logit of 0.25), language (Balanced, Aligned, Conflict, respectively), the frequency of word-object pairs (low or high), and their interaction as predictor variables, and stimuli and participants as random intercepts.
Across all languages and pair frequencies, participants were much more likely to select the correct object in comparison to the distractors (Figure 4; full regression table available at https://osf.io/rs2bm/). Mapping and self-evaluation ( Figure 4) were positively correlated for all languages. They were strongly correlated for the Balanced language (r s = 0.9), and moderately for the Aligned (r s = 0.59) and the Conflict languages (r s = 0.53). This suggests that 9 lm4 syntax: object selection ∼ chance level + language * pair frequency + (1|stimuli) + (1|participant). OPEN MIND: Discoveries in Cognitive Science participants from all languages were able to form clear word-object relationships. It was surprising to see that participants from the Conflict language, who performed at chance on the speech segmentation task, were able to form strong word-object relationships-a point to which we return later.
To explore whether our simultaneous task impacts mapping performance, we compared the present data with data from a previous experiment that only tested cross-situational word learning but using the same stimuli and population (Dal Ben et al., 2022). We fitted one mixed-effect logistic model that had mapping (correct or incorrect) as the outcome variable, the interaction between experiment (separate or simultaneous task) and language (Balanced, Aligned, Conflict) as a predictor, and participants as random intercepts 10 . 10 lme4 syntax: object selection ∼ experiment:language + (1|participant).

OPEN MIND: Discoveries in Cognitive Science
Overall, cross-situational word learning improved for all languages during the parallel task in comparison to the separate task ( Figure 5). The improvement was greater for participants from the Aligned language (change in OR = 7.39, 95% CI [2.10, 25.98]), followed by participants from the Balanced language (change in OR = 3.34, 95% CI [1.37, 5.52]). Although less pronounced, there was also an improvement for the Conflict language (change in OR = 1.60, 95% CI [0.50, 5.05]), which indicates that participants can benefit from word-object cooccurrence even when TP and phonotactics point to different word boundaries.
Relationship Between Speech Segmentation and Word Mapping. To explore potential relationships between speech segmentation and word mapping, we ran Spearmans' correlations between words' and objects' selections (average scores per participant) for each Language. We found moderate positive correlations between segmentation and mapping for all Languages (r s Balanced = 0.49; r s Aligned = 0.52; r s Conflict = 0.42; Figure 6A). Overall, participants that were better at segmentation were also better at mapping. To further explore if that was true for participants from the Conflict Language, we performed a median split of segmentation performance (Mdn = 0.66, IQR = 0.4) and ran Spearman correlation tests for each group separately ( Figure 6B). Participants that successfully segmented the speech (above median) were also successful in mapping words to objects (r s = 0.46). However, we found no relationship between segmentation and mapping for those who performed poorly on segmentation (below the median; r s = 0.003).
Our design does not inform us about potential learning sequences. Intuitively, strong speech segmentation skills should lead to strong word mapping, which is confirmed to some extent by the positive correlation between word and object selections for participants above the median in the Conflict language, but not for those below the median. Interestingly, simulations by Räsänen and Rasilo (2015) favor a simultaneous performance in which speech OPEN MIND: Discoveries in Cognitive Science segmentation and mapping retrofeed each other, driving performance on both tasks. In this regard, the absence of a relationship between segmentation and mapping for participants below the median in the Conflict Language indicates that these performances could be independent from one another.
Overall, results from the present experiment suggest that not only can adults simultaneously track conditional probabilities between audio and visual stimuli to segment words from continuous speech streams and map them to referents under ambiguous learning contexts, but that both segmentation and mapping improve when a greater set of cues, even from different modalities, are available (Figures 3 and 5). Such results provide preliminary empirical evidence to the model of simultaneous segmentation and ambiguous mapping proposed by Räsänen and Rasilo (2015) and Jones et al. (2010).
Our results also indicate that phonotactic probabilities, or how familiar syllables' positional probabilities are in the native language of the participants, also impact such joint performance. The size of dots indicates the number of participants that overlap in each coordinate (from 1 to 6). (B) Correlations between speech segmentation and mapping in the Conflict Language for participants with speech segmentation above the median (Mdn = 0.66, IQR = 0.4; r s = 0.45) and below the median (r s = 0.003). The size of dots indicates the number of participants that overlap in each coordinate (from 1 to 2).

OPEN MIND: Discoveries in Cognitive Science
When transitional and phonotactic probabilities worked together to signal word boundaries, segmentation and mapping improved (Aligned language) in contrast to when the phonotactic probabilities were balanced among test items (Balanced language). However, the impact of phonotactics was most pronounced when it conflicted with TP information. In the Conflict language, overall, participants failed to show a preference for words when compared to part-words at test (Figure 2). Nonetheless, they were able to map words and objects (Figure 4). How could this happen?
If we assume that segmentation is a necessary pre-step to cross-situational mapping, then this result is hard to explain. However, if adults use whatever informative statistics they have at hand to solve linguistic ambiguity, they would take advantage of both transitional and phonotactic statistics and word-object co-occurrences in the Aligned and Balanced languages. On the other hand, in the Conflict language, statistics were not consistent enough to promote segmentation, but co-occurrences between word syllables and objects were consistent enough to promote mapping and, to some extent, speech segmentation-even without clear and explicit word representations. It is worth noting that objects were consistently paired with words only, and not with part-words. This might have provided some participants with enough information for speech segmentation. It might also have decreased the influence of statistical cues on segmentation (both TPs and PPs). Nonetheless, if word-object co-occurrence was the main source of information for speech segmentation, we should have seen similar levels of segmentation in all languages.
Moreover, our two-alternative forced-choice test might not have been sensitive enough to capture the weaker and implicit word representations that might have arisen in the Conflict language, providing us a partial picture of participants' speech segmentation. Our twoalternative-forced-choice trials contrasted words with stronger TPs and weaker phonotactic probability, or part-words with weaker TPs but stronger phonotactic probability. The contrast between recently acquired TP knowledge, and language specific phonotactic knowledge learned across the lifespan, may have impaired word selection (Finn & Hudson Kam, 2008). With this in mind, we replicate the current experiment, but using an arguably more sensitive speech segmentation measurement.
Finally, it is worth noting that our careful selection and combination of syllables to create disyllabic words with varying TP and PP contrasts introduced an important confound to our study: none of our words shared syllables. As all syllables were unique to a given word, tracking co-occurrences between individual syllables and objects would be enough to solve the mapping task-but not the segmentation task. This learning strategy would greatly reduce mapping complexity: participants could ignore half of the syllables and all linguistic regularities (i.e., TP and PP). Whereas this strategy may be computationally most simplistic, it seems like, as a group, participants in the Balanced and Aligned languages did track word-level statistics, as indicated by their segmentation performance. However, when faced with conflicting linguistic regularities, participants in the Conflict language might have defaulted to this more simplistic learning strategy and solved the mapping task without relying on word representations. Importantly, this confound extends to Experiment 2 and we further discuss it in the General Discussion.

EXPERIMENT 2
In an attempt to capture the potentially nuanced word form knowledge implicitly arising from experience with the Conflict language, in the current experiment we use a more sensitive word segmentation test: go/no-go (François et al., 2017). In this test, each item is presented and evaluated separately, one at a time. By avoiding the contrast between stimuli (i.e., word, partword, non-word) with different statistics (TP and phonotactics) at test and by adding a new stimuli type (i.e., non-words), we aim for a more fine-grained understanding of word representations in the Conflict language. The mapping test is the same as in Experiment 1.

Method
This experiment was a replication of Experiment 1, but it was fully online due to the COVID-19 pandemic. Differences in methodology are described below.
Participants. Forty-five adults, all native speakers of Brazilian-Portuguese, with no reported visual or auditory impairment that could interfere with the task, participated. However, 10 participants were excluded from the final analyses because they failed or missed attention check questions, reported using their mobile phones or taking notes during the experiment (see Data analysis for further details). The final sample consisted of 35 adults (M age = 23.51, ± 4.01 SD, 22 female). As in Experiment 1, participants were recruited at the official Facebook group of the Universidade Federal de São Carlos and received no compensation for their participation. The study was conducted according to the Declaration of Helsinki and the Ethics Committee of the host university approved the research (#3.085.914).
Stimuli and Design. We used the Conflict language from Experiment 1, with the same wordobject pairs. As a brief reminder, in this Language, words had high TPs (TP = 1; Table 1) and lower phonotactic probabilities (M words = 0.0072), while part-words had lower TPs (TP = 0.5) and higher phonotactic probabilities (M part-words = 0.0085). In addition, we created three additional non-words with balanced phonotactic by recombining the initial syllables of words (i.e., /visu/, /tami/, /rako/; PPs = 0.0080, 0.0074, 0.0069, respectively). Because their syllables never occurred together in the Language, their TP was zero.
A similar design from Experiment 1 was used here, with four main differences. First, given the online nature of the study, before beginning the experimental task, participants were instructed to move to a quiet room, to turn off any electronic devices (e.g., cellphone, TV), to wear earphones, and not to take notes during the experiment. Second, the segmentation test followed a go/no-go structure: test words (i.e., /nipe/, /tadi/, /mide/), part-words (i.e., /sute/, /viko/, /bara/), and non-words (i.e., /visu/, /tami/, /rako/) were presented one at a time and participants were instructed to indicate whether they were or were not words from the language they had just heard (by pressing to "s" or "n", corresponding to "sim" [yes] or "não" [no] in Portuguese). Each stimuli was tested 6 times (total of 54 trials). Third, attention checks were conducted during the familiarization and segmentation test. At each familiarization block, participants were prompted to answer five simple questions (i.e., "Are you alive?", "Are you sleeping?", "Are you breathing?", "Are you dead?", "Are you awake?"). Between segmentation test trials, attention checks displayed either a Portuguese word or a made-up word (e.g., "mesa" [table], "drevo") printed on the screen and participants were prompted to indicate if the word existed in Portuguese or not. During both familiarization and test, participants indicated their answers for attention checks by pressing the "s" or "n" keys on the keyboard. Fourth, at the end of the experiment, we checked for compliance to instructions by asking participants whether they had used the cellphone or if they had taken notes during the experiment.
Procedure. The experiment was entirely online, hosted on Pavlovia and programmed using Psychopy3 (Bridges et al., 2020). After agreeing to participate, participants were instructed to avoid distractions (see previous section), answered a questionnaire about their educational OPEN MIND: Discoveries in Cognitive Science background and language abilities, and then started the experimental task. As in Experiment 1, they were exposed to three phases: familiarization, segmentation test, and mapping test (same as Experiment 1). In addition, attention checks (described before) were presented between familiarization blocks and between trials during the segmentation test. Data Analysis. We followed similar analytical steps from Experiment 1. We first excluded participants who reported using their mobile phones during the experiment (n = 3) and those (n = 2) who failed two or more attention checks (out of five questions) during familiarization. Another five participants were excluded because their reaction times to attention checks in the familiarization or segmentation tests were greater than 3 SDs from the mean. For the remaining participants (n = 35), we excluded trials with reaction times greater than 3 SDs away from the mean (segmentation: 32 trials overall, 1% of the data; mapping: 7 trials overall, 1% of the data). The final data was entered in mixed-effect logistic regressions. The outcome, predictors, and random effects for each model is described in the next section.

Results and Discussion
Speech Segmentation. To analyze speech segmentation, we fitted a mixed-effects logistic regression with words', part-words', and non-words' evaluations as the outcome variable. Selection of words and rejections of part-words and non-words were coded as correct responses. Predictors were the chance level (logit of 0.5) and stimuli type (words, part-words, non-words), stimuli and participants were random intercepts 11 .
These results indicate that participants might have tracked both transitional and phonotactic statistics from familiarization, but used them differently when evaluating stimuli during test. For instance, they might have relied on TP information when evaluating words (higher TP and lower phonotactics) and phonotactic information when evaluating part-words (lower TP and higher phonotactics). Finally, the lack of familiarity with non-words (no TP information), and the balanced phonotactic statistics, might have generated correct non-word rejections. Overall, our nuanced results could indicate that the go/no-go procedure is not sensitive enough to capture implicit word representation arising from speech segmentation of a language with conflicting statistics-a point we return to in the General Discussion.
Cross-situational Word Learning. To model mapping performance, our mixed-effect logistic regression had object selection (correct or incorrect) as the outcome variable, chance level (logit of 0.25) and target stimuli frequency (150 or 300 repetitions) as predictors, and stimuli and participants as random intercepts 12 . As in Experiment 1, participants correctly mapped both high and low frequency words above chance level (M high = 0.56, SD high = 0.31; M low = 0.49, SD low = 0.31; Figure 8), with small differences in the likelihood of correctly selecting high or low frequency word-object pairs (OR high = 1.51, 95% CI [0.75, 3.02]; change in OR low = 0.68, 95% CI [0.35, 1.36]). Again, we found a moderate positive correlation between mapping and self-evaluation (r s = 0.67; Figure 8).
Relationship Between Speech Segmentation and Word Mapping. As in Experiment 1, we ran Spearmans' correlation tests between words' and objects' selections (average scores per participant) to explore potential relationships between speech segmentation and word mapping. We found a weak positive correlation between segmentation and mapping (r s = 0.32; Figure 9). Again, overall, participants that were better at segmentation were also better at mapping. Further exploration by speech segmentation median split (Mdn = 0.5, IQR = 0.24) revealed little difference between participants above the median (r s = 0.05) and below the median (r s = 0.11), with no correlation between segmentation and mapping for both groups. The current experiment was designed to further evaluate the effects of the conflict between transitional and phonotactic statistics on simultaneous speech segmentation and crosssituational word learning. Overall, we replicated Experiment 1: speech segmentation, as measured by a go/no-go test, was at chance level, but word-object mapping performance was above chance. Nonetheless, our more sensitive word segmentation test provided some nuanced information about stimulus representations.
We found that participants were likely to correctly evaluate non-words as such. This indicates that how participants represented words and part-words was most likely the result of the interplay between phonotactic and transitional probabilities. For instance, stronger phonotactics combined with a probabilistic transitional probability (TP = 0.5) lead participants to incorrectly evaluate part-words as words. On the other hand, the weaker phonotactics combined with deterministic transitional information (TP = 1) prompt only a slight tendency to correctly evaluate words as such.
As in Experiment 1, speech segmentation performance and self-evaluation indicate that the conflict between transitional and phonotactic probabilities impaired the formation of clear word representations, which could have impaired participants' accuracy when estimating their knowledge of words from speech. Again, however, despite the absence of clear word representations, participants were able to map words to objects. Consistent word-object co-occurrences might have provided sufficient information to promote mapping and some level of segmentation, despite conflicting phonetic information (Räsänen & Rasilo, 2015). Furthermore, as in Experiment 1, syllables were not shared between words. Participants could have tracked co-occurrences between individual syllables and objects to solve the mapping task, without relying on any word-level phonetic information. Whereas using this strategy would allow participants to solve the mapping task, it wouldn't allow them to gain any word-level information. Thus, if this was the entirety of the explanation, speech segmentation performance should have been at chance level for words, part-words, and non-words.

OPEN MIND: Discoveries in Cognitive Science
However, during the go/no-go test, participants were more likely to correctly select words as such and to correctly reject non-words, indicating that they tracked word-level statistics to some degree. Next, we discuss a possible design to overcome this important confound as well as some of the limitations of our study. Moreover, we, discuss how our preliminary findings broaden our understanding of statistical learning from multiple cues and prompt further research on the subject.

GENERAL DISCUSSION
In the present study, we explored whether adults could segment speech streams into words and map them to objects simultaneously by tracking conditional probabilities across ambiguous presentations. We also investigated the effects of word-level phonotactics in segmentation and mapping. Phonotactics were either balanced, aligned, or in conflict with transitional probabilities. We found that participants were successful at both the segmentation and mapping tasks when transitional and phonotactic probabilities were either aligned or balanced across words. In contrast, when transitional and phonotactic probabilities were in conflict, we did not find evidence for speech segmentation, but we still found evidence for mapping. Our results offer preliminary support for the idea that a greater set of cues, even in different modalities, supports speech segmentation and word learning in ambiguous situations ( Jones et al., 2010;Räsänen & Rasilo, 2015). Not only were participants able to segment and map words simultaneously by tracking conditional probabilities, but their overall performance was stronger in this simultaneous task in comparison to separate tasks of segmentation and cross-situational word learning. This adds to the evidence showing that language learners benefit from combining several sources of linguistic information when learning a new language (Choi et al., 2018;Johnson, 2016;Saffran, 2020;Smith et al., 2018;Tachakourt, 2023;Yurovsky et al., 2012). This combination might be especially useful when dealing with ambiguity, as even a partial solution to one linguistic challenge could reduce ambiguity in other linguistic challenges (for related evidence, see Feldman, Myers, et al., 2013).
Our design does not provide insights into specific learning strategies used by our participants. For instance, they could have used an aggregation strategy, gradually segmenting speech and using the segmented words as anchors for further segmentation and mapping. Or they could have used hypothesis testing from the start, electing syllable sequences and testing their co-occurrence with each other and with objects over time. Participants might have also used a blend of these strategies depending on the level of ambiguity they were facing ( Yurovsky & Frank, 2015). In addition, our selection of unique syllables for each word introduced an important confound to our study. Participants could have solved the mapping task even when ignoring half of the syllables and all word-level statistics. In contrast, more efficient segmentation performance for all languages (except for the Conflict language on Experiment 1) suggests that participants tracked word-level statistics to some degree. Furthermore, the positive correlation between segmentation and mapping, found in all languages of both experiments, suggests that both speech segmentation, that relied on word-level statistics, and cross-situational mapping were in close interaction, potentially retro feeding each other over time (in line with proposals by Räsänen & Rasilo, 2015).
Interestingly, the conflict between phonotactics and transitional information might have impaired the formation of clear and explicit word representations, but not the formation of strong word-object relationships. Whereas this could point to independent processing of phonetic and audiovisual statistics, it could also be that participants did form clear, but implicit, word representations that our explicit measurements (either a two-alternative forced-choice or go/no-go) were not able to capture. For instance, despite the conflict between transitional and phonotactic statistics, participants were still able to consistently reject non-words during Experiment 2, showing that stimuli with different degrees of statistical information were treated differently. It is worth reinforcing that we manipulated TPs and PPs differently. Whereas TPs were determined when designing the experimental stimuli, PPs were estimated from participants' native language. This led to differences in strength between cues that might have had different effects on participants' processing. For instance, they could have relied more on TP to find word boundaries and on PPs to form strong word representations. A more direct way to assess these implicit processes and to overcome the confound of not repeating syllables across words may be to use EEG measures during the passive familiarization phase.
Recording neural activity during familiarization could inform us about how the alignment or conflict between TP and PP are processed and what happens when these statistics are violated (e.g., Elmer et al., 2021;François et al., 2017). Furthermore, the use of neural entrainment analysis could provide direct evidence to whether participants track word-level statistics (disyllabic words) or individual syllables (e.g., Batterink & Paller, 2017;Choi et al., 2020). In an ongoing EEG study in one of our labs, we are measuring: a) whether participants will show similar ERPs to violations of transitional and phonotactic information presented in the Balanced and in the Conflict languages and b) the temporal entrainment of their neural activity during familiarization.
Another invaluable source of information on the learning mechanisms involved in the simultaneous speech segmentation and mapping are the cognitive processes underlying such performances. For instance, auditory and visual memory have been shown to predict crosssituational word learning ( Vlach & DeBrock, 2017;. Differences in attention have also been found to impact statistical learning Yurovsky et al., 2013). Future research could measure these and other cognitive processes to better understand their role in statistical language learning.
Our study was exploratory in nature. Building on our promising initial findings, future replications should put our findings to the test. They could also address some of the shortcomings of the present investigation. Future investigations could, for example, manipulate both TPs and PPs in a similar way, leading to a finer control of their contrast. Both cues could be defined experimentally (e.g., Benitez & Saffran, 2021), or estimated directly from participants' natural language. The latter estimation could lead to the design of natural continuous speech (in line with Hay et al., 2011) that comprises the variability in transitional and phonotactic probabilities that learners face in the wild, increasing the ecological validity of statistical learning findings (cf. Smith et al., 2014). Future research could also have participants from more heterogeneous backgrounds, for example, by recruiting participants from different ages, in different countries, with different socioeconomic status. We tested fairly homogenous samples of young college students from a single language background. The trends we found may not generalize to other populations (Simons et al., 2017). Also, it could be that the statistical learning mechanisms involved in this simultaneous task may have different roles across development (Choi et al., 2018;Danielson et al., 2017;Smith et al., 2018). Future research could investigate simultaneous statistical language learning across development to bridge the gaps between young adults, infants, and older adults. Finally, although highly dynamic, our task comprises only a small sampling of the challenges (i.e., segmentation and mapping) and statistics (i.e., conditional probabilities) available for language learners in natural environments. Future studies could improve ecological validity by, for instance, combining statistical, prosodic, and semantic information (Hay et al., 2011;Karaman & Hay, 2018), or diving into natural environments (Bogaerts et al., 2022;Yu et al., 2021).
Learning languages is difficult. To overcome many linguistic challenges, learners can rely on several cues. Here we provide preliminary evidence that adults can track conditional probabilities to simultaneously find words in continuous speech and map them to objects across ambiguous situations. We also show that the level of pre-experimental familiarity with words can impact their representation. By doing so, we contribute to a more nuanced understanding of how statistical cues interact to promote language learning.

DATA AVAILABILITY STATEMENT
The materials, code, and data from this study are openly available on Open Science Framework at https://osf.io/rs2bm/.