Categorical perception occurs when a perceiver's stimulus classifications affect their ability to make fine perceptual discriminations and is the most intensively studied form of category learning. On the basis of categorical perception studies, it has been proposed that category learning proceeds by the deformation of an initially homogeneous perceptual space (“perceptual warping”), so that stimuli within the same category are perceived as more similar to each other (more difficult to tell apart) than stimuli that are the same physical distance apart but that belong to different categories. Here, we present a significant counterexample in which robust category learning occurs without these differential perceptual space deformations. Two artificial categories were defined along the dimension of pitch for a perceptually unfamiliar, multidimensional class of sounds. A group of participants (selected on the basis of their listening abilities) were trained to sort sounds into these two arbitrary categories. Category formation, verified empirically, was accompanied by a heightened sensitivity along the entire pitch range, as indicated by changes in an EEG index of implicit perceptual distance (mismatch negativity), with no significant resemblance to the local perceptual deformations predicted by categorical perception. This demonstrates that robust categories can be initially formed within a continuous perceptual dimension without perceptual warping. We suggest that perceptual category formation is a flexible, multistage process sequentially combining different types of learning mechanisms rather than a single process with a universal set of behavioral and neural correlates.
Categorization is an essential part of cognition; to make sense of the outside world, perceivers need to continually interpret, recognize, and interact with direct and indirect indicators of objects and events, categorizing things that can and cannot be dealt with in the same way (Harnad, 2005). For salient stimuli that are frequently encountered, it may be efficient to “automate” this perceptual sorting by programming it within the central nervous system. One proposed way that perceivers may do this, called categorical perception, is a mode of information processing where a physical continuum is divided into separate perceptual categories, and sensitivity to stimulus variation is higher at the category boundaries than within each category. For example, in speech and color perception, our classification of phonemes and shades of color affects the precision with which we are able to discriminate them. A shade of blue and a shade of green appear more different than two shades of green, even when the two colors within each test pair have the same absolute difference in wavelength between them (Roberson, Davies, & Davidoff, 2000; Bornstein & Korda, 1984). Similarly, with acoustic stimulus pairs that have equal physical differences between them, we can discriminate between speech sounds labeled as different phonemes faster and more reliably than between speech sounds labeled as the same phoneme (Liberman, Cooper, Shankweiler, & Studdert-Kennedy, 1967). An essential signature of categorical perception is that discrimination performance can be predicted from classification data (Livingston, Andrews, & Harnad, 1998; Liberman et al., 1967).
The general way in which perceptual categories arise is a continuing subject of debate. One commonly articulated theory based on categorical perception studies poses that perceptual systems form categories by distorting the perceived differences and similarities between objects, compressing the perceptual space within each category and expanding the perceptual space at the borders between categories. This perceptual warping could be inborn and/or acquired/modified with learning (Prather, Nowicki, Anderson, Peters, & Mooney, 2009; Kuhl & Miller, 1975; Liberman et al., 1967). The study of category learning processes therefore offers a great opportunity to uncover the types of neural plasticity that support the efficient integration of stimulus-related and goal-related information, resulting in category formation (Jiang et al., 2007).
Three elements are necessary to study the development of perceptual categories directly: (1) a physical space that is continuously perceived at the initiation of the study, (2) a training paradigm that reliably forms perceptual categories on this space, and (3) a testing paradigm with sufficiently high resolution that compares perceptual distances before and after category acquisition in a within-subject design. Most previous studies on auditory category formation have failed to provide all three of these elements: For example, language studies have used an overlearned categorical space that is not continuously perceived at study initiation (language acquisition studies using very young infants could be an exception, but they lack high-resolution discrimination tests; Jusczyk, 1997). Visual category learning studies have started from continuous perceptual spaces but have not often made use of within-subject designs and/or have obtained contradictory results (Gillebert, Op de Beeck, Panis, & Wagemans, 2009; De Baene, Ons, Wagemans, & Vogels, 2008; Jiang et al., 2007; Sigala & Logothetis, 2002; Livingston et al., 1998; Goldstone, 1994). Taken together, previous perceptual categorization studies have supported the idea that category training can enhance perceptual acuity across (or reduce perceptual acuity within) category borders but have failed to demonstrate whether category formation inevitably results in the perceptual warping specifically seen in categorical perception studies. We seek an answer to this question with the present set of experiments.
These experiments were designed to examine the category learning process using a laboratory training procedure that mimicked salient aspects of natural category learning. The experiments used a set of synthetic sounds with unfamiliar timbres arranged along a dimension that went from tonal to noise like and pitches that continuously varied from low to high values. Sufficiently skilled listeners (randomly selected volunteers who passed basic auditory perceptual tests) learned to sort these sounds into two arbitrarily defined pitch categories that were never explicitly explained but rather acquired through trial and error. The dimension of timbre was not relevant for the categorization task but varied to make the stimulus set more natural and thus promote robust category formation (Lively, Logan, & Pisoni, 1993). Category learning was assessed by comparing identification performance before and after training. To test for the presence of the perceptual deformations thought to be an integral part of the category learning process, each individual's sensitivity to small and large pitch changes within and across categories was measured, using both psychometric tests and electrophysiological measurements. For the latter, the amplitude of the mismatch negativity (MMN, a derived component of the ongoing EEG, which has been associated with sound discriminability) served as a measure of intrinsic perceptual distance before and after learning (Kujala & Näätänen, 2010; Näätänen, 2001; Alho, 1995; Giard et al., 1995; Javitt, Steinschneider, Schroeder, Vaughan, & Arezzo, 1994). The amplitude of the MMN is known to scale with the perceptual magnitude of stimulus deviance and to increase as a function of discrimination learning (Kujala & Näätänen, 2010; Näätänen, 2000; Kraus et al., 1995). Thus, we used MMN amplitude as an implicit measure of perceptual distances between sounds and variations in MMN amplitude as a result of training to measure implicit perceptual distance changes induced by the training procedure.
Collectively, listeners all acquired robust pitch categories and showed a homogenous enhancement of small suprathreshold pitch distances with no significant local deformations of the perceptual space (within or between category boundaries).
In Equation 1, S is the filtering spectrum, G is an individual Gaussian filter (with unitary amplitude, peak at the central frequency n ∙ f0 and width σk), and A is the exponentially decaying weight scaling each Gaussian filter (10 filters in the spectrum, indicated by the index n). The width of the Gaussian filters (σk in Equation 1) correlates with the perceived timbre of the sounds (on a continuum from tonal to noise like), and the fundamental frequency f0 correlates with the perceived pitch.
Data from a previous set of psychophysical experiments (Caruso & Balaban, 2014) revealed how naive listeners perceive these sounds in terms of pitch and timbre and guided the selection of the stimuli used in the present experiment. The sound stimuli had one of five possible timbre levels (five filter widths, σk = 10, 16.9, 25.1, 32.6, and 48 Hz) and fundamental frequency in the range from 200 to 682 Hz.
In all behavioral tasks, each sound lasted 300 msec, including a 30-msec cosine rise and fall at the sound edges. When presented in pairs, the gap between the two sounds was 2.5 sec long. In the EEG recordings, sounds were 100 msec long including a 10-msec rise and fall time. To obtain the same perceived loudness, the sounds were equated for their root mean square value (Soloudre, 2004).
Two artificial, nonoverlapping categories (A and B in Figure 1) were defined along the dimension of pitch. Category A was associated to sounds with fundamental frequencies in the interval of 215–332 Hz, and Category B was associated to sounds with fundamental frequencies in the interval of 412–635 Hz. The region between these two intervals was not sampled during training (“gap region” of 3.25 semitones), whereas sounds with fundamental frequencies higher and lower than those associated to the categories were explicitly indicated as being outside A and B (Figure 1B), suggesting that the categories were bounded. Timbre variations between sounds were irrelevant to the categorization task and were included to add richness to the stimuli, promoting categorical learning in a natural setting (Lively et al., 1993).
Figure 1 describes the stimulus pitches used for all test and training tasks (see Procedure). During all training tasks, sounds were drawn from the entire pitch region (except the gap) according to a probabilistic structure that oversampled the center of the categories (Figure 1B, gray). During the identification test, used to confirm category learning, sounds were presented with the same rate of occurrence at pitch levels interleaved between those drawn during training (Figure 1B, green). The distance between adjacent sounds was 1.25 semitones for identification test sounds and 1.25 semitones (±0.25 semitones of maximum random jitter) for training sounds (Figure 1B). Two tests measured pitch discriminability at different distances and different locations along the pitch dimension: A battery of auditory staircases was centered within the two categories and in the untrained gap region (Figure 1C), and MMN amplitude was measured for pairs of sounds within and across category borders (Figure 1D).
Note that the high variability of pitch and timbre levels, together with the interleaved design allowing for different test and training sounds, was essential to promote effective category learning and discourage the use of a labeling strategy (the association between a particular set of sound examples and a specific category).
Participants and Their Selection
Caruso and Balaban (2014) previously characterized how naive listeners perceived these stimuli, finding that pitch and timbre variations perceptually interfered with each other. To avoid the possibility that participants might vary in how much timbre variations could obscure pitch variations or in their propensity to be distracted by the timbre variations, we devised a participant selection procedure consisting of timbre and pitch discrimination tests. Participants heard pairs of sounds that could vary along the dimension of pitch (in steps of 1–4 semitones) and/or timbre (five evenly spaced timbre levels described above in Stimuli). They had to concentrate on a specific dimension (first condition: attend timbre; second condition: attend pitch) and indicate whether the pairs were the same or different with regard to the specified dimension. Both conditions included an initial training phase to familiarize participants with the task and the terminology and a test phase where all the possible pairs were presented six times in random order, with no feedback.
Two criteria were applied to the results of the test phase for each participant, corresponding to a minimum requirement for accuracy and a maximum threshold for interference between dimensions. First, we set a minimum of 70% correct answers for sound pairs for which the variation along the relevant dimension was either zero or greater than one step. This allowed for the possibility that a particular step size in the presentation was lower than the discriminable distance for each single participant. Second, we quantified the interference between the two dimensions by regressing accuracy onto the magnitude of the irrelevant variations: We observed that, as the irrelevant variation grew larger, the interference grew larger (lower accuracy), with a negative regression slope. We therefore set a limit for the slope of an individual participant equivalent to that of a line that goes from perfect accuracy (100%) when the distance along the irrelevant dimension is zero to 70% when the distance along the irrelevant dimension is maximal.
The initial pool of participants consisted of 75 randomly selected volunteers (24 men, age = 18–35 years) who gave written informed consent and reported having no hearing deficits. Eighteen participants (four men) were selected, and 11 (three men) completed all testing procedures. The remaining seven participants abandoned the procedure before completion (five participants withdrew because of the end of their courses for the academic year, and two suffered injuries in a car accident that prevented their further participation). All participants were paid €8 per hour for participation in the behavioral pretest and posttest and up to a maximum of €10 per hour (in proportion to their performance) for the training tasks. After each session, only half of the payment was remitted; the remaining half was paid at the completion of all procedures.
All tasks took place in a sound-attenuated room with the participants seated in front of a computer monitor and keyboard. The sounds were amplified and delivered binaurally through Sennheiser HD 580 headphones (Wedemark, Germany). Participants chose a comfortable sound level that was kept constant throughout the entire test series. During the EEG recordings, participants were seated in a comfortable chair in front of a computer screen at a distance of approximately 1.5 m. They were asked to attend to a silent movie. Sounds were amplified and delivered binaurally through E-A-RTONE 3A insert earphones (Aearo Corporation, Indianapolis, IN).
To test the hypothesis that category perception necessarily involves differential perceptual warping, participants were trained to classify unfamiliar experimental sounds according to the two arbitrary categories described above, so that category boundaries would be built up over time and maintained. This required the experiment to follow a pretest–training–posttest design. We compared the results of perceptual tests and EEG responses to sound variation before and after training to measure category learning performance (via an identification test) and both explicit and implicit discrimination performance (using auditory staircases and EEG activity differences).
The goal of training was to allow participants to learn the stimulus categories without direct instruction or an explicit statement of where the boundary was. Our intent was to do this in a way that resembles how natural perceptual categories, such as phonemes, are presumably learned. Two procedures were used. To promote understanding, we initially allowed participants unlimited time to provide answers; then, as performance became stable over days of training, we introduced a time limit of 900 msec from sound onset (600 msec from sound offset; Figure 2A and C) to make the categorization process automatic. The participants were never directly instructed about time limits but were informed that training would become more challenging with practice. Similarly, initially, there was no time limit imposed for reading posttrial feedback (“Well done!” in green for correct trials and “Oh, no! Wrong!” in red for wrong trials), whereas in the limited-time training phase, feedback was provided for 800 msec in the form of a red, green, or yellow screen indicating wrong, correct, and timed-out trials, respectively. The criterion for stable performance (after which time limits were introduced) was a minimum of 85% correct responses for both tasks for three sessions over 4 consecutive days or 80% correct responses for four sessions over 5 consecutive days. This same criterion was adopted after the imposition of time limits to terminate training. In the time-limited phase, RTs were recorded. Training sessions were given at least 1 day apart. All participants completed training in 7–48 days.
Participants were asked to classify single sounds as “A,” “B,” or “neither A nor B.” The sound stimuli came from the training set and reflected the probabilistic structure of the categories described in Methods (Figure 1B). After each trial, participants were given feedback in the form of a sentence or a colored screen as indicated above. The time course of the trials for both the time-unlimited and time-limited training phases is shown in Figure 2A. In each session, 270 sounds were randomly presented.
Paired presentation training
The purpose of this task was to familiarize the participant with sounds presented in pairs rather than in isolation and, at the same time, promote category learning (without explicitly encouraging discrimination between the sounds in the pairs). Like the identification training task, sounds were drawn according to the probability structure of the training set and then arranged in pairs (thus, the pitch distance between the two sounds varied pseudorandomly). Participants reported whether the two sounds in each pair belonged to the same category or not—they did not have to indicate what the category was.
The trial time course for the unlimited and limited time phases of training is shown in Figure 2C. Participants were given feedback after each trial as described above. In each session, 270 pairs were presented in random order, 140 from different categories (120 crossing the border between A and B and 10 crossing each of the two extreme borders) and 130 from inside one of the categories (65 for each category). Pairs were made anew for each session, respecting the probability structure in Figure 1.
To document category learning, we tested the accuracy with which single sounds were assigned to the A and B categories before and after category learning (identification tests). Once category learning had taken place, we tested the specific predictions about how the perceptual space would be deformed as a result of category formation. If this were the case, pairs of sounds with the same parametric pitch distance should be perceived as more similar when they occur within the same pitch category than when they belong to different categories. Thus, pairs of sounds with the same semitone distance between their fundamental frequencies (the semitone is a constant fraction of a musical octave) should be perceived as being more similar when they both belong to A or B than when one of them belongs to A whereas the other belongs to B or is at the border between A and B.
In principle, this mechanism could apply to multiple scales of pitch variation: The warping of the perceptual space could work all the way down to changing the discrimination threshold, or it could act on sounds that are always discriminable but nevertheless become perceptually closer or further apart as a result of training. We took both of these possibilities into account by assessing sensitivity in two ways: measuring discrimination thresholds explicitly with auditory staircases and measuring subjective perceptual distances implicitly for easily discriminable pairs of sounds (distances of 2.5 or 5 semitones) using EEG recordings (calculating the MMN).
Probing category formation using an identification test
In this test, similar to the identification training task, participants had to classify single sounds as “A,” “B,” or “neither A nor B.” The sounds regularly sampled the entire pitch–timbre domain (Figure 1B, green bars). One hundred seventy sounds (17 pitch levels × 5 timbre levels × 2 repetitions equivalent to 10 repetitions for each pitch level) were presented in random order, with no time limit and no feedback. The time course of each trial is shown in Figure 2B. On average, the test lasted 15–25 min.
Probing perceptual difference thresholds using auditory staircases
We assessed participants' pitch discrimination thresholds at three locations along the pitch dimension (Figure 1C) with staircase procedures centered at 258 Hz (“low pitch,” within Category A), 530 Hz (“high pitch,” within Category B), and 369 Hz (“medium pitch,” in the middle of the untrained gap region between A and B). All sound stimuli in the staircases had the same intermediate timbre level; one sound (random position in the pair) had a fundamental frequency equal to the staircase central frequency, whereas the other varied according to a weighted 2-down 1-up adaptive procedure in steps of one fortieth of semitone, starting with an initial pitch distance of 2 semitones. This procedure leads to asymptotic performance at the point of 70.7% correct answers (García-Pérez, 1998; Kaernbach, 1991; Levitt, 1971). For each central frequency, the staircase was repeated twice, starting from higher and lower pitch levels, in random order before and after category learning; the discrimination threshold, computed over the last 15 reversals, was averaged across repetitions. The time course of a trial is shown in Figure 2D. On average, the test lasted between 30 and 40 min.
Using differences in EEG signals to measure implicit subjective perceptual distances
We recorded EEG responses to a stream of standard (more frequent) and deviant (less frequent) sounds that participants were instructed to ignore while watching a silent movie of their choice. Figure 1D shows the characteristics of the sounds chosen for the oddball tasks. There were two conditions, with a standard sound from inside Category A or B and four deviant sounds at a distance of 2.5 and 5 semitones on either side of the standard (referred to as medium and large deviants, respectively). For example, when the standard was chosen within Category A, the two deviants on the low pitch side belonged to the same Category A; whereas on the high pitch side, one deviant, closer to the standard, was placed within the untrained gap region between categories, and the other, further from the standard, belonged to Category B. The second condition included symmetrical cases with the standard in Category B. Eight participants were randomly assigned to the first or second condition and tested before and after training; three participants were tested twice, once in each condition, in random order within the same day. For these participants, only one condition, randomly chosen, was included in the analysis.
The complete oddball sequence consisted of 5,000 sounds, 80% of the sounds being standards and 20% being deviants. Twenty repetitions of the standard were added at the beginning of each recording block to facilitate the formation of the MMN. Sound order was pseudorandomized, as each deviant had to be preceded by at least three standards. The stimulus onset asynchrony was 800 msec. Each recording session lasted around 75 min: two blocks of around 35 min with a short break in between.
All recordings were made with an ActiveTwo data acquisition system from BioSemi (Amsterdam, The Netherlands; www.biosemi.com).
The EEG consisted of 128 scalp sensors (Biosemi coordinate system) plus electrodes at both mastoids for offline referencing; we also recorded electro-ocular activity from two bipolar montage electrodes (the vertical EOG from a supraorbital and infraorbital electrode at the left eye and the horizontal EOG from the outer canthi of both eyes). Sound stimuli were delivered to the participant binaurally through insert earphones as described above. All channels were low-pass filtered online with a cutoff frequency of 417 Hz and digitized with 24-bit resolution at a sampling rate of 2048 Hz.
Data were preprocessed in the following way. Recordings were downsampled to 256 Hz and rereferenced off-line to the average of the two mastoids (to maximize the MMN amplitude at frontal sites; Kujala, Tervaniemi, & Schröger, 2007). They were subsequently band-pass filtered with a finite impulse response filter (768-point high-pass filtering, 1-Hz cutoff frequency, 0.2-Hz transition bandwidth; 24-point low-pass filtering, 30-Hz cutoff frequency, 5-Hz transition bandwidth). Recorded epochs of 800 msec were analyzed (200-msec prestimulus baseline and 600 msec after stimulus onset) at Fz.
For each recording session, we acquired 4,040 standard (4,000 within the oddball sequence plus 20 added to the beginning of each of the two subblocks, as specified above) and 1,000 deviants (specifically 250 trials for each of the four deviants). We did not include the ERPs to the first 10 standards in each block and to every standard after a deviant in the analysis: thus, each ERP in the analysis was preceded by the same standard stimulus. After baseline subtraction, we rejected each epoch exceeding ±50 μV on one of the channels (including the bipolar horizontal and vertical EOGs). On average, after the rejection procedure, each session provided 2,066 trials for the standard (minimum of 1,597 and maximum of 3,349) and 170 for each deviant (minimum of 117 and maximum of 293). We did not set an exclusion criterion based on the number of nonrejected trials for each condition.
For each participant, we subtracted the average response to the standard from the average response to each deviant and quantified MMN as the average value in the time window between 70 and 200 msec after stimulus onset. We chose this relatively broad time window to include all latency variations seen across the participants (see Figure 3). Note that this definition also includes the components of the N1 evoked potential (Näätänen & Picton, 1987); in practice, we found it difficult to make a consistent and objective separation between these two components in the type of recordings collected here.
All analyses were conducted in MATLAB (The MathWorks, Natick, MA) using custom-made routines and the EEGlab toolbox (Delorme & Makeig, 2004).
Normality for all perceptual and MMN data was assessed using Lilliefors tests (separately for medium and large deviants); homogeneity of variances was assessed using Bartlett tests (separately for medium and large deviants). For some of the training tasks (see below), normality and homogeneity assumptions were not met, and nonparametric statistics were used. However, all data for the discrimination tests and MMN recordings met the assumptions of normality and homogeneity of variances, justifying the use of parametric statistical tests.
In the posttraining identification test, we considered a sound to be consistently categorized as A or B if the frequency of correct classification was higher than 80%. This 80% criterion value was derived as follows. If one conservatively assumes that the chance level for responding A or B is 50%, the probability of randomly answering A (or B) more than 80% of the time in the posttraining identification task (considering the number of trials run) is less than .001 according to the binomial distribution. Thus, a participant who classified a sound as A or B more than 80% of the time was highly unlikely to be guessing. (This is a conservative assumption because three responses are possible in the test [A, B, and N], so that the real chance level of answering A or B is <50%, and the choice of 80% correct as the criterion for proficient identification is associated with a chance probability that is lower than .001). Table 1 lists how each participant perceptually classified the sounds used in the staircase and MMN procedures after training (see also Figure 6 later in this paper).
|Participant .||Staircase .||MMN First Condition: Standard in A .||MMN Second Condition: Standard in B .|
|LR .||MR .||HR .||L (−5 sts) .||M (−2.5 sts) .||S .||M (+2.5 sts) .||L (+5 sts) .||L (−5 sts) .||M (−2.5 sts) .||S .||M (+2.5 sts) .||L (+5 sts) .|
|Participant .||Staircase .||MMN First Condition: Standard in A .||MMN Second Condition: Standard in B .|
|LR .||MR .||HR .||L (−5 sts) .||M (−2.5 sts) .||S .||M (+2.5 sts) .||L (+5 sts) .||L (−5 sts) .||M (−2.5 sts) .||S .||M (+2.5 sts) .||L (+5 sts) .|
For each test sound (presented either in staircase perceptual tests or during MMN recordings), the table reports subjective classifications after training. Sounds were classified as belonging to Categories A or B if the frequency of correct classification was higher than 80%, whereas lower classification scores indicated unreliable classification (denoted by an N). See Methods and Classification analysis for details. LR = low frequency range; MR = middle frequency range; HR = high frequency range; L = large; M = medium; S = standard; sts = semitones; A = category “A”; B = category “B”; N = neither category “A” nor “B.”
All participants (n = 11) met the criterion set to complete each training phase (unlimited time and limited time) within a maximum of six sessions per phase (overall average number of sessions was 3.5 and 4.4 for the two phases). We further confirmed learning by comparing the group mean accuracy and RT in the first and last sessions of both tasks, separately for the time-unlimited and time-limited conditions (RTs were not recorded in the unlimited phase). Because the hypotheses of normality and homogeneity of variance were violated for some training data groups, we used nonparametric Wilcoxon signed rank tests (t test parametric statistics yielded the same results).
In the identification task, group accuracy significantly increased as a result of the unlimited-time training (W = 3, p = .005) and the limited-time training (W = 2, p = .003), whereas group RT significantly decreased as a result of the limited-time training (W = 7, p = .019). Similarly, in the pair task, the mean accuracy significantly increased in both phases (unlimited time: W = 6, p = .014; limited time: W = 1, p = .002), whereas the reduction in RT did not reach significance (W = 15, p = .123). The results for the unlimited-time accuracy also remained significant when we removed the data from one outlying participant whose performance in the first session was particularly low (Figure 4; W = 6, p = .027).
The mean accuracy and RT in each training session are shown for each participant in Figure 4. With a single exception, most participants were already performing above chance in their first session in both phases, suggesting that the task was not difficult. Nevertheless, these results suggest that all participants were actively and successfully engaged by the training tasks. To further verify their ability to properly group sounds into the two arbitrary Categories A and B, identification performance was compared before and after training.
During training, Categories A and B were implicitly imposed onto the pitch dimension, whereas timbre variations of sounds added variability unrelated to the categories. Thus, learning to categorize sounds in A and B should result in a sorting decision based solely on pitch, with timbre variations being ignored. We tested both these outcomes.
First, all participants learned to categorize based solely on pitch after training. Figure 5A shows the group categorization of sounds as a function of pitch levels before and after training. Categorization accuracy was compared before and after training for those test sounds within the pitch intervals associated to A and B during training (shaded areas in Figure 5A). The performance of each participant (Figure 5B) increased significantly, as assessed by a Wilcoxon signed rank test (all ps < .03). For each single participant, we also fit the A curves and B curves before and after training with a sigmoid function: The absolute values of the slopes of the sigmoid significantly increased at a group level (Wilcoxon signed rank test: W = 27, p = .001; the mean ± SE of absolute slope is 0.42 ± 0.06 before training and 1.65 ± 0.46 after training).
Second, the percepts of all participants were less affected by timbre variation after than before training. Figure 5C shows the group categorization of sounds as a function of timbre levels before and after training. For each single participant, we regressed the three classification curves (A, B, neither-A-nor-B) against timbre levels before and after training: The absolute value of the slopes of their best linear fits (Figure 5D) significantly decreased at a group level (Wilcoxon signed rank test; A curve: W = 3, p = .005; B curve: W = 1, p = .004; neither-A-nor-B curve: W = 0, p < .001; the mean ± SE of the absolute value of the slopes before and after training were 0.095 ± 0.019 and 0.021 ± 0.004 for the A curve, 0.072 ± 0.017 and 0.018 ± 0.003 for the B curve, and 0.128 ± 0.025 and 0.013 ± 0.003 for the neither-A-nor-B curve).
Note that the sounds with very low and very high pitch (presented rarely during training as “neither A nor B”) were in fact less likely to be classified as belonging to the categories (Wilcoxon signed rank test comparing the group response outside the classes vs. the average response inside the categories: Z = −2.14, W = 60.5, p = .032). We can therefore conclude that the participants indeed learned to sort sounds into two “boxes” A and B with specific locations on the pitch axis and boundaries on the high as well as low pitch sides rather than learned to follow a general rule, like “sounds in B have a higher pitch than sounds in A,” which would result in a continuous transition from A to B analogous to the one seen before training.
During category training, a space between the categories was left unsampled. As a consequence, each participant had a slightly different “boundary” position and steepness function in this pitch region after training (although groups with similar category borders can be recognized; Figure 6). Because the perceptual deformations suggested by categorical perception studies result in discrimination performance that can be inferred from identification performance (because both types of performance result from the combination of an expansion of sensitivity at the borders between categories and a compression of sensitivity within categories), it is necessary to obtain a precise account of border position on a single participant basis. To this end, we set an identification threshold (see Methods): Sounds had to be labeled “A” (or B) more than 80% of the time to be considered “within category”; sounds classified less consistently were considered to be at the border between categories (Table 1).
Discrimination Testing: Auditory Staircases
A set of auditory staircases were used to evaluate pitch discrimination abilities in different places along the pitch dimension before and after training (see Methods). Ideally, to assess changes in sensitivity within categories and at the border between categories, we aimed at placing the “low pitch” staircase within category A, the “high pitch” staircase within category B, and the “medium pitch” staircase at the border between them (Figure 1C). The identification test after training confirmed that the “low pitch” and “high pitch” staircases were indeed placed within Categories A and B, respectively, for all participants, whereas the “medium pitch” staircase was placed at the border between A and B for five participants and within Category B for the remaining six participants (Table 1).
We thus looked at the effects of category learning on these groups of participants. For each participant and each staircase location, we quantified the sensitivity change with a staircase index (SI) defined as the difference between the discrimination thresholds (T) before and after training, normalized by their sum (SI = [Tpre − Tpost] / [Tpre + Tpost]). This index has values between −1 and 1. Negative values indicate a deterioration of performance (higher threshold after training), positive values indicate an improvement of performance (lower threshold after training), and a null value indicates no change in performance.
The perceptual warping account of categorical perception predicts a negative SI for within-category staircases (as discrimination abilities are expected to worsen) and/or a positive SI for between-category staircases (as discrimination abilities are expected to improve; Figure 7A). To be able to detect small local categorical perceptual effects, we considered the three groups of staircases placed within Categories A and B as separate.
We found no quantitative change in pitch discrimination abilities with training for any staircase in any participant group (Figure 7B). All staircase indices were not statistically different from zero (t test, low-pitch staircase: t = 0.26, df = 10, p = .80; high-pitch staircase: t = 0.53, df = 10, p = .61; medium-pitch staircase, within one category (six participants): t = 0.82, df = 5, p = .45; medium-pitch staircase, at border (five participants): t = −0.26, df = 4, p = .81).
Discrimination Testing: MMN Responses
The results of the auditory staircases indicate that threshold pitch distances were not warped by pitch categorization learning either in the central regions of the categories or in the region between categories. However, such distortions in perceptual space might still be present without necessarily affecting perceptual thresholds, because they could be manifested on a larger perceptual scale, especially considering that the coarse sampling of pitch space during training was neither promoting nor requiring finer pitch discrimination. According to this hypothesis, sounds would maintain discriminability but nevertheless become closer or further apart in terms of their perceptual similarity. We therefore assessed perceptual changes at a coarser scale (a medium and a large pitch difference of 2.5 or 5 semitones) by measuring electrophysiological MMN responses. Specifically, we used variations in MMN amplitude to measure implicit perceptual distance changes induced by the training procedure.
As the perceptual warping proposed by categorical perception studies differs according to whether pairs of sounds belong to the same category (perceptual distance decreases) or to different categories (perceptual distance increases), we first grouped participants according to their classification of the standard and deviants after training (labeling each standard–deviant pair as either “within category” or “between categories”).
We separately analyzed the results for large and medium pitch distances (5 and 2.5 semitones), comparing MMN amplitudes across Classification groups (“within category” or “between categories”) and Time (before and after category learning) using two-way ANOVA. The MMN amplitudes for each participant before and after training were divided by their sum to control for individual differences in the overall size of these electrophysiological responses ([MMNbefore / (MMNbefore + MMNafter)] and [MMNafter / (MMNbefore + MMNafter)]), creating a normalized MMN value (nMMN) that was used for statistical analyses.
At large pitch distances (standard/deviants 5 semitones apart), we could directly compare the nMMN for within-category pairs and across-category pairs for 10 participants, who all categorized the standard within the trained category, one of the deviants within the same category, and the other deviant outside the trained category (either within the other category or in some uncertain boundary area; Table 1). The remaining participant failed to reliably categorize the standard and was not included in this analysis. The nMMN interaction term between time and classification group was not significant (F(1, 36) = 0.15, p = .70), indicating that category learning did not have a differential effect on within- and across-category comparisons. Furthermore, we found no main effect of Time (F(1, 36) = 3.18, p = .08), indicating that the perceptual distances of these sounds did not change with training (Figure 8A). There was also no main effect of Classification group (F(1, 36) = 0.0, p = 1.0).
At medium pitch distances (standard/deviants 2.5 semitones apart), only seven participants had stimulus classifications that allowed for a direct comparison between within- and across-category pairs. The other four participants either classified all three sounds in the same category or failed to classify the standard within the trained category and were excluded from the analysis (Table 1). We did not find a significant nMMN interaction term between Time and Classification group (F(1, 24) = 4.13, p = .053), indicating that category learning did not have a differential effect on within- and across-category pairs. We also found no significant main effect for Classification group (F(1, 24) = 0.02, p = .88). On the other hand, we found a highly significant main effect of Time (F(1, 24) = 18.96, p < .001), indicating that the nMMN (and the perceptual distances of these sounds) significantly increased as a result of category training for both within- and between-category stimuli; this is not in accordance with the “classical” pattern of perceptual deformations suggested by categorical perception studies (Figure 8B). Post hoc tests corrected for multiple comparisons (Tukey test, p < .05) showed that (1) the nMMNs for within-category and between-category pairs were not significantly different from each other either before or after training, (2) the nMMN for between-category pairs after training differed significantly from both before-training nMMN values, and (3) the nMMN for within-category pairs after training differed significantly from the nMMN for the between-category pairs before training.
If category learning drives the formation of sharp discontinuities along the perceptual space, short-term training might produce small perceptual discontinuities that vary across participants in a way that is proportional to the increase in identification function steepness produced by training. We tested this hypothesis by calculating the correlation coefficient (Pearson's r) between the nMMN and the slope of the identification function at the category boundary (as calculated by a logistic fit), for those participants who reliably classified the standard and one deviant across a category border after training. This analysis revealed that the amplitude of the nMMN is not larger when the category border identification function is steeper, for neither the large (r = .032, t = 0.09, p = .93) nor the intermediate (r = .27, t = 0.63, p = .56) deviants. Similar results were found for the correlation between the increment of the nMMN with the increment in category steepness after training (for large deviants: r = .023, t = 0.06, p = .95; for intermediate deviants: r = .21, t = 0.48, p = .65). Therefore, no evidence was found to support the idea of an underlying fine-scale relationship between the steepness of an individual's category boundary percepts and the size of an implicit electrophysiological measure of perceptual distance.
This study addressed the generality of the perceptual-space warping mechanism that has been proposed for auditory category learning. Categorization is the ability to recognize physically different objects as members of the same group. The most intensively studied form of category learning, categorical perception, occurs when, given the same physical distance between pairs of objects, those assigned to different categories are perceived as being more different than those assigned to the same category (Harnad, 2005). Conceptually, when new categories are formed for objects lying on a smooth continuum, the perceptual space is thought to undergo a warping process that brings items in the same class closer to each other and/or items from different classes further from each other (Liberman et al., 1967). Such effects have been hypothesized, for example, for phonemic contrasts, which arise and are modified during language learning (Liberman et al., 1967). We investigated the generality of this mechanism in a laboratory training study using psychophysical tests and EEG recordings of brain activity.
We report that the training procedure implemented in this study successfully established two categories along the continuous pitch dimension in all participants, as confirmed by the comparison of classification performance before and after training. Specifically, after training, all participants were able to categorize sounds that varied in pitch and timbre based on the pitch quality alone: Timbre variations did not interfere with the categorization, whereas the different pitches were consistently classified into two groups (the A and B categories). The border between the categories was reliably placed in the untrained region, and its steepness increased with learning, even if there was a great deal of intersubject variability in the precise placement and steepness of the identification function at the category boundary.
The training procedures employed in the present experiment did not result in sharp perceptual discontinuities between the two pitch categories in any of the listeners. This is different from the kinds of preexisting categories with sharply separated boundaries that have been studied in categorical perception experiments (Livingston et al., 1998; Liberman et al., 1967). The opportunity to examine pitch category learning in a context that lacked sharp, preexisting perceptual discontinuities enabled us to test the generality of the perceptual space warping explanation for category learning. This study found that learning uniformly enhanced perceptual sensitivity along the entire pitch dimension rather than producing any perceptual warping. Specifically, intermediate-size pitch variations were uniformly perceived as more dissimilar after training than before, both within and across category borders, as indicated by the fact that they all evoked increased MMN responses.
Some previous category learning studies have found mixed perceptual outcomes congruent with the generality of perceptual warping, such as increased sensitivity at category borders and/or decreased sensitivity within categories (De Baene et al., 2008; Guenther, Husain, Cohen, & Shinn-Cunningham, 1999; Livingston et al., 1998; Goldstone, 1994). Contrary to this, other studies have found modified sensory representations incompatible with perceptual warping and akin to the one found here—a homogeneous enhancement of differences along the category-related dimensions (Gillebert et al., 2009; Jiang et al., 2007; Sigala & Logothetis, 2002; Goldstone, 1994).
To reconcile the information from all of these studies, we propose that category learning is a dynamic, flexible multistage process similar to other forms of perceptual learning (Amitay, Zhang, Jones, & Moore, 2014). Because the present experiment used a rich set of unfamiliar stimuli with diverse perceptual attributes that were relevant or irrelevant to the categorization task, this posed a particular set of challenges that the learning process needed to address. A subject learning to use pitch to categorize sounds that also varied in timbre needed to first tune in to pitch variations and ignore the distracting timbre variations; indeed, the identification tests showed that all the participants behaved in this way after training, despite the difficulties posed by timbre–pitch perceptual interference (Caruso & Balaban, 2014; Patterson, Handel, Yost, & Datta, 1996). As has been suggested for visual object learning (Jiang et al., 2007; Freedman, Riesenhuber, Poggio, & Miller, 2003), category training with these sounds might first have resulted in the acquisition of dedicated representations on which the categorization task could then be based. Those representations would be characterized by enhanced pitch differences, as indicated by the MMN results, and possibly decreased timbre differences, although unfortunately those were not fully documented here. After resolving the initial interference between pitch and timbre, nothing more was asked of participants. Had we continued training with task performance demands beyond what was required here (for instance, greatly constraining the decision time or asking participants to classify rapid pitch sequences), this might have resulted in the establishment of the kind of sharp perceptual discontinuities documented in categorical perception studies, as suggested by studies of phoneme category formation (Heeren & Schouten, 2008; Xu, Gandour, & Francis, 2006). Such a scenario might also have resulted in further increases in the steepness of the identification function at the category borders and the emergence of a differential increase in MMN magnitude for intracategory pairs of sounds. Figure 8B hints at a nonsignificant tendency at the group level for pairs of sounds across the category border to have grown more dissimilar than pairs within the same category (although correlation analyses did not reveal any significant relationship between variation in the steepness of the identification function at the category boundary and the magnitude of MMN responses to stimuli that straddled it).
According to the hypothesis that categorization learning is a dynamic, flexible multistage process, successful categorization learning would be initially driven by areas outside sensory cortices, such as pFC (Gotts, Milleville, Bellgowan, & Martin, 2011; Freedman et al., 2003), whereas perceptual changes that are not specific to particular categories occur in the auditory cortex. In time, as categories become more consolidated, specific changes in local auditory neuronal circuits, such as greater neuronal firing changes to stimulus variations near the category boundary than within categories, could be driven by pFC and become the substrates for the perceptual warping observed in categorical perception studies (Gillebert et al., 2009; De Baene et al., 2008; Jiang et al., 2007; Sigala & Logothetis, 2002; Tremblay, Kraus, Carrell, & McGee, 1997).
The results presented here support the first stage of categorization learning delimited within the explanatory framework presented above. Although it is always possible that our failure to find perceptual warping was because of the small number of participants included in the present experiments, we believe that this is highly unlikely for two reasons. First, the sample size was sufficient to reveal a significant implicit perceptual change that had a relatively small effect size, suggesting that this perceptual change is an early and strong consequence of categorization learning. Second, we used a within-subject design and saw no evidence of perceptual warping in the data from any single participant and no any relationship between the steepness of perceptual functions at category boundaries and the size of implicit perceptual changes across the participant group. This suggests that our failure to see perceptual warping is not because of either sampling error or an insufficiently large sample size.
It is also noteworthy that the magnification of the MMN after learning was significant only for task-relevant pitch differences. Neither smaller nor larger pitch differences were affected by training, presumably because perceptual rescaling of such differences was not needed to learn these specific categories and thus not promoted by the training paradigm. Small pitch differences (near perceptual threshold) were never heard during training, whereas large pitch variations (experienced during training) have previously been shown to be insensitive to interference from the timbre variations used here (Caruso & Balaban, 2014). The specificity of these training effects provides additional support for the explanatory framework enunciated above.
In summary, this study has documented neuroplastic changes resulting from successful auditory category learning. The establishment of two new perceptual categories within a novel, continuous set of complex sounds occurred in the absence of perceptual warping. Learning was accompanied by a homogenous enhancement of relevant pitch differences. These results indicate that perceptual warping is not a necessary substrate for category learning (at least at the initial stages of the learning process).
Reprint requests should be sent to Valeria C. Caruso, Center for Cognitive Neuroscience, Duke University, Room B 245, 450 Research Drive, Box 90999, Durham, NC 27708, or via e-mail: firstname.lastname@example.org.