Skip to Main Content

We extracted a number of lexical, discourse, timing, phonetic, acoustic, and prosodic features for each target ACW, which we use in the statistical analysis and machine learning experiments presented in the following sections. Tables 4,567 through 8 summarize the full feature set. For simplicity, in those tables each line may describe one or more features. Features that may be extracted by on-line applications are marked with letter ; this is further explained later in this section.

Table 4

Lexical and discourse features. Each line may describe one or more features. Features marked may be available in on-line conditions.

Lexical features
 

 
Lexical identity of the target word (w).
 

 
Part-of-speech tag of w, original and simplified.
 

 
Word immediately preceding w, and its original and simplified POS tags. If w is preceded by silence, this feature takes value ‘#’.
 

 
Word immediately following w, and its original and simplified POS tags. If w is followed by silence, this feature takes value ‘#’.
 
Discourse features
 
 Number of words in w's IPU. 
 Number and proportion of words in w's IPU before and after w
 Number of words uttered by the other speaker during w's IPU. 
 Number of words in the previous turn by the other speaker. 
Number of words in w's turn. 
Number and proportion of words and IPUs in w's turn before and after w
Number and proportion of turns in w's task before and after w
Number of words uttered by the other speaker during w's turn. 
Number of words in the following turn by the other speaker. 
Number of ACWs in w's turn other than w
Lexical features
 

 
Lexical identity of the target word (w).
 

 
Part-of-speech tag of w, original and simplified.
 

 
Word immediately preceding w, and its original and simplified POS tags. If w is preceded by silence, this feature takes value ‘#’.
 

 
Word immediately following w, and its original and simplified POS tags. If w is followed by silence, this feature takes value ‘#’.
 
Discourse features
 
 Number of words in w's IPU. 
 Number and proportion of words in w's IPU before and after w
 Number of words uttered by the other speaker during w's IPU. 
 Number of words in the previous turn by the other speaker. 
Number of words in w's turn. 
Number and proportion of words and IPUs in w's turn before and after w
Number and proportion of turns in w's task before and after w
Number of words uttered by the other speaker during w's turn. 
Number of words in the following turn by the other speaker. 
Number of ACWs in w's turn other than w
Table 5

Timing features. Each line may describe one or more features. Features marked may be available in on-line conditions.

Timing features
 
 Duration (in msec) of w (raw, normalized with respect to all occurrences of the same word by the same speaker, and normalized with respect to all words with the same number of syllables and phonemes uttered by the same speaker). 
 Flag indicating whether there was any overlapping speech from the other speaker. 
 Duration of w's IPU. 
 Latency (in msec) between w's turn and the previous turn by the other speaker. 
 Duration of the silence before w (or 0 if the w is not preceded by silence), its IPU, and its turn. 
 Duration and proportion of w's IPU elapsed before and after w
 Duration of w's turn before w
 Duration of any overlapping speech from the other speaker during w's IPU. 
 Duration of the previous turn by the other speaker. 
Duration of the silence after w (or 0 if w is not followed by silence), its IPU, and its turn. 
Latency between w's turn and the following turn by the other speaker. 
Duration of w's turn, as a whole and after w
Duration of any overlapping speech from the other speaker during w's turn. 
Duration of the following turn by the other speaker. 
Timing features
 
 Duration (in msec) of w (raw, normalized with respect to all occurrences of the same word by the same speaker, and normalized with respect to all words with the same number of syllables and phonemes uttered by the same speaker). 
 Flag indicating whether there was any overlapping speech from the other speaker. 
 Duration of w's IPU. 
 Latency (in msec) between w's turn and the previous turn by the other speaker. 
 Duration of the silence before w (or 0 if the w is not preceded by silence), its IPU, and its turn. 
 Duration and proportion of w's IPU elapsed before and after w
 Duration of w's turn before w
 Duration of any overlapping speech from the other speaker during w's IPU. 
 Duration of the previous turn by the other speaker. 
Duration of the silence after w (or 0 if w is not followed by silence), its IPU, and its turn. 
Latency between w's turn and the following turn by the other speaker. 
Duration of w's turn, as a whole and after w
Duration of any overlapping speech from the other speaker during w's turn. 
Duration of the following turn by the other speaker. 
Table 6

Prosodic features. In all cases, both original and simplified ToBI labels were considered. Each line may describe one or more features. Features marked may be available in on-line conditions.

ToBI prosodic features
 
– Phrase accent, boundary tone, break index, and pitch accent on w
– Phrase accent, boundary tone, break index, and final pitch accent on the final intonational phrase of the previous turn by the other speaker (these features are defined only when w is turn initial). 
ToBI prosodic features
 
– Phrase accent, boundary tone, break index, and pitch accent on w
– Phrase accent, boundary tone, break index, and final pitch accent on the final intonational phrase of the previous turn by the other speaker (these features are defined only when w is turn initial). 
Table 7

Acoustic features. Each line may describe one or more features. Features marked may be available in on-line conditions.

Acoustic features
 
 w's mean, maximum, and minimum pitch and intensity (raw and speaker normalized). 
 Jitter and shimmer, computed over the whole word and over the first and second syllables, computed over just the voiced frames (raw and speaker normalized). 
 Noise-to-harmonics ratio (NHR), computed over the whole word and over the first and second syllables (raw and speaker normalized). 
 w's ratio of voiced frames to total frames (raw and speaker normalized). 
 Pitch slope, intensity slope, and stylized pitch slope, computed over the whole word, its first and second halves, its first and second syllables, the first and second halves of each syllable, and the word's final 100, 200, and 300 msec (raw and normalized with respect to all other occurrences of the same word by the same speaker). 
 w's mean, maximum, and minimum pitch and intensity, normalized with respect to three types of context: w's IPU, w's immediately preceding word by the same speaker, and w's immediately following word by the same speaker. 
 Voiced-frames ratio, jitter, and shimmer, normalized with respect to the same three types of context. 
 Mean, maximum, and minimum pitch and intensity, ratio of voiced frames, (all raw and speaker normalized), jitter, and shimmer, calculated over the final 500, 1,000, 1,500, and 2,000 msec of the previous turn by the other speaker (only defined when w is turn initial but not task initial). 
 Pitch slope, intensity slope, and stylized pitch slope, calculated over the final 100, 200, 300, 500, 1,000, 1,500, and 2,000 msec of the previous turn by the other speaker (only defined when w is turn initial but not task initial). 
Acoustic features
 
 w's mean, maximum, and minimum pitch and intensity (raw and speaker normalized). 
 Jitter and shimmer, computed over the whole word and over the first and second syllables, computed over just the voiced frames (raw and speaker normalized). 
 Noise-to-harmonics ratio (NHR), computed over the whole word and over the first and second syllables (raw and speaker normalized). 
 w's ratio of voiced frames to total frames (raw and speaker normalized). 
 Pitch slope, intensity slope, and stylized pitch slope, computed over the whole word, its first and second halves, its first and second syllables, the first and second halves of each syllable, and the word's final 100, 200, and 300 msec (raw and normalized with respect to all other occurrences of the same word by the same speaker). 
 w's mean, maximum, and minimum pitch and intensity, normalized with respect to three types of context: w's IPU, w's immediately preceding word by the same speaker, and w's immediately following word by the same speaker. 
 Voiced-frames ratio, jitter, and shimmer, normalized with respect to the same three types of context. 
 Mean, maximum, and minimum pitch and intensity, ratio of voiced frames, (all raw and speaker normalized), jitter, and shimmer, calculated over the final 500, 1,000, 1,500, and 2,000 msec of the previous turn by the other speaker (only defined when w is turn initial but not task initial). 
 Pitch slope, intensity slope, and stylized pitch slope, calculated over the final 100, 200, 300, 500, 1,000, 1,500, and 2,000 msec of the previous turn by the other speaker (only defined when w is turn initial but not task initial). 
Table 8

Phonetic and session-specific features. Each line may describe one or more features. Features marked may be available in on-line conditions.

Phonetic features
 

 
Identity of each of w's phones.
 

 
Absolute and relative duration of each phone.
 

 
Absolute and relative duration of each syllable.
 
Session-specific features
 
– Session number. 
– Identity and gender of both speakers. 
Phonetic features
 

 
Identity of each of w's phones.
 

 
Absolute and relative duration of each phone.
 

 
Absolute and relative duration of each syllable.
 
Session-specific features
 
– Session number. 
– Identity and gender of both speakers. 

Close Modal

or Create an Account

Close Modal
Close Modal