Abstract
Alzheimer's disease (AD) is an increasingly prevalent cognitive disorder in which memory, language, and executive function deteriorate, usually in that order. There is a growing need to support individuals with AD and other forms of dementia in their daily lives, and our goal is to do so through speech-based interaction. Given that 33% of conversations with people with middle-stage AD involve a breakdown in communication, it is vital that automated dialogue systems be able to identify those breakdowns and, if possible, avoid them.
In this article, we discuss several linguistic features that are verbal indicators of confusion in AD (including vocabulary richness, parse tree structures, and acoustic cues) and apply several machine learning algorithms to identify dialogue-relevant confusion from speech with up to 82% accuracy. We also learn dialogue strategies to avoid confusion in the first place, which is accomplished using a partially observable Markov decision process and which obtains accuracies (up to 96.1%) that are significantly higher than several baselines. This work represents a major step towards automated dialogue systems for individuals with dementia.
1. Introduction
Alzheimer's disease (AD) is a progressive neurodegenerative disorder that deteriorates cognitive, social, and executive functions. Typical effects of AD include declines in memory, executive capacity, and (crucially) linguistic ability (Cummings 2004). These declines complicate or prohibit the completion of activities of daily living, and more severe declines often require caregiver assistance. Caregivers who assist individuals with AD at home are common, but their involvement is often the precursor to placement in a long-term care facility (Gaugler et al. 2009). Approximately $100 billion are spent annually in North America on the direct and indirect care for patients with AD and other dementias, the majority of which is attributed to long-term institutional care (Ernst et al. 1997). As the population ages, the incidence of AD will double or triple, with Medicare costs alone reaching $189 billion in the United States presently (Bharucha et al. 2009).
Given the growing need to support this population, we are designing software that engages in two-way speech communication with individuals with dementia for two purposes: 1) to help guide that individual towards the completion of daily household tasks (e.g., brushing one's teeth), and 2) to fulfill a social function often left empty in relative isolation. Our goal is to encode in software verbal techniques that caregivers use when interacting with their patients, including automatically identifying and recovering from breakdowns in communication. In this article, we show that these breakdowns can reliably be identified by surface-level linguistic features, and we propose a statistical dialogue system for avoiding these breakdowns in the first place.
2. Previous Work
Formal caregivers are often trained in responsive verbal communication strategies that are meant to adapt to an individual's state-of-mind (Hopper 2001; Goldfarb and Pietro 2004). These include (Small et al. 2003) but are not limited to:
- 1.
Relatively slow rate of speech rate.
- 2.
Verbatim repetition of misunderstood prompts.
- 3.
Closed-ended questions (i.e., that elicit yes/no responses).
- 4.
Simple sentences with reduced syntactic complexity.
- 5.
Giving one question or one direction at a time.
- 6.
Minimal use of pronouns.
2.1 Communication Breakdown
Several theoretical models apply to communication breakdown. The trouble source-repair (TSR) model describes difficulties in communication in any dyads, including how repairs are initiated and carried out (Schegloff, Jefferson, and Sacks 1977). Difficulties can be phonological (e.g., mispronunciation), morphological/syntactic (e.g., incorrect agreement among constituents), semantic (e.g., disturbances related to lexical access), and discourse-based (i.e., misunderstanding of shared knowledge, or cohesion) (Orange, Lubinsky, and Higginbotham 1996). The majority of TSR sequences involve self-correction (e.g., by repetition, elaboration, or reduction of a troublesome utterance).
Orange, Lubinsky, and Higginbotham (1996) showed that whereas 18% of non-AD dyad utterances involve TSR, 23.6% of early-stage AD dyads and 33% of middle-stage AD dyads involved TSR. Of these, individuals with middle-stage AD exhibited more discourse-related difficulties including inattention, failure to track propositions and thematic information, and deficits in working memory. Contrary to TSR, the most common repair initiators and repairs given communication breakdown involved frequent wh-questions and hypotheses (e.g., “Do you mean...?”).
An alternative, closely related, paradigm for measuring communication breakdown is trouble-indicating behavior (TIB) involving implicit or explicit requests for aid. In a study of 10 seniors with varying degrees of dementia, Watson (1999) showed that there was a significant difference in TIB use (p < 0.005) between individuals with AD and the general population. This article adopts this model, which incorporates 12 distinct types of TIB (examples are derived from our own annotations of the DementiaBank corpus, described in Section 3):
- 1.
Neutral or non-specific requests for repetition (local). Minimal queries indicating non-understanding, which did not identify the problem specifically. E.g., What?Huh?
- 2.
Request for confirmation – repetition with reduction. Partial repair of a trouble source, often in the form of a question. Speaker 1: just tell me everything that you see happening there.... Speaker 2: uh these are things thatare happening ?
- 3.
Request for confirmation – complete repetition. Recapitulatory “echo” questions, often with pronoun alternation. Speaker 1: You tell me everything you see. Speaker 2: I'm to tell everything I see ?
- 4.
Request for confirmation – repetition with elaboration. Same as TIB 3, but with the inclusion of additional semantic content. Speaker 1: Tell me everything that you see happening in this picture. Speaker 2: Uh do do youwant like the window's open that sort of thing?
- 5.
Request for specific information. Contains a specific semantic concept, content word, or referent to the previous or recent turn. Speaker 1: Did I say about some water?
- 6.
Request for more information. A non-specific request (i.e., without direct mention of semantic concepts in a recent utterance). I don't understand.Tell me more.What do you mean?
- 7.
Corrections. Are the result of a violation in the quality of message or message inaccuracies. Here, semantic confusion often originates from the individual not indicating the TIB. Speaker 1: She dropped a dish no, no shedidn't drop a dish.
- 8.
Lack of uptake / lack of continuation. Verbal behaviors including 1) minimal feedback where back channel responses indicate non-understanding or lack of contribution to or elaboration on topic extension; 2) overriding where a participant does not allow the floor; and 3) topic switch where one participant abruptly changes topic. Speaker 1: Do you know what indexed means? Speaker 2: Yes. Speaker 1: What? Speaker 2: Oh, it's a bit too hard, bit late too late to...
- 9.
Hypothesis formation. Guessing behaviors involving supplying words or speaking for or on behalf of the other participant. This does not include hypotheses in the form of rhetorical questions (which are instead categorized as TIB 5). Speaker 1: We went to the farm. Speaker 2: You went to Riverdale Farm.; Speaker 1: we forgot to turn off the spigot.
- 10.
Metalinguistic comment. This includes “talk about talk” which explicitly refer to non-understanding of message, the interpersonal manner in which the message was conveyed, or the production of the message. I can'tremember.I don't understand.I guess there's more things I'm supposed to see.
- 11.
Reprise / minimal dysfluency. Reprises with partial or whole repetition (or revision) of the message. Minimal dysfluencies indicate difficulties producing a message that involve sound, syllable and word repetition, pauses, and fillers. These are deemed more excessive than the typical dysfluencies that occur in typical speech. the children are wearing umcompletely outfitted.
- 12.
Request for repetition–global. Minimal queries indicating non-understanding. Wait – go back to the part about…You just lost me.
In previous work, we studied ten older adults with Alzheimer's disease as they were guided, through speech, by a robot in the task of making a cup of tea. In that work, older adults were especially likely to exhibit TIB 8 (lack of uptake), but exhibited fewer TIBs, proportionally, when interacting with the robot than with a human conversant (Rudzicz et al. 2015).
2.2 Linguistic Factors in Dementia
Although memory impairment is the main symptom of AD, language impairment can be an important marker. Faber-Langendoen et al. (1988) found that 36% of mild AD patients and 100% of severe AD patients had aphasia, according to standard aphasia testing protocols. Ahmed et al. (2013) found that two-thirds of their participants showed subtle, but significant, changes in connected speech production up to a year before their diagnosis of probable AD. Weiner et al. (2008), in a study of 486 AD patients, reported a significant correlation between dementia severity and a number of different linguistic measures, including confrontation naming, articulation, word-finding ability, and semantic fluency.
Other indicative measures include noun rate, pronoun rate, verb rate, and adjective rate, per 100 words, which have been used in studies of aphasia and speech in affective disorders (Bucks et al. 2000; Fraser, Rudzicz, and Rochon 2013). The noun rate is associated with word-finding difficulties, and the pronoun rate measures the frequency of indirect referencing, which is higher for individuals who commonly forget names. Also, Bucks et al. (2000) suggested that the adjective rate measures the quality and expressiveness of speech and that the verb rate is related to the fluency of speech. Additional measures are described in Section 4.
3. Data
We use the DementiaBank (Goodglass and Kaplan 1983; Becker et al. 1994) database for this study. DementiaBank consists of narrative speech during the standard Cookie Theft picture description task from the Boston Diagnostic Aphasia Examination (Goodglass and Kaplan 1983). In this task, an examiner shows the picture in Figure 1 to the patient, requests a description of its contents, and is permitted to periodically encourage or prompt the patient. Each speech sample was recorded and manually transcribed at the word level following the TalkBank CHAT protocol (MacWhinney 2014). Narratives were segmented into utterances and annotated with filled pauses, paraphasias, and unintelligible words. The transcriptions were tokenized to remove whitespace, punctuation, and CHAT code tokens (e.g., “[//]”). Tokens that were CHAT-coded to indicate repetition were expanded; e.g., that's a dog [x 3] becomes that's a dog dog dog (MacWhinney 2014). The remaining tokens were lemmatized, which is effective in reducing lexical sparseness, but which also removes some information, such as verb tense, that can differentiate between control subjects and those with AD (Cortese et al. 2006).
Every participant is associated with a professionally administered Mini-Mental State Examination (MMSE) score, which provides a unified and validated scale for measuring the severity of dementia on a scale of 0 (greatest cognitive decline) to 30 (no cognitive decline), based on a series of questions in five areas: orientation, registration, attention, memory, and language (Folstein, Folstein, and McHugh 1975). English was the first language of all speakers.
Participants were assigned to the Dementia group primarily based on a history of cognitive and functional decline, and the results of the MMSE. In 1992, several years after the study had ended, the final diagnosis of each patient was reviewed on the basis of their clinical record and any additional relevant information (in 66.7% of the deceased, autopsy). From this group, we consider participants with diagnoses of ‘possible’ or ‘probable’ AD, resulting in 240 samples from 167 participants, because a few performed the description task annually (Boller and Becker 2005). We also consider control participants, resulting in 233 additional files from 97 speakers. Demographics are given in Table 1. The participants with dementia produced an average of 104.3 (s.d. 59.0) words per narrative, and the control participants produced an average of 114.4 (s.d. 59.5) words per narrative, although the distribution in both cases is somewhat right-skewed.
. | AD (n = 167) . | Control (n = 97) . |
---|---|---|
Mean age in years (s.d.) | 71.8 (8.5) | 65.2 (7.8) |
Mean education in years (s.d.) | 12.5 (2.9) | 14.1 (2.4) |
Sex (n male/female) | 52/115 | 37/60 |
Mean MMSE score (s.d.) | 18.5 (5.1) | 29.1 (1.1) |
. | AD (n = 167) . | Control (n = 97) . |
---|---|---|
Mean age in years (s.d.) | 71.8 (8.5) | 65.2 (7.8) |
Mean education in years (s.d.) | 12.5 (2.9) | 14.1 (2.4) |
Sex (n male/female) | 52/115 | 37/60 |
Mean MMSE score (s.d.) | 18.5 (5.1) | 29.1 (1.1) |
Some analysis (Section 4.1.4) uses the Carolina Conversations Collection (CCC), in which transcriptions and audio of undirected conversations between interviewees (older adults) and interviewers (younger adults) are recorded longitudinally (Pope and Davis 2011). In our analysis, we use the data of 31 interviewees with AD (7 men) and 41 without (9 men).
Across both databases, each transcript has an associated audio file, which we convert to 16-bit mono PCM format with a sampling rate of 16 kHz. All data were annotated at the utterance-level with specific TIBs by two speech-language pathologists, with an inter-annotator Fleiss' kappa agreement of κ = 0.84. Utterances with more than one TIB were allowed, but no more than two per utterance were ever detected.
4. Features
The following subsections outline the lexical and acoustic features used across our analyses, with special attention paid to features that are meant to be task-relevant, namely, cue words of TIBs, semantic similarity across utterances, word specificity, vocabulary richness, repetition, incomplete words, and fillers.
4.1 Lexical Features
We include various part-of-speech (POS) features, such as the proportion of nouns, verbs, light verbs, adjectives, adverbs, prepositional words, demonstratives, functional words generally, and the noun-to-verb ratio, obtained using the Stanford POS tagger (Toutanova et al. 2003). We measure the mean word frequency using the SUBTL norms (Brysbaert and New 2009), and each of the Bristol norms: imageability, age-of-acquisition, and familiarity. Stadthagen-Gonzalez and Davis (2006) describe imageability as “a semantic variable that measures how easy it is for a word to elicit mental images… used to evaluate the effects of meaning on memory and word recognition” and as closely related to “concreteness,” with some exceptions (Bird, Franklin, and Howard 2001); raters of imageability were asked to indicate how easily each word elicited mental images, on a 7-point scale. Familiarity is “often … interpreted as a measure of the frequency of exposure to a word”; ratings of familiarity were also on a 7-point scale, from words that they never had seen to those that they had seen very often (nearly every day). The Bristol norms were obtained from the ratings by 100 native English speakers across nouns, verbs, and some additional words (Stadthagen-Gonzalez and Davis 2006).
We estimate vocabulary “richness” through type-token ratio, average word length, BI, and HS, as described in Section 2.2. We also compute objectivity measures using the MPQA Subjectivity Lexicon (Wilson, Wiebe, and Hoffmann 2005), which categorizes words by their subjectivity (strong or weak) and their prior, out-of-context, polarity (positive, negative, both, or neutral). Using the subjectivity lexicon, we compute the proportion of all possible permutations (e.g., strong negative, weak negative, weak positive).
To estimate statistical deviations from typical language, we also compute the average log probability of each word in each utterance given two trigram models with Good-Turing smoothing, from the Brown and Switchboard (Godfrey, Holliman, and McDaniel 1992) corpora, respectively. Additional features are described in the following subsections.
4.1.1 Word Specificity
Word specificity is closely related to the word's position in a hypernym tree. For example, sedan is more specific than car. Because degraded cognitive function is correlated with a smaller active vocabulary, individuals with AD may be more likely to resort to more common, general words and therefore a lower degree of word specificity (Hirst and Feng 2012).
Word specificity, for nouns, can be approximated using WordNet (Miller 1995). Specifically, hypernym relationships between synonym sets form a hierarchal tree where the root (Entity) provides the most abstract sense. Therefore, the “word specificity” feature is the average depth of nouns in the WordNet (noun) hypernym tree, after recent work (Le 2010).
Adjectives and adverbs, unlike nouns, are not organized in hierarchies, but in bipolar scales between pairs of extremes; verbs are structured into several smaller hierarchies with different roots, which may limit the utility of computing depths of verbs in this way.
4.1.2 Semantic Similarity
Watson (1999) showed that almost half of TIBs exhibited by speakers without AD are either TIB 5 (request for specific information) or TIB 9 (hypothesis formation). These TIBs, and others, involve expressions that are semantically related to recent or preceding utterances. This is in stark contrast to how individuals with AD express trouble—over 30% of TIBs exhibited by these individuals are TIB 8 (lack of uptake) and over 65% are TIB 11 (reprise / minimal dysfluency), neither of which involves particular relations with preceding utterances. Our hypothesis is therefore that measures of semantic similarity with previous utterances will be indicative of AD.
Our intuition is that, additionally, TIB 1 (neutral or non-specific requests for repetition (local)) and TIB 10 (metalinguistic comment) would give low similarity scores.
4.1.3 Repetition, Incomplete Words, and Fillers
In slight contrast to Section 4.1.2, repetition of a sequence of words is a common coping mechanism for brief cognitive lapses (Guinn and Habash 2012) and is increased with degradation in cognitive ability. Watson (1999) and Guinn and Habash (2012) showed that those with AD exhibited repetition twice as often as their interviewers, on average.
Incomplete words occur when uttered words are abruptly terminated before completion. This can be caused by brief lapses in cognition (Guinn and Habash 2012) or might imply that the speaker is unsure of what they are trying to communicate. These are about twice as common in people with AD relative to the general population (Guinn and Habash 2012).
Fillers are non-words or short phrases that signal cognitive lapse (e.g., “umm,” “uh”). Single-word fillers are extracted using the UH tag in the Penn tagset. Near the beginning of utterances, fillers often indicate that the speaker has difficulty either comprehending what was said or forming a sentence. In the middle, they often indicate the need for a trouble source repair (Bortfeld et al. 2001). About 2.6% of words spoken by people with AD are filler phrases, compared with 1.2% of words spoken by healthy older adults (Guinn and Habash 2012).
4.1.4 Cue Words
Cue words are expressions that either link spans of discourse or signal semantic relationships in text. They are frequently used as identifying markers of the relative importance of sentences in document summarization (Teufel and Moens 1997) and in dialogue (Gravano, Hirschberg, and Beňuš 2012). We do not remove stop words here from consideration because AD patients are known to use a high proportion of function words such as pronouns, conjunctions, and articles that convey little meaning (Kempler 1995; Almor et al. 1999). Therefore, even non-content words that would typically belong to a stop list can be cues for AD.
To determine which tokens are cue words, we considered term frequency-inverse document frequency (tf-idf), but this was discarded empirically. Instead, we use two-tailed Welch's t-test to compare the mean frequency of each token in utterances with and without each TIB. The tokens were then ranked by p-value, as shown in Table 2, which includes data from the CCC. Despite differences in cue words across data sets, there is also overlap. Across both sets, the word pardon is a cue word for TIB 1 (neutral or non-specific request for repetition), no is a cue word for TIB 7 (corrections), and know is a cue word for TIB 10 (metalinguistic comment). This appears to be consistent with previous analysis of AD language use (Nicholas et al. 1985; Almor et al. 1999).
TIB . | CCC . | DementiaBank . |
---|---|---|
1. Neutral or non-specific request for repetition (local) | huh, pardon, what | hm, mention, pardon, really, anything |
2. Request for confirmation – repetition with reduction | N/A | did, mhm, mention, hear, say |
3. Request for confirmation – complete repetition | N/A | really, good |
4. Request for confirmation – repetition with elaboration | said | missed, again, bad, why, huh |
5. Request for specific information | ummm, what | no, talk, did, wanted, nope |
6. Request for more information | why, what | describe, am, matter, detail, more |
7. Corrections | ohhh, no | no |
8. Lack of uptake/ lack of continuation | umm | please, yes, uhhuh, what |
9. Hypothesis formation | uhoh, uhm, remember, yes | use, watch, my, suppose, find |
10. Metalinguistic comment | remember, know, forget, unbelievable, disappointing | mentioned, know, what, must, can't |
11. Reprise/minimal dysfluency | uh, the, um, but, because | k, hmm, uhhuh |
12. Request for repetition – global | N/A | N/A |
TIB . | CCC . | DementiaBank . |
---|---|---|
1. Neutral or non-specific request for repetition (local) | huh, pardon, what | hm, mention, pardon, really, anything |
2. Request for confirmation – repetition with reduction | N/A | did, mhm, mention, hear, say |
3. Request for confirmation – complete repetition | N/A | really, good |
4. Request for confirmation – repetition with elaboration | said | missed, again, bad, why, huh |
5. Request for specific information | ummm, what | no, talk, did, wanted, nope |
6. Request for more information | why, what | describe, am, matter, detail, more |
7. Corrections | ohhh, no | no |
8. Lack of uptake/ lack of continuation | umm | please, yes, uhhuh, what |
9. Hypothesis formation | uhoh, uhm, remember, yes | use, watch, my, suppose, find |
10. Metalinguistic comment | remember, know, forget, unbelievable, disappointing | mentioned, know, what, must, can't |
11. Reprise/minimal dysfluency | uh, the, um, but, because | k, hmm, uhhuh |
12. Request for repetition – global | N/A | N/A |
4.2 Acoustic Features
Aperiodicity and bradykinesia (slowed articulation) of speech, which are symptoms of neurodegenerative disorders such as Parkinsons disease (Bhatnagar 2002), are typically not associated with AD. Nevertheless, we perform recurrence quantification analysis of the cross recurrence (Marwan and Kurths 2002) of each utterance. We add these features because our recent work has found an acoustic component to AD (Fraser, Meltzer, and Rudzicz 2016). Specifically, for 3rd-order recurrence quantification analysis with delay 2, neighborhood 0.5, we compute the mean recurrence rate, determinism, {mean, maximal, and entropy of} diagonal line length, laminarity, trapping time, maximal vertical line length, recurrence time of 1st and 2nd types, clustering coefficient, and transitivity over windows of length 1,000 samples (with a 500-sample window shift). In total, there are 178 acoustic features.
5. Experiment 1: Identifying TIBs
Our first goal is to automatically identify TIBs using only lexical and acoustic features described in Section 4, and machine learning. Here, we construct three processed data sets: control, dementia, and both, as follows. The nature of the DementiaBank conversations is such that the interviewer is never in a position to exhibit a TIB. Therefore, all interviewer data points are removed. Initially, we are only interested in identifying the presence of a TIB without distinguishing between the different types of TIBs, so we simply conflate all TIB-annotated utterances into one class, and all others into a second class.
A large number of features include not-a-number (NaN) values (typically when the feature represents a ratio, e.g., number of light verbs to number of verbs in the utterance “Cookie” is 0/0). To account for these values, we use k-nearest neighbor (kNN) imputation (Batista and Monard 2002). This imputation substitutes each NaN with a weighted average of the k data points closest to the datum in question. Empirically, we set k = 3.
Furthermore, the density of TIB-class data points is very sparse, especially within the control data set. Models built on this data have the danger of performing majority class classification. Because the TIB-class is the minority class, a majority class classification model would do extremely poorly in identifying TIBs. The imbalance in the data set can be alleviated by several means. In this experiment, we generate synthetic data points using the synthetic minority oversampling technique (SMOTE), which under-samples the majority class and over-samples the minority class by creating synthetic minority class examples (Chawla et al. 2002). Here, each minority class sample introduces synthetic examples randomly along line segments that join the k = 5 minority class nearest neighbors.
5.1 Methodology
We use multilinear logistic regression as a baseline and a mixture density network (MDN) to estimate the presence of TIB. An MDN is a multilayer perceptron with one hidden layer with a variable number of hidden units and a Gaussian mixture probability density as the output (Bishop 1994). For this experiment, the output is a single Gaussian, given that increasing this number reduced performance. We train the network for a maximum of 6,000 cycles optimized by scaled conjugate gradient and negative log likelihood as the error function. The mean of the output Gaussian is rounded to estimate the TIB class. A rounded estimate of 0 signifies non-TIB and a rounded estimate of 1 signifies TIB. As with the logistic regression, if some rounded estimate does not equal 0 or 1, then the estimate is considered to indicate non-TIB.
Evaluation of the models are based on accuracy (% of correct estimates), sensitivity (true positives / all positives), and specificity (true negatives / all negatives) where TIBs are considered positive and non-TIBs are negative. We use 10-fold cross validation on the chosen processed data set (control, dementia, or combined) in which models are iteratively trained on data in 9 of the 10 random partitions and are then evaluated using the remaining partition.
Three models are evaluated: logistic regression, MDN with 100 hidden units, and MDN with a variably scaled number of hidden units where the number of hidden units is five times the number of features selected. The models are tested on each of control, dementia, and combined data.
5.2 Results of Identifying TIBs
First, we compare data from controls and those with dementia separately, before combining the data in Section 5.2.2.
5.2.1 Control vs. Dementia
Table 3 presents the top five features by absolute correlation with TIB class over all of the Control and Dementia data. Note that no features in common are shared between groups, indicating that TIBs may manifest differently in each population. On Dementia data, the MDN with 100 hidden units and 20 to 25 features shows the most promise (78.9% accuracy); on Control data, the logistic regression with 50 features performs the best (93.6% accuracy). Note that although Control models produce higher accuracy and specificity across both logistic regression and MDN methods (two-tailed heteroscedastic t-tests, p < 0.001), sensitivity is far lower than in Dementia data, suggesting that more TIBs are proportionally “missed” in Control data.
Control Features . | Dementia Features . |
---|---|
S1 → SQ (interrogative) | Num. of PRPs :: Num. NNs + Num. PRPs |
Num. DTs | Num. NNs |
Num. DTs :: Num. words | Num. NNs :: ( Num. NNs + Num. VBs) |
Mean imageability | NP → PRP (personal pronoun) |
Mean familiarity | Mean NN imageability |
Control Features . | Dementia Features . |
---|---|
S1 → SQ (interrogative) | Num. of PRPs :: Num. NNs + Num. PRPs |
Num. DTs | Num. NNs |
Num. DTs :: Num. words | Num. NNs :: ( Num. NNs + Num. VBs) |
Mean imageability | NP → PRP (personal pronoun) |
Mean familiarity | Mean NN imageability |
The logistic regression model on Dementia data achieves peak accuracy of 76.48% at 60 features, peak sensitivity of 74.04% at 60 features, and peak specificity of 80.06% at 35 features. On Control data, its peak accuracy is 93.57% at 70 features selected, its peak sensitivity is 31.96% at 50 features, and its peak specificity is 100.00% for ≥95 features. These results are shown in Figure 2.
The MDN with 100 hidden units on Dementia data achieves peak accuracy of 78.87%, peak sensitivity of 75.32%, and peak specificity of 84.05%, each at 20 features. On Control data, it has peak accuracy of 92.70% at 20 features, peak sensitivity of 13.51% at 30 features, and peak specificity of 99.70% at 10 features. This is shown in Figure 3.
The MDNs with five times as many hidden units as selected features on Dementia data achieve peak accuracy of 77.86% at 20 features, peak sensitivity of 75.376% at 20 features, and peak specificity of 82.42% at 25 features. On Control data, they have peak accuracy of 92.70% at 5 features, peak sensitivity of 61.91% at 180 features, and peak specificity of 100.00% at 5 features. This is shown in Figure 4.
5.2.2 Combined Data Set (Control and Dementia)
The top 20 features selected by absolute correlation with TIB class are shown in Table 4, along with p-values for these features after fitting the data with a multilinear logistic regression, given all data together. These features are composed entirely of lexical features, in contrast with related work on the same data in which we found that acoustic measures are highly indicative of AD (Fraser, Meltzer, and Rudzicz 2016). Indeed, pausing and phonation rate, along with word repetition, incomplete words, and fillers (see Section 4.1.3) have high diagnostic value, and we expected these to be related to TIB 11 (reprise / minimal dysfluency), but this was not the case. Interestingly, the average WordNet depth and mean noun familiarity are among the top five features, but make no appearance among the top five features selected from either the Control or Dementia sets separately. In particular, the ratio of pronouns to all nouns correlates positively with the presence of TIBs (as one might expect, given the relatively generic nature of the former) with a correlation of 0.43.
Feature Name . | r . | LR p-value . |
---|---|---|
Number of Pronouns :: Number of Pronouns and Nouns | 0.42812 | < 0.001 |
Mean Noun Imageability | −0.33533 | 0.395 |
Mean Noun Familiarity | −0.32535 | 0.0127 |
Average WordNet Depth (i.e., word specificity §6) | −0.31758 | < 0.001 |
NP → PRP | 0.31205 | 0.560 |
Number of Nouns | −0.31170 | < 0.001 |
Number of Nouns :: Number of Nouns and Verbs | −0.30733 | 0.353 |
Mean Imageability | −0.30579 | 0.187 |
Mean Age of Acquisition | −0.27172 | 0.00721 |
Number of Demonstratives :: Number of Words | 0.26869 | < 0.001 |
WHNP → WP (wh-NP with an NP GAP → wh-pronoun) | 0.26614 | < 0.001 |
Cue words (§6) | 0.25988 | 0.641 |
Number of Nouns :: Number of Words | − 0.25929 | 0.0411 |
NP → DT NN | −0.25321 | < 0.001 |
Number of Complex T-units :: Number of T-units | 0.24475 | 0.00231 |
Number of Dependent Clauses | 0.24265 | 0.00131 |
Noun Imageability | − 0.24141 | 0.456 |
Semantic Similarity (§6) | −0.23932 | 0.633 |
Number of Dependent Clauses :: Number of T-units | 0.23818 | 0.0483 |
Number of Complex T-units | 0.23598 | 0.0694 |
Feature Name . | r . | LR p-value . |
---|---|---|
Number of Pronouns :: Number of Pronouns and Nouns | 0.42812 | < 0.001 |
Mean Noun Imageability | −0.33533 | 0.395 |
Mean Noun Familiarity | −0.32535 | 0.0127 |
Average WordNet Depth (i.e., word specificity §6) | −0.31758 | < 0.001 |
NP → PRP | 0.31205 | 0.560 |
Number of Nouns | −0.31170 | < 0.001 |
Number of Nouns :: Number of Nouns and Verbs | −0.30733 | 0.353 |
Mean Imageability | −0.30579 | 0.187 |
Mean Age of Acquisition | −0.27172 | 0.00721 |
Number of Demonstratives :: Number of Words | 0.26869 | < 0.001 |
WHNP → WP (wh-NP with an NP GAP → wh-pronoun) | 0.26614 | < 0.001 |
Cue words (§6) | 0.25988 | 0.641 |
Number of Nouns :: Number of Words | − 0.25929 | 0.0411 |
NP → DT NN | −0.25321 | < 0.001 |
Number of Complex T-units :: Number of T-units | 0.24475 | 0.00231 |
Number of Dependent Clauses | 0.24265 | 0.00131 |
Noun Imageability | − 0.24141 | 0.456 |
Semantic Similarity (§6) | −0.23932 | 0.633 |
Number of Dependent Clauses :: Number of T-units | 0.23818 | 0.0483 |
Number of Complex T-units | 0.23598 | 0.0694 |
The logistic regression model achieves stable results with the exception of a large variance in sensitivity at 90 features selected. The logistic regression model has the highest rate of identifying TIBs (79.43% accuracy) with 75 selected features with a sensitivity of 60.93% while retaining good specificity (88.49%). The MDN with 100 hidden units obtains greater variance than the logistic regression model, but has several advantages. The peak accuracy and sensitivity at 81.55% and 63.26% of this MDN are strictly higher than the peak accuracy and sensitivity of the logistic regression. Furthermore, the MDN achieves these results at only 20 features selected compared with 75–80 features selected for the logistic regression. In environments where TIB estimation needs to be performed dynamically, it can be computationally intensive to extract a large number of features in real-time. The MDN with five times as many hidden units as input features selected (Figure 5) performs adequately well for up to 80 features selected although it exhibits a high variance as the number of features increases beyond 100, perhaps because of over-fitting. There is a large variance in the results for 30 features selected as well, which could be due to an unusual partitioning of the data during cross-validation.
5.3 Clustering TIBs
Although this work is focused on the detection of any TIB, an automated system should react differently to different types of TIB. In this section, we perform a brief tangential but related experiment in n-way classification for groups of TIBs, rather than the binary case of Section 5.2. To cluster TIBs into those groups by similarity, we first z-score each feature to eliminate the influence of those dimensions with excessively large variances. We then compute the mean of all feature vectors representing utterances for each TIB, and finally perform agglomerative hierarchical clustering using the shortest Euclidean distance between those vectors. The resulting dendrograms are shown in Figure 6.1
Qualitatively, certain patterns are evident; for example, there is high similarity between TIBs 8, 9, and 10 (metalinguistic or unrelated speech) and between TIBs 2 and 3 (different requests for confirmation) in Dementia. However, the latter pair are not as similar among Controls, perhaps because of a different degree of reduction in TIB 2. Also, Control subjects do not exhibit any lack of uptake (TIB 8) at all in this view, which is in sharp contrast to its relatively frequent occurrence among people with dementia, especially in our related work (Rudzicz et al. 2015). Control subjects also exhibit neither reprise (TIB 11) nor global request for repetition (TIB 12) in these data, and subjects with Dementia exhibit neither the latter nor request for confirmation (repetition with elaboration, TIB 4).
For these experiments (and those in Section 6.3, where these clusters are used), we sweep the number of TIB clusters (excluding no-TIB) from 1 to 5. Specifically, one class is “no-TIB” and another is all remaining TIBs, combined, after isolating, iteratively, TIBs 7, 5, 1, and 6 in Dementia; TIBs 5, 6, {3, 4, and 7}, and {9 and 10} in Controls, and TIBs 4, 7, 1, and 6 in the combined set. Figure 7 shows accuracy in TIB cluster classification for each of the Dementia, Controls, and combined data sets, using the top 30 features in each, for each value of n (including no-TIB). As expected, there is a significant trend towards lower accuracy as the number of classes increases.
6. Experiment 2: Avoiding TIBs (While Completing Tasks) with POMDPs
Software that can engage older adults with AD in dialogue can benefit from tracking the speech and language characteristics of those individuals and from automatically identifying when the dialogue breaks down. Such software must also optimize its output to the user, especially if the assistance offered by caregivers in the completion of daily tasks is to be mimicked.
Here, we build a dialogue system to avoid TIBs using a partially observable Markov decision process (POMDP; see Section 6.1) using DementiaBank data. The specification of the data that we used are shown in Table 5. We removed the dialogues that do not include at least one interviewer and one participant turn. Individuals with dementia were much more likely (two-tailed t(1) = 318.36, p < 0.001) to exhibit TIBs in an utterance (10.17%) than Controls (2.25%). Note that in order to design a POMDP for reducing confusion, there must be some other function (e.g., an extrinsic task) to optimize. Our pilot work involved focusing only on TIBs in the CCC database, which involves free-form conversation. However, without a task to complete, the POMDP consistently decided that the best way to avoid confusion is to simply say nothing at all; therefore, we only consider the DementiaBank data here, as described subsequently.
. | No. of dialogues . | No. of turns . | TIB% . | Mean turns/dialogue (s.d.) . | Max turns/dialogue . |
---|---|---|---|---|---|
Dementia data | 283 | 5128 | 10.17% | 3.69 (3.38) | 39 |
Control data | 159 | 2080 | 02.25 % | 2.25 (1.37) | 7 |
Combined data | 442 | 7208 | 07.89% | 3.16 (2.20) | 39 |
. | No. of dialogues . | No. of turns . | TIB% . | Mean turns/dialogue (s.d.) . | Max turns/dialogue . |
---|---|---|---|---|---|
Dementia data | 283 | 5128 | 10.17% | 3.69 (3.38) | 39 |
Control data | 159 | 2080 | 02.25 % | 2.25 (1.37) | 7 |
Combined data | 442 | 7208 | 07.89% | 3.16 (2.20) | 39 |
Table 6 shows a sample dialogue with an individual with dementia. Here, the Speaker column indicates interviewer (INV) or participant (PAR) turns. We recruited speech-language pathologists (SLPs) to annotate the dialogues with actions. For interviewer turns, actions are ASK (explicit requests for information), ENCOURAGEMENT (e.g., “you're doing great”), and REPEAT (verbatim or paraphrased repetitions of previous requests, including continuations). In the interviewer's CHAT transcripts, there are occasionally turns consisting of the “mhm” token that corresponds to “Yes” or ACKNOWLEDGMENT. Note that it is possible for an interviewer to utter none of these actions. As the interviewers in this database were explicitly instructed against other forms of interaction, no other actions (e.g., responses to requests for information, TIB 6) were recorded.
Speaker . | Actions/States transcripts . | Transcript, in reduced CHAT format . |
---|---|---|
PAR | SINK | the pearl [: poor] [* p:w] &mo moms gettin(g) her wet [//] feet wet (be)cause she thinking of days gone by and then the water run . [+ gram] . |
PAR | COOKIE | (.) and &uh that boy whether he knows or not hes gonna [: going to] crack his head on the back of that counter trying to get too many cookies out . . |
PAR | COOKIE-TIB | (..) he [//] (.) &uh I dont know the significance of that little girl putting her hand to her mouth reaching . . |
PAR | TIB | unless thats the way all silly girls act at that age . [+ gram]. |
PAR | TIB | pardon ? [+ exc]. |
… |
Speaker . | Actions/States transcripts . | Transcript, in reduced CHAT format . |
---|---|---|
PAR | SINK | the pearl [: poor] [* p:w] &mo moms gettin(g) her wet [//] feet wet (be)cause she thinking of days gone by and then the water run . [+ gram] . |
PAR | COOKIE | (.) and &uh that boy whether he knows or not hes gonna [: going to] crack his head on the back of that counter trying to get too many cookies out . . |
PAR | COOKIE-TIB | (..) he [//] (.) &uh I dont know the significance of that little girl putting her hand to her mouth reaching . . |
PAR | TIB | unless thats the way all silly girls act at that age . [+ gram]. |
PAR | TIB | pardon ? [+ exc]. |
… |
For participant turns, the SLPs instead annotated both the presence of specific TIBs and whether picture elements were mentioned. Those picture descriptions relate to the state of task completion and are, if extant, either SINK (i.e., information about the right part of the picture in Figure 1) or COOKIE (i.e., information about the left part of the picture). Fleiss' kappa, κ = 0.84, indicates satisfactory agreement, so we consider the annotations of the first SLP here.
6.1 Partially Observable Markov Decision Processes
POMDPs are generalized Markov decision processes that inherently model uncertainty with regards to latent factors in a system, namely, the state space S which can be further factorized into substates with conditional dependencies. It is a model of an active agent in the world whose objective is to map a set of observations O from the environment to the optimal action A to perform to maximize the reward over time. This results in a policy of actions to take to optimize the reward. To learn a full POMDP model, we solve for the transition probabilities between states, P(st +1|st,A) where st ∈ S, the observation function P(ot|st) where ot is the observation in O at time t, and an optimal mapping of actions to observations π = f(ot) ∈ A.
6.2 POMDP Design
We define the POMDP states associated with each turn in the dialogue with the triple [TIB, SINK, COOKIE]. Because each of the component variables is binary, there are 23 = 8 POMDP states in total. Similarly, the POMDP actions are defined using zero or more actions from [ASK, ENCOURAGEMENT, REPEAT]. Thus, we also have eight possible action values. The POMDP observations are based on all features described in Section 4. Each feature is discretized to four different classes. This is done by fitting values of each feature to a normal distribution. The middle class (class 2) is centered on mean μ and covers the range [μ − σ, μ + σ]; class 1 covers feature values less than μ − σ, and class 3 covers feature values greater than μ + σ. Finally, class 0 is reserved for missing values.
Discretizing features in this way allows for discrete-space modeling, but results in 4(81+178) possible observation vectors, which is clearly intractable. We therefore empirically select the top three features from Table 4 with maximal correlation to TIBs. These features include (1) the proportion of number of pronouns to the number of pronouns and nouns, (2) the number of nouns, and (3) the proportion of the number of demonstratives to the number of nouns. We tried more and fewer features; accuracy increases slightly up to eight features, but it was not feasible to add more than 2,048 observations in the POMDP. Furthermore, we use one feature for the SINK state and one for the COOKIE state. For each participant utterance, the SINK feature encodes the number of keywords mentioned about the SINK part of the picture (the right part) and the COOKIE feature encodes the number of keywords concerning the COOKIE part of the picture (the left part).
POMDP reward functions are often defined to encourage task completion (Young et al. 2013). Similarly, we set a reward of +1 when either SINK or COOKIE is mentioned, +10 if both are mentioned, and 0 otherwise. In order to avoid confusion in the participant, TIBs are penalized by −5 if any TIB occurs, otherwise the reward is increased by +5. This reward function naturally favors states in which both SINK and COOKIE occur, without TIB.
Given all annotations of state st, action matrix A and observation vector ot, we calculate the following likelihoods for DementiaBank data using maximum likelihood estimation:
- 1.
P(st +1|st,A)
- 2.
P(ot|st,A)
In our experiments we learn three POMDPs: Dementia POMDP, Control POMDP, and Combined POMDP, where those transition and observation functions are learned separately from dementia dialogues, control dialogues, or both, respectively. The discretization step for observations are also done separately for each type of model given the representative data.
6.3 POMDP Results
After learning the parameters of each POMDP model, the policy is approximated using point-based value iteration, which uses a finite number of beliefs for solving a POMDP model approximately. The number of beliefs is a parameter that is generally hand-tuned empirically. For our experiments, the solver used 10,000 random samples to estimate the optimal policy.
POMDP policies can be evaluated based on aspects such as dialogue lengths and accumulated discounted rewards in simulation runs (Young et al. 2013). We evaluate the learned POMDP policies in 1,000 simulation runs using the default simulation in Perseus (Spaan and Vlassis 2005) in which the true state of the user is estimated from the provided transition function, and the observations are sampled from the provided observation function. The discounted rewards are calculated based on the reward function and the “true” user state.
We compare the mean dialogue turns (until the maximum reward is reached) and the mean of accumulated discounted rewards for the POMDP policy against baseline policies of only performing one action, that is, either ASK, ENCOURAGEMENT, or REPEAT. A random policy (RAND) is also attempted in which one of the eight possible POMDP actions is performed at each turn. We also estimate POMDP accuracy using existing dialogue annotations. This is done by representing each dialogue in the form of an action observation trajectory. After learning the POMDP policy, we start by the initial belief, b0. The action suggested by the POMDP for the initial belief, a0, is compared with the interviewer's action, h0. The belief is updated to b1 based on the received observation. Subsequent POMDP actions are then compared to caregiver actions until the end of that dialogue (trajectory). Accuracy is the proportion of actions taken by the POMDP that match at least one of the annotated actions, in each turn. Because this is a disjunctive operation on binary vectors, we expect a random policy to do well.
The results in Table 7 show that the learned POMDPs perform well based on simulation runs (shorter and more task-oriented conversations are preferred) and accuracy. In particular, the results show that the Dementia POMDP performs better than all the baseline policies, in all cases at p < 0.05 given two-tailed t-tests with Bonferroni correction. The Dementia POMDP produces shorter dialogues to achieve its maximum reward, on average around four turns. The Dementia POMDP also accumulates more rewards on average. The standard deviations for both dialogue turns and accumulated rewards are smaller than those values for the baseline policies. We also calculate the observed agreement of the PODMP actions with those of human caregivers based on Cohen's kappa, achieving κ = 0.66 (Control), κ = 0.42 (Dementia), and κ = 0.55 (Combined), each of which constitutes moderate agreement between the POMDP and human annotators.
Dementia . | POMDP . | ASK . | ENCG . | REPEAT . | RAND . |
---|---|---|---|---|---|
Dialogue turns | 4.23 (2.22) | 5.12 (4.04) | 11.11 (8.61) | 7.29 (6.89) | 8.69 (7.20) |
Accumulated rewards | 51.75 (8.59) | 45.27 (13.24) | 30.23 (11.99) | 42.63 (9.23) | 31.59 (11.53) |
Accuracy | 91.4% | 32.5% | 36.3% | 27.8% | 86.2% |
Control | POMDP | ASK | ENCG | REPEAT | RAND |
Dialogue turns | 2.94 (1.53) | 2.95 (1.82) | 6.53 (4.55) | 3.20 (1.95) | 4.16 (2.87) |
Accumulated rewards | 57.64 (8.30) | 49.70 (10.47) | 45.15 (7.14) | 36.16 (11.39) | 38.47 (11.12) |
Accuracy | 96.1% | 25.6% | 19.5% | 14.8% | 84.1% |
Combined | POMDP | ASK | ENCG | REPEAT | RAND |
Dialogue turns | 4.03 (2.47) | 3.73 (2.53) | 9.52 (7.05) | 6.28 (6.24) | 6.17 (4.76) |
Accumulated rewards | 50.86 (8.99) | 49.34 (10.11) | 33.30 (10.79) | 42.09 (10.92) | 34.42 (11.92) |
Accuracy | 94.1% | 30.8% | 32.0% | 24.4% | 86.7% |
Dementia . | POMDP . | ASK . | ENCG . | REPEAT . | RAND . |
---|---|---|---|---|---|
Dialogue turns | 4.23 (2.22) | 5.12 (4.04) | 11.11 (8.61) | 7.29 (6.89) | 8.69 (7.20) |
Accumulated rewards | 51.75 (8.59) | 45.27 (13.24) | 30.23 (11.99) | 42.63 (9.23) | 31.59 (11.53) |
Accuracy | 91.4% | 32.5% | 36.3% | 27.8% | 86.2% |
Control | POMDP | ASK | ENCG | REPEAT | RAND |
Dialogue turns | 2.94 (1.53) | 2.95 (1.82) | 6.53 (4.55) | 3.20 (1.95) | 4.16 (2.87) |
Accumulated rewards | 57.64 (8.30) | 49.70 (10.47) | 45.15 (7.14) | 36.16 (11.39) | 38.47 (11.12) |
Accuracy | 96.1% | 25.6% | 19.5% | 14.8% | 84.1% |
Combined | POMDP | ASK | ENCG | REPEAT | RAND |
Dialogue turns | 4.03 (2.47) | 3.73 (2.53) | 9.52 (7.05) | 6.28 (6.24) | 6.17 (4.76) |
Accumulated rewards | 50.86 (8.99) | 49.34 (10.11) | 33.30 (10.79) | 42.09 (10.92) | 34.42 (11.92) |
Accuracy | 94.1% | 30.8% | 32.0% | 24.4% | 86.7% |
As mentioned in Section 5.3, we split the TIB class into K = 1…5 clusters. Then, we construct new POMDPs given each K, that is, we designed POMDPs with 1, 2, 3, 4, and 5 TIB states, plus one no-TIB state; there are also four task states, namely, two values on the left and two values on the right of the cookie-theft picture. We also update the transition function of each POMDP based on their subsequent states using a simple maximum-likelihood function with an ε = 0.1 smoothing parameter. The new POMDPs are created using Dementia, Control, and Combined data.
We compare the mean number of dialogue turns (until the maximum reward is reached) and the mean of accumulated discounted rewards for the POMDP policy against baseline policies of only performing one action (i.e., either ASK, ENCOURAGEMENT, or REPEAT). A random policy (RAND) is also attempted in which one of the eight possible POMDP actions is performed at each turn. We also defined two hand-crafted policies. The first (HC policy) starts with the ASK action (explicit request for information), then performs ENCOURAGEMENT (e.g., “you're doing great”), and then the REPEAT action (verbatim or paraphrased repetitions of previous requests, including continuations). This HC policy continues these last two actions until the end of conversation. The second hand-crafted policy takes a frequency-based approach. For each state, the fHC policy (frequentist HC) performs the action that has been performed by caregivers most frequently, given the state. At the time of interaction, the fHC policy uses the POMDP belief; that is, it takes the maximum likelihood state from the belief and performs the caregivers' most frequent action for that state. Note that one separate fHC policy is learned for each population (i.e., Dementia, Control, and Combined) separately. The POMDP performance is shown, over 1,000 simulation runs, in Figures 8 and 9, as the number of TIB states increase from 2 to 6 (note that there is no-TIB state besides K = 1…5 TIB clusters).
Figures 8 (top) and 9 (top) show the performance of the policies on the Dementia model. These show that the optimized policies of the learned Dementia POMDP have consistently better performance than the baseline policies, including the heuristic ones. Figure 9 (top) shows that by increasing the number of classes from 2 to 5 (increasing the number of POMDP states from 8 to 24), the mean of rewards for the Dementia POMDP slightly decreases from 49.23 (s.d. 10.12) to 42.42 (s.d 16.24), over the simulation runs. The second best performance is the ASK policy that, on average, collects 39.21 (s.d. 10.53) and 35.83 (s.d. 13.38) reward, when two and six classes are used, respectively. The third best policy is the Dementia fHC policy that accumulates 30.05 (s.d. 13.77) and 31.48 (s.d. 12.83) rewards on average, respectively, when two and six classes are used.
In addition, the POMDP policies, on average, have shorter dialogue turns to reach the maximum reward than any other policies, based on Figure 8 (top). When six classes are used (24 POMDP states), it takes 4.97 (s.d. 2.99) dialogue turns, on average, for the Dementia POMDP policy to reach the maximum reward, whereas the ASK and Dementia fHC policies take 7.75 (s.d. 8.47) and 6.69 (s.d. 6.75), on average, respectively.
Figures 8 (middle) and 9 (middle) show that, among the Control models, the POMDP and REPEAT policies have the highest performance. The high performance of the REPEAT policy (which asks questions such as “what else do you see in the picture?”) could be attributed to the Control population's consistent providing of new information when this action is performed, achieving greater rewards.
Figures 8 (bottom) and 9 (bottom) describe the performance of the policies on the Combined model. The POMDP policy again has the highest performance; when six classes (24 POMDP states) are used, the Combined POMDP on average accumulates 44.60 (13.07) rewards. For the same group, the ASK policy and the Combined fHC policy, on average, collect 34.24 (s.d. 16.57) and 34.61 (s.d. 13.35) rewards, respectively.
Overall, the POMDP policies show better performance than all other policies, based on the mean dialogue turns (until the maximum reward is reached) and the mean of accumulated discounted rewards in simulation runs. In this work, we manually set the reward function, empirically. In the future, we intend to learn this function based on inverse reinforcement learning (Ng and Russell 2000; Choi and Kim 2011; Chinaei and Chaib-Draa 2014).
7. Discussion
This article provides a first assessment of the automatic detection and avoidance of verbal indicators of confusion in dialogue with individuals who have Alzheimer's disease. We have demonstrated that it is possible to identify trouble indicating behaviors (TIBs, both at all, and in specific groups) explicitly in the speech of older adults with AD and matched controls, with accuracies up to 79% and 93%, respectively, using a neural network model and both acoustic and lexico-syntactic features, although the latter appear to be far more indicative. This was validated on a large set of data involving conversations around a standardized picture-description task. Much of this work is based on prior work in which we used similar linguistic features to train machine learning classifiers to distinguish between participants with and without AD (Fraser, Meltzer, and Rudzicz 2016). In that work, we also studied the degree of heterogeneity of linguistic impairments in AD and found four clear factors: semantic impairment, acoustic abnormality, syntactic/fluency impairment, and information/situational impairment. With some modification, assessment of this type (longitudinal or otherwise) can be seamlessly integrated into the conversation-focused systems described in the current article.
Furthermore, we learned several POMDPs using data from individuals with and without AD (both separately and combined) and evaluated those models using simulation runs and accuracy relative to trained human participants. In simulation, learned POMDPs produce shorter dialogues than four baseline models and accumulate more reword, on average; they also resemble humans far more accurately. This is closely related to our previous work in studying TIBs in older adults with moderate AD interacting with a physical (Wizard-of-Oz) conversational robot in a home setting (Rudzicz et al. 2015). In that work, across the verbal behaviors that indicate confusion, older adults with AD were very likely to simply ignore the robot (TIB 8; lack of uptake), which accounts for over 40% of all such behaviors. However, individuals with AD were much more likely (t(18) = −4.78, p < 0.0001) to exhibit no TIB when interacting with a robot (18.1% of utterances) than with a non-familiar human (6.7%). The current work completes the largest missing component in that work, namely, the automatic selection of prompts given speech input. Ongoing work involves the use of inverse reinforcement learning to learn the reward function of the POMDP from data.
Given our rapidly aging populations, it is increasingly vital that we develop technological approaches to caring for the elderly in their homes. To a large extent, this will involve speech interaction with augmentative or assistive technologies. Although speech recognition accuracy can be improved in a home-care environment by limiting the vocabulary, and by adapting the acoustic model to the individual (Rudzicz et al. 2015), word error rates remain high in such an environment. Fortunately, while the accuracy of automatic assessment decreases with increased word-error rates, recent work suggests this is weakly correlated (r = −0.31) (Zhou, Fraser, and Rudzicz 2016). Also, in practice, several of our conclusions must be tempered by the relatively precise nature of the picture description task. Differences will clearly exist in different interaction types, such as robot-directed activities/exercises, or casual (undirected) conversation. We are currently exploring a variety of hidden variables, including task type and complexity, in our own data collection with a small humanoid robot, but other forms of interaction, including with tablet- or computer-based interfaces and ambient technologies, must also be pursued.
Finally, if assessment and monitoring are increasingly aided by communication software, especially during activities of daily living, then certain bioethics questions arise. Who will bear the responsibility for incorrect decisions if those decisions are the product of statistics and the data from which they are derived? If medically relevant information can be obtained from recordings in one's home, how will individual privacy be maintained, especially if increasingly popular cloud-based speech services are to be used? Inequalities of access and potential loss of privacy are but two risks that must be considered now because—if the benefits of quantifiable, repeatable, and accurate clinical care by machine outweigh those risks—then a fundamental change to healthcare is inevitable.
Acknowledgments
Hamidreza Chinaei is supported by a fellowship from the AGE-WELL NCE and another from the Fonds de recherche du Québec – Nature et technologies. Frank Rudzicz's contribution to this work is funded by an NSERC Discovery grant (RGPIN 435874), by a Young Investigator award by the Alzheimer Society of Canada, and from a project grant from the AGE-WELL NCE. We gratefully thank Maria Yancheva for her insightful and incisive suggested revisions to this work. The original acquisition of the DementiaBank data was supported by NIH grants AG005133 and AG003705 to the University of Pittsburgh, and maintenance of the data archive is supported by NIH-NIDCD grant R01-DC008524 to Carnegie Mellon University.
Note
A third dendrogram, for the combined set, is not shown.
References
Author notes
Department of Computer Science, University of Toronto, and Toronto Rehabilitation Institute-University Health Network.
Department of Computer Science, University of Toronto.
Toronto Rehabilitation Institute-University Health Network, and University of Toronto, Department of Computer Science. E-mail: [email protected].