Abstract
Humans rarely perform better than chance at lie detection. To better understand human perception of deception, we created a game framework, LieCatcher, to collect ratings of perceived deception using a large corpus of deceptive and truthful interviews. We analyzed the acoustic-prosodic and linguistic characteristics of language trusted and mistrusted by raters and compared these to characteristics of actual truthful and deceptive language to understand how perception aligns with reality. With this data we built classifiers to automatically distinguish trusted from mistrusted speech, achieving an F1 of 66.1%. We next evaluated whether the strategies raters said they used to discriminate between truthful and deceptive responses were in fact useful. Our results show that, although several prosodic and lexical features were consistently perceived as trustworthy, they were not reliable cues. Also, the strategies that judges reported using in deception detection were not helpful for the task. Our work sheds light on the nature of trusted language and provides insight into the challenging problem of human deception detection.
1 Introduction
Humans are notoriously poor lie detectors, most performing at chance level or worse (Bond Jr and DePaulo, 2006). This result has been found across a wide variety of deception detection tasks, in multiple modalities, and in different cultures. Although poor performance has been well-attested, very little work has been done to understand why humans perform so poorly at detecting lies.
Because humans are so poor at deception detection, there have been many efforts to develop automated methods to detect deception in multiple modalities. Biometric indicators, typically measured by the polygraph (a device used to detect lies by measuring blood pressure, pulse, respiration, and skin conductivity), have been shown to perform poorly at deception detection (Eriksson and Lacerda, 2007). Facial expressions (Ekman, 2009a), gestures and body posture (Lu et al., 2005; Tsechpenakis et al., 2005), and even brain imaging (Meijer and Verschuere, 2017) have been explored as potential indicators of deception. Some of these features are difficult or expensive to capture automatically, or are too invasive to be practical for general use. In recent years, automatic deception detection has gained popularity in the speech and NLP communities. Language cues have the advantage of being inexpensive, non-invasive, and easy to collect automatically. More importantly, prior research examining linguistic cues to deception has been promising. Researchers have used machine learning to identify deceptive language in various domains, including court testimonies (Fornaciari and Poesio, 2013), hotel reviews (Ott et al., 2011), and interview dialogues (Levitan et al., 2018b). These automated methods have demonstrated that machine learning classifiers can indeed identify deceptive language with accuracy between 70% and 90%, depending on the task—much better than human performance on the same task. These studies have also identified specific characteristics of deceptive language.
Despite these important advances in understanding and automatically identifying deception, there has been little work investigating human perception of deception. What linguistic and prosodic characteristics of an utterance lead listeners to believe that it is true—to trust it—regardless of whether it is true or not? Why do people frequently believe lies? How do the strategies humans use in lie detection align with actual indicators of deception and how do they relate to people’s performance in lie detection? Can we in fact train machine learning classifiers to automatically identify speech that will be perceived as truth (trusted) or lie (mistrusted) by humans?
To investigate these questions, we created a lie detection game, LieCatcher, to conduct a large-scale study of human perception of deception. The stimuli for this game were drawn from a large corpus of previously collected truthful and deceptive dialogues; players were asked to judge whether single utterance spoken responses to written questions were truthful or deceptive. We distributed the game on Amazon’s Mechanical Turk crowd-sourcing platform to collect large scale judgments of deceptive or true responses to a set of biographical questions. We systematically analyzed a number of linguistic and prosodic features in the rated responses to understand the characteristics of trusted vs. mistrusted speech. We compared these features to the actual characteristics of truthful and deceptive responses presented in the game to identify the similarities and differences between human perception of deception and the actual production of deception. We also examined player-reported strategies to discover which the raters believed to be useful and which were in fact useful or not useful for detecting deception. Finally, we trained machine learning classifiers using a large set of lexical and speech features to automatically identify human-trusted speech.
The contributions of this paper include: 1) A large-scale analysis of linguistic and prosodic cues to trust compared with cues to deception; this adds considerably to our scientific understanding of human perception of deception. Our results show that there are several prosodic and lexical features that were consistently perceived as trustworthy, but that these were not reliable cues to deceptive speech. 2) A game framework for studying deception perception, which can be extended to other speech and language perception studies. 3) A classifier that uses lexical and acoustic-prosodic features to identify speech that was trusted by humans, achieving an F1 of 66.1%. 4) An analysis of successful and unsuccessful human strategies for detecting deception, showing that strategies that judges reported using in deception detection were not helpful for the task. We further believe that this latter analysis may be useful for training humans to detect lies more successfully.
2 Related Work
Previous studies have examined deceptive language in various domains, including fake reviews (Ott et al., 2011), public trials (Pérez-Rosas et al., 2015), TV shows (Pérez-Rosas et al., 2015), Twitter (Addawood et al., 2019), opinions on controversial topics (Mihalcea and Strapparava, 2009), online games (Zhou et al., 2004), and interviews (Levitan et al., 2018a, b). Machine learning classifiers have been shown to outperform human judges by a large margin. For example, Ott et al. (2011) trained a deception classifier that achieved nearly 90% accuracy on a corpus of fake hotel reviews, whereas human accuracy was about 60%.
Researchers have also examined various features that are characteristic of truthful vs. deceptive language. A meta-study by Bond Jr and DePaulo (2006) highlighted several patterns of deceptive language found in multiple studies, such as shorter responses, fewer details, and more negative emotions. Other cues to deception that have been identified include language that is less sensory or concrete (Ott et al., 2011; Vrij et al., 2006). Truthful language has been found to contain more linguistic markers of certainty (Levitan et al., 2018b; Rubin et al., 2006). Syntactic features such as lexicalized production rules and part of speech tags have also been shown to be useful in predicting deception (Feng et al., 2012; Pérez-Rosas and Mihalcea, 2015). Linguistic Inquiry and Word Count (LIWC) (Pennebaker and King, 1999), which groups words into psychologically meaningful dimensions, has also been used extensively in deception studies (Ott et al., 2011; Pérez-Rosas and Mihalcea, 2015; Pérez-Rosas et al., 2015). Prosodic cues to deception have also been identified; for example, Levitan et al. (2018a) found increased pitch maximum and intensity maximum are indicators of deception. Though these studies are critical for advancing the state of machine deception detection and for understanding the nature of deceptive language, they do not address the question of human perception of deception, which is the focus of this work. We aim to gain insight into why humans are poor judges of deception by comparing actual cues to deception with characteristics of language trusted and mistrusted by humans.
Psychology research of human deception detection has traditionally focused on facial expression cues (Ekman et al., 1991; Frank et al., 2008) and personal beliefs about what characterizes deceptive behavior (The Global Deception Research Team, 2006; Granhag and Strömwall, 2004; Wright et al., 2014). Based on worldwide survey studies, The Global Deception Research Team (2006) found pan-cultural deception stereotypes that liars tend to be nervous with flawed speech. However, Hartwig and Bond (2011) pointed out the methodological limitation of such studies: We cannot be certain that what people report reflects their actual decision process (Nisbett and Wilson, 1977). Our work attempts to decipher the cues people actually use to detect lies by examining features of utterances that are labeled as true by participants, compared with features of utterances rated as lies.
3 CXD Corpus
We used deceptive and truthful utterances from the Columbia X-Cultural (CXD) Corpus for our deception perception study (Levitan et al., 2015). The CXD Corpus is a collection of interviews between native speakers of Standard American English and Mandarin Chinese, all speaking in English. It contains 122 hours of conversational speech between 340 individuals. Previously unacquainted pairs of participants were brought into the lab to interview one another. They were first surveyed for gender and native language and asked to complete the NEO-FFI personality inventory (Costa and McCrae, 1989). They were then asked to provide true answers to a set of 24 biographical questions and then to provide false answers for a random half we chose. Interviews took place in a sound- proof booth and each pair of participants took turns playing the role of interviewer and interviewee. During the game, the interviewer asked the 24 questions in any order and was encouraged to ask follow-up questions to help determine whether the interviewee was lying or telling the truth about each question. Participants were financially compensated for both successful deception and successful deception detection. Table 1 provides sample responses to one of the questions.
T | my dad works i don’t i never really know how to say it he works with computers um in information technology |
T | he’s a technical engineer at draper laboratory |
F | he works for um it’s like um a subsidiary of walgreens kind of it’s very it’s very corporate it’s like a big big very impersonal company which is i think he doesn’t like about it |
F | uh my dad is an official in uh in the government system |
T | my dad works i don’t i never really know how to say it he works with computers um in information technology |
T | he’s a technical engineer at draper laboratory |
F | he works for um it’s like um a subsidiary of walgreens kind of it’s very it’s very corporate it’s like a big big very impersonal company which is i think he doesn’t like about it |
F | uh my dad is an official in uh in the government system |
The recorded interviews were orthographically transcribed using Amazon Mechanical Turk (AMT) crowd-sourcing and the transcripts were force-aligned with the audio recordings using the Kaldi Speech Recognition Toolkit (Povey et al., 2011). The interviews were segmented using a question identification classifier (Maredia et al., 2017). All interviewee turns were automatically identified using the question identification system and subsequently hand-corrected. The corpus was segmented into: 1) question responses: The single interviewee turn directly following the question; 2) question chunks: All interviewee turns in (1) plus answers to subsequent follow-up questions. We used the single turn question response segmentation for our deception perception study, so as not to influence raters’ responses with interviewers’ follow-up questions.
4 LieCatcher
Using the data described in Section 3, we created a lie detection game called LieCatcher.1 LieCatcher is a Game With A Purpose that allows players to assess their overall ability to detect lies, while simultaneously providing deceptive speech judgments that we then use to study deception perception. We developed LieCatcher in Unity2 and hosted the game on the Web. Figure 1 shows a screenshot from the game. In the game, players are shown a text version of a question asked by one interviewer and then listen to the single-turn spoken interviewee response. After listening, the player selects a “Truth” or “Lie” button, indicating their perception of the speech sample as truthful or deceptive. The game was designed so that players could submit their decision only after the audio had finished playing, so they could not make a judgment without hearing the full response. This feature of the game also provided information about raters’ behavior when making judgments, as we recorded the time interval between the end of the audio clip and the time that the player entered their response for each decision. After the gameplay, a score report is displayed summarizing all their judgments for that task, giving players feedback about their performance at the end of each multi-question task.
4.1 Crowdsourcing Experiment
We used the game to collect deception judgments via crowd-sourcing on AMT. On AMT we first vetted potential raters by giving them a language background questionnaire and restricting raters to those who had spoken English fluently since the age of 5 years. In the game, each player was shown a series of 13 questions, one at a time, with the audio recording of the interviewee response. Audio samples were balanced by gender, native language of speaker, and question number (there were no duplicate questions within a game), with half of the responses true and half false. For quality control, we included a randomly placed check question instructing the annotator to select a certain answer for that question (e.g., “wait 5 seconds and then press False”) to help ensure that raters were actually paying attention to the game with their audio on. Annotators were also given a post-game survey including questions on previous experience in jobs related to deception detection, their gender, their own confidence level in spotting lies, and the strategies they used in making judgments. We manually filtered out annotators who answered the check question incorrectly or who failed to finish the game or survey. We obtained institutional review board approval for our deception perception study and followed all human subject protection guidelines.
Each response was rated by three annotators and each annotator was limited to a maximum of 10 total tasks of 13 questions each. In total, 5340 utterances were annotated by 431 total annotators; 4.8% of the raters said they had had previous experience in law enforcement. In our sample, 38.9% of the annotators self-identified as male, 59.1% female, and 2.1% other. On average, annotators judged 49.93% of the utterances correctly, roughly at chance. In cases where all three annotators agreed on a judgment, the accuracy was 50.75%, slightly higher than the overall accuracy but still at chance. This is consistent with decades of research in deception detection (Bond Jr and DePaulo, 2006).
4.2 Inter-annotator Agreement
We used Fleiss’ kappa to measure inter-annotator agreement on whether an utterance was truthful or deceptive. The annotators had a Fleiss’ kappa of 0.135, indicating slight agreement as Landis and Koch (1977) suggests, showing that this task is highly subjective. We also computed inter-annotator agreement across utterances from female vs. male speakers and from native English versus native Mandarin speakers. We found only slight agreement (Fleiss’ kappa in the range (0.10, 0.15)) across all speaker traits, indicating that people did not agree more on speakers with certain traits. Lastly, we considered whether inter-annotator agreement might be affected by utterance length, since annotators might find it more difficult to judge short utterances (e.g., one word utterances like “yes” or “no”) for lack of sufficient information. However, we found that agreement was uniformly low across all utterance lengths and that longer utterance length tended to result in even lower agreement (Fleiss’ kappa: 0.138 for length ≤ 5 words vs. 0.067 for length ≥ 30 words).
We plotted the distribution of trust levels over all responses (Figure 2) and found that the distribution is skewed toward trust, indicating that annotators tend to be more trusting than mistrusting, with 33% of utterances trusted by all annotators and 70% trusted by at least two, consistent with the Truth Default Theory (Levine, 2014), which posits that humans operate on a default presumption that others are basically honest.
5 Textual and Prosodic Indicators of Trust and Deception
In this section, we consider the following questions: What are the characteristics of trusted and mistrusted speech? How do these compare with the characteristics of truthful and deceptive speech? Of all features raters believed to indicate lies, which are valid cues and which not? Also, what are the deceptive cues that raters failed to perceive? We compared features of trusted and mistrusted utterances and features of truthful and deceptive utterances using paired t-tests. Labels for trust were computed using majority vote (i.e., an utterance is considered trusted if at least 2 annotators believed it is true, otherwise mistrusted). We also compared utterances trusted by all and mistrusted by all, observing differences in complexity and prosodic features. For complexity features, #verbs, #nouns, #num, concreteness ceased to be significant, and type-token became significant with p < 0.05. For prosodic features, pitch mean and pitch std ceased to be significant, and intensity max became significant with p < 0.05. Notice that all features that differ had relative small significance levels p > 0.001. To prevent the inflation of false positive errors caused by conducting multiple comparisons, we present only features that are statistically significant after Benjamini–Hochberg correction (Benjamini and Hochberg, 1995).
Disfluency
Social psychologists hypothesize that telling a lie can be more cognitively demanding than truth-telling (Hauch et al., 2015; M DePaulo et al., 2003). False responses are hypothesized to be less fluent than true responses because fabricating a story takes more mental effort than recalling an actual event. We considered a wide range of features indicative of disfluencies and report those that were statistically significant in either the responses raters labeled as lies or those that actually appear in lying responses in the corpus:
Filled Pauses:
Response Latency:
For deception, this is defined as the time span between the interviewer question and the first non-filler word of the interviewee response; for trust, it is defined as the time span between the start of the audio and the first non-filler word of the interviewee response. Note that latency is defined differently for trust since annotators only heard the segmented version of the response which did not include the silence between the question and the response.
False Starts:
A type of speech disfluency where a speaker begins an utterance or a phrase and then self-corrects it; annotation of disfluencies was included in the corpus transcription.
Repetitions:
The number of identical, consecutive words or bigrams (e.g., ‘‘he he has a...’’).
Table 2 shows which features appeared in responses raters believed to be deceptive (Column Trust) and which appeared in responses that actually were deceptive (Column Deception).
Feature name . | Trust . | Deception . |
---|---|---|
has filled pause | ↓↓↓↓ | ↑↑↑↑ |
#filled pauses | ↓↓↓↓ | ↑↑↑↑ |
has false start | ↓↓↓ | ↑↑ |
response latency | ↓↓↓↓ | |
repetitions | ↓↓↓↓ | ↑ |
Feature name . | Trust . | Deception . |
---|---|---|
has filled pause | ↓↓↓↓ | ↑↑↑↑ |
#filled pauses | ↓↓↓↓ | ↑↑↑↑ |
has false start | ↓↓↓ | ↑↑ |
response latency | ↓↓↓↓ | |
repetitions | ↓↓↓↓ | ↑ |
Overall, our findings are consistent with Zuckerman et al. (1981): Speech hesitations and errors are perceived as signs for incompetence and cues for deception. Of all the disfluency features, filled pauses proved to be the strongest reliable indicator of deception and this feature was also perceived correctly by raters. For raters, response latency was the strongest cue for mistrust among all disfluency features; however, this was not a reliable indicator of lies. This is also consistent with the findings of Zuckerman et al. (1981) and of Hartwig and Bond (2011). Word and bigram repetition, as well as false starts, were traits of speech raters mistrusted, even though they were only weak cues to deception.
Complexity
Previous research suggests that deceptive statements tend to be simpler and less complex than true ones. This is because of the theory that cognitive load is increased during deception, which can limit creative and complex utterance production (Hauch et al., 2015; M DePaulo et al., 2003; Hartwig and Bond, 2011). Based on these findings, one might expect lies to be less lexically diverse, shorter, and less elaborate than true responses. Do raters appear to use these cues in judging deception? We used utterance length as the most direct indicator of response complexity and also counted the number of words with more than six characters and the number of content words in the utterance as a more fine-grained indication of complexity. We used type-token ratio to capture lexical diversity. We also considered word entropy but found it to be strongly correlated with number of words so decided not to include it as a separate feature. In addition, we used Flesch reading ease (Kincaid et al., 1975) to identify readability, specificity score (Li and Nenkova, 2015) as an indication of the level of detail on the sentence level, and concreteness score (Brysbaert et al., 2014) as an indicator of the level of details of the speakers’ visual and haptic experiences. We also used discourse markers (causation and conjunction) extracted from LIWC (Pennebaker and King, 1999), inspired by the hypothesis that liars might use fewer discourse markers in their utterances (Newman et al., 2003).
As shown in Table 3, overall, raters were more likely to judge longer and more complex responses as deceptive.
Features . | Trust . | Deception . |
---|---|---|
#sent | ↓↓↓↓ | ↑↑↑↑ |
#word | ↓↓↓ | ↑↑↑↑ |
#word per sent | ↓↓↓ | ↑↑↑↑ |
#word>6 | ↓↓ | ↑↑↑↑ |
type-token | ↑↑↑ | |
#verb | ↓↓ | ↑↑↑↑ |
#noun | ↓↓ | ↑↑↑↑ |
#adj | ↑↑↑↑ | |
#num | ↓ | ↑↑↑ |
#proper nouns | ↓ | |
concreteness | ↓↓ | ↑↑↑↑ |
specificity | ↓↓ | ↑↑↑↑ |
#conj | ↑↑ |
Features . | Trust . | Deception . |
---|---|---|
#sent | ↓↓↓↓ | ↑↑↑↑ |
#word | ↓↓↓ | ↑↑↑↑ |
#word per sent | ↓↓↓ | ↑↑↑↑ |
#word>6 | ↓↓ | ↑↑↑↑ |
type-token | ↑↑↑ | |
#verb | ↓↓ | ↑↑↑↑ |
#noun | ↓↓ | ↑↑↑↑ |
#adj | ↑↑↑↑ | |
#num | ↓ | ↑↑↑ |
#proper nouns | ↓ | |
concreteness | ↓↓ | ↑↑↑↑ |
specificity | ↓↓ | ↑↑↑↑ |
#conj | ↑↑ |
Contrary to previous research, we found that lies tended to be more complex than true utterances: they tended to be longer, included more specific Language, and were more lexically diverse. They were also more concrete and contained higher numbers of verbs, nouns, adjectives, numbers, and conjunctions. Although raters were apparently using these cues to predict lies, they were relatively weak ones.
Sentiment
When lying, people may experience feelings of guilt and fear of being caught, which may result in their use of more negative words (M DePaulo et al., 2003; Hauch et al., 2015; Ekman, 1988, 2009b). Abe et al. (2007) also found that the act of deceiving is uniquely associated with neural structures associated with heightened emotion. We extracted positive emotion and negative emotion using LIWC (Pennebaker and King, 1999). We also extracted Pleasantness, Activation, and Imagery scores for each utterance from the Dictionary of Affect (DAL) (Whissell, 1989) by summing up the scores of all words. We normalized all features to reduce length effect.
As shown in Table 4, truthful utterances in the corpus contained more visually descriptive words than deceptive utterances, and listeners were more apt to rate utterances with descriptive words as truthful. This is consistent with findings in Masip et al. (2005) that providing sensory details is more difficult when fabricating a story. Truthful utterances also contained words with higher activation scores than deceptive utterances, and trusted utterances also had higher activation scores than mistrusted ones. Consistent with Hartwig and Bond (2011), raters judged more pleasant utterances as truthful, although this was not a valid cue in the CXD corpus.
Uncertainty
Our previous research (Levitan et al., 2018b) has found that linguistic markers of certainty and uncertainty are significant indicators of deception. So we measured certainty and uncertainty in two ways: words from LIWC’s “certainty” category as linguistic markers of certainty (e.g., always, never) and hedge words and phrases (e.g., possible, sort of) (Ulinski et al., 2018) as indicators of uncertainty. As shown in Table 5, there was a match between rater trust and true responses for hedge words and phrases. In the CXD corpus, lies included hedge phrases more often than true responses did, and we found that listeners did mistrust responses containing hedge phrases. However, although linguistic markers of certainty in the corpus (e.g., “always,” “never”—which are the opposite of hedge words) were indicators of truth, raters failed to perceive this.
Creativity
Do liars tend to rely upon certain “templates” or generic responses when answering questions for lack of a more detailed story to present? Do truth-tellers provide more creative responses based on reality? To measure creativity of responses, we examined how similar a response was to other responses to the same question. For each question, we converted all responses to TF-IDF vectors on unigrams and bigrams. We built a lexical graph for each question with responses as nodes and cosine similarities between TF-IDF vectors as edge weights. Then we computed the eigenvalue centrality for each node and used its negative value as the measure of creativity. The intuition here is that the more central a response is, the more similar it is to its neighbors and thus less ”creative.”
We found liars to be more creative than truth-tellers. We verified this result by counting the number of neighbors within a certain cosine distance in the TD-IDF space. This result is robust against various threshold of cosine distance (0.1–0.9 with 0.1 as the step size). The difference was not due to response length, as we found no correlation between creativity and response length (spearman, ρ = 0.007, p > 0.05). However, we did find that judgments were not influenced by whether the response was creative or not. Perhaps when people lie they try to tell a compelling story, which results in a more creative response regardless of length.
Prosody
Previous studies have shown that pitch maximum and intensity maximum are significant indicators of deception (Levitan et al., 2018a). We examined whether prosodic features impacted listeners’ trust. We extracted a set of 14 features from Praat (Boersma and Weenink, 2009), an open-source audio processing toolkit, and z-score normalized the features by gender. We used the total number of words divided by duration of utterance as a measure of speaking rate. As shown in Table 6, raters judged speech that was loud (high intensity min and mean) and had less variation in intensity (low std) as trustworthy, perhaps because louder speech can sound more confident. Trusted speech also had higher degrees of jitter, shimmer, and noise to harmonics ratio (NHR), which are measures of voice quality. Though not a valid indication of truth, faster speaking rate was also trusted, perhaps because raters expected speakers to speak more slowly when lying. This is consistent with previous findings that, while faster speaking rate is trusted, it is not an actual cue to trustworthiness (Zuckerman et al., 1981; Hartwig and Bond, 2011). Listeners also trusted speech with higher pitch (max, min, and mean) and greater pitch variance (std). However, a higher pitch max and greater pitch std were in fact signs of deception.
Features . | Trust . | Deception . |
---|---|---|
speaking rate | ↑↑↑↑ | |
pitch max | ↑↑↑↑ | ↑↑↑↑ |
pitch min | ↑↑↑↑ | ↓↓ |
pitch mean | ↑↑ | |
pitch std | ↑↑ | ↑↑ |
intensity max | ↑↑↑ | |
intensity min | ↑↑↑↑ | ↓ |
intensity mean | ↑↑↑↑ | |
intensity std | ↓↓↓↓ | ↑ |
NHR | ↑↑↑↑ | |
jitter | ↑↑↑↑ | |
shimmer | ↑↑↑↑ |
Features . | Trust . | Deception . |
---|---|---|
speaking rate | ↑↑↑↑ | |
pitch max | ↑↑↑↑ | ↑↑↑↑ |
pitch min | ↑↑↑↑ | ↓↓ |
pitch mean | ↑↑ | |
pitch std | ↑↑ | ↑↑ |
intensity max | ↑↑↑ | |
intensity min | ↑↑↑↑ | ↓ |
intensity mean | ↑↑↑↑ | |
intensity std | ↓↓↓↓ | ↑ |
NHR | ↑↑↑↑ | |
jitter | ↑↑↑↑ | |
shimmer | ↑↑↑↑ |
Of the 11 prosodic cues of mistrust, only 3 (27%) are actually valid indicators of deception; of the 6 prosodic cues of deception, 3 (50%) are also valid indicators of mistrust. This is in contrast to the overlap we see across all features reported. Of the 31 cues that are significant indicators of mistrust, 20 (65%) are also valid indicators of deception; of the 28 cues that are indicative of deception in the data, 20 (71%) are actually indicative of mistrust. This suggests that, although there were several characteristics of trusted speech that were in fact associated with truth, and also many characteristics of mistrusted speech that were associated with deception, prosodic cues to deception are far more difficult for humans to correctly perceive than other cues.
5.1 Can We Predict Trust and Deception?
Based upon the analysis of the linguistic and prosodic characteristics of trusted and deceptive utterances above, we developed predictive models of trust and deception to identify the relative strengths of each type of feature. In addition, we included additional speaker traits (gender, native language, personality), which have also been shown to identify significant differences in speaker trust (Levitan et al., 2018a). We observed several differences in trust behavior across these speaker traits.
- •
Gender: We observed a gender difference in trust (χ2(1) = 5.16, N = 5340, p < 0.05) with female speakers trusted (71.50% of all utterances) more than males (68.61%).
- •
Native language: We observed a native language difference in speaker trust by raters (χ2(1) = 30.22, N = 5340, p < 0.0001) with native American English speakers trusted (73.52% of all utterances) more than native Chinese speakers (66.59%).
- •
Personality: We partitioned CXD speakers into the NEO-FFI Five Factor personality groups by binning personality scores into ”high,“ ”average,“ and ”low” in each dimension as described in Levitan (2019). We observed significant differences in speakers’ responses trusted by raters in the following dimensions:
- –
Conscientiousness: Speakers with low scores (71.91%) were more trusted than people with neutral (69.12%) or high scores (67.66%). (χ2(2) = 7.22, N = 5340, p < 0.05)
- –
Openness: Speakers with high scores (71.55%) were more trusted than people with neutral (68.64%) or low scores (67.40%). (χ2(2) = 6.40, N = 5340, p < 0.05)
- –
Neuroticism: Speakers with high scores (71.75%) were more trusted than people with neutral (68.65%) or low scores (66.48%). (χ2(2) = 8.93, N = 5340, p < 0.05)
- –
In addition, we included a large number of data-driven features extracted from the spoken utterances and from their text transcripts. In contrast to the features that were analyzed in Section 5, which were specifically motivated by the deception detection literature, these data-driven features were chosen because of their usefulness for a wide range of NLP and speech processing tasks. Data-driven features included: dependency triples backed off to parts of speech; one-hot encoded unigrams and bigrams; average of word vectors using GloVe embedding pretrained on Twitter; Interspeech 2013 (IS13) ComParE Challenge baseline feature set, which contains 6373 features resulting from the computation of functionals over low-level descriptor contours extracted from openSMILE (Eyben et al., 2010). The prosodic feature analysis in Section 5 was conducted on a small set of prosodic features extracted using Praat. These were a subset of the openSMILE feature set and were used for clarity and interpretability, while the openSMILE features used here are a larger set for classification experiment.
Because of the low agreement of the annotations, we took an approach similar to Danescu-Niculescu-Mizil et al. (2013) and considered only utterances that annotators reached consensus on for experiments on trust. In total, 1762 utterances were trusted by all annotators and 427 utterances were mistrusted by all annotators. Due to the small size of this dataset, we randomly divided the speakers into five bins of similar size and performed cross-validation with utterances from speakers in each bin as the test data and the rest as the training data; thus, learned models were always evaluated on unseen speakers. On each speaker split, we trained logistic regression models and tuned penalty parameter C and the type of regularization using five cross-validation folds on the training data. We also experimented with linear SVM, gradient boosting classifier, and random forest classifier and found no improvement. We normalized all features to have zero mean and unit- variance. For data-driven features, feature selection was performed to prune the feature space. Because this is an unbalanced task, we evaluated our models using precision, recall, and macro-F1 score.
As shown in Table 7, we can predict whether an utterance is trusted with an average macro-F1 score of 66.10%. We found that 5 disfluency features combined with 16 prosody features outperformed 9538 data-driven features, demonstrating the efficacy of the feature sets specifically designed for this task. Prosody (macro-F1, 60.20%) is the strongest feature set that is predicative of trust, with intensity mean as the strongest feature (45.50%) within that set. Disfluency (57.28%) is the next strongest feature set, with response latency (58.40%) as the strongest feature. Of the individual feature sets, we found that speaker traits, sentiment, uncertainty, and creativity did not perform better than the random baseline, which suggested they are not useful cues on their own for predicting rater trust. However, we found that including speaker traits helped improving the classifier’s performance when combined with other features.
Feature Sets . | Precision . | Recall . | macro-F1 . |
---|---|---|---|
random | 50.19 | 50.29 | 44.97 |
majority | 40.27 | 50.00 | 44.61 |
data-driven (9538) | 73.45 | 59.72 | 61.51 |
disfluency (5) | 76.83 | 56.73 | 57.28 |
+ prosody (16) | 78.90 (80.14) | 60.52 (58.66) | 62.74 (60.20) |
+ sentiment (5) | 78.75 (40.27) | 61.19 (50.00) | 63.61 (44.61) |
+ uncertainty (3) | 78.11 (40.27) | 61.80 (50.00) | 64.36 (44.61) |
+ creativity (1) | 77.94 (40.27) | 61.90 (50.00) | 64.48 (44.61) |
+ complexity (16) | 77.71 (55.86) | 62.24 (50.31) | 64.87 (45.77) |
+ speaker traits (7) | 77.55 (40.27) | 63.34 (50.00) | 66.10 (44.61) |
all (9591) | 74.26 | 60.34 | 62.37 |
Feature Sets . | Precision . | Recall . | macro-F1 . |
---|---|---|---|
random | 50.19 | 50.29 | 44.97 |
majority | 40.27 | 50.00 | 44.61 |
data-driven (9538) | 73.45 | 59.72 | 61.51 |
disfluency (5) | 76.83 | 56.73 | 57.28 |
+ prosody (16) | 78.90 (80.14) | 60.52 (58.66) | 62.74 (60.20) |
+ sentiment (5) | 78.75 (40.27) | 61.19 (50.00) | 63.61 (44.61) |
+ uncertainty (3) | 78.11 (40.27) | 61.80 (50.00) | 64.36 (44.61) |
+ creativity (1) | 77.94 (40.27) | 61.90 (50.00) | 64.48 (44.61) |
+ complexity (16) | 77.71 (55.86) | 62.24 (50.31) | 64.87 (45.77) |
+ speaker traits (7) | 77.55 (40.27) | 63.34 (50.00) | 66.10 (44.61) |
all (9591) | 74.26 | 60.34 | 62.37 |
In addition to classifying trust, we trained a logistic regression model using these same features to classify deception. Our best model achieved an F1 score of 55.5% using a combination of disfluency, prosody, and uncertainty features. We note that this performance is substantially lower than the trust classification results, suggesting that distinguishing between truthful and deceptive utterances is much harder than distinguishing between utterances trusted and mistrusted by all annotators. However, we also note that this model was not optimized for deception classification (e.g., no feature selection or parameter tuning) and was trained on a very small amount of data. Our previous work obtained better performance at deception detection (69.8 F1) using more data and using models that were optimized for the task (Levitan, 2019). The purpose of this current experiment was to directly compare trust and deception classification using the same data and features.
Figures 3 and 4 show the most important 10 features for predicting trust and deception. To compute feature importance, we averaged the absolute values of the feature weights across the 5 cross-validation splits. We found that DAL-pleasant, has filled pauses, intensity mean, and type-token are important for both prediction tasks. Of the top three features in either task, speaking rate and response latency are only important for predicting trust, and DAL-activation and creativity are only important for predicting deception.
The most important 10 features for predicting trust. The x-axis denotes the absolute values of the feature weights averaged over 5 cross-validation splits.
The most important 10 features for predicting trust. The x-axis denotes the absolute values of the feature weights averaged over 5 cross-validation splits.
The most important 10 features for predicting deception. The x-axis denotes the absolute values of the feature weights averaged over 5 cross-validation splits.
The most important 10 features for predicting deception. The x-axis denotes the absolute values of the feature weights averaged over 5 cross-validation splits.
5.2 The Mechanism Behind a Successful Lie
To better understand why people are “vulnerable” to deceptive utterances, and to understand the characteristics of successful lies, we ran paired t-tests with Benjamini–Hochberg correction to compare the linguistic and prosodic features of successful vs. unsuccessful lies. As shown in Table 8, we found that successful lies differed most from unsuccessful lies in that they contained fewer sentences (1) and were shorter in duration (2).
. | Features . | Successful? . |
---|---|---|
1. | #sent | ↓↓↓↓ |
2. | response latency | ↓↓↓↓ |
3. | has filled pause | ↓↓↓↓ |
4. | #filled pauses | ↓↓↓↓ |
5. | repetitions | ↓↓↓↓ |
6. | intensity min | ↑↑↑↑ |
7. | intensity mean | ↑↑↑↑ |
8. | intensity std | ↓↓↓↓ |
9. | speaking rate | ↑↑↑↑ |
10. | shimmer | ↑↑↑↑ |
. | Features . | Successful? . |
---|---|---|
1. | #sent | ↓↓↓↓ |
2. | response latency | ↓↓↓↓ |
3. | has filled pause | ↓↓↓↓ |
4. | #filled pauses | ↓↓↓↓ |
5. | repetitions | ↓↓↓↓ |
6. | intensity min | ↑↑↑↑ |
7. | intensity mean | ↑↑↑↑ |
8. | intensity std | ↓↓↓↓ |
9. | speaking rate | ↑↑↑↑ |
10. | shimmer | ↑↑↑↑ |
Successful lies were also louder (6, 7), faster (9), had fewer filled pauses (3, 4), varied less in intensity (8), and were harsher (10) in voice quality. When people were more successful at lying, they tended to respond quicker (2) and did not repeat themselves (5).
We also conducted this analysis for classifier judgments, to understand the characteristics of lies that successfully deceived a lie detection classifier. Table 9 shows the top features that discriminate between successful and unsuccessful lies. There are several features that were similar for both human and machine judgments. For example, successful lies had fewer filled pauses (2, 3) and were shorter in duration (5) and number of sentences (9). Some features, however, were unique to deceiving a classifier. For example, lies that successfully deceived the deception classifier were less creative (1) and less specific (4). It seems that different kinds of lies were successful at deceiving humans and an automated deception classifier.
. | Features . | Successful? . |
---|---|---|
1. | creativity | ↓↓↓↓ |
2. | has filled pause | ↓↓↓↓ |
3. | #filled pauses | ↓↓↓↓ |
4. | specificity | ↓↓↓↓ |
5. | duration | ↓↓↓↓ |
6. | pitch max | ↓↓↓↓ |
7. | #word | ↓↓↓↓ |
8. | #word per sent | ↓↓↓↓ |
9. | #sent | ↓↓↓↓ |
10. | concreteness | ↓↓↓↓ |
. | Features . | Successful? . |
---|---|---|
1. | creativity | ↓↓↓↓ |
2. | has filled pause | ↓↓↓↓ |
3. | #filled pauses | ↓↓↓↓ |
4. | specificity | ↓↓↓↓ |
5. | duration | ↓↓↓↓ |
6. | pitch max | ↓↓↓↓ |
7. | #word | ↓↓↓↓ |
8. | #word per sent | ↓↓↓↓ |
9. | #sent | ↓↓↓↓ |
10. | concreteness | ↓↓↓↓ |
5.3 Individual Differences in Lie Detection Ability
We found that gender was not significantly related to accuracy (Mann–Whitney U, p > 0.05), but female participants took longer to judge (Mann–Whitney U, p < 1e-20). In addition, we did not observe female participants to be more or less trusting than male participants (Mann–Whitney U, p > 0.05) or significantly different from male participants in their level of confidence (Mann–Whitney U, p > 0.05). Raters with previous job experience related to lie detection did not perform better than those without such experience (1-sample t-test, p > 0.05), but they did take longer to make judgments (1-sample t-test, p < 0.05). They were at the same level of trust (1-sample t-test, p > 0.05) and same level of confidence (1-sample t-test, p > 0.05) as those without prior experience. This is consistent with previous findings that persons in the legal professions are no better at detecting deception than others (Aamodt and Custer, 2006).
6 Characterizing Strategies for Detecting Lies
We summarized annotator-provided strategies for detecting lies based on previous work in deception detection (Hauch et al., 2015; M DePaulo et al., 2003; Albrechtsen et al., 2009; Blair et al., 2010; Vrij et al., 2006) and annotators’ responses. The strategies were manually labeled by domain experts with previous research experience in deception detection and all ambiguities were discussed by three people. For each strategy, we computed the average percentage of utterances judged correctly by annotators who reported using this strategy and compared it with the average percentage of utterances judged correctly across all annotators. We performed the same analysis for the percent of utterances the raters believed to be true (trusted). In addition, we reported the percentage of annotators who claimed to have used these strategies in Table 10.
Strategy . | %Correct . | %Trust . | %Used . | Example . |
---|---|---|---|---|
Prosody | − 0.25 | − 0.17 | 45.74% | voice tone and pattern |
Response latency | + 0.11 | − 2.13** | 30.71% | listened for delays in the speakers response |
Pauses | − 0.52 | − 2.95** | 24.66% | I listened for pauses to see... |
Disfluency | − 0.59 | − 1.88* | 22.87% | If they said “um” I thought they were lying |
Intuition | + 1.09 | + 0.52 | 22.87% | My gut instinct... |
Details | + 0.81 | − 2.95* | 17.26% | ...how much or how little detail they used... |
Prior | + 1.95 | + 2.85* | 13.90% | How realistic the answers were |
Style | − 0.65 | + 0.86 | 11.88% | Anxiety in voice |
Confidence | − 2.83* | − 1.60 | 11.21% | paying attention to the person’s confidence.. |
Duration | − 0.94 | − 2.80 | 9.41% | length of answer |
Speaking rate | + 0.39 | − 0.64 | 6.72% | Speed of answer |
Speaker traits | + 0.07 | − 0.00 | 6.05% | how relaxed they were |
Lexical | + 1.53 | + 1.00 | 5.16% | Look for context around the words |
Laughter | + 1.04 | + 0.40 | 1.79% | if they laugh its false |
Clarity | + 2.52 | + 9.71* | 1.35% | People usually give more and clearer details... |
Breathing | + 5.33 | − 2.73 | 1.12% | I tried to notice when they breathe so deeply.. |
Repeat question | + 0.36 | + 6.10 | 0.67% | I id notice one person repeat the question.. |
Contradictions | + 0.04 | + 1.24 | 0.67% | ...the person blatantly contradicted themselves... |
Repetition | + 1.24 | + 6.52 | 0.44% | repetition when lying |
Strategy . | %Correct . | %Trust . | %Used . | Example . |
---|---|---|---|---|
Prosody | − 0.25 | − 0.17 | 45.74% | voice tone and pattern |
Response latency | + 0.11 | − 2.13** | 30.71% | listened for delays in the speakers response |
Pauses | − 0.52 | − 2.95** | 24.66% | I listened for pauses to see... |
Disfluency | − 0.59 | − 1.88* | 22.87% | If they said “um” I thought they were lying |
Intuition | + 1.09 | + 0.52 | 22.87% | My gut instinct... |
Details | + 0.81 | − 2.95* | 17.26% | ...how much or how little detail they used... |
Prior | + 1.95 | + 2.85* | 13.90% | How realistic the answers were |
Style | − 0.65 | + 0.86 | 11.88% | Anxiety in voice |
Confidence | − 2.83* | − 1.60 | 11.21% | paying attention to the person’s confidence.. |
Duration | − 0.94 | − 2.80 | 9.41% | length of answer |
Speaking rate | + 0.39 | − 0.64 | 6.72% | Speed of answer |
Speaker traits | + 0.07 | − 0.00 | 6.05% | how relaxed they were |
Lexical | + 1.53 | + 1.00 | 5.16% | Look for context around the words |
Laughter | + 1.04 | + 0.40 | 1.79% | if they laugh its false |
Clarity | + 2.52 | + 9.71* | 1.35% | People usually give more and clearer details... |
Breathing | + 5.33 | − 2.73 | 1.12% | I tried to notice when they breathe so deeply.. |
Repeat question | + 0.36 | + 6.10 | 0.67% | I id notice one person repeat the question.. |
Contradictions | + 0.04 | + 1.24 | 0.67% | ...the person blatantly contradicted themselves... |
Repetition | + 1.24 | + 6.52 | 0.44% | repetition when lying |
Prosody, response latency, pauses, disfluency, and intuition were the top five strategies mentioned by annotators, and in Section 5 we verified that the set of features related to prosody, response latency, pauses, and disfluency were indeed significant indications of trust. As shown in Table 10, none of the reported strategies was associated with an improved deception detection performance. However, we did find that using speaker “confidence” as a cue to deception was negatively associated with the annotators’ performance.
Which strategies are reported by raters who are more or less trusting over all?
We found that people who reported using response latency, pauses, and disfluency when judging deception trusted a smaller percentage of utterances. This could be because of the high prevalence of disfluencies in spontaneous speech, regardless of whether the utterance was deceptive or not. We also found that raters who used the level of detail in a response as a cue were more mistrusting in their judgments. Conversely, those who used clarity of a response and prior domain knowledge were more trusting.
Does complex reasoning correlate with accuracy in lie detection or trust level?
We examined two measures of player behavior that approximate complex reasoning: How long do people take to make judgments? How many strategies do they report in total?
We measured how long people took to judge responses using the time interval between the end of the audio clip and the time that annotator entered his/her response. We found no correlation between response time and the percentage of answers correct. However, we did find a negative correlation between response time and the percentage of answers trusted (spearman, ρ = −0.101, p < 0.0001). We found a similar result using the number of strategies as a proxy for complex reasoning. There was no correlation between the number of strategies reported and the accuracy score, but we did discover a negative correlation between the percentage of answers trusted and the number of strategies raters reported using (spearman, ρ = −0.133, p < 0.01). These findings indicate that complexity of reasoning process does not correlate with lie detection performance but negatively correlates with trust level.
7 Conclusion
In this paper we presented a framework for understanding human deception perception. We created a lie detection game, LieCatcher, and used it to collect large-scale judgments of deceptive speech. We analyzed a large set of linguistic and prosodic cues to deception and identified some mismatches between the responses people perceived as deceptive and those that were actual deceptive responses. Particularly notable in these mismatches were prosodic features, suggesting that humans have difficulty interpreting prosodic cues to deception.
We built a predictive model of trust with a macro-F1 score of 66.1%, and showed that disfluencies and prosody were most useful for predicting trust. We summarized and manually annotated annotator-provided strategies and found that none of them were associated with an improvement in lie detection ability; however, some were associated with raters’ tendency to trust. The identified mismatches between features of trusted vs. deceptive speech, as well as the lack of useful strategies reported by raters, shed new light on the poor performance of humans at deception detection. In addition, we showed that complex reasoning did not correlate with accuracy in deception detection but negatively correlated with trust level.
This work has implications for several applications in multiple disciplines. In business, politics, and interpersonal relationships, it is critical to cultivate the trust of others. Our empirically identified characteristics of trusted language provide useful information for training individuals who want to speak in a more trustworthy manner. Furthermore, we are interested in using these findings to synthesize voices that are likely to be trusted by others and we have already begun that process. Potential applications that can benefit from trustworthy voices include dialogue systems and robots, especially for assistive technologies (e.g., for individuals with disabilities, elderly individuals) where trust is crucial for successful interactions. Our LieCatcher game was a useful framework for studying perceived deception in an engaging format. The experiments presented in this paper were conducted using stimuli from the CXD corpus. In future experiments, we plan to conduct a cross domain analysis to see if these findings generalize to other domains and corpora. Because of the lack of corpora with annotations of human perceptions of deception, we plan to conduct similar perception studies using the LieCatcher framework to enable this cross-domain analysis. In addition, the LieCatcher game can be extended to explore the perception of other aspects of spoken language. We are currently exploring its use for training purposes. In the future, we will provide more immediate feedback to players about their judgments for each response, with the goal of training practitioners to improve their performance at deception detection.
Acknowledgments
This work was funded by AFOSR grant FA9550-18-1-0039, “Spoken Indicators of Trust Across Cultures.” We thank the anonymous reviewers for their helpful feedback. Thanks also to James Shin, Ivy Chen, and William Wang (ww2742) for their help implementing LieCatcher.
Notes
The LieCatcher game framework is publicly available at https://github.com/sarahita/LieCatcherGame.