Skip to Main Content
Table 2: 

Model performance (accuracy) on the idiom and simile discriminative tasks. * Difference is significant (α < 0.07) between the supervised and knowledge-enhanced models via t-test.

MethodModelIdiomSimile
Majority 50.0 50.8 
Zero-shot GPT2-XL 53.6 53.7 
GPT3 60.2 62.4 
UnifiedQA 67.7 60.6 
Few-shot GPT3 54.1 51.7 
PET 66.1 55.2 
Supervised RoBERTa 82.0 80.4 
-narrative 65.0 67.9 
Knowledge Context 82.8 79.9 
Enhanced Literal 83.580.6 
Human Performance 92.0 95.0 
MethodModelIdiomSimile
Majority 50.0 50.8 
Zero-shot GPT2-XL 53.6 53.7 
GPT3 60.2 62.4 
UnifiedQA 67.7 60.6 
Few-shot GPT3 54.1 51.7 
PET 66.1 55.2 
Supervised RoBERTa 82.0 80.4 
-narrative 65.0 67.9 
Knowledge Context 82.8 79.9 
Enhanced Literal 83.580.6 
Human Performance 92.0 95.0 
Close Modal

or Create an Account

Close Modal
Close Modal