Model performance (accuracy) on the idiom and simile discriminative tasks. * Difference is significant (α < 0.07) between the supervised and knowledge-enhanced models via t-test.
Method . | Model . | Idiom . | Simile . |
---|---|---|---|
Majority | 50.0 | 50.8 | |
Zero-shot | GPT2-XL | 53.6 | 53.7 |
GPT3 | 60.2 | 62.4 | |
UnifiedQA | 67.7 | 60.6 | |
Few-shot | GPT3 | 54.1 | 51.7 |
PET | 66.1 | 55.2 | |
Supervised | RoBERTa | 82.0 | 80.4 |
-narrative | 65.0 | 67.9 | |
Knowledge | Context | 82.8 | 79.9 |
Enhanced | Literal | 83.5* | 80.6 |
Human Performance | 92.0 | 95.0 |
Method . | Model . | Idiom . | Simile . |
---|---|---|---|
Majority | 50.0 | 50.8 | |
Zero-shot | GPT2-XL | 53.6 | 53.7 |
GPT3 | 60.2 | 62.4 | |
UnifiedQA | 67.7 | 60.6 | |
Few-shot | GPT3 | 54.1 | 51.7 |
PET | 66.1 | 55.2 | |
Supervised | RoBERTa | 82.0 | 80.4 |
-narrative | 65.0 | 67.9 | |
Knowledge | Context | 82.8 | 79.9 |
Enhanced | Literal | 83.5* | 80.6 |
Human Performance | 92.0 | 95.0 |