Percent of times that the generation from each of the models and human-written references was chosen as plausible (absolute) or preferred (comparative) by the majority of workers.
Model . | Absolute . | Comparative . | ||
---|---|---|---|---|
Idiom . | Simile . | Idiom . | Simile . | |
GPT2-XL | 56 | 60 | 15 | 18.6 |
+Context | 68 | 68 | 45 | 16 |
+Literal | 48 | 76 | 13 | 46.7 |
Human | 80 | 88 | − | − |
All | − | − | 8 | 12 |
Neither | − | − | 17 | 6.7 |
Model . | Absolute . | Comparative . | ||
---|---|---|---|---|
Idiom . | Simile . | Idiom . | Simile . | |
GPT2-XL | 56 | 60 | 15 | 18.6 |
+Context | 68 | 68 | 45 | 16 |
+Literal | 48 | 76 | 13 | 46.7 |
Human | 80 | 88 | − | − |
All | − | − | 8 | 12 |
Neither | − | − | 17 | 6.7 |