Skip to Main Content
Table 3: 
Percentage accuracy of four baseline models and raw human performance on BLiMP using a forced-choice task. A random guessing baseline would achieve an accuracy of 50%.
ModelOverallAna. agrArg. strBindingCtrl. rais.D-n agrEllipsisFiller. gapIrregularIslandnpiQuantifiersS-v agr
5-gram 61.2 47.9 71.9 64.4 68.5 70.0 36.9 60.2 79.5 57.2 45.5 53.5 60.3 
LSTM 69.8 91.7 73.2 73.5 67.0 85.4 67.6 73.9 89.1 46.6 51.7 64.5 80.1 
TXL 69.6 94.1 69.5 74.7 71.5 83.0 77.2 66.6 78.2 48.4 55.2 69.3 76.0 
GPT-2 81.5 99.6 78.3 80.1 80.5 93.3 86.6 81.3 84.1 70.6 78.9 71.3 89.0 
Human 88.6 97.5 90.0 87.3 83.9 92.2 85.0 86.9 97.0 84.9 88.1 86.6 90.9 
ModelOverallAna. agrArg. strBindingCtrl. rais.D-n agrEllipsisFiller. gapIrregularIslandnpiQuantifiersS-v agr
5-gram 61.2 47.9 71.9 64.4 68.5 70.0 36.9 60.2 79.5 57.2 45.5 53.5 60.3 
LSTM 69.8 91.7 73.2 73.5 67.0 85.4 67.6 73.9 89.1 46.6 51.7 64.5 80.1 
TXL 69.6 94.1 69.5 74.7 71.5 83.0 77.2 66.6 78.2 48.4 55.2 69.3 76.0 
GPT-2 81.5 99.6 78.3 80.1 80.5 93.3 86.6 81.3 84.1 70.6 78.9 71.3 89.0 
Human 88.6 97.5 90.0 87.3 83.9 92.2 85.0 86.9 97.0 84.9 88.1 86.6 90.9 
Close Modal

or Create an Account

Close Modal
Close Modal