Skip to Main Content
Table 1: 

Performance (macro F1 multiplied by 100) of baselines and Pet on the RAFT benchmark (Alex et al., 2021). Best model performance is shown in bold, best overall performance (including human annotators) is underlined. The final column shows average performance across all 11 tasks.

MethodADEB77NISOSEOverSOTSRITAIToSTEHTCAvg
GPT-2 60.0 12.1 56.1 24.5 49.8 38.0 49.2 61.2 49.8 31.1 72.3 45.8 
GPT-Neo 45.2 14.9 40.8 34.3 68.1 40.6 49.3 60.5 56.5 55.4 63.6 48.1 
AdaBoost 54.3 02.3 62.6 47.5 83.8 45.5 50.6 55.6 56.0 44.3 62.5 51.4 
snlt 60.3 24.8 58.5 30.2 83.1 33.6 49.2 62.6 54.0 44.9 79.1 52.8 
GPT-3 68.6 29.9 67.9 43.1 93.7 76.9 51.6 65.6 57.4 52.6 82.1 62.7 
SetFit 72.6 53.8 87.2 52.1 90.7 68.2 49.3 62.8 62.0 53.2 83.7 66.9 
Pet 82.2 59.3 85.7 64.6 90.8 81.6 49.3 63.8 57.6 48.3 82.4 69.6 
 
Human 83.0 60.7 85.7 64.6 91.7 90.8 46.8 60.9 62.7 72.2 89.7 73.5 
MethodADEB77NISOSEOverSOTSRITAIToSTEHTCAvg
GPT-2 60.0 12.1 56.1 24.5 49.8 38.0 49.2 61.2 49.8 31.1 72.3 45.8 
GPT-Neo 45.2 14.9 40.8 34.3 68.1 40.6 49.3 60.5 56.5 55.4 63.6 48.1 
AdaBoost 54.3 02.3 62.6 47.5 83.8 45.5 50.6 55.6 56.0 44.3 62.5 51.4 
snlt 60.3 24.8 58.5 30.2 83.1 33.6 49.2 62.6 54.0 44.9 79.1 52.8 
GPT-3 68.6 29.9 67.9 43.1 93.7 76.9 51.6 65.6 57.4 52.6 82.1 62.7 
SetFit 72.6 53.8 87.2 52.1 90.7 68.2 49.3 62.8 62.0 53.2 83.7 66.9 
Pet 82.2 59.3 85.7 64.6 90.8 81.6 49.3 63.8 57.6 48.3 82.4 69.6 
 
Human 83.0 60.7 85.7 64.6 91.7 90.8 46.8 60.9 62.7 72.2 89.7 73.5 
Close Modal

or Create an Account

Close Modal
Close Modal