Skip to Main Content
Table 3: 

Human ratings of summaries along four evaluation dimensions, averaged over three expert annotators, broken down by extractive and abstractive models. The M* codes follow the notation described in Section 3.2. The three highest-rated models in each column are in bold.

MethodCoherenceConsistencyFluencyRelevance
CNN/DM Reference Summary 3.26 4.47 4.79 3.77 
Extractive Models 
M0 - LEAD-3 4.16 4.98 4.94 4.14 
M1 - NEUSUM 3.22 4.98 4.90 3.82 
M2 - BanditSum 3.28 4.99 4.83 3.81 
M5 - RNES 3.71 4.97 4.81 4.06 
 
Abstractive Models 
M8 - Pointer Generator 3.29 4.65 4.79 3.55 
M9 - Fast-abs-rl 2.38 4.67 4.50 3.52 
M10 - Bottom-Up 2.73 4.25 4.42 3.38 
M11 - Improve-abs 2.28 3.27 3.65 3.15 
M12 - Unified-ext-abs 3.60 4.96 4.85 3.85 
M13 - ROUGESal 3.44 4.82 4.86 3.83 
M14 - Multi-task (Ent + QG) 3.20 4.90 4.74 3.63 
M15 - Closed book decoder 3.35 4.95 4.80 3.67 
M17 - T5 4.00 4.93 4.93 4.23 
M20 - GPT-2 (zero shot)1 3.63 3.40 3.97 3.30 
M22 - BART 4.18 4.94 4.90 4.25 
M23 - Pegasus (C4) 4.16 4.91 4.88 4.26 
M23 - Pegasus (dynamic mix) 4.09 4.85 4.79 4.27 
MethodCoherenceConsistencyFluencyRelevance
CNN/DM Reference Summary 3.26 4.47 4.79 3.77 
Extractive Models 
M0 - LEAD-3 4.16 4.98 4.94 4.14 
M1 - NEUSUM 3.22 4.98 4.90 3.82 
M2 - BanditSum 3.28 4.99 4.83 3.81 
M5 - RNES 3.71 4.97 4.81 4.06 
 
Abstractive Models 
M8 - Pointer Generator 3.29 4.65 4.79 3.55 
M9 - Fast-abs-rl 2.38 4.67 4.50 3.52 
M10 - Bottom-Up 2.73 4.25 4.42 3.38 
M11 - Improve-abs 2.28 3.27 3.65 3.15 
M12 - Unified-ext-abs 3.60 4.96 4.85 3.85 
M13 - ROUGESal 3.44 4.82 4.86 3.83 
M14 - Multi-task (Ent + QG) 3.20 4.90 4.74 3.63 
M15 - Closed book decoder 3.35 4.95 4.80 3.67 
M17 - T5 4.00 4.93 4.93 4.23 
M20 - GPT-2 (zero shot)1 3.63 3.40 3.97 3.30 
M22 - BART 4.18 4.94 4.90 4.25 
M23 - Pegasus (C4) 4.16 4.91 4.88 4.26 
M23 - Pegasus (dynamic mix) 4.09 4.85 4.79 4.27 
Close Modal

or Create an Account

Close Modal
Close Modal