Human evaluation on 1600 generated FaithDial responses (200 × 8) from different models on the test data. * and ** indicates that the results are significantly different from the best result in that column (bolded) with p-value < 0.05, < 0.01 respectively. ‘Coop.’, ‘Abst.’, and ‘Enga.’ means cooperativeness, abstractiveness, and engagingness, respectively.
. | Models . | Interpretable . | Hallucination . | Faithfulness . | Generic . | ||
---|---|---|---|---|---|---|---|
Coop. . | Abst. . | Enga. . | |||||
WoW | T5 | 93.2% | 055.8%** | 2.97* | 1.95* | 1.72* | 2.2% |
T5-CTRL | 95.2% | 44.2%* | 1.97* | 0.92* | 1.33* | 0.9% | |
T5-LossTruncation | 94.3% | 042.5%** | 2.87* | 1.87* | 1.83* | 1.2% | |
FaithDial | T5 | 94.4% | 23.2%* | 3.63 | 2.43* | 2.33 | 1.4% |
T5-WoW | 95.2% | 20.9%* | 3.59 | 2.44 | 2.37 | 1.0% | |
T5-CTRL | 96.7% | 20.8%* | 2.55* | 1.42* | 2.10* | 1.0% | |
T5-LossTruncation | 94.2% | 24.2%* | 3.59 | 2.42* | 2.03* | 0.9% | |
T5-InfoNCE | 97.2% | 19.9% | 3.79 | 2.92 | 2.60 | 0.9% |
. | Models . | Interpretable . | Hallucination . | Faithfulness . | Generic . | ||
---|---|---|---|---|---|---|---|
Coop. . | Abst. . | Enga. . | |||||
WoW | T5 | 93.2% | 055.8%** | 2.97* | 1.95* | 1.72* | 2.2% |
T5-CTRL | 95.2% | 44.2%* | 1.97* | 0.92* | 1.33* | 0.9% | |
T5-LossTruncation | 94.3% | 042.5%** | 2.87* | 1.87* | 1.83* | 1.2% | |
FaithDial | T5 | 94.4% | 23.2%* | 3.63 | 2.43* | 2.33 | 1.4% |
T5-WoW | 95.2% | 20.9%* | 3.59 | 2.44 | 2.37 | 1.0% | |
T5-CTRL | 96.7% | 20.8%* | 2.55* | 1.42* | 2.10* | 1.0% | |
T5-LossTruncation | 94.2% | 24.2%* | 3.59 | 2.42* | 2.03* | 0.9% | |
T5-InfoNCE | 97.2% | 19.9% | 3.79 | 2.92 | 2.60 | 0.9% |