Comparison between I-VILA models and other layout-aware methods that require expensive pretraining. I-VILA achieves comparable accuracy with less than 5% of the training cost.
. | GROTOAP2 . | DocBank . | S2-VL . | Training Cost1 . | |||
---|---|---|---|---|---|---|---|
F1 . | H(G) . | F1 . | H(G) . | F1 . | H(G) . | ||
BERTBASE (Devlin et al., 2019) | 90.78 | 1.58 | 87.24 | 3.50 | 78.34(6.53)1 | 7.17(0.95) | 40 hr fine-tuning |
BERTBASE + I-VILA(Text Line) | 91.65 | 1.13 | 90.25 | 2.56 | 81.15(4.83) | 4.76(1.28) | 40 hr fine-tuning |
BERTBASE + I-VILA(Text Block) | 92.31 | 0.63 | 89.49 | 2.25 | 81.82(4.88) | 3.65(0.26) | 40 hr fine-tuning |
LayoutLMBASE (Xu et al., 2020) | 92.34 | 0.78 | 91.06 | 2.64 | 82.69(6.04) | 4.19(0.25) | 1.2k hr pretraining + 50 hr fine-tuning |
LayoutLMv2BASE (Xu et al., 2021) | –2 | – | 93.33 | 1.93 | 83.05(4.51) | 3.34(0.82) | 9.6k hr pretraining3 + 130 hr fine-tuning |
. | GROTOAP2 . | DocBank . | S2-VL . | Training Cost1 . | |||
---|---|---|---|---|---|---|---|
F1 . | H(G) . | F1 . | H(G) . | F1 . | H(G) . | ||
BERTBASE (Devlin et al., 2019) | 90.78 | 1.58 | 87.24 | 3.50 | 78.34(6.53)1 | 7.17(0.95) | 40 hr fine-tuning |
BERTBASE + I-VILA(Text Line) | 91.65 | 1.13 | 90.25 | 2.56 | 81.15(4.83) | 4.76(1.28) | 40 hr fine-tuning |
BERTBASE + I-VILA(Text Block) | 92.31 | 0.63 | 89.49 | 2.25 | 81.82(4.88) | 3.65(0.26) | 40 hr fine-tuning |
LayoutLMBASE (Xu et al., 2020) | 92.34 | 0.78 | 91.06 | 2.64 | 82.69(6.04) | 4.19(0.25) | 1.2k hr pretraining + 50 hr fine-tuning |
LayoutLMv2BASE (Xu et al., 2021) | –2 | – | 93.33 | 1.93 | 83.05(4.51) | 3.34(0.82) | 9.6k hr pretraining3 + 130 hr fine-tuning |
We report the equivalent V100 GPU hours on the GROTOAP dataset in this column.
LayoutLMv2 cannot be trained on the GROTOAP2 dataset because almost 30% of its instances do not have compatible PDF images.
The authors do not report the exact cost in the paper. The number is a rough estimate based on our experimental results.