Table 5: 

Comparison between I-VILA models and other layout-aware methods that require expensive pretraining. I-VILA achieves comparable accuracy with less than 5% of the training cost.

GROTOAP2DocBankS2-VLTraining Cost1
F1 H(G)F1 H(G)F1 H(G)
BERTBASE (Devlin et al., 2019) 90.78 1.58 87.24 3.50 78.34(6.53)1 7.17(0.95) 40 hr fine-tuning 
BERTBASE + I-VILA(Text Line) 91.65 1.13 90.25 2.56 81.15(4.83) 4.76(1.28) 40 hr fine-tuning 
BERTBASE + I-VILA(Text Block) 92.31 0.63 89.49 2.25 81.82(4.88) 3.65(0.26) 40 hr fine-tuning 
 
LayoutLMBASE (Xu et al., 2020) 92.34 0.78 91.06 2.64 82.69(6.04) 4.19(0.25) 1.2k hr pretraining
+ 50 hr fine-tuning 
LayoutLMv2BASE (Xu et al., 2021) 2 – 93.33 1.93 83.05(4.51) 3.34(0.82) 9.6k hr pretraining3
+ 130 hr fine-tuning 
GROTOAP2DocBankS2-VLTraining Cost1
F1 H(G)F1 H(G)F1 H(G)
BERTBASE (Devlin et al., 2019) 90.78 1.58 87.24 3.50 78.34(6.53)1 7.17(0.95) 40 hr fine-tuning 
BERTBASE + I-VILA(Text Line) 91.65 1.13 90.25 2.56 81.15(4.83) 4.76(1.28) 40 hr fine-tuning 
BERTBASE + I-VILA(Text Block) 92.31 0.63 89.49 2.25 81.82(4.88) 3.65(0.26) 40 hr fine-tuning 
 
LayoutLMBASE (Xu et al., 2020) 92.34 0.78 91.06 2.64 82.69(6.04) 4.19(0.25) 1.2k hr pretraining
+ 50 hr fine-tuning 
LayoutLMv2BASE (Xu et al., 2021) 2 – 93.33 1.93 83.05(4.51) 3.34(0.82) 9.6k hr pretraining3
+ 130 hr fine-tuning 
1

We report the equivalent V100 GPU hours on the GROTOAP dataset in this column.

2

LayoutLMv2 cannot be trained on the GROTOAP2 dataset because almost 30% of its instances do not have compatible PDF images.

3

The authors do not report the exact cost in the paper. The number is a rough estimate based on our experimental results.

Close Modal

or Create an Account

Close Modal
Close Modal