Skip to Main Content
Table 3: 

Content extraction performance for H-VILA. The H-VILA models significantly reduce the inference time cost compared to LayoutLM, while achieving comparable accuracy on the three benchmark datasets.

GROTOAP2DocBankS2-VLInference Time (ms)
Macro F1 H(G)Macro F1 H(G)Macro F1 H(G)
LayoutLMBASE 92.34 0.78 91.06 2.64 82.69(6.04) 4.19(0.25) 52.56(0.25) 
Simple Group Classifier 92.65 0.00 87.01 0.00 1 – 82.57(0.30) 
 
H-VILA(Text Line) 91.65 0.32 91.27 1.07 83.69(2.92) 1.70(0.68) 28.07(0.37)2 
H-VILA(Text Block) 92.37 0.00 87.78 0.00 82.09(5.89) 0.36(0.12) 16.37(0.15) 
GROTOAP2DocBankS2-VLInference Time (ms)
Macro F1 H(G)Macro F1 H(G)Macro F1 H(G)
LayoutLMBASE 92.34 0.78 91.06 2.64 82.69(6.04) 4.19(0.25) 52.56(0.25) 
Simple Group Classifier 92.65 0.00 87.01 0.00 1 – 82.57(0.30) 
 
H-VILA(Text Line) 91.65 0.32 91.27 1.07 83.69(2.92) 1.70(0.68) 28.07(0.37)2 
H-VILA(Text Block) 92.37 0.00 87.78 0.00 82.09(5.89) 0.36(0.12) 16.37(0.15) 
1

The simple group classifier fails to converge for one run. We do not report the results for fair comparison.

2

When reporting efficiency in other parts of the paper, we use this result because of its optimal combination of accuracy and efficiency.

Close Modal

or Create an Account

Close Modal
Close Modal