Content extraction performance for H-VILA. The H-VILA models significantly reduce the inference time cost compared to LayoutLM, while achieving comparable accuracy on the three benchmark datasets.
. | GROTOAP2 . | DocBank . | S2-VL . | Inference Time (ms) . | |||
---|---|---|---|---|---|---|---|
Macro F1 . | H(G) . | Macro F1 . | H(G) . | Macro F1 . | H(G) . | ||
LayoutLMBASE | 92.34 | 0.78 | 91.06 | 2.64 | 82.69(6.04) | 4.19(0.25) | 52.56(0.25) |
Simple Group Classifier | 92.65 | 0.00 | 87.01 | 0.00 | –1 | – | 82.57(0.30) |
H-VILA(Text Line) | 91.65 | 0.32 | 91.27 | 1.07 | 83.69(2.92) | 1.70(0.68) | 28.07(0.37)2 |
H-VILA(Text Block) | 92.37 | 0.00 | 87.78 | 0.00 | 82.09(5.89) | 0.36(0.12) | 16.37(0.15) |
. | GROTOAP2 . | DocBank . | S2-VL . | Inference Time (ms) . | |||
---|---|---|---|---|---|---|---|
Macro F1 . | H(G) . | Macro F1 . | H(G) . | Macro F1 . | H(G) . | ||
LayoutLMBASE | 92.34 | 0.78 | 91.06 | 2.64 | 82.69(6.04) | 4.19(0.25) | 52.56(0.25) |
Simple Group Classifier | 92.65 | 0.00 | 87.01 | 0.00 | –1 | – | 82.57(0.30) |
H-VILA(Text Line) | 91.65 | 0.32 | 91.27 | 1.07 | 83.69(2.92) | 1.70(0.68) | 28.07(0.37)2 |
H-VILA(Text Block) | 92.37 | 0.00 | 87.78 | 0.00 | 82.09(5.89) | 0.36(0.12) | 16.37(0.15) |