Skip to Main Content
Table 6: 

VILA model performance when using different layout group detectors for text blocks G(𝓑) and lines G(𝓛) on the S2-VL dataset.

ExperimentGroup SourceGroup-uniform OracleI-VILAH-VILA
Max Macro F1H(G)Macro F1H(G)Macro F1H(G)
VaryingG𝓑 Ground-Truth 100.00(0.00) 0.00(0.00) 86.50(4.52) 1.86(0.29) 85.91(3.13) 0.35(0.19) 
Vision Model 99.31(0.23) 1.09(0.30) 83.44(6.48) 2.83(0.34) 82.09(5.89) 0.36(0.12) 
PDF Parsing 96.91(1.09) 2.06(0.86) 83.95(4.45) 3.93(0.93) 78.69(4.90) 0.02(0.01) 
 
VaryingG𝓛 Vision Model 99.57(0.13) 0.42(0.18)1 83.77(5.75) 1.20(0.16) 83.69(2.92) 0.20(0.12) 
PDF Parsing 99.70(0.12) 0.38(0.26) 82.97(5.56) 1.28(0.13) 82.61(4.10) 0.00(0.00) 
ExperimentGroup SourceGroup-uniform OracleI-VILAH-VILA
Max Macro F1H(G)Macro F1H(G)Macro F1H(G)
VaryingG𝓑 Ground-Truth 100.00(0.00) 0.00(0.00) 86.50(4.52) 1.86(0.29) 85.91(3.13) 0.35(0.19) 
Vision Model 99.31(0.23) 1.09(0.30) 83.44(6.48) 2.83(0.34) 82.09(5.89) 0.36(0.12) 
PDF Parsing 96.91(1.09) 2.06(0.86) 83.95(4.45) 3.93(0.93) 78.69(4.90) 0.02(0.01) 
 
VaryingG𝓛 Vision Model 99.57(0.13) 0.42(0.18)1 83.77(5.75) 1.20(0.16) 83.69(2.92) 0.20(0.12) 
PDF Parsing 99.70(0.12) 0.38(0.26) 82.97(5.56) 1.28(0.13) 82.61(4.10) 0.00(0.00) 
1

For text line detector experiments, we report H(G) based on text lines rather than blocks.

Close Modal

or Create an Account

Close Modal
Close Modal