VILA model performance when using different layout group detectors for text blocks G(𝓑) and lines G(𝓛) on the S2-VL dataset.
Experiment . | Group Source . | Group-uniform Oracle . | I-VILA . | H-VILA . | |||
---|---|---|---|---|---|---|---|
Max Macro F1 . | H(G) . | Macro F1 . | H(G) . | Macro F1 . | H(G) . | ||
VaryingG𝓑 | Ground-Truth | 100.00(0.00) | 0.00(0.00) | 86.50(4.52) | 1.86(0.29) | 85.91(3.13) | 0.35(0.19) |
Vision Model | 99.31(0.23) | 1.09(0.30) | 83.44(6.48) | 2.83(0.34) | 82.09(5.89) | 0.36(0.12) | |
PDF Parsing | 96.91(1.09) | 2.06(0.86) | 83.95(4.45) | 3.93(0.93) | 78.69(4.90) | 0.02(0.01) | |
VaryingG𝓛 | Vision Model | 99.57(0.13) | 0.42(0.18)1 | 83.77(5.75) | 1.20(0.16) | 83.69(2.92) | 0.20(0.12) |
PDF Parsing | 99.70(0.12) | 0.38(0.26) | 82.97(5.56) | 1.28(0.13) | 82.61(4.10) | 0.00(0.00) |
Experiment . | Group Source . | Group-uniform Oracle . | I-VILA . | H-VILA . | |||
---|---|---|---|---|---|---|---|
Max Macro F1 . | H(G) . | Macro F1 . | H(G) . | Macro F1 . | H(G) . | ||
VaryingG𝓑 | Ground-Truth | 100.00(0.00) | 0.00(0.00) | 86.50(4.52) | 1.86(0.29) | 85.91(3.13) | 0.35(0.19) |
Vision Model | 99.31(0.23) | 1.09(0.30) | 83.44(6.48) | 2.83(0.34) | 82.09(5.89) | 0.36(0.12) | |
PDF Parsing | 96.91(1.09) | 2.06(0.86) | 83.95(4.45) | 3.93(0.93) | 78.69(4.90) | 0.02(0.01) | |
VaryingG𝓛 | Vision Model | 99.57(0.13) | 0.42(0.18)1 | 83.77(5.75) | 1.20(0.16) | 83.69(2.92) | 0.20(0.12) |
PDF Parsing | 99.70(0.12) | 0.38(0.26) | 82.97(5.56) | 1.28(0.13) | 82.61(4.10) | 0.00(0.00) |
For text line detector experiments, we report H(G) based on text lines rather than blocks.