VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, for example, each token’s 2D position on the page, into language model pretraining. We introduce new methods that explicitly model VIsual LAyout (VILA) groups, that is, text lines or text blocks, to further improve performance. In our I-VILA approach, we show that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification. In the H-VILA approach, we show that hierarchical encoding of layout-groups can result in up to 47% inference time reduction with less than 0.8% Macro F1 loss. Unlike prior layout-aware approaches, our methods do not require expensive additional pretraining, only fine-tuning, which we show can reduce training cost by up to 95%. Experiments are conducted on a newly curated evaluation suite, S2-VLUE, that unifies existing automatically labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines. Pre-trained weights, benchmark datasets, and source code are available at https://github.com/allenai/VILA.


Introduction
Scientific papers are usually distributed in Portable Document Format (PDF) without extensive semantic markup.Extracting structured document representations from these PDF files-i.e., identifying title and author blocks, figures, references, and so on-is a critical first step for downstream NLP tasks (Beltagy et al., 2019;Wang et al., 2020) and is important for improving PDF accessibility (Wang et al., 2021).
Recent work demonstrates that document layout information can be used to enhance content extraction via large-scale, layout-aware pretraining (Xu et al., 2020(Xu et al., , 2021;;Li et al., 2021).However, these methods only consider individual tokens' 2D positions and do not explicitly model high-level layout structures like the grouping of text into lines and blocks (see Figure 1 for example), limiting accuracy.Further, existing methods come with enormous computational costs: they rely on further pretraining an existing pretrained model like BERT (Devlin et al., 2019) on layout-enriched input, and achieving the best performance from the models requires more than a thousand (Xu et al., 2020) to several thousand (Xu et al., 2021) GPUhours.This means swapping in a new pretrained text model or experimenting with new layout-aware architectures can be prohibitively expensive, incompatible with the goals of green AI (Schwartz et al., 2020).
In this paper, we explore how to improve the accuracy and efficiency of structured content extraction from scientific documents by using VIsual LAyout (VILA) groups.Following Zhong et al. (2019) and Tkaczyk et al. (2015), our methods use the idea that a document page can be segmented into visual groups of tokens (either lines or blocks), and that the tokens within each group generally have the same semantic category, which we refer to as the group uniformity assumption (see Figure 1(b)).Given text lines or blocks generated by rule-based PDF parsers (Tkaczyk et al., 2015) or vision models (Zhong et al., 2019), we design two different methods to incorporate the VILA groups and the assumption into modeling: the I-VILA model adds layout indicator tokens to textual inputs to improve the accuracy of existing BERT-based language models, while the H-VILA model uses VILA structures to define a hierarchical model that models pages as collections of groups rather than of individual tokens, increasing inference efficiency.
Previous datasets for evaluating PDF content extraction rely on machine-generated labels of imperfect quality, and comprise papers from a limited range of scientific disciplines.To better evaluate our proposed methods, we design a new benchmark suite, Semantic Scholar Visual Layoutenhanced Scientific Text Understanding Evaluation (S2-VLUE).The benchmark extends two existing resources (Tkaczyk et al., 2015;Li et al., 2020) and introduces a newly curated dataset, S2-VL, which contains high-quality human annotations for papers across 19 disciplines.
Our contributions are as follows: 1. We introduce a new strategy for PDF content extraction that uses VILA structures to inject layout information into language models, and show that this improves accuracy without the expensive pretraining required by existing methods, and generalizes to different language models.2. We design two models that incorporate VILA features differently.The I-VILA model injects layout indicator tokens into the input texts and improves prediction accuracy (up to +1.9% Macro F1) and consistency compared to the previous layout-augmented language model LayoutLM (Xu et al., 2020).The H-VILA model performs group-level predictions and can reduce model inference time by 47% with less than 0.8% loss in Macro F1. 3. We construct a unified benchmark suite S2-VLUE which enhances existing datasets with VILA structures, and introduce a novel dataset S2-VL that addresses gaps in existing resources.S2-VL contains hand-annotated gold labels for 15 token categories on papers spanning 19 disciplines.
2 Related Work

Structured Content Extraction for Scientific Documents
Prior work on structured content extraction for scientific documents usually relies on textual or visual features.Text-based methods like Scien-ceParse (Ammar et al., 2018), GROBID (GRO, 2008(GRO, -2021) ) or Corpus Conversion Service (Staar et al., 2018) combine PDF-to-text parsing engines like CERMINE (Tkaczyk et al., 2015) or pdfalto,1 which output a sequence of tokens extracted from a PDF, with machine learning models like RNN (Hochreiter and Schmidhuber, 1997), CRF (Lafferty et al., 2001), or Random Forest (Breiman, 2001) trained to classify the token categories of the sequence.Though these models are practical and fairly efficient, they fall short in prediction accuracy or generalize poorly to out-of-domain documents.Vision-based Approaches (Zhong et al., 2019;He et al., 2017;Siegel et al., 2018), on the other hand, treat the parsing task as an image object detection problem: given document images, the models predict rectangular bounding boxes, segmenting the page into individual components of different categories.These models excel at capturing complex visual layout structures like figures or tables, but because they operate only on visual signals without textual information, they cannot accurately predict fine-grained semantic categories like title, author, or abstract, which are of central importance for parsing scientific documents.

Layout-aware Language Models
Recent methods on layout-aware language models improve prediction accuracy by jointly modeling documents' textual and visual signals.Lay-outLM (Xu et al., 2020) learns a set of novel posi- tional embeddings that can encode tokens' 2D spatial location on the page and improves accuracy on scientific document parsing (Li et al., 2020).More recent work (Xu et al., 2021;Li et al., 2021) aims to encode the document in a multimodal fashion by modeling text and images together.However, these existing joint-approach models require expensive pretraining, and may be less efficient as a consequence of their joint inputs (Xu et al., 2021), making them less suitable for deployment at scale.In this work, we aim to incorporate document layout features in the form of visual layout groupings, in novel ways that improve or match performance without the need for expensive pretraining.Our work is well-aligned with recent efforts for incorporating structural information into language models (Lee et al., 2020;Bai et al., 2021;Yang et al., 2020;Zhang et al., 2019).

Training and Evaluation Datasets
The available training and evaluation datasets for scientific content extraction models are automatically generated from author-provided source data, e.g.GROTOAP2 (Tkaczyk et al., 2014) and Pub-LayNet (Zhong et al., 2019) are constructed from PubMed Central XML and DocBank (Li et al., 2020) from arXiv LaTeX source.Despite their large sample sizes, these datasets have limited layout variation, leading to poor generalization to papers from other disciplines with distinct layouts.Also, due to the heuristic nature in which the data are automatically labeled, they contain systematic classification errors that can affect downstream modeling performance.We elaborate on the limitations of GROTOAP2 (Tkaczyk et al., 2014) and DocBank (Li et al., 2020) in Section 4. Pub-LayNet (Zhong et al., 2019)  3 Methods

Problem Formulation
Following prior work (Tkaczyk et al., 2015;Li et al., 2020), our task is to map each token t i in an input sequence T = (t 1 , . . ., t n ) to its semantic category c i (e.g.title, body text, reference, etc.).Input tokens are extracted via PDF-to-text tools, which output both the word t i and its 2D position on the page, a rectangular bounding box a i = (x 0 , y 0 , x 1 , y 1 ) denoting the left, top, right, and bottom coordinate of the word.The order of tokens in sequence T may not reflect the actual reading order of the text due to errors in PDFto-text conversion (e.g., in the original DocBank dataset (Li et al., 2020)), which poses an additional challenge to language models pre-trained on regular texts.
Besides the token sequence T , additional visual structures G can also be retrieved from the source document.Scientific papers are organized into groups of tokens (lines or blocks), which consist of consecutive pieces of text that can be segmented from other pieces based on spatial gaps.The group information can be extracted via visual layout detection models (Zhong et al., 2019;He et al., 2017) or rule-based PDF parsing (Tkaczyk et al., 2015).
Formally, given an input page, the group detector identifies a series of m rectangular boxes for each group b j ∈ B = {b 1 , . . ., b m } in the input document page, where b j = (x 0 , y 0 , x 1 , y 1 ) denotes the box coordinates.Page tokens are allocated to the visual groups g j = (b j , T (j) ), where T (j) = {t i | a i b j , t i ∈ T } contains all tokens in the j-th group, and a i b j denotes that the center point of token t i 's bounding box a i is strictly within the group box b i .When two group regions overlap and share common tokens, the system assigns the common tokens to the earlier group by the estimated reading order from the PDF parser.We refer to text block groups of a page as G (B)  and text line groups as G (L) .In our case, we define text lines as consecutive tokens appearing at the nearly same vertical position.2Text blocks are sets of adjacent text lines with gaps smaller than a certain threshold, and ideally the same semantic category.That is, even two close lines of different semantic categories should be allocated to separate blocks, and in our models we use a block detector trained toward this objective.In practice, block or line detectors may generate incorrect predictions.
In the following sections, we describe our two models, I-VILA and H-VILA.The models take a BERT-based pretrained language model as a foundation, which may or may not itself be layoutaware (we experiment with DistilBERT, BERT, RoBERTa, and LayoutLM in our experiments).Our models then augment the base model to incorporate group structures, as detailed below.

I-VILA: Injecting Visual Layout Indicators
According to the group uniformity assumption, token categories are homogeneous within a group, and categorical changes should happen at group boundaries.This suggests that layout information should be incorporated in a way that informs token category consistency intra-group and signals possible token category changes inter-group.
Our first method supplies VILA structures by inserting a special layout indicator token at each group boundary in the input text, and models this with a pretrained language model (which may or may not be position-aware).We refer to this as the I-VILA method.As shown in Figure 2(a), the inserted tokens partition the text into segments that provide helpful structure to the model, hinting at possible category changes.In I-VILA, the special tokens are seen at all layers of the model, providing VILA signals at different stages of modeling, rather than only providing positional information at the initial embedding layers as in LayoutLM (Xu et al., 2020).We empirically show that BERTbased models can learn to leverage such special tokens to improve both the accuracy and the consistency of category predictions, even without an additional loss penalizing inconsistent intra-group predictions.
In practice, given G, we linearize tokens T (j) from each group and flatten them into a 1D sequence.To avoid capturing confounding information in existing pretraining tasks, we insert a new token previously unseen by the model, [BLK], in-between text from different groups T (j) .The resulting input sequence is of the form , where T (j) i and n j indicate the i-th token and the total number of tokens respectively in the j-th group, and [CLS] and [SEP] are the special tokens used by the BERT model and are inserted to preserve a similar input structure. 3The BERT-based models are fine-tuned on the token classification objective with a cross entropy loss.When I-VILA uses a visual pretrained language model as input, such as LayoutLM (Xu et al., 2020), the positional embeddings for the newly injected [BLK] tokens are generated from the corresponding group's bounding box b j .

H-VILA: Visual Layout-guided Hierarchical Model
The uniformity of group token categories also suggests the possibility of building a group-level classifier.Inspired by recent advances in modeling long documents, hierarchical structures (Yang et al., 2020;Zhang et al., 2019) provide an ideal architecture for the end task while optimizing for computational cost.Illustrated in Figure 3, our hierarchical approach uses two transformer-based models, one to encode each group in terms of its words, and another modeling the whole document in terms of the groups.We provide the details below.
The Group Encoder is a l g -layer transformer that converts each group g i into a hidden vector h i .Following the typical transformer model setting (Vaswani et al., 2017), the model takes a sequence of tokens T (j) within a group as input, and maps each token T (j) i into a dense vector e (j) i of dimension d.Subsequently, a group vector aggregation function f : R n j ×d → R d is applied that projects the token representations e (j) 1 , . . ., e (j) n j to a single vector hj that represents the group's textual information.A group's 2D spatial information is incorporated in the form of positional embeddings, and the final group representation h j can be calculated as: where p is the 2D positional embedding similar to the one used in LayoutLM: where E x , E x , E w , E h are the embedding matrices for x, y coordinates and width and height.In practice, we find that injecting positional information using the bounding box of the first token within the group leads to better results, and we choose group vector aggregation function f to be the average over all tokens representations.The Page Encoder is another stacked transformer model of l p layers that operates on the group representation h j generated by the group encoder.It generates a final group representation s j for downstream classification.A MLP-based linear classifier is attached thereafter, and is trained to generate the group-level category probability p jc .
Different from previous work (Yang et al., 2020), we restrict the choice of l g and l p to {1, 12} such that we can load pre-trained weights from BERT base models.Therefore, no additional pretraining is required, and the H-VILA model can be fine-tuned directly for the downstream classification task.Specifically, we set l g = 1 and initialize the group encoder from the first-layer transformer weights of BERT.The page encoder is configured as either a one-layer transformer or a 12layer transformer that resembles a full LayoutLM model.Weights are initialized from the first-layer or full 12 layers of the LayoutLM model, which is trained to model texts in conjunction with their positions.
Group Token Truncation As suggested in Yang et al. ( 2020)'s work, when an input document of length N is evenly split into segments of L s , the memory footprint of the hierarchical model is O(l g N L s + l p ( N Ls ) 2 ), and for long documents with N L s , it approximates as O(N 2 /L 2 s ).However, in our case, it is infeasible to adopt the Greedy Sentence Filling technique (Yang et al., 2020) as it mingles signals from different groups and obfuscates group structures.It is also less desirable to simply use the maximum token count per group max 1≤j≤m n j to batch the contents due to the high variance of group token length (see Table 1).Instead, we choose a group token truncation count ñ empirically based on statistics of the group token length distribution such that N ≈ ñm, and use the first ñ to aggregate the group hidden vector h j for all groups (we pad the sequence to ñ when it is shorter).

Benchmark Suite
To systematically evaluate the proposed methods, we develop the the Semantic Scholar Visual Layout-enhanced Scientific Text Understanding Evaluation (S2-VLUE) benchmark suite.S2-VLUE consists of three datasets-two previously released resources which we augment with VILA information, and a new hand-curated dataset S2-VL.
Key statistics for S2-VLUE are provided in Table 1.Notably, the three constituent datasets differ with respect to their: 1) annotation method, 2) VILA generation method, and 3) paper domain coverage.We provide details below.
GROTOAP2 The GROTOAP2 dataset (Tkaczyk et al., 2014) is automatically annotated.Its text block and line groupings come from the CERMINE PDF parsing tool (Tkaczyk et al., 2015); text block category labels are then obtained by pairing block texts with structured data from document source files obtained from PubMed Central.A small subset of data is inspected by experts, and a set of post-processing heuristics is developed to further improve annotation quality.Since token categories are annotated by group, the dataset achieves perfect accordance between token labels and VILA structures.However, the method of rule-based PDF parsing employed by the authors introduces labeling inaccuracies due to imperfect VILA detection: the authors find that block-level annotation accuracy achieves only 92 Macro F1 in a small gold evaluation set.Additionally, all samples are extracted from the PMC Open Access Subset4 that includes only life sciences publications; these papers have less representation of classification types like "equation", which are common in other scientific disciplines.

DocBank
The DocBank dataset (Li et al., 2020) is fully machine-labeled without any postprocessing heuristics or human assessment.The authors first identify token categories by automatically parsing the source TEX files available from arXiv.Text block annotations are then generated by grouping together tokens of the same category using connected component analysis.However, only a specific set of token tags is extracted from the main TEX file for each paper, leading to inaccurate and incomplete token labels, especially for papers employing LaTeX macro commands,5 and thus, incorrect visual groupings.Hence, we develop a Mask R-CNN-based vision layout detection model based on a collection of existing resources (Zhong et al., 2019;MFD, 2021;He et al., 2017;Shen et al., 2021) to fix these inaccuracies and generate trustworthy VILA annotations at both the text block and line level. 6As a result, this dataset can be used to evaluate VILA models under a different setting, since the VILA structures are generated independently from the token annotations.Because the papers in DocBank are from arXiv, however, they primarily represent domains like Computer Science, Physics, and Mathematics, limiting the amount of layout variation.

S2-VL
We introduce a new dataset to address the three major drawbacks in existing work: 1) annotation quality, 2) VILA fidelity, and 3) domain coverage.S2-VL is manually labeled by graduate students who frequently read scientific papers.Using Overall, the datasets in S2-VLUE cover a wide range of academic disciplines with different layouts.The VILA structures in the three component datasets are curated differently, which helps to evaluate the generality of VILA-based methods.

Implementation Details
Our models are implemented using Py-Torch (Paszke et al., 2019) and the transformers library (Wolf et al., 2020).A series of baseline and VILA models are fine-tuned on 4-GPU RTX8000 or A100 machines.The AdamW optimizer (Kingma and Ba, 2015; Loshchilov and 7 Of our defined categories, 12 are common fields and taken directly from other similar datasets, e.g., title, abstract etc.We add three categories: equation, header, and footer, which commonly occur in scientific papers and are included in full text mining resources like S2ORC (Lo et al., 2020) and CORD-19 (Wang et al., 2020).
Hutter, 2019) is adopted with a 5 × 10 −5 learning rate and (β 1 , β 2 ) = (0.9, 0.999).The learning rate is linearly warmed up over 5% steps then linearly decayed.For all datasets (GROTOAP2, DocBank, S2-VL), unless otherwise specified, we select the best fine-tuning batch size (40, 40 and 12) and training epochs (24, 6,8 and 10) for all models.As for S2-VL, given its smaller size, we use 5-fold cross validation and report averaged scores, and use 2 × 10 −5 learning rate with 20 epochs.We split S2-VL based on papers rather than pages to avoid exposing paper templates of test samples in the training data.Mixed precision training (Micikevicius et al., 2018) is used to speed up the training process.
For I-VILA models, we fine-tune several BERTvariants with VILA-enhanced text inputs, and the models are initialized from pre-trained weights available in the transformers library.The H-VILA models are initialized as mentioned in Section 3.3, and by default, positional information is injected for each group.

Competing Methods
We consider three approaches that compete with the proposed methods from different perspectives: 1. Baselines The LayoutLM (Xu et al., 2020) model is the main baseline method.It is the closest model counterpart to our VILAaugmented models as it also injects layout information and achieves previous SOTA performance on the Scientific PDF parsing task (Li et al., 2020).
LayoutLM BASE (Xu et al., 2020)  1 For S2-VL, we show averaged scores with standard deviation in parentheses across the 5-fold cross validation subsets.
2 In this table, we report S2-VL results using VILA structures detected by visual layout models.When the ground-truth VILA structures are available, both I-VILA and H-VILA models can achieve better accuracy, shown in Table 6.
Table 2: Performance of baseline and I-VILA models on the scientific document extraction task.I-VILA provides consistent accuracy improvements over the baseline LayoutLM model on all three benchmark datasets.
2. Sentence Breaks For I-VILA models, besides using VILA-based indicators, we also compare with indicators generated from sentence breaks detected by PySBD (Sadvilkar and Neumann, 2020).Figure 2(a) shows that the inserted sentence-break indicators may have both "false-positive" or "false-negative" hints for token semantic category changes, making it less helpful for the end task.3. Simple Group Classifier For hierarchical models, we consider another baseline approach, where the group texts are separately fed into a LayoutLM-based group classifier.It doesn't require complicated model design, and uses a full LayoutLM to model each group's text, as opposed to the l g = 1 layer used in the H-VILA models.However, this method cannot account for inter-group interactions, and is far less efficient. 9

Metrics Prediction Accuracy
The token label distribution is heavily skewed towards categories corresponding to paper body texts (e.g., the "BODY_CONTENT" category in GROTOAP2 or the "paragraph" category in S2-VL and DocBank).Therefore, we choose to use Macro F1 as our primary evaluation metric for prediction accuracy.
Group Category Inconsistency To better characterize how different models behave with respect to group structure, we also report a diagnostic metric that calculates the uniformity of the token categories within a group.Hypothetically, tokens T (j) 9 Despite the group texts being relatively short, this method causes extra computational overhead as the full LayoutLM model needs to be run m times for all groups in a page.The simple group classifier models are only trained for 5, 2, and 5 epochs for GROTOAP2, DocBank, and S2-VL for tractability.in the j-th group g j share the same category c, and naturally the group inherits the semantic label c.We use the group token category entropy to measure the inconsistency of a model's predicted token categories within the same group: where p c denotes the probability of a token in group g being classified as category c.When all tokens in a group have the same category, the group token category inconsistency is zero.H(g) reaches the maximum when p c is a uniform distribution across all possible categories.The inconsistency for G is the arithmetic mean of all individual groups g i : H(G) acts as an auxiliary metric for evaluating prediction quality with respect to the provided VILA structures.In the remainder of this paper, we report the inconsistency metric for text blocks G (B) by default, and scale the values by a factor of 100.
Measuring Efficiency We report the inference time per sample as a measure of model efficiency.
We select 1,000 pages from the GROTOAP2 test set, and report the average model runtime for 3 runs on this subset.All models are tested on an isolated machine with a single V100 GPU.We report the time incurred for text classification; time costs associated with PDF-to-text conversion or VILA structure detection are not included (these are treated as pre-processing steps, which can be cached and re-used when processing documents with different content extractors). 1 The simple group classifier fails to converge for one run.We do not report the results for fair comparison.
2 When reporting efficiency in other parts of the paper, we use this result because of its optimal combination of accuracy and efficiency.6 Results

I-VILA Achieves Better Accuracy
Table 2 shows that I-VILA models lead to consistent accuracy improvements without further pretraining.Compared to the baseline Lay-outLM model, inserting layout indicators results in +1.13%, +1.90%, and +1.29% Macro F1 improvements across the three benchmark datasets.I-VILA models also achieve better token prediction consistency; the corresponding group category inconsistency is reduced by 32.1%, 21.7%, and 21.7% compared to baseline.Moreover, VILA information is also more helpful than language structures: I-VILA models based on text blocks and lines all outperform the sentence boundary-based method by a similar margin.Figure 4 shows an example of the VILA model predictions.

H-VILA is More Efficient
Table 3 summarizes the efficiency improvements of the H-VILA models with l g = 1 and l p = 12.
As block-level models perform predictions directly at the text block level, the group category inconsistency is naturally zero.Compared to LayoutLM, H-VILA models with text lines brings a 46.59% reduction in inference time, without heavily pe- nalizing the final prediction accuracies (-0.75%, +0.23%, +1.21% Macro F1).When text blocks are used, H-VILA models are even more efficient (68.85% and 80.17% inference time reduction compared to the LayoutLM and simple group classifier baseline), and they also achieve similar or better accuracy compared to the simple group classifier (-0.30%, +0.88% Macro F1 for GROTOAP2 and DocBank).However, in H-VILA models, the inductive bias from the group uniformity assumption also has a drawback: models are often less accurate than their I-VILA counterparts, and performing block level classification may sometimes lead to worse results (-3.60% and -0.73% Macro F1 in the DocBank and S2-VL datasets compared to LayoutLM).Moreover, shown in Figure 5, when the injected layout group is incorrect, the H-VILA method lacks the flexibility to assign different token categories within a group, leading to lower accuracy.Additional analysis of the impact of the layout group predictions is detailed in Section 8. We report the equivalent V100 GPU hours on the GROTOAP dataset in this column.
2 LayoutLMv2 cannot be trained on the GROTOAP2 dataset because almost 30% of its instances do not have compatible PDF images. 3The authors do not report the exact cost in the paper.The number is a rough estimate based on our experimental results.7 Ablation Studies

I-VILA is Effective Across BERT Variants
To test the applicability of the VILA methods, we adapt I-VILA to different BERT variants and train them on the GROTOAP2 dataset.Shown in Table 4, I-VILA leads to consistent improvements on DistilBERT (Sanh et al., 2019), BERT, and RoBERTa (Liu et al., 2019), 10 leading to up-to +1.77%, +1.69%, and 0.96% Macro F1 compared to non-VILA counterparts.

I-VILA Improves Accuracy without Pretraining
In Table 5, we fine-tune a series of I-VILA models based on BERT, and compare their performance with LayoutLM and LayoutLMv2 (Xu et al., 2021) which require additional large-scale pretraining on corpora with layout.BERT+I-VILA achieves comparable accuracy to LayoutLM (0.00%, -0.89%, -1.05%), with only 5% of the training cost. 11I-VILA also closes the gap with the latest multimodal method LayoutLMv2 (Xu et al., 2021) with only 1% of the training cost.This further verifies that injecting layout indicator tokens is a novel and effective way of incorporating layout information into language models.
10 Positional embeddings are not used in these models.
11 It takes 10.5 hours to finish fine-tuning I-VILA on the GROTOAP2 dataset using a 4 RTX 8000 machine, equivalent to around 60 V100 GPU hours, approximately 5% of the 1280 hours of the pretraining time for LayoutLM.

VILA in Practice: The Impact of Layout Group Detectors
Applying VILA methods in practice requires running a group layout detector as a critical first step.
In this section, we analyze how the accuracy of different block and line group detectors affects the accuracy of H-VILA and I-VILA models.
The results are shown in Table 6.We report on the S2-VL dataset using two automated group detectors: the CERMINE PDF parser (Tkaczyk et al., 2015) and the Mask R-CNN vision model trained on the PubLayNet dataset (Zhong et al., 2019).We also report on using ground truth blocks as an upper bound.The "Group-uniform Oracle" illustrates how well the different group detectors reflect the group uniformity assumption; in the oracle setting, one is given ground truth labels but is restricted to assigning the same label to all tokens in a group.
When using text blocks, the performance of H-VILA hinges on the accuracy of group detection, while I-VILA shows more reliable results when using different group detectors.This suggests that improvements in vision models for block detection could be a promising avenue for improving content extraction performance, especially when using H-VILA, and I-VILA may be the better choice when block detection accuracy is lower.
We also observe that text line-based methods tend to be higher performing for both group detectors, by a small margin for I-VILA and a larger one for H-VILA.The group detectors in our experiments are trained on data from PubLayNet, and applied to a different dataset, S2-VL.This domain transfer affects block detectors more than line detectors, because the two datasets define blocks differently.This setting is realistic because ground Table 6: VILA model performance when using different layout group detectors for text blocks G (B) and lines G (L) on the S2-VL dataset.
truth blocks from the target dataset may not always be available for training (even when labeled tokens are).Training a group detector on S2-VL is likely to improve performance.

Conclusion
In this paper, we introduce two new ways to integrate Visual Layout (VILA) structures into the NLP pipeline for structured content extraction from scientific paper PDFs.We show that inserting special indicator tokens based on VILA (I-VILA) can lead to robust improvements in token classification accuracy (up to +1.9% Macro F1) and consistency (up to -32% group category inconsistency).In addition, we design a hierarchical transformer model based on VILA (H-VILA), which can reduce inference time by 46% with less than 0.8% Macro F1 reduction compared to previous SOTA methods.These VILA-based methods can be easily incorporated into different BERT variants with only fine-tuning, achieving comparable performance against existing work with only 5% of the training cost.We ablate the influence of different visual layout detectors on VILA-based models, and provide suggestions for practical use.We release a benchmark suite, along with a newly curated dataset S2-VL, to systematically evaluate the proposed methods.
Our study is well-aligned with the recent exploration of injecting structures into language models, and provides new perspectives on how to incorporate documents' visual structures.The approach shows how explicitly modeling task structure can help achieve "green AI" goals, dramatically reducing computation and energy costs without significant loss in accuracy.While we evaluate on scientific documents, related visual group structures also exist in other kinds of documents, and adapting our techniques to those domains could offer improve-ments in corporate reports, historical archives, or legal documents, and this is an item of future work.

A Model Performance Breakdown
In Table 7, 8, and 9, we present model accuracies on GROTOAP2, DocBank, and S2-VL of each category for the results reported in the main paper.

B Improvements of the DocBank Dataset
We implement several fixes for the public version of the DocBank dataset to improve its accuracy and create faithful VILA structures.

B.1 Dataset Artifacts
As the DocBank dataset is automatically generated via parsing LaTeX source from arXiv, it will inevitably include noise.Moreover, the authors only release the document screenshots and token information parsed using PDFMiner12 instead of the source PDF files, which causes additional issues when using the dataset.We identify some major error categories during the course of our project, detailed as follows: Incorrect PDF Parsing The PDFMiner software does not work perfectly when parsing CID fonts,13 which are often used for rendering special symbols in PDFs.For example, the software may incorrectly parse 25 • C as 25(cid:176) C. Including such (cid: * ) tokens in the input text is not reasonable, because they break the natural flow of the text and most pre-trained language model tokenizers cannot appropriately encode such tokens.
Erroneous Label Generation Token labels in DocBank are extracted by parsing latex commands.For example, it will label all text in the command \abstract{ * } as "abstract".Though theoretically this approach may work well for "standard" documents, we find the resulting label quality is far from ideal when processing real-world documents at scale.One major issue is that it cannot appropriately handle user-created macros, which are often used for compiling complex math equations.It leads to very low (label) accuracy in the "equation" category in the dataset -in fact, we manually inspected 10 pages, and found 60% of the math equation tokens are wrongly labeled as other classes.This approach also fails to appropriately label some document texts that are passively generated with the LaTeX commands, e.g., the "Figure *" produced by the \caption command is treated as "paragraph".

Lack of VILA Structures
As the DocBank dataset generating method solely operates on the document tex sources, it does not include visual layout information.The missing VILA structures leads to low label accuracy for layout-sensitive categories like figure and tables -for example, when a figure contains selectable text (i.e., it is not stored in a format like PNG or JPG, but instead contains text tokens returned by the PDF parser), the method cannot recognize such tokens and thus it assigns incorrect labels (other than "figure").Though the authors tried to create layout group structures by applying connected component analysis method to PDF tokens,14 we observed different types of errors in the generated groups, e.g., mis-identifying paragraph breaks (combining multiple paragraph blocks into one) or overlapping layout groups (caused by incorrect token labels), and chose not to use them.

B.2 Fixes and Enhancement
Based on the aforementioned issues, we implement the following fixes and enhance the DocBank dataset with VILA structures.
Remove Incorrect PDF Tokens Provided that there are no simple ways to recover the incorrect (cid: * ) tokens generated by PDFMiner, we simply remove them from the input text.
Generate VILA Structures We use pre-trained Faster-RCNN models (Ren et al., 2015) from the LayoutParser (Shen et al., 2021) tool to identify both the text lines and blocks based on the page images.Specifically, for text blocks, we use the PubLayNet/mask_rcnn_R_50_FPN_3x/ model to detect the body content regions (including title, paragraph, figure , table, and list) and the MFD/faster_rcnn_R_50_FPN_3x/ model to detect the display math equation regions.We also fine-tune a Fast RCNN model on the GRO-TOAP2 dataset (which has text line annotation), and use it to detect the text lines.All other regions (or texts that are not covered by the detected blocks or lines) are created by the connected component analysis method.593) 95732( 8226) 882( 113) 3887( 2041) 240( 26) -Table 9: Prediction F1 breakdown for all models on the S2-VL dataset.Similar to the results in the main paper, we show averaged scores with standard deviation in parentheses across the 5-fold cross validation subsets.
Correct Label Errors Given the VILA structures, we can easily correct some previously mentioned errors like incorrect labels for "Figure *" by applying majority voting for token labels in a text block.However, for the "equation" category, given the low accuracy of the original DocBank labels, neither majority voting nor other automatic methods can easily recover the correct token categories.Hence, we choose to discard this category in the modeling phase, i.e., converting all existing "equation" labels to the background category "paragraph".We update our methods for several rounds to coordinate the fixes and enhancements, and ultimately we can reduce more than 90% of the label errors for figure and table captions.By using the accurate pre-trained layout detection models, the generated VILA structures are more than 95% accurate. 15

Figure 1 :
Figure 1: (a) Real-world scientific documents often have intricate layout structures, so analyzing only flattened raw text forfeits valuable information, yielding sub-optimal results.(b) The complex structures can be broken down into groups (text blocks or lines) that are composed of tokens with the same semantic category.

Figure 2 :
Figure 2: Comparing inserting indicator tokens [BLK] based on VILA groups and sentence boundaries.Indicators representing VILA groups (e.g., text blocks in the left figure) are usually consistent with the token category changes (illustrated by the background color in (a)), while sentence boundary indicators fail to provide helpful hints (both "false positive"s and "false negative"s occur frequently in (b)).Best viewed in color.

Figure 3 :
Figure 3: Illustration of the H-VILA model.Texts from each visual layout group are encoded separatedly using the group encoder, and the generated representation are subsequently modeled by a page encoder.The semantic category are predicted at the group-level, which significantly improves efficiency.

Figure 4 :
Figure 4: Model predictions for the 10th page of our paper draft.We present the token category and text block bounding boxes (highlighted in red rectangles) based on the (a) ground-truth annotations and model predictions from both I-VILA and H-VILA models (the three results happen to be identical) and (b) model predictions from the LayoutLM model.When VILA is injected, the model achieves more consistent predictions for the example, as indicated by arrows (1) and (2) in the figure.Best view in color.

Figure 5 :
Figure 5: Illustration of models trained and evaluated with incorrect text block detections (only the top half of the page is shown).The blocks are created by vision predictions, which fails to capture the correct caption text structure (arrow 1).Because the I-VILA model can generate different token predictions within a group, it maintains high accuracy, whereas H-VILA assigns the same category for all tokens in the incorrect block, leading to lower accuracy.

Table 1 :
(Neumann et al., 2021)datasets in the S2-VLUE benchmark.thePAWLSannotationtool(Neumannetal., 2021), annotators draw rectangular text blocks directly on each PDF page, and specify the block-level semantic categories from 15 possible candidates.7Tokenswithin a group can therefore inherit the category from the parent text block.Inter-annotator agreement, in terms of token-level accuracy measured on a 12-paper subset, is high at 0.95.The ground-truth VILA labels in S2-VL can be used to fine-tune visual layout detection models, and paper PDFs are also included, making PDF-based structure parsing feasible: this enables VILA annotations to be created by different means, which is helpful for benchmarking new VILA-based models.

Table 3 :
Content extraction performance for H-VILA.The H-VILA models significantly reduce the inference time cost compared to LayoutLM, while achieving comparable accuracy on the three benchmark datasets.
Base Model Baseline Text Line G(L)Text Block G(B)

Table 4 :
Content extraction performance (Macro F1 on the GROTOAP2 dataset) for I-VILA using different BERT model variants.I-VILA can be applied to both standard BERT-based models and layout-aware ones, and consistently improves the classification accuracy.

Table 5 :
Comparison between I-VILA models and other layout-aware methods that require expensive pretraining.I-VILA achieves comparable accuracy with less than 5% of the training cost.

Table 7 :
Prediction F1 breakdown for all models on the GROTOAP2 dataset.