Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
Date
Availability
1-4 of 4
Daniel S. Weld
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Transactions of the Association for Computational Linguistics (2022) 10: 376–392.
Published: 06 April 2022
FIGURES
| View All (5)
Abstract
View article
PDF
Accurately extracting structured content from PDFs is a critical first step for NLP over scientific papers. Recent work has improved extraction accuracy by incorporating elementary layout information, for example, each token’s 2D position on the page, into language model pretraining. We introduce new methods that explicitly model VIsual LAyout (VILA) groups , that is, text lines or text blocks, to further improve performance. In our I-VILA approach, we show that simply inserting special tokens denoting layout group boundaries into model inputs can lead to a 1.9% Macro F1 improvement in token classification. In the H-VILA approach, we show that hierarchical encoding of layout-groups can result in up to 47% inference time reduction with less than 0.8% Macro F1 loss. Unlike prior layout-aware approaches, our methods do not require expensive additional pretraining, only fine-tuning, which we show can reduce training cost by up to 95%. Experiments are conducted on a newly curated evaluation suite, S2-VLUE , that unifies existing automatically labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines. Pre-trained weights, benchmark datasets, and source code are available at https://github.com/allenai/VILA .
Journal Articles
Publisher: Journals Gateway
Transactions of the Association for Computational Linguistics (2020) 8: 64–77.
Published: 01 January 2020
FIGURES
Abstract
View article
PDF
We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span, without relying on the individual token representations within it. SpanBERT consistently outperforms BERT and our better-tuned baselines, with substantial gains on span selection tasks such as question answering and coreference resolution. In particular, with the same training data and model size as BERT large , our single model obtains 94.6% and 88.7% F1 on SQuAD 1.1 and 2.0 respectively. We also achieve a new state of the art on the OntoNotes coreference resolution task (79.6% F1), strong performance on the TACRED relation extraction benchmark, and even gains on GLUE. 1
Journal Articles
Publisher: Journals Gateway
Transactions of the Association for Computational Linguistics (2015) 3: 315–328.
Published: 01 June 2015
Abstract
View article
PDF
Recent research on entity linking (EL) has introduced a plethora of promising techniques, ranging from deep neural networks to joint inference. But despite numerous papers there is surprisingly little understanding of the state of the art in EL. We attack this confusion by analyzing differences between several versions of the EL problem and presenting a simple yet effective, modular, unsupervised system, called V inculum , for entity linking. We conduct an extensive evaluation on nine data sets, comparing V inculum with two state-of-the-art systems, and elucidate key aspects of the system that include mention extraction, candidate generation, entity type prediction, entity coreference, and coherence.
Journal Articles
Publisher: Journals Gateway
Transactions of the Association for Computational Linguistics (2015) 3: 117–129.
Published: 01 February 2015
Abstract
View article
PDF
Most approaches to relation extraction , the task of extracting ground facts from natural language text, are based on machine learning and thus starved by scarce training data. Manual annotation is too expensive to scale to a comprehensive set of relations. Distant supervision, which automatically creates training data, only works with relations that already populate a knowledge base (KB). Unfortunately, KBs such as FreeBase rarely cover event relations ( e.g. “person travels to location” ). Thus, the problem of extracting a wide range of events — e.g., from news streams — is an important, open challenge. This paper introduces N ews S pike -RE, a novel, unsupervised algorithm that discovers event relations and then learns to extract them. N ews S pike -RE uses a novel probabilistic graphical model to cluster sentences describing similar events from parallel news streams. These clusters then comprise training data for the extractor. Our evaluation shows that N ews S pike -RE generates high quality training sentences and learns extractors that perform much better than rival approaches, more than doubling the area under a precision-recall curve compared to Universal Schemas.