Abstract
When searching for information, a human reader first glances over a document, spots relevant sections, and then focuses on a few sentences for resolving her intention. However, the high variance of document structure complicates the identification of the salient topic of a given section at a glance. To tackle this challenge, we present SECTOR, a model to support machine reading systems by segmenting documents into coherent sections and assigning topic labels to each section. Our deep neural network architecture learns a latent topic embedding over the course of a document. This can be leveraged to classify local topics from plain text and segment a document at topic shifts. In addition, we contribute WikiSection, a publicly available data set with 242k labeled sections in English and German from two distinct domains: diseases and cities. From our extensive evaluation of 20 architectures, we report a highest score of 71.6% F1 for the segmentation and classification of 30 topics from the English city domain, scored by our SECTOR long short-term memory model with Bloom filter embeddings and bidirectional segmentation. This is a significant improvement of 29.5 points F1 over state-of-the-art CNN classifiers with baseline segmentation.
1 Introduction
Today’s systems for natural language understanding are composed of building blocks that extract semantic information from the text, such as named entities, relations, topics, or discourse structure. In traditional natural language processing (NLP), these extractors are typically applied to bags of words or full sentences (Hirschberg and Manning, 2015). Recent neural architectures build upon pre-trained word or sentence embeddings (Mikolov et al., 2013; Le and Mikolov, 2014), which focus on semantic relations that can be learned from large sets of paradigmatic examples, even from long ranges (Dieng et al., 2017).
From a human perspective, however, it is mostly the authors themselves who help best to understand a text. Especially in long documents, an author thoughtfully designs a readable structure and guides the reader through the text by arranging topics into coherent passages (Glavaš et al., 2016). In many cases, this structure is not formally expressed as section headings (e.g., in news articles, reviews, discussion forums) or it is structured according to domain-specific aspects (e.g., health reports, research papers, insurance documents).
Ideally, systems for text analytics, such as topic detection and tracking (TDT) (Allan, 2002), text summarization (Huang et al., 2003), information retrieval (IR) (Dias et al., 2007), or question answering (QA) (Cohen et al., 2018), could access a document representation that is aware of both topical (i.e., latent semantic content) and structural information (i.e., segmentation) in the text (MacAvaney et al., 2018). The challenge in building such a representation is to combine these two dimensions that are strongly interwoven in the author’s mind. It is therefore important to understand topic segmentation and classification as a mutual task that requires encoding both topic information and document structure coherently.
In this paper, we present Sector,1 an end-to-end model that learns an embedding of latent topics from potentially ambiguous headings and can be applied to entire documents to predict local topics on sentence level. Our model encodes topical information on a vertical dimension and structural information on a horizontal dimension. We show that the resulting embedding can be leveraged in a downstream pipeline to segment a document into coherent sections and classify the sections into one of up to 30 topic categories reaching 71.6% F1—or alternatively, attach up to 2.8k topic labels with 71.1% mean average precision (MAP). We further show that segmentation performance of our bidirectional long short-term memory (LSTM) architecture is comparable to specialized state-of-the-art segmentation methods on various real-world data sets.
To the best of our knowledge, the combined task of segmentation and classification has not been approached on the full document level before. There exist a large number of data sets for text segmentation, but most of them do not reflect real-world topic drifts (Choi, 2000; Sehikh et al., 2017), do not include topic labels (Eisenstein and Barzilay, 2008; Jeong and Titov, 2010; Glavaš et al., 2016), or are heavily normalized and too small to be used for training neural networks (Chen et al., 2009). We can utilize a generic segmentation data set derived from Wikipedia that includes headings (Koshorek et al., 2018), but there is also a need in IR and QA for supervised structural topic labels (Agarwal and Yu, 2009; MacAvaney et al., 2018), different languages and more specific domains, such as clinical or biomedical research (Tepper et al., 2012; Tsatsaronis et al., 2012), and news-based TDT (Kumaran and Allan, 2004; Leetaru and Schrodt, 2013).
Therefore we introduce WikiSection,2 a large novel data set of 38k articles from the English and German Wikipedia labeled with 242k sections, original headings, and normalized topic labels for up to 30 topics from two domains: diseases and cities. We chose these subsets to cover both clinical/biomedical aspects (e.g., symptoms, treatments, complications) and news-based topics (e.g., history, politics, economy, climate). Both article types are reasonably well-structured according to Wikipedia guidelines (Piccardi et al., 2018), but we show that they are also complementary: Diseases is a typical scientific domain with low entropy (i.e., very narrow topics, precise language, and low word ambiguity). In contrast, cities resembles a diversified domain, with high entropy (i.e., broader topics, common language, and higher word ambiguity) and will be more applicable to for example, news, risk reports, or travel reviews.
We compare Sector to existing segmentation and classification methods based on latent Dirichlet allocation (LDA), paragraph embeddings, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). We show that Sector significantly improves these methods in a combined task by up to 29.5 points F1 when applied to plain text with no given segmentation.
2 Related Work
The analysis of emerging topics over the course of a document is related to a large number of research areas. In particular, topic modeling (Blei et al., 2003) and TDT (Jin et al., 1999) focus on representing and extracting the semantic topical content of text. Text segmentation (Beeferman et al., 1999) is used to split documents into smaller coherent chunks. Finally, text classification (Joachims, 1998) is often applied to detect topics on text chunks. Our method unifies those strongly interwoven tasks and is the first to evaluate the combined topic segmentation and classification task using a corresponding data set with long structured documents.
Topic modeling is commonly applied to entire documents using probabilistic models, such as LDA (Blei et al., 2003). AlSumait et al. (2008) introduced an online topic model that captures emerging topics when new documents appear. Gabrilovich and Markovitch (2007) proposed the Explicit Semantic Analysis method in which concepts from Wikipedia articles are indexed and assigned to documents. Later, and to overcome the vocabulary mismatch problem, Cimiano et al. (2009) introduced a method for assigning latent concepts to documents. More recently, Liu et al. (2016) represented documents with vectors of closely related domain keyphrases. Yeh et al. (2016) proposed a conceptual dynamic LDA model for tracking topics in conversations. Bhatia et al. (2016) utilized Wikipedia document titles to learn neural topic embeddings and assign document labels. Dieng et al. (2017) focused on the issue of long-range dependencies and proposed a latent topic model based on RNNs. However, the authors did not apply the RNN to predict local topics.
Text segmentation has been approached with a wide variety of methods. Early unsupervised methods utilized lexical overlap statistics (Hearst, 1997; Choi, 2000), dynamic programming (Utiyama and Isahara, 2001), Bayesian models (Eisenstein and Barzilay, 2008), or pointwise boundary sampling (Du et al., 2013) on raw terms.
Later, supervised methods included topic models (Riedl and Biemann, 2012) by calculating a coherence score using dense topic vectors obtained by LDA. Bayomi et al. (2015) exploited ontologies to measure semantic similarity between text blocks. Alemi and Ginsparg (2015) and Naili et al. (2017) studied how word embeddings can improve classical segmentation approaches. Glavaš et al. (2016) utilized semantic relatedness of word embeddings by identifying cliques in a graph.
More recently, Sehikh et al. (2017) utilized LSTM networks and showed that cohesion between bidirectional layers can be leveraged to predict topic changes. In contrast to our method, the authors focused on segmenting speech recognition transcripts on word level without explicit topic labels. The network was trained with supervised pairs of contrary examples and was mainly evaluated on artificially segmented documents. Our approach extends this idea so it can be applied to dense topic embeddings which are learned from raw section headings.
Wang et al. (2017) tackled segmentation by training a CNN to learn coherence scores for text pairs. Similar to Sehikh et al. (2017), the network was trained with short contrary examples and no topic objective. The authors showed that their pointwise ranking model performs well on data sets by Jeong and Titov (2010). In contrast to our method, the ranking algorithm strictly requires a given ground truth number of segments for each document and no topic labels are predicted.
Koshorek et al. (2018) presented a large new data set for text segmentation based on Wikipedia that includes section headings. The authors introduced a neural architecture for segmentation that is based on sentence embeddings and four layers of bidirectional LSTM. Similar to Sehikh et al. (2017), the authors used a binary segmentation objective on the sentence level, but trained on entire documents. Our work takes up this idea of end-to-end training and enriches the neural model with a layer of latent topic embeddings that can be utilized for topic classification.
Text classification is mostly applied at the paragraph or sentence level using machine learning methods such as support vector machines (Joachims, 1998) or, more recently, shallow and deep neural networks (Le et al., 2018; Conneau et al., 2017). Notably, paragraph vectors (Le and Mikolov, 2014) is an extension of word2vec for learning fixed-length distributed representations from texts of arbitrary length. The resulting model can be utilized for classification by providing paragraph labels during training. Furthermore, Kim (2014) has shown that CNNs combined with pre-trained task-specific word embeddings achieve the highest scores for various text classification tasks.
Combined approaches of topic segmentation and classification are rare to find. Agarwal and Yu (2009) classified sections of BioMed Central articles into four structural classes (introduction, methods, results, and discussion). However, their manually labeled data set only contains a sample of sentences from the documents, so they evaluated sentence classification as an isolated task. Chen et al. (2009) introduced two Wikipedia-based data sets for segmentation, one about large cities, the second about chemical elements. Although these data sets have been used to evaluate word-level and sentence-level segmentation (Koshorek et al., 2018), we are not aware of any topic classification approach on this data set.
Tepper et al. (2012) approached segmentation and classification in a clinical domain as supervised sequence labeling problem. The documents were segmented using a maximum entropy model and then classified into 11 or 33 categories. A similar approach by Ajjour et al. (2017) used sequence labeling with a small number of 3–6 classes. Their model is extractive, so it does not produce a continuous segmentation over the entire document. Finally, Piccardi et al. (2018) did not approach segmentation, but recommended an ordered set of section labels based on Wikipedia articles.
Eventually, we were inspired by passage retrieval (Liu and Croft, 2002) as an important downstream task for topic segmentation and classification. For example, Hewlett et al. (2016) proposed WikiReading, a QA task to retrieve values from sections of long documents. The objective of TREC Complex Answer Retrieval is to retrieve a ranking of relevant passages for a given outline of hierarchical sections (Nanni et al., 2017). Both tasks highly depend on a building block for local topic embeddings such as our proposed model.
3 Task Overview and Data set
We start with a definition of the WikiSection machine reading task shown in Figure 1. We take a document D = 〈S, T〉 consisting of N consecutive sentences S = [s1, …, sN] and empty segmentation T = ∅ as input. In our example, this is the plain text of a Wikipedia article (e.g., about Trichomoniasis3) without any section information. For each sentence sk, we assume a distribution of local topics ek that gradually changes over the course of the document.
The task is to split D into a sequence of distinct topic sections T = [T1, …, TM], so that each predicted section Tj = 〈Sj, yj〉 contains a sequence of coherent sentences Sj ⊆ S and a topic label yj that describes the common topic in these sentences. For the document Trichomoniasis, the sequence of topic labels is y1…M = [ symptom, cause, diagnosis, prevention, treatment, complication, epidemiology ].
3.1 WikiSection Data Set
For the evaluation of this task, we created WikiSection, a novel data set containing a gold standard of 38k full-text documents from English and German Wikipedia comprehensively annotated with sections and topic labels (see Table 1).
Data set . | disease . | city . | ||
---|---|---|---|---|
language . | en . | de . | en . | de . |
total docs | 3.6k | 2.3k | 19.5k | 12.5k |
avg sents per doc | 58.5 | 45.7 | 56.5 | 39.9 |
avg sects per doc | 7.5 | 7.2 | 8.3 | 7.6 |
headings | 8.5k | 6.1k | 23.0k | 12.2k |
topics | 27 | 25 | 30 | 27 |
coverage | 94.6% | 89.5% | 96.6% | 96.1% |
Data set . | disease . | city . | ||
---|---|---|---|---|
language . | en . | de . | en . | de . |
total docs | 3.6k | 2.3k | 19.5k | 12.5k |
avg sents per doc | 58.5 | 45.7 | 56.5 | 39.9 |
avg sects per doc | 7.5 | 7.2 | 8.3 | 7.6 |
headings | 8.5k | 6.1k | 23.0k | 12.2k |
topics | 27 | 25 | 30 | 27 |
coverage | 94.6% | 89.5% | 96.6% | 96.1% |
The documents originate from recent dumps in English4 and German.5 We filtered the collection using SPARQL queries against Wikidata (Tanon et al., 2016). We retrieved instances of Wikidata categories disease (Q12136) and their subcategories (e.g., Trichomoniasis or Pertussis) or city (Q515) (e.g., London or Madrid).
Our data set contains the article abstracts, plain text of the body, positions of all sections given by the Wikipedia editors with their original headings (e.g., "Causes | Genetic sequence") and a normalized topic label (e.g., disease.cause). We randomized the order of documents and split them into 70% training, 10% validation, 20% test sets.
3.2 Preprocessing
To obtain plain document text, we used Wikiextractor,6 split the abstract sections and stripped all section headings and other structure tags except newline characters and lists.
Vocabulary Mismatch in Section Headings.
Table 2 shows examples of section headings from disease articles separated into head (most common), torso (frequently used), and tail (rare). Initially, we expected articles to share congruent structure in naming and order. Instead, we observe a high variance with 8.5k distinct headings in the diseases domain and over 23k for English cities. A closer inspection reveals that Wikipedia authors utilize headings at different granularity levels, frequently copy and paste from other articles, but also introduce synonyms or hyponyms, which leads to a vocabulary mismatch problem (Furnas et al., 1987). As a result, the distribution of headings is heavy-tailed across all articles. Roughly 1% of headings appear more than 25 times whereas the vast majority (88%) appear 1 or 2 times only.
3.3 Synset Clustering
In order to use Wikipedia headlines as a source for topic labels, we contribute a normalization method to reduce the high variance of headings to a few representative labels based on the clustering of BabelNet synsets (Navigli and Ponzetto, 2012).
We create a set ℋ that contains all headings in the data set and use the BabelNet API to match7 each heading h ∈ ℋ to its corresponding synsets Sh ⊂ S. For example, "Cognitive behavioral therapy" is assigned to synset bn:03387773n. Next, we insert all matched synsets into an undirected graph G with nodes s ∈ S and edges e. We create edges between all synsets that match among each other with a lemma h′ ∈ ℋ. Finally, we apply a community detection algorithm (Newman, 2006) on G to find dense clusters of synsets. We use these clusters as normalized topics and assign the sense with most outgoing edges as representative label, in our example e.g. therapy.
From this normalization step we obtain 598 synsets that we prune using the head/tail division rule count(s) < count(si) (Jiang, 2012). This method covers over 94.6% of all headings and yields 26 normalized labels and one other class in the English disease data set. Table 1 shows the corresponding numbers for the other data sets. We verify our normalization process by manual inspection of 400 randomly chosen heading– label assignments by two independent judges and report an accuracy of 97.2% with an average observed inter-annotator agreement of 96.0%.
4 SECTOR Model
Our Sector architecture consists of four stages, shown in Figure 2: sentence encoding, topic embedding, topic classification and topic segmentation. We now discuss each stage in more detail.
4.1 Sentence Encoding
The first stage of our Sector model transforms each sentence sk from plain text into a fixed-size sentence vector xk that serves as input into the neural network layers. Following Hill et al. (2016), word order is not critical for document-centric evaluation settings such as our WikiSection task. Therefore, we mainly focus on unsupervised compositional sentence representations.
Bag-of-Words Encoding.
Bloom Filter Embedding.
We set parameters to m = 4096 and k = 5 to achieve a compression factor of 0.2, which showed good performance in the original paper.
Sentence Embeddings.
4.2 Topic Embedding
Now, a simple concatenation of the embeddings ek = can be used as topic vector by downstream applications.
4.3 Topic Classification
Ranking Loss for Multi-Label Optimization.
4.4 Topic Segmentation
In the final stage, we leverage the information encoded in the topic embedding and output layers to segment the document and classify each section.
Baseline Segmentation Methods.
As a simple baseline method, we use prior information from the text and split sections at newline characters (NL). Additionally, we merge two adjacent sections if they are assigned the same topic label after classification. If there is no newline information available in the text, we use a maximum label (max) approach: We first split sections at every sentence break (i.e., Sj = sk; j = k = 1, …, N) and then merge all sections that share at least one label in the top-2 predictions.
Using Deviation of Topic Embeddings for Segmentation.
Finally we apply the sequence d1…N with parameters D = 16 and σ = 2.5 to locate the spots of fastest movement (see Figure 4), i.e. all k where dk−1 < dk > dk+1; k = 1…N in our discrete case. We use these positions to start a new section.
Improving Edge Detection with Bidirectional Layers.
Finally, we show in the evaluation that our Sector model, which was optimized for sentences , can be applied to the WikiSection task to predict coherently labeled sections Tj = 〈Sj, 〉.
5 Evaluation
We conduct three experiments to evaluate the segmentation and classification task introduced in Section 3. The WikiSection-topics experiment constitutes segmentation and classification of each section with a single topic label out of a small number of clean labels (25–30 topics). The WikiSection-headings experiment extends the classification task to multi-label per section with a larger target vocabulary (1.0k–2.8k words). This is important, because often there are no clean topic labels available for training or evaluation. Finally, we conduct a third experiment to see how Sector performs across existing segmentation data sets.
Evaluation Data Sets.
For the first two experiments we use the WikiSection data sets introduced in Section 3.1, which contain documents about diseases and cities in both English and German. The subsections are retained with full granularity. For the third experiment, text segmentation results are often reported on artificial data sets (Choi, 2000). It was shown that this scenario is hardly applicable to topic-based segmentation (Koshorek et al., 2018), so we restrict our evaluation to real-world data sets that are publicly available. The Wiki-727k data set by Koshorek et al. (2018) contains Wikipedia articles with a broad range of topics and their top-level sections. However, it is too large to compare exhaustively, so we use the smaller Wiki-50 subset. We further use the Cities and Elements data sets introduced by Chen et al. (2009), which also provide headings. These sets are typically used for word-level segmentation, so they don’t contain any punctuation and are lowercased. Finally, we use the Clinical Textbook chapters introduced by Eisenstein and Barzilay (2008), which do not supply headings.
Text Segmentation Models.
We compare Sector to common text segmentation methods as baseline, C99 (Choi, 2000) and TopicTiling (Riedl and Biemann, 2012) and the state-of-the-art TextSeg segmenter (Koshorek et al., 2018). In the third experiment we report numbers for BayesSeg (Eisenstein and Barzilay, 2008) (configured to predict with unknown number of segments) and GraphSeg (Glavaš et al., 2016).
Classification Models.
We compare Sector to existing models for single and multi-label sentence classification. Because we are not aware of any existing method for combined segmentation and classification, we first compare all methods using given prior segmentation from newlines in the text (NL) and then additionally apply our own segmentation strategies for plain text input: maximum label (max), embedding deviation (emd) and bidirectional embedding deviation (bemd).
For the experiments, we train a Paragraph Vectors (PV) model (Le and Mikolov, 2014) using all sections of the training sets. We utilize this model for single-label topic classification (depicted as PV>T) by assigning the given topic labels as paragraph IDs. Multi-label classification is not possible with this model. We use the paragraph embedding for our own segmentation strategies. We set the layer size to 256, window size to 7, and trained for 10 epochs using a batch size of 512 sentences and a learning rate of 0.025. We further use an implementation of CNN (Kim, 2014) with our pre-trained word vectors as input for single-label topics (CNN>T) and multi-label headings (CNN>H). We configured the models using the hyperparameters given in the paper and trained the model using a batch size of 256 sentences for 20 epochs with learning rate 0.01.
Sector Configurations.
We evaluate the various configurations of our model discussed in prior sections. SEC>T depicts the single-label topic classification model which uses a softmax activation output layer, SEC>H is the multi-label variant with a larger output and sigmoid activations. Other options are: bag-of-words sentence encoding (+bow), Bloom filter encoding (+bloom) and sentence embeddings (+emb); multi-class cross-entropy loss (as default) and ranking loss (+rank).
We have chosen network hyperparameters using grid search on the en_disease validation set and keep them fixed over all evaluation runs. For all configurations, we set LSTM layer size to 256, topic embeddings dimension to 128. Models are trained on the complete train splits with a batch size of 16 documents (reduced to 8 for bag-of-words), 0.01 learning rate, 0.5 dropout, and ADAM optimization. We used early stopping after 10 epochs without MAP improvement on the validation data sets. We pre-trained word embeddings with 256 dimensions for the specific tasks using word2vec on lowercase English and German Wikipedia documents using a window size of 7. All tests are implemented in Deeplearning4j and run on a Tesla P100 GPU with 16GB memory. Training a SEC+bloom model on en_city takes roughly 5 hours, inference on CPU takes on average 0.36 seconds per document. In addition, we trained a SEC>H@fullwiki model with raw headings from a complete English Wikipedia dump,8 and use this model for cross-data set evaluation.
Quality Measures.
We measure text segmentation at sentence level using the probabilisticPkerror score (Beeferman et al., 1999), which calculates the probability of a false boundary in a window of size k, lower numbers mean better segmentation. As relevant section boundaries we consider all section breaks where the topic label changes. We set k to half of the average segment length. We measure classification performance on section level by comparing the topic labels of all ground truth sections with predicted sections. We select the pairs by matching their positions using maximum boundary overlap. We report micro-averagedF1 score for single-label or Precision@1 for multi-label classification. Additionally, we measure Mean Average Precision (MAP), which evaluates the average fraction of true labels ranked above a particular label (Tsoumakas et al., 2009).
5.1 Results
WikiSection-topics . | . | en_disease . | de_disease . | en_city . | de_city . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
single-label classification . | . | 27 topics . | 25 topics . | 30 topics . | 27 topics . | ||||||||
model configuration . | segm. . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . |
Classification with newline prior segmentation | |||||||||||||
PV>T* | NL | 35.6 | 31.7 | 47.2 | 36.0 | 29.6 | 44.5 | 22.5 | 52.9 | 63.9 | 27.2 | 42.9 | 55.5 |
CNN>T* | NL | 31.5 | 40.4 | 55.6 | 31.6 | 38.1 | 53.7 | 13.2 | 66.3 | 76.1 | 13.7 | 63.4 | 75.0 |
SEC>T+bow | NL | 25.8 | 54.7 | 68.4 | 25.0 | 52.7 | 66.9 | 21.0 | 43.7 | 55.3 | 20.2 | 40.5 | 52.2 |
SEC>T+bloom | NL | 22.7 | 59.3 | 71.9 | 27.9 | 50.2 | 65.5 | 9.8 | 74.9 | 82.6 | 11.7 | 73.1 | 81.5 |
SEC>T+emb* | NL | 22.5 | 58.7 | 71.4 | 23.6 | 50.9 | 66.8 | 10.7 | 74.1 | 82.2 | 10.7 | 74.0 | 83.0 |
Classification and segmentation on plain text | |||||||||||||
C99 | 37.4 | n/a | n/a | 42.7 | n/a | n/a | 36.8 | n/a | n/a | 38.3 | n/a | n/a | |
TopicTiling | 43.4 | n/a | n/a | 45.4 | n/a | n/a | 30.5 | n/a | n/a | 41.3 | n/a | n/a | |
TextSeg | 24.3 | n/a | n/a | 35.7 | n/a | n/a | 19.3 | n/a | n/a | 27.5 | n/a | n/a | |
PV>T* | max | 43.6 | 20.4 | 36.5 | 44.3 | 19.3 | 34.6 | 31.1 | 28.1 | 43.1 | 36.4 | 20.2 | 35.5 |
PV>T* | emd | 39.2 | 32.9 | 49.3 | 37.4 | 32.9 | 48.7 | 24.9 | 53.1 | 65.1 | 32.9 | 40.6 | 55.0 |
CNN>T* | max | 40.1 | 26.9 | 45.0 | 40.7 | 25.2 | 43.8 | 21.9 | 42.1 | 58.7 | 21.4 | 42.1 | 59.5 |
SEC>T+bow | max | 30.1 | 40.9 | 58.5 | 32.1 | 38.9 | 56.8 | 24.5 | 28.4 | 43.5 | 28.0 | 26.8 | 42.6 |
SEC>T+bloom | max | 27.9 | 49.6 | 64.7 | 35.3 | 39.5 | 57.3 | 12.7 | 63.3 | 74.3 | 26.2 | 58.9 | 71.6 |
SEC>T+bloom | emd | 29.7 | 52.8 | 67.5 | 35.3 | 44.8 | 61.6 | 16.4 | 65.8 | 77.3 | 26.0 | 65.5 | 76.7 |
SEC>T+bloom | bemd | 26.8 | 56.6 | 70.1 | 31.7 | 47.8 | 63.7 | 14.4 | 71.6 | 80.9 | 16.8 | 70.8 | 80.1 |
SEC>T+bloom+rank* | bemd | 26.8 | 56.7 | 68.8 | 33.1 | 44.0 | 58.5 | 15.7 | 71.1 | 79.1 | 18.0 | 66.8 | 76.1 |
SEC>T+emb* | bemd | 26.3 | 55.8 | 69.4 | 27.5 | 48.9 | 65.1 | 15.5 | 71.6 | 81.0 | 16.2 | 71.0 | 81.1 |
WikiSection-topics . | . | en_disease . | de_disease . | en_city . | de_city . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
single-label classification . | . | 27 topics . | 25 topics . | 30 topics . | 27 topics . | ||||||||
model configuration . | segm. . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . | Pk . | F1 . | MAP . |
Classification with newline prior segmentation | |||||||||||||
PV>T* | NL | 35.6 | 31.7 | 47.2 | 36.0 | 29.6 | 44.5 | 22.5 | 52.9 | 63.9 | 27.2 | 42.9 | 55.5 |
CNN>T* | NL | 31.5 | 40.4 | 55.6 | 31.6 | 38.1 | 53.7 | 13.2 | 66.3 | 76.1 | 13.7 | 63.4 | 75.0 |
SEC>T+bow | NL | 25.8 | 54.7 | 68.4 | 25.0 | 52.7 | 66.9 | 21.0 | 43.7 | 55.3 | 20.2 | 40.5 | 52.2 |
SEC>T+bloom | NL | 22.7 | 59.3 | 71.9 | 27.9 | 50.2 | 65.5 | 9.8 | 74.9 | 82.6 | 11.7 | 73.1 | 81.5 |
SEC>T+emb* | NL | 22.5 | 58.7 | 71.4 | 23.6 | 50.9 | 66.8 | 10.7 | 74.1 | 82.2 | 10.7 | 74.0 | 83.0 |
Classification and segmentation on plain text | |||||||||||||
C99 | 37.4 | n/a | n/a | 42.7 | n/a | n/a | 36.8 | n/a | n/a | 38.3 | n/a | n/a | |
TopicTiling | 43.4 | n/a | n/a | 45.4 | n/a | n/a | 30.5 | n/a | n/a | 41.3 | n/a | n/a | |
TextSeg | 24.3 | n/a | n/a | 35.7 | n/a | n/a | 19.3 | n/a | n/a | 27.5 | n/a | n/a | |
PV>T* | max | 43.6 | 20.4 | 36.5 | 44.3 | 19.3 | 34.6 | 31.1 | 28.1 | 43.1 | 36.4 | 20.2 | 35.5 |
PV>T* | emd | 39.2 | 32.9 | 49.3 | 37.4 | 32.9 | 48.7 | 24.9 | 53.1 | 65.1 | 32.9 | 40.6 | 55.0 |
CNN>T* | max | 40.1 | 26.9 | 45.0 | 40.7 | 25.2 | 43.8 | 21.9 | 42.1 | 58.7 | 21.4 | 42.1 | 59.5 |
SEC>T+bow | max | 30.1 | 40.9 | 58.5 | 32.1 | 38.9 | 56.8 | 24.5 | 28.4 | 43.5 | 28.0 | 26.8 | 42.6 |
SEC>T+bloom | max | 27.9 | 49.6 | 64.7 | 35.3 | 39.5 | 57.3 | 12.7 | 63.3 | 74.3 | 26.2 | 58.9 | 71.6 |
SEC>T+bloom | emd | 29.7 | 52.8 | 67.5 | 35.3 | 44.8 | 61.6 | 16.4 | 65.8 | 77.3 | 26.0 | 65.5 | 76.7 |
SEC>T+bloom | bemd | 26.8 | 56.6 | 70.1 | 31.7 | 47.8 | 63.7 | 14.4 | 71.6 | 80.9 | 16.8 | 70.8 | 80.1 |
SEC>T+bloom+rank* | bemd | 26.8 | 56.7 | 68.8 | 33.1 | 44.0 | 58.5 | 15.7 | 71.1 | 79.1 | 18.0 | 66.8 | 76.1 |
SEC>T+emb* | bemd | 26.3 | 55.8 | 69.4 | 27.5 | 48.9 | 65.1 | 15.5 | 71.6 | 81.0 | 16.2 | 71.0 | 81.1 |
WikiSection-headings . | . | en_disease . | de_disease . | en_city . | de_city . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
multi-label classification . | . | 1.5k topics . | 1.0k topics . | 2.8k topics . | 1.1k topics . | ||||||||
model configuration . | segm. . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . |
CNN>H* | max | 40.9 | 36.7 | 31.5 | 41.3 | 14.1 | 21.1 | 36.9 | 43.3 | 46.7 | 42.2 | 40.9 | 46.5 |
SEC>H+bloom | bemd | 35.4 | 35.8 | 38.2 | 36.9 | 31.7 | 37.8 | 20.0 | 65.2 | 62.0 | 23.4 | 49.8 | 53.4 |
SEC>H+bloom+rank | bemd | 40.2 | 47.8 | 49.0 | 42.8 | 28.4 | 33.2 | 41.9 | 66.8 | 59.0 | 34.9 | 59.6 | 54.6 |
SEC>H+emb* | bemd | 30.7 | 50.5 | 57.3 | 32.9 | 26.6 | 36.7 | 17.9 | 72.3 | 71.1 | 19.3 | 68.4 | 70.2 |
SEC>H+emb+rank* | bemd | 30.5 | 47.6 | 48.9 | 42.9 | 32.0 | 36.4 | 16.1 | 65.8 | 59.0 | 18.3 | 69.2 | 58.9 |
SEC>H+emb@fullwiki* | bemd | 42.4 | 9.7 | 17.9 | 42.7 | (0.0) | (0.0) | 20.3 | 59.4 | 50.4 | 38.5 | (0.0) | (0.1) |
WikiSection-headings . | . | en_disease . | de_disease . | en_city . | de_city . | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
multi-label classification . | . | 1.5k topics . | 1.0k topics . | 2.8k topics . | 1.1k topics . | ||||||||
model configuration . | segm. . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . | Pk . | P@1 . | MAP . |
CNN>H* | max | 40.9 | 36.7 | 31.5 | 41.3 | 14.1 | 21.1 | 36.9 | 43.3 | 46.7 | 42.2 | 40.9 | 46.5 |
SEC>H+bloom | bemd | 35.4 | 35.8 | 38.2 | 36.9 | 31.7 | 37.8 | 20.0 | 65.2 | 62.0 | 23.4 | 49.8 | 53.4 |
SEC>H+bloom+rank | bemd | 40.2 | 47.8 | 49.0 | 42.8 | 28.4 | 33.2 | 41.9 | 66.8 | 59.0 | 34.9 | 59.6 | 54.6 |
SEC>H+emb* | bemd | 30.7 | 50.5 | 57.3 | 32.9 | 26.6 | 36.7 | 17.9 | 72.3 | 71.1 | 19.3 | 68.4 | 70.2 |
SEC>H+emb+rank* | bemd | 30.5 | 47.6 | 48.9 | 42.9 | 32.0 | 36.4 | 16.1 | 65.8 | 59.0 | 18.3 | 69.2 | 58.9 |
SEC>H+emb@fullwiki* | bemd | 42.4 | 9.7 | 17.9 | 42.7 | (0.0) | (0.0) | 20.3 | 59.4 | 50.4 | 38.5 | (0.0) | (0.1) |
Segmentation . | Wiki-50 . | Cities . | Elements . | Clinical . | |||
---|---|---|---|---|---|---|---|
and multi-label classification . | Pk . | MAP . | Pk . | MAP . | Pk . | MAP . | Pk . |
GraphSeg | 63.6 | n/a | 40.0 | n/a | 49.1 | n/a | – |
BayesSeg | 49.2 | n/a | 36.2 | n/a | 35.6 | n/a | 57.8 |
TextSeg | 18.2* | n/a | 19.7* | n/a | 41.6 | n/a | 30.8 |
SEC>H+emb@en_disease | – | – | – | – | 43.3 | 9.5 | 36.5 |
SEC>C+emb@en_disease | – | – | – | – | 45.1 | n/a | 35.6 |
SEC>H+emb@en_city | 30.0 | 31.4 | 28.2 | 56.5 | 41.0 | 7.9 | – |
SEC>C+emb@en_city | 31.3 | n/a | 22.9 | n/a | 48.8 | n/a | – |
SEC>H+emb@cities | 33.3 | 15.3 | 21.4* | 52.3* | 39.2 | 12.1 | 37.7 |
SEC>H+emb@fullwiki | 28.6* | 32.6* | 33.4 | 40.5 | 42.8 | 14.4 | 36.9 |
Segmentation . | Wiki-50 . | Cities . | Elements . | Clinical . | |||
---|---|---|---|---|---|---|---|
and multi-label classification . | Pk . | MAP . | Pk . | MAP . | Pk . | MAP . | Pk . |
GraphSeg | 63.6 | n/a | 40.0 | n/a | 49.1 | n/a | – |
BayesSeg | 49.2 | n/a | 36.2 | n/a | 35.6 | n/a | 57.8 |
TextSeg | 18.2* | n/a | 19.7* | n/a | 41.6 | n/a | 30.8 |
SEC>H+emb@en_disease | – | – | – | – | 43.3 | 9.5 | 36.5 |
SEC>C+emb@en_disease | – | – | – | – | 45.1 | n/a | 35.6 |
SEC>H+emb@en_city | 30.0 | 31.4 | 28.2 | 56.5 | 41.0 | 7.9 | – |
SEC>C+emb@en_city | 31.3 | n/a | 22.9 | n/a | 48.8 | n/a | – |
SEC>H+emb@cities | 33.3 | 15.3 | 21.4* | 52.3* | 39.2 | 12.1 | 37.7 |
SEC>H+emb@fullwiki | 28.6* | 32.6* | 33.4 | 40.5 | 42.8 | 14.4 | 36.9 |
Sector Outperforms Existing Classifiers.
With our given segmentation baseline (NL), the best sentence classification model CNN achieves 52.1% F1 averaged over all data sets. Sector improves this score significantly by 12.4 points. Furthermore, in the setting with plain text input, Sector improves the CNN score by 18.8 points using identical baseline segmentation. Our model finally reaches an average of 61.8% F1 on the classification task using sentence embeddings and bidirectional segmentation. This is a total improvement of 27.8 points over the CNN model.
Topic Embeddings Improve Segmentation.
Sector outperforms C99 and TopicTiling significantly by 16.4 and 18.8 points Pk, respectively, on average. Compared to the maximum label baseline, our model gains 3.1 points by using the bidirectional embedding deviation and 1.0 points using sentence embeddings. Overall, Sector misses only 4.2 points Pk and 2.6 points F1 compared with the experiments with prior newline segmentation. The third experiments reveals that our segmentation method in isolation almost reaches state-of-the-art on existing data sets and beats the unsupervised baselines, but lacks performance on cross-data set evaluation.
Bloom Filters on Par with Word Embeddings.
Bloom filter encoding achieves high scores among all data sets and outperforms our bag-of-words baseline, possibly because of larger training batch sizes and reduced model parameters. Surprisingly, word embeddings did not improve the model significantly. On average, German models gained 0.7 points F1 and English models declined by 0.4 points compared with Bloom filters. However, model training and inference using pre-trained embeddings is faster by an average factor of 3.2.
Topic Embeddings Perform Well on Noisy Data.
In the multi-label setting with unprocessed Wikipedia headings, classification precision of Sector reaches up to 72.3% P@1 for 2.8k labels. This score is in average 9.5 points lower compared to the models trained on the small number of 25–30 normalized labels. Furthermore, segmentation performance only misses 3.8 points Pk compared with the topics task. Ranking loss could not improve our models significantly, but achieved better segmentation scores on the headings task. Finally, the cross-domain English fullwiki model performs only on baseline level for segmentation, but still achieves better classification performance than CNN on the English cities data set.
5.2 Discussion and Model Insights
Figure 5 shows classification and segmentation of our Sector model compared to the PV baseline.
Sector Captures Latent Topics from Context.
We clearly see from NL predictions (left side of Figure 5) that Sector produces coherent results with sentence granularity, with topics emerging and disappearing over the course of a document. In contrast, PV predictions are scattered across the document. Both models successfully classify first (symptoms) and last sections (epidemiology). However, only Sector can capture diagnosis, prevention, and treatment. Furthermore, we observe additional screening predictions in the center of the document. This section is actually labeled "Prevention | Screening" in the source document, which explains this overlap.
Furthermore, we observe low confidence in the second section labeled cause. Our multi-class model predicts for this section {diagnosis, cause, genetics}. The ground truth heading for this section is "Causes | Genetic sequence," but even for a human reader this assignment is not clear. This shows that the multi-label approach fills an important gap and can even serve as an indicator for low-quality article structure.
Finally, both models fail to segment the complication section near the end, because it consists of an enumeration. The embedding deviation segmentation strategy (right side of Figure 5) completely solves this issue for both models. Our Sector model is giving nearly perfect segmentation using the bidirectional strategy, it only misses the discussed part of cause and is off by one sentence for the start of prevention. Furthermore, averaging over sentence-level predictions reveals clearly distinguishable section class labels.
6 Conclusions and Future Work
We presented Sector, a novel model for coherent text segmentation and classification based on latent topics. We further contributed WikiSection, a collection of four large data sets in English and German for this task. Our end-to-end method builds on a neural topic embedding which is trained using Wikipedia headings to optimize a bidirectional LSTM classifier. We showed that our best performing model is based on sparse word features with Bloom filter encoding and significantly improves classification precision for 25–30 topics on comprehensive documents by up to 29.5 points F1 compared with state-of-the-art sentence classifiers with baseline segmentation. We used the bidirectional deviation in our topic embedding to segment a document into coherent sections without additional training. Finally, our experiments showed that extending the task to multi-label classification of 2.8k ambiguous topic words still produces coherent results with 71.1% average precision.
We see an exciting future application of Sector as a building block to extract and retrieve topical passages from unlabeled corpora, such as medical research articles or technical papers. One possible task is WikiPassageQA (Cohen et al., 2018), a benchmark to retrieve passages as answers to non-factoid questions from long articles.
Acknowledgments
We would like to thank the editors and anonymous reviewers for their helpful suggestions and comments. Our work is funded by the German Federal Ministry of Economic Affairs and Energy (BMWi) under grant agreement 01MD16011E (Medical Allround-Care Service Solutions) and H2020 ICT-2016-1 grant agreement 732328 (FashionBrain).
Notes
Our source code is available under the Apache License 2.0 at https://github.com/sebastianarnold/SECTOR.
The data set is available under the CC BY-SA 3.0 license at https://github.com/sebastianarnold/WikiSection.
We match lemmas of main senses and compounds to synsets of type NOUN CONCEPT.
Excluding all documents contained in the test sets.