Weakly Supervised Domain Detection

In this paper we introduce domain detection as a new natural language processing task. We argue that the ability to detect textual segments which are domain-heavy, i.e., sentences or phrases which are representative of and provide evidence for a given domain could enhance the robustness and portability of various text classification applications. We propose an encoder-detector framework for domain detection and bootstrap classifiers with multiple instance learning (MIL). The model is hierarchically organized and suited to multilabel classification. We demonstrate that despite learning with minimal supervision, our model can be applied to text spans of different granularities, languages, and genres. We also showcase the potential of domain detection for text summarization.


Introduction
Text classification is a fundamental task in Natural Language processing which has been found useful in a wide spectrum of applications ranging from search engines enabling users to identify content on websites, sentiment and social media analysis, customer relationship management systems, and spam detection. Over the past several years, text classification has been predominantly modeled as a supervised learning problem (e.g., Kim 2014;McCallum and Nigam 1998;Iyyer et al. 2015) for which appropriately labeled data must be collected. Such data is often domaindependent (i.e., covering specific topics such as those relating to "Business" or "Medicine") and a classifier trained using data from one domain is likely to perform poorly on another. For example, the phrase "the mouse died quickly" may indicate negative sentiment in a customer review describing the hand-held pointing device or positive sentiment when describing a laboratory experiment performed on a rodent. The ability to handle a wide variety of domains 1 has become more pertinent with the rise of data-hungry machine learning techniques like neural networks and their application to a plethora of textual media ranging from news articles to twitter, blog posts, medical journals, Reddit comments, and parliamentary debates (Kim, 2014;Yang et al., 2016;Conneau et al., 2017;Zhang et al., 2016).
The question of how to best deal with multiple domains when training data is available for one or few of them has met with much interest in the literature. The field of domain adaptation (Jiang and Zhai, 2007;Blitzer et al., 2006;Daume III, 2007;Finkel and Manning, 2009;Lu et al., 2016) aims at improving the learning of a predictive function in a target domain where there is little or no labeled data, using knowledge transferred from a source domain where sufficient labeled data is available. Another line of work (Li and Zong, 2008;Wu and Huang, 2015;Chen and Cardie, 2018) assumes that labeled data may exist for multiple domains, but in insufficient amounts to train classifiers for one or more of them. The aim of multi-domain text classification is to leverage all the available resources in order to improve system performance across domains simultaneously.
In this paper we investigate the question of how 1 The term "domain" has been permissively used in the literature to describe (a) a collection of documents related to a particular topic such as user-reviews in Amazon for a product category (e.g., books, movies), (b) a type of information source (e.g., twitter, news articles), and (c) various fields of knowledge (e.g., Medicine, Law, Sport). In this paper we adopt the latter definition of domains, however, nothing in our approach precludes applying it to different domain labels. domain-specific data might be obtained in order to enable the development of text classification tools as well as more domain aware applications such as summarization, question answering, and information extraction. We refer to this task as domain detection and assume a fairly common setting where the domains of a corpus collection are known and the aim is to identify textual segments which are domain-heavy, i.e., documents, sentences, or phrases providing evidence for a given domain.
Domain detection can be formulated as a multilabel classification problem, where a model is trained to recognize domain evidence at the sentence-, phrase-, or word-level. By definition then, domain detection would require training data with fine-grained domain labels, thereby increasing the annotation burden; we must provide labels for training domain detectors and for modeling the task we care about in the first place. In this paper we consider the problem of fine-grained domain detection from the perspective of Multiple Instance Learning (MIL; Keeler and Rumelhart 1992) and develop domain models with very little human involvement. Instead of learning from individually labeled segments, our model only requires document-level supervision and optionally prior domain knowledge and learns to introspectively judge the domain of constituent segments. Importantly, we do not require document-level domain annotations either since we obtain these via distant supervision by leveraging information drawn from Wikipedia.
Our domain detection framework comprises two neural network modules; an encoder learns representations for words and sentences together with prior domain information if the latter is available (e.g., domain definitions), while a detector generates domain-specific scores for words, sentences, and documents. We obtain a segmentlevel domain predictor which is trained end-toend on document-level labels using a hierarchical, attention-based neural architecture (Vaswani et al., 2017). We conduct domain detection experiments on English and Chinese and measure system performance using both automatic and human-based evaluation. Experimental results show that our model outperforms several strong baselines and is robust across languages and text genres, despite learning from weak supervision. We also showcase our model's application potential for text summarization.
Our contributions in this work are threefold; we propose domain detection, as a new finegrained multilabel learning problem which we argue would benefit the development of domain aware NLP tools; we introduce a weakly supervised encoder-detector model within the context of multiple instance learning; and demonstrate that it can be applied across languages and text genres without modification.

Related Work
Our work lies at the intersection of multiple research areas, including domain adaptation, representation learning, multiple instance learning, and topic modeling. We review related work below.
Domain Adaptation A variety of domain adaptation methods (Jiang and Zhai, 2007;Arnold et al., 2007;Pan et al., 2010) have been proposed to deal with the lack of annotated data in novel domains faced by supervised models. Daume III and Marcu (2006) propose to learn three separate models, one specific to the source domain, one specific to the target domain, and a third one representing domain general information. A simple yet effective feature augmentation technique is further introduced in Daume III (2007) which Finkel and Manning (2009) subsequently recast within a hierarchical Bayesian framework. More recently, Lu et al. (2016) present a general regularization framework for domain adaptation while Camacho-Collados and Navigli (2017) integrate domain information within lexical resources. A popular approach within text classification learns features that are invariant across multiple domains whilst explicitly modeling the individual characteristics of each domain (Chen and Cardie, 2018;Wu and Huang, 2015;Bousmalis et al., 2016).
Similar to domain adaptation, our detection task also identifies the most discriminant features for different domains. However, while adaptation aims to render models more portable by transferring knowledge, detection focuses on the domains themselves and identifies the textual segments which provide the best evidence for their semantics, allowing to create datasets with explicit domain labels to which domain adaptation techniques can be further applied.
Multiple Instance Learning Multiple instance learning (MIL) handles problems where labels are associated with groups or bags of instances (docu-ments in our case), while instance labels (segmentlevel domain labels) are unobserved. The task is then to make aggregate instance-level predictions, by inferring labels either for bags (Keeler and Rumelhart, 1992;Dietterich et al., 1997;Maron and Ratan, 1998) or jointly for instances and bags (Zhou et al., 2009;Wei et al., 2014;Kotzias et al., 2015). Our domain detection model is an example of the latter variant.
Initial MIL models, adopted a relatively strong consistency assumption between bag labels and instance labels. For instance, in binary classification, a bag was considered positive only if all its instances were positive (Dietterich et al., 1997;Maron and Ratan, 1998;Zhang et al., 2002;Andrews and Hofmann, 2004;Carbonetto et al., 2008). The assumption was subsequently relaxed by investigating prediction combinations (Weidmann et al., 2003;Zhou et al., 2009).
Within NLP, multiple instance learning has been predominantly applied to sentiment analysis. Kotzias et al. (2015) use sentence vectors obtained by a pre-trained hierarchical CNN (Denil et al., 2014) as features under a MIL objective which simply averages instance contributions towards bag classification (i.e., positive/negative document sentiment). Pappas and Popescu-Belis (2014) adopt a multiple instance regression model to assign sentiment scores to specific product aspects, using a weighted summation of predictions. More recently, Angelidis and Lapata (2018) propose MILNET, a multiple instance learning network model for sentiment analysis. They employ an attention mechanism to flexibly weigh predictions and recognize sentiment-heavy text snippets (i.e., sentences or clauses).
We depart from previous MIL-based work, in devising an encoding module with self-attention and non-recurrent structure, which is particularly suitable for modeling long documents efficiently. Compared to MILNET (Angelidis and Lapata, 2018), our approach generalizes to segments of arbitrary granularity; it introduces an instance scoring function which supports multilabel rather than binary classification, and takes prior knowledge into account (e.g., domain definitions) to better inform the model's predictions.
Topic Modeling Topic models are built around the idea that the semantics of a document collection is governed by latent variables. The aim is therefore to uncover these latent variables-topics-that shape the meaning of the document collection. Latent Dirichlet Allocation (LDA; Blei et al. 2003) is one of the best-known topic models. In LDA, documents are generated probabilistically using a mixture over K topics which are in turn characterized by a distribution over words. And words in a document are generated by repeatedly sampling a topic according to the topic distribution and selecting a word given the chosen topic.
Although most topic models are unsupervised, some variants can also accommodate documentlevel supervision (Mcauliffe and Blei, 2008;Lacoste-Julien et al., 2009). However, these models are not appropriate for analyzing multiply labeled corpora since they limit documents to being associated with a single label. Multi-Multinomial LDA (MM-LDA; Ramage et al. 2009b) relaxes this constraint by modeling each document as a bag of words with a bag of labels, and topics for each observation are drawn from a shared topic distribution. Labeled LDA (L-LDA; Ramage et al. 2009a) goes one step further by directly associating labels with latent topics thereby learning label-word correspondences. L-LDA is a natural extension of both LDA by incorporating supervision and Multinomial Naive Bayes (McCallum and Nigam, 1998) by incorporating a mixture model (Ramage et al., 2009a).
Similar to L-LDA, DETNET is also designed to perform learning and inference in multi-label settings. Our model adopts a more general solution to the credit attribution problem (i.e., the association of textual units in a document with semantic tags or labels). Despite learning from a weak and distant signal, our model can produce domain scores for text spans of varying granularity (e.g., sentences and phrases) not just words and achieves this with a hierarchically-organized neural architecture. Aside from learning through efficient backpropagation, the proposed framework can take incorporate useful prior information (e.g., pertaining to the labels and their meaning).

Problem Formulation
We formulate domain detection as a multilabel learning problem. Our model is trained on samples of document-label pairs. Each document consists of s sentences x = {x 1 , . . . , x s } and is associated with discrete labels y = {y (c) |c ∈ [1, C]}. In this work, domain labels are not annotated manually but extrapolated from Wikipedia (see Sec- tion 6 for details). In a non-MIL framework, a model typically learns to predict document labels by directly conditioning on its sentence representations h 1 , . . . , h s or their aggregate. In contrast, y under MIL is a learned function f θ of latent instance-level labels, i.e., y = f θ (y 1 , . . . , y s ). A MIL classifier will therefore first produce domain scores for all instances (aka sentences), and then learn to integrate instance scores into a bag (aka document) prediction.
In this paper we further assume that the instance-bag relation applies to sentences and documents but also to words and sentences. In addition, we incorporate prior domain information to facilitate learning in a weakly supervised setting: each domain is associated with a definition U (c) , i.e., a few sentences providing a high-level description of the domain at hand. For example, the definition of the "Lifestyle" domain is "the interests, opinions, behaviors, and behavioral orientations of an individual, group, or culture". Figure 1 provides an overview of our Domain Detection Network, which we call DETNET. The model comprises two modules; an encoder learns representations for words and sentences whilst incorporating prior domain information; a detector generates domain scores for words, sentences, and documents by selectively attending to previously encoded information. We describe the two modules in more detail below.

The Encoder Module
We learn representations for words and sentences using identical encoders with separate learning parameters. Given a document, the two encoders implement the following steps: For each sentence X = [x 1 ; . . . ; x n ], the wordlevel encoder yields contextualized word representations Z and their attention weights α. Sentence embeddings g are obtained via weighted averaging and then provided as input to the sentencelevel encoder which outputs contextualized representations H and their attention weights β.
In this work we aim to model fairly long documents (e.g., Wikipedia articles; see Section 6 for details). For this reason, our encoder builds on the Transformer architecture (Vaswani et al., 2017), a recently proposed highly efficient model which has achieved state-of-the-art performance in machine translation (Vaswani et al., 2017) and question answering (Yu et al., 2018). The Transformer aims at reducing the fundamental constraint of sequential computation which underlies most architectures based on recurrent neural networks. It eliminates recurrence in favor of applying a selfattention mechanism which directly models relationships between all words in a sentence, regardless of their position. Figure 2, the Transformer is a non-recurrent framework comprising m identical layers. Information on the (relative or absolute) position of each token in a sequence is represented by the use of positional encodings which are added to input embeddings (see the bottom of Figure 2). We denote position-augmented inputs in a sentence with X. Our model uses four layers in both word and sentence encoders. The first three layers are identical to those in the Transformer (m = 3), comprising a multi-head self-attention sublayer and a position-wise fully-connected feed-forward network. The last layer is simply a multi-head selfattention layer yielding attention weights for subsequent operations.

Self-Attentive Encoder As shown in
Single-head attention takes three parameters as input in the Transformer (Vaswani et al., 2017): a query matrix, a key matrix, and a value matrix. These three matrices are identical and equal to the inputs X at the first layer of the word encoder. The output of a single-head attention is calculated via: Multi-head attention allows to jointly attend to information from different representation subspaces at different positions. This is done by first applying different linear projections to inputs and then concatenating them: where we adopt four heads (r = 4) for both word and sentence encoders. The second sublayer in the Transformer (see Figure 2) is a fullyconnected feed-forward network applied to each position separately and identically: 2 After sequentially encoding input embeddings through the first three layers, we obtain contextualized word representations Z ∈ R dz×n . Based on Z, the last multi-head attention layer in the word encoder yields a set of attention matrices A = {A (k) } r k=1 for each sentence where A (k) ∈ R n×n . Therefore, when measuring the contributions from words to sentences, e.g., in terms of domain scores and representations, we can selectively focus on salient words within the set A = {A (k) } r k=1 : where the softmax function outputs the salience distribution over words: and obtain sentence embeddings g = Zα.
In the same vein, we adopt another self-attentive encoder to obtain contextualized sentence representations H ∈ R d h ×s . The final layer outputs multi-head attention score matrices B = {B (k) } r k=1 (with B (k) ∈ R s×s ), and we calculate sentence salience as: ,j ).
Prior Information In addition to documents (and their domain labels), we might have some prior knowledge about the domain, e.g., its general semantic content and the various topics related to it. For example, we might expect articles from the "Lifestyle" domain to not talk about missiles or warfare, as these are recurrent themes in the "Military" domain. As mentioned earlier, throughout this paper we assume we have domain definitions U expressed in a few sentences as prior knowledge. Domain definitions share parameters with WORDENC and SENTENC and are encoded in a definition matrix U ∈ R d h ×C . Intuitively, identifying the domain of a word might be harder than that of a sentence; on account of being longer and more expressive, sentences provide more domain-related cues than words whose meaning often relies on supporting context. We thus inject domain definitions U into our word detector only.

The Detector Module
DETNET adopts three detectors corresponding to words, sentences, and documents: WORDDET first produces word domain scores using both lexical semantic information Z and prior (domain) knowledge U ; SENTDET yields domain scores for sentences while integrating downstream instance signals Q instc and sentence semantics H; finally, DOCDET makes the final document-level predictions based on sentence scores.
Word Detector Our first detector yields word domain scores. For a sentence, we obtain a selfscoring matrix P self using its own contextual word semantic information: In contrast to the representations used in Angelidis and Lapata (2018), we generate instance scores from contextualized representations, i.e., Z. Since the softmax function normally favors single-mode outputs, we adopt tanh(·) ∈ (−1, 1) as our domain scoring function to tailor MIL to our multilabel scenario.
As mentioned earlier, we employ domain definitions as prior information at the word level and compute the prior score via: where W u ∈ R du×dz projects prior information U onto the input semantic space. The prior score matrix P prior captures the interactions between domain definitions and sentential contents.
In this work, we flexibly integrate scoring components with gates, as shown in Figure 3. The key idea is to learn a prior gate Γ balancing Equations (8) and (9) via: where J is an all-ones matrix and P ∈ R C×n is the final domain score matrix at the word-level; denotes element-wise multiplication and [·, ·] matrix concatenation. σ(·) ∈ (0, 1) is the sigmoid function and Γ ∈ (0, γ) the prior gate with scaling factor γ, a hyperparameter controlling the overall effect of prior information and instances. 3 Sentence Detector The second detector identifies sentences with domain-heavy semantics based on signals from the sentence encoder, prior information and word instances. Again we obtain a self-scoring matrix Q self via: After computing sentence scores from sentencelevel signals, we estimate domain scores from individual words. We do this by reusing α in Equation (5), q instc = P α. After gathering q instc for each sentence, we obtain Q instc ∈ R C×s as the full instance score matrix.
Analogously to the word-level detector (see Equation (10)), we employ a sentence-level upward gate Λ to dynamically propagate domain scores from downstream word instances to sentence bags: where Q is the final sentence score matrix.
Document Detector Document-level domain scores are based on the sentence salience distribution β (see Equation (7)) and are computed as the weighted average of sentence scores: We use only document-level supervision for multilabel learning in C domains. Formally, our training objective is: where N is the training set size. At test time, we partition domains into a relevant set and an irrelevant set for unseen samples. Since the domain scoring function is tanh(·) ∈ (−1, 1), we use a threshold of 0 against whichỹ is calibrated. 4

Experimental Setup
Datasets DETNET was trained on two datasets created from Wikipedia 5 for English and Chinese. 6 Wikipedia articles are organized according to a hierarchy of categories representing the defining characteristics of a field of knowledge. We recursively collect Wikipedia pages by first determining the root categories based on their match with the domain name. We then obtain their subcategories, the subcategories of these subcategories, and so on. We treat all pages associated with a category as representative of the domain of its root category.
In our experiments we used seven target domains: "Business and Commerce" (BUS), "Government and Politics" (GOV), "Physical and Mental Health" (HEA), "Law and Order" (LAW), "Lifestyle" (LIF), "Military" (MIL), and "General Purpose" (GEN). Exceptionally, GEN does not have a natural root category. We leverage Wikipedia's 12 Main Categories 7 to ensure that GEN is genuinely different from the other six domains. We used 5,000 pages for each domain. Ta-4 If ∀c ∈ [1, C] :ỹc < 0 holds, we setỹc * = 1 and select c * as c * = arg max cỹ c to produce a positive prediction.
ble 1 shows various statistics on our dataset.

System Comparisons
We constructed three variants of DETNET to explore the contribution of different model components. DETNET 1H has a single-level hierarchical structure, treating only sentences as instances and documents as bags; while DETNET 2H has a two-level hierarchical structure (the instance-bag relation applies to words-sentences and sentences-documents); finally, DETNET * is our full model which is fully hierarchical and equipped with prior information (i.e., domain definitions). We also compared DET-NET to a variety of related systems which include: MAJOR: the Majority domain label applies to all instances.
L-LDA: Labeled LDA (Ramage et al., 2009a) is a topic model that constrains LDA by defining a one-to-one correspondence between LDA's latent topics and observed labels. This allows L-LDA to directly learn word-label correspondences. We obtain domain scores for words through the topic-word-count matrix M ∈ R C×V which is computed during training: where C and V are the number of domain labels and the size of vocabulary, respectively. Scalar β is a prior value set to 1/C and matrixM ∈ R V ×C consists of word scores over domains. Following the snippet extraction approach proposed in Ramage et al. (2009a), L-LDA can also be used to score sentences as the expected probability that the domain label had generated each word. For more details on L-LDA, we refer the interested reader to Ramage et al. (2009a).
HIERNET: A hierarchical neural network model described in Angelidis and Lapata (2018) which produces document-level predictions by attentively integrating sentence representations. For this model we used word and sentence encoders identical to DETNET. HIERNET does not generate instance-level predictions, however, we assume that document-level predictions apply to all sentences.
MILNET: A variant of the MIL-based model introduced in Angelidis and Lapata (2018) which considers sentences as instances and documents as bags (while DETNET generalizes the instancebag relationship to words and sentences). To make MILNET comparable to our system, we use an encoder identical to DETNET, i.e., two Transformer encoders for words and sentences, respectively. Thus, MILNET differs from DETNET 1H in two respects: (a) word representations are simply averaged without word-level attention to build sentence embeddings and (b) context-free sentence embeddings generate sentence domain scores before being fed to the sentence encoder.

Implementation Details
We used 16 shuffled samples in a batch where the maximum document length was set to 100 sentences with the excess clipped. Word embeddings were initialized randomly with 256 dimensions. All weight matrices in the model were initialized with the fan-in trick (Glorot and Bengio, 2010) and biases were initialized with zero. Apart from using layer normalization (Ba et al., 2016) in the encoders, we applied batch normalization (Ioffe and Szegedy, 2015) and a dropout rate of 0.1 in the detectors to accelerate model training. We trained the model with the Adam optimizer (Kingma and Ba, 2014). We set all three gate scaling factors in our model to 0.1. Hyper-parameters were optimized on the development set. To make our experiments easy to replicate, we release our PyTorch (Paszke et al., 2017) source code. 8

Automatic Evaluation
In this section we present the results of our automatic evaluation for sentence and document predictions. Problematically, for sentence predictions we do not have gold-standard domain labels (we have only extrapolated these from Wikipedia for documents). We therefore developed an automatic approach for creating silver standard domain labels which we describe below.
Test Data Generation In order to obtain sentences with domain labels, we exploit lead sentences in Wikipedia articles. Lead sentences typically define the article's subject matter and emphasize its topics of interest. 9 As most lead sentences contain domain-specific content we can fairly confidently assume that document-level domain labels will apply. To validate this assumption, we randomly sampled 20 documents containing 220 lead sentences and asked two annotators to label these with domain labels. Annotators overwhelmingly agreed in their assignments with the document labels, the (average) agreement was K = 0.89 using Cohen's Kappa coefficient.
We used the lead sentences to create pseudo documents simulating real documents whose sentences cover multiple domains. To ensure sentence labels are combined reasonably (e.g., MIL is not likely to coexist with LIF), prior to generating synthetic documents, we traverse the training set and acquire all domain combinations S, e.g., S = {{GOV}, {GOV, MIL}}. We then gather lead sentences representing the same domain combinations. We generate synthetic documents with a maximum length of 100 sentences (we also clip real documents to the same length).
Algorithm 1 shows the pseudocode for document generation. We first sample document labels, then derive candidate label sets for sentences by introducing GEN and a noisy label . After sampling sentences for each domain, we shuffle them to achieve domain-varied sentence contexts. We created two synthetic datasets for English and Chinese. Detailed statistics are shown in Table 1.  Table 2: Performance using Macro-F 1 % on automatically created Wikipedia test set; models with the symbol † are significantly (p < 0.05) different from the best system in each task using the approximate randomization test (Noreen, 1989).
Evaluation Metric We evaluate system performance automatically using label-based Macro-F 1 (Zhang and Zhou, 2014), a widely-used metric for multilabel classification. It measures model performance for each label specifically and then macro-averages the results. For each class, given a confusion matrix tp fn fp tn containing the number of samples classified as true positive, false positive, true negative, and false negative, Macro-F 1 is calculated as 1 C C c=1 2 tp c 2 tp c + fp c + fnc where C is the number of domain labels.

Results
Our results are summarized in Table 2. We first report domain detection results for documents, since reliable performance on this task is a prerequisite for more fine-grained domain detection. As shown in Table 2, DETNET does well on document-level domain detection, managing to outperform systems over which it has no clear advantage (such as HIERNET or MILNET). As far as sentence-level prediction is concerned, all DETNET variants significantly outperform all comparison systems. Overall, DETNET * is the best system achieving 54.37% and 55.88% Macro-F 1 on English (en) and Chinese (zh), respectively. It outperforms MILNET by 17.25% on English and 11.51% on Chinese. The performance of the fully hierarchical model DETNET 2H is better than DETNET 1H , showing positive effects of directly incorporating word-level domain signals. We also observe that prior information is generally helpful on both languages and both tasks.

Human Evaluation
Aside from automatic evaluation, we also assessed model performance against human elicited domain labels for sentences and words. The purpose of this experiment was threefold: (a) to validate the results obtained from automatic evaluation; (b) to evaluate finer-grained model performance at the word level; and (c) to examine whether our model generalizes to non-Wikipedia articles. For this, we created a third test set from the New York Times 10 , in addition to our Wikipedia-based English and Chinese datasets. For all three corpora, we randomly sampled two documents for each domain, and then from each document, we sampled one long paragraph or a few consecutive short paragraphs containing 8-12 sentences. Amazon Mechanical Turkers were asked to read these sentences and assign a domain based on the seven labels used in this paper (multiple labels were allowed). Participants were provided with domain definitions. We obtained five annotations per sentence and adopted the majority label as the sentence's domain label. We obtained two annotated datasets for English (Wiki-en and NYT-en) and one for Chinese (Wiki-zh), consisting of 122/14, 111/11, and 117/12 sentences/documents each.
Word-level domain evaluation is more challenging; taken out-of-context, individual words might be uninformative or carry meanings compatible with multiple domains. Expecting crowdworkers to annotate domain labels word-by-word with high confidence, might be therefore problematic. In order to reduce annotation complexity, we opted for a retrieval-style task for word evaluation. Specifically, AMT workers were given a sentence and its domain label (obtained from the sentence-level elicitation study described above), and asked to highlight which words they considered consistent with the domain of the sentence. We used the same corpora/sentences as in our first AMT study. Analogously, words in each sentence were annotated by five participants and their labels were determined by majority agreement.
Fully hierarchical variants of our model (i.e., DETNET 2H , DETNET * ) and L-LDA are able to produce word-level predictions; we thus retrieved the words within a sentence whose domain score was above the threshold of 0 and compared them against the labels provided by crowdworkers. MILNET and DETNET 1H can only make  Table 3: System performance using Macro-F 1 % (test set created via AMT); models with the symbol † are significantly (p < 0.05) different from the best system in each task using the approximate randomization test (Noreen, 1989).  sentence-level predictions. In this case, we assume that the sentence domain applies to all words therein. HIERNET can only produce documentlevel predictions based on which we generate sentence labels and further assume that these apply to sentence words too. Again, we report Macro-F 1 which we compute as 2p * r * p * +r * where precision p * and recall r * are both averaged over all words.
We show model performance against AMT domain labels in Table 3.
Consistent with the automatic evaluation results, DETNET variants are the best performing models on the sentence-level task. On the Wikipedia datasets, DETNET 2H or DETNET * outperform all baselines and DETNET 1H by a large margin, showing that word-level signals can indeed help detect sentence domains. Although statistical models are typically less accurate when they are applied to data that has a different distribution from the training data, DETNET * works surprisingly well on NYT, substantially outperforming all other systems. We also notice that prior information is useful in making domain predictions for NYT sentences: since our models are trained on Wikipedia, prior domain definitions largely alleviate the genre shift to non-Wikipedia sentences. Table 4 provides a breakdown of the performance of DETNET * across domains. Overall, the model performs worst on LIF and GEN domains (which are very broad) and best on BUS and MIL (which are very narrow).
With regard to word-level evaluation, DETNET 2H and DETNET * are the best systems and are significantly better against all comparison models by a wide margin, except L-LDA. The latter is a strong domain detection system at the word-level since it is able to directly associate words with domain labels (see Equation (17)) without resorting to documentor sentence-level predictions. However, our twolevel hierarchical model is superior considering all-around performance across sentences and documents. The results here accord with our intuition from previous experiments: hierarchical models outperform simpler variants (including MILNET) since they are able to capture and exploit fine-grained domain signals relatively accurately. Interestingly, prior information does not seem to have an effect on the Wikipedia datasets, but is useful when transferring to NYT. We also observe that models trained on the Chinese datasets perform consistently better than English. Analysis of the annotations provided by crowdworkers revealed that the ratio of domain words in Chinese is higher compared to English (27.47% vs. 13.86% in Wikipedia and 16.42% in NYT), possibly rendering word retrieval in Chinese an easier task. Table 5 shows the 15 most representative domain words identified by our model (DETNET * ) on Wiki-en for our seven domains. We obtained Domains DETNET * L-LDA monopolization, enactment, panama, funding, arbitron, maturity, groceries, os, elevator, salary, organizations, pietism, contract, mercantilism, sectors also, business, company, used, one, management, may, business, united, 2007, time, first, new, market, new HEA psychology, divorce, residence, pilates, dorlands, culinary, technique, emotion, affiliation, seafood, famine, malaria, oceans, characters, pregnancy also, health, may, used, one, disease, medical, use, first, people, 1, many, time, water, care GEN gender, destruction, beliefs, schizophrenia, area, writers, armor, creativity, propagation, cheminformatics, overpopulation, deity, stimulation, mathematical, cosmology also, one, theory, 1, used, time, two, may, first, example, many, called, form, would, known GOV penology, tenure, governance, alloys, biosecurity, authoritarianism, criticisms, burundi, motto, imperium, mesopotamia, juche, 420, krytocracy, criticism also, government, political, state, united, party, one, minister, national, states, first, would, used, new, university LAW alloys, biosecurity, authoritarianism, mesopotamia, electronic, economical, pupil, pathophysiology, imperium, phonology, collusion, cantons, auctoritas, sigint, juche law, also, united, legal, may, act, states, court, rights, one, case, state, would, v, government LIF teacher, freight, career, agaricomycetes, casein, manga, diplogasteria, benefit, pteridophyta, basidiomycota, ascomycota, letters, eukaryota, carcinogens, lifespan also, used, may, often, one, made, water, food, many, use, usually, called, known, oil, time MIL battles, eads, insignia, commanders, artillery, width, episodes, neurasthenia, reconnaissance, elevation, freedom, length, patrol, manufacturer, demise military, war, army, also, air, united, force, states, one, used, forces, first, royal, british, world  this list by weighting word domain scores P with their attention scores:

BUS
and ranking all words in the development set according to P * , separately for each domain. Since words appearing in different contexts are usually associated with multiple domains, we determine a word's ranking for a given domain based on the highest score. As shown in Table 5, biosecurity and authoritarianism are prevalent in both GOV and LAW domains. Interestingly, with contextualized word representations, fairly general English words are recognized as domain heavy. For example, technique is a strong domain word in HEA and 420 in GOV (the latter is slang for the consumption of cannabis and highly associated with government regulations).
For comparison, we also show the top domain words identified by L-LDA via matrixM (see Equation (17)). To produce meaningful output, we have removed stop words and punctuation tokens, which are given very high domain scores by L-LDA (this is not entirely surprising sinceM is based on simple co-occurrence). Notice that no such post-processing is necessary for our model. As shown in Table 5, the top domain words identified by L-LDA (on the right) are more general and less informative, compared to those from DETNET * (on the left).

Domain-Specific Summarization
In this section we illustrate how fine-grained domain scores can be used to produce domain summaries, following an extractive, unsupervised approach. We assume the user specifies the domains they are interested in a priori (e.g., LAW, HEA) and the system returns summaries targeting the semantics of these domains.
Specifically, we introduce DETRANK, an extension of the well-known TEXTRANK algorithm (Mihalcea and Tarau, 2004), which incorporates domain signals acquired by DETNET * . For each document, TEXTRANK builds a directed graph G = (V, E) with nodes V corresponding to sentences, and undirected edges E whose weights are computed based on sentence similarity. Specifically, edge weights are represented with matrix E where each element E i,j corresponds to the transition probability from vertex i to vertex j. Follow -The   arms  industry  is  a  massive  global  industry  and  business  which  manufactures  and  sells  weapons  and  military  technology  and  equipment  .  Products  include  guns  ,  ammunition  ,  missiles  ,  military  aircraft  ,  military  vehicles  ,  ships  ,  electronic  Systems  ,  and  more  .  It  is  estimated  that  yearly  ,  over  1  trillion  dollars  are  spent  on  military expenditures  and  arms  worldwide  (  2  %  of  World  GDP  )  .  International  trade  in  handguns  ,  machine  guns  ,  tanks  ,  armored  personal  carriers  and  other  relatively inexpensive   ing Barrios et al. (2016), E i,j is computed with the Okapi BM25 algorithm (Robertson et al., 1995), a probabilistic version of TF-IDF, and small weights (< 0.001) are set to zeros. Unreachable nodes are further pruned to acquire the final vertex set V .
To enhance TEXTRANK with domain information, we first multiply sentence-level domain scores Q with their corresponding attention scores: and for a given domain c, we can extract a (domain) sentence score vector q * = Q * c, * ∈ R 1×s . Then, from q * , we produce vectorq ∈ R 1×|V | representing a distribution of domain signals over sentences:q In order to render domain signals in different sentences more discernible, we scale all elements inq to [0, 1] before obtaining a legitimate distribution with the softmax function. Finally, we integrate the domain component into the original transition matrix as: where φ ∈ (0, 1) controls the extent to which domain-specific information influences sentence selection for the summarization task; higher φ will lead to summaries which are more domainrelevant. Here, we empirically set φ = 0.3. The main difference between DETRANK and TEXTRANK is that TEXTRANK treats 1 − φ as a damping factor and a uniform probability distribution is applied toq.
In order to decide which sentence to include in the summary, a node's centrality is measured using a graph-based ranking algorithm (Mihalcea and Tarau, 2004). Specifically, we run a Markov chain withẼ on G until it converges to the stationary distribution e * where each element denotes the salience of a sentence. In the proposed DETRANK algorithm, e * jointly expresses the importance of a sentence in the document and its relevance to the given domain (controlled by φ). We rank sentences according to e * and select the top K ones, subject to a budget (e.g., 100 words).
We ran a judgment elicitation study on summaries produced by TEXTRANK and DETRANK. Participants were provided with domain definitions and asked to decide which summary was best according to the criteria of: Informativeness (does the summary contain more information about a specific domain, e.g., "Government and Politics"?), Succinctness (does the summary avoid unnecessary detail and redundant information?), and Coherence (does the summary make logical sense?). Amazon Mechanical Turk (AMT) workers were allowed to answer "Both" or "Neither" in cases where they could not discriminate between summaries. We sampled 50 summary pairs from the English Wikipedia development set. We collected three responses per summary pair and determined which system participants preferred based on majority agreement. Table 6 shows the proportion of times AMT workers preferred each system according to the criteria of Informativeness, Succinctness, Coher-  Table 6: Human evaluation results for summaries produced by TEXTRANK and DETRANK; proportion of times AMT workers found models Informative (Inf), Succinct (Succ), and Coherent (Coh); All is the average across ratings; symbol † denotes that differences between models are statistically significant (p < 0.05) using a pairwise t-test. ence, and overall. As can be seen, participants find DETRANK summaries more informative and coherent. While it is perhaps not surprising for DETRANK to produce summaries which are domain informative since it explicitly takes domain signals into account, it is interesting to note that focusing on a specific domain also helps discard irrelevant information and produce more coherent summaries. This, on the other hand, possibly renders DETRANK's summaries more verbose (see the Succinctness ratings in Table 6). Figure 4 shows example summaries for the Wikipedia article Arms Industry for domains MIL and BUS. 11 Both summaries begin with a sentence which introduces the arms industry to the reader. When MIL is the domain of interest, the summary focuses on military products such as guns and missiles. When the domain changes to BUS, the summary puts more emphasis on trade, e.g., market competition and companies doing military business, such as Boeing and Eurofighter.

Conclusions
In this work, we proposed an encoder-detector framework for domain detection. Leveraging only weak domain supervision, our model achieves results superior to competitive baselines across different languages, segment granularities, and text genres. Aside from identifying domain specific training data, we also show that our model holds promise for other natural language tasks, such as text summarization. Beyond domain detection, we hope that some of the work described here might be of relevance to other multilabel classification problems such as sentiment analysis (Angelidis and Lapata, 2018), relation extraction (Surdeanu et al., 2012), and named entity recognition (Tang et al., 2017). More generally, our experi-11 https://en.wikipedia.org/wiki/Arms_industry ments show that the proposed framework can be applied to textual data using minimal supervision, significantly alleviating the annotation bottleneck for text classification problems.
A key feature in achieving performance superior to competitive baselines is the hierarchical nature of our model, where representations are encoded step-by-step, first for words, then for sentences, and finally for documents. The framework flexibly integrates prior information which can be used to enhance the otherwise weak supervision signal or to render the model more robust across genres. In the future, we would like to investigate semi-supervised instantiations of MIL, where aside from bag labels, small amounts of instance labels are also available (Kotzias et al., 2015). It would also be interesting to examine how the label space influences model performance, especially since in our scenario the labels are extrapolated from Wikipedia and might be naturally noisy and/or ambiguous.