Abstract
Currently, deep learning models have been widely adopted and achieved promising results on various application domains. Despite their intriguing performance, most deep learning models function as black boxes, lacking explicit reasoning capabilities and explanations, which are usually essential for complex problems. Take joint inference in information extraction as an example. This task requires the identification of multiple structured knowledge from texts, which is inter-correlated, including entities, events, and the relationships between them. Various deep neural networks have been proposed to jointly perform entity extraction and relation prediction, which only propagate information implicitly via representation learning. However, they fail to encode the intensive correlations between entity types and relations to enforce their coexistence. On the other hand, some approaches adopt rules to explicitly constrain certain relational facts, although the separation of rules with representation learning usually restrains the approaches with error propagation. Moreover, the predefined rules are inflexible and might result in negative effects when data is noisy. To address these limitations, we propose a variational deep logic network that incorporates both representation learning and relational reasoning via the variational EM algorithm. The model consists of a deep neural network to learn high-level features with implicit interactions via the self-attention mechanism and a relational logic network to explicitly exploit target interactions. These two components are trained interactively to bring the best of both worlds. We conduct extensive experiments ranging from fine-grained sentiment terms extraction, end-to-end relation prediction, to end-to-end event extraction to demonstrate the effectiveness of our proposed method.
1. Introduction
Joint inference is commonly adopted in the field of information extraction (IE), for example, end-to-end relation extraction and end-to-end event extraction. Compared with a pipelined procedure, joint inference performs multiple correlated subtasks in a single model simultaneously, which avoids error propagation and exploits inter-task correlations. For example, end-to-end relation extraction involves both entity extraction and relation classification between entities. As shown in Figure 1(a), given a text input “W. Dale Nelson covers the White House for The Associated Press,” end-to-end relation extraction requires the identification of W. Dale Nelson as an entity of type person (PER), White House as an entity of type location (LOC), and The Associated Press as an entity of type organization (ORG). At the same time, the relation between W. Dale Nelson and The Associated Press needs to be classified as work_for. For end-to-end event extraction, an event consists of an event trigger and an arbitrary number of arguments. The task involves the identification and classification of the following three items:
Entity mention: An entity mention is a reference to an entity in the form of a noun phrase or a pronoun.
Event trigger: An event trigger usually refers to the main word that clearly expresses an event occurrence. Event triggers can be verbs, nouns, and occasionally adjectives.
Event argument: Event arguments refer to entities that fill specific roles in the event. They mainly include participants, namely, the entities that are involved in the event, and general event attributes such as place and time.
Various deep learning models have been proposed to jointly extract entities, or events and their relations through either parameter/feature sharing (Miwa and Bansal 2016; Katiyar and Cardie 2017) to exploit task commonalities, or designing loss functions that consider task correlations, for example, adopting a novel tagging scheme (Li and Ji 2014; Miwa and Sasaki 2014; Gupta, Schütze, and Andrassy 2016; Zhang, Zhang, and Fu 2017; Zheng et al. 2017). However, these joint deep models only exploit task interactions implicitly via parameter sharing or high-level feature learning without effective relational knowledge integration. We observe that intensive correlations or relational patterns exist among targets being extracted. Take Figure 1(a) as an example; if we know entity W. Dale Nelson is a person and it has relation work_for with another entity The Associated Press, we can probably infer that The Associated Press is an organization. Note that the widely used BIO segmentation scheme in entity segmentation can be considered as a special case of correlation constraints among targets, for example, “I” should not follow “O.”
To fuse such explicit dependencies among different targets, some early studies enforce the model predictions with constraints (Yang and Cardie 2013; Roth, Yih, and Yih 2007) or rely on global graphical models (Yu and Lam 2010) to produce structured predictions. These approaches, however, fail to connect final predictions with feature updates, resulting in error propagation. Logic rules have been integrated into deep learning architectures for natural language processing as a form of prior knowledge integration recently (Hu et al. 2016; Li and Srikumar 2019; Wang and Pan 2020). However, in existing methods, rules are explicitly given and kept fixed with learnable weights during model learning, which limits the expressiveness and adaptation of knowledge from training data.
To address these limitations, we propose a novel marriage between deep feature learning and relational logic reasoning, named Variational Deep Logic Network (VDLN), for joint inference in the IE domain. The complex relationships among target variables could be effectively captured both implicitly and explicitly via the mutual enhancements of deep neural networks and automatic logic inference in a joint learning framework. Specifically, VDLN consists of two modules: a deep learning module 𝒬 and a logic reasoning module 𝒫. The deep learning module adopts the self-attention mechanism to explore the dependencies among each token in a sentence in order to generate word-level and relation-level features. It is also flexible to incorporate structured models, for example, Conditional Random Fields (CRFs) (Lafferty, McCallum, and Pereira 2001) to produce structured outputs for entity segmentations. For the logic reasoning module, we construct a novel logic network that parameterizes logic inference process via a hierarchy of layers consisting of an atom layer and a rule layer. The final output of the logic network simulates rule entailments and reflects the probability of the target atom being true given the input atoms. The target atom could be regarded as a binary classifier for each target label. The logic network aims to learn relational correlations among the related variables, which is crucial for the task at hand. For example, the aforementioned dependency between entity and relation labels could be reflected via the first-order-logic rule: person(X1) ∧ work_for(X1, X2) ⇒ organization(X2). It is worth noting that the logic reasoning module is flexible enough to achieve both rule learning given some simple rule templates and integration of predefined logic rules.
To smoothly integrate these two modules and to model dependencies of correlated variables for joint inference, we propose a variational EM learning paradigm. The E-step involves learning of module 𝒬 to produce probabilistic predictions for each variable. For the M-step, the logic reasoning module 𝒫 conducts knowledge inference and updates its parameters according to the outputs of 𝒬. The alternation between E-step and M-step facilitates the integration and mutual enhancement of both knowledge reasoning and abstractive feature learning to achieve the best of both worlds.
To demonstrate our model’s generality, we apply VDLN on a range of challenging IE tasks, focusing on different kinds of correlations and with increasing levels of difficulty. Specifically, we take Aspect and Opinion Extraction as the first IE task that focuses on entity extraction by treating aspect and opinion terms as two different entity types and exploring their interactions to boost the extraction accuracy. The second IE task is End-to-End Relation Extraction, which considers correlations among entities and their relations. We use End-to-End Event Extraction as our third IE task, which contains rich correlations between entities and events. The proposed model achieves better performances across all these tasks without the need to construct any prior knowledge. To summarize, our contributions include:
We propose a novel logic-inspired network incorporating logic semantics for probabilistic reasoning, which is more expressive and beneficial for exploiting target interactions for joint inference. The logic network is able to learn effective reasoning patterns given the training corpus, and at the same time allows the integration of predefined logic rules.
We design a variational EM algorithm within our deep logic networks for IE tasks, which bridge the gap between deep feature learning and knowledge reasoning to enhance the final performance.
We conduct extensive experiments on 6 benchmark data sets across 3 IE tasks with increasing levels of difficulty to demonstrate the effectiveness and generality of our proposed model.
2. Related Work
2.1 Information Extraction
Information extraction aims to extract structured knowledge from texts (e.g., entities, relational triplets). In this paper, we mainly review three IE tasks that are related to our proposals. The first task is aspect and opinion extraction, which focus on the identification of product aspects/attributes and their corresponding opinion expressions. Existing work either relies on predefined rules and patterns among aspect terms and opinion terms utilizing syntactic information of a sentence (Hu and Liu 2004; Qiu et al. 2011; Li et al. 2010), or designs deep learning models considering different types of dependencies, for example, contextual dependencies (Liu, Joty, and Meng 2015; Wang et al. 2017; Li and Lam 2017; Xu et al. 2018a), syntactic dependencies (Yin et al. 2016; Wang et al. 2016), and task dependencies (Chen and Qian 2020). Another recent work (Yu, Jiang, and Xia 2019) exploits the combination of explicit rules with deep feature learning via linear integer programming. However, such integration only treats rules as fixed constraints to revise deep learning predictions, without the ability to update rules and propagate information back to feature learning.
For end-to-end relation extraction, the early works adopt a pipeline procedure that first learns an entity extraction model and then trains a relation classifier based on the extracted entities (Chan and Roth 2011; Lin et al. 2016). This strategy is prone to error propagation resulting from the extracted entities. To resolve this limitation, subsequent works propose joint extraction models by sharing parameters (Miwa and Bansal 2016; Katiyar and Cardie 2017; Bekoulis et al. 2018; Bekoulis, Deleu, and Demeester 2018; Takanobu et al. 2019; Dixit and Al-Onaizan 2019; Dai et al. 2019a) or by designing loss functions to encode the task interactions, for example, structured perceptron (Li and Ji 2014), novel labeling strategies (Miwa and Sasaki 2014; Gupta, Schütze, and Andrassy 2016; Zhang, Zhang, and Fu 2017; Zheng et al. 2017; Wang et al. 2018), global loss (Sun et al. 2018; Adel and Schütze 2017), and triplet/answer generation (Zeng et al. 2018; Li et al. 2019). Wang and Lu (2020) proposed combining both sequence encoder and table encoder together with rich input embeddings for joint extraction. However, these approaches only exploit correlations among the subtasks implicitly. Another strategy is to enforce relational facts via explicit rule constraints (Roth, Yih, and Yih 2007; Yang and Cardie 2013; Kate and Mooney 2010) or graphical models (Yu and Lam 2010), which are separated from feature learning.
The third task, which is more challenging, is event extraction. Pipelined models are first proposed, which require extensive feature engineering (Ji and Grishman 2008; Liao and Grishman 2010; Patwardhan and Riloff 2009; Hong et al. 2011; McClosky, Surdeanu, and Manning 2011; Miwa et al. 2014). To capture interactions among different subtasks, graphical and structured prediction models have been proposed for joint inference of event triggers and event arguments (Poon and Vanderwende 2010; Venugopal et al. 2014; Riedel et al. 2009; Li et al. 2014; Judea and Strube 2016; Yang and Mitchell 2016). Recently, deep neural networks were also introduced for joint prediction in the domain of event extraction (Nguyen, Cho, and Grishman 2016; Sha et al. 2018; Liu, Luo, and Huang 2018; Nguyen and Nguyen 2019; Zhang, Ji, and Sil 2019; Wadden et al. 2019). However, most of the existing research depends on external linguistic resources to generate semantic and syntactic features in order to enhance the final prediction. Lin et al. (2020) adopted manually designed global features to capture cross-task and cross-instance interactions.
2.2 Deep Learning with Logic Reasoning
Considering the limitation of pure deep learning models, which lack the reasoning capabilities, and the inflexibility of pure symbolic models, a marriage between them has been proposed, namely, Neural-Symbolic Learning, which aims to equip distributed representation learning with some form of real intelligence, or, on the other hand, assists symbolic models to handle uncertainties (Garcez, Broda, and Gabbay 2002; Franca, Zaverucha, and D’avila Garcez 2014; Serafini and d’Avila Garcez 2016; Evans and Grefenstette 2018; Manhaeve et al. 2018; Dong et al. 2019; Xu et al. 2018b; Tran and d’Avila Garcez 2018; Wang et al. 2019; d’Avila Garcez et al. 2019; Ciravegna et al. 2020; Lamb et al. 2020; Yang and Song 2020). Deep neural networks have been used to simulate logic reasoning by parameterizing logic operators and logic atoms with neural weights (Franca, Zaverucha, and D’avila Garcez 2014; Tran and d’Avila Garcez 2018). Another group of research focuses on smooth integration of logic rules within the deep learning frameworks (Manhaeve et al. 2018; Xu et al. 2018b). A more challenging direction is to induce logic rules automatically through representation learning and differentiable back-propagation (Evans and Grefenstette 2018; Dong et al. 2019; Wang et al. 2019; Yang and Song 2020).
In the NLP domain, Rocktäschel, Singh, and Riedel (2015) and Guo et al. (2016) embedded logic rules into the distributed feature space for knowledge graph learning. Hu et al. (2016) fused discrete logic rules into deep neural networks (DNNs) through posterior regularization and Qu and Tang (2019) used a variational EM algorithm to distill knowledge from a graph neural network into a Markov logic network. Another work used logic rules to construct adversarial sets (Minervini et al. 2017; Minervini and Riedel 2018), or as indirect supervision to improve model training (Wang and Poon 2018). Logic knowledge has also been inserted into deep architectures as named neurons (Li and Srikumar 2019). Recently, differentiable theorem proving has been proposed that parameterizes symbolic unification in the backward chaining process of prolog (Gallaire and Minker 1978) with neural weight learning (Rocktäschel and Riedel 2017; Campero et al. 2018; Minervini et al. 2020). Inspired by Qu and Tang (2019), we also adopt the variational EM algorithm for knowledge distillation. But different from the previously mentioned studies, we design a semantically meaningful deep architecture for automatic logic reasoning. The logic-inspired network is able to learn expressive and useful reasoning patterns that are adapted given the training corpus, and at the same time flexible to incorporate predefined logic rules. In the domain of information extraction, Wang and Pan (2020) used predefined logic rules as a form of regularizer to be imposed to the learning of DNNs. The regularizer is realized via a discrepancy loss between the deep learning predictions and the satisfiability of their corresponding logic rules. However, this mechanism only locally influences the learning of DNNs. Compared with Wang and Pan (2020), our proposed model is able to learn different combinations of logic atoms to form the rules and it is also flexible to incorporate predefined knowledge. Moreover, our EM training algorithm alternates between an inference step and a learning step to achieve mutual enhancement which globally enforces the learning of both modules, instead of sample-wise regularization.
3. Problem Definition and Preliminary
For ease of illustration, we first list all the symbols used in this work together with their descriptions in Table 1.
Symbols . | Description . |
---|---|
𝓔, 𝓡, 𝒱 | the set of all entity types, relation categories, event trigger categories |
E | the set of segmentation labels E = {Bj, Ij, O}j∈𝓔 with j ∈ 𝓔 an entity type |
𝒩ϵ, 𝒩r | the set of all words, the set of all relations within a sentence |
D | a set of atoms D = {d1, …, dN} |
ϵi, r(ϵi,ϵj), vi | a constant representing an entity, a relation, an event trigger |
wi, yi, yi | an input word, an output label, an output prediction vector |
θ, ϕ | all the parameters corresponding to module 𝒬, module 𝒫 |
xi, Xi | a logic constant, a logic variable |
di | a logic atom consisting of a predicate and arguments di = pred(X1, …, Xm) |
h | the head atom of a clause d1 ∧ … ∧ dn ⇒ h |
υ(⋅) | the probabilistic value of an atom or a clause υ(⋅) ∈ [0, 1] |
x, h, u, α | a vector representation of an input, a hidden neuron, an entity type, attention scores |
W, b | a trainable transformation matrix, a trainable bias vector |
vi,n, Vn | a trainable transformation vector, bi-linear transformation matrix for atom evaluations |
q, p | an output probabilistic vector from module 𝒬, module 𝒫 |
R, γ | a logic rule identifier, the confidence score of a rule |
m ∈ 𝒩ϵ ∪ 𝒩r | a logic constant referring to either a word or a relation |
ctx(m) | the set of logic constants that form the context of m |
ctx(m) | a vector of probabilistic inputs for module 𝒫: ctx(m) = (qm1, qm2, …, qm|ctx(m)|) |
σ | the sigmoid function |
βt, d | a weight vector that weighs each logic rule, a vector of atom values |
Y, Z | the set of target random variables, the set of hidden random variables |
p(⋅), q(⋅) | probabilistic distributions |
Symbols . | Description . |
---|---|
𝓔, 𝓡, 𝒱 | the set of all entity types, relation categories, event trigger categories |
E | the set of segmentation labels E = {Bj, Ij, O}j∈𝓔 with j ∈ 𝓔 an entity type |
𝒩ϵ, 𝒩r | the set of all words, the set of all relations within a sentence |
D | a set of atoms D = {d1, …, dN} |
ϵi, r(ϵi,ϵj), vi | a constant representing an entity, a relation, an event trigger |
wi, yi, yi | an input word, an output label, an output prediction vector |
θ, ϕ | all the parameters corresponding to module 𝒬, module 𝒫 |
xi, Xi | a logic constant, a logic variable |
di | a logic atom consisting of a predicate and arguments di = pred(X1, …, Xm) |
h | the head atom of a clause d1 ∧ … ∧ dn ⇒ h |
υ(⋅) | the probabilistic value of an atom or a clause υ(⋅) ∈ [0, 1] |
x, h, u, α | a vector representation of an input, a hidden neuron, an entity type, attention scores |
W, b | a trainable transformation matrix, a trainable bias vector |
vi,n, Vn | a trainable transformation vector, bi-linear transformation matrix for atom evaluations |
q, p | an output probabilistic vector from module 𝒬, module 𝒫 |
R, γ | a logic rule identifier, the confidence score of a rule |
m ∈ 𝒩ϵ ∪ 𝒩r | a logic constant referring to either a word or a relation |
ctx(m) | the set of logic constants that form the context of m |
ctx(m) | a vector of probabilistic inputs for module 𝒫: ctx(m) = (qm1, qm2, …, qm|ctx(m)|) |
σ | the sigmoid function |
βt, d | a weight vector that weighs each logic rule, a vector of atom values |
Y, Z | the set of target random variables, the set of hidden random variables |
p(⋅), q(⋅) | probabilistic distributions |
3.1 Problem Definition
For all three IE tasks, the target variables can be categorized as: (1) Entities, with the set of all entity types denoted by 𝓔 (2) Events, with 𝒱 denoting the set of all event types. (3) Relational triplets (s, r, o) governed by a set of relation categories r ∈ 𝓡, with s and o being the subject and object of relation r, respectively. For convenience, we use r(s,o) to denote the relational triplet. Given an input sentence {w1, w2, …, wn}, entity extraction is formalized as a sequence labeling problem to generate entity segmentation. Denote the set of segmentation labels by E = {Bj, Ij, O}j∈𝓔, with Bj, Ij, O indicating the beginning, inside, and outside of an entity of type j, respectively. The output is a label sequence {y1, y2, …, yn}, where yi ∈ E. End-to-end relation extraction aims to generate both entity segmentation as well as a set of relational triplets r(ϵ1,ϵ2), where ϵ1 and ϵ2 correspond to entities. End-to-end event extraction consists of 3 subtasks: entity extraction, event trigger extraction, and event argument prediction. Event trigger extraction is formalized as a token-based classification problem with |𝒱| classes. Event argument prediction aims to produce relational triplet r(ϵ,v) where ϵ is an entity, v is an event trigger, and r denotes the argument relation between ϵ and v. For relational triplet prediction, we pair all candidate entities (or entities with event triggers) that are extracted in the first place to predict the relation label.
3.2 Variational EM
Note that Qu and Tang (2019) adopted this formulation to distill information from graph neural networks for Markov logic networks with given logic rules. Compared to other existing works that either used manually constructed logic rules to enhance the learning of DNNs, or learn logic rules but are limited in terms of computational efficiency, we build on top of Qu and Tang (2019) to achieve mutual learning of both DNNs and logic reasoning.
3.3 First-Order Logic
In our problem setting, we treat each different classifier as FOL entailments and define each target label as the head atom of a set of FOL formula. For example, d1 ∧ d2 ∧ … ∧ dn ⇒ person(X) explains how person(X) can be deduced from its body atoms. In this case, if d1 ∧ d2 ∧ … ∧ dn evaluates to True, person(X) will also be True.
To encode uncertainties within probabilistic logic, we assign each FOL formula d1 ∧ d2 ∧ … ∧ dn ⇒ h with a learnable confidence score γ ∈ [0, 1]. The higher the score, the more confidence the formula plays in the computation process.
4. Motivation
Conventional deep learning usually lacks knowledge integration and fails to explicitly model the crucial interactions among the targets. Recently, logic reasoning has been adopted and integrated with DNNs to enhance performance by introducing knowledge as FOL rules. Among them, probabilistic logic converts the hard 0/1 assignment to soft probabilities (Nilsson 1986), which facilitates optimization through gradient descent. However, the pre-designed FOLs may not be expressive enough to represent the inherent patterns and prevents adaptation to a given training data set. To address this limitation, we propose VDLNs, which inherit the representational power of deep learning, and at the same time simulate the logic rule learning process via a novel logic network consisting of a hierarchy of an atom layer, a rule layer, and an output layer. Given some predefined rule templates, the atom layer implements a neural transformation process to convert the inputs to a set of abstract atoms. Then our logic network learns to discriminatively select the most relevant atoms in the atom layer to compose a logic rule in the rule layer. Our network design avoids a manual construction of atoms for each rule that is task-dependent. It is also flexible to inject any prior knowledge into the logic network if the rules are easy to obtain. The combination of automatically-learned and predefined logic rules is realized via a form of residual connection.
To integrate a logic system with deep learning, most existing works only use knowledge to regularize feature learning or feed deep learning outputs as the inputs to the logic system, but ignore the mutual interactions. In this work, we introduce a novel integration of DNNs and knowledge reasoning via variational EM. Note that Qu, Bengio, and Tang (2019) proposed to adopt variational EM for semi-supervised classification by associating 2 graph neural networks. Qu and Tang (2019) further extended the algorithm for efficient inference in Markov logic (Richardson and Domingos 2006). However, their work only updates the weights for predefined rules without learning the predicates of rules. Different from previous works, our proposed model automatically learns useful predicates and the weights of different instantiations of those predicates that explore the associations among highly dependent classifiers for joint inference.
5. Methodology
The overview of the proposed model VDLN is shown in Figure 2. It consists of 2 modules: (1) Module 𝒬 consists of a DNN that transforms the input sequence of text into abstractive features hi’s and produces the probabilistic outputs qi’s. (2) A logic module 𝒫 consists of a set of logic networks (LNets), with one LNet corresponding to each specific word wi and relation r. Each LNet takes ctx(wi) (ctx(r)) as input, which consists of information from all of its associated variables to conduct knowledge reasoning among these variables, and generates the final probabilistic evaluations {pi}’s. Note that in VDLN, besides modeling complex correlations between targets, the logic module 𝒫 also implements the BIO labeling scheme. The entire model is trained via the variational EM algorithm that alternates between an E-step (inference) and an M-step (learning). In the E-step, the deep module 𝒬 generates soft predictions of each word and candidate relation by distilling knowledge from 𝒫. In the M-step, the logic module 𝒫 takes the predictions of 𝒬 as input and generates a probabilistic output for each target class of each word and relation. With a more concrete example, the overall procedure is the following: Given an input sentence of 11 tokens “W. Dale Nelson covers the White House for The Associated Press,” module Q first produces the hidden representations {h1, …, h11} and the output vectors {q1, …, q11}, as shown in Figure 2. Likewise, a relation output vector qr is generated for each pair of candidate entities predicted via hi, for example, (W. Dale Nelson, Associated Press), based on their hidden representations {hi}i∈{1,2,3,10,11} and their attention scores. Then these vectors {{qr}’s, q1, …, q11}, where {qr}’s collects the set of all entity pairs for relation predictions, are used to form the input ctx(wi) (or ctx(r)) for each word (or relation) in module 𝒫 to produce the final probabilistic output vectors {{pr}’s, p1, …, p11} for all the words and relations from Module 𝒫. With the output vectors from both modules, we conduct EM training algorithm that first update the parameters in 𝒬 by treating the predictions from pi’s as the supervision labels. Then in the next iteration, we update the parameters in 𝒫 by treating the predictions from qi’s as the supervision labels.
In the following, we will describe the architecture of VDLN in Section 5.1 and Section 5.2 in detail.
5.1 Deep Learning with Self-Attention
For end-to-end event extraction, we use two separate classifiers for entity and event trigger prediction, respectively, which corresponds to two different sets of parameters: {, } and {, } in (11).
For end-to-end event extraction, the event argument relation triplet (ϵ, r, v) is generated in a similar manner by replacing ϵ2 with event trigger v. We additionally use a binary classifier to decide whether there is a relation between the entity and event trigger due to the sparsity of relation labels.
5.2 Logic Network
As described in Section 5.1, the deep learning model only implicitly learns word correlations via high-level features and attentions, but ignores the explicit correlations among target variables, especially for those of different types. In fact, the entity/event labels are highly dependent on the relation types, for example, “person(wi) ∧ work_for(r(wi,wj)) ⇒ organization(wj).” Moreover, the segmentation labels are highly correlated within a context window. Although such segmentation interactions could be captured in 𝒬 via structured loss, it is more efficient and capable of modeling more complex correlations together with relation information. Here, we treat such segmentation dependencies as a form of knowledge reasoning.
Recently, some approaches have been proposed to combine deep learning with logic reasoning to regularize the learning process or induce new rules. However, most of them are not expressive enough by limiting themselves to the tasks within the logic domain, or are computationally expensive to work on real application domains. There is also a lack of focus on directly modeling rules for classifiers. For expressiveness, Shanahan et al. (2019) proposed a relational neural network, which only translates to a single logic rule that is propositional in nature. We propose a novel logic network within the logic module 𝒫 that simulates FOL and enhances reasoning capabilities through multilevel rule constructions within a deep architecture.
As shown in Figure 2, 𝒫 consists of a separate logic network (LNets) applied on each word and relation. Following the introduction of FOL in Section 3.3, we first adapt the problem into the logic domain, where a logic variable corresponds to a word w or a relational triplet r(ϵ1,ϵ2). All possible words and relations form the set of logical constants. Each target class y ∈ E ∪ 𝒱 ∪ 𝓡 could be regarded as a predicate, and when taking constants as arguments, becomes a grounded atom. When the target class y ∈ E ∪ 𝒱 is an entity type or event type, it takes a single word (or phrase) as the argument, for example, person(W. Dale Nelson) with y = person. When the target class y ∈ 𝓡 is a relation, it takes a relational triplet as the argument. For example, work_for(r(ϵ1,ϵ2)) with y = work_for specifies entity ϵ1 and entity ϵ2 has relation work_for. We use d(x1, …, xn) to denote an n-ary atom and υ(d) ∈ [0, 1] to denote the probability of the atom being true. For example, υ(work_for(r(ϵ1,ϵ2)) = 0.8 indicates that ϵ1, and ϵ2 has relation “work_for” with probability 0.8.
As discussed in Section 3.3, we treat each target class as a form of logic entailment where the target class is the head atom h of a set of logic rules/formula R ∈ {R1, …, RS} with R : d1 ∧ d2 ∧ … ∧ dT ⇒ h. Here R is a rule identifier. As a concrete example, if we aim to predict whether a text segment ϵj belongs to the target class “organization (entity),” we may define a logic rule to entail the target entity type “organization”: person(ϵi) ∧ work_for(r(ϵi,ϵj)) ⇒ organization(ϵj), where the head atom h = organization(ϵj) corresponds to the target entity type. The result depends on its precondition, which consists of two atoms d1 = person(ϵi) and d2 = work_for(r(ϵi,ϵj)). Then an FOL program aims to produce the truth probability of h given the set of all possible rules {R1, …, RS}. In most cases, such rules may not be readily available. Hence, it is desirable to learn the FOL rules automatically. To achieve that, we use a separate logic network (LNet) to generate relevant rules corresponding to the same head atom and evaluate its truth probability through its preconditions.
The detailed computation process for each LNet is shown in Figure 3. For a logic constant m ∈ 𝒩ϵ ∪ 𝒩r referring to either a word or a relation, we build a set consisting of its relevant contexts ctx(m) = {m1, …, m|ctx(m)|}. Then the input to a LNet becomes ctx(m) = (qm1, qm2, …, qm|ctx(m)|), which combines deep learning predictions of each element in ctx(m).1 The LNet aims to produce probabilistic evaluations of a set of N atoms D = {d1, …, dN} in the atom layer, which are in turn used to form a logic program consisting of a set of logic rules {R1, …, RS} of the form dj1 ∧ … ∧ djT ⇒ h. All these rules share an identical head atom h = ym that indicates whether m belongs to a target class ym ∈ E ∪ 𝒱 ∪ 𝓡. The final output is a probabilistic evaluation υ(h) by accumulating all the logic rules and considering their confidence scores γ1, …, γS. In this way, the LNet is able to model the correlations of related constants formed by each word’s or relation’s relevant contexts.
Take the sentence “W. Dale Nelson covers the White House for The Associated Press” as an example. To make predictions on the word Dale in Module 𝒫, we first identify its context ctx(Dale) = {W., Dale, Nelson, The Associated Press, r(Dale, The Associated Press)} if The Associated Press is extracted as an entity. Then the input ctx(Dale) is the concatenation of all the prediction vectors q’s corresponding to each element in ctx(Dale) obtained from module 𝒬: ctx(Dale) = (qW., qDale, qNelson, qThe Associated Press, qr(Dale, The Associated Press)). Given ctx(Dale), the values of the synthetic atoms are obtained in the following process. Specifically, for the first input vector qW. corresponding to the previous word of Dale, we produce = {, …, } with values υ() = σ(v1,1⊤qW.), …, υ() = σ(v1,n1⊤qW.) corresponding to n1 with different properties from the previous word of Dale, according to (14). We treat these atoms as unary synthetic atoms. In a similar manner, we obtain and as another 2 sets of n1 1-ary atoms. Each of the n2 produced 2-ary atoms ∈ D(2) corresponds to an interaction property among Dale, The Associated Press and r(Dale, The Associated Press) via υ() = σ(qThe Associated Press⊤Vnqr(Dale,The Associated Press)), according to (15).
To make the logic network flexible and comprehensive, we further enhance LNet with residual connections to incorporate predefined atoms and logic rules when provided. As shown in Figure 3, a concat operation concatenates the synthetic atoms and original inputs to form the atom layer D = {d1, …, dNctx(m), dNctx(m)+1, …, dN}, where the first Nctx(m) = |Dctx(m)| atoms are the synthetic atoms, while the last N − Nctx(m) atoms are the predefined atoms. Different from synthetic atoms, which do not have exact semantic meanings, the predefined atoms are formed by the original inputs qi specifying the probability of each target class corresponding to the input word/relation, for example, dj = person(ϵ), Nctx(m) + 1 ≤ j ≤ N will inherit the value as υ(dj) = (qϵ)[person], which indicates the probability of label person for ϵ. The predefined atoms facilitate the incorporation of prior knowledge, for example, person(ϵi) ∧ work_for(r(ϵi,ϵj)) ⇒ organization(ϵj), into the rule layer.
As shown both in Figure 2 and (19), the output of LNet υ(h) is used to produce the probabilistic vector pm as the output of module 𝒫 for each constant m. These probabilistic vectors pm’s, together with the outputs qm’s from module 𝒬, will further be used to train our joint model via the EM algorithm, as discussed in the sequel.
6. Learning with Expectation-Maximization
6.1 Inference
During training, as the ground-truths are available, we also utilize label information to update q. Specifically for each m, we update q using the aforementioned strategy with probability 0.5, otherwise we replace ym (or {y1, …, yn}) predicted from 𝒫 with the ground-truth label to update q.
6.2 Learning
To compute p(ym|ctx(m)) within the logic module 𝒫, we define the context ctx(m) of each variable m to be those variables that have intensive correlations with m for the task at hand by constructing some rule templates given the output {qi}’s from module 𝒬. When m = wi ∈ 𝒩ϵ, we use 3 types of dependencies for the rule templates:
The prediction of a word wi from 𝒬 is a direct precondition: qi → pi.
The prediction of another word wj from 𝒬 that has relation with wi could inform the target prediction: qj, qrij → pi.
The prediction of wi’s preceding and following words from 𝒬 could inform the target prediction: qi−1, qi+1 → pi. Note that this type of dependency is applicable when the structured prediction is implemented in the logic module (𝒫), not the deep learning module (𝒬).
The prediction of rij from 𝒬 is a direct precondition: qrij → prij.
The predictions of wi and wj from 𝒬 could inform the target: qi, qj → prij.
Given these dependency templates, we construct the input of the logic network ctx(m) for m = wi ∈ 𝒩ϵ as ctx(m) = (qi−1, qi, qi+1, qj, qrij) for entity (or event) prediction of each word wi, where qi−1, qi, and qi+1 are separately used to construct 1-ary atoms, and both qj and qrij are used to construct 2-ary atoms in module 𝒫. Intuitively, the corresponding words and relations form the context of wi, denoted by ctx(i) = {wi−1, wi, wi+1, wj, rij}. Similarly, the input ctx(m) when m = rij ∈ 𝒩r for relation prediction of rij is ctx(m) = (qrij, qi, qj). We use qrij, qi, qj, respectively, to produce 1-ary atoms. Again, both qi and qj are used to produce 2-ary atoms. Given the construction of ctx(m), the output pm of the logic network will then be computed following (19).
6.3 Optimization
Overall, the training process involves alternating between variational E-step and M-step to update module 𝒫 using (24) and module 𝒬 using (22). For both steps, the output of 𝒫 is obtained by sampling the context predictions of the target using 𝒬, which reflects the intensive interactions between these two modules. This interaction is also enhanced by learning to approximate these two distributions throughout the training process. To facilitate training, we first pretrain 𝒬 using the ground-truth labels for several iterations before the variational EM procedure. In the testing phase, both 𝒫 and 𝒬 can be adopted to generate the predictions. In our experiments, we use a similar strategy as ensemble learning that assigns each module a weight that is tuned according to the validation set to compute a weighted average of the two modules as our final predictions. The complete training procedure for end-to-end relation extraction is shown in Algorithm 1.
7. Experiment
7.1 Tasks and Data
We conduct experiments on 6 benchmark data sets from 3 IE tasks:
Aspect and Opinion Terms Extraction: Aspect terms refer to the product features or attributes that the users commented on in the customer reviews. Opinion terms are those carrying subjective opinions toward the products or services. For example, given a review sentence “The service staff is terrible.”, service staff is an aspect term and terrible is the opinion term. We use a restaurant review corpus and a laptop review corpus from SemEval 2014 (Pontiki et al. 2014). The statistics of the two data sets are shown in Table 2.
End-to-end Relation Extraction: This task involves the identification and classification of both entities and relations between entities. For this task, three benchmark data sets are used, including CoNLL04 (Roth and Yih 2004), ACE04 (Doddington et al. 2004), and ACE05 (Li and Ji 2014). As shown in Table 3, CoNLL04 consists of 4 entity types and 5 relation categories. ACE04 defines 7 entity types with 7 relation categories and ACE05 adopts the same entity types as ACE04 but defines 6 relation types. CoNLL04 and ACE04 do not provide official train/test split, hence we conduct 3-fold and 5-fold cross-validation for CoNLL04 and ACE04, respectively, to report our final results. We follow the same preprocessing and data split as (Li and Ji 2014) on ACE05 data set.
End-to-end Event Extraction: This task involves three subtasks, namely, extraction and classification of entity mentions, extraction, and classification of event triggers, and discovering of relationships between entity mentions and event triggers for event argument extraction and classification. For this task, the same ACE05 data set is used. For entity mentions, we consider ACE entity types PER, ORG, GPE, LOC, FAC, VEH, WEA, and ACE VALUE and TIME expressions, following the common setting of the existing works. There are in total 33 event subtypes that are involved in the event trigger classification task. The total number of different argument roles for entities participating in various events is 35 and we collapse 8 of them that are time-related, following Yang and Mitchell (2016). The detailed statistics of ACE05 data set for event extraction is shown in Table 4. For evaluation, we treat an entity as correct if both its entity type and offset matches one of the ground-truth entities. An event trigger is correctly identified if its offset matches one of the reference event triggers, and it is regarded as correctly classified if its type is also correct. An argument role is correctly identified if the corresponding entity type, entity offset, and event type matches one of the reference argument roles, and it is correctly classified if the argument role is also correct.
Data . | # Sentences . | Entity Type . | |
---|---|---|---|
Restaurant 14 | train | 3,041 | aspect, opinion |
test | 800 | ||
Laptop 14 | train | 3,045 | aspect, opinion |
test | 800 |
Data . | # Sentences . | Entity Type . | |
---|---|---|---|
Restaurant 14 | train | 3,041 | aspect, opinion |
test | 800 | ||
Laptop 14 | train | 3,045 | aspect, opinion |
test | 800 |
Data . | # Sentences . | # Entities . | # Relation . | Entity Type . | Relation Type . | |
---|---|---|---|---|---|---|
CoNLL04 | 1,437 | 5,336 | 2,040 | person, location, organization, other | located_in, org_based_In, work_for, live_in, kill | |
ACE04 | 6,789 | 22,740 | 4,368 | person, vehicle, organization, location, facility, weapon, geographical entity | physical, PER/ORG-affiliation, employment-organization, person-social, GPE-affiliation, Agent-Artifact, discourse | |
ACE05 | train | 7,273 | 26,470 | 4,779 | person, vehicle, organization, location, facility, weapon, geographical entity | physical, ORG-affiliation, employment-organization, agent-artifact, part-whole, person-social, GPE-affiliation |
dev | 1,765 | 6,421 | 1,179 | |||
test | 1,535 | 5,476 | 1,147 |
Data . | # Sentences . | # Entities . | # Relation . | Entity Type . | Relation Type . | |
---|---|---|---|---|---|---|
CoNLL04 | 1,437 | 5,336 | 2,040 | person, location, organization, other | located_in, org_based_In, work_for, live_in, kill | |
ACE04 | 6,789 | 22,740 | 4,368 | person, vehicle, organization, location, facility, weapon, geographical entity | physical, PER/ORG-affiliation, employment-organization, person-social, GPE-affiliation, Agent-Artifact, discourse | |
ACE05 | train | 7,273 | 26,470 | 4,779 | person, vehicle, organization, location, facility, weapon, geographical entity | physical, ORG-affiliation, employment-organization, agent-artifact, part-whole, person-social, GPE-affiliation |
dev | 1,765 | 6,421 | 1,179 | |||
test | 1,535 | 5,476 | 1,147 |
7.2 Experimental Setting
To integrate self-attention mechanism, we use a pretrained BERT model (base-uncased) (Devlin et al. 2019) to initialize all the word embeddings and to produce the attention scores for each pair of words. The batch size is 20 and the dimension for BiGRU is 100 with dropout rate 0.1. For the logic network, we set the number of 1-ary atoms (n1) and 2-ary atoms (n2) for each input variable to 20 and the number of body atoms in each rule as T = 8. The total number of rules is set to S = 30. During training, we use Adadelta with initial learning rate 0.01 for module 𝒬 and Adam with initial learning rate 0.01 for module 𝒫. The sampling rate for both E-step and M-step is set to 0.5, that is, for 50% of the time, the ground-truth label is used for learning desired modules. For each experiment, we first pretrain 𝒬 for 50 epochs and then alternate between 𝒫 and 𝒬 with every 2 epochs for each module. The final prediction is made by ensemble strategy with weight 0.6 and 0.4 for 𝒬 and 𝒫, respectively. All the hyperparameters are selected via the validation set. For evaluation, we use micro-F1 scores on non-negative classes. An entity is correct if both segmentation and entity type are correct. A relation is correct if both of its entities (events) and the relation type matches the ground-true label. We use the same evaluation metric as (Yang and Mitchell 2016) for event extraction.
For time complexity, we report the duration using 1 GPU of Tesla V100 250W. Pure neural model (e.g., BERT) takes 38s and 28s for training 1 epoch on Res14 and Lap14 data set, respectively. It takes 556s and 522s for training 1 iteration of VDLN that consists of 2 epochs of both modules on Res14 and Lap14 data set, respectively. On CoNLL04, it takes 56s for training 1 epoch of BERT and 271s for training 1 iteration of VDLN. On ACE04, it takes 131s for training 1 epoch of BERT and 967s for training 1 iteration of VDLN. On ACE05, it takes 242s for training 1 epoch of BERT and 1713s for training 1 iteration of VDLN. For memory usage, experiments on Res14 and Lap14 data occupy around 8.9G. Experiment on CoNLL04 takes 15.9G. Experiments on ACE04 and ACE05 occupy around 6.7G. All these memory usages are almost the same when compared with pure deep learning models.
7.3 Result
Aspect and Opinion Terms Extraction: To demonstrate the effectiveness of our proposed model, we compare with the following most recent baselines:
GInf: A pipelined model combining deep neural networks with integer linear programming (Yu, Jiang, and Xia 2019). The predictions produced from the deep neural networks are taken as input to the integer linear programming system where explicit relational constraints among aspect terms and opinion terms are enforced considering syntactic information.
Rule-distill: A posterior-regularization-based framework to regularize deep learning predictions via prior knowledge. The training is conducted via a teacher-student knowledge distillation (Hu et al. 2016). To adapt this model to our problem setting, we construct a few logic rules, as shown in Table 14 to form the teacher network. For fairness, we use the same neural model (module 𝒬) as the student network.
DLogic: A joint model incorporating explicit logic rules into the deep learning model (Wang and Pan 2020). The deep learning predictions are made as probabilistic evaluations of input atoms to produce the output for the head atom of each rule. Then a discrepancy loss is computed to align the deep learning predictions with a set of predefined logic rules.
DLogic*: Replace the deep neural networks of DLogic with the one used in our proposed model for fair evaluations.
VDLN: The proposed model consisting of a logic module 𝒫 and a deep learning module 𝒬.
SOTA (Chen and Qian 2020): The current state-of-the-art model on aspect and opinion terms extraction, which implements BERT-large with collaborative learning considering the interactions among aspect terms, opinion terms, and sentiment polarities.
Table 5 shows the result for aspect and opinion terms extraction. Because some of the baseline models do not have published code, we only conduct 3 different runs over Rule-distill, DLogic, DLogic*, and our proposed model VDLN. For the other baseline models, we use fixed results as reported. The numbers in italic form indicate the average results over 3 different runs. This task can be cast as a special case of entity extraction by treating aspect terms and opinion terms as 2 different entity types. Yu, Jiang, and Xia (2019) incorporated explicit relational knowledge among aspect and opinion words through integer linear programming. However, the separation of knowledge reasoning from DNN during learning makes the result suboptimal. As a comparison, VDLN makes these 2 components interactive via variational learning. Compared with Rule-distill (Hu et al. 2016), VDLN outperforms the teacher-student network, demonstrating the advantage of EM algorithm for mutual learning and the ability to learn correlation patterns as logic rules. To verify the expressiveness of our proposed logic network for knowledge reasoning, we compare with explicit rule integration (Wang and Pan 2020), which bridges the DNN outputs with explicit logic rules by minimizing their discrepancies. For fair comparison, we replace their DNN module with ours, denoted by DLogic*. Clearly, VDLN gives better performances at all times, which proves the advantage of automatically learning a logic network over fixed rules. The SOTA model (Chen and Qian 2020) adopted BERT-large as the feature learning backbone and implemented multitask learning framework with collaborative learning mechanism to explore interactions among target terms and sentiment polarities for joint extraction. It is obvious that VDLN with logic reasoning outperforms the SOTA model even with BERT-base neural component. In general, VDLN significantly outperforms all baselines with p < 0.05 using paired t-test, except the SOTA model on opinion extraction of Res14.
Model . | GInf . | Rule-distill . | DLogic . | DLogic* . | VDLN . | SOTA (Chen and Qian 2020) . | |
---|---|---|---|---|---|---|---|
Res14 | Aspect | 84.50 | 87.27* | 85.41* | 86.57* | 87.71 | 86.71 |
Opinion | 85.20 | 86.40* | 84.21* | 86.11* | 87.32 | 87.18 | |
Lap14 | Aspect | 78.69 | 81.05* | 81.01* | 81.25* | 82.44 | 82.34 |
Opinion | 79.89 | 80.56* | 79.57* | 79.89* | 81.40 | 81.00 |
Model . | GInf . | Rule-distill . | DLogic . | DLogic* . | VDLN . | SOTA (Chen and Qian 2020) . | |
---|---|---|---|---|---|---|---|
Res14 | Aspect | 84.50 | 87.27* | 85.41* | 86.57* | 87.71 | 86.71 |
Opinion | 85.20 | 86.40* | 84.21* | 86.11* | 87.32 | 87.18 | |
Lap14 | Aspect | 78.69 | 81.05* | 81.01* | 81.25* | 82.44 | 82.34 |
Opinion | 79.89 | 80.56* | 79.57* | 79.89* | 81.40 | 81.00 |
End-to-End Relation Extraction: Besides the aforementioned baseline DLogic and DLogic*, we further adopt the following baselines:
Gopt: A globally optimized neural model for end-to-end relation extraction (Zhang, Zhang, and Fu 2017). The work converts the entity and relation extraction problem into a single table filling task, which produces a score for each label in the next step given the state of a partially-filled table. Moreover, global optimization is used, which treats the entire sentence as a unit.
MtQA: The extraction of entities and relations is cast as the task of identifying answer spans from the context given some question templates (Li et al. 2019). The question encodes relevant information corresponding to the target entity or relation to be identified.
SpanRel: An end-to-end deep learning model based on span-level predictions (Dixit and Al-Onaizan 2019). Instead of token-level modeling, span-based models take the features corresponding to all possible spans within a sentence for both entity and relation predictions.
SOTA (Wang and Lu 2020): A joint model using two different encoders, namely, a table encoder and a sequence encoder to intensively exploit the target interactions, together with rich encodings combining word vectors, character vectors, and strong pretrained contextualized vectors.
Table 6 lists the performances of the proposed models and baseline models for each end-to-end relation extraction data set. Note that although Luan et al. (2019) also showed promising results on ACE04 and ACE05 data sets, their model depends on auxiliary coreference supervisions, which is not fair to be compared with. Nevertheless, we still achieve comparable performance. MtQA (Li et al. 2019) treats the task as a question answering problem with predefined question templates and uses BERT as a backbone. Compared with Gopt (Zhang, Zhang, and Fu 2017) without self-attention, the improvement shows the advantage of modeling token-level dependencies for information extraction. The results also verify our consistent improvement over Rule-distill (Hu et al. 2016). Compared with the SOTA model (Wang and Lu 2020), which adopted rich encodings combining word vectors (GloVe), character embeddings, and contextualized embeddings (ALBERT-large, which is an extensively pretrained large model), our model produces slightly lower performance. We conjecture that the high result of the SOTA model depends on its rich encodings, as when replacing ALBERT-large with BERT in their model, the F1 score for entity extraction on ACE05 drops to 87.8, according to Wang and Lu (2020). In general, VDLN significantly outperforms the other baselines except SOTA, and Rule-distill on entity extraction of ACE04 with p < 0.05.
Data set . | Model . | Entity . | Relation . |
---|---|---|---|
CoNLL04 | Gopt (Zhang, Zhang, and Fu 2017) | 85.6 | 67.8 |
MtQA (Li et al. 2019) | 87.8 | 68.9 | |
Rule-distill (Hu et al. 2016) | 88.2* | 71.6* | |
DLogic (Wang and Pan 2020) | 87.1* | 64.6* | |
DLogic* (Wang and Pan 2020) | 88.3* | 69.9* | |
VDLN (ours) | 89.1 | 72.4 | |
SOTA (Wang and Lu 2020) | 90.1 | 73.6 | |
ACE04 | MtQA (Li et al. 2019) | 83.6 | 49.4 |
Rule-distill (Hu et al. 2016) | 87.7 | 58.1 | |
DLogic (Wang and Pan 2020) | 81.6* | 50.2* | |
DLogic* (Wang and Pan 2020) | 85.6* | 55.9* | |
VDLN (ours) | 87.9 | 57.8 | |
SOTA (Wang and Lu 2020) | 88.6 | 59.6 | |
ACE05 | MtQA (Li et al. 2019) | 84.8 | 60.2 |
Rule-distill (Hu et al. 2016) | 87.8* | 62.8* | |
SpanRel (Dixit and Al-Onaizan 2019) | 86.0 | 62.8 | |
DLogic (Wang and Pan 2020) | 83.8* | 59.3* | |
DLogic* (Wang and Pan 2020) | 87.2* | 62.4* | |
VDLN (ours) | 88.5 | 63.7 | |
SOTA (Wang and Lu 2020) | 89.5 | 64.3 |
Data set . | Model . | Entity . | Relation . |
---|---|---|---|
CoNLL04 | Gopt (Zhang, Zhang, and Fu 2017) | 85.6 | 67.8 |
MtQA (Li et al. 2019) | 87.8 | 68.9 | |
Rule-distill (Hu et al. 2016) | 88.2* | 71.6* | |
DLogic (Wang and Pan 2020) | 87.1* | 64.6* | |
DLogic* (Wang and Pan 2020) | 88.3* | 69.9* | |
VDLN (ours) | 89.1 | 72.4 | |
SOTA (Wang and Lu 2020) | 90.1 | 73.6 | |
ACE04 | MtQA (Li et al. 2019) | 83.6 | 49.4 |
Rule-distill (Hu et al. 2016) | 87.7 | 58.1 | |
DLogic (Wang and Pan 2020) | 81.6* | 50.2* | |
DLogic* (Wang and Pan 2020) | 85.6* | 55.9* | |
VDLN (ours) | 87.9 | 57.8 | |
SOTA (Wang and Lu 2020) | 88.6 | 59.6 | |
ACE05 | MtQA (Li et al. 2019) | 84.8 | 60.2 |
Rule-distill (Hu et al. 2016) | 87.8* | 62.8* | |
SpanRel (Dixit and Al-Onaizan 2019) | 86.0 | 62.8 | |
DLogic (Wang and Pan 2020) | 83.8* | 59.3* | |
DLogic* (Wang and Pan 2020) | 87.2* | 62.4* | |
VDLN (ours) | 88.5 | 63.7 | |
SOTA (Wang and Lu 2020) | 89.5 | 64.3 |
End-to-End Event Extraction: For this task, the state-of-the-art models to be compared are listed in the following.
JEventEntity: A probabilistic model taking into consideration of intensive dependencies between event triggers and entity mentions, as well as relationships among events (Yang and Mitchell 2016). A joint inference procedure is then proposed to globally optimize all the predictions within a text input.
dbRNN: A novel dependency-bridged recurrent neural network for event extraction (Sha et al. 2018), which fully utilizes both the sequential and syntactic structure of a sentence to enhance the extraction performance.
GAIL: A deep learning model based on generative adversarial imitation learning (Zhang, Ji, and Sil 2019). The authors use reinforcement learning to model sequential predictions and aims to produce proper reward values estimated from discriminators in GAN.
Joint3EE: A joint deep learning model to simultaneously achieve end-to-end event extraction (Nguyen and Nguyen 2019) by decomposing the joint probability into a product of the probability of each target variable conditioned on the processed units.
DYGIE++: A joint model for end-to-end event extraction based on contextualized span representations (Wadden et al. 2019). The span representations encode local and global interactions with a dynamic graph update to propagate long-range information.
SOTA (ONEIE): A joint neural model consisting of contextualized text representations and manually designed global features to capture the cross-task and cross-instance interactions (Lin et al. 2020).
The results on end-to-end event extraction is listed in Table 7. JEventEntity (Yang and Mitchell 2016) adopts joint inference with extensive manually designed linguistic features. Its performance is inferior compared to deep learning models. On the other hand, DNNs alone lack explicit knowledge that is crucial for the task at hand. Hence, dbRNN (Sha et al. 2018) and Joint3EE (Nguyen and Nguyen 2019) incorporate linguistically informed features (e.g., dependency relations) to enhance the performance of DNNs. From the comparison results, VDLN achieves the best performance among the baselines except the last row without requiring any external linguistic resources and only needs to generate simple rule templates that associate extracted entities and events with the argument relation predictions in an automatic manner. SOTA (ONEIE) (Lin et al. 2020) produces the best result mainly originating from the manually designed global features to enforce cross-task and cross-instance relationships, for example, A TRANSPORT event has only one DESTINATION argument.
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
JEventEntity | 81.8 | 71.0 | 68.8 | 50.6 | 48.4 |
dbRNN | – | – | 69.6 | 57.2 | 50.1 |
GAIL | 87.1 | 73.9 | 72.0 | 55.1 | 52.4 |
Joint3EE | – | 72.5 | 69.8 | 59.9 | 52.1 |
DYGIE++ | 89.7 | – | 69.7 | 53.0 | 48.8 |
VDLN | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
SOTA (ONEIE) | 90.2 | 78.2 | 74.7 | 59.2 | 56.8 |
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
JEventEntity | 81.8 | 71.0 | 68.8 | 50.6 | 48.4 |
dbRNN | – | – | 69.6 | 57.2 | 50.1 |
GAIL | 87.1 | 73.9 | 72.0 | 55.1 | 52.4 |
Joint3EE | – | 72.5 | 69.8 | 59.9 | 52.1 |
DYGIE++ | 89.7 | – | 69.7 | 53.0 | 48.8 |
VDLN | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
SOTA (ONEIE) | 90.2 | 78.2 | 74.7 | 59.2 | 56.8 |
7.4 Analysis
Our default experimental setting uses BERT as the neural component. To demonstrate the generality of our proposed VDLN architecture, we conduct extra experiments by replacing BERT with three other strong contextualized neural models, namely, SpanBERT, RoBERTa, and BERT-large, respectively, with fine-tuning, as well as a non-pretrained model using pure transformers. The results are listed in Table 8. We denote by “VDLN (*)” as the proposed joint model with the neural component 𝒬 replaced by * = BERT-large, SpanBERT, RoBERTa, transformer. All the experiments with pretrained models involve fine-tuning of the pretrained parameters. The transformer model follows the one in Wang and Pan (2020). From Table 8, we observe that large models (BERT-large) usually produce the best results, whereas VDLN (RoBERTa) performs best on CoNLL04. SpanBERT has inferior performance on average. Clearly, the joint model VDLN outperforms its neural component alone across almost all experiments. With such observation, we show that the proposed methodology is able to benefit a wide variety of its neural counterpart.
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
BERT | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
VDLN (BERT) | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
BERT-large | 87.9 | 86.5 | 80.7 | 81.1 | 88.1 | 71.0 | 88.1 | 59.8 | 87.8 | 63.2 |
VDLN (BERT-large) | 88.4 | 87.1 | 81.6 | 81.8 | 88.6 | 72.6 | 88.2 | 59.4 | 87.6 | 64.6 |
SpanBERT | 87.1 | 86.3 | 79.4 | 77.6 | 87.1 | 67.6 | 85.9 | 54.9 | 85.7 | 59.6 |
VDLN (SpanBERT) | 88.2 | 86.4 | 81.2 | 78.8 | 87.2 | 70.3 | 86.2 | 55.1 | 86.2 | 61.6 |
RoBERTa | 86.6 | 86.2 | 79.3 | 78.4 | 89.3 | 72.1 | 85.1 | 55.8 | 86.2 | 61.0 |
VDLN (RoBERTa) | 86.9 | 85.7 | 80.3 | 79.1 | 90.1 | 73.4 | 85.3 | 56.5 | 86.4 | 61.5 |
transformer | 84.3 | 84.2 | 76.2 | 77.3 | 85.8 | 62.7 | 82.1 | 51.4 | 83.4 | 59.1 |
VDLN (transformer) | 85.2 | 85.7 | 76.8 | 78.3 | 86.5 | 63.3 | 82.7 | 53.1 | 83.9 | 58.8 |
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
BERT | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
VDLN (BERT) | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
BERT-large | 87.9 | 86.5 | 80.7 | 81.1 | 88.1 | 71.0 | 88.1 | 59.8 | 87.8 | 63.2 |
VDLN (BERT-large) | 88.4 | 87.1 | 81.6 | 81.8 | 88.6 | 72.6 | 88.2 | 59.4 | 87.6 | 64.6 |
SpanBERT | 87.1 | 86.3 | 79.4 | 77.6 | 87.1 | 67.6 | 85.9 | 54.9 | 85.7 | 59.6 |
VDLN (SpanBERT) | 88.2 | 86.4 | 81.2 | 78.8 | 87.2 | 70.3 | 86.2 | 55.1 | 86.2 | 61.6 |
RoBERTa | 86.6 | 86.2 | 79.3 | 78.4 | 89.3 | 72.1 | 85.1 | 55.8 | 86.2 | 61.0 |
VDLN (RoBERTa) | 86.9 | 85.7 | 80.3 | 79.1 | 90.1 | 73.4 | 85.3 | 56.5 | 86.4 | 61.5 |
transformer | 84.3 | 84.2 | 76.2 | 77.3 | 85.8 | 62.7 | 82.1 | 51.4 | 83.4 | 59.1 |
VDLN (transformer) | 85.2 | 85.7 | 76.8 | 78.3 | 86.5 | 63.3 | 82.7 | 53.1 | 83.9 | 58.8 |
To analyze the effect of each module within the proposed framework and the effect of the EM training procedure, we conduct experiments on each separate module as well as some variations within a module. The results are shown in Tables 9 and 10 for all three different tasks. Specifically, 𝒬 and 𝒫 record the performance of each individual module alone, respectively, without the EM training alternation. Because 𝒫 requires the output from the deep learning predictions as its input for logic reasoning, we initialize 𝒫 with the output from a pretrained module 𝒬. 𝒬* and 𝒫* record the performance using 𝒬 and 𝒫, respectively, for final predictions after jointly training both modules alternatively via variational EM. It could be observed that for separate models, 𝒫 is slightly better than 𝒬 because it inherits the result from 𝒬 and further conducts logic reasoning based on the intensive interactions among the output variables. However, both of them are inferior to those with variational learning paradigm, which proves that the EM algorithm encourages a mutual enhancement between two different modules. For final predictions, the results from 𝒫* and 𝒬* are comparable most of the time. In the end, we use the ensemble model 𝒫 + 𝒬, which produces the best result on Laptop-14, ACE04, and ACE05.
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
𝒬 | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
𝒫 | 86.0 | 86.5 | 81.1 | 79.9 | 88.0 | 70.4 | 86.0 | 55.7 | 87.5 | 60.9 |
𝒬* | 87.3 | 87.1 | 82.5 | 81.2 | 88.9 | 72.5 | 87.9 | 57.8 | 87.8 | 63.8 |
𝒫* | 87.6 | 87.0 | 82.7 | 81.2 | 89.1 | 71.9 | 87.3 | 57.6 | 88.1 | 63.5 |
𝒫 + 𝒬 | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
𝒬 | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
𝒫 | 86.0 | 86.5 | 81.1 | 79.9 | 88.0 | 70.4 | 86.0 | 55.7 | 87.5 | 60.9 |
𝒬* | 87.3 | 87.1 | 82.5 | 81.2 | 88.9 | 72.5 | 87.9 | 57.8 | 87.8 | 63.8 |
𝒫* | 87.6 | 87.0 | 82.7 | 81.2 | 89.1 | 71.9 | 87.3 | 57.6 | 88.1 | 63.5 |
𝒫 + 𝒬 | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
𝒬 | 86.5 | 75.0 | 72.7 | 54.7 | 51.2 |
𝒫 | 86.9 | 74.7 | 72.7 | 55.3 | 51.5 |
𝒬* | 87.5 | 75.2 | 73.2 | 55.7 | 52.6 |
𝒫* | 87.8 | 75.3 | 72.8 | 56.0 | 52.3 |
𝒫 + 𝒬 | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
𝒬 | 86.5 | 75.0 | 72.7 | 54.7 | 51.2 |
𝒫 | 86.9 | 74.7 | 72.7 | 55.3 | 51.5 |
𝒬* | 87.5 | 75.2 | 73.2 | 55.7 | 52.6 |
𝒫* | 87.8 | 75.3 | 72.8 | 56.0 | 52.3 |
𝒫 + 𝒬 | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
We further verify the effect of each component within the framework via ablation studies. As shown in Tables 11 and 12, the first column indicates different model variations. DNN is the deep learning component we adopt, which corresponds to module 𝒬 in VDLN. DNN w/o BiGRU removes the BiGRU layer on top of the BERT model and DNN + CRF further connects a linear-chain CRF with structured loss as the last layer. Clearly, the performance with and without BiGRU is similar. However, BiGRU brings some performance gain when associated with the joint model VDLN, compared with VDLN w/o BiGRU, which removes BiGRU in the joint model. A CRF layer is able to bring some performance gain compared to DNN alone for both aspect/opinion extraction and end-to-end relation extraction. However, it is not beneficial for event trigger and event argument prediction, most probably due to the fact that most event triggers are made from a single word. Another two variations, namely, VDLN (seg) and VDLN (rel), refer to the model with only segmentation-based rule templates and relation-based rule templates, respectively, within module 𝒫. Specifically, segmentation-based rule templates only associate token-level interactions, for example, q(yi−1), q(yi+1) → p(yi). On the other hand, relation-based rule templates associate relational triplets with token predictions, e.g., q(yj), q(yrij) → p(yi). The results for these two variations demonstrate the contribution of each kind of interaction for the proposed model. From Table 11, we can observe that VDLN (seg) is beneficial for entity predictions, whereas VDLN (rel) mostly works for relation extraction. The last row VDLN + CRF takes DNN + CRF model as module 𝒬 in the joint model VDLN. It has similar performance compared to VDLN, which shows VDLN already learns structured information in a CRF model. VDLN + CRF (rel) only adopts relation-based rule templates in module 𝒫. By comparing it with VDLN, we can verify that the segmentation rule used in module 𝒫 is more beneficial than using a simple graphical model. We also verify the effect of using sparsemax for the rule learning process. The sparsemax operator explicitly constrains the number of atoms to be selected to form the body of a rule. By replacing it with softmax (VDLN (softmax)), the results show that sparsemax provides better result and is semantically more meaningful.
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
DNN w/o BiGRU | 86.5 | 85.8 | 79.0 | 79.2 | 87.2 | 70.8 | 86.2 | 56.7 | 86.9 | 61.5 |
DNN | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
DNN + CRF | 87.2 | 86.2 | 80.7 | 80.3 | 88.3 | 70.8 | 87.4 | 57.2 | 87.6 | 62.4 |
VDLN (seg) | 87.1 | 87.0 | 82.0 | 82.5 | 88.8 | 70.7 | 87.6 | 57.6 | 88.0 | 62.8 |
VDLN (rel) | 87.4 | 87.2 | 80.6 | 81.7 | 88.4 | 71.9 | 86.9 | 58.2 | 87.7 | 63.5 |
VDLN w/o BiGRU | 87.1 | 86.6 | 81.9 | 80.7 | 88.3 | 71.9 | 86.1 | 54.2 | 87.8 | 61.5 |
VDLN | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
VDLN (softmax) | 86.7 | 87.0 | 81.5 | 81.2 | 88.5 | 71.4 | 87.5 | 57.9 | 87.7 | 62.3 |
VDLN + CRF (rel) | 87.2 | 86.9 | 81.9 | 81.4 | 88.2 | 72.1 | 87.6 | 58.4 | 87.8 | 63.7 |
VDLN + CRF | 87.8 | 87.3 | 82.3 | 82.2 | 88.5 | 72.7 | 87.9 | 58.6 | 88.0 | 63.5 |
Model . | Res14 . | Lap14 . | CoNLL04 . | ACE04 . | ACE05 . | |||||
---|---|---|---|---|---|---|---|---|---|---|
Aspect . | Opinion . | Aspect . | Opinion . | Entity . | Relation . | Entity . | Relation . | Entity . | Relation . | |
DNN w/o BiGRU | 86.5 | 85.8 | 79.0 | 79.2 | 87.2 | 70.8 | 86.2 | 56.7 | 86.9 | 61.5 |
DNN | 86.2 | 86.1 | 80.2 | 79.5 | 87.6 | 69.3 | 85.8 | 55.5 | 87.4 | 61.3 |
DNN + CRF | 87.2 | 86.2 | 80.7 | 80.3 | 88.3 | 70.8 | 87.4 | 57.2 | 87.6 | 62.4 |
VDLN (seg) | 87.1 | 87.0 | 82.0 | 82.5 | 88.8 | 70.7 | 87.6 | 57.6 | 88.0 | 62.8 |
VDLN (rel) | 87.4 | 87.2 | 80.6 | 81.7 | 88.4 | 71.9 | 86.9 | 58.2 | 87.7 | 63.5 |
VDLN w/o BiGRU | 87.1 | 86.6 | 81.9 | 80.7 | 88.3 | 71.9 | 86.1 | 54.2 | 87.8 | 61.5 |
VDLN | 87.5 | 87.1 | 82.7 | 81.3 | 89.1 | 72.4 | 87.9 | 57.8 | 88.3 | 63.8 |
VDLN (softmax) | 86.7 | 87.0 | 81.5 | 81.2 | 88.5 | 71.4 | 87.5 | 57.9 | 87.7 | 62.3 |
VDLN + CRF (rel) | 87.2 | 86.9 | 81.9 | 81.4 | 88.2 | 72.1 | 87.6 | 58.4 | 87.8 | 63.7 |
VDLN + CRF | 87.8 | 87.3 | 82.3 | 82.2 | 88.5 | 72.7 | 87.9 | 58.6 | 88.0 | 63.5 |
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
DNN w/o BiGRU | 86.1 | 75.4 | 72.7 | 54.4 | 51.0 |
DNN | 86.5 | 75.0 | 72.7 | 54.7 | 51.2 |
DNN + CRF | 87.2 | 74.6 | 72.8 | 54.3 | 50.9 |
VDLN (seg) | 87.5 | 74.9 | 72.7 | 54.9 | 51.7 |
VDLN (rel) | 86.8 | 74.7 | 71.8 | 55.2 | 52.0 |
VDLN w/o BiGRU | 86.7 | 74.9 | 72.5 | 54.8 | 51.8 |
VDLN | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
VDLN + CRF (rel) | 87.1 | 75.8 | 72.8 | 55.7 | 52.5 |
VDLN + CRF | 87.4 | 75.4 | 72.8 | 55.4 | 52.2 |
Model . | Entity extraction . | Event trigger identification . | Event trigger classification . | Event argument identification . | Event argument classification . |
---|---|---|---|---|---|
DNN w/o BiGRU | 86.1 | 75.4 | 72.7 | 54.4 | 51.0 |
DNN | 86.5 | 75.0 | 72.7 | 54.7 | 51.2 |
DNN + CRF | 87.2 | 74.6 | 72.8 | 54.3 | 50.9 |
VDLN (seg) | 87.5 | 74.9 | 72.7 | 54.9 | 51.7 |
VDLN (rel) | 86.8 | 74.7 | 71.8 | 55.2 | 52.0 |
VDLN w/o BiGRU | 86.7 | 74.9 | 72.5 | 54.8 | 51.8 |
VDLN | 87.7 | 75.6 | 73.2 | 56.1 | 52.7 |
VDLN + CRF (rel) | 87.1 | 75.8 | 72.8 | 55.7 | 52.5 |
VDLN + CRF | 87.4 | 75.4 | 72.8 | 55.4 | 52.2 |
To investigate the advantage of the logic-inspired network within module 𝒫, we compare the proposed model with another popular and effective deep learning model, that is, graph neural networks (GNNs) (Dai, Dai, and Song 2016) and graph convolutional networks (GCNs) (Kipf and Welling 2017) for information propagation. The results are shown in Table 13. Specifically, we replace the logic network in 𝒫 with a GNN (or GCN), which takes the context ctx(m) of each target node m as the neighboring nodes to update its own feature via non-linear transformations (spectral-based graph convolutions). In a word, the graph structure of the GNN (GCN) is provided by the rule templates used in the logic network where two nodes are connected if they appear in the same rule. We denote this model by GNN + 𝒬 (GCN + 𝒬). GCN + 𝒬 is more expressive than GNN + 𝒬 Clearly, GCN + 𝒬 outperforms GNN + 𝒬 in general, but is still inferior than our proposed model across all except one experiment, indicating that the proposed logic module has better reasoning capabilities than graph-based models in this problem domain.
Model . | GNN + 𝒬 . | GCN + 𝒬 . | VDLN . | VDLN + rules . | |
---|---|---|---|---|---|
Res14 | Aspect | 86.2 | 86.8 | 87.5 | 88.0 |
Opinion | 86.5 | 86.4 | 87.1 | 87.3 | |
Lap14 | Aspect | 81.3 | 81.8 | 82.7 | 82.9 |
Opinion | 81.0 | 81.3 | 81.3 | 82.1 | |
CoNLL04 | Entity | 87.9 | 88.2 | 89.1 | 89.1 |
Relation | 71.3 | 71.4 | 72.4 | 72.7 | |
ACE04 | Entity | 85.7 | 86.7 | 87.9 | 87.7 |
Relation | 56.2 | 56.7 | 57.8 | 58.3 | |
ACE05 | Entity | 86.1 | 86.3 | 88.3 | 88.0 |
Relation | 62.3 | 62.6 | 63.8 | 64.1 | |
ACE05 | Entity | 87.1 | 87.9 | 87.7 | 87.5 |
Trigger (I) | 73.9 | 75.1 | 75.6 | 75.6 | |
Trigger (C) | 71.8 | 72.3 | 73.2 | 72.7 | |
Argument (I) | 53.6 | 54.4 | 56.1 | 55.8 | |
Argument (C) | 51.7 | 52.0 | 52.7 | 52.5 |
Model . | GNN + 𝒬 . | GCN + 𝒬 . | VDLN . | VDLN + rules . | |
---|---|---|---|---|---|
Res14 | Aspect | 86.2 | 86.8 | 87.5 | 88.0 |
Opinion | 86.5 | 86.4 | 87.1 | 87.3 | |
Lap14 | Aspect | 81.3 | 81.8 | 82.7 | 82.9 |
Opinion | 81.0 | 81.3 | 81.3 | 82.1 | |
CoNLL04 | Entity | 87.9 | 88.2 | 89.1 | 89.1 |
Relation | 71.3 | 71.4 | 72.4 | 72.7 | |
ACE04 | Entity | 85.7 | 86.7 | 87.9 | 87.7 |
Relation | 56.2 | 56.7 | 57.8 | 58.3 | |
ACE05 | Entity | 86.1 | 86.3 | 88.3 | 88.0 |
Relation | 62.3 | 62.6 | 63.8 | 64.1 | |
ACE05 | Entity | 87.1 | 87.9 | 87.7 | 87.5 |
Trigger (I) | 73.9 | 75.1 | 75.6 | 75.6 | |
Trigger (C) | 71.8 | 72.3 | 73.2 | 72.7 | |
Argument (I) | 53.6 | 54.4 | 56.1 | 55.8 | |
Argument (C) | 51.7 | 52.0 | 52.7 | 52.5 |
As mentioned in Section 5.2, the logic network is able to learn relevant knowledge automatically, as well as encode prior knowledge if provided. In all the previous experiments, we do not feed any manually designed logic rules into the logic network for fair comparisons and a demonstration of our model’s generality. To investigate how the given rules contribute to the actual task, we design some easily acquired logic rules for each task and incorporate them into the learning of LNet. The manually designed rules for each specific task and data set are listed in Table 14. For aspect and opinion terms extraction, we design rules involving dependency relations and POS tags, as adopted in Qiu et al. (2011) and Yu, Jiang, and Xia (2019). For example, the FOL rule “aspect(x) ∧ posnoun(x) ∧ depamod(y, x) ∧ posadj(y) ⇒ opinion(y)” states that if x is an aspect word having POS tag noun, we can infer that y with POS tag adj is an opinion word when there is a dependency relation amod between x and y. The dependency structures and POS tags are generated using Stanford CoreNLP (Manning et al. 2014). For end-to-end relation extraction, we mainly adopt relational rules demonstrating the correlations between entity types and relation types. Lastly, for event extraction, we use event trigger types and entity types as preconditions to design the FOL rules in order to entail the target relations. The results are shown in the last column of Table 13 (i.e., “VDLN + rules”). It can be observed that performance is improved on the task of aspect and opinion terms extraction. For end-to-end relation extraction, additional rules are more beneficial for relation extraction. However, we could observe a degradation in the performance of event extraction when inserting the manually designed logic rules. This might be caused by inaccurate event trigger predictions as well as uncertain rules with sparse coverage in the given corpus.
Task . | FOL formula . | |
---|---|---|
Aspect and opinion extraction | aspect(x) ∧ posnoun(x) ∧ depnn(x, y) ∧ posnoun(y) ⇒ aspect(y) | |
aspect(x) ∧ posnoun(x) ∧ depconj(x, y) ∧ posnoun(y) ⇒ aspect(y) | ||
opinion(x) ∧ posadj(x) ∧ depconj(x, y) ∧ posadj(y) ⇒ opinion(y) | ||
aspect(x) ∧ posnoun(x) ∧ depnsubj(x, y) ∧ posadj(y) ⇒ opinion(y) | ||
opinion(x) ∧ posadj(x) ∧ depnsubj(y, x) ∧ posnoun(y) ⇒ aspect(y) | ||
aspect(x) ∧ posnoun(x) ∧ depamod(y, x) ∧ posadj(y) ⇒ opinion(y) | ||
opinion(x) ∧ posadj(x) ∧ depamod(x, y) ∧ posnoun(y) ⇒ aspect(y) | ||
End-to-End relation extraction | CoNLL04 | person(x) ∧ live_in(r(x,y)) ⇒ location(y) |
location(x) ∧ live_in(r(y,x)) ⇒ person(y) | ||
organization(x) ∧ org_based_in(r(x,y)) ⇒ location(y) | ||
location(x) ∧ org_based_in(r(y,x)) ⇒ organization(y) | ||
location(x) ∧ located_in(r(x,y)) ⇒ location(y) | ||
person(x) ∧ kill(r(x,y)) ⇒ person(y) | ||
person(x) ∧ work_for(r(x,y)) ⇒ organization(y) | ||
organization(x) ∧ work_for(r(y,x)) ⇒ person(y) | ||
ACE04 | person(x) ∧ person − social(r(x,y)) ⇒ person(y) | |
person(x) ∧ discourse(r(x,y)) ⇒ person(y) | ||
geographical(x) ∧ discourse(r(x,y)) ⇒ geographical(y) | ||
organization(x) ∧ discourse(r(x,y)) ⇒ organization(y) | ||
person(x) ∧ employment(r(x,y)) ⇒ organization(y) ∨ geographical(y) | ||
geographical(x) ∧ employment(r(y,x)) ⇒ organization(y) ∨ person(y) | ||
organization(x) ∧ GPE − affiliation(r(x,y)) ⇒ geographical(y) | ||
person(x) ∧ GPE − affiliation(r(x,y)) ⇒ geographical(y) | ||
ACE05 | person(x) ∧ person − social(r(x,y)) ⇒ person(y) | |
vehicle(x) ∧ part − whole(r(x,y)) ⇒ vehicle(y) | ||
geographical(x) ∧ part − whole(r(x,y)) ⇒ geographical(y) | ||
organization(x) ∧ part − whole(r(x,y)) ⇒ organization(y) | ||
person(x) ∧ ORG − affilliation(r(x,y)) ⇒ organization(y) ∨ geographical(y) | ||
organization(x) ∨ geographical(x) ∧ ORG − affilliation(r(y,x)) ⇒ person(y) | ||
organization(x) ∧ GPE − affilliation(r(x,y)) ⇒ location(y) | ||
location(x) ∧ GPE − affilliation(r(y,x)) ⇒ organization(y) | ||
End-to-End event extraction | Movement_Transport(x) ∧ person(y) ⇒ Destination(r(x,y)) | |
(Personnel_Elect(x) ∨ Personnel_StartPosition(x) ∨ Personnel_EndPosition(x) ∨ Life_Marry(x) ∨ Justice_Arrest − Jail(x)) ∧ person(y) ⇒ Position(r(x,y)) | ||
(Personnel_StartPosition(x) ∨ Personnel_EndPosition(x)) ∧ organization(y) ⇒ Attacker(r(x,y)) | ||
(Contact_Meet(x) ∨ Contact_PhoneWrite(x) ∨ Conflict_Demonstrate(x)) ∧ person(y) ⇒ Attacker(r(x,y)) | ||
(Movement_Transport(x) ∨ Contact_Meet(x) ∨ Conflict_Attack(x) ∨ Life_Die(x)) ∧ time(y) ⇒ Target(r(x,y)) | ||
(Justice:Sentence(x) ∨ Justice:ChargeIndict(x) ∨ Justice:Convict(x)) ∧ person(y) ⇒ Adjudicator(r(x,y)) | ||
Justice:Sentence(x) ∧ sentence(y) ⇒ Crime(r(x,y)) | ||
Justice:ChargeIndict(x) ∧ crime(y) ⇒ Prosecutor(r(x,y)) |
Task . | FOL formula . | |
---|---|---|
Aspect and opinion extraction | aspect(x) ∧ posnoun(x) ∧ depnn(x, y) ∧ posnoun(y) ⇒ aspect(y) | |
aspect(x) ∧ posnoun(x) ∧ depconj(x, y) ∧ posnoun(y) ⇒ aspect(y) | ||
opinion(x) ∧ posadj(x) ∧ depconj(x, y) ∧ posadj(y) ⇒ opinion(y) | ||
aspect(x) ∧ posnoun(x) ∧ depnsubj(x, y) ∧ posadj(y) ⇒ opinion(y) | ||
opinion(x) ∧ posadj(x) ∧ depnsubj(y, x) ∧ posnoun(y) ⇒ aspect(y) | ||
aspect(x) ∧ posnoun(x) ∧ depamod(y, x) ∧ posadj(y) ⇒ opinion(y) | ||
opinion(x) ∧ posadj(x) ∧ depamod(x, y) ∧ posnoun(y) ⇒ aspect(y) | ||
End-to-End relation extraction | CoNLL04 | person(x) ∧ live_in(r(x,y)) ⇒ location(y) |
location(x) ∧ live_in(r(y,x)) ⇒ person(y) | ||
organization(x) ∧ org_based_in(r(x,y)) ⇒ location(y) | ||
location(x) ∧ org_based_in(r(y,x)) ⇒ organization(y) | ||
location(x) ∧ located_in(r(x,y)) ⇒ location(y) | ||
person(x) ∧ kill(r(x,y)) ⇒ person(y) | ||
person(x) ∧ work_for(r(x,y)) ⇒ organization(y) | ||
organization(x) ∧ work_for(r(y,x)) ⇒ person(y) | ||
ACE04 | person(x) ∧ person − social(r(x,y)) ⇒ person(y) | |
person(x) ∧ discourse(r(x,y)) ⇒ person(y) | ||
geographical(x) ∧ discourse(r(x,y)) ⇒ geographical(y) | ||
organization(x) ∧ discourse(r(x,y)) ⇒ organization(y) | ||
person(x) ∧ employment(r(x,y)) ⇒ organization(y) ∨ geographical(y) | ||
geographical(x) ∧ employment(r(y,x)) ⇒ organization(y) ∨ person(y) | ||
organization(x) ∧ GPE − affiliation(r(x,y)) ⇒ geographical(y) | ||
person(x) ∧ GPE − affiliation(r(x,y)) ⇒ geographical(y) | ||
ACE05 | person(x) ∧ person − social(r(x,y)) ⇒ person(y) | |
vehicle(x) ∧ part − whole(r(x,y)) ⇒ vehicle(y) | ||
geographical(x) ∧ part − whole(r(x,y)) ⇒ geographical(y) | ||
organization(x) ∧ part − whole(r(x,y)) ⇒ organization(y) | ||
person(x) ∧ ORG − affilliation(r(x,y)) ⇒ organization(y) ∨ geographical(y) | ||
organization(x) ∨ geographical(x) ∧ ORG − affilliation(r(y,x)) ⇒ person(y) | ||
organization(x) ∧ GPE − affilliation(r(x,y)) ⇒ location(y) | ||
location(x) ∧ GPE − affilliation(r(y,x)) ⇒ organization(y) | ||
End-to-End event extraction | Movement_Transport(x) ∧ person(y) ⇒ Destination(r(x,y)) | |
(Personnel_Elect(x) ∨ Personnel_StartPosition(x) ∨ Personnel_EndPosition(x) ∨ Life_Marry(x) ∨ Justice_Arrest − Jail(x)) ∧ person(y) ⇒ Position(r(x,y)) | ||
(Personnel_StartPosition(x) ∨ Personnel_EndPosition(x)) ∧ organization(y) ⇒ Attacker(r(x,y)) | ||
(Contact_Meet(x) ∨ Contact_PhoneWrite(x) ∨ Conflict_Demonstrate(x)) ∧ person(y) ⇒ Attacker(r(x,y)) | ||
(Movement_Transport(x) ∨ Contact_Meet(x) ∨ Conflict_Attack(x) ∨ Life_Die(x)) ∧ time(y) ⇒ Target(r(x,y)) | ||
(Justice:Sentence(x) ∨ Justice:ChargeIndict(x) ∨ Justice:Convict(x)) ∧ person(y) ⇒ Adjudicator(r(x,y)) | ||
Justice:Sentence(x) ∧ sentence(y) ⇒ Crime(r(x,y)) | ||
Justice:ChargeIndict(x) ∧ crime(y) ⇒ Prosecutor(r(x,y)) |
We would like to emphasize that compared with the predefined logic rules, the learned logic rules are different in the way that the atoms in the rule bodies are rather abstract and composited. More concretely, if we define the set of all generated synthetic atoms in the atom layer as Dctx(m) = {d1, …, d80},3 we are able to list a few learned logic rules for relations on CoNLL04.
d11 ∧ d2 ∧ d58 ∧ d31 ∧ d7 ⇒ (r = located_in).
d18 ∧ d34 ∧ d16 ∧ d4 ∧ d10 ⇒ (r = work_for).
d5 ∧ d4 ∧ d31 ⇒ (r = live_in).
located_in(r(the White House,U.S.)) ∧ live_in(r(the White House,U.S.)) ∧ location(U.S.) ∧ organization(the White House) ∧ located_in(r(the White House,U.S.)) ⇒ (r = located_in).
For qualitative analysis, we use Table 15 to list a few examples showing that the incorporation of logic reasoning is able to more correctly extract target terms/relations compared to pure neural networks 𝒬. Specifically, the words in bold indicate aspects or entities and the words in italic form indicate opinions. For entity and relation extraction, the second row in each example represents the predicted entity types and relations. The numbers in the relation indicate the indices of its corresponding entities. For aspect and opinion terms extraction, VDLN is able to identify target aspects or opinions with certain syntactic relations that are missed by pure DNN. For example, the opinion term laid-back can be extracted by associating it with the aspect term ambience and another opinion term relaxed. For entity and relation extraction, VDLN modifies incorrect predictions from DNN. For example, the output relation located_in(6th Fleet,Mediterranean) is corrected as org_based_in(6th Fleet,Mediterranean) by VDLN.
DNN . | VDLN . |
---|---|
“The ambience is also more laid-back and relaxed.” | “The ambience is also more laid-back and relaxed.” |
“The folding chair I was seated at was uncomfortable.” | “The folding chair I was seated at was uncomfortable.” |
“It is robust, with a friendly use as all Apple products.” | “It is robust, with a friendlyuse as all Apple products.” |
“…on duty with the 6th Fleet in the Mediterranean,…” Entity: location, location; Relation: located_in (1, 2) | “…on duty with the 6th Fleet in the Mediterranean,…” Entity: organization, location; Relation: org_based_in(1, 2) |
“…get them all home, said Ms. Say in Nashville, Tenn.” Entity: people, location, location; Relation: located_in(1, 2) | “…get them all home, said Ms. Say in Nashville, Tenn.” Entity: people, location, location; Relation: located_in(2, 3) |
“…the legislature of the state of Florida …” Entity: facility, geographical; Relation: None | “…the legislature of the state of Florida …” Entity: organization, geographical; Relation: emp-org(1, 2) |
DNN . | VDLN . |
---|---|
“The ambience is also more laid-back and relaxed.” | “The ambience is also more laid-back and relaxed.” |
“The folding chair I was seated at was uncomfortable.” | “The folding chair I was seated at was uncomfortable.” |
“It is robust, with a friendly use as all Apple products.” | “It is robust, with a friendlyuse as all Apple products.” |
“…on duty with the 6th Fleet in the Mediterranean,…” Entity: location, location; Relation: located_in (1, 2) | “…on duty with the 6th Fleet in the Mediterranean,…” Entity: organization, location; Relation: org_based_in(1, 2) |
“…get them all home, said Ms. Say in Nashville, Tenn.” Entity: people, location, location; Relation: located_in(1, 2) | “…get them all home, said Ms. Say in Nashville, Tenn.” Entity: people, location, location; Relation: located_in(2, 3) |
“…the legislature of the state of Florida …” Entity: facility, geographical; Relation: None | “…the legislature of the state of Florida …” Entity: organization, geographical; Relation: emp-org(1, 2) |
To demonstrate the model’s robustness, we conduct experiments with varying hyperparameters. We choose three parameters, namely, the sampling rate ρ during the EM updates, the number of atoms T in the rule body for each rule formed in the logic network, and the number of rules in the rule set {R1, …, RS} that share the same head atom h. Specifically, we use different sampling rates ranging from ρ = 0.1 to ρ = 0.9 when updating both 𝒫 and 𝒬. Here ρ is the probability of using the predictions from 𝒫 or 𝒬 when learning the parameters of 𝒬 or 𝒫, respectively, during the variational EM updates. With probability 1 − ρ, the ground-truth label is used to supervise each module. The results for aspect and opinion terms extraction are shown in Figure 4 and the results for CoNLL04 and ACE05 data set are shown in Figure 5. Both figures demonstrate the robustness of VDLN against different sampling rates. The performance drop for ρ = 0.9 is reasonable as only 10% of the ground-truth labels are used for supervision during the EM training procedure.
Figures 6, 7, and 8 correspond to the F1 scores for entity extraction and relation prediction on ACE05, ACE04, and CoNLL04 data set, respectively. The x-axis on the left subfigure indicates the number of atoms T in the body of each rule (i.e., dj1∧, …, ∧djT ⇒ h). The x-axis on the right subfigure indicates the number of rules S for each head atom h. As indicated in the figures, the final performance of our proposed framework is not sensitive to such hyperparameters within the logic network. For ACE05 and ACE04, varying T from 1 to 10 results in more stable performance for entity extraction compared with relation prediction. On the other hand, the model’s performance is relatively less dependent on the number of rules S. When changing S from 10 to 50, the results stay within a small range. As for CoNLL04, the performance decreases when S is higher than 25. This might result from the fact that the interactions between entities and relations are simpler in the smaller CoNLL04 data set, which becomes easy to be overfitted.
7.5 Error Analysis and Future Work
For error analysis, our model has limitations when the entities (triggers) are not correctly extracted. Specifically, if the entities (triggers) are not extracted in the entity (trigger) prediction phase in module 𝒬, that is, generating predictions from qi for each word, it becomes hard to rectify such predictions in the logic networks and during the EM training procedure. The reason is that the relations and rules are all based on the extracted candidate entities from 𝒬. Indeed, when some entities are not identified in 𝒬, there will be no bilinear interactions between the missed entities and other entities to be modeled in the logic network. Hence, it is difficult for VDLN to learn useful rules to correct its predictions.
In our future work, we plan to solve the above limitations by revising the extraction mechanism in the neural component where entities are first predicted followed by relation predictions. A table filling mechanism might be a good choice (Miwa and Sasaki 2014). We also plan to design more interpretable networks in terms of logic reasoning so that the learned rules can be explicitly explained. In terms of application, our future work may include generalizing our proposed framework to work with more challenging cases, for example, cross-sentence correlations, cross-instance consistencies, in order to be applied on other application domains, for example, document-level event extraction, cross-sentence relation extraction, etc.
8. Conclusion
We propose a variational deep logic network to inherit both the representational power of deep learning and the reasoning capabilities of logic systems for joint inference in IE. These two paradigms communicate through the variational EM algorithm. For knowledge reasoning, we introduce a novel logic network that transforms logic semantics to a deep hierarchical architecture to facilitate logic inference automatically. Meanwhile, the logic network enhances the expressiveness over manually designed rules by learning more effective atom combinations according to the training data. It is also flexible to incorporate predefined logic rules to further enhance the final performance.
Acknowledgments
This work is supported by NTU Nanyang Assistant Professorship (NAP) grant M4081532.020, 2020 Microsoft Research Asia collaborative research grant, and Singapore Lee Kuan Yew Postdoctoral Fellowship.
Notes
How to construct relevant context for each m ∈ 𝒩ϵ ∪ 𝒩r is explained in detail in Section 6.
Here n1 and n2 are hyperparameters corresponding to the number of 1-ary atoms and 2-ary atoms, respectively, for each input.
According to the experimental setting, we have 80 generated atoms.