Variational Deep Logic Network for Joint Inference of Entities and Relations

Abstract Currently, deep learning models have been widely adopted and achieved promising results on various application domains. Despite their intriguing performance, most deep learning models function as black boxes, lacking explicit reasoning capabilities and explanations, which are usually essential for complex problems. Take joint inference in information extraction as an example. This task requires the identification of multiple structured knowledge from texts, which is inter-correlated, including entities, events, and the relationships between them. Various deep neural networks have been proposed to jointly perform entity extraction and relation prediction, which only propagate information implicitly via representation learning. However, they fail to encode the intensive correlations between entity types and relations to enforce their coexistence. On the other hand, some approaches adopt rules to explicitly constrain certain relational facts, although the separation of rules with representation learning usually restrains the approaches with error propagation. Moreover, the predefined rules are inflexible and might result in negative effects when data is noisy. To address these limitations, we propose a variational deep logic network that incorporates both representation learning and relational reasoning via the variational EM algorithm. The model consists of a deep neural network to learn high-level features with implicit interactions via the self-attention mechanism and a relational logic network to explicitly exploit target interactions. These two components are trained interactively to bring the best of both worlds. We conduct extensive experiments ranging from fine-grained sentiment terms extraction, end-to-end relation prediction, to end-to-end event extraction to demonstrate the effectiveness of our proposed method.


Introduction
Joint inference is commonly adopted in the field of information extraction (IE), for example, end-to-end relation extraction and end-to-end event extraction. Compared with a pipelined procedure, joint inference performs multiple correlated subtasks in a single model simultaneously, which avoids error propagation and exploits inter-task correlations. For example, end-to-end relation extraction involves both entity extraction and relation classification between entities. As shown in Figure 1(a), given a text input "W. Dale Nelson covers the White House for The Associated Press," end-to-end relation extraction requires the identification of W. Dale Nelson as an entity of type person (PER), White House as an entity of type location (LOC), and The Associated Press as an entity of type organization (ORG). At the same time, the relation between W. Dale Nelson and The Associated Press needs to be classified as work for. For end-to-end event extraction, an event consists of an event trigger and an arbitrary number of arguments. The task involves the identification and classification of the following three items: • Entity mention: An entity mention is a reference to an entity in the form of a noun phrase or a pronoun.
• Event trigger: An event trigger usually refers to the main word that clearly expresses an event occurrence. Event triggers can be verbs, nouns, and occasionally adjectives.
• Event argument: Event arguments refer to entities that fill specific roles in the event. They mainly include participants, namely, the entities that are involved in the event, and general event attributes such as place and time.
For example, in Figure 1(b), there are four entity mentions with their corresponding types labeled above. blow is a trigger for the event Conflict:Attack with two different arguments: He (Attacker) and city (Place). Various deep learning models have been proposed to jointly extract entities, or events and their relations through either parameter/feature sharing (Miwa and Bansal 2016;Katiyar and Cardie 2017) to exploit task commonalities, or designing loss functions that consider task correlations, for example, adopting a novel tagging scheme Miwa and Sasaki 2014;Gupta, Schütze, and Andrassy 2016;Zhang, Zhang, and Fu 2017;Zheng et al. 2017

work for
He will blow a city off the earth in a minute if he can get the hold of the means to do it.

Conflict:Attack
Attacker Place task interactions implicitly via parameter sharing or high-level feature learning without effective relational knowledge integration. We observe that intensive correlations or relational patterns exist among targets being extracted. Take Figure 1(a) as an example; if we know entity W. Dale Nelson is a person and it has relation work for with another entity The Associated Press, we can probably infer that The Associated Press is an organization. Note that the widely used BIO segmentation scheme in entity segmentation can be considered as a special case of correlation constraints among targets, for example, "I" should not follow "O." To fuse such explicit dependencies among different targets, some early studies enforce the model predictions with constraints (Yang and Cardie 2013;Roth, Yih, and Yih 2007) or rely on global graphical models (Yu and Lam 2010) to produce structured predictions. These approaches, however, fail to connect final predictions with feature updates, resulting in error propagation. Logic rules have been integrated into deep learning architectures for natural language processing as a form of prior knowledge integration recently (Hu et al. 2016;Li and Srikumar 2019;Wang and Pan 2020). However, in existing methods, rules are explicitly given and kept fixed with learnable weights during model learning, which limits the expressiveness and adaptation of knowledge from training data.
To address these limitations, we propose a novel marriage between deep feature learning and relational logic reasoning, named Variational Deep Logic Network (VDLN), for joint inference in the IE domain. The complex relationships among target variables could be effectively captured both implicitly and explicitly via the mutual enhancements of deep neural networks and automatic logic inference in a joint learning framework. Specifically, VDLN consists of two modules: a deep learning module Q and a logic reasoning module P. The deep learning module adopts the self-attention mechanism to explore the dependencies among each token in a sentence in order to generate word-level and relation-level features. It is also flexible to incorporate structured models, for example, Conditional Random Fields (CRFs) (Lafferty, McCallum, and Pereira 2001) to produce structured outputs for entity segmentations. For the logic reasoning module, we construct a novel logic network that parameterizes logic inference process via a hierarchy of layers consisting of an atom layer and a rule layer. The final output of the logic network simulates rule entailments and reflects the probability of the target atom being true given the input atoms. The target atom could be regarded as a binary classifier for each target label. The logic network aims to learn relational correlations among the related variables, which is crucial for the task at hand. For example, the aforementioned dependency between entity and relation labels could be reflected via the first-order-logic rule: person(X 1 ) ∧ work for(X 1 , X 2 ) ⇒ organization(X 2 ). It is worth noting that the logic reasoning module is flexible enough to achieve both rule learning given some simple rule templates and integration of predefined logic rules.
To smoothly integrate these two modules and to model dependencies of correlated variables for joint inference, we propose a variational EM learning paradigm. The E-step involves learning of module Q to produce probabilistic predictions for each variable. For the M-step, the logic reasoning module P conducts knowledge inference and updates its parameters according to the outputs of Q. The alternation between E-step and M-step facilitates the integration and mutual enhancement of both knowledge reasoning and abstractive feature learning to achieve the best of both worlds.
To demonstrate our model's generality, we apply VDLN on a range of challenging IE tasks, focusing on different kinds of correlations and with increasing levels of difficulty. Specifically, we take Aspect and Opinion Extraction as the first IE task that focuses on entity extraction by treating aspect and opinion terms as two different entity types and exploring their interactions to boost the extraction accuracy. The second IE task is End-to-End Relation Extraction, which considers correlations among entities and their relations. We use End-to-End Event Extraction as our third IE task, which contains rich correlations between entities and events. The proposed model achieves better performances across all these tasks without the need to construct any prior knowledge. To summarize, our contributions include: • We propose a novel logic-inspired network incorporating logic semantics for probabilistic reasoning, which is more expressive and beneficial for exploiting target interactions for joint inference. The logic network is able to learn effective reasoning patterns given the training corpus, and at the same time allows the integration of predefined logic rules.
• We design a variational EM algorithm within our deep logic networks for IE tasks, which bridge the gap between deep feature learning and knowledge reasoning to enhance the final performance.
• We conduct extensive experiments on 6 benchmark data sets across 3 IE tasks with increasing levels of difficulty to demonstrate the effectiveness and generality of our proposed model.

Information Extraction
Information extraction aims to extract structured knowledge from texts (e.g., entities, relational triplets). In this paper, we mainly review three IE tasks that are related to our proposals. The first task is aspect and opinion extraction, which focus on the identification of product aspects/attributes and their corresponding opinion expressions. Existing work either relies on predefined rules and patterns among aspect terms and opinion terms utilizing syntactic information of a sentence (Hu and Liu 2004;Qiu et al. 2011;Li et al. 2010), or designs deep learning models considering different types of dependencies, for example, contextual dependencies (Liu, Joty, and Meng 2015;Wang et al. 2017;Li and Lam 2017;Xu et al. 2018a), syntactic dependencies (Yin et al. 2016;Wang et al. 2016), and task dependencies (Chen and Qian 2020). Another recent work (Yu, Jiang, and Xia 2019) exploits the combination of explicit rules with deep feature learning via linear integer programming. However, such integration only treats rules as fixed constraints to revise deep learning predictions, without the ability to update rules and propagate information back to feature learning. For end-to-end relation extraction, the early works adopt a pipeline procedure that first learns an entity extraction model and then trains a relation classifier based on the extracted entities (Chan and Roth 2011;Lin et al. 2016). This strategy is prone to error propagation resulting from the extracted entities. To resolve this limitation, subsequent works propose joint extraction models by sharing parameters (Miwa and Bansal 2016;Katiyar and Cardie 2017;Takanobu et al. 2019;Dixit and Al-Onaizan 2019;Dai et al. 2019a) or by designing loss functions to encode the task interactions, for example, structured perceptron , novel labeling strategies (Miwa and Sasaki 2014;Gupta, Schütze, and Andrassy 2016;Zhang, Zhang, and Fu 2017;Zheng et al. 2017;), global loss (Sun et al. 2018Adel and Schütze 2017), and triplet/answer generation (Zeng et al. 2018;. Wang and Lu (2020) proposed combining both sequence encoder and table encoder together with rich input embeddings for joint extraction. However, these approaches only exploit correlations among the subtasks implicitly. Another strategy is to enforce relational facts via explicit rule constraints (Roth, Yih, and Yih 2007;Yang and Cardie 2013;Kate and Mooney 2010) or graphical models (Yu and Lam 2010), which are separated from feature learning.
The third task, which is more challenging, is event extraction. Pipelined models are first proposed, which require extensive feature engineering (Ji and Grishman 2008;Liao and Grishman 2010;Patwardhan and Riloff 2009;Hong et al. 2011;McClosky, Surdeanu, and Manning 2011;). To capture interactions among different subtasks, graphical and structured prediction models have been proposed for joint inference of event triggers and event arguments (Poon and Vanderwende 2010;Venugopal et al. 2014;Riedel et al. 2009;Judea and Strube 2016;Yang and Mitchell 2016). Recently, deep neural networks were also introduced for joint prediction in the domain of event extraction (Nguyen, Cho, and Grishman 2016;Sha et al. 2018;Liu, Luo, and Huang 2018;Nguyen and Nguyen 2019;Zhang, Ji, and Sil 2019;Wadden et al. 2019). However, most of the existing research depends on external linguistic resources to generate semantic and syntactic features in order to enhance the final prediction. Lin et al. (2020) adopted manually designed global features to capture cross-task and crossinstance interactions.

Deep Learning with Logic Reasoning
Considering the limitation of pure deep learning models, which lack the reasoning capabilities, and the inflexibility of pure symbolic models, a marriage between them has been proposed, namely, Neural-Symbolic Learning, which aims to equip distributed representation learning with some form of real intelligence, or, on the other hand, assists symbolic models to handle uncertainties (Garcez, Broda, and Gabbay 2002;França, Zaverucha, and D'avila Garcez 2014;Serafini and d'Avila Garcez 2016;Evans and Grefenstette 2018;Manhaeve et al. 2018;Dong et al. 2019;Xu et al. 2018b;Tran and d'Avila Garcez 2018;Wang et al. 2019;Dai et al. 2019b;d'Avila Garcez et al. 2019;Ciravegna et al. 2020;Lamb et al. 2020;Yang and Song 2020). Deep neural networks have been used to simulate logic reasoning by parameterizing logic operators and logic atoms with neural weights (Franca, Zaverucha, and D'avila Garcez 2014;Tran and d'Avila Garcez 2018). Another group of research focuses on smooth integration of logic rules within the deep learning frameworks (Manhaeve et al. 2018;Xu et al. 2018b). A more challenging direction is to induce logic rules automatically through representation learning and differentiable back-propagation (Evans and Grefenstette 2018;Dong et al. 2019;Wang et al. 2019;Yang and Song 2020).
In the NLP domain, Rocktäschel, Singh, and Riedel (2015) and Guo et al. (2016) embedded logic rules into the distributed feature space for knowledge graph learning. Hu et al. (2016) fused discrete logic rules into deep neural networks (DNNs) through posterior regularization and Qu and Tang (2019) used a variational EM algorithm to distill knowledge from a graph neural network into a Markov logic network. Another work used logic rules to construct adversarial sets (Minervini et al. 2017;Minervini and Riedel 2018), or as indirect supervision to improve model training (Wang and Poon 2018). Logic knowledge has also been inserted into deep architectures as named neurons (Li and Srikumar 2019). Recently, differentiable theorem proving has been proposed that parameterizes symbolic unification in the backward chaining process of prolog (Gallaire and Minker 1978) with neural weight learning Campero et al. 2018;Minervini et al. 2020). Inspired by , we also adopt the variational EM algorithm for knowledge distillation. But different from the previously mentioned studies, we design a semantically meaningful deep architecture for automatic logic reasoning. The logic-inspired network is able to learn expressive and useful reasoning patterns that are adapted given the training corpus, and at the same time flexible to incorporate predefined logic rules. In the domain of information extraction, Wang and Pan (2020) used predefined logic rules as a form of regularizer to be imposed to the learning of DNNs. The regularizer is realized via a discrepancy loss between the deep learning predictions and the satisfiability of their corresponding logic rules. However, this mechanism only locally influences the learning of DNNs. Compared with Wang and Pan (2020), our proposed model is able to learn different combinations of logic atoms to form the rules and it is also flexible to incorporate predefined knowledge. Moreover, our EM training algorithm alternates between an inference step and a learning step to achieve mutual enhancement which globally enforces the learning of both modules, instead of sample-wise regularization.

Problem Definition and Preliminary
For ease of illustration, we first list all the symbols used in this work together with their descriptions in Table 1.

Problem Definition
For all three IE tasks, the target variables can be categorized as: (1) Entities, with the set of all entity types denoted by E.
(2) Events, with V denoting the set of all event types.
(3) Relational triplets (s, r, o) governed by a set of relation categories r ∈ R, with s and o being the subject and object of relation r, respectively. For convenience, we use r (s,o) to denote the relational triplet. Given an input sentence {w 1 , w 2 , . . . , w n }, entity extraction is formalized as a sequence labeling problem to generate entity segmentation. Denote the set of segmentation labels by E = {B j , I j , O} j∈E , with B j , I j , O indicating the beginning, inside, and outside of an entity of type j, respectively. The output is a label sequence {y 1 , y 2 , . . . , y n }, where y i ∈ E. End-to-end relation extraction aims to generate both entity segmentation as well as a set of relational triplets r ( 1 , 2 ) , where 1 and 2 correspond to entities. End-to-end event extraction consists of 3 subtasks: entity extraction, event trigger extraction, and event argument prediction. Event trigger extraction is formalized as a token-based classification problem with |V| classes. Event argument prediction aims to produce relational triplet r ( ,v) where is an entity, v is an event trigger, and r denotes the argument relation between and v. For relational triplet prediction, we pair all candidate entities (or entities with event triggers) that are extracted in the first place to predict the relation label.

Variational EM
Given a model p φ parameterized by φ, the objective is to maximize L = log p(Y; φ) with respect to φ, where Y is the target variable. We can re-formalize the objective by introducing another model q parameterized by θ: given that the last term E q(Z;θ) [log q(Z; θ)] of the ELBO is a constant with respect to p. Such formulation has 2 advantages: (1) It promotes mutual learning from 2 different perspectives when only optimizing the single model p is hard and insufficient. With such consideration, we treat the logic module P as p and the deep learning module Q as q in (1).
(2) EM exploits the dependencies between input and hidden variables, which is beneficial for modeling inter-dependencies for joint inference, for example, the correlations between entity types and relation categories. But different from Qu and Tang (2019), we design a semantically meaningful deep architecture for automatic logic reasoning. Note that  adopted this formulation to distill information from graph neural networks for Markov logic networks with given logic rules. Compared to other existing works that either used manually constructed logic rules to enhance the learning of DNNs, or learn logic rules but are limited in terms of computational efficiency, we build on top of  to achieve mutual learning of both DNNs and logic reasoning.

First-Order Logic
A first-order logic (FOL) program associates constants, variables, and predicates with logic connectives, namely, ∨, ∧, and ¬, and quantifiers. A constant x is an object, for example, a word or a relation between two words. A variable X refers to a group of constants. A predicate pred can be regarded as a function that maps constants or variables to True or False. An FOL formula consists of atoms connected with ∨, ∧, or ¬, representing logic "OR," logic "AND," and logic "NOT," respectively. Here an atom is an n-ary predicate taking constants or variables as arguments. For example, person(X) is a 1-ary atom, same(X 1 , X 2 ) is a 2-ary atom. An atom is said to be grounded if all of its variables are instantiated with constants. Given these definitions, an FOL formula in the form of entailment could be represented as where d i = pred(X 1 , . . . , X m ) is a body atom and h is the head atom of the formula. ⇒ in (4) can be replaced with ∨ and ¬, converting (4) to an equivalent form: ¬d 1 ∨ ¬d 2 ∨ . . . ∨ ¬d n ∨ h consisting of valid connectives. In our problem setting, we treat each different classifier as FOL entailments and define each target label as the head atom of a set of FOL formula. For example, d 1 ∧ d 2 ∧ . . . ∧ d n ⇒ person(X) explains how person(X) can be deduced from its body atoms. In this case, if d 1 ∧ d 2 ∧ . . . ∧ d n evaluates to True, person(X) will also be True.
In order to smoothly integrate FOL with deep learning, probabilistic logic has been proposed that translates the hard assignment of True and False to a soft version within [0, 1] that indicates the probability of an atom being true. Hence, we can define υ(person(X)) = υ(d 1 ∧ d 2 ∧ . . . ∧ d n ) ∈ [0, 1], where υ(·) denotes the probabilistic evaluation of the input. Furthermore, we adopt T-norm (Klement, Mesiar, and Pap 2013) for probabilistic evaluations of logic connectives: To encode uncertainties within probabilistic logic, we assign each FOL formula d 1 ∧ d 2 ∧ . . . ∧ d n ⇒ h with a learnable confidence score γ ∈ [0, 1]. The higher the score, the more confidence the formula plays in the computation process.

Motivation
Conventional deep learning usually lacks knowledge integration and fails to explicitly model the crucial interactions among the targets. Recently, logic reasoning has been adopted and integrated with DNNs to enhance performance by introducing knowledge as FOL rules. Among them, probabilistic logic converts the hard 0/1 assignment to soft probabilities (Nilsson 1986), which facilitates optimization through gradient descent. However, the pre-designed FOLs may not be expressive enough to represent the inherent patterns and prevents adaptation to a given training data set. To address this limitation, we propose VDLNs, which inherit the representational power of deep learning, and at the same time simulate the logic rule learning process via a novel logic network consisting of a hierarchy of an atom layer, a rule layer, and an output layer. Given some predefined rule templates, the atom layer implements a neural transformation process to convert the inputs to a set of abstract atoms. Then our logic network learns to discriminatively select the most relevant atoms in the atom layer to compose a logic rule in the rule layer. Our network design avoids a manual construction of atoms for each rule that is task-dependent. It is also flexible to inject any prior knowledge into the logic network if the rules are easy to obtain. The combination of automaticallylearned and predefined logic rules is realized via a form of residual connection.
To integrate a logic system with deep learning, most existing works only use knowledge to regularize feature learning or feed deep learning outputs as the inputs to the logic system, but ignore the mutual interactions. In this work, we introduce a novel integration of DNNs and knowledge reasoning via variational EM. Note that Qu, Bengio, and Tang (2019) proposed to adopt variational EM for semi-supervised classification by associating 2 graph neural networks.  further extended the algorithm for efficient inference in Markov logic (Richardson and Domingos 2006). However, their work only updates the weights for predefined rules without learning the predicates of rules. Different from previous works, our proposed model automatically learns useful predicates and the weights of different instantiations of those predicates that explore the associations among highly dependent classifiers for joint inference.

Methodology
The overview of the proposed model VDLN is shown in Figure 2. It consists of 2 modules: (1) Module Q consists of a DNN that transforms the input sequence of text into abstractive features h i 's and produces the probabilistic outputs q i 's. (2) A logic module P consists of a set of logic networks (LNets), with one LNet corresponding to each specific word w i and relation r. Each LNet takesỹ ctx(w i ) (ỹ ctx(r) ) as input, which consists of information from all of its associated variables to conduct knowledge reasoning among these variables, and generates the final probabilistic evaluations {p i }'s. Note that in VDLN, besides modeling complex correlations between targets, the logic module P also implements the BIO labeling scheme. The entire model is trained via the variational EM algorithm that alternates between an E-step (inference) and an M-step (learning). In the E-step, the deep module Q generates soft predictions of each word and candidate relation by distilling knowledge from P. In the M-step, the logic module P takes the predictions of Q as input and generates a probabilistic output for each target class of

Figure 2
An overview of the proposed model. The left part corresponds to module Q and the right part corresponds to module P. Module Q transforms the text input to a set of prediction vectors q i 's, which can be fed as input to module P to produce a set of prediction vectors p i 's. Then an EM algorithm is used to train all the parameters that alternate between an E-step and an M-step.
each word and relation. With a more concrete example, the overall procedure is the following: Given an input sentence of 11 tokens "W. Dale Nelson covers the White House for The Associated Press," module Q first produces the hidden representations {h 1 , . . . , h 11 } and the output vectors {q 1 , . . . , q 11 }, as shown in Figure 2. Likewise, a relation output vector q r is generated for each pair of candidate entities predicted via h i , for example, (W. Dale Nelson, Associated Press), based on their hidden representations {h i } i∈{1,2,3,10,11} and their attention scores. Then these vectors {{q r } s, q 1 , . . . , q 11 }, where {q r } s collects the set of all entity pairs for relation predictions, are used to form the inputỹ ctx(w i ) (or y ctx(r) ) for each word (or relation) in module P to produce the final probabilistic output vectors {{p r } s, p 1 , . . . , p 11 } for all the words and relations from Module P. With the output vectors from both modules, we conduct EM training algorithm that first update the parameters in Q by treating the predictions from p i 's as the supervision labels. Then in the next iteration, we update the parameters in P by treating the predictions from q i 's as the supervision labels.
In the following, we will describe the architecture of VDLN in Section 5.1 and Section 5.2 in detail.

Deep Learning with Self-Attention
As shown in Figure 2, Q is a deep neural network based on the self-attention mechanism and a bidirectional Gated Recurrent Unit (BiGRU) in order to model both non-local and contextual token-level interactions, respectively. Specifically, we use a transformer-style framework with multihead self-attentions to generate a hidden representation for each word incorporating its correlation with other tokens. Given input embeddings {x i }'s corresponding to a text sequence {w 1 , w 2 , . . . , w n }, the multihead self-attention model generates a hidden representationh i for each word through a linear transformation of all attention headsh where each attention head c produces It is flexible to stack multiple layers of self-attentions. Then a BiGRU network, f G , is applied on top ofh i to generate the final feature h i incorporating sequential interactions: A softmax classifier is used to produce the final prediction for each token corresponding to the entity labels as For end-to-end event extraction, we use two separate classifiers for entity and event trigger prediction, respectively, which corresponds to two different sets of parameters: (11). To generate entity-relation triplets, we first construct candidate entity pairs for each sentence by enumerating all the predicted entities. For each entity pair ( 1 , 2 ), the relation feature is a concatenation of its associated entity features, entity types, and attention scores obtained through the transformer network: h i with | 1 | representing the number of words w i within a candidate entity 1 . Similarly,h 2 is obtained from another candidate entity 2 . u 1 and u 2 denote the entity type embedding for 1 and 2 , respectively, by looking up an entity type embedding matrix U with |E| (the total number of entity types) columns that are randomly initialized and trained through the learning process. The attention vector α 12 corresponds to the averaged multihead attention score between w i ∈ 1 and w j ∈ 2 , while α 21 records the reverse order of the 2 entities. The final prediction for relation r of the entity pair, that is, the triplet ( 1 , r, 2 ), is produced via For end-to-end event extraction, the event argument relation triplet ( , r, v) is generated in a similar manner by replacing 2 with event trigger v. We additionally use a binary classifier to decide whether there is a relation between the entity and event trigger due to the sparsity of relation labels.

Logic Network
As described in Section 5.1, the deep learning model only implicitly learns word correlations via high-level features and attentions, but ignores the explicit correlations among target variables, especially for those of different types. In fact, the entity/event labels are highly dependent on the relation types, for example, "person(w i ) ∧ work for(r (w i ,w j ) ) ⇒ organization(w j )." Moreover, the segmentation labels are highly correlated within a context window. Although such segmentation interactions could be captured in Q via structured loss, it is more efficient and capable of modeling more complex correlations together with relation information. Here, we treat such segmentation dependencies as a form of knowledge reasoning.
Recently, some approaches have been proposed to combine deep learning with logic reasoning to regularize the learning process or induce new rules. However, most of them are not expressive enough by limiting themselves to the tasks within the logic domain, or are computationally expensive to work on real application domains. There is also a lack of focus on directly modeling rules for classifiers. For expressiveness, Shanahan et al. (2019) proposed a relational neural network, which only translates to a single logic rule that is propositional in nature. We propose a novel logic network within the logic module P that simulates FOL and enhances reasoning capabilities through multilevel rule constructions within a deep architecture.
As shown in Figure 2, P consists of a separate logic network (LNets) applied on each word and relation. Following the introduction of FOL in Section 3.3, we first adapt the problem into the logic domain, where a logic variable corresponds to a word w or a relational triplet r ( 1 , 2 ) . All possible words and relations form the set of logical constants. Each target class y ∈ E ∪ V ∪ R could be regarded as a predicate, and when taking constants as arguments, becomes a grounded atom. When the target class y ∈ E ∪ V is an entity type or event type, it takes a single word (or phrase) as the argument, for example, person(W. Dale Nelson) with y = person. When the target class y ∈ R is a relation, it takes a relational triplet as the argument. For example, work for(r ( 1 , 2 ) ) with y = work for specifies entity 1 and entity 2 has relation work for. We use d(x 1 , . . . , x n ) to denote an n-ary atom and υ(d) ∈ [0, 1] to denote the probability of the atom being true. For example, υ(work for(r ( 1 , 2 ) ) = 0.8 indicates that 1 , and 2 has relation "work for" with probability 0.8.
As discussed in Section 3.3, we treat each target class as a form of logic entailment where the target class is the head atom h of a set of logic rules/formula R ∈ {R 1 , . . . , Here R is a rule identifier. As a concrete example, if we aim to predict whether a text segment j belongs to the target class "organization (entity)," we may define a logic rule to entail the target entity type "organization": person( i ) ∧ work for(r ( i , j ) ) ⇒ organization( j ), where the head atom h = organization( j ) corresponds to the target entity type. The result depends on its precondition, which consists of two atoms d 1 = person( i ) and d 2 = work for(r ( i , j ) ). Then an FOL program aims to produce the truth probability of h given the set of all possible rules {R 1 , . . . , R S }. In most cases, such rules may not be readily available. Hence, it is desirable to learn the FOL rules automatically. To achieve that, we use a separate logic network (LNet) to generate relevant rules corresponding to the same head atom and evaluate its truth probability through its preconditions.
The detailed computation process for each LNet is shown in Figure 3. For a logic constant m ∈ N ∪ N r referring to either a word or a relation, we build a set consisting of its relevant contexts ctx(m) = {m 1 , . . . , m |ctx(m)| }. Then the input to a LNet becomes y ctx(m) = (q m 1 , q m 2 , . . . , q m |ctx(m)| ), which combines deep learning predictions of each element in ctx(m). 1 The LNet aims to produce probabilistic evaluations of a set of N atoms D = {d 1 , . . . , d N } in the atom layer, which are in turn used to form a logic program consisting of a set of logic rules {R 1 , . . . , R S } of the form d j 1 ∧ . . . ∧ d j T ⇒ h. All these rules share an identical head atom h = y m that indicates whether m belongs to a target class y m ∈ E ∪ V ∪ R. The final output is a probabilistic evaluation υ(h) by accumulating all the logic rules and considering their confidence scores γ 1 , . . . , γ S . In this way, the

Input layer
Synthetic atoms

Figure 3
An example of a logic network. The input consists of prediction vectors q i 's from module Q.
The synthetic atoms consist of {D (1) m i }'s, consisting of 1-ary synthetic atoms, the value of which is obtained by transforming from q m i using a vector v i,n , and D (2) , a set of 2-ary synthetic atoms, the value of which is obtained by transforming from 2 input vectors q m r 1 , q m r 2 using a bilinear matrix V n . The set of synthetic atoms can be combined with predefined atoms via a concatenation operation to form the atom layer. The rule layer then selects relevant atoms in the atom layer to form logic rules to generate the final value υ(h) for the head atom.
LNet is able to model the correlations of related constants formed by each word's or relation's relevant contexts.
As shown in Figure 3, to produce the set of N atoms, each LNet first creates a set of 1-ary atoms D (1) m i records a unique property of its corresponding input m i . It also produces a set of 2-ary atoms indicates a relation between 2 interacting contexts m r 1 , m r 2 ∈ ctx(m). The value for each atom is computed automatically via parameterized transformations: 2 where σ is the sigmoid function for probabilistic evaluations. v i,n ∈ R |q m i | and V n ∈ R |q mr 1 |×|q mr 2 | are transformation parameters that convert the input vector to a scalar.
We can view (14) and (15) as computing the probability of a specific property of the input (e.g., q m i ) being true. Then D ctx(m) = {D (1) m i } m i ∈ctx(m) ∪ D (2) could be regarded as recording different properties or relationships of the input context ctx(m). We treat these automatically generated atoms D ctx(m) as synthetic atoms.
Take the sentence "W. Dale Nelson covers the White House for The Associated Press" as an example. To make predictions on the word Dale in Module P, we first identify its context ctx(Dale) = {W., Dale, Nelson, The Associated Press, r (Dale, The Associated Press) } if The Associated Press is extracted as an entity. Then the inputỹ ctx(Dale) is the concatenation of all the prediction vectors q's corresponding to each element in ctx(Dale) obtained from module Q:ỹ ctx(Dale) = (q W. , q Dale , q Nelson , q The Associated Press , q r (Dale,The Associated Press) ). Giveñ y ctx(Dale) , the values of the synthetic atoms are obtained in the following process. Specifically, for the first input vector q W. corresponding to the previous word of Dale, we produce D (1) 1,n 1 ) = σ(v 1,n 1 q W. ) corresponding to n 1 with different properties from the previous word of Dale, according to (14). We treat these atoms as unary synthetic atoms. In a similar manner, we obtain D ), according to (15). To make the logic network flexible and comprehensive, we further enhance LNet with residual connections to incorporate predefined atoms and logic rules when provided. As shown in Figure 3, a concat operation concatenates the synthetic atoms and original inputs to form the atom layer | atoms are the synthetic atoms, while the last N − N ctx(m) atoms are the predefined atoms. Different from synthetic atoms, which do not have exact semantic meanings, the predefined atoms are formed by the original inputs q i specifying the probability of each target class corresponding to the input word/relation, for example, d j = person( ), N ctx(m) + 1 ≤ j ≤ N will inherit the value as υ(d j ) = (q ) [person] , which indicates the probability of label person for . The predefined atoms facilitate the incorporation of prior knowledge, for example, person( i ) ∧ work for(r ( i , j ) ) ⇒ organization( j ), into the rule layer.
The rule layer aims to learn a set of logic rules {R 1 , . . . , R S } corresponding to the same head atom h by choosing proper body atoms from D. It consists of two kinds of logic rules: learned rules and fixed rules. The learned rules are constructed based on an iterative selection process via attention mechanism. Given the set of all possible atoms D, a logic rule R : d j 1 ∧, . . . , ∧d j T ⇒ h is formed by learning to select d j t ∈ D at each iteration t ∈ {1, . . . , T}. The selection process is parameterized and approximated using attention mechanism, where a trainable weight vector β t ∈ R N is used to record the relevance of all N atoms in D at each iteration t. To achieve sparse selection, we adopt sparsemax (Martins and Astudillo 2016): Intuitively, sparsemax(β t ) transforms β t to a sparse probabilistic vector indicating the most relevant atoms to be selected. Note that β t is randomly initialized and trained throughout the learning process. To produce the value for the head atom h, we follow (5) to obtain where d=[υ(d 1 ), . . . , υ(d N )] denotes the vector of atom evaluations. Specifically, for each timestamp t, d sparsemax(β t ) ≈ υ(d t ) when β t assigns the most probabilistic mass to d t , which should be expected using sparsemax, compared to softmax. Then after T iterations of selection, the resulted logic rules should have the form d j 1 ∧ . . . ∧ d j T ⇒ h. Fixed rules correspond to prior knowledge which is manually constructed. These rules can be used to enhance the final prediction when the learned rules are not expressive enough. To incorporate such prior knowledge, we transform the body atoms of each given rule into 1-hot attention weights. For example, for person( i ) ∧ work for(r ( i , j ) ) ⇒ organization( j ), we construct its corresponding attention weight β 1 = 1(d i = person),β 2 = 1(d j = work for), where 1(·) is a indicator function. These weight vectors are kept fixed during training. The final value for h considering all the relevant rules {R 1 , . . . , R S } is obtained via where γ ∈ R S indicates the confidence level for each rule and is trainable. υ(h) can be regarded as a binary classifier for its corresponding target class, for example, "organization (entity)." We use the same atom set D with different attention vectors to parameterize different target classes. When denoting by υ(h l ) the output from an LNet for each target class l, we can produce the final multi-class predictions p m in module P as for m ∈ N r ∪ N . The number of nodes in the output layer is L(m) = |E| for entity (m ∈ N ), L(m) = |V| for event trigger (m ∈ N ) and L(m) = |R| for relation (m ∈ N r ). As shown both in Figure 2 and (19), the output of LNet υ(h) is used to produce the probabilistic vector p m as the output of module P for each constant m. These probabilistic vectors p m 's, together with the outputs q m 's from module Q, will further be used to train our joint model via the EM algorithm, as discussed in the sequel.

Inference
The E-step involves inference and update of module Q by taking the output p m from the logic module P. Recall from Section 3.2 that the objective is to solve q(Z; θ) = p(Z|Y; φ) with p fixed. Here we have Z = {y m } m∈N ∪N r corresponds to the predictions for all the words N and relations N r , and Y = {y ctx(m) } m∈N ∪N r , with ctx(m) denoting the context of node m, which will be introduced later. Using the mean-field variational approximation, the above probabilities factorizes as q(Z) = m∈N ∪N r q(y m ) and p(Z|Y) = m∈N ∪N r p(y m |y ctx(m) ) (θ and φ are omitted for ease of illustration). To train our model with an EM algorithm, we associate p(y m |y ctx(m) ) with the logic module P such that p(y m = y|y ctx(m) ) = (p m ) [y] representing the probability when m has label y, where p m is the output of the logic module obtained from (19). The subscription [y] denotes the y-th entry of the corresponding vector. Similarly, we associate q(y m ) with module Q, where q(y m = y) = (q m ) [y] . For relation prediction, q(y r = y) = (q r ) [y] with r ∈ N r . Note that the bold symbols (e.g., y m , y r , y) within distributions p(·) and q(·) denote random variables for label predictions and the non-bold symbols (e.g., y i , y) indicate the actual label assignment. According to Qu, Bengio, and Tang (2019), the optimal solution satisfies the approximated condition: log q y m ≈ E q(y ctx(m) ) log p y m |y ctx(m) for m ∈ N r ∪ N with N r and N denoting the set of relations and words, respectively. Here θ and φ is omitted for ease of illustration. The above condition can be further converted to log q y m ≈ log p y m |ỹ ctx(m) by approximating the expectation via samplingỹ ctx(m) from q(y ctx(m) ) in module Q. Intuitively, (21) aims to align the distributions from two modules. To update q in terms of θ, we minimize the following objective fixing p: To achieve that, we first generate the prediction y m = argmax y p(y m = y|ỹ ctx(m) ) through the logic module P using (19) givenỹ ctx(m) (will be discussed later). We use the predicted label y m as the target label to train module Q, replacing E p(y m |ỹ ctx(m) ) log q(y m ) in (22) with log q(y m = y m ). During training, as the ground-truths are available, we also utilize label information to update q. Specifically for each m, we update q using the aforementioned strategy with probability 0.5, otherwise we replace y m (or {y 1 , . . . , y n }) predicted from P with the ground-truth label to update q.

Learning
The M-step involves learning and update of module P. Recall the objective in (3), we fix Q and use it to update P. Assuming conditional independence given the contexts, we have Then the objective of maximizing (3) becomes Here y m = argmax y q(y m = y) is the predicted label which corresponds to the maximum probability in q(y m ) from module Q. To avoid randomness brought by sampling y ctx(m) , we directly use the real-valued outputs given by module Q such thatỹ ctx(m) = (q m 1 , . . . , q m |ctx(m)| ) with ctx(m) = {m 1 , . . . , m |ctx(m)| }. This aligns with the input of the logic network shown in Figure 2. (24) can be viewed as maximizing the log-likelihood of the predictions given by Q using module P. To incorporate label supervision, we use y m with probability 0.5, otherwise we replace y m in (24) with m's true label to update p.
To compute p(y m |ỹ ctx(m) ) within the logic module P, we define the context ctx(m) of each variable m to be those variables that have intensive correlations with m for the task at hand by constructing some rule templates given the output {q i }'s from module Q. When m = w i ∈ N , we use 3 types of dependencies for the rule templates: • The prediction of a word w i from Q is a direct precondition: q i → p i .

•
The prediction of another word w j from Q that has relation with w i could inform the target prediction: q j , q r ij → p i .

•
The prediction of w i 's preceding and following words from Q could inform the target prediction: q i−1 , q i+1 → p i . Note that this type of dependency is applicable when the structured prediction is implemented in the logic module (P), not the deep learning module (Q).
For relation prediction when m = r ij ∈ N r , we use 2 templates: • The prediction of r ij from Q is a direct precondition: q r ij → p r ij .
• The predictions of w i and w j from Q could inform the target: q i , q j → p r ij .
Given these dependency templates, we construct the input of the logic network y ctx(m) for m = w i ∈ N asỹ ctx(m) = (q i−1 , q i , q i+1 , q j , q r ij ) for entity (or event) prediction of each word w i , where q i−1 , q i and q i+1 are separately used to construct 1-ary atoms, and both q j and q r ij are used to construct 2-ary atoms in module P. Intuitively, the corresponding words and relations form the context of w i , denoted by ctx(i) = {w i−1 , w i , w i+1 , w j , r ij }. Similarly, the inputỹ ctx(m) when m = r ij ∈ N r for relation prediction of r ij isỹ ctx(m) = (q r ij , q i , q j ). We use q r ij , q i , q j , respectively, to produce 1-ary atoms. Again, both q i and q j are used to produce 2-ary atoms. Given the construction of y ctx(m) , the output p m of the logic network will then be computed following (19).

Optimization
Overall, the training process involves alternating between variational E-step and M-step to update module P using (24) and module Q using (22). For both steps, the output of P is obtained by sampling the context predictions of the target using Q, which reflects the intensive interactions between these two modules. This interaction is also enhanced by learning to approximate these two distributions throughout the training process. To facilitate training, we first pretrain Q using the ground-truth labels for several iterations before the variational EM procedure. In the testing phase, both P and Q can be adopted to generate the predictions. In our experiments, we use a similar strategy as ensemble learning that assigns each module a weight that is tuned according to the validation set to compute a weighted average of the two modules as our final predictions. The complete training procedure for end-to-end relation extraction is shown in Algorithm 1.
where δ is the learning rate. end for 2: EM training between P and Q Generateỹ ctx(m) = (q m 1 , . . . , q m ctx(m) ) from outputs of module Q. for each iteration do 3: Inference for k = 1, 2, . . . , K do Produce probabilistic output p m in module P from LNets using (19). Obtain predictions y m = argmax y p(y m = y|ỹ ctx(m) ) = argmax y (p m ) [y] .
Update module Q via θ := θ − δ q ∂O E ∂θ , where δ q is the learning rate for the E-step. O E is obtained through (22). end for 4: Learning for k = 1, 2, . . . , K do Obtain predicted label y m = argmax y q(y m = y) from module Q. Compute distribution p(y m |ỹ ctx(m) ) = p m from module P using (19). Update module P via φ := φ − δ p ∂O M ∂φ , where δ p is the learning rate for the M-step. O M is obtained through (24). end for end for 7. Experiment

Tasks and Data
We conduct experiments on 6 benchmark data sets from 3 IE tasks: • Aspect and Opinion Terms Extraction: Aspect terms refer to the product features or attributes that the users commented on in the customer reviews. Opinion terms are those carrying subjective opinions toward the products or services. For example, given a review sentence "The service staff is terrible.", service staff is an aspect term and terrible is the opinion term. We use a restaurant review corpus and a laptop review corpus from   (Pontiki et al. 2014). The statistics of the two data sets are shown in Table 2.
• End-to-end Relation Extraction: This task involves the identification and classification of both entities and relations between entities. For this task, three benchmark data sets are used, including CoNLL04 (Roth and Yih 2004), ACE04 (Doddington et al. 2004), and ACE05 . As shown in Table 3, CoNLL04 consists of 4 entity types and 5 relation categories. ACE04 defines 7 entity types with 7 relation categories and ACE05 adopts the same entity types as ACE04 but defines 6 relation types. CoNLL04 and ACE04 do not provide official train/test split, hence we conduct 3-fold and 5-fold cross-validation for CoNLL04 and ACE04, respectively, to report our final results. We follow the same preprocessing and data split as (Li and Ji 2014) on ACE05 data set.
• End-to-end Event Extraction: This task involves three subtasks, namely, extraction and classification of entity mentions, extraction, and classification of event triggers, and discovering of relationships between entity mentions and event triggers for event argument extraction and classification. For this task, the same ACE05 data set is used. For entity mentions, we consider ACE entity types PER, ORG, GPE, LOC, FAC, VEH, WEA, and ACE VALUE and TIME expressions, following the common setting of the existing works. There are in total 33 event subtypes that are involved in the event trigger classification task. The total number of different argument roles for entities participating in various events is 35 and we collapse 8 of them that are time-related, following Yang and Mitchell (2016). The detailed statistics of ACE05 data set for event extraction is shown in Table 4. For evaluation, we treat an entity as correct if both its entity type and offset matches one of the ground-truth entities. An event trigger is correctly identified if its offset matches one of the reference event triggers, and it is regarded as correctly classified if its type is also correct. An argument role is correctly identified if the corresponding entity type, entity offset, and event type matches one of the reference argument roles, and it is correctly classified if the argument role is also correct.

Experimental Setting
To integrate self-attention mechanism, we use a pretrained BERT model (base-uncased) (Devlin et al. 2019) to initialize all the word embeddings and to produce the attention scores for each pair of words. The batch size is 20 and the dimension for BiGRU is 100 with dropout rate 0.1. For the logic network, we set the number of 1-ary atoms (n 1 ) and 2-ary atoms (n 2 ) for each input variable to 20 and the number of body atoms in each rule as T = 8. The total number of rules is set to S = 30. During training, we use Adadelta with initial learning rate 0.01 for module Q and Adam with initial learning rate 0.01 for module P. The sampling rate for both E-step and M-step is set to 0.5, that is, for 50% of the time, the ground-truth label is used for learning desired modules. For each experiment, we first pretrain Q for 50 epochs and then alternate between P and Q with every 2 epochs for each module. The final prediction is made by ensemble strategy with weight 0.6 and 0.4 for Q and P, respectively. All the hyperparameters are selected via the validation set. For evaluation, we use micro-F1 scores on non-negative classes. An entity is correct if both segmentation and entity type are correct. A relation is correct if both of its entities (events) and the relation type matches the groundtrue label. We use the same evaluation metric as (Yang and Mitchell 2016) for event extraction.
For time complexity, we report the duration using 1 GPU of Tesla V100 250W. Pure neural model (e.g., BERT) takes 38s and 28s for training 1 epoch on Res14 and Lap14 data set, respectively. It takes 556s and 522s for training 1 iteration of VDLN that consists of 2 epochs of both modules on Res14 and Lap14 data set, respectively. On CoNLL04, it takes 56s for training 1 epoch of BERT and 271s for training 1 iteration of VDLN. On ACE04, it takes 131s for training 1 epoch of BERT and 967s for training 1 iteration of VDLN. On ACE05, it takes 242s for training 1 epoch of BERT and 1713s for training 1 iteration of VDLN. For memory usage, experiments on Res14 and Lap14 data occupy around 8.9G. Experiment on CoNLL04 takes 15.9G. Experiments on ACE04 and ACE05 occupy around 6.7G. All these memory usages are almost the same when compared with pure deep learning models.

Aspect and Opinion Terms Extraction:
To demonstrate the effectiveness of our proposed model, we compare with the following most recent baselines: • GInf: A pipelined model combining deep neural networks with integer linear programming (Yu, Jiang, and Xia 2019). The predictions produced from the deep neural networks are taken as input to the integer linear programming system where explicit relational constraints among aspect terms and opinion terms are enforced considering syntactic information. •

Rule-distill:
A posterior-regularization-based framework to regularize deep learning predictions via prior knowledge. The training is conducted via a teacher-student knowledge distillation (Hu et al. 2016). To adapt this model to our problem setting, we construct a few logic rules, as shown in Table 14 to form the teacher network. For fairness, we use the same neural model (module Q) as the student network.
• DLogic: A joint model incorporating explicit logic rules into the deep learning model (Wang and Pan 2020). The deep learning predictions are made as probabilistic evaluations of input atoms to produce the output for the head atom of each rule. Then a discrepancy loss is computed to align the deep learning predictions with a set of predefined logic rules.
• DLogic*: Replace the deep neural networks of DLogic with the one used in our proposed model for fair evaluations.
• VDLN: The proposed model consisting of a logic module P and a deep learning module Q.
• SOTA (Chen and Qian 2020): The current state-of-the-art model on aspect and opinion terms extraction, which implements BERT-large with collaborative learning considering the interactions among aspect terms, opinion terms, and sentiment polarities. Table 5 shows the result for aspect and opinion terms extraction. Because some of the baseline models do not have published code, we only conduct 3 different runs over Rule-distill, DLogic, DLogic*, and our proposed model VDLN. For the other baseline models, we use fixed results as reported. The numbers in italic form indicate the average results over 3 different runs. This task can be cast as a special case of entity extraction by treating aspect terms and opinion terms as 2 different entity types. Yu, Jiang, and Xia (2019) incorporated explicit relational knowledge among aspect and opinion words through integer linear programming. However, the separation of knowledge reasoning from DNN during learning makes the result suboptimal. As a comparison, VDLN makes these 2 components interactive via variational learning. Compared with Ruledistill (Hu et al. 2016), VDLN outperforms the teacher-student network, demonstrating the advantage of EM algorithm for mutual learning and the ability to learn correlation patterns as logic rules. To verify the expressiveness of our proposed logic network for knowledge reasoning, we compare with explicit rule integration (Wang and Pan 2020), which bridges the DNN outputs with explicit logic rules by minimizing their discrepancies. For fair comparison, we replace their DNN module with ours, denoted by DLogic*. Clearly, VDLN gives better performances at all times, which proves the advantage of automatically learning a logic network over fixed rules. The SOTA model (Chen and Qian 2020) adopted BERT-large as the feature learning backbone and implemented multitask learning framework with collaborative learning mechanism to explore interactions among target terms and sentiment polarities for joint extraction. It is obvious that VDLN with logic reasoning outperforms the SOTA model even with BERT-base neural component. In general, VDLN significantly outperforms all baselines with p <0.05 using paired t-test, except the SOTA model on opinion extraction of Res14.

End-to-End Relation Extraction:
Besides the aforementioned baseline DLogic and DLogic*, we further adopt the following baselines: • Gopt: A globally optimized neural model for end-to-end relation extraction (Zhang, Zhang, and Fu 2017). The work converts the entity and relation extraction problem into a single table filling task, which produces a score for each label in the next step given the state of a partially-filled table. Moreover, global optimization is used, which treats the entire sentence as a unit. •

MtQA:
The extraction of entities and relations is cast as the task of identifying answer spans from the context given some question templates . The question encodes relevant information corresponding to the target entity or relation to be identified.
• SpanRel: An end-to-end deep learning model based on span-level predictions (Dixit and Al-Onaizan 2019). Instead of token-level modeling, span-based models take the features corresponding to all possible spans within a sentence for both entity and relation predictions.
• SOTA (Wang and Lu 2020): A joint model using two different encoders, namely, a table encoder and a sequence encoder to intensively exploit the target interactions, together with rich encodings combining word vectors, character vectors, and strong pretrained contextualized vectors. Table 6 lists the performances of the proposed models and baseline models for each end-to-end relation extraction data set. Note that although Luan et al. (2019) also showed promising results on ACE04 and ACE05 data sets, their model depends on auxiliary coreference supervisions, which is not fair to be compared with. Nevertheless, we still achieve comparable performance. MtQA  treats the task as a question answering problem with predefined question templates and uses BERT as a backbone. Compared with Gopt (Zhang, Zhang, and Fu 2017) without self-attention, the improvement shows the advantage of modeling token-level dependencies for informa- Table 6 Results on 3 benchmark data sets for end-to-end relation extraction. Italic numbers on CoNLL04 and ACE04 indicate average results over 3 random splits and 5 random splits, respectively, and those on ACE05 are averaged over 3 different runs. * indicates the results are significant with p < 0.05.

Data set
Model Entity Relation

CoNLL04
Gopt (Zhang, Zhang, and Fu 2017) 85.6 67.8 MtQA  87.8 68.9 Rule-distill (Hu et al. 2016) 88.2* 71.6* DLogic (Wang and Pan 2020) 87.1* 64.6* DLogic* (Wang and Pan 2020) 88. tion extraction. The results also verify our consistent improvement over Rule-distill (Hu et al. 2016). Compared with the SOTA model (Wang and Lu 2020), which adopted rich encodings combining word vectors (GloVe), character embeddings, and contextualized embeddings (ALBERT-large, which is an extensively pretrained large model), our model produces slightly lower performance. We conjecture that the high result of the SOTA model depends on its rich encodings, as when replacing ALBERT-large with BERT in their model, the F1 score for entity extraction on ACE05 drops to 87.8, according to Wang and Lu (2020). In general, VDLN significantly outperforms the other baselines except SOTA, and Rule-distill on entity extraction of ACE04 with p < 0.05.

End-to-End Event Extraction:
For this task, the state-of-the-art models to be compared are listed in the following.
• JEventEntity: A probabilistic model taking into consideration of intensive dependencies between event triggers and entity mentions, as well as relationships among events (Yang and Mitchell 2016). A joint inference procedure is then proposed to globally optimize all the predictions within a text input.
• dbRNN: A novel dependency-bridged recurrent neural network for event extraction (Sha et al. 2018), which fully utilizes both the sequential and syntactic structure of a sentence to enhance the extraction performance. • GAIL: A deep learning model based on generative adversarial imitation learning (Zhang, Ji, and Sil 2019). The authors use reinforcement learning to model sequential predictions and aims to produce proper reward values estimated from discriminators in GAN. •

Joint3EE:
A joint deep learning model to simultaneously achieve end-to-end event extraction (Nguyen and Nguyen 2019) by decomposing the joint probability into a product of the probability of each target variable conditioned on the processed units.
• DYGIE++: A joint model for end-to-end event extraction based on contextualized span representations ). The span representations encode local and global interactions with a dynamic graph update to propagate long-range information. •

SOTA (ONEIE):
A joint neural model consisting of contextualized text representations and manually designed global features to capture the cross-task and cross-instance interactions (Lin et al. 2020).
The results on end-to-end event extraction is listed in Table 7. JEventEntity (Yang and Mitchell 2016) adopts joint inference with extensive manually designed linguistic features. Its performance is inferior compared to deep learning models. On the other hand, DNNs alone lack explicit knowledge that is crucial for the task at hand. Hence, dbRNN (Sha et al. 2018) and Joint3EE (Nguyen and Nguyen 2019) incorporate linguistically informed features (e.g., dependency relations) to enhance the performance of DNNs. From the comparison results, VDLN achieves the best performance among the baselines except the last row without requiring any external linguistic resources and only needs to generate simple rule templates that associate extracted entities and events with the argument relation predictions in an automatic manner. SOTA (ONEIE) (Lin et al. 2020) produces the best result mainly originating from the manually designed global features to enforce cross-task and cross-instance relationships, for example, A TRANSPORT event has only one DESTINATION argument.

Analysis
Our default experimental setting uses BERT as the neural component. To demonstrate the generality of our proposed VDLN architecture, we conduct extra experiments by replacing BERT with three other strong contextualized neural models, namely, Span-BERT, RoBERTa, and BERT-large, respectively, with fine-tuning, as well as a nonpretrained model using pure transformers. The results are listed in Table 8. We denote  Table 9 Performance for each separate module and its variations on the task of aspect/opinion extraction and end-to-end relation extraction. by "VDLN (*)" as the proposed joint model with the neural component Q replaced by * = BERT-large, SpanBERT, RoBERTa, transformer. All the experiments with pretrained models involve fine-tuning of the pretrained parameters. The transformer model follows the one in Wang and Pan (2020). From Table 8, we observe that large models (BERT-large) usually produce the best results, whereas VDLN (RoBERTa) performs best on CoNLL04. SpanBERT has inferior performance on average. Clearly, the joint model VDLN outperforms its neural component alone across almost all experiments. With such observation, we show that the proposed methodology is able to benefit a wide variety of its neural counterpart.
To analyze the effect of each module within the proposed framework and the effect of the EM training procedure, we conduct experiments on each separate module as well as some variations within a module. The results are shown in Tables 9 and 10 for all three different tasks. Specifically, Q and P record the performance of each individual module alone, respectively, without the EM training alternation. Because P requires the output from the deep learning predictions as its input for logic reasoning, we initialize P with the output from a pretrained module Q. Q* and P* record the performance using Q and P, respectively, for final predictions after jointly training both modules alternatively via variational EM. It could be observed that for separate models, P is slightly better than Q because it inherits the result from Q and further conducts logic reasoning based on the intensive interactions among the output variables. However, both of them are inferior to those with variational learning paradigm, which proves that the EM algorithm encourages a mutual enhancement between two different modules. For final predictions, the results from P* and Q* are comparable most of the time. In the end, we use the ensemble model P + Q, which produces the best result on Laptop-14, ACE04, and ACE05.  We further verify the effect of each component within the framework via ablation studies. As shown in Tables 11 and 12, the first column indicates different model variations. DNN is the deep learning component we adopt, which corresponds to module Q in VDLN. DNN w/o BiGRU removes the BiGRU layer on top of the BERT model and DNN+CRF further connects a linear-chain CRF with structured loss as the last layer. Clearly, the performance with and without BiGRU is similar. However, BiGRU brings some performance gain when associated with the joint model VDLN, compared with VDLN w/o BiGRU, which removes BiGRU in the joint model. A CRF layer is able to bring some performance gain compared to DNN alone for both aspect/opinion extraction and end-to-end relation extraction. However, it is not beneficial for event trigger and event argument prediction, most probably due to the fact that most event triggers are made from a single word. Another two variations, namely, VDLN (seg) and VDLN (rel), refer to the model with only segmentation-based rule templates and relation-based rule templates, respectively, within module P. Specifically, segmentation-based rule templates only associate token-level interactions, for example, q(y i−1 ), q(y i+1 ) → p(y i ).
On the other hand, relation-based rule templates associate relational triplets with token predictions, e.g., q(y j ), q(y r ij ) → p(y i ). The results for these two variations demonstrate the contribution of each kind of interaction for the proposed model. From Table 11, we can observe that VDLN (seg) is beneficial for entity predictions, whereas VDLN (rel) mostly works for relation extraction. The last row VDLN+CRF takes DNN+CRF model as module Q in the joint model VDLN. It has similar performance compared to VDLN, which shows VDLN already learns structured information in a CRF model. VDLN+CRF (rel) only adopts relation-based rule templates in module P. By comparing it with VDLN, we can verify that the segmentation rule used in module P is more beneficial than using a simple graphical model. We also verify the effect of using sparsemax for the rule learning process. The sparsemax operator explicitly constrains the number of atoms to be selected to form the body of a rule. By replacing it with softmax (VDLN (softmax)), the results show that sparsemax provides better result and is semantically more meaningful. To investigate the advantage of the logic-inspired network within module P, we compare the proposed model with another popular and effective deep learning model, that is, graph neural networks (GNNs) (Dai, Dai, and Song 2016) and graph convolutional networks (GCNs) (Kipf and Welling 2017) for information propagation. The results are shown in Table 13. Specifically, we replace the logic network in P with a GNN (or GCN), which takes the context ctx(m) of each target node m as the neighboring nodes to update its own feature via non-linear transformations (spectral-based graph convolutions). In a word, the graph structure of the GNN (GCN) is provided by the rule templates used in the logic network where two nodes are connected if they appear in the same rule. We denote this model by GNN+Q (GCN+Q). GCN+Q is more expressive than GNN+Q. Clearly, GCN+Q outperforms GNN+Q in general, but is still inferior than our proposed model across all except one experiment, indicating that the pro- Table 14 Predefined logic rules for each task and data set.
As mentioned in Section 5.2, the logic network is able to learn relevant knowledge automatically, as well as encode prior knowledge if provided. In all the previous experiments, we do not feed any manually designed logic rules into the logic network for fair comparisons and a demonstration of our model's generality. To investigate how the given rules contribute to the actual task, we design some easily acquired logic rules for each task and incorporate them into the learning of LNet. The manually designed rules for each specific task and data set are listed in Table 14. For aspect and opinion terms extraction, we design rules involving dependency relations and POS tags, as adopted in Qiu et al. (2011) and Yu, Jiang, and Xia (2019). For example, the FOL rule "aspect(x) ∧ pos noun (x) ∧ dep amod (y, x) ∧ pos adj (y) ⇒ opinion(y)" states that if x is an aspect word having POS tag noun, we can infer that y with POS tag adj is an opinion word when there is a dependency relation amod between x and y. The dependency structures and POS tags are generated using Stanford CoreNLP (Manning et al. 2014). For end-to-end relation extraction, we mainly adopt relational rules demonstrating the correlations between entity types and relation types. Lastly, for event extraction, we use event trigger types and entity types as preconditions to design the FOL rules in order to entail the target relations. The results are shown in the last column of Table 13 (i.e., "VDLN+rules"). It can be observed that performance is improved on the task of aspect and opinion terms extraction. For end-to-end relation extraction, additional rules are more beneficial for relation extraction. However, we could observe a degradation in the performance of event extraction when inserting the manually designed logic rules. This might be caused by inaccurate event trigger predictions as well as uncertain rules with sparse coverage in the given corpus.
We would like to emphasize that compared with the predefined logic rules, the learned logic rules are different in the way that the atoms in the rule bodies are rather abstract and composited. More concretely, if we define the set of all generated synthetic atoms in the atom layer as D ctx(m) = {d 1 , . . . , d 80 }, 3 we are able to list a few learned logic rules for relations on CoNLL04.
Here each atom d n is a linear combination of all different classes for a word or a relation. As shown in (14) and (15), each atom d n is either a 1-ary atom corresponding to a linear combination of the DNN predictions of an argument word, or a 2-ary atom corresponding to a bi-linear property of two arguments. For predictions of a relation r ( 1 , 2 ) in the logic network, the input isỹ ctx (r ( 1 , 2 ) ) = (q r ( 1 , 2 ) , q 1 , q 2 ). Then {d 1 , . . . , d 20 } are 1-ary atoms corresponding to different linear combinations of q r ( 1 , 2 ) . {d 21 , . . . , d 40 } are 1-ary atoms corresponding to different linear combinations of q 1 , which is the entity class distribution of the head entity 1 from DNNs. Similarly, {d 41 , . . . , d 60 } are 1-ary atoms corresponding to linear combinations of q 2 for tail entity 2 . The last 20 atoms {d 61 , . . . , d 80 } are 2-ary atoms corresponding to bilinear interactions of q 1 and q 2 . To interpret the example rule "d 11 ∧ d 2 ∧ d 58 ∧ d 31 ∧ d 7 ⇒ (r = located in)" for the CoNLL04 data set, suppose the pair of entities being queried for the relation is (the White House, U.S.); we will have d 11 , d 2 , and d 7 representing linear transformations of q r ( the White House,U.S.) , d 58 representing a linear transformation of q U.S. , and d 31 representing a linear transformation of q the White House . To be more specific, when those learned linear transformation weights favor a particular entity/relation class, a more concrete interpretation of the above rule could be located in(r (the White House,U.S.) ) ∧ live in(r (the White House,U.S.) ) ∧ location(U.S.) ∧organization(the White House) ∧ located in(r (the White House,U.S.) ) ⇒ (r = located in).
3 According to the experimental setting, we have 80 generated atoms.

Table 15
Example outputs by DNN alone and VDLN, respectively.

DNN VDLN
"The ambience is also more laid-back and relaxed." "The ambience is also more laid-back and relaxed." "The folding chair I was seated at was uncomfortable." "The folding chair I was seated at was uncomfortable." "It is robust, with a friendly use as all Apple products." "It is robust, with a friendly use as all Apple products." ". . . on duty with the 6th Fleet in the Mediterranean,..." ". . . on duty with the 6th Fleet in the Mediterranean,..." Entity: location, location; Relation: located in (1,2) Entity: organization, location; Relation: org based in (1,2) ". . . get them all home, said Ms. Say in Nashville, Tenn." ". . . get them all home, said Ms. Say in Nashville, Tenn." Entity: people, location, location; Relation: located in(1,2) Entity: people, location, location; Relation: located in ( Note that these learned rules are all generic rules learned for each specific data set, because the linear and bi-linear transformations to compose those atoms are identical across each training instance and are learned throughout the training process. For qualitative analysis, we use Table 15 to list a few examples showing that the incorporation of logic reasoning is able to more correctly extract target terms/relations compared to pure neural networks Q. Specifically, the words in bold indicate aspects or entities and the words in italic form indicate opinions. For entity and relation extraction, the second row in each example represents the predicted entity types and relations. The numbers in the relation indicate the indices of its corresponding entities. For aspect and opinion terms extraction, VDLN is able to identify target aspects or opinions with certain syntactic relations that are missed by pure DNN. For example, the opinion term laid-back can be extracted by associating it with the aspect term ambience and another opinion term relaxed. For entity and relation extraction, VDLN modifies incorrect predictions from DNN. For example, the output relation located in(6th Fleet, Mediterranean) is corrected as org based in(6th Fleet, Mediterranean) by VDLN.
To demonstrate the model's robustness, we conduct experiments with varying hyperparameters. We choose three parameters, namely, the sampling rate ρ during the EM updates, the number of atoms T in the rule body for each rule formed in the logic network, and the number of rules in the rule set {R 1 , . . . , R S } that share the same head atom h. Specifically, we use different sampling rates ranging from ρ = 0.1 to ρ = 0.9 when updating both P and Q. Here ρ is the probability of using the predictions from P or Q when learning the parameters of Q or P, respectively, during the variational EM updates. With probability 1 − ρ, the ground-truth label is used to supervise each module. The results for aspect and opinion terms extraction are shown in Figure 4 and the results for CoNLL04 and ACE05 data set are shown in Figure 5. Both figures demonstrate the robustness of VDLN against different sampling rates. The performance drop for ρ = 0.9 is reasonable as only 10% of the ground-truth labels are used for supervision during the EM training procedure. Figures 6, 7, and 8 correspond to the F1 scores for entity extraction and relation prediction on ACE05, ACE04, and CoNLL04 data set, respectively. The x-axis on the left subfigure indicates the number of atoms T in the body of each rule (i.e., d j 1 ∧, . . . , ∧d j T ⇒ h). The x-axis on the right subfigure indicates the number of rules S for each head atom h. As indicated in the figures, the final performance of our proposed framework is not sensitive to such hyperparameters within the logic network. For ACE05 and ACE04, varying T from 1 to 10 results in more stable performance for entity extraction compared with relation prediction. On the other hand, the model's performance is relatively less dependent on the number of rules S. When changing S from 10 to 50, the results stay within a small range. As for CoNLL04, the performance decreases when S is higher than 25. This might result from the fact that the interactions between entities and relations are simpler in the smaller CoNLL04 data set, which becomes easy to be overfitted.

Error Analysis and Future Work
For error analysis, our model has limitations when the entities (triggers) are not correctly extracted. Specifically, if the entities (triggers) are not extracted in the entity (trigger) prediction phase in module Q, that is, generating predictions from q i for each word, it becomes hard to rectify such predictions in the logic networks and during the EM training procedure. The reason is that the relations and rules are all based on the extracted candidate entities from Q. Indeed, when some entities are not identified in Q, there will be no bilinear interactions between the missed entities and other entities to be modeled in the logic network. Hence, it is difficult for VDLN to learn useful rules to correct its predictions. Sensitivity study for the logic network on CoNLL04 data set.
In our future work, we plan to solve the above limitations by revising the extraction mechanism in the neural component where entities are first predicted followed by relation predictions. A table filling mechanism might be a good choice (Miwa and Sasaki 2014). We also plan to design more interpretable networks in terms of logic reasoning so that the learned rules can be explicitly explained. In terms of application, our future work may include generalizing our proposed framework to work with more challenging cases, for example, cross-sentence correlations, cross-instance consistencies, in order to be applied on other application domains, for example, document-level event extraction, cross-sentence relation extraction, etc.

Conclusion
We propose a variational deep logic network to inherit both the representational power of deep learning and the reasoning capabilities of logic systems for joint inference in IE. These two paradigms communicate through the variational EM algorithm. For knowledge reasoning, we introduce a novel logic network that transforms logic semantics to a deep hierarchical architecture to facilitate logic inference automatically. Meanwhile, the logic network enhances the expressiveness over manually designed rules by learning more effective atom combinations according to the training data. It is also flexible to incorporate predefined logic rules to further enhance the final performance.