Modeling Content and Context with Deep Relational Learning

Building models for realistic natural language tasks requires dealing with long texts and accounting for complicated structural dependencies. Neural-symbolic representations have emerged as a way to combine the reasoning capabilities of symbolic methods, with the expressiveness of neural networks. However, most of the existing frameworks for combining neural and symbolic representations have been designed for classic relational learning tasks that work over a universe of symbolic entities and relations. In this paper, we present DRaiL, an open-source declarative framework for specifying deep relational models, designed to support a variety of NLP scenarios. Our framework supports easy integration with expressive language encoders, and provides an interface to study the interactions between representation, inference and learning.


Introduction
Understanding natural language interactions in realistic settings requires models that can deal with noisy textual inputs, reason about the dependencies between different textual elements, and leverage the dependencies between textual content and the context from which it emerges. Work in linguistics and anthropology has defined context as a frame that surrounds a focal communicative event and provides resources for its interpretation (Gumperz, 1992;Duranti and Goodwin, 1992).
As a motivating example, consider the interactions in the debate network described in Figure 1. Given a debate claim (t 1 ), and two consecutive posts debating it (p 1 , p 2 ), we define a textual inference task, determining whether a pair of text elements hold the same stance in the debate (denoted using the relation Agree(X, Y)). This task is similar to other textual inference tasks (Bowman et al., 2015) that have been successfully approached using complex neural representations (Peters et al., 2018;Devlin et al., 2019). In addition, we can leverage the dependencies between these decisions. For example, assuming that one post agrees with the debate claim (Agree(t 1 , p 2 )), and the other one does not (¬Agree(t 1 , p 1 )), the disagreement between the two posts can be inferred: ¬Agree(t 1 , p 1 ) ∧ Agree(t 1 , p 2 ) → ¬ Agree(p 1 , p 2 ). Finally, we consider the social context of the text. The disagreement between the posts can reflect a difference in the perspectives their authors hold on the issue. This information might not be directly observed, but it can be inferred using the authors' social interactions and behavior, given the principle of social homophily (McPherson et al., 2001), stating that people with strong social ties are likely to hold similar views and authors' perspectives can be captured by representing their social interactions. Exploiting this information requires models that can align the social representation with the linguistic one.
Motivated by these challenges, we introduce DRAIL 1 , a Deep Relational Learning framework, which uses a combined neuro-symbolic representation for modeling the interaction between multiple decisions in relational domains. Similar to other neuro-symbolic approaches Cohen et al., 2020), our goal is to exploit the complementary strengths of the two modeling paradigms. Symbolic representations, used by logic-based systems and by probabilistic graphical models (Richardson and Domingos, 2006;Bach et al., 2017), are interpretable, and allow domain experts to directly inject knowledge and constrain the learning problem. Neural models capture dependencies using the network architecture and are better equipped to deal with noisy data, such as text. However, they are often difficult to interpret and constrain according to domain knowledge. Our main design goal in DRAIL is to provide a generalized tool, specifically designed for NLP tasks. Existing approaches designed for classic relational learning tasks (Cohen et al., 2020), such as knowledge graph completion, are not equipped to deal with the complex linguistic input, whereas others are designed for very specific NLP settings such as word-based quantitative reasoning problems (Manhaeve et al., 2018) or aligning images with text . We discuss the differences between DRAIL and these approaches in Section 2. The examples in this paper focus on modelings various argumentation mining tasks and their social and political context, but the same principles can be applied to wide array of NLP tasks with different contextualizing information, such as images that appear next to the text, or prosody when analyzing transcribed speech, to name a few examples.
DRAIL uses a declarative language for defining deep relational models. Similar to other declarative languages (Richardson and Domingos, 2006;Bach et al., 2017), it allows users to inject their knowledge by specifying dependencies between decisions using first-order logic rules, which are later compiled into a factor graph with neural potentials. In addition to probabilistic inference, DRAIL also models dependencies using a distributed knowledge representation, denoted RELNETS, which provides a shared representation space for entities and their relations, trained using a relational multi-task learning approach. This provides a mechanism for explaining symbols, and aligning representations from different modalities. Following our running example, ideological standpoints, such as Liberal or Conservative, are discrete entities embedded in the same space as textual entities and social entities. These entities are initially associated with users, however using RELNETS this information will propagate to texts reflecting these ideologies, by exploiting the relations that bridge social and linguistic information (see Figure 1).
To demonstrate DRAIL's modeling approach, we introduce the task of open-domain stance prediction with social context, which combines social network analysis and textual inference over complex opinionated texts, as shown in Figure 1. We complement our evaluation of DRAIL with two additional tasks, issue-specific stance prediction, where we identify the views expressed in debate forums with respect to a set of fixed issues (Walker et al., 2012), and argumentation mining (Stab and Gurevych, 2017), a document-level discourse analysis task.

Related Work
In this section, we survey several lines of work dealing with symbolic, neural, and hybrid representations for relational learning.

Languages for Graphical Models
Several high-level languages for specifying graphical models have been suggested. BLOG (Milch et al., 2005) and CHURCH (Goodman et al., 2008) were suggested for generative models. For discriminative models, we have Markov Logic Networks (MLNs) (Richardson and Domingos, 2006) and Probabilistic Soft Logic (PSL) (Bach et al., 2017). Both PSL and MLNs combine logic and probabilistic graphical models in a single representation, where each formula is associated with a weight, and the probability distribution over possible assignments is derived from the weights of the formulas that are satisfied by such assignments. Like DRAIL, PSL uses formulas in clausal form (specifically collections of horn clauses). The main difference between DRAIL and these languages is that, in addition to graphical models, it uses distributed knowledge representations to represent dependencies. Other discriminative methods include FACTORIE (McCallum et al., 2009), an imperative language to define factor graphs, Constrained Conditional Models (CCMs) (Rizzolo and Roth, 2010;Kordjamshidi et al., 2015) an interface to enhance linear classifiers with declarative constraints, and ProPPR (Wang et al., 2013) a probabilistic logic for large databases that approximates local groundings using a variant of personalized PageRank.

Node Embedding and Graph Neural Nets
A recent alternative to graphical models is to use neural nets to represent and learn over relational data, represented as a graph. Similar to DRAIL's RELNETS, the learned node representation can be trained by several different prediction tasks. However, unlike DRAIL, these methods do not use probabilistic inference to ensure consistency.
Node embeddings approaches (Perozzi et al., 2014;Tang et al., 2015;Pan et al., 2016;Grover and Leskovec, Grover and Leskovec, 2016;Tu et al., 2017) learn a feature representation for nodes capturing graph adjacency information, such that the similarity in the embedding space of any two nodes is proportional to their graph distance and overlap in neighboring nodes. Some frameworks (Pan et al., 2016;Xiao et al., 2017;Tu et al., 2017) allow nodes to have textual properties, which provide an initial feature representation when learning to represent the graph relations. When dealing with multi-relational data, such as knowledge graphs, both the nodes and the edge types are embedded (Bordes et al., 2013;Wang et al., 2014;Trouillon et al., 2016;Sun et al., 2019). Finally, these methods learn to represent nodes and relations based on pair-wise node relations, without representing the broader graph context in which they appear. Graph neural nets (Kipf and Welling, 2017;Hamilton et al., 2017;Veličković et al., 2017) create contextualized node representations by recursively aggregating neighboring nodes.

Hybrid Neural-Symbolic Approaches
Several recent systems explore ways to combine neural and symbolic representations in a unified way. We group them into five categories.
Lifted rules to specify compositional nets. These systems use an end-to-end approach and learn relational dependencies in a latent space. Lifted Relational Neural Networks (LRNNs) (Sourek et al., 2018) and RelNNs (Kazemi and Poole, 2018) are two examples. These systems map observed ground atoms, facts, and rules to specific neurons in a network and define composition functions directly over them. While they provide for a modular abstraction of the relational inputs, they assume all inputs are symbolic and do not leverage expressive encoders.
Differentiable inference. These systems identify classes of logical queries that can be compiled into differentiable functions in a neural network infrastructure. In this space we have Tensor Logic Networks (TLNs) (Donadello et al., 2017) and TensorLog (Cohen et al., 2020). Symbols are represented as row vectors in a parameter matrix. The focus is on implementing reasoning using a series of numeric functions.
Rule induction from data. These systems are designed for inducing rules from symbolic knowledge bases, which is not in the scope of our framework. In this space we find Neural Theorem Provers (NTPs) (Rocktäschel and Riedel, 2017), Neural Logic Programming (Yang et al., 2017), DRUM (Sadeghian et al., 2019) and Neural Logic Machines (NLMs) (Dong et al., 2019). NTPs use a declarative interface to specify rules that add inductive bias and perform soft proofs. The other approaches work directly over the database.

Deep classifiers and probabilistic inference.
These systems propose ways to integrate probabilistic inference and neural networks for diverse learning scenarios. DeepProbLog (Manhaeve et al. 20180 extends the probabilistic logic programming language ProbLog to handle neural predicates. They are able to learn probabilities for atomic expressions using neural networks. The input data consists of a combination of feature vectors for the neural predicates, together with other probabilistic facts and clauses in the logic program. Targets are only given at the output side of the probabilistic reasoner, allowing them to learn each example with respect to a single query. On the other hand, Deep Probabilistic Logic (DPL) (Wang and Poon 2018) combines neural networks with probabilistic logic for indirect supervision. They learn classifiers using neural networks and use probabilistic logic to introduce distant supervision and labeling functions. Each rule is regarded as a latent variable, and the logic defines a joint probability distribution over all labeling decisions. Then, the rule weights and the network parameters are learned jointly using variational EM. In contrast, DRAIL focuses on learning multiple interdependent decisions from data, handling and requiring supervision for all unknown atoms in a given example. Lastly, Deep Logic Models (DLMs) (Marra et al., 2019) learn a set of parameters to encode atoms in a probabilistic logic program. Similarly to Donadello et al. (2017)   Deep structured models. More generally, deep structured prediction approaches have been successfully applied to various NLP tasks such as named entity recognition and dependency parsing (Chen and Manning, 2014;Weiss et al., 2015;Ma and Hovy, 2016;Lample et al., 2016;Kiperwasser and Goldberg, 2016;Malaviya et al., 2018). When the need arises to go beyond sentencelevel, some works combine the output scores of independently trained classifiers using inference (Beltagy et al., 2014;?;Liu et al., 2016;Subramanian et al., 2017;Ning et al., 2018), whereas others implement joint learning for their specific domains (Niculae et al., 2017;Han et al., 2019). Our main differentiating factor is that we provide a general interface that leverages first order logic clauses to specify factor graphs and express constraints.
To summarize these differences, we outline a feature matrix in Table 1. Given our focus in NLP tasks, we require a neural-symbolic system that (1) allows us to integrate state-of-the-art text encoders and NLP tools, (2) supports structured prediction across long texts, (3) lets us combine several modalities and their representations (e.g., social and textual information), and (4) results in an explainable model where domain constraints can be easily introduced.

The DRAIL Framework
DRAIL was designed for supporting complex NLP tasks. Problems can be broken down into domain-specific atomic components (which could be words, sentences, paragraphs or full documents, depending on the task), and dependencies between them, their properties and contextualizing information about them can be explicitly modeled.
In DRAIL, dependencies can be modeled over the predicted output variables (similar to other probabilistic graphical models), as well as over the neural representation of the atoms and their relationships in a shared embedding space. This section explains the framework in detail. We begin with a high-level overview of DRAIL and the process of moving from a declarative definition to a predictive model.
A DRAIL task is defined by specifying a finite set of entities and relations. Entities are either discrete symbols (e.g., POS tags, ideologies, specific issue stances), or attributed elements with complex internal information (e.g., documents, users). Decisions are defined using rule templates, formatted as Horn clauses: t LH ⇒ t RH , where t LH (body) is a conjunction of observed and predicted relations, and t RH (head) is the output relation to be learned. Consider the debate prediction task in Figure 1, it consists of several sub-tasks, involving textual inference (Agree(t 1 , t 2 )), social relations (VoteFor(u, v)) and their combination (Agree(u, t)). We illustrate how to specify the task as a DRAIL program in Figure 2 (left), by defining a subset of rule templates to predict these relations.
Each rule template is associated with a neural architecture and a feature function, mapping the initial observations to an input vector for each neural net. We use a shared relational embedding space, denoted RELNETS, to represent entities and relations over them. As described in Figure 2 ("RelNets Layer"), each entity and relation type is associated with an encoder, trained jointly across all prediction rules. This is a form of relational multi-task learning, as the same entities and relations are reused in multiple rules and their representation is updated accordingly. Each rule defines a neural net, learned over the relations defined on the body. They they take a composition of the vectors generated by the relations encoders as an input ( Figure 2, "Rule Layer"). DRAIL is architecture-agnostic, and neural modules for entities, relations and rules can be specified using PyTorch (code snippets can be observed in Appendix C). Our experiments show that we can use different architectures for representing text, users, as well as for embedding discrete entities.
The relations in the Horn clauses can correspond to hidden or observed information, and a specific input is defined by the instantiations-or groundings-of these elements. The collection of all rule groundings results in a factor graph representing our global decision, taking into account the consistency and dependencies between the rules. This way, the final assignments can be obtained by running an inference procedure. For example, the dependency between the views of users on the debate topic (Agree(u, t)) and the agreement between them (VoteFor(u, v)), is modeled as a factor graph in Figure 2 ("Structured Inference Layer")).
We formalize the DRAIL language in Section 3.1. Then, in Sections 3.2, 3.3, and 4, we describe the neural components and learning procedures.

Modeling Language
We begin our description of DRAIL by defining the templating language, consisting of entities, relations, and rules, and explaining how these elements are instantiated given relevant data.
Entities are named symbolic or attributed elements. An example of a symbolic entity is a political ideology (e.g., Liberal or Conservative). An example of an attributed entity is a user with age, gender, and other profile information, or a document associated with textual content. In DRAIL entities can appear either as constants, written as strings in double or single quote (e.g., "user1") or as variables, which are identifiers, substituted with constants when grounded. Variables are written using unquoted upper case strings (e.g., X, X1). Both constants and variables are typed.
Relations are defined between entities and their properties, or other entities. Relations are defined using a unique identifier, a named predicate, and a list of typed arguments. Atoms consist of a predicate name and a sequence of entities, consistent with the type and arity of the relation's argument list. If the atom's arguments are all constants, it is referred to as a ground atom. For example, Agree("user1", "user2") is a ground atom representing whether "user1" and "user2" are in agreement. When atoms are not grounded (e.g., Agree(X, Y)) they serve as placeholders for all the possible groundings that can be obtained by replacing the variables with constants. Relations can either be closed (i.e., all of their atoms are observed) or open, when some of the atoms can be unobserved. In DRAIL, we use a question mark ? to denote unobserved relations. These relations are the units that we reason over.
To help make these concepts concrete, consider the following example analyzing stances in a debate, as introduced in Figure 1. First, we define the entities. User = {"u1", "u2"}, Claim = {"t1"} Post ={"p1", "p2"}. Users are entities associated with demographic attributes and preferences. Claims are assertions over which users debate. Posts are textual arguments that users write to explain their position with respect to the claim. We create these associations by defining a set of relations, capturing authorship Author(User, Post), votes between users VoteFor(User, User)?, and the position users, and their posts, take with respect to to the debate claim. Agree(Claim, User)?, Agree(Claim, Post)?. The authorship relation is the only closed one, for example, the atom: O = {Author("u1", "p1")}.
Rules are functions that map literals (atoms or their negation) to other literals. Rules in DRAIL are defined using templates formatted as Horn clauses: t LH ⇒ t RH , where t LH (body) is a conjunction of literals, and t RH (head) is the output literal to be predicted, and can only be an instance of open relations. Horn clauses allow us to describe structural dependencies as a collection of "if-then" rules, which can be easily interpreted. For example, Agree(X, C) ∧ VoteFor(Y, X) ⇒ Agree(Y, C) expresses the dependency between votes and users holding similar stances on a specific claim. We note that rules can be rewritten in disjunctive form by converting the logical implication into a disjunction between the negation of the body and the head. For example, the rule above can be rewritten The DRAIL program consists of a set of rules, which can be weighted (i.e., soft constraints), or unweighted (i.e., hard constraints). Each weighted rule template defines a learning problem, used to score assignments to the head of the rule. Because the body may contain open atoms, each rule represents a factor function expressing dependencies between open atoms in the body and head. Unweighted rules, or constraints, shape the space of feasible assignments to open atoms, and represent background knowledge about the domain.
Given the set of grounded atoms O, rules can be grounded by substituting their variables with constants, such that the grounded atoms correspond to elements in O. This process results in a set of grounded rules, each corresponding to a potential function or to a constraint. Together they define a factor graph. Then, DRAIL finds the optimally scored assignments for open atoms by performing MAP inference. To formalize this process, we first make the observation that rule groundings can be written as linear inequalities, directly corresponding to their disjunctive form, as follows: Where I + r (I − r ) correspond to the set of open atoms appearing in the rule that are not negated (respectively, negated). Now, MAP inference can be defined as a linear program. Each rule grounding r, generated from template t(r), with input features x r and open atoms y r defines the potential (2) added to the linear program with a weight w r . Unweighted rule groundings are defined as with c(x c , y c ) ≤ 0 added as a constraints to the linear program. This way, the MAP problem can be defined over the set of all potentials Ψ and the set of all constraints C as arg max In addition to logical constraints, we also support arithmetic constraints than can be written in the form of linear combinations of atoms with an inequality or an equality. For example, we can enforce the mutual exclusivity of liberal and conservative ideologies for any user X by writing: Ideology(X, "con") + Ideology(X, "lib") = 1 We borrow some additional syntax from PSL to make arithmetic rules easier to use. Bach et al.
(2017) define a summation atom as an atom that takes terms and/or sum variables as arguments.
A summation atom represents the summations of ground atoms that can be obtained by substituting individual variables and summing over all possible constants for sum variables. For example, we could rewrite the above ideology constraint as Ideology(X, +I) = 1, where Ideology(X, +I) represents the summation of all atoms with predicate Ideology that share variable X. DRAIL uses two solvers, Gurobi (Gurobi Optimization, 2015) and AD3 (Martins et al., 2015) for exact and approximate inference, respectively.
To ground DRAIL programs in data, we create an in-memory database consisting of all relations expressed in the program. Observations associated with each relation are provided in column separated text files. DRAIL's compiler instantiates the program by automatically querying the database and grounding the formatted rules and constraints.

Neural Components
Let r be a rule grounding generated from template t, where t is tied to a neural scoring function Φ t and a set of parameters θ t (Rule Layer in Figure 2). In the previous section, we defined the MAP problem for all potentials ψ r (x, y) ∈ Ψ in a DRAIL program, where each potential has a weight w r . Consider the following scoring function: Notice that all potentials generated by the same template share parameters. We define each scoring function Φ t over the set of atoms on the left hand side of the rule template. Let t = rel 0 ∧ rel 1 ∧ . . . ∧ rel n−1 ⇒ rel n be a rule template. Each atom rel i is composed of a relation type, its arguments and feature vectors for them, as shown in Figure 2, "Input Layer". Given that a DRAIL program is composed of many competing rules over the same problem, we want to be able to share information between the different decision functions. For this purpose, we introduce RELNETS.

RELNETS
A DRAIL program often uses the same entities and relations in multiple different rules. The symbolic aspect of DRAIL allows us to constrain the values of open relations, and force consistency across all their occurrences. The neural aspect, as defined in Eq. 4, associates a neural architecture with each rule template, which can be viewed as a way to embed the output relation.
We want to exploit the fact that there are repeating occurrences of entities and relations across different rules. Given that each rule defines a learning problem, sharing parameters allows us to shape the representations using complementary learning objectives. This form of relational multitask learning is illustrated it in Figure 2, "RelNets Layer".
We formalize this idea by introducing relationspecific and entity-specific encoders and their parameters (φ rel ; θ rel ) and (φ ent ; θ ent ), which are reused in all rules. As an example, let's write the formulation for the rules outlined in Figure 2, where each relation and entity encoder is defined over the set of relevant features.
Note that entity and relation encoders can be arbitrarily complex, depending on the application. For example, when dealing with text, we could use BiLSTMs or a BERT encoder.
Our goal when using RELNETS is to learn entity representations that capture properties unique to their types (e.g., users, issues), as well as relational patterns that contextualize entities, allowing them to generalize better. We make the distinction between raw (or attributed) entities and symbolic entities. Raw entities are associated with rich, yet unstructured, information and attributes, such as text or user profiles. On the other hand, symbolic entities are well-defined concepts, and are not associated with additional information, such as political ideologies (e.g., liberal) and issues (e.g., gun-control). With this consideration, we identify two types of representation learning objectives: Embed Symbol / Explain Data: Aligns the embedding of symbolic entities and raw entities, grounding the symbol in the raw data, and using the symbol embedding to explain properties of previously unseen raw-entity instances. For example, aligning ideologies and text to (1) obtain an ideology embedding that is closest to the statements made by people with that ideology, or (2) interpret text by providing a symbolic label for it.
Translate / Correlate: Aligns the representation of pairs of symbolic or raw entities. For example, aligning user representations with text, to move between social and textual information, as shown in Figure 1, "Social-Linguistic Relations". Or capturing the correlation between symbolic judgements like agreement and matching ideologies.

Learning
The scoring function used for comparing output assignments can be learned locally for each rule separately, or globally, by considering the dependencies between rules.
Global Learning The global approach uses inference to ensure that the parameters for all weighted rule templates are consistent across all decisions. Let Ψ be a factor graph with potentials {ψ r } ∈ Ψ over the all possible structures Y . Let θ = {θ t } be a set of parameter vectors, and Φ t (x r , y r ; θ t ) be the scoring function defined for potential ψ r (x r , y r ). Hereŷ ∈ Y corresponds to the current prediction resulting from the MAP inference procedure and y ∈ Y corresponds to the gold structure. We support two ways to learn θ: (1) The structured hinge loss (2) The general CRF loss Where Z(x) is a global normalization term computed over the set of all valid structures Y .
When inference is intractable, approximate inference (e.g., AD 3 ) can be used to obtainŷ. To approximate the global normalization term Z(x) in the general CRF case, we follow Zhou et al. (2015); Andor et al. (2016) and keep a pool β k of k of high-quality feasible solutions during inference. This way, we can sum over the solutions in the pool to approximate the partition function y'∈β k ψ r ∈Ψ exp Φ t (x r , y ′ r ; θ t ) . In this paper, we use the structured hinge loss for most experiments, and include a discussion on the approximated CRF loss in Section 5.7.

Joint Inference
The parameters for each weighted rule template are optimized independently. Following Andor et al. (2016), we show that joint inference serves as a way to greedily approximate the CRF loss, where we replace the normalization term in Eq. (6) with a greedy approximation over local normalization as: where Z L (x r ) is computed over all the valid assignments y ′ r for each factor ψ r . We refer to models that use this approach as JOINTINF.

Experimental Evaluation
We compare DRAIL to representative models from each category covered in Section 2. Our goal is to examine how different types of approaches capture dependencies and what are their limitations when dealing with language interactions. These baselines are described in Section 5.1. We also evaluate different strategies using DRAIL in Section 5.2. We focus on three tasks: open debate stance prediction (Sec. 5.3), issue-specific stance prediction (Sec. 5.4) and argumentation mining (Sec. 5.5), details regarding the hyper-parameters used for all tasks can be found in Appendix B.

Baselines
End-to-end Neural Nets: We test all approaches against neural nets trained locally on each task, without explicitly modeling dependencies. In this space, we consider two variants: INDNETS, where each component of the problem is represented using an independent neural network, and E2E, where the features for the different components are concatenated at the input and fed to a single neural network.
Relational Embedding Methods: Introduced in Section 2.2, these methods embed nodes and edge types for relational data. They are typically designed to represent symbolic entities and relations. However, because our entities can be defined by raw textual content and other features, we define the relational objectives over our encoders. This adaptation has proven successful for domains dealing with rich textual information (Lee and Goldwasser, 2019). We test three relational knowledge objectives: TransE (Bordes et al., 2013), ComplEx (Trouillon et al., 2016), and RotatE (Sun et al., 2019). Limitations: (1) These approaches cannot constrain the space using domain knowledge, and (2) they cannot deal with relations involving more than two entities, limiting their applicability to higher order factors.
Probabilistic Logics: We compare to PSL (Bach et al., 2017), a purely symbolic probabilistic logic, and TensorLog (Cohen et al., 2020), a neuro-symbolic one. In both cases, we instantiate the program using the weights learned with our base encoders. Limitations: These approaches do not provide a way to update the parameters of the base classifiers.

Modeling Strategies
Local vs. Global Learning: The trade-off between local and global learning has been explored for graphical models (MEMM vs. CRF), and for deep structured prediction (Chen and Manning, 2014;Andor et al., 2016;Han et al., 2019). Although local learning is faster, the learned scoring functions might not be consistent with the correct global prediction. Following (Han et al., 2019), we initialize the parameters using local models.
RELNETS: We will show the advantage of having relational representations that are shared across different decisions, in contrast to having independent parameters for each rule. Note that in all cases, we will use the global learning objective to train RELNETS.
Modularity: Decomposing decisions into relevant modules has been shown to simplify the learning process and lead to better generalization (Zhang and Goldwasser, 2019). We will contrast the performance of modular and end-to-end models to represent text and user information when predicting stances.
Representation Learning and Interpretability: We will do a qualitative analysis to show how we are able to embed symbols and explain data by moving between symbolic and sub-symbolic representations, as outlined in Section 3.3.

Open Domain Stance Prediction
Traditionally, stance prediction tasks have focused on predicting stances on a specific topic, such as abortion. Predicting stances for a different topic, such as gun control, would require learning a new In this task, we would like to leverage the fact that stances in different domains are correlated. Instead of using a pre-defined set of debate topics (i.e., symbolic entities) we define the prediction task over claims, expressed in text, specific to each debate. Concretely, each debate will have a different claim (i.e., different value for C in the relation Claim(T, C), where T corresponds to a debate thread). We refer to these settings as Open-Domain and write down the task in Figure 3. In addition to the textual stance prediction problem (r0), where P corresponds to a post, we represent users (U) and define a user-level stance prediction problem (r1). We assume that additional users read the posts and vote for content that supports their views, resulting in another prediction problem (r2,r3). Then, we define representation learning tasks, which align symbolic (ideology, defined as I) and raw (users and text) entities (r4-r7). Finally, we write down all dependencies and constrain the final prediction (c0-c7).
Dataset: We collected a set of 7,555 debates from debate.org, containing a total of 42,245 posts across 10 broader political issues. For a given issue, the debate topics are nuanced and vary according to the debate question expressed in text (e.g., Should semi-automatic guns be banned, Conceal handgun laws reduce violent crime).
Debates have at least two posts, containing up to 25 sentences each. In addition to debates and posts, we collected the user profiles of all users participating in the debates, as well as all users  User data is considerably sparse. We create two evaluation scenarios, random and hard. In the random split, debates are randomly divided into ten folds of equal size. In the hard split, debates are separated by political issue. This results in a harder prediction problem, as the test data will not share topically related debates with the training data. We perform 10-fold cross validation and report accuracy.

Entity and Relation Encoders:
We represent posts and titles using a pre-trained BERT-small 2 encoder (Turc et al., 2019), a compact version of the language model proposed by Devlin et al. 2019. For users, we use feed-forward computations with ReLU activations over the profile features and a pre-trained node embedding (Grover and Leskovec, 2016) over the friendship graph. All relation and rule encoders are represented as feed-forward networks with one hidden layer, ReLU activations and a softmax on top. Note that all of these modules are updated during learning.
Table 2 (Left) shows results for all the models described in Section 5.1. In E2E models, post and user information is collapsed into a single module (rule), whereas in INDNETS, JOINTINF, GLOBAL and RELNETS they are modeled separately. All other baselines use the same underlying modular encoders. We can appreciate the advantage of relational embeddings in contrast to INDNETS for user and voter stances, particularly in the case of ComplEx and RotatE. We can attribute this to the 2 We found negligible difference in performance between BERT and BERT-small for this task, while obtaining a considerable boost in speed.
fact that all objectives are trained jointly and entity encoders are shared. However, approaches that explicitly model inference, like PSL, TensorLog, and DRAIL outperform relational embeddings and end-to-end neural networks. This is because they enforce domain constraints.
We explain the difference between the performance of DRAIL and the other probabilistic logics by: (1) The fact that we use exact inference instead of approximate inference, (2) PSL learns to weight the rules without giving priority to a particular task, whereas the JOINTINF model works directly over the local outputs, and most importantly, (3) our GLOBAL and RELNETS models backpropagate to the base classifiers and fine-tune parameters using a structured objective.
In Table 2 (Right) we show different versions of the DRAIL program, by adding or removing certain constraints. AC models only enforce author consistency, AC-DC models enforce both author consistency and disagreement between respondents, and finally, AC-DC-SC models introduce social information by considering voting behavior. We get better performance when we model more contextualizing information for the RELNETS case. This is particularly helpful in the Hard case, where contextualizing information, combined with shared representations, help the model generalize to previously unobserved topics. With respect to the modeling strategies listed in Section 5.2, we can observe: (1) The advantage of using a global learning objective, (2) the advantage of using RELNETS to share information and (3) the advantage of breaking down the decision into modules, instead of learning an end-to-end model.
Then, we perform a qualitative evaluation to illustrate our ability to move between symbolic LGBT Libl gay marriage ought be legalized, gay marriage should be legalized, same-sex marriage should be federally legal Con Leviticus 18:22 and 20:13 prove the anti gay marriage position, gay marriage is not bad, homosexuality is not a sin nor taboo Table 3: Representation Learning Objectives: Explain Data (Top) and Embed Symbol (Bottom).
Note that ideology labels were learned from user profiles, and do not necessarily represent the official stances of political parties. and raw information. Table 3 (Top) takes a set of statements and explains them by looking at the symbols associated with them and their score. For learning to map debate statements to ideological symbols, we rely on the partial supervision provided by the users that self-identify with a political ideology and disclose it on their public profiles. Note that we do not incorporate any explicit expertise in political science to learn to represent ideological information. We chose statements with the highest score for each of the ideologies. We can see that, in the context of guns, statements that have to do with some form of gun control have higher scores for the center-toleft spectrum of ideological symbols (moderate, liberal, progressive), whereas statements that mention gun rights and the ineffectiveness of gun control policies have higher scores for conservative and libertarian symbols.
To complement this evaluation, in Table 3 (Bottom), we embed ideologies and find three example statements that are close in the embedding space. In the context of LGBT issues, we find that statements closest to the liberal symbol are those that support the legalization of samesex marriage, and frame it as a constitutional issue. On the other hand, the statements closest to the conservative symbol, frame homosexuality and same-sex marriage as a moral or religious issue, and we find statements both supporting and opposing same-sex marriage. This experiment shows that our model is easy to interpret, and provides an explanation for the decision made.
Finally, we evaluate our learned model over entities that have not been observed during training. To do this, we extract statements made by three prominent politicians from ontheissues.org. Then, we try to explain the politicians by looking at their predicted ideology. Results for this  evaluation can be seen in Table 4. The left part of Figure 4 shows the proportion of statements that were identified for each ideology: left (liberal or progressive), moderate and right (conservative). We find that we are able to recover the relative positions in the political spectrum for the evaluated politicians: Bernie Sanders, Joe Biden, and Donald Trump. We find that Sanders is the most left leaning, followed by Biden. In contrast, Donald Trump stands mostly on the right. We also include some examples of the classified statements. We show that we are able to identify cases in which the statement does not necessarily align with the known ideology for each politician.

Issue-Specific Stance Prediction
Given a debate thread on a specific issue (e.g., abortion), the task is to predict the stance with respect to the issue for each one of the debate posts (Walker et al., 2012). Each thread forms a tree structure, where users participate and respond to each other's posts. We treat the task as a collective classification problem, and model the agreement between posts and their replies, as well as the consistency between posts written by the same author. The DRAIL program for this task can be observed in Appendix A. Dataset: We use the 4Forums dataset from the Internet Argument Corpus (Walker et al., 2012), consisting of a total of 1,230 debates and 24,658 posts on abortion, evolution, gay marriage, and gun control. We use the same splits as Li et al. (2018) and perform 5-fold cross validation.
Entity and Relation Encoders: We represented posts using pre-trained BERT encoders (Devlin et al., 2019) and do not generate features for authors. As in the previous task, we model all relations and rules using feed-forward networks with one hidden layer and ReLU activations. Note that we fine-tune all parameters during training.
In Table 4 we can observe the general results for this task. We report macro F1 for post stance and agreement between posts for all issues. As in the previous task, we find that ComplEx and RotatE relational embeddings outperform INDNETS, and probabilistic logics outperform methods that do not perform constrained inference. PSL outperforms JOINTINF for evolution and gun control debates, which are the two issues with less training data, whereas JOINTINF outperforms PSL for debates on abortion and gay marriage. This could indicate that re-weighting rules may be advantageous for the cases with less supervision. Finally, we see the advantage of using a global learning objective and augmenting it with shared representations. Table 5 compares our model with previously published results.

Argument Mining
The goal of this task is to identify argumentative structures in essays. Each argumentative structure corresponds to a tree in a document. Nodes are predefined spans of text and can be labeled either as claims, major claims, or premises,  and edges correspond to support/attack relations between nodes. Domain knowledge is injected by constraining sources to be premises and targets to be either premises or major claims, as well as enforcing tree structures. We model nodes, links, and second order relations, grandparent (a → b → c), and co-parent (a → b ← c) (Niculae et al., 2017). Additionally, we consider link labels, denoted stances. The DRAIL program for this task can be observed in Appendix A. Dataset: We used the UKP dataset (Stab and Gurevych, 2017), consisting of 402 documents, with a total of 6,100 propositions and 3,800 links (17% of pairs). We use the splits used by Niculae et al. (2017), and report macro F1 for components and positive F1 for relations.
Entity and Relation Encoders: To represent the component and the essay, we used a BiLSTM over the words, initialized with GloVe embeddings (Pennington et al., 2014), concatenated with a feature vector following Niculae et al. (2017). For representing the relation, we use a feed-forward computation over the components, as well as the relation features used in Niculae et al. (2017).
We can observe the general results for this task in   trying to predict links correctly. For this task, we did not apply TensorLog, given that we couldn't find a way to express tree constraints using their syntax. Once again, we see the advantage of using global learning, as well as sharing information between rules using RELNETS. Table 7 shows the performance of our model against previously published results. While we are able to outperform models that use the same underlying encoders and features, recent work by Kuribayashi et al. (2019) further improved performance by exploiting contextualized word embeddings that look at the whole document, and making a distinction between argumentative markers and argumentative components. We did not find a significant improvement by incorporating their ELMo-LSTM encoders into our framework, 3 nor by replacing our BiLSTM encoders with BERT. We leave the exploration of an effective way to leverage contextualized embeddings for this task for future work.

Run-time Analysis
In this section, we perform a run-time analysis of all probabilistic logic systems tested. All experiments were run on a 12 core 3.2Ghz Intel i7 CPU machine with 63GB RAM and an NVIDIA GeForce GTX 1080 Ti 11GB GDDR5X GPU. 3 We did not experiment with their normalization approach, extended BoW features, nor AC/AM distinction.  Figure 5 shows the overall training time (per fold) in seconds for each of the evaluated tasks. Note that the figure is presented in logarithmic scale. We find that DRAIL is generally more computationally expensive than both TensorLog and PSL. This is expected given that DRAIL backpropagates to the base classifiers at each epoch, while the other frameworks just take the local predictions as priors. However, when using a large number of arithmetic constraints (e.g., Argument Mining), we find that PSL takes a really long time to train. We found no significant difference when using ILP or AD. 3 We presume that this is due to the fact that our graphs are small and that Gurobi is a highly optimized commercial software.
Finally, we find that when using encoders with a large number of parameters (e.g., BERT) in tasks with small graphs, the difference in training time between training local and global models is minimal. In these cases, back-propagation is considerably more expensive than inference, and global models converge in fewer epochs. For Argument Mining, local models are at least twice as fast. BiLSTMs are considerably faster than BERT, and inference is more expensive for this task.

Analysis of Loss Functions
In this section we perform an evaluation of the CRF loss for issue-specific stance prediction. Note that one drawback of the CRF loss (Eq. 6) is that we need to accumulate the gradient for the approximated partition function. When using entity encoders with a lot of parameters (e.g., BERT), the amount of memory needed for a single instance increases. We were unable to fit the full models in our GPU. For the purpose of these tests, we froze the BERT parameters after local training  and updated only the relation and rule parameters.
To obtain the solution pool, we use Gurobi's pool search mode to find β high-quality solutions. This also increases the cost of search at inference time. Development set results for the debates on abortion can be observed in Table 8. While increasing the size of the solution pool leads to better performance, it comes at a higher computational cost.

Conclusions
In this paper, we motivate the need for a declarative neural-symbolic approach that can be applied to NLP tasks involving long texts and contextualizing information. We introduce a general framework to support this, and demonstrate its flexibility by modeling problems with diverse relations and rich representations, and obtain models that are easy to interpret and expand. The code for DRAIL and the application examples in this paper have been released to the community, to help promote this modeling approach for other applications.

C Code Snippets
We include code snippets to show how to load data into DRAIL (Figure 7-a), as well as to how to define a neural architecture (Figure 7-b). Neural architectures and feature functions can be programmed by creating Python classes, and the module and classes can be directly specified in the DRAIL program (lines 13, 14, 24, and 29 in Figure 7-a).