Bi-GRU Relation Extraction Model Based on Keywords Attention

Abstract Relational extraction plays an important role in the field of natural language processing to predict semantic relationships between entities in a sentence. Currently, most models have typically utilized the natural language processing tools to capture high-level features with an attention mechanism to mitigate the adverse effects of noise in sentences for the prediction results. However, in the task of relational classification, these attention mechanisms do not take full advantage of the semantic information of some keywords which have information on relational expressions in the sentences. Therefore, we propose a novel relation extraction model based on the attention mechanism with keywords, named Relation Extraction Based on Keywords Attention (REKA). In particular, the proposed model makes use of bi-directional GRU (Bi-GRU) to reduce computation, obtain the representation of sentences, and extracts prior knowledge of entity pair without any NLP tools. Besides the calculation of the entity-pair similarity, Keywords attention in the REKA model also utilizes a linear-chain conditional random field (CRF) combining entity-pair features, similarity features between entity-pair features, and its hidden vectors, to obtain the attention weight resulting from the marginal distribution of each word. Experiments demonstrate that the proposed approach can utilize keywords incorporating relational expression semantics in sentences without the assistance of any high-level features and achieve better performance than traditional methods.


INTRODUCTION
Abundant data on the Web are generated and shared every day, thus the relational facts of subjects (entities) in the text are often utilized to represent the text information to capture associations among those data.Generally, triples are utilized to represent entities and their relations which often indicate unambiguous facts about entities.For example, a triple (e 1 , r, e 2 ) denotes that entity e 1 has a relation r with another entity e 2 .Knowledge graphs (KG) such as FreeBase [1] and DBpedia [2] are real examples of such representations in the triple form.
Relation extraction is a sub-task of natural language processing (NLP) that can discover relations between entity pairs and given unstructured text data.Previous work in the area of relation extraction from text heavily depends on kernel and feature methods [3].Recent research studies utilize data-driven Deep Neural Networks (DNNs) methods to eliminate RE of the conventional NLP approaches since these DNN-based methods [4][5][6] can automatically learn features instead of manually designed features based on the various NLP tool-kits.Most of them surpassed the traditional methods and achieved excellent results for the RE tasks.Among them, both DNNs-based supervised and distant supervision methods are the most popular and reliable solutions for RE but have their own characteristics.Supervised methods have better performance for the specific domain, while distant supervision methods have better performance for generic domains.Therefore, it is difficult to specify which kind of the above two methods are the best.Hence, the following part introduces the DNN-based supervised methods in detail according to the research of the paper.
According to the structure of DNNs, DNN-based Supervised RE usually is classified into various types such as CNN [6][7][8][9][10], RNN [5,11,12], or Mix structure.In addition, some variant RNN networks have been developed in RE systems such as the Long Short Term Memory network ( LSTM ) [13][14][15], and Gated Recurrent Unit ( GRU ) [16].Each kind of DNN has its own characteristics and advantages in dealing with various language tasks.For example, due to the parallel processing ability, the CNNs are good at addressing local and structural information, but rarely capture global features and time sequence information.Instead, RNNs, LSMTs, and GRUs, which are suitable for modeling sequence and problem transformation, can alleviate these problems that CNNs cannot overcome.
However, these structural RNNs-based methods have a common drawback which is that many external artificial features are introduced without an effective feature filter mechanism [17].Therefore, the semanticoriented approaches are utilized to improve the ability of semantic representation via capturing the internal association of text and the attention mechanisms.To alleviate the influence of word-level noise within sentences, many efforts have been devoted to getting rid of irrelevant words [18][19][20][21], especially, the recent state-of-the-art attention-based methods such as [19,22,23].
Although the inner-sentence noise can be alleviated by the attention mechanisms with the caculation of weights for the each word independently, there are some information for better extraction through some continuous words such as phrases.Yu et al. [24] proposes an attention mechanism based on the conditional random fields (CRF), which incorporates such keywords information into the neural relation extractor.

Bi-GRU Relation Extraction Model Based on Keywords Attention
Compared with other strong feature-based classifiers and all baseline neural models, the CRF mechanism is important for this model to construct a better attention weight.
Based on the above analysis, we propose a novel relation extraction model based on the attention mechanism with keywords, named Relation Extraction Based on Keywords Attention (REKA) , which incorporates an attention mechanism based on the keywords-identifiable of relation that is similar to the segments in the [24].Different from the model in [24], our model makes use of bi-directional GRU (Bi-GRU) to reduce computation without any NLP tools.In particular, the CRF attention mechanism includes two components: entity pair attention and segment attention.
The proposed entity pair attention means adding additional weight to the entity part of the dataset so that it plays a more decisive role when entering the code.The proposed segment attention is assumed that each sentence has a binary sequence of states corresponding to it and that each state variable in the sequence corresponds to a word in the sentence.This binary state variable indicates whether the corresponding word is related to the relation extraction task with 0 and 1, respectively.Inspired by the [24], we utilized a linear-chain CRF incorporating segment attention to obtain the marginal distribution of each state variable as an attention weight.
To summarize, the contributions of the proposed REKA model are shown as follows: • Propose a novel Bi-GRU model based on an attention mechanism with keywords to handle the relation extraction.• Both entity pair similarity features and segment features are incorporated in the proposed attention mechanism with keywords.• Achieves state-of-the-art performance without any other NLP tools assistance.
• Be more interpretable than the original Bi-GRU model.

RNN-Based Relation Extraction Models
Recently, relation extraction research focuses on extracting relational features with neural networks [25][26][27].Zhang et al. [28] claimed that RNN-based relation extraction models have better performance than that the CNN-based models since CNN's can only obtain the local features, but RNNs are good at learning long-distance dependency between entities.Afterward, LSTM [15] is proposed by using the gate mechanism to solve the problem of gradient explosion in RNN models.Based on this, Xu et al. [5] propose a model with LSTM via the shortest dependency path (SDP) between entities, named the SDP-LSTM model, in which there are four types of information, including Word vectors, POS tags, Grammatical relations, and WordNet hypernyms, to support external information.To address the problem of shallow architecture difficultly represented by the potential space in different network levels, Xu et al. [29] can obtain the abstract features along the two sub-paths of SDP.

Bi-GRU Relation Extraction Model Based on Keywords Attention
Since dependency trees are directed graphs, it is necessary to identify whether the relation implies the reverse direction or the first entity is related to the second entity.Therefore, the SPD is divided into two sub-paths, each directed from the entity towards the ancestor node.However, one-directional LSTM models lack representation of the complete sequential information.Thus, the bidirectional LSTM model (BiLSTM) is utilized by Zhang et al. [30] to obtain the sentence level representation with several lexical features.The experimental results demonstrate word embedding as an input feature alone is enough to achieve excellent results.However, the SDP can filter the input text but has no extracted features.To address this issue, the attention mechanism is introduced for BiLSTM-based RE [31].

Attention Mechanisms for Relation Extraction
Since useful information can be presented anywhere in the sentence, some researchers recently have presented attention-based models which can obtain the important semantic information in a sentence.
Zhou et al. [31] propose the attention mechanism in BiLSTM, which automatically got the important features only with the raw text.Similar to the work of Zhou et al. [31], Xiao et al. [32] propose a two-level BiLSTM architecture based on a two-level attention mechanism to extract a high-level representation of the raw sentence.
Although the attention mechanism is used to capture the important features extracted by the model, [31] just presents a random weight without the consideration of prior knowledge.Therefore, EAtt-BiGRU proposed by Qin et al. [33] leverages the entity pair as prior knowledge to form attention weight.Different from Zhou et al.'s [31] work, EAtt-BiGRU applies bi-directional GRU (Bi-GRU) to reduce computation, capture the representation of sentences and adopt a GRU to extract prior knowledge of entity pairs.Zhang et al. [34] propose a Bi-GRU model based on another attention mechanism with the SDP for the prior knowledge, extracting sentence-level features and attention weights.Nguyen et al. [35] have proposed to use a special attention mechanism and introduced dependency analysis that takes into account the interconnections between potential features.
With the proposed BERT model, which has achieved excellent performance on various NLP tasks, more and more studies have started to try to use the BERT model in search matching tasks and achieved very good results.In the latest study on pre-trained models, Wei et al. [36] achieved high metric scores using BERT.Although the BERT model has excellent encoding ability and can fully capture the semantic information of the context in the sentence, it still has problems such as high training costs and long prediction time.
Our model is inspired by Lee et al. [22], but different from the previous works that can only get wordlevel or sentence-level attention and rarely obtain the degree of correlation between entities and other related words, our model utilizes Bi-GRU instead of BiLSTM to reduce computation.Meaning while, inspired by the attention model designed by Yu et al. [24] for the relation extraction, which is capable of learning phrase-like features and capturing reasonably related segments as relational expressions based on chinaXiv:202211.00421v1

Bi-GRU Relation Extraction Model Based on Keywords Attention
the CRF, we propose a novel attention mechanism combining the entity pair attention with the segment attention via CRF together.
Although the above methods provide a solid foundation for the research of supervised RE, there are still limitations among them.For example, the insufficient training corpus puzzles the further development of the supervised RE.Therefore, Mintz et al. [37] propose a distant supervision approach strongly based on an assumption in the selection of training examples.Distant supervision methods also achieved excellent results for the RE [38][39][40].However, it also has some drawbacks, for example, the noise in the data sets is obvious.Thus, it is difficult to demonstrate which two kinds of above methods are currently the best.Hence, we just research the supervised methods in this paper.

METHODOLOGY
The proposed REKA model consists of four components, the structure of which is shown in Figure 1, and the role of each layer is as follows: • The input layer that contains word vector information and location information.
• The self-attention layer that processes the word vectors to obtain word representations.
• To obtain contextual information about each word in a sentence The Bi-GRU layer is used.
• The keyword-attention layer extracts the key information in the sentence and passes it to the final classification layer.

Input Layer
The REKA model's input layer is designed to transform the original input of the sentence into an embedding vector containing various feature information, where the input sentences are denoted by {w 1 , w 2 , …, w n } and { } , , , p p p … is a vector of the relative position features information of every word to the entity pair e j{1,2} .
To further enhance the model's ability to better capture the semantic information in sentences, a pretraining model of embedded language models (ELMo) [43] word embedding is utilized in this paper, which proposes a better solution for multiple meanings of words, unlike the previous work of word2vec by Mikolov et al. [41] and GloVe by Pennington et al. [42], in which one word corresponds to a vector that is stationary.
ELMo is a real trained model, in which a sentence or a paragraph is fed into and inferred the word vector corresponding to each word based on the context.One of the obvious benefits of ELMo is that the multiplemeaning words can be understood in the context of the preceding and following words.
After the word embedding process, {x 1 , x 2 , …, x n } is the d w dimensional vector and input into the next layer as the position feature vector.

Multi-Head Attention Layer
Although this paper makes use of non-fixed word vectors in the input layer, we use the Multi-Headed Attention (MHA) mechanism to process the output vectors in the input layer to help the model further understand the deep semantic information in the sentences and to address the problem of long-term dependencies.MHA is a special kind of self-attention mechanism [17,19], in which the symmetric similarity matrix of the sequences can be constructed from a sequence of word vectors resulting from the input layer.
As shown in Figure 2, given a key K, a queries Q, and a value V, the multi-head attention module will execute the attention h times, the calculation process uses the following equation (1-3): where head Attention , , is the trainable parameter, W M is the sc aled dot-product attention calculation when calculated and connected in series, key and value of i th head, respectively [17].
The inputs Q, K, V are all equal to the word embedding vector {x 1 , x 2 , ..., x n } in the multi -head attention [17].The output of the MHA self-attention is a sequence of features with information about the context of the input sentences.

Bi-GRU Network
The Bi-GRU network layer was used to obtain semantic information in sentences about the output sequence of the MHA self-attentive layer.As shown in Figure 3, GRU optimizes the LSTM by retaining only two gate operations including a new gate and a reset gate, thus its units, therefore, have fewer parameters and converge faster than LSTM units.

Figure 3. The GRU unit
The GRU unit's processing of m i is represented in this paper for simplicity as GRU(m i ).Therefore, the equation (4-6) for calculating the contextualized word representation is obtained as follows: ; ] [

Bi-GRU Relation Extraction Model Based on Keywords Attention
The input M resulting from the MHA self-attention layer is fed into the Bi-GRU network step by step.To simultaneous use of past and future feature information at a given time step, we connect the hidden state of the forward GRU network with the hidden sta te of the backward GRU network Where d h is used to denote the hidden state of the GRU network unit dimension, {h 1 , h 2 , ..., h n } is denoted the hidden state vector of each word, The arrow represents the direction of the GRU unit.

Keywords Attention based on CRF
Although attention mechanisms have achieved state-of-the-art results in a variety of NLP tasks, most of them do not fully exploit the keywords information in the sentences.This is because keywords usually refer to important words for solving relational extraction tasks, and the performance of the models would be improved if information about these keywords could be exploited.
The goal of the attention mechanism with keywords proposed in this paper is to assign more reasonable weights to the hidden layer vectors, where attention weights are also a set of linear combinations of scalars.A more reasonable weight assignment indicates that the model pays more attention to the more important words in the sentence compared to other words, and all the weights in this attention mechanism with keywords take values between 0 and 1.
However, there is a different approach to the calculation of the weights between the traditional attention mechanisms and the proposed model.In particular, the proposed model defines a state variable z for each word in the sentence, it means that the word corresponding to z is irrelevant to the relational classification of this sentence when z equals 0, and vice versa if z equals 1.Thus, each sentence of the input model has a corresponding sequence of z.From the above description, the expected value of a hidden state N, the probability of its corresponding word, will be selected and calculated as the following equation ( 7): ( ) In order to calculate the p(z i = 1|H), the CRF is introduced here to calculate the sequence of weights for the hidden sequence vectors H = {h 1 , h 2 , ..., h n }, where H represents the input sequence and h i represents the hidden output of the GRU layer for the i th word in the sentence.CRF provides a calculation of transfer probabilities for the computation of conditional probabilities in between sequences.
The linear-chain CRF defines a range of conditional probability p(z i = 1|H) given H with the following definition (8-9): ( ) For feature extraction, the feature extractor makes use of two types of feature functions, the vertex feature function y 1 (z i , H), the edge feature function y 2 (z i , z i+1 ).y 1 represents the mapping of the output h of GRU to the state variable z, and y 2 simulates the transition of two state variables at adjacent time steps.The equations for their definitions are shown as the following equation (11-13) respectively: , exp , exp Where W H and W E are trainable parameters, b is a trainable bias term.They calculate the contextual information as a feature score for each state variable, which takes advantage of the entity location features 1 2 e e i i p p as well as keyword features embedded vectors (entity pair hidden similarity features t 1 , t , e e h h ).
For the hidden vector output by the words after the Bi-GRU layer, the CRF keyword attention mechanism performs soft selection by assigning higher weights to the words in the sentence that are more relevant to the classification.The processing of the sentence by the CRF keywords attention mechanism is shown in Figure 4, The CRF keyword attention in the figure assigns different weights to each word with an example sentence "The boy ran into the school cafeteria".In addition to the two entity words "boy" and "cafeteria", "into" in the sentence was also assigned a higher weight relative to the other words, due to the fact that a is the word associated with the relational classification.

Bi-GRU Relation Extraction Model Based on Keywords Attention
Entity position feature: The proposed attention mechanism with keywords in this paper not only obtains word embedding features but also incorporates position embedding features.
In order to represent contextual information as well as the relative location features of entities 1 2  , e e i i p p , this paper connects them with the output of their corresponding hidden layers h i , as shown by F 1 in Equation 12.There is a definition such as , , Positional vectors ar e similar to word embedding in that it transforms a relative positional scalar into a feature embedding vector by traversing through the embedding matrix Entity hidden similarity features: Extracting entity hidden similarity features as entity features are used to replace the traditional entity feature extraction method in this paper, thus avoiding the use of traditional NLP tools, and its calculation process is defined as shown in Equation (14)(15).

( ) ( )
In this paper, entities are categorized according to their similarity to their hidden vectors.The j th entity hidden similarity feature t j is calculated by weighting the similarity of c with the hidden layer output j e h based on the j th entity.
Entity features are structured by cascading the hidden states corresponding to the entity locations and the potential type representation of the entity pair, shown as F 2 in Equation (12).

Classification Layer
To compute the probability p of the output distribution of the state variable, A softmax layer has been added after the keyword attention layer, which is shown in Equation 16.
Where |D| is the size of the training data dataset and (S (i) , y (i) ) is the i th sample in the dataset.The AdaDelta optimizer is utilized to minimize the loss calculation parameter h in this paper.
To prevent overfitting, L2 regularisation is added to the loss function, where l 1 , l 2 are the hyperparameters of the regularisation.The second regularizer attempts to compel the model to process a small number of significant words and returns a sparse weight distribution.The resulting objective function L is shown in Equation 18.

Dataset and Metric
To evaluate the experiment, we used the SemEval-2010 Task 8 dataset for our experiment, SemEval-2010 Task 8 dataset is a benchmark dataset that is widely used in the field of relationship extraction.The dataset has 19 relationship types, including nine directional relationships and others.As shown in Table 1.The dataset includes 10717 sentences, of which 8000 samples were used for training and other 2717 samples for testing.The evaluation metrics used here are the macro averaged F1 score based, which is the official evaluation metric of the dataset.

chinaXiv:202211.00421v1
ChinaXiv合作期刊  The experimental results show that the proposed REKA model is superior to the conventional model with fewer features but is lower than the Entity-Aware BERT and CASREL BERT.However, the pre-trained model file of the BERT is so large that it takes longer to be trained with higher hardware performance requirements.

Bi-GRU Relation Extraction Model Based on Keywords Attention
As shown in Table 5, we conducted ablation experiments on the development dataset in order to explore the contribution of the various components of the keywords-aware attention mechanism to the experimental results.We gradually stripped the individual components from the original model, the experimental results showed that the F1-score decreased by 0.2 when the position embedding component was stripped from the model.MHA, pre-trained EMLo word embeddings, and entity is hidden similarity features provide F1 scores of 0.5, 1.2, and 0.8 respectively for the model.In particular, a 2.3% improvement of F 1 is a result of the keywords-aware attention.Therefore, experimental results demonstrate that these components contribute to the model in a complementary way rather than working individually and achieve an F1 score of 84.6 via the combination of all components.

chinaXiv:202211.00421v1
ChinaXiv合作期刊 Table 5.The effect of components on the F 1 -score of the model.

CONCLUSION
In this paper, we propose a novel Bi-GRU network model based on an attention mechanism with keywords for the task of RE on the SemEval-2010 task dataset.This model adequately extracts features that are available in the dataset through the keyword attention mechanism and achieved F1 score of 84.8 without the use of other NLP tools.To calculate the marginal distribution for each word, we used the similarity between the output of the hidden vectors by the entity words in the hidden layer and the relative position feature vectors between the entity words in the CRF keyword attention mechanism, which is chosen as the attention weight.Our further research will be carried out on attention mechanisms that can better extract key information from sentences, and we are planning to use this for the identification of relationships between several entities.

Figure 1 .
Figure 1.The systematic architecture of the REKA model.

Figure 4 .
Figure 4. CRF keywords attention mechanism architecture shown with an example sentence "The boy ran into the school cafeteria".
is the maximum sentence length, d p is the dimension of the position vector.
vector constructed to represent the classes of similar entities, where K is a hyperparameter representing the number of classes in which entities are classified by their hidden similarity.
Of which |R| is the number of relationship categories, b y R |R| is a biased term, W y that maps the expected value of the hidden state N to the feature score of the relational label.

Table 1 .
Types of relationships in the dataset and their percentages.

Table 3 .
Comparison of the results of the Semeval-2010 Task 8 test dataset.

Table 4 .
Average precision score for our model and compared methods (micro-averaged over all classes).
Notes: a. (The fi rst columns show how much of testing data has been used.Performance is on the SemEval-2010 task dataset).