Relation Extraction Based on Prompt Information and Feature Reuse

ABSTRACT To alleviate the problem of under-utilization features of sentence-level relation extraction, which leads to insufficient performance of the pre-trained language model and underutilization of the feature vector, a sentence-level relation extraction method based on adding prompt information and feature reuse is proposed. At first, in addition to the pair of nominals and sentence information, a piece of prompt information is added, and the overall feature information consists of sentence information, entity pair information, and prompt information, and then the features are encoded by the pre-trained language model ROBERTA. Moreover, in the pre-trained language model, BIGRU is also introduced in the composition of the neural network to extract information, and the feature information is passed through the neural network to form several sets of feature vectors. After that, these feature vectors are reused in different combinations to form multiple outputs, and the outputs are aggregated using ensemble-learning soft voting to perform relation classification. In addition to this, the sum of cross-entropy, KL divergence, and negative log-likelihood loss is used as the final loss function in this paper. In the comparison experiments, the model based on adding prompt information and feature reuse achieved higher results of the SemEval-2010 task 8 relational dataset.


INTRODUCTION
Relation extraction, as a basic information extraction task, aims to identify the relationship between pairs of nominals in a given sentence from a set of predefined relationships of interest. The work process can be briefly summarized as follows: the triple r(e1, e2) is extracted from the unstructured text. Where e1 and e2 are entities in the utterance, generally nouns or phrases formed by nouns, and r denotes the relationship between entities e1 and e2.
Relation extraction plays a crucial role in natural language processing applications that require a relational understanding of the unstructured text, such as question answering the application, recommendation algorithm, semantic search, knowledge base filling, and knowledge graph construction. Many tasks of natural language processing can benefit from accurate relation classification. Therefore, relation extraction has attracted a lot of attention. The common approach nowadays is to fine-tune pre-trained language models such as BERT [1], ROBERTA [2], and GPT [3], etc. to achieve relation classification. The existing sentencelevel relation extraction is also mainly based on the language model with various innovations. However, in the process of fine-tuning the language model to the relation extraction task, the insufficient feature selection makes the language model too fine-tuned to the downstream task, thus not giving full play to the performance of the language model; at the same time, the model does not make sufficient use of the feature vector.
To this end, this paper proposes a sentence-level relation extraction method based on adding prompt information and feature reuse. The modification of the sentence is shown in Table 1. This method first adds a prompt message in addition to the original sentence-level features and entity-pair features: "What is the relationship between entity one (e1) and entity two (e2) in the above sentence?". Then the sentence features, entity pair features, and prompt features are all encoded by ROBERTA [2]. The encoded data is then fed into the model, and in the model composition this paper chooses the ROBERTA [2] language model as a basis for the overall model, and BIGRU is introduced in the process of model fine-tuning, from which another feature is constructed. In the hidden layer of the model, five features are proposed in this paper noted as Feature cls , Feature bigru , Feature entity1 , Feature entity2 , Feature prompt . Finally, feature reuse is performed to form four different outputs, and ensemble-learning soft voting is used for the output Voting is performed and the voted results are used for predictive relation classification. The main contribution of this method is to add prompt information to the relation extraction task, which not only solves the problem of insufficient feature information but also allows the model to give full play to its performance; secondly, feature reuse and ensemble-learning are used to solve the problem of insufficient utilization of feature vectors and further improve the robustness of sentence-level relation extraction results; finally, in this paper, the same batch of data is fed into the model twice before and after finally, the same batch of data is fed into the model twice to obtain two different distributions, and a new loss function consisting of the cross-entropy, KL divergence, and negative log-likelihood loss of these two distributions is used to optimize the model. Sentence (1) The <e1> legend </e1> was derived from a much older <e2> publication </e2>.

Relationship
Entity-Destination (e1, e2) Modification The $ leftovers $ are pushed into the # colon #. @ what is the relation-ship between leftovers and colon in the above sentence? @

RELATED WORK
The task of relation classification is a very important part of the knowledge graph construction process. The methods of relation extraction, in general, include unsupervised relation classification and supervised relation classification, with supervised relation classification, which is usually considered a multiclassification problem. The performance of traditional relation classification depends mainly on the quality of features, but errors often occur during feature extraction with NLP tools, reducing the overall performance of the model. To solve this problem of feature extraction errors, Zeng et al. [4], Zheng et al. [5], Zheng et al. [6] successively proposed the use of convolutional neural networks, recurrent neural networks, and graph neural networks for relationship extraction. Although these neural networks can encode and convert entity pairs and sentence information into feature vectors, which provides some improvement in model performance, these approaches do not take into account which information in the sentence is more important. For this reason, Shen et al. [7], Zhou et al. [8], Guo et al. [9] proposed models for neural networks with attention mechanisms, which were added to convolutional neural networks, recurrent neural networks, and graph neural networks, respectively, to further improve the performance of relational extraction models. On top of this, Lee et al. [10] added the perception of entities to enhance the robustness of the attention mechanism.
After the emergence of language models EMLO [11], BERT [1], GPT [3], etc., language models were widely used for relation extraction tasks. Alt et al. [12] proposed a new approach based on Transform [13] architecture for relation extraction, using pre-learned implicit language features combined with Transform. Wang et al. [14] applied BERT applied to relation extraction and used an entity-aware self-attention mechanism to inject relation information related to multiple entities in each layer of the hidden state to achieve the prediction of multiple relations by encoding them once. Wu et al. [15] similarly proposed a relationship extraction model based on BERT, while encoding the information of entity pairs into the feature vector as well, thus effectively improving the performance of the model. Tian et al. [16] employ a graph convolutional neural network based on the attention mechanism on top of the BERT encoding, which can

Relation Extraction Based on Prompt Information and Feature Reuse
better parse the information in the dependency tree. Tao et al. [17] extract syntactic indicators guided by syntactic knowledge and then encode them using language models, which mitigates the noise of the data. Han et al. [18] instead propose prompt-based learning that converts the task of relation classification into a task of completing blanks, using a masking task for predicting possible relations when the language model is trained.
Since the language model is not so effective in terms of specific domains or specific tasks. Peters et al. [19] embed the contents of multiple knowledge bases into BERT, and the knowledge-enhanced BERT also achieves better results in downstream tasks such as relation extraction. Wang et al. [20] configured a neural adapter for each kind of injected knowledge in order not to let the injected historical knowledge be washed away, thus allowing the fusion of multiple knowledge bases and making the model have better results.
Recently, great progress has also been made in Few-Shot Relation Extraction. Qin et al. [21] used a continual few-shot relation learning method based on embedding space regularization and data augmentation to avoid catastrophic forgetting of previous tasks. Liu et al. [22] proposed a direct addition method that introduces relational information, generates a relational representation by joining two relations and then adds it to the model for training and prediction. Chia et al. [23] worked on the task setting of the zero-shot relation triplet extraction task. Unifying language model prompts and structured text methods, the relationship samples were generated by conditional processing of relations with structured prompts templates and decoded according to the triplet search decoding method.
Also, ensemble learning is effective in the field of relational extraction, i.e., the performance of relational extraction models can be improved by ensemble learning. Han et al. [24] used multiple semi-supervised learning methods to form a new semi-supervised learning method based on ensemble learning, which is well applied to relational classification. Kim et al. [25] used four classifiers, CRF, CRFext, SEARN, and Bi-LSTM, for relational classification, and finally learned the four models together in an ensemble-learning manner, which also achieved better results. Yang et al. [26] constructed a more efficient and robust relationship extractor based on a joint integrated neural network through the proposed adaptively enhanced multiple LSTM networks attention. Christopoulou et al. [27] used BiLSTM-CRF and feature-based CRF models as sub-models thus building ensemble-learning algorithms and using the integrated algorithms for extracting relationships between drugs and achieving better results. Rim et al. [28] propose a method to combine predictions from CNN and RNN into an integrated model to perform relational classification and extraction simultaneously, as well as a choice of weighted cross-entropy as the objective function and an up-sampling strategy to mitigate the negative effects of category imbalance. Although the current sentence-level relation extraction methods have achieved great success, there is still much room for improvement: the feature information extraction for sentences is not sufficient, making the language model overly fine-tuned to downstream tasks, resulting in the language model not fully exploiting its performance; the utilization of feature vectors is not sufficient; the choice of loss functions is too obsolete.

Relation Extraction Based on Prompt Information and Feature Reuse
Therefore, inadequate extraction of information about sentence features means that there is no information other than entity pair information as well as sentence information. If additional information could be added to identify the features of the relation extraction task so that more information would be encoded and injected into the model. It would also allow the language model to have more understanding of the relation extraction task, which may enable the language model to perform to its full potential. Second, underutilization of feature vectors refers to the fact that the feature vectors are used only once during the propagation of the neural network, for example, the R-BERT method proposed by Wu et al. [15] utilizes the sentence and entity pair information only once. If the feature vector can be used more times, it may have a better improvement on the relation extraction results. To overcome the above problems, this paper proposes a sentence-level relation extraction method based on adding prompt information and feature reuse.

A RELATION EXTRACTION MODEL BASED ON PROMPT INFORMATION AND FEATURE REUSE
The model is presented in three main areas: the encoding layer, the model details, and the model optimization. In the encoding layer, the details of how the logarithmic data is modified and the features of the encoded matrix-vector are presented. In terms of model details, this paper subdivides the model into three parts: input layer, feature acquisition layer, and feature reuse layer, as shown in Figure 1. Finally, for the optimization part of the model, this paper presents the way the loss function used is composed.

Encoding Layer
For all the data in the dataset first perform a replacement operation, replacing "<e1>", "</e1>" in the dataset with the special tokens " $ ", "<e2>", "</e2>" is replaced by the special tokens "#", and finally at the end of the sentence add a prompt message " What is the relationship between entity one (e1) and entity two (e2) in the above sentence? ", and add the special tokens "@" before and after the prompt message. For example, a statement in the dataset " The <e1> legend </e1> was derived from a much older <e2> publication </e2>. " to " The $ legend $ was derived from a much older # publication #. @ What is the relationship between legend and publication in the above sentence? @ " as input.
In this paper, we use the pre-trained Roberta mode to encode the input sentences. For the input format specific to the Roberta model, we need to add "[CLS]" and "[SEP]" before and after the sentence to indicate the beginning and end of the sentence respectively. For the proposed model in this paper, five vector matrices are designed as inputs to the model. For modified statements S after Roberta encoding, all statements are set to a maximum length L, and statements of insufficient length are made up with zeros. The sentence S can then be represented as a set of word vectors input ids noted as 1 2 { , , , } . To perform the self-attention operation on the specified words, Roberta follows the matrix-vector attention mask proposed in Bert notated as

Model Details
This section will be divided into three parts to introduce the detailed parts of the model. The three sections are Input Layer, Feature Acquisition Layer, and Feature Reuse Layer, which describe the model in detail in terms of input, feature acquisition, and model type optimization. The specific details are shown in Figure 1.

Input Layer
In the input layer of the model, the matrix-vector I i and the matrix-vector M a are first fed into the pretrained Roberta mode, which outputs a hidden layer H. Then the matrix-vector M e1 , the matrix-vector M e2 , and the matrix-vector M p are input into the model to be multiplied by the hidden layer H. From this, information can be extracted about the entity pair with the whole sentence and the prompt part with the whole sentence.

Feature Acquisition Layer
The hidden layer H is used as the output of Roberta, and the feature vector H 0 contains the information of the whole sentence. Denote H 1 as the feature vector of sentence information { } where s is the sigmoid activation function,  is the product of terms, W r , W z , and W h are the parameters of the GRU network. The h t is considered as the output of BIGRU, and h t is denoted as the feature vector Taking the calculated matrix-vector M sum and dividing it by the length of the masked part of each matrix-vector to obtain the final average vector, the resulting result is denoted as { } . That is, five eigenvectors are obtained in the feature acquisition layer F c , F b , F e1 , F e2 , F p as the input to the next step.

Feature Reuse Layer
Five features F c , F b , F e1 , F e2 , F p are taken as input in the feature reuse layer. As shown in Algorithm 2. The features are processed using Algorithm 2, first by regularizing the input feature vectors; Then, in order not

Relation Extraction Based on Prompt Information and Feature Reuse
to overfit the model, it goes through the dropout layer, dropping some random neurons. Finally, after passing the function Tanh, the linear model is used to make the reduced dimensional output. The output obtained from each feature is then stitched together to obtain O 1 . Putting F c and F b then go through the same operation separately, and the output from the linear layer to get O 2 and O 3 . Finally, put F e1 , F e2 , F p are also passed through Algorithm 2 to obtain O 4 . Where the specific linear operation formula is as follows: Finally, using the idea of ensemble-learning soft voting, the probabilities of each classification outcome in each output are summed and averaged to find the final relation classification output.

Model Optimization
Since deep neural networks are prone to overfitting, regularization methods such as dropout are usually used to reduce the generalization error of the model during the training process. The dropout removes a random portion of units in each layer of the neural network to avoid overfitting the model. It is due to the randomness of dropout that Liang et al. [29] proposed a dropout-based loss function.
In this paper, we add cross-entropy to the loss function based on the above approach. The final loss function consists of cross-entropy, KL divergence, and negative log-likelihood loss. First, let each batch of data pass through the forward neural network twice, before and after, and two different distributions can be obtained from Figure 2, respectively P1 and P2. Due to the randomness of dropout, the forward pass is also slightly different in spite of passing the same model twice. P1 left path is dropped with the output distribution and P2 right path is dropped with the output distribution is not the same. For this purpose the KL divergence is used to describe the difference between two distributions noted as i kl L as follows:   (1 )( )

EXPERIMENTS AND ANALYSIS
The experiments attempt to demonstrate the enhancement of prompt information, feature reuse, and loss functions on the performance of the model, thus further enhancing the effectiveness of existing relation classification methods. The dataset is first presented, then the model in this paper is compared with existing methods, and finally, the impact of each part of the model on the model results is explored.

Dataset
For the data part, the dataset used in this paper is the SemEval-2010 task 8 relational dataset. The dataset contains 10717 samples, 8000 samples for training, and 2717 samples for testing. The dataset contains 9 semantic relationship types and 1 other relationship type Other, the relationships are ordered. The directionality of the relations effectively doubles the number of relations, since entity pairs are considered to be correctly labeled only if the order is also correct. Cause-Effect (e1, e2) is different from Cause-Effect (e2, e1). So ultimately 19 relationships exist, for the relationships contained in the dataset and the number of individual relationships as shown specifically in Table 2.

Parameter Setting
In this paper, we use the grid search algorithm to adjust the optimal parameters, the maximum sentence length L ∈

Comparison of Different Methods
The proposed model, denoted as RPR, is compared with the previous methods TRE [12], Entity-Aware BERT [14], R-BERT [15], PTR [18], and Skeleton-Aware BERT [17]. The specific results are shown in Table 3. (1) Comparison with TRE [12] method. The TRE approach learns implicit linguistic features from a plain text corpus and combines them in a self-attention Transformer architecture. It does not take into account information other than entity pairs and sentence-level information. Whereas, the RPR method adds prompt information to be able to better extract features about the relation extraction task.
(2) Comparison with Entity-Aware BERT [14] comparison of the methods. The Entity-Aware BERT method can accomplish the multi-entity relation extraction task by encoding only once. However, it does not take into account the utilization of feature information. The RPR method, on the other hand, reuses entity pairs as well as sentence-level information multiple times, effectively alleviating the problem of the underutilization of feature vectors. (3) Comparison with R-BERT [15] method. The R-BERT method uses a pre-trained BERT language model and merges information from the target to handle the relation classification task. But it does not sufficiently extract information from the target. The RPR approach, with the addition of a prompt to emphasize the target information, allows the language model to be fully understood. (4) Comparison with PTR [18] comparison of the methods. The PTR approach proposes prompt-based learning by adding a piece of information other than entity pairs, sentence-level information, and applying the mask training task of the language model to predict the classification, but it does not take into account that the predicted categories are too exotic, which leads to unsatisfactory results. The RPR method, with the addition of information, still follows the idea of the classification task and is able to better infer the classification and achieve better results. (5) Comparison with Skeleton-Aware BERT [17] comparison of methods. The Skeleton-Aware BERT method extracts syntactic indicators guided by syntactic knowledge and merges syntactic indicators and whole sentences into a better relational representation. But it does not take into account the performance of the language model and the degree of feature utilization. The RPR method uses a better ROBERTA language model, as well as feature reuse of the components of each part of the model, which improves the accuracy and F1-score values.

Effect of Model Components on the Model
This paper has demonstrated strong empirical results based on the proposed method, and to further understand the specific contribution of each component of the proposed method, the following control

Relation Extraction Based on Prompt Information and Feature Reuse
group experiment was set up for this purpose. For the pre-trained language models, BERT and ROBERTA were used as the base models to set up control trials, respectively. Two types of inputs are used in this paper, one using the original input to mark special symbols for only two entities in the sentence, and the other input using a prompt-based input, a prompt message is added at the end of the sentence. For the specific models, three groups are also used, the first group is based on the pre-trained language model for classification, the second group adds BILSTM on top of the pre-trained language model, and the third group adds BIGRU on top of the pre-trained language model. For the loss functions, two groups are used in this paper, one just using the cross-entropy loss function and the other using the loss function proposed in this paper. All experiments were performed using grid tuning reference to obtain the final results under the optimal parameters.
In Figure 3 (a), (b) it can be seen that the overall model reaches its maximum value with about 10 Epochs of fine-tuning and stabilizing. It can also be seen that the model model-roberta-prompt-gru achieves the maximum value of all models. And the overall performance of the model is also improved after using the improved loss function in this paper compared with the previous model using only the cross-entropy loss function.  Table 4, and Table 5, it can be seen that a large number of experiments were done in this paper to verify the conclusions. Where dl represents the loss function used in this paper indicates. The overall performance of the Roberta model is better than that of Bert, and the effectiveness of the loss function proposed in this paper can also be seen in the table. And also the maximum value of 90.70 is obtained in Roberta-bigru-prompt-dl.

CONCLUSION AND FUTURE WORK
In this paper, an approach to sentence-level relation extraction based on adding prompt information and feature reuse is proposed. By adding a prompt message, the sentence is made more informative and allows the pre-trained ROBERTA mode to better understand the relation extraction task. On this basis, certain feature information is also reused in the model to constitute multiple output results, and the idea of integrated learning is used to soft-vote the output results, which enhances the robustness of the experimental results. Finally, the model is optimized by using cross-entropy, KL divergence, and the sum of negative log-likelihood losses as loss functions, and better results are achieved on the SemEval-2010 task 8 relational dataset. This also enables more accurate identification of the relationships between entities in the knowledge graph building blocks in various fields such as medicine, movie, and music, and provides a reliable guarantee for the accuracy of the knowledge graph construction. The direction of future work is to be able to introduce graph neural networks while employing prompt information and feature reuse, which can better capture the information of sentence and entity pairs.

AUTHOR CONTRIBUTIONS
Xin Zhang was responsible for experimental idea construction, method design, data analysis and thesis writing. Ping Feng was responsible for the thesis review and revision, experimental investigation, and experimental supervision. Yingying Wang, Jian Zhao and Biao Huang are responsible for project management and experimental hardware preparation.