A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task

This paper presents a winning solution for the CCKS-2020 financial event extraction task, where the goal is to identify event types, triggers and arguments in sentences across multiple event types. In this task, we focus on resolving two challenging problems (i.e., low resources and element overlapping) by proposing a joint learning framework, named SaltyFishes. We first formulate the event extraction task as a joint probability model. By sharing parameters in the model across different types, we can learn to adapt to low-resource events based on high-resource events. We further address the element overlapping problems by a mechanism of Conditional Layer Normalization, achieving even better extraction accuracy. The overall approach achieves an F1-score of 87.8% which ranks the first place in the competition.


INTRODUCTION
The CCKS-2020 financial event extraction task  aims at extracting structural events by identifying event types, triggers and arguments in sentences across multiple types. Figure 1 gives an example of event A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task extraction for a financial news sentence. One structural event belongs to the type of 投资/investment, along with the trigger 收购/acquire and its arguments providing more complementary details. Note that this sentence contains more than one event, and the trigger and arguments have overlaps across the events. The CCKS-2020 task provides two kinds of such event sentences. The first one contains five types of events associated with an abundant sentence corpus, called source events. The second one contains another five types of events associated with low-resource sentence corpus, called target events. Each type of event sentence is split into training (labeled data) and testing (unlabeled data) parts. Our goal is to evaluate the performance of event extraction on the test set of target events. This poses two main challenges compared to the traditional event extraction tasks [1,2,3,4,5]: • The target events contain only 179 training sentences on average for each type. This limited supervision information cannot provide sufficient contextual information for the event extraction.
• Elements can be overlapped with each other, i.e., the same trigger or argument may belong to different events. As shown in the example, the trigger 收购/acquire and the argument 世纪华通/Shijihuatong belong to both event types of 投资/investment and 股份股权转让/share transfer. Performing event extraction by a simple sequence labeling method will cause label conflicts. To address these challenges, we devised a joint learning method. In our approach, the overall framework is formulated as a joint probability model, which is decomposed into submodels, i.e., the joint distribution is decomposed into a product of three conditional distributions. Each subtask will be a specific use of this distribution, including event type detection, trigger extraction and argument extraction.

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
For the first subtask, given a financial news sentence, we first classify the sentence into a correct event type by using a multi-class multi-label text classification paradigm. For the other two subtasks, we successively extract triggers and arguments with a pre-training/fine-tuning framework. The pre-training module is implemented by a pre-trained language model BERT [6] on all financial news sentences, and we further fine-tune the pre-trained model with respect to the trigger/argument extraction module. To deal with the overlapping element issue, we introduce conditional layer normalization, only extracting triggers according to the specific event type, and extracting arguments according to the specific trigger. This method can extract elements separately in different conditions, avoiding overlapping. In addition, by sharing parameters across different types in such a unified model, we can learn to adapt to low-resource events based on high-resource events. Our approach achieved an F1-score of 87.8% which ranked the first in the CCKS-2020 financial event extraction competition.

RELATED WORK
Traditional event extraction research is usually achieved in the high-resource setting, and assumes events appearing in sentences without overlapped elements. These studies can be roughly categorized into the following two groups: 1) Traditional joint methods [4,5,7,8,9] that perform trigger extraction and argument extraction simultaneously. They solve the task in a sequence labeling manner, and extract triggers or arguments by tagging the sentence only once. However, these methods fail in extracting overlapped elements since the overlapped elements will cause label conflicts when forced to have more than one label. 2) Pipeline methods [1,10,11,12,13,14] that perform trigger extraction and argument extraction in separate stages. Though the pipeline methods enable extracting overlapped elements in separate stages [14], they usually lack explicit dependencies between the triggers and arguments and suffer from error propagation. All the above methods require sufficient training data to learn model parameters for each event type, and few methods can extract complex overlapped elements in event extraction.
Recently, several methods were proposed to solve event extraction in several kinds of low-resource setting, such as few-shot learning setting [2], zero-shot learning setting [3,12] and incremental learning setting [15]. However, these methods cannot be directly transferred to this CCKS-2020 financial event extraction task, for the reason that this task aims at extracting low-resource target events with the help of rich-resource source events, which is a completely different setting comparing to the above low-resource event extraction settings.

OUR APPROACH
We will present the overview, the design of each component, and some strategies for improvements.

Overview
Given a sentence denoted as s, we proposed a joint learning approach to identify its event types C, event triggers T and event arguments A. The approach is formulated as a joint probability model, which is decomposed into three submodels with respect to the event type detection, the event trigger extraction and the event argument extraction: P(C, T, A|s; Θ) ∝ P(C|s; Θ 1 ) P(T|s, C; Θ 2 , Θ 3 ) P(A|s, C, T; Θ 2 , Θ 4 ) The event type detection is modeled by a multi-class multi-label text classification paradigm, where Θ 1 is the set of type detection model parameters. The other two extraction parts are modeled by a pre-training/ fine-tuning framework, where Θ 2 contains model parameters shared by both modules, while Θ 3 and Θ 4 are respective private model parameters. All parameters in Θ ≜ {Θ 1 , Θ 2 , Θ 3 , Θ 4 } are used across different event types (either high-resource or low-resource), which promotes rich interactions between source events and target events. Figure 2 sketches the overall framework.

Event Detection Model
In order to discover the event types occuring in the sentence, we adopted codes provided by official competition  as our event detection model (EDM). This model utilizes a pre-trained language model (PLM) to derive sentence representations, formulated as a multi-label multi-class text classification task. Specifically, given the sentence s, the probability of s belonging to a specific type c m is calculated as in Equation (1):

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
where z sent is the hidden state corresponding to the input token <CLS> in the PLM, which encodes the entire sentence representation of s; Θ 1 includes all parameters used in the PLM. For simplicity, we denoted p(c m |s; Θ 1 ) as p m .
Then, we can update and obtain the desired sentence representation z sent by minimizing following binary cross entropy loss function with Equation (2): where N is the number of the training sentences; M is the number of the pre-defined event types; y nm is the true type label, which is either 0 or 1. During prediction, we simply set a threshold d and selected the resultant event types C where each type c m such that p(c m |s; Θ 1 ) > d.

Event Extraction Model
This section introduces our event extraction model (EEM), achieving two subtasks by a pre-training/finetuning framework: trigger extraction and argument extraction. The pre-training part encodes sentence tokens as contextualized representations with the pre-trained language model, BERT [6], which contains rich language knowledge widely used for natural language processing (NLP) tasks. The fine-tuning part is divided into three modules, including a shared module to encode condition information based on conditional layer normalization, and two private modules to extract triggers and arguments. Note that both extraction modules have a similar model structure.

Shared Module
This section introduces a sentence representation layer shared by both trigger extraction and argument extraction, which will derive a conditional sentence representation H s-typ for the specific event type c, and a syntactic feature representation H syn .
Since we have obtained event types C occuring in the given sentence s, we are going to derive sentence representations conditioned on each specific event type c ∈ C, so as to avoid element overlapping issues. To this end, we introduced a general module, named conditional layer normalization (CLN) [16,17], to integrate such conditional information into sentence representation. CLN is mostly based on the wellknown layer normalization [18], but can dynamically generate gain γ and bias β based on the condition information. Given a condition representation c and a sentence representation x, CLN is formulated in

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
where x i is the i-th dimension element in x, γ c ∈ R d and β c ∈ R d are the conditional gain and bias, respectively. In this way, the given condition representation is encoded into the gain and bias, and then integrated into the contextual representations.
Then we utilized CLN to integrate event type information into the sentence. Specifically, we first transformed the event type's name into textual tokens, such as the type 投资/investment which was transformed into tokens 投 and 资. Then we concatenated these type tokens together with the word tokens in the sentence s, forming a sequence as X : <CLS> + type tokens + <SEP> + word tokens + <SEP>. The sequence was input into the PLM to derive its contextualized representations, and we termed the representations corresponding to type tokens as H c and word tokens as H s . Then, we fused H c with mean pooling and H s together, to derive the conditional token representations for s with Equation (6) where H s-typ is the token representations conditioned on the event type c. Such process generates type-aware token representations adaptive to various event types. As such, we can perform trigger extraction and argument extraction in the independent context of each type.

Private Module
The private module contains the following two sub-modules.

(1) Trigger Extraction Module (TEM)
This module extracts event triggers given the event type c ∈ C. In order to improve textual representations for trigger extraction, we adopted a self-attention (SA) layer. Thus the type-aware token representations can be enhanced as in Equations (7) and (8) where ⊕ is the concatenation operation. H syn corresponds to the representation of syntactic features, obtained by NLP tool LTP  . The syntactic features include B/I/O labels of word segmentation, part-of-speech tagging, named entity recognition and dependency parsing, which are initialized randomly as learnable embeddings in our model.

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
In order to strengthen interactions among triggers of different event types, we predicted triggers with the same trigger extractor. For each token, we adopted a pair of fully connected networks (FCN) to predict whether it is a "begin" or "end" position of a trigger as summarized in Equations (9) and (10): where h tri,i is the i-th element of h tri , w t (b) and w t (e) are learnable parameters, and Θ 3 includes w t (b), w t (e), and parameters in SA.
Then, a binary cross entropy loss function was used for begin position prediction and end position prediction, denoted as Loss t (b) and Loss t (e). The final loss is defined as in Equation (11): where w t ∈ (0,1) is a trade-off factor. For prediction, we simply set a threshold φ tri , and selected positions such that their prediction scores are higher than φ tri . We matched the begin position with the nearest end position to obtain a complete trigger. The final trigger extraction results formed the trigger set T.

(2) Argument Extraction Module (AEM)
This module is to extract arguments conditioned on one of the triggers T extracted from the TEM. Given a specific trigger t ∈ T in the sentence s, we obtained trigger-aware sentence representation H s-tri conditioned on t, where the process was the same as Equation (7). We also utilized self-attention layer to enhance the sentence representation, termed as H sa-tri . To discern the position of trigger t, we further added its relative position embedding R, which measured the distance from current position to the trigger position. The syntactic feature H syn was also taken into consideration. Thus, the enhanced sentence overall representation is defined as in Equation (12) For argument extraction, we extracted all arguments with pairs of FCNs and devised them as in Equations (13) and (14): where h arg,i is the i-th element of H arg .

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
The loss function is also binary cross entropy for both begin and end position prediction for each argument role and it is calculated with Equation (15) (15) where w t ∈ (0,1) is a tradeoff factor. For prediction, we simply set a threshold φ arg , and selected positions were prediction score is higher than φ arg . We matched the begin position with the nearest end position to obtain a complete argument. We removed the redundant argument types for each event type based on the event schema constrain. The final results formed the argument set A.

Training and Prediction
To jointly learn the TEM and AEM, we combined both losses from the two modules as in Equation (16): where w j ∈ (0,1) is a weight hyperparameter to balance the two modules.
We utilized ground-truth labels to train the overall model. For prediction, we first obtained trigger extraction results, and then input them into the argument extraction module. The results obtained from the two modules were returned as the final predictions.

Additional Improvement Strategy
Despite the training data sets, the unlabeled data in the testing data sets also contain rich information. In order to exploit all the data to improve the performance, we also employed the following strategies:

(1) Continuing Pre-training on PLM
PLMs are usually pre-trained on the common corpus, which may cause semantic bias on the financial corpus. Therefore, we continued pre-training the PLM on all the financial data, including training data and testing data. This strategy was applied to both EDM and EEM.

(2) Model Ensemble on Variant Data Splits
To fully exploit labeled data, we adopted K-fold validation on the labeled data, leading to K models trained on different data splits. Then, we ensembled K model predictions by the voting strategy. This model ensemble strategy was applied to EDM and EEM separately.

(3) Utilizing Pseudo-Labels on Unlabeled Data
To fully exploit unlabeled data, we employed a novel strategy to label testing data with pseudo labels. Specifically, we trained models on the ground-truth data, and then predicted labels on those unlabeled data, called pseudo-labeled data. By integrating the pseudo-labeled data into the ground-truth data, we obtained a mixed training data set. We trained new models on this mixed data set. Note that this strategy was only used for EEM, where we achieved better performance.

EXPERIMENT
This section introduces the data set provided in the competition, and conducts experiments to evaluate the model.

Data Set
The data set provided in the competition contains source event data and target event data, including labeled data and testing data for each event type. The statistics of each event type of labeled data is shown in Table 1. The competition only evaluated on the testing target event data. For validation, we separated a part of labeled data as validation data. The details of data partition are shown in Table 2. Since the ground truth of the testing data was not available, all experiments below were conducted on the validation data.

Implementation
We utilized an extended BERT model as our PLM, which was pre-trained on the mixed large Chinese corpus  (termed as BERT-ext) from model zoo  . Then we continued pre-training it on this financial data. For EDM, we set the learning rate to 2e-5. The batch size was 16. For EEM, we applied a learning rate of 2e-5 to the PLM layer and 1e-4 to other layers. The batch size was 8. The tradeoff weight w t , w a , and w j was set to 0.5, 0.5, and 0.2, respectively. Each kind of syntactic embedding dimension was set to 40. The relative position embedding dimension was set to 64. We applied dropout to the SA layer and all input embeddings with the rate set to 0.3. With the model ensemble strategy, we trained five EDMs for a better event type prediction. For EEM, we trained 5 models, and ensembled the 5 results to obtain pseudo label on the testing data. Then, we trained 10 EEMs on the mixed training data, and obtained 10 predictions on the testing data. We ensembled all 15 EEM results as the final submissions.

Main Result
Since the ground truth of testing data was not available, we conducted experiments on the validation data. The F1-score of event detection, trigger extraction and argument extraction on validation data was 0.921, 0.970, and 0.889, respectively.
The best result (F1-score) of our approach on official testing data was 0.8781, which was the highest score in the competition.

Ablation Study
We conducted an ablation study on the event extraction model, where the results are shown in Table 3. As the entire decoding process is a pipelined paradigm, the performance of each submodel is affected by the previous predicted results. To avoid this impact, we adopted ground-truth results as the input of each submodel. Specifically, Line1 in Table 3 shows the complete model, which was trained on both groundtruth data and pseudo-label data with all components. Line 2 in Table 3 removed the pseudo-label data, and the results show that utilizing pseudo-label data improved performance significantly. The following experiments were ablated based on Line 2. Line 3 removed source data in training, and the result indicated that learning target events with source events was effective. Line 4 and Line 5 in Table 3 replaced the continued pre-trained PLM by BERT and the standard BERT-ext, which suggests the effectiveness of continuing pre-training for PLM. Line 6 in Table 3 replaced CLN with a simple concatenate operation, which indicates CLN can utilize condition information more effectively. Line 7 in Table 3 applied the same learning rate to all layers, which indicates utilizing different learning rates on model layers benefits the learning process. Line 8 in Table 3 removed syntactic features, which indicates syntactic features improved the extraction performance. All the results demonstrated the effectiveness of each component in the task.

Case Study
To demonstrate the predicted results of the model, we conducted case study for model predictions. Figure 3 depicts an example of the model results. Given an input sentence, the model sequentially conducted A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task event detection, trigger extraction and argument extraction. The model first predicted all event types occurring in the sentence. Then, the model extracted triggers according to the given event type, respectively. Next, the model extracted all arguments according to the given event type and the given trigger. Such a process extracted overlapped elements separately. Besides, by sharing parameters across different types in such a unified model, the model learned to extract low-resource events based on the high-resource events. Though the model attempted to solve the low-resource issue and the element overlapping issue, there existed two main error patterns: 1) the error proporgation problem: Since the subtasks were conducted in a cascading manner, the errors of the former predictions would lead to the errors of the following predictions; 2) the argument prediction errors: Since the arguments were complicatedly associated with their roles, the model tended to predict less arguments or predict arguments with wrong roles. We would attempt further improvements in the future.

CONCLUSION
In this paper, we proposed a financial event extraction approach based on a joint learning framework, which fully utilizes all the data to improve the performance of low-resource event types, and effectively solves the overlapping problem of events. The experimental results show that the approach achieved significant performance, and it ranked the first place in the CCKS-2020 financial event extraction competition.

A Joint Learning Framework for the CCKS-2020 Financial Event Extraction Task
Shu Guo received her PhD degree from the Institute of Information Engineering, Chinese Academy of Sciences. She is currently working at the National Computer Network Emergency Response Technical Team/ Coordination Center of China. Her research interests include knowledge graph embedding, knowledge acquisition and Web mining. ORCID: 0000-0002-9660-0291 Bowen Yu is currently pursuing his PhD degree in the Institute of Information Engineering, Chinese Academy of Sciences. His research interests include information extraction, metric learning and unsupervised learning. ORCID: 0000-0002-6804-1859 Lihong Wang is a Professor at the National Computer Network Emergency Response Technical Team/Coordination Center of China. Her current research interests include big data mining and analytics, information retrieval and natural language processing. ORCID: 0000-0003-0179-2364