Abstract
In the task of Knowledge Graph Completion (KGC), the existing datasets and their inherent subtasks carry a wealth of shared knowledge that can be utilized to enhance the representation of knowledge triplets and overall performance. However, no current studies specifically address the shared knowledge within KGC. To bridge this gap, we introduce a multi-level Shared Knowledge Guided learning method (SKG) that operates at both the dataset and task levels. On the dataset level, SKG-KGC broadens the original dataset by identifying shared features within entity sets via text summarization. On the task level, for the three typical KGC subtasks—head entity prediction, relation prediction, and tail entity prediction—we present an innovative multi-task learning architecture with dynamically adjusted loss weights. This approach allows the model to focus on more challenging and underperforming tasks, effectively mitigating the imbalance of knowledge sharing among subtasks. Experimental results demonstrate that SKG-KGC outperforms existing text-based methods significantly on three well-known datasets, with the most notable improvement on WN18RR (MRR: 66.6% 72.2%, Hit@1: 58.7%67.0%).
1 Introduction
Knowledge Graphs (KGs) are directed multi-relation graphs, with entities as nodes and relations as edges, denoted as a set of triples (h, r, t). Their distinctive advantage lies in efficiently representing and managing extensive knowledge, offering high-quality structured information for diverse downstream tasks, including question answering (Saxena et al., 2020), information retrieval systems (Bounhas et al., 2020), and recommendation systems (Gao et al., 2023). Despite these strengths, existing knowledge graphs still lack a substantial amount of valuable information. Effectively addressing this gap in knowledge completeness has given rise to the field of Knowledge Graph Completion (KGC). KGC aims to infer the missing entities and relations from knowledge graphs, significantly enhancing both the quality and coverage of these valuable knowledge repositories.
Existing KGC methods are mainly divided into structure-based methods and text-based methods. Structure-based methods (Bordes et al., 2013; Sun et al., 2019; Balazevic et al., 2019) typically map entities and relations into low-dimensional vectors and calculate the probability of valid triples by various scoring functions. Text-based methods (Yao et al., 2019; Xie et al., 2022; Kim et al., 2020; Yao et al., 2024) adopt pre-trained language models to semantically encode textual descriptions of entities. They can encode unseen entities in training time, while making reasoning less efficient. Recent advancements, such as the bi-encoder structure proposed in studies like Wang et al. (2021a, 2022), aim to reduce the training cost of language model encoders. This shift has led to text-based methods beginning to surpass structure-based methods in terms of performance.
While these methods exhibit a strong capability to complete knowledge graphs, challenges persist in the effective sharing of knowledge among datasets and subtasks. Specifically, we find that the same (h, r) or (r, t) often appear in different triples (h, r, t). According to our analysis in Figure 1a, 42.1% of triples can find other triples sharing the same (h, r) with themselves, and 66.9% of triples can find those sharing the same (r, t) with themselves in the WN18RR training set. For instance, (Kirsten Dunst, film actor, Spider-Man), (Willem Dafoe, film actor, Spider-Man), (James Franco, film actor, Spider-Man) all have the same relations and tail entities. This suggests the potential existence of shared knowledge, such as “American film actor,” among various head entities. Leveraging this dataset-level shared knowledge is essential to enhance the learning ability of triples and assist the model in correctly identifying answers from lexically similar candidates.
(a) The greater the proportion of triples sharing the same (r, t) or (h, r), the more accessible knowledge we acquire. (b) The average number of connected entities in many-to-one and one-to-many relations indicates the imbalanced distribution between head entities and tail entities, respectively.
(a) The greater the proportion of triples sharing the same (r, t) or (h, r), the more accessible knowledge we acquire. (b) The average number of connected entities in many-to-one and one-to-many relations indicates the imbalanced distribution between head entities and tail entities, respectively.
Notably, we observe considerable performance variations across various KGC subtasks, even when applied to the same dataset. For KG-BERT (Yao et al., 2019) on the WN18RR dataset, the Hit@10 prediction results differ notably for head entities (54%) and tail entities (60.7%). This discrepancy arises from certain relations, such as gender and city, linking more head (tail) entities and fewer tail (head) entities. As shown in Figure 1b, the issue of imbalanced distribution of head entities and tail entities is prevalent in knowledge graphs, yet it receives limited attention in research. Existing multi-task learning methods treat head entity and tail entity prediction equally, ignoring the intricacies of more complex tasks. Therefore, making the model focus on more challenging tasks while learning the shared knowledge across multiple subtasks becomes an urgent concern. This task-level shared knowledge can enhance the model’s learning of entity and relation embeddings.
In this paper, we introduce a multi-level Shared Knowledge Guided learning method (SKG) for knowledge graph completion. To capture dataset-level shared knowledge within specific entity sets, we jointly train original triples, triples with identical head entities and relations, and triples with identical relations and tail entities. For task-level knowledge sharing, we incorporate relation prediction in multi-task learning to assist entity prediction task, enabling the model to acquire more relation-aware entity information. In each iteration, our loss weight allocation scheme assigns higher loss weights to tasks that are more challenging and underperforming, effectively addressing the imbalanced distribution of head and tail entities. In summary, our contributions include:
We extract dataset-level shared knowledge by extending the original dataset, bolstering the model’s ability to identify correct answers from lexically similar candidates in the bi-encoder architecture.
We design a novel multi-task learning architecture with dynamically adjusted loss weights for task-level knowledge sharing. This ensures the model focuses more on challenging and underperforming tasks, alleviating the imbalance of subtasks in KGC.
SKG-KGC is evaluated on three benchmark datasets: WN18RR, FB15k-237 and Wikidata5M. Experimental results demonstrate the competitive performance of our model in both transductive and inductive settings, with notable success on the WN18RR dataset.
2 Related Work
Knowledge Graph Completion
KGC has been extensively studied for many years as a popular research topic. It can be divided into three subtasks: head entity prediction, relation prediction, and tail entity prediction. Structure-based methods, such as TransE (Dou et al., 2021), RotatE (Sun et al., 2019), TuckER (Balazevic et al., 2019), and Complex-N3 (Jain et al., 2020), map entities and relations to low-dimensional vector spaces and measure the plausibility of triples by various scoring functions. Recent text-based methods represented by KG-BERT (Yao et al., 2019) attempt to integrate pre-trained language models for encoding textual descriptions of entities and relations. PKGC (Lv et al., 2022) converts each triple into natural prompt sentences, utilizing a single encoder for triple encoding. Xie et al. (2022), Saxena et al. (2022), and Yao et al. (2024) formulate KGC as a sequence-to-sequence generation task and explore Seq2Seq PLM models to directly generate required text. StAR (Wang et al., 2021a) simultaneously learns graph embeddings and contextual information of the text encoding method. Chen et al. (2023) employ conditional soft prompts to integrate textual description structural knowledge. In contrast, SimKGC (Wang et al., 2022) introduces contrastive learning and a bi-encoder with a pre-trained language model to encode entities and relations separately. It proves highly efficient for training with a large negative sample size, enhancing the efficiency of KGC training and inference.
Multi-task Learning
MTL aims to concurrently train deep learning models by leveraging information from multiple interconnected tasks. Balancing losses during training facilitates tasks in providing valuable insights to each other, resulting in a more proficient and robust model. For the KGC task, Kim et al. (2020) first propose a multi-task learning method, integrating relation prediction, relevance ranking, and link prediction tasks. Subsequent models focus on introducing additional knowledge or potent pre-trained language models (PLMs). For instance, Dou et al. (2021) propose a novel embedding framework for multi-task learning, enabling the transfer of structural knowledge across different KGs. Incorporating the ALBERT-large (Lan et al., 2020) model with more parameters as the text encoder, Tian et al. (2022) enhance model performance at the expense of increased training costs. Meanwhile, Li et al. (2023) employ a multi-task pre-training strategy to capture relational information and unstructured semantic knowledge within structured knowledge graphs. These studies emphasize the interconnectedness of various KGC subtasks, highlighting that knowledge sharing among them can enhance overall performance.
However, they overlook the distinction between head entity prediction and tail entity prediction tasks, which arises from the imbalanced distribution of head and tail entities. Recognizing this, our SKG-KGC model explicitly distinguishes between head entity prediction and tail entity prediction in the context of multi-task learning. We attempt to achieve superior performance and scalability by employing the basic PLM model and fewer subtasks.
3 Method
In this section, we introduce a multi-level Shared Knowledge Guided learning method (SKG) for knowledge graph completion. We elaborate the entire architecture of the proposed model in Section 3.1. In Sections 3.2 and 3.4, we illustrate how our method captures shared knowledge at both dataset and task levels for KGC. These insights are seamlessly integrated at the bi-encoder architecture, as explained in Section 3.3. The following sections provide a detailed overview of the training and inference processes of our model.
3.1 Model Structure
Figure 2 illustrates the overview of the SKG- KGC model. Our model consists of three parts:
Dataset level: During training, the model is simultaneously trained with original triples, triples with identical (h, r), and triples with identical (r, t). This approach strengthens the learning of shared features among entity sets while reducing text redundancy.
Bi-encoder architecture: Two encoders are initialized with the same pre-trained model but do not share parameters. The primary encoder computes the joint embedding of the two known elements in triples, while the secondary encoder computes the representation of the missing entities.
Task level: We design balanced multi-task learning by introducing a relation prediction subtask to assist link prediction. In each iteration, the model assigns higher loss weights to challenging and underperforming subtasks, facilitating dynamic knowledge sharing across different subtasks.
3.2 Dataset Expansion
In addition to the original triples (h, r, t), our proposed model also incorporates triples with the same head entity and relation (h, r,{t0, t1,…, ti}) and triples with the same relation and tail entity ({h0, h1,…, hj}, r, t). Common features among different entities in triples are identified through text summarization. For instance, (h1, r, t), (h2, r, t), and (h3, r, t) are valid triples in the training dataset, where the relation r and the tail entity t are consistent. Consequently, the three head entities h1, h2, h3 may share common or similar features. If a new entity h0 also contains these common features within the head entity set {h1, h2, h3}, the triplet (h0, r, t) is more likely to be considered reasonable.
The model takes text sequences as input, corresponding to the three types of triples for knowledge graph completion. Each entity text sequence comprises the entity’s name and its corresponding text description. For the triple ({04692908, 00387897}, derivationally related form, 01259005), the input sequence is: “[CLS] chip, a mark left after a small piece has been chopped or broken off of something [PSEP] snick, a small cut [SEP] derivationally related form [SEP] nick, cut a nick into [SEP]”. The bold font indicates the name of each head entity. [PSEP] serves as the separator for entities in the head entity set. The use of [CLS] and [SEP] aligns with the BERT-base model. Further details regarding different subtasks are provided in Table 1.
An example of the head entity prediction (HP), relation prediction (RP), and tail entity prediction (TP) subtasks in KGC.
Subtask . | Input . | Label . | Type . | ||
---|---|---|---|---|---|
h tokens . | r tokens . | t tokens . | |||
HP | [MASK] | has part | China | Asia | Candidate entity ranking |
RP | Asia | [MASK] | China | has part | Multi-classification |
TP | Asia | has part | [MASK] | China | Candidate entity ranking |
Subtask . | Input . | Label . | Type . | ||
---|---|---|---|---|---|
h tokens . | r tokens . | t tokens . | |||
HP | [MASK] | has part | China | Asia | Candidate entity ranking |
RP | Asia | [MASK] | China | has part | Multi-classification |
TP | Asia | has part | [MASK] | China | Candidate entity ranking |
Here, Sj and Sk are the nodes pointing to and pointed by Si respectively, T(Sj) denotes the TextRank value of the j-th sentence, wji and wjk are the weight of edges between nodes (sentence similarity), and d is the damping ratio, signifying the probability of jumping from one node to another.
After iteration, we obtain the final TextRank value T(Si) for the i-th sentence unit. The top-n sentences with the highest TextRank values are then selected as concise text, providing the model with essential yet condensed descriptive information.
3.3 Bi-encoder Architecture
Unlike MTL-KGC (Kim et al., 2020) using a single encoder, our proposed model employs two encoders initialized with the same pre-trained language model but without sharing parameters. Each encoder autonomously acquires shared knowledge at both the dataset and task levels. The primary encoder computes the joint embedding of the two known elements in triples, while the secondary encoder computes the representation of the missing entities.
Head Entity Prediction
Tail Entity Prediction
We incorporate the idea of contrastive learning to make the anchor point closer to positive samples (h, r, t) and farther from negative samples (h′, r, t) or (h, r, t′). The proper selection of negative samples significantly impacts the training model’s performance. For ease of comparison, our model employs negative samples consistent with those constructed in SimKGC.
Relation Prediction
Here, eht represents the head entity and tail entity embeddings encoded by the shared main encoder, and WRP is the parameter matrix of the classification layer used for relation prediction.
3.4 Balanced Multi-Task Learning
Here, dk(t) is calculated similarly to focal loss (Lin et al., 2020), augmenting the weight of difficult-to-distinguish samples. Although focal loss is originally designed for classification (Romdhane et al., 2020), we extend its application to multi-task weight assignment. For task k, denotes the normalized accuracy metric of the validation set during the iteration immediately before t. An increased accuracy metric indicates enhanced learning capability of the model for the task, thus suggesting a reduction in weight allocation. The focusing parameter rk smoothly adjusts the proportion of tasks that are down-weighted. As the task becomes simpler, it is accorded less weight.
In this paper, the focusing parameter rk primarily mirrors the learning difficulty of the head entity prediction and tail entity prediction tasks, denoted as the ratio of the average number of connected entities in many-to-one and one-to-many relations. A higher count of entities connected by many-to-one relations increases the learning complexity of the head entity prediction task. The tail entity prediction task is also influenced by the number of entities connected by one-to-many relations. The default value of rk for the relation prediction task is set to 1.
For t = 1, we initialize the loss weight wk(t) of each task to 1, though introducing any non-balanced initialization weight based on prior knowledge is also viable.
3.5 Training
For different subtasks in KGC, we optimize our proposed model using InfoNCE loss and cross-entropy loss, respectively.
InfoNCE loss
The scoring function f(h, r, t) ∈ [−1,1] for triples is the cosine similarity of ehr and et. The additive margin γ > 0 enhances the separation between true triples and false triples. We utilize the temperature τ to adjust the relative importance of negatives in triples and introduce as a learnable parameter during training. represents the number of negative samples. The same approach is applicable to obtain the loss HP.
Cross-entropy Loss
3.6 Inference
Assume there are |T| test triples and |E| candidate entities in the head entity prediction task. Traditional cross encoders, such as KG-BERT (Yao et al., 2019) and MTL-KGC (Kim et al., 2020), traverse |E| entities for each test triple (?, r, t). They replace the head entity in the test triplet repeatedly and select the highest-ranking entity as the candidate. This means a test triple requires |E| computations, and |T| triples need |E|×|T| computations in total. In contrast, our method employs two independent encoders similar to SimKGC (Wang et al., 2022). The primary encoder computes the relation-aware tail entity embeddings for |T| test triples, while the secondary encoder necessitates only a one-time computation for |E| candidate entities without re-traversing all entities. The embeddings from the two encoders are combined using a dot product operation to obtain the ranking scores for all entities. This reduces the required BERT forward passes to |E| + |T|, significantly reducing inference time.
Likewise, the reasoning process for the tail entity prediction subtask follows a comparable pattern. The computational complexity also shifts from |E|×|T| to |E| + |T|. The inference complexity of the relation prediction subtask remains |T|, owing to the retention of the cross-encoder. Moreover, we have the capability to pre-compute the embeddings of unseen entities or relations based on their text descriptions. Consequently, our model can also facilitate inductive reasoning for some unseen entities or relations.
4 Experiments
4.1 Experimental Setup
Dataset
Our model is evaluated on three benchmark datasets: WN18RR (Dettmers et al., 2018), FB15k-237 (Toutanova and Chen, 2015), and Wikidata5M (Wang et al., 2021b). Further details regarding dataset statistics are provided in Table 2. WN18RR is a subset of WordNet (Miller, 1995), containing about 41k entities and 11 semantic relations between words. FB15k-237, a subset of FreeBase (Bollacker et al., 2008), consists of about 15k entities and 237 relations. For text descriptions in WN18RR and FB15k237, we follow the data provided by KG-BERT (Yao et al., 2019). Wikidata5M integrates the Wikidata knowledge graph and Wikipedia pages, comprising nearly 5 million entities and about 20 million triples. It is used for both transductive and inductive KGC tasks. In the transductive setting, entities appearing in the test set are encountered in the training set, while in the inductive setting, entities in the test set have never appeared in the training set.
Statistics of the datasets used in this paper. “Wikidata5M-Trans” and “Wikidata5M-Ind” refer to the transductive and inductive settings, respectively.
Dataset . | #entity . | #relation . | #train . | #valid . | #test . |
---|---|---|---|---|---|
WN18RR | 40,943 | 11 | 86,835 | 3,034 | 3,134 |
FB15K-237 | 14,541 | 237 | 272,115 | 17,535 | 20,466 |
Wikidata5M-Trans | 4,594,485 | 822 | 20,614,279 | 5,163 | 5,163 |
Wikidata5M-Ind | 4,579,609 | 822 | 20,496,514 | 6,699 | 6,894 |
Dataset . | #entity . | #relation . | #train . | #valid . | #test . |
---|---|---|---|---|---|
WN18RR | 40,943 | 11 | 86,835 | 3,034 | 3,134 |
FB15K-237 | 14,541 | 237 | 272,115 | 17,535 | 20,466 |
Wikidata5M-Trans | 4,594,485 | 822 | 20,614,279 | 5,163 | 5,163 |
Wikidata5M-Ind | 4,579,609 | 822 | 20,496,514 | 6,699 | 6,894 |
Evaluation Metrics
For each test triple (h, r, t), our model predicts the tail entity t by ranking all entities based on (h, r), and similarly, predicts the head entity h by ranking all entities based on (r, t). The evaluation employs four metrics: mean reciprocal rank (MRR), Hit@1, Hit@3, and Hit@10. MRR is the average reciprocal rank of all test triples, while Hit@k represents the proportion of correct entities ranked within the top-k candidates. All metrics are reported under the filtered setting (Bordes et al., 2013), and computations involve averaging over head entity prediction (?, r, t) and tail entity prediction (h, r,?) tasks.
Hyperparameters
The SimKGC model (Wang et al., 2022) serves as our benchmark, with most hyperparameters aligning with it. The encoders are initialized with BERT-base-uncased (English). The AdamW optimizer with linear learning rate decay is employed. All models are trained with batch size 1024 on 4 A100 GPUs. We conduct a grid search on learning rates within {10−5,3 × 10−5,5 × 10−5}. Entity descriptions are truncated to a maximum of 50 tokens. In the TextRank algorithm, we set the damping ratio d at 0.85 and select the top three sentences as the summarized text. Each task’s initial weight in multitask learning is set to 1. The temperature τ initializes at 0.05, and the additive margin γ for InfoNCE loss is 0.02. For the WN18RR, FB15k-237, and Wikidata5M datasets, we train for 50, 10, and 1 epochs, respectively.
4.2 Main Results
We compare the performance of SKG-KGC with state-of-the-art baseline models, covering both structure-based methods and text-based methods. Table 3 illustrates the main results on the WN18RR and FB15K-237 datasets, while Table 4 shows the performance on the Wikidata5M dataset under transductive and inductive settings.
Main results for WN18RR and FB15K-237 datasets. Results of [†] are from StAR (Wang et al., 2021a) and the other results are from the corresponding papers. Bold numbers represent the best results.
Model . | WN18RR . | FB15K-237 . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
Structure-based Methods | ||||||||
TransE (Bordes et al., 2013) † | 24.3 | 4.3 | 44.1 | 53.2 | 27.9 | 19.8 | 37.6 | 44.1 |
ComplEx (Trouillon et al., 2016) † | 44.9 | 40.9 | 46.9 | 53.0 | 27.8 | 19.4 | 29.7 | 45.0 |
RotatE (Sun et al., 2019) † | 47.6 | 42.8 | 49.2 | 57.1 | 33.8 | 24.1 | 37.5 | 53.3 |
TuckER (Balazevic et al., 2019) † | 47.0 | 44.3 | 48.2 | 52.6 | 35.8 | 26.6 | 39.4 | 54.4 |
Complex-N3 (Jain et al., 2020) | 49.0 | 44.0 | – | 58.0 | 37.0 | 27.0 | – | 56.0 |
TransMTL-H (Dou et al., 2021) | 49.8 | – | – | 57.0 | 34.9 | – | – | 53.7 |
SEPA (Gregucci et al., 2023) | 48.1 | 44.1 | 49.6 | 56.2 | 33.2 | 24.3 | 36.3 | 50.9 |
Text-based Methods | ||||||||
KG-BERT (Yao et al., 2019) † | 21.6 | 4.1 | 30.2 | 52.4 | – | – | – | 42.0 |
MTL-KGC (Kim et al., 2020) | 33.1 | 20.3 | 38.3 | 59.7 | 26.7 | 17.2 | 29.8 | 45.8 |
StAR (Wang et al., 2021a) | 40.1 | 24.3 | 49.1 | 70.9 | 29.6 | 20.5 | 32.2 | 48.2 |
GenKGC (Xie et al., 2022) | – | 28.7 | 40.3 | 53.5 | – | 19.2 | 35.5 | 43.9 |
MIT-KGC (Tian et al., 2022) | – | 33.5 | 58.2 | 76.5 | – | 21.2 | 41.7 | 57.5 |
SimKGC (Wang et al., 2022) | 66.6 | 58.7 | 71.7 | 80.0 | 33.6 | 25.7 | 37.3 | 49.8 |
LP-BERT (Li et al., 2023) | 48.2 | 34.3 | 56.3 | 75.2 | 31.0 | 22.3 | 33.6 | 49.0 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 | 35.0 | 26.4 | 37.7 | 52.2 |
Model . | WN18RR . | FB15K-237 . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
Structure-based Methods | ||||||||
TransE (Bordes et al., 2013) † | 24.3 | 4.3 | 44.1 | 53.2 | 27.9 | 19.8 | 37.6 | 44.1 |
ComplEx (Trouillon et al., 2016) † | 44.9 | 40.9 | 46.9 | 53.0 | 27.8 | 19.4 | 29.7 | 45.0 |
RotatE (Sun et al., 2019) † | 47.6 | 42.8 | 49.2 | 57.1 | 33.8 | 24.1 | 37.5 | 53.3 |
TuckER (Balazevic et al., 2019) † | 47.0 | 44.3 | 48.2 | 52.6 | 35.8 | 26.6 | 39.4 | 54.4 |
Complex-N3 (Jain et al., 2020) | 49.0 | 44.0 | – | 58.0 | 37.0 | 27.0 | – | 56.0 |
TransMTL-H (Dou et al., 2021) | 49.8 | – | – | 57.0 | 34.9 | – | – | 53.7 |
SEPA (Gregucci et al., 2023) | 48.1 | 44.1 | 49.6 | 56.2 | 33.2 | 24.3 | 36.3 | 50.9 |
Text-based Methods | ||||||||
KG-BERT (Yao et al., 2019) † | 21.6 | 4.1 | 30.2 | 52.4 | – | – | – | 42.0 |
MTL-KGC (Kim et al., 2020) | 33.1 | 20.3 | 38.3 | 59.7 | 26.7 | 17.2 | 29.8 | 45.8 |
StAR (Wang et al., 2021a) | 40.1 | 24.3 | 49.1 | 70.9 | 29.6 | 20.5 | 32.2 | 48.2 |
GenKGC (Xie et al., 2022) | – | 28.7 | 40.3 | 53.5 | – | 19.2 | 35.5 | 43.9 |
MIT-KGC (Tian et al., 2022) | – | 33.5 | 58.2 | 76.5 | – | 21.2 | 41.7 | 57.5 |
SimKGC (Wang et al., 2022) | 66.6 | 58.7 | 71.7 | 80.0 | 33.6 | 25.7 | 37.3 | 49.8 |
LP-BERT (Li et al., 2023) | 48.2 | 34.3 | 56.3 | 75.2 | 31.0 | 22.3 | 33.6 | 49.0 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 | 35.0 | 26.4 | 37.7 | 52.2 |
Main results for Wikidata5M dataset. Results of [‡] are from SimKGC (Wang et al., 2022) and the other results are from the corresponding papers. We follow the evaluation protocol used in SimKGC.
Model . | Wikidata5M-Trans . | Wikidata5M-Ind . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
Structure-based Methods | ||||||||
TransE (Bordes et al., 2013) ‡ | 25.3 | 17.0 | 31.1 | 39.2 | – | – | – | – |
RotatE (Sun et al., 2019) ‡ | 29.0 | 23.4 | 32.2 | 39.0 | – | – | – | – |
Text-based Methods | ||||||||
DKRL (Xie et al., 2016) ‡ | 16.0 | 12.0 | 18.1 | 22.9 | 23.1 | 5.9 | 32.0 | 54.6 |
KEPLER (Wang et al., 2021b) ‡ | 21.0 | 17.3 | 22.4 | 27.7 | 40.2 | 22.2 | 51.4 | 73.0 |
BLP-SimplE (Daza et al., 2021) ‡ | – | – | – | – | 49.3 | 28.9 | 63.9 | 86.6 |
SimKGC (Wang et al., 2022) | 35.8 | 31.3 | 37.6 | 44.1 | 71.4 | 60.9 | 78.5 | 91.7 |
KGT5 (Saxena et al., 2022) | 30.0 | 26.7 | 31.8 | 36.5 | – | – | – | – |
SKG-KGC | 36.6 | 32.3 | 38.2 | 44.6 | 72.0 | 61.6 | 78.8 | 91.7 |
Model . | Wikidata5M-Trans . | Wikidata5M-Ind . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
Structure-based Methods | ||||||||
TransE (Bordes et al., 2013) ‡ | 25.3 | 17.0 | 31.1 | 39.2 | – | – | – | – |
RotatE (Sun et al., 2019) ‡ | 29.0 | 23.4 | 32.2 | 39.0 | – | – | – | – |
Text-based Methods | ||||||||
DKRL (Xie et al., 2016) ‡ | 16.0 | 12.0 | 18.1 | 22.9 | 23.1 | 5.9 | 32.0 | 54.6 |
KEPLER (Wang et al., 2021b) ‡ | 21.0 | 17.3 | 22.4 | 27.7 | 40.2 | 22.2 | 51.4 | 73.0 |
BLP-SimplE (Daza et al., 2021) ‡ | – | – | – | – | 49.3 | 28.9 | 63.9 | 86.6 |
SimKGC (Wang et al., 2022) | 35.8 | 31.3 | 37.6 | 44.1 | 71.4 | 60.9 | 78.5 | 91.7 |
KGT5 (Saxena et al., 2022) | 30.0 | 26.7 | 31.8 | 36.5 | – | – | – | – |
SKG-KGC | 36.6 | 32.3 | 38.2 | 44.6 | 72.0 | 61.6 | 78.8 | 91.7 |
On the WN18RR dataset, the SKG-KGC model outperforms other models significantly. It exhibits notable improvements over the state-of-the- art (SOTA) method in MRR, Hit@1, Hit@3, and Hit@10, with gains of 5.6%, 8.3%, 3.4%, and 1.6%, respectively. The most substantial enhancement is observed in Hit@1, potentially attributed to the presence of more lexically similar entities and a sparser graph structure in the WN18RR dataset. We argue that shared knowledge aids the model in learning crucial textual descriptions, enhancing its ability to identify similar candidate entities. The dynamic and balanced loss weight scheme in multi-task learning enables the model to concentrate more on specific subtasks, enhancing its efficacy in handling sparse data in WN18RR. Moreover, text-based methods consistently outperform structure-based methods, underscoring their advantage in grasping the semantics of words.
Compared to the WN18RR dataset, the FB15K- 237 dataset features richer relations and fewer entities. Our model exhibits improved experimental performance among text-based methods, with the exception of MIL-KGC, which utilizes the more potent AlBERT-large encoder and undergoes longer training times. This outcome underscores the effectiveness of shared knowledge and balanced multi-task learning in SKG-KGC for leveraging text information. However, our model still falls short when compared to structured methods like TuckeER and Complex-N3. Two main reasons contribute to this shortfall. Firstly, the limited number of entities in the FB15K-237 dataset results in inadequate learning of entity textual descriptions. Additionally, structured methods contribute to a more effective understanding of generalizable inference rules, which proves advantageous for the FB15K-237 dataset.
The Wikidata5M dataset spans various domains and boasts a much larger scale compared to WN18RR and FB15K-237. As indicated in Table 4, our model demonstrates SOTA performance in both transductive and inductive settings when compared to existing structure-based and text-based methods. Notably, the million-scale data results in a prolonged training time for our model in a single iteration. To facilitate comparisons and minimize training costs, we adopt the approach from SimKGC, maintaining the epoch at 1 during training. Consequently, the dynamic and balanced loss weight allocation scheme is not applied to this dataset. Although extending the existing dataset and incorporating the relation prediction subtask in multi-task learning contribute to some performance enhancement, further improvements can be achieved. Additionally, the exceptional performance on the Wikidata5M_inds dataset underscores our model’s capability to infer entities not encountered in the training set.
4.3 Ablation Studies
We conduct the ablation studies to explore the impact of each specific component on the SKG- KGC model. Specifically, “w/o dataset expansion” means that the model is trained only using original triples. “w/o balanced multi-task learning” refers to treating the loss weights of multiple subtasks as 1. “w/o multi-level shared knowledge”z means removing both components that gather dataset-level and task-level knowledge. “w/o bi-encoder architecture” indicates that we only use one encoder for all triple elements. The results shown in Table 5 highlight that removing any of these components greatly reduces the model’s performance.
The ablation results on the WN18RR and FB15K-237 dataset.
Model . | WN18RR . | FB15K-237 . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
w/o dataset expansion | 68.5 | 61.6 | 72.6 | 80.3 | 34.2 | 25.6 | 36.9 | 51.5 |
w/o balanced multi-task learning | 70.8 | 65.3 | 73.6 | 81.5 | 34.7 | 26.0 | 37.4 | 52.3 |
w/o multi-level shared knowledge | 66.9 | 60.3 | 70.5 | 79.3 | 34.1 | 25.4 | 36.8 | 51.7 |
w/o bi-encoder architecture | 68.2 | 61.2 | 72.4 | 80.6 | 33.3 | 24.3 | 36.2 | 51.1 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 | 35.0 | 26.4 | 37.7 | 52.2 |
Model . | WN18RR . | FB15K-237 . | ||||||
---|---|---|---|---|---|---|---|---|
MRR . | Hit@1 . | Hit@3 . | Hit@10 . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . | |
w/o dataset expansion | 68.5 | 61.6 | 72.6 | 80.3 | 34.2 | 25.6 | 36.9 | 51.5 |
w/o balanced multi-task learning | 70.8 | 65.3 | 73.6 | 81.5 | 34.7 | 26.0 | 37.4 | 52.3 |
w/o multi-level shared knowledge | 66.9 | 60.3 | 70.5 | 79.3 | 34.1 | 25.4 | 36.8 | 51.7 |
w/o bi-encoder architecture | 68.2 | 61.2 | 72.4 | 80.6 | 33.3 | 24.3 | 36.2 | 51.1 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 | 35.0 | 26.4 | 37.7 | 52.2 |
Effect of Dataset Expansion
Removing dataset expansion causes a significant decrease in our model’s performance on the WN18RR and FB15K-237 datasets. Particularly on the WN18RR dataset, which features more textual descriptions, MRR, Hit@1, Hit@3, and Hit@10 metrics all drop by 3.7%, 5.4%, 2.5%, and 1.3%, respectively. This emphasizes the effectiveness of common knowledge within entity sets sharing the same (h, r) or (r, t). Such dataset-level shared knowledge enhances the model’s ability to learn common features among interconnected entities.
Effect of Balanced Multi-task Learning
On the WN18RR and FB15K-237 datasets, when balanced multi-task learning is excluded, the MRR, Hit@1, and Hit@3 of the model show a decrease, but the Hit@10 metric is still comparable. This highlights the advantage of our proposed loss weight allocation scheme for multiple subtasks in multi-task learning. The scheme facilitates more accurate identification of the expected entity from candidate entity sets, despite facing challenges in identifying the top-10 entities.
Effect of Bi-encoder Architecture
The removal of the bi-encoder architecture results in a 4% decrease in MRR on WN18RR, and a 1.7% decrease on the FB15K-237 dataset. This indicates that it is reasonable for the model to use two independent encoders to encode unknown and known elements separately, thereby avoiding some potential confusion in the single encoder configuration. These findings highlight the effectiveness of the bi-encoder architecture in seamlessly integrating dataset-level and task-level shared knowledge, significantly improving the model’s proficiency in knowledge graph completion.
4.4 Further Exploration of Dataset-level Knowledge
During dataset expansion, we study how different input texts, the number of sentences, and entity sets affect our model, aiming at further exploration of dataset-level knowledge.
Experiment 1: Effect of Input Texts
In this experiment, we assess the impact of different input texts on the model performance. We examine four scenarios: without entity descriptions, without entity names, with both but without text summarization, and with both including text summarization.
The results in Table 6 indicate that the removal of entity descriptions and names leads to a 30.7% and 6.9% decrease in the model’s MRR, respectively, underscoring the importance of these features in capturing in-depth semantic relations in the text-based KGC methods. Importantly, entity descriptions contribute significantly to providing an extensive textual context. Furthermore, the application of the TextRank text summarization algorithm yields a 1.7% increase in MRR, effectively addressing the issue of text redundancy due to an excess of entities.
Performance comparison of different input texts on WN18RR.
Input text . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|
w/o entity description | 41.5 | 32.7 | 46.0 | 58.1 |
w/o entity name | 65.3 | 57.6 | 70.0 | 79.0 |
w/o text summarization | 70.5 | 64.2 | 74.1 | 81.8 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 |
Input text . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|
w/o entity description | 41.5 | 32.7 | 46.0 | 58.1 |
w/o entity name | 65.3 | 57.6 | 70.0 | 79.0 |
w/o text summarization | 70.5 | 64.2 | 74.1 | 81.8 |
SKG-KGC | 72.2 | 67.0 | 75.1 | 81.6 |
Experiment 2: Selection of Top-n Sentences
During text summarization, we select the top n sentences with the highest TextRank values to serve as concise text, providing the model with the necessary but succinct descriptive information. Accordingly, we explore the impact of the number of sentences on the model’s overall performance.
Figure 3 presents the experimental outcomes of selecting the top-n sentences (n = {1,2,3,4,5,6}) on the WN18RR dataset. When n = 3, the MRR and Hit@1 metrics of the SKG-KGC model reach their optimal value. When n is less than 3, the model may face challenges in fully comprehending more detailed information regarding the entity context. Conversely, when n increases, the influx of descriptive information might lead to information overload and confusion, making it challenging to identify the more critical contextual information about entities. Consequently, the top three sentences are ultimately selected as the summarized text.
Experiment 3: Effect of Entity Sets
We compare SKG-KGC with its two variants that remove head entity sets ({h0, h1,…, hj}, r, t) and tail entity sets (h, r,{t0, t1,…, ti}) on the WN18RR dataset.
Table 7 shows the effect of such exclusions on the model’s performance in predicting head and tail entities. Removing head entity sets significantly reduces the performance of tail entity prediction (MRR decreases by 3.6%), while removing tail entity sets only slightly affects head entity prediction (MRR decreases by 1.1%). The overall performance in entity prediction benefits from shared knowledge across all dataset levels, notably for the tail entity prediction task. We attribute this observed phenomenon to the proportion of triples sharing the same (r, t) or (h, r), as depicted in Figure 1a. The WN18RR dataset contains more triples with the same (r, t), thereby providing a wealth of knowledge about head entity sets and resulting in a more significant enhancement in the tail entity prediction task.
Effectiveness of different entity sets on WN18RR. (?, r, t) and (h, r,?) denote head entity and tail entity prediction respectively.
Model . | Subtask . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
w/o head entity sets | (?, r, t) | 66.5 | 59.4 | 70.6 | 79.6 |
(h, r,?) | 73.3 | 65.6 | 78.4 | 87.1 | |
Average | 69.9 | 62.5 | 74.5 | 83.3 | |
w/o tail entity sets | (?, r, t) | 66.4 | 60.1 | 70.1 | 77.7 |
(h, r,?) | 73.8 | 66.6 | 78.6 | 86.5 | |
Average | 70.1 | 63.3 | 74.3 | 82.1 | |
SKG-KGC | (?, r, t) | 67.5 | 62.4 | 70.4 | 76.5 |
(h, r,?) | 76.9 | 71.6 | 79.8 | 86.7 | |
Average | 72.2 | 67.0 | 75.1 | 81.6 |
Model . | Subtask . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
w/o head entity sets | (?, r, t) | 66.5 | 59.4 | 70.6 | 79.6 |
(h, r,?) | 73.3 | 65.6 | 78.4 | 87.1 | |
Average | 69.9 | 62.5 | 74.5 | 83.3 | |
w/o tail entity sets | (?, r, t) | 66.4 | 60.1 | 70.1 | 77.7 |
(h, r,?) | 73.8 | 66.6 | 78.6 | 86.5 | |
Average | 70.1 | 63.3 | 74.3 | 82.1 | |
SKG-KGC | (?, r, t) | 67.5 | 62.4 | 70.4 | 76.5 |
(h, r,?) | 76.9 | 71.6 | 79.8 | 86.7 | |
Average | 72.2 | 67.0 | 75.1 | 81.6 |
4.5 Further Exploration of Balanced Multi-Task Learning
We analyze SKG-KGC alongside weight-unadjusted methods from two perspectives: different datasets and different subtasks of the WN18RR dataset.
Experiment 1: Performance of Balanced Multi-task Learning on Different Datasets
As shown in Table 5, balanced multi-task learning works well on WN18RR, but shows only slight improvements on the larger FB15k-237 dataset. We attribute this performance to two main factors. First, the scheme is designed for addressing the issue of imbalanced loss weights among tasks, so it works well when task differences are significant. As shown in Figure 1a, the proportion of triples sharing the same (r, t) or (h, r) is 84.3% and 74.8% on the FB15K-237 dataset, respectively, which means less task disparity compared to the 24.8% on WN18RR. Secondly, our scheme dynamically updates the loss weights of all tasks after each iteration. Due to computational resource limitations, the number of iterations performed on larger datasets is reduced, leading to less pronounced changes in task weights. Therefore, our proposed scheme performs better when applied to smaller datasets and more diverse tasks.
Experiment 2: Performance of Balanced Multi-task Learning on Different Subtasks
Furthermore, Table 8 provides detailed results on the WN18RR dataset, including head entity and tail entity prediction outcomes.
Performance of balanced multi-task learning on different subtasks of WN18RR.
Model . | Subtask . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
w/o weight | (?, r, t) | 66.3 | 60.9 | 68.4 | 76.8 |
(h, r,?) | 75.3 | 69.6 | 78.8 | 86.1 | |
Average | 70.8 | 65.3 | 73.6 | 81.5 | |
SKG-KGC | (?, r, t) | 67.5 | 62.4 | 70.4 | 76.5 |
(h, r,?) | 76.9 | 71.6 | 79.8 | 86.7 | |
Average | 72.2 | 67.0 | 75.1 | 81.6 |
Model . | Subtask . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
w/o weight | (?, r, t) | 66.3 | 60.9 | 68.4 | 76.8 |
(h, r,?) | 75.3 | 69.6 | 78.8 | 86.1 | |
Average | 70.8 | 65.3 | 73.6 | 81.5 | |
SKG-KGC | (?, r, t) | 67.5 | 62.4 | 70.4 | 76.5 |
(h, r,?) | 76.9 | 71.6 | 79.8 | 86.7 | |
Average | 72.2 | 67.0 | 75.1 | 81.6 |
Notably, tail entity prediction consistently outperforms head entity prediction. We attribute this to the smaller average number of entities connected in one-to-many relations. Moreover, Figure 1b underscores the imbalanced distribution of head entities and tail entities. While our proposed SKG-KGC improves experimental performance through a designed loss weight allocation scheme in multi-task learning, the challenge of significant performance differences between head entities and tail entities persists.
4.6 Parameter Analysis in Bi-Encoder Architecture
In our experiment, we utilize two types of encoders: BERT-base-uncased with 110M parameters and BERT-large-uncased with 340M parameters, to assess their performance on WN18RR. Each model is evaluated in two configurations: Single-encoder uses one encoder for all elements, while bi-encoder uses two separate encoders for known and unknown elements.
As shown in Table 9, within a bi-encoder architecture, the Hit@10 metric improves with the substantial increase in parameter volume of BERT-large. However, the more critical MRR and Hit@1 metrics decline significantly by 1.4% and 3.2%, respectively, potentially due to the curse of dimensionality and overfitting. This observation indicates that an increase in parameter volume does not necessarily lead to an overall improvement in model performance, as supported by previous research (Tian et al., 2022).
Performance comparison of encoders with varying parameter volumes on WN18RR.
Encoder . | #num . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
BERT-base (110M) | one | 68.2 | 61.2 | 72.4 | 80.6 |
two | 72.2 | 67.0 | 75.1 | 81.6 | |
BERT-large (340M) | one | 69.6 | 62.3 | 74.1 | 82.7 |
two | 70.8 | 63.8 | 75.2 | 83.2 |
Encoder . | #num . | MRR . | Hit@1 . | Hit@3 . | Hit@10 . |
---|---|---|---|---|---|
BERT-base (110M) | one | 68.2 | 61.2 | 72.4 | 80.6 |
two | 72.2 | 67.0 | 75.1 | 81.6 | |
BERT-large (340M) | one | 69.6 | 62.3 | 74.1 | 82.7 |
two | 70.8 | 63.8 | 75.2 | 83.2 |
Furthermore, when comparing the performance between single and bi-encoder configurations, it is evident that the bi-encoder consistently outperforms the single encoder configuration. We speculate this could be attributed to the bi-encoder’s explicit differentiation between the embeddings of known and unknown elements in the triplets, thereby avoiding potential confusion in the single encoder configuration. Hence, when selecting encoders and architectures for similar tasks, priority should be given to the selection of architecture rather than simply increasing the model’s parameter volume.
4.7 Efficiency Analysis
Since our model and MTL-KGC (Kim et al., 2020) both engage in multi-task learning in KGC, we employ the same task settings and encoders for comparative analysis. Table 10 reports the approximate time cost for training and inference.
Comparison of time cost between our model and MTL-KGC. The terms “T/Ep”, “Train”, and “Inf” denote the training time per epoch, the training time until convergence, and inference time, respectively.
Time . | WN18RR . | FB15K237 . | ||||
---|---|---|---|---|---|---|
T/EP . | Train . | Inf . | T/EP . | Train . | Inf . | |
MTL-KGC | 2.6h | 7.8h | 60h | 6.9h | 20.8h | 491h |
Our model | 5.5m | 2.5h | 2.8m | 16.7m | 2.7h | 10m |
Time . | WN18RR . | FB15K237 . | ||||
---|---|---|---|---|---|---|
T/EP . | Train . | Inf . | T/EP . | Train . | Inf . | |
MTL-KGC | 2.6h | 7.8h | 60h | 6.9h | 20.8h | 491h |
Our model | 5.5m | 2.5h | 2.8m | 16.7m | 2.7h | 10m |
Compared with MTL-KGC, SKG-KGC demonstrates superior speed in training and test datasets. This efficiency improvement emerges from our model’s use of independent candidate entity encoders for calculating entity rankings, similar to the approaches employed by StAR (Wang et al., 2021a) and SimKGC (Wang et al., 2022). While not pioneering fast inference, our proposed model achieves a trade-off between efficiency and effectiveness, with a focus on improving the latter. Notably, our model surpasses MTL-KGC in both training and inference speed, aligning with the theoretical analysis outlined in Section 3.6.
4.8 Case Study
To conduct a qualitative analysis of the multi-level shared knowledge, we show the top two entities as ranked by SKG-KGC, SKG-KGC without shared knowledge, and the most competitive baseline SimKGC in Table 11.
Case study on the FB15K-237 dataset. [*] indicates ground-truth entity. Text in brackets represents the textual description for entities.
Case 1: Input (head and relation): | |
h: Cleopatra [is a 1963 British-American-Swiss epic drama film...] | |
r: /film/film/featured_film_locations | |
Prediction (SKG-KGC): | |
P1*: Rome [Located in the foothills of the Appalachian Mountains...] | |
P2*: City of London [is a city within London...] | |
Prediction (SKG-KGC w/o Shared Knowledge): | |
P1: Zurich [is the largest city in Switzerland...] | |
P2*: Rome [Located in the foothills of the Appalachian Mountains...] | |
Prediction (SimKGC): | |
P1*: Rome [Located in the foothills of the Appalachian Mountains...] | |
P2: Zurich [is the largest city in Switzerland...] | |
Case 2: Input (relation and tail): | |
r: /location/location/contains | |
t: Curtis Institute of Music [is a conservatory in Philadelphia...] | |
Prediction (SKG-KGC): | |
P1*: United States of America [commonly referred to as the United States...] | |
P2: Pittsburgh, PA Metropolitan Statistical Area [is the largest population center...] | |
Prediction (SKG-KGC w/o Shared Knowledge): | |
P1: Pittsburgh, PA Metropolitan Statistical Area [is the largest population center...] | |
P2*: United States of America [commonly referred to as the United States...] | |
Prediction (SimKGC): | |
P1: Allentown [is a city located in Lehigh County...] | |
P2*: United States of America [commonly referred to as the United States...] | |
Case 3: Input (head and relation): | |
h: Flo Rida [Tramar Lacel Dillard, better known by...] | |
r: /people/person/profession | |
Wrong Prediction (All models): | |
P1: Record producer-GB [is an individual working within the music industry...] | |
P2: Music executive-GB [is a person within a record label...] |
Case 1: Input (head and relation): | |
h: Cleopatra [is a 1963 British-American-Swiss epic drama film...] | |
r: /film/film/featured_film_locations | |
Prediction (SKG-KGC): | |
P1*: Rome [Located in the foothills of the Appalachian Mountains...] | |
P2*: City of London [is a city within London...] | |
Prediction (SKG-KGC w/o Shared Knowledge): | |
P1: Zurich [is the largest city in Switzerland...] | |
P2*: Rome [Located in the foothills of the Appalachian Mountains...] | |
Prediction (SimKGC): | |
P1*: Rome [Located in the foothills of the Appalachian Mountains...] | |
P2: Zurich [is the largest city in Switzerland...] | |
Case 2: Input (relation and tail): | |
r: /location/location/contains | |
t: Curtis Institute of Music [is a conservatory in Philadelphia...] | |
Prediction (SKG-KGC): | |
P1*: United States of America [commonly referred to as the United States...] | |
P2: Pittsburgh, PA Metropolitan Statistical Area [is the largest population center...] | |
Prediction (SKG-KGC w/o Shared Knowledge): | |
P1: Pittsburgh, PA Metropolitan Statistical Area [is the largest population center...] | |
P2*: United States of America [commonly referred to as the United States...] | |
Prediction (SimKGC): | |
P1: Allentown [is a city located in Lehigh County...] | |
P2*: United States of America [commonly referred to as the United States...] | |
Case 3: Input (head and relation): | |
h: Flo Rida [Tramar Lacel Dillard, better known by...] | |
r: /people/person/profession | |
Wrong Prediction (All models): | |
P1: Record producer-GB [is an individual working within the music industry...] | |
P2: Music executive-GB [is a person within a record label...] |
In the first case, SKG-KGC correctly predicts the entity “Rome” and also unexpectedly predicts “City of London”, possibly due to the influence of “London” in the shared tail entity sets. In the second case, SKG-KGC correctly identifies “United States of America” by utilizing shared knowledge from candidate entity sets, while other models fail due to an overemphasis on textual similarity between “Pittsburgh” / “Allentown” and “Philadelphia”. However, in the third case, the training set only reveals that Flo Rida’s profession is that of an actor and songwriter, and the correct tail entity should be Artist-GB. All three models predict incorrectly due to the presence of the word “song” in Flo Rida’s description. Thus, we suspect that text-based methods may excessively focus on certain text descriptions of the entities themselves and overlook structural information in the knowledge graph.
These results highlight that SKG-KGC can mitigate the over-reliance on semantic similarity as compared to previous methods, and effectively improve the ability to identify correct entities from similar candidate entities. Furthermore, these insights prove valuable for considering both textual and structural information in KGC.
5 Conclusion
In this paper, we introduce a multi-level shared knowledge guided method for efficient knowledge graph completion. Our approach effectively addresses the challenges of inadequate knowledge learning and imbalanced subtasks in multi-task learning. Through extensive experiments on benchmark datasets, we demonstrate that SKG- KGC consistently outperforms competitive baseline models, particularly excelling on WN18RR with its extensive entity descriptions. These findings provide new insights for multi-task learning and other tasks related to knowledge graphs. In future research, we aim to explore the integration of text-based methods with graph embeddings to extract the semantic and structural information in knowledge graphs.
References
Author notes
Action Editor: Sebastia Pado