Visual Entity Linking via Multi-modal Learning

Abstract Existing visual scene understanding methods mainly focus on identifying coarse-grained concepts about the visual objects and their relationships, largely neglecting fine-grained scene understanding. In fact, many data-driven applications on the Web (e.g., news-reading and e-shopping) require accurate recognition of much less coarse concepts as entities and proper linking them to a knowledge graph (KG), which can take their performance to the next level. In light of this, in this paper, we identify a new research task: visual entity linking for fine-grained scene understanding. To accomplish the task, we first extract features of candidate entities from different modalities, i.e., visual features, textual features, and KG features. Then, we design a deep modal-attention neural network-based learning-to-rank method which aggregates all features and maps visual objects to the entities in KG. Extensive experimental results on the newly constructed dataset show that our proposed method is effective as it significantly improves the accuracy performance from 66.46% to 83.16% compared with baselines.


INTRODUCTION
Visual scene understanding is inevitably regarded as one of the core functions of the next-generation machine intelligence, and it has been evolving to meet increasing demands not limited to objects detection from images and video clips, more often than not, for grasping the story behind the pixels. Due to the Contributions: The contributions of this paper are summarized as follows:

Visual Entity Linking via Multi-modal Learning
• We are the first to consider the visual entity linking in scene graphs and have constructed a new large-scale dataset for the challenging task.
• We proposed a novel framework to learn features from three different modalities and simultaneously designed a learning-to-rank based visual entity linking model, which aggregated different features by a deep modal-attention neural network.
• We conducted extensive experiments to evaluate our visual entity linking model against state-of-theart methods. Results on the constructed dataset show that our proposed method is effective as it significantly improved the accuracy performance from 66.46% to 83.16% compared with baselines.

RELATED WORK
This section discusses the existing related research from the following aspects: entity linking, visual scene understanding, and multi-modal learning.

Entity Linking
Entity linking is the task of mapping the mentions in the text to the corresponding entities in KG. Conventional models are usually distinguished by the supervision they require, i.e., supervised or unsupervised methods. The supervised methods [2,3,4,5] utilized the annotated data to train binary classifiers or ranking models to realize entity disambiguation. The unsupervised methods [6,7,8,9,10] generally used some similarity measures between the mentions in the text and the entities in KG.
In contrast to the traditional research in the textual domain, the visual entity linking task has the following differences: (1) The form of visual representation is much more complicated than the form that may appear in textual content, and (2) Different modalities have different characteristics.

Visual Scene Understanding
Visual scene understanding includes many kinds of work, such as traditional computer vision tasks like image recognition [11,12] and higher-level scene reasoning tasks like scene graph generation. In recent years, considerable progress has been made on many sub-issues of the overall visual scene understanding problem. Since the early work [1,13,14] generated the visually-grounded graph over the objects with their relationships in an image, many models have been proposed to improve the performance, such as adding prior probability distribution [13,15,16] and introducing the message passing mechanism [17,18,19].

Multi-modal Learning
Multi-modal learning [20,21,22] focuses on learning with contextual information from multiple modalities in a joint model. The recent relevant tasks to our work include the cross-modal entity disambiguation and entity-aware captioning. Ref. [23] built a deep zero-shot multi-modal network for social media posts disambiguation, but they are still limited to link the textual entities in the posts to the knowledge base. Refs. [24,25,26,27,28] proposed multi-modal entity-aware models to achieve image caption generation, which is also different from our visual entity linking task.
Recently, Li et al. [29] presented a multimedia knowledge extraction system, enabled the search of graph queries, and retrieved multimedia evidence. However, when dealing with visual entity linking, it adopts redundant rules for different entity types, increasing the complexity of the model. Entity linking is an important branch in the field of natural language processing, but the existing model cannot solve the problem of multi-modal learning in entity linking. This paper proposed a joint model to provide ideas for solving this problem.

Problem Formulation
In the following section, we will describe our multi-modal learning model for visual entity linking in detail. As illustrated in Figure 2, the proposed model includes the feature extraction module and the visual entity linking module. presents the j-th bounding box of the i-th image, each input sample x i has a corresponding textual information t i and the length L t , and j i y represent the highest possibility entity in the multi-modal knowledge graph KG linked to j i v . We aim to learn a transformed function f(·) which satisfies: where f(·) is a transformed function that projects input samples x v and x t into the same space as y, and sim(·) generates a similarity score between prediction and the ground truth.
For the testing process, given a test sample s p , including the image with the corresponding caption, i.e., s t = (v p , t p ). The bounding boxes of test image are generated by the scene graph, represented by

Feature Extraction Module
The feature extraction module aims to extract features from three modalities as follows: Visual features: To link the image bounding boxes x v to the given KG, we first used the outperforming method from [30] for the generation of the bounding boxes in the scene graph. Then, we used the VGG-16 network [31] for visual feature extraction of the image bounding boxes. The final layer representation e(x v ) of the VGG-16 network is transformed into low dimensions, which describes the features of an image bounding box.
Textual features: To extract textual features from the image caption x t , we encoded the caption by a GRU language model [32] with distributed word semantics embeddings e(x t ). We used the following implementation for the GRU.
where h t is a hidden layer output at decoding Step t, r t controls the influence of the hidden layer unit h t − 1 at the previous moment on the current word x t , z t decides whether to ignore the current word x t , and e(x t ) is the output textual vector representation of GRU at the last decoding step t = T.
Because pre-trained models can effectively represent the semantic distribution of words in a sentence, we used the pre-trained embeddings from the GloVE model [33] in the GRU sentence encoder.
Similar to the traditional entity linking models, we also need to obtain the list of named entities in the image caption x t . We implemented a Named Entity Recognizer (NER) based on BERT [34], Bi-LSTM, and Conditional Random Fields (CRF), as shown in Figure 3. The Bi-LSTM extracts higher-level structural information, and the CRF is used as a sequence classifier. We fine-tuned the model on our dataset and tried to achieve the best recognition results. Once we established the named entity list with the NER, we sent the list to a SPARQL query engine and obtained the candidate entities in the KG.
The structure details of our named entity recognizer are based on BERT, and it is mainly composed of BERT, Bi-LSTM and CRF.
KG features: To generate the linked candidate entities, we proposed a matching algorithm based on rules. Then, we obtained each candidate entity's image embedding e(y v ) and structural text information embedding e(y t ) as its KG features.

Visual Entity Linking via Multi-modal Learning
Because of the inaccuracy and incompleteness of the entity mention in the caption x t , i.e., the abbreviations and nicknames for the person name, directly sending the entity mentions occurred in captions to the KG SPARQL query engine may not establish a complete list of candidate entities. To this end, we implemented a rule-based candidate entity list generator by using partial matching strategy and four rules as follows: • The entity name has several words with the entity mention in common; • The entity name is wholly contained in or contains the entity mention; • The entity name exactly pairs the first letters of all words in the entity mention; and • The entity name has a high string similarity over 80% with the entity mention in Levenshtein distance.
For the KG visual embedding e(y v ) of y, we also used the same VGG-16 network to extract the dense visual features. For the KG structural text embedding e(y t ) of y, we used the DBpedia [35] as our multimodal knowledge graph to obtain the embedding with the complex model proposed by [36], which is the state-of-the-art model. The KG structural text embeddings are learned by the following score function to measure a fact <h, r, t> in KG:

Visual Entity Linking via Multi-modal Learning
where t is the conjugate of t and Re(·) means taking the real part of a complex value. h and t are entities in KG and may be possible matched entities for y.

Visual Entity Linking Module
In this module, we aggregated all the three modality features to predict the best matched KG entity j i y for each image bounding box j i v .
We proposed a supervised visual entity linking module using visual confidence and textual confidence.
The loss function consists of two parts, where ( ) T ⋅ L is the supervised max-margin ranking loss for KG entity prediction on the textual features, and the ( ) V ⋅ L is the max-margin ranking loss on the visual features. l t and l v denote hyper-parameters to tune the function. conf v is the confidence score in the visual modality, and conf t is the confidence score in the textual modality. Through our max-margin ranking loss function, the confidence of the correct linking entity conf(y) should be higher than that of any other candidate entity conf(y′) with the margin c, [x] + denoting the positive part of x.
m is the number of the samples, and s is the serial number of the input sample. f(·) is a function that projects textual embeddings into KG structural embedding space. conf t (y i ) is the confidence score between the caption textual embedding e(x t ) and KG structural text embedding of i-th candidate entity ( ) To learn the different weights of modalities, we formulated the modal-attention module as follows, which selectively weakens or magnifies the different modalities: where R is an attention vector, and x is the final context vector that reasonably focuses on different modalities.
At test time, the following entity-predicting nearest neighbor (1-NN) classifier is used for the prediction, where l 1 and l 2 are the hyper-parameters:

EXPERIMENTS
We constructed a new dataset for the task of visual entity linking and compared our model with stateof-the-art methods on the dataset.

Experiments Setting
Datasets. For the visual entity linking task, we need to link the image bounding boxes to specific entities in KG, which goes beyond identifying the category of the o bjects. However, most of the existing computer vision datasets contain no named entities in the images or captions. Therefore, we built a new dataset, namely Visual Entity Linking Dataset (VELD), which is composed of 39,000 news image and textual caption pairs with the links to KG entities, and manually labeled by expert human annotators (entity types: PER, LOC, ORG). In total, we have gathered 39,000 images with textual captions, randomly split into 31,000 for training, 4,000 for validation, and 4,000 for testing.
It is meaningless to the visual entity linking task if no word in captions describes named entities. Therefore, we performed a filtering program about named entities in VELD to remove news data that does not contain named entities, to ensure that our named entities appear in the sentence at 100%. Key aspects are summarized in Table 1. The VELD dataset, similarly to BreakingNews [37], exhibits longer average caption lengths than image-caption datasets like MSCOCO [38], indicating that news captions tend to be more descriptive. Evaluation metrics. The primary metric of our evaluation is the accuracy of the visual entity linking to the KG entity. The accuracy is defined as in Equation (17):

Visual Entity Linking via Multi-modal Learning
correctly linked entity mentions accuracy all links generated by our method = Implementation details: We initialized the NER stream of our model with a BERT language model pre-trained on the English Wikipedia. Specifically, we used the BERT BASE model which has 12 layers of transformer blocks with each block having a hidden state size of 768 and 12 multi-head attentions. We trained on four 2080Ti GPUs with a total batch size of 256 for 20 epochs. We used Adam optimizer with an initial learning rate of 0.001. We used a decay learning rate schedule with a warm-up to train the model.

Baselines
Because our proposed task is relatively novel, the related models available for our comparison are especially limited. We report the performance of the following state-of-the-art entity linking and visual objects recognition methods as baselines, as well as several configurations of our proposed method to examine contributions of each component (T: textual, V: visual, and KG: knowledge graph).

Visual Entity Linking via Multi-modal Learning
• CoAtt [39] (T and KG) uses a type-aware co-attention model for entity disambiguation.
• Falcon [40] (T and KG) performs joint entity linking of a short text by leveraging several fundamental principles of English morphology.
• CDTE [41] (T and KG) proposes a neural, modular entity linking method using multiple sources of information for entity linking.
• GENRE [42] (T and KG) system realizes entity retrieval by generating entity names. GENRE generates entity names in a token-by-token auto-regressive manner from left to right, and the generated results are affected by context.

Results
Comparison patterns: First of all, because of the novelty of the visual entity linking, we need to rely on existing models to build comparative experimental methods. Next, we will explain the experimental settings and how to achieve relative fairness through settings under some unbalanced condition of experiments, such as V + KG modalities, V + T modalities, and T + KG modalities.
Intuitively, in V + KG modalities, the visual training data of their recognizer network is the work as same as the KG. These massive image data and the image resources in KG modality are equivalent and have the same effect. In the first two experiments of Table 2, we used V modality to replace V + KG modalities to realize the corresponding experiment.

Visual Entity Linking via Multi-modal Learning
V + T modalities cannot output the result defined in the task because of lacking the KG entity links. Ignore the KG modality, the target entities, the entity linking task will not continue, so it cannot be used as a comparative experiment. Therefore, in our experiments, due to the lack of target entities, we did not choose the corresponding V + T modalities for comparative analysis.
The short-text entity linking based on the KG (T + KG modalities) cannot link the KG entities to the corresponding image bounding box. For comparison, we first used the scene graph method to generate the corresponding bounding boxes, and then randomly connected the entity in the candidate list to the entity bounding boxes. At the same time, we multiplied each accuracy rate by the number of candidate entities per entity bounding box to ensure the fairness of the accuracy rate. By multiplying the accuracy of the visual entity linking in T + KG modalities by the number of candidate entities, we eliminated the error caused by the random connection of candidate entities in T + KG modalities experiments.
Main results: Table 2 shows the Top-1, 3, 5, and 10 candidate entity list retrieval accuracy results on the VELD dataset. The first two experiments use the information of visual modality and knowledge graph modality. Through the experimental results, we proved that the existing deep neural network based on static off-line training cannot complete the task of visual entity linking well. Because of the limitation of the training dataset, it is difficult to build a dataset which contains image resources of all the entity in the open domain, so the validity of our model is proved from another side.
The third to the fifth experiments are based on the features of textual modality and knowledge graph modality for visual entity linking, and through a series of post-processing, the linking of the target frame is not affected by the visual features. From the experimental results, there is still a large gap between textual modality and our full model.
Compared with the simple visual object recognition methods and textual entity linking method which uses text and KG as the support, we found that our proposed method significantly outperforms these baselines. The reason is that we jointly fused three kinds of features in different modalities, rather than simple modality based linking. Another convincing point is that by applying the similar multi-modal learning model DZMNED on the VELD dataset, the results show that they only achieved 66.46% on the Top-1 accuracy measure. Our model reached 83.16%, which shows that our model has a great advantage in the task of visual entity linking. Figure 3 visualizes the modality attention module of our model, where we list each entity (each column) of some samples in the test, in which amplified modality is represented by a darker color, and attenuated modality is shown by a lighter color. We intuitively analyze from the experimental results that more relevant modalities in the visual entity linking will be emphasized through the modality attention module. Specifically, we used the alignments between different modalities from the test set of the VELD dataset.

Visualization of modality attention:
For our multi-modal visual entity linking model (V+T+KG modalities), we confirm from the experimental results that the modality attention module has successfully enhanced the function of relevant modal

Visual Entity Linking via Multi-modal Learning
information (e.g., in similar celebrity entity linking), and amplified relevant modality-based contexts in prediction.
In the example of the first row in Figure 4, "Jobs, Apple's founder, attended the launch of the new iPhone", we first generated the candidate entity list of "Jobs", "Apple", "iPhone". For the first example, in the process of entity linking for "Jobs" entity, we learn that the influence of visual modality is higher than that of textual modality according to the color depth of the modalities. For the other two entities "Apple" and "iPhone", the influence of visual modality is much lower than that of textual modality. Because there are few candidate entities of "Apple" and "iPhone", just relying on the textual modality we can easily find the knowledge graph entity corresponding to the contextual semantics, but there are many related entities for "Jobs" entity, so we need to use the feature vector of visual modality for the entity linking task, which is why different entity categories have different modality weights.
In the second example, "Curry won the NBA Championship for the Golden State Warriors at Auckland Stadium.", the candidate entity list is composed of "Curry", "Golden State Warriors" and "NBA". For "Curry" entity, we found that the modality attention module mainly concentrates on visual modality information. For "Golden State Warriors" entity, the proportion of visual modality information and textual modality information is roughly equal, while for "NBA" entity, it mainly depends on textual modality signals.
In the third case, for the person category like "Francis", the modality attention successfully focuses on the visual modality, and attenuates distracting signals, and for "Oscar", visual modality information and textual modality information have the same status.
Ablation study: To evaluate the effectiveness of our different modules, we considered several ablation experiments in Table 3. We validated the effect of the feature in three modalities, visual, textual, and KG. Because our experimental result is a comprehensive expression of multiple modalities, i.e., the basis of the visual entity linking is an image bounding box, and the caption description generates the candidate entity list, we used the KG entities for links. Therefore, our input data is not changed, and only the parameters of the corresponding part (l 1 and l 2 ) are adjusted to zero in the confidence calculation to achieve the purpose of eliminating a particular modality feature. For the less knowledge graph, we choose the KG embeddings learned from the 1M KG subset. The corresponding experiment results are shown in Table 3. From the experimental results, it can be found that the features of each modality contribute a certain amount to the performance of visual entity linking. The lack of any modality feature will significantly reduce the performance in terms of accuracy metric. The lack of visual features reduced the top-1 accuracy to 56%. For the absence of text features, the top-1 accuracy decreased to 73%. For reducing the scale of KG, the top-1 accuracy decreased to nearly 60%. These results also suggest that jointly utilizing multi-modal features can obtain the best experimental linking results.
Error analysis: In the example that "Robert Downey Jr. plays Iron Man in the movie", our model links the image bounding boxes to the actor Robert Downey Jr., and ground-truth links it to Iron Man. It means that our model sometimes outputs error results in the situation where one person has multiple roles in KG. Therefore, in some instances, we need to set some rules and constraints to obtain a better result. In other scenarios, when there is occlusion or concealment in the image, there will be deviations. For example, in the second case, we can easily link Suárez in KG to the image bounding box, but it is much more difficult to link Messi because there is a visual occlusion in the image. Such a reason for errors can be attributed to the incompleteness of the image information, and from another aspect, it also indicates the importance of image features for visual entity linking.

CONCLUSION AND FUTURE WORK
In this paper, we introduced a new task called visual entity linking, which links the KG entities to the corresponding image bounding boxes, and we addressed the problem by a novel framework. The proposed framework first extracts the features from three modalities (visual, textual and KG). Then, a deep modalattention neural network is employed for linking the entities to the corresponding image bounding boxes. We constructed a new dataset VELD for visual entity linking experiments. The experimental results show that our model achieved state-of-the-art results. Moreover, through extensive ablation experiments, we demonstrated the efficacy of our method.
In the future, a possible improvement direction is to utilize the structural embedding features of the scene graph to improve the performance of visual entity linking. In addition, we hope our model becomes a generic framework for the visual entity linking task, but constructing an ideal complete KG including all the entities in the world is impossible. Therefore, the requirements and effects of visual entity linking need to be determined according to specific applications.