Auto Insurance Fraud Detection with Multimodal Learning

ABSTRACT In recent years, feature engineering-based machine learning models have made significant progress in auto insurance fraud detection. However, most models or systems focused only on structural data and did not utilize multi-modal data to improve fraud detection efficiency. To solve this problem, we adapt both natural language processing and computer vision techniques to our knowledge-based algorithm and construct an Auto Insurance Multi-modal Learning (AIML) framework. We then apply AIML to detect fraud behavior in auto insurance cases with data from real scenarios and conduct experiments to examine the improvement in model performance with multi-modal data compared to baseline model with structural data only. A self-designed Semi-Auto Feature Engineer (SAFE) algorithm to process auto insurance data and a visual data processing framework are embedded within AIML. Results show that AIML substantially improves the model performance in detecting fraud behavior compared to models that only use structural data.


INTRODUCTION
According to the insurance industry development report issued by China Insurance Regulatory Commission (CIRC) in April 2021, until the end of 2020, there are in total 235 insurance companies with total assets of 23 trillion RMB, among which the income from insurance premiums is 4.53 trillion RMB, making China the second largest insurance market across the world. Conservatively speaking, China's auto insurance fraud leakage accounts for at least 20% of the total compensation amount [1]. The estimate of China's auto 1. How to build AI models that could precisely predict high-risk cases? 2. How to use AI to make maximum utilization of multi-modal data that are collected during the insurance business? 3. How to use AI to extract risk factors from different types of data, will these factors be helpful in predicting insurance fraud?
Results show that AIML could extract risk factors from multi-modal data efficiently and improve the model performance to predict auto insurance fraud behavior. Compared to baseline machine learning model that only uses structural data, the ensemble model in AIML increases the AUC by 12.24% in predicting fraud behavior with multi-modal data. The rest of the paper is organized as follows. Section 2 outlines the related work of auto insurance fraud detection and the state-of-the-art methods of multi-modal data processing. Section 3 describes details of the experimental dataset and the design of our evaluation. Section 4 shows the results and model performances based on our design. Section 5 concludes and discusses possible future topics.

RELATED WORK
In this section, we summarize related work in two main areas: auto insurance fraud detection and multimodal data processing methods.

1 Auto Insurance Fraud Detection
Insurance fraud detection can be treated as a binary classification or multiple classification problem. Many researchers have adapted machine learning models to auto insurance fraud detection and have achieved solid results. Viaene et al. [3], Kašćelan et al. [4] and Li et al. [5] examined the performance of Bayesian modelling, clustering analysis, data mining and random forest in auto insurance fraud detection. David et al. [9] achieved features and characteristics of population with high-risk in fraud behavior by analyzing the age variable of insurance holder. He et al. [6], Guo et al. [7] and Wang et al. [10] further explored the potential of deep learning models in fraud detection. Subudhi et al. [11] and Majhi et al. [12] built mixture models that could detect auto insurance fraud effectively. Tuo et al. [13] and Liu et al. [14] first discussed and studied the game theory of insurance fraud in China. Gui et al. [15] have reviewed and classified literature on moral hazard of auto insurance. Zhao et al. [16], Tang et al. [17] and Wang et al. [18] applied traditional machine learning methods to model insurance fraud behavior based on Chinese auto insurance market data. It is not until recently that Yan et al. [19,20], Yu et al. [1] and Xu et al. [21] started to analyze insurance fraud problem with deep learning models and mixture models and made progress in the field of auto insurance detection.
Although different methods have been proposed to analyze different types of data generated from the business process of auto insurance, few multi-modal data-oriented models have been built in the field of auto insurance fraud detection. More high-risk factors await to be extracted from the multi-modal data, e.g., images, texts, to detect fraud behavior.

Multi-modal Data Processing
Mult i-modal data processing has been widely adapted in the scenario of multimedia [22], disaster monitoring [23] and intelligence analysis [24]. The representative work is GAIA proposed by Li et al. [22]. The GAIA system consists of a text knowledge extraction branch and a visual knowledge extraction branch and thus enables seamless search of complex graph queries, and retrieves multimedia evidence including text, images and videos.
In the aspect of machine learning in multi-modal processing, Ngiam et al. [25] adopted the idea of shared representation learning to extend the idea of unsupervised learning of auto-encoders to the field of multi-modal learning, aiming to map data from different modalities to a uni-dimensional space. The core idea is to use noise degrading auto-encoders to represent each modality separately and then use another auto-encoder to fuse them into a multi-modal representation at the neural network fusion layer. Another method is the shared representation learning, whose idea is to project each modality into independent but constrained spaces for representation. For example, Wang et al. [26] proposed a compact hash coding method for multi-modal expression. In their work, a deep learning model is designed to generate hashcodes based on the inter-modal and intra-modal correlation constraints, and then the redundancy of hash coding features is reduced based on orthogonal regularization method. Peng et al. [27] proposed the concept of cross-media intelligence. It refers to the function of human brains across different sensory information, such as sight, hearing, language and other cognitive features of the outside world. It mainly

Auto Insurance Fraud Detection with Multimodal Learning
studies the techniques and application of multi-modal learning in cross-media reasoning analysis, including fine-grained image classification, cross-media retrieval, text-generated image and video description generation, etc. Wu et al. [28] proposed a neural network that combines both visual information and text information to recognize and disambiguate entities in short texts, whose core idea is to connect visual and text information through embedding generated representation learning and to introduce a common concern mechanism for fine-grained information interaction. Experiments show that this method is superior to methods that only rely on text information.
In the aspect of knowledge engineering, a representative work is from Mousselly et al. [29], where they constructed a unified knowledge embedding based on visual features, text features and structural features of symbolic knowledge. Compared with traditional structure-based knowledge graph representation learning, their performances in link prediction and entity classification tasks were improved. Xie et al. [30] later proposed an improved model IKRL, whose core idea is to conduct joint modeling of visual features and structural features of knowledge graph, so as to generate multi-modal knowledge graph embedding with higher quality through connections between different types of modality. Chen et al. [31] explored how to effectively jointly mapping and modeling cross-modal semantic information in the knowledge graph, thus laying an important foundation for supporting intelligent application services for multi-modal content. Guo et al. [32] further explored the entity alignment task of multi-modal knowledge graph, which mainly extended the multi-modal entity alignment task from Euclidean space to hyperbolic space.
Since there are many relatively mature algorithms for each type of data, digging more information and factors from both text data and visual data in the scenario of auto insurance is practical and promising.

FRAMEWORK
In this section, the multi-modal insurance fraud detection framework of AIML is explained in detail. Overall, our framework includes three modules as shown in the Figure 1.

Auto Insurance Fraud Detection with Multimodal Learning
Structural data will be cleaned and processed by a feature engineering model to extract and generate risk factors for fraud behavior. Both text and visual data will be processed by systems that are embedded with Natural Language Processing (NLP) models and Computer Vision (CV) algorithms to extract risk factors correspondingly. Finally, the ensemble factors will be assigned to a machine learning model to predict fraud behavior.

Structural Data and Baseline Model
The workflow of baseline model in AIML is: 1. Data are collected based on cases and stages from insurance companies, including case reporting stage, investigating stage and loss verification stage (All data are labeled and verified by experts and professionals from insurance companies). 2. Collected data are then cleaned and pre-processed, i.e., cases with more than 50% missing information will be removed, categorical variables will be one-hot encoded. 3. New features are generated with feature engineering algorithms from original features. 4. New features are fed to a machine learning model to achieve predicted outcomes.
During the predicting process, feature engineering is an essential part in the process of predicting problems for real case scenarios. It is divided into feature classification and feature derivation, among which feature classification refers to the classification of original features based on their distributions; feature derivation refers to feature synthesis based on classified features in order to obtain richer feature combinations. After comparing multiple popular machine learning and deep learning methods, AIML uses the combination of Semi-Auto Feature Engineering (SAFE) for automated feature engineering, which is a self-designed and semi-automatic method for feature engineering, and eXtreme Gradient Boosting tree (XGB) [33] to predict whether the case is fraud or not.

Unstructured Text Data Processing
Extracting risk factors from auto insurance case description texts is treated as NLP text mining tasks. There are in total six text data mining tasks in AIML, i.e., recognizing driving status, type of accident, type of roads, cause of accident, number of cars and parties involved in the accident. Table 1 illustrates the key information we extracted from the unstructured text data:

Single car accident
Negligence Driving Crashed Highway Car/Object AIML uses multi-task classification framework to achieve the goal of risk factor mining, wherein a common backbone representative learning model is shared by the six test mining tasks. The advantage of multi-task learning is that it could reduce computational complexity and cost of training, while taking into account different levels of correlation between tasks. Specifically, feature extraction layer is fully shared, based on Bidirectional Encoder Representations from Transformers (BERT) pre-trained model, combined with multi-task loss linear fusion fine tuning and Conditional Random Fields (CRF) method to achieve multi-task learning. The multi-task model is shown in Figure 2, including input layer, encoding layer, fully connection layer (FC layer), activation layer, CRF layer and output layer.

Auto Insurance Fraud Detection with Multimodal Learning
First, AIML treats the description of accident as input and chooses BERT (Chinese Version) as pre-trained encoder. BERT could dynamically represent the meaning and relevance of characters in context by using powerful multi-directional self-attention mechanism combined with self-supervised learning, so as to construct a vector that represents the semantic feature of the whole sentence after weighted combination. Additionally, BERT pre-trained model uses massive data including Wikipedia and other knowledge as training corpus to ensure its applicability to insurance text. BERT could still achieve a rather nice classification accuracy by adding a full connection layer to the output layer, even without fine-tuning of model parameters.
Second, AIML uses multiple classifiers to extract multi-event factors, taking the sequence vector output by BERT as their inputs. A cost function is defined for each classification task and each task was considered independent to each other. Parameters of the newly added FC layer and BERT sequence output layer are tuned by multi-task loss linear fusion method.
Finally, based on the correlation between computing tasks, AIML uses CRF to calculate the maximum joint probability for multiple classification results. CRF is most commonly used in the field of sequential annotation in NLP, using joint probability to calculate the co-occurrence relationship between text and annotation to optimize the overall accuracy of sequential annotation. Here we use a similar mechanism to optimize the overall accuracy of multi-task prediction with a CRF layer. The original CRF must satisfy two prerequisites: Exponentially distributed and. Only adjacent elements are correlated. The input for CRF is the output sequence vector for multi-classification task, presented as: where P indicates probability function, Z indicates normalization factor, h indicates the mapping function between single output and global input, g indicates the function for local correlation between output elements, y indicates the single output element and X indicates the global input.

Visual Data and Processing
Based on the scenarios of auto insurance fraud detection, this paper mainly focuses on three techniques, namely, Object Detection, Optical Character Recognition (OCR) and Pedestrian Re-identification (ReID). We design a systematic approach for AIML, as shown in Figure 3, to process and extract risk factors from visual data, i.e., photos and pictures of car accidents.
Raw visual data are stored in folders with case ID as folder names. The first step is to classify pictures into seven categories, i.e., accident scene, car components, invoices, driver license, driving license, photos of inspectors and cars and others. A ResNet classification model is trained on 413 cases with 1,392 well labeled pictures. Then AIML adapts the trained model to a much larger test set with 22,385 pictures to make a rough classification as those 22,385 pictures were originally unlabeled. Flowing a manually fine classification, all pictures with correct categories are used to re-train the classification model. Figure 3. Visual data processing.

Auto Insurance Fraud Detection with Multimodal Learning
The second step is to extract risk factors from each category of pictures. For pictures in categories of accident scene and car components, a Yolov5 model is used to extract risk factors from photos to identify damage conditions of cars and a ResNet model is used to extract scene information such as daytime or nighttime. For pictures that contain text information, AIML uses OCR to recognize information from licenses and invoices. For pictures that contains both investigators and cars, AIML uses ReID to identify different investigators and check anomalies, i.e., if they appeared in previously detected fraud cases or if they appear in multiple cases.
In the last step, factors extracted from visual data will be merged with structural data by case ID to improve model performance.

RESULTS
In this section, we report our experimental results on a real-world auto insurance dataset. The results of baseline model will be firstly presented. Then, risk factors extracted from text data and visual data will be added and show the effectiveness of multi-modal learning in improving the fraud detection capability.

Dataset
Experimental data are collected and resampled from 4,946 auto insurance cases from November 10, 2014, to October 26, 2020, among which 3,613 are non-fraud cases and 1,333 are confirmed fraud cases. Data are organized in a per-car basis, i.e., cases containing multiple car accidents are treated as multiple data samples, indicated by a compound Case Unique ID (CaseUID, including both case ID and car plate). Therefore, number of samples in the entire dataset is slightly larger than the number of cases, i.e., including 5,034 non-fraud Case Unique IDs and 1,413 fraud Case Unique IDs respectively. There are in total 216 fields of data containing information collected from the case reporting stage, investigating stage and loss verification stage. Variables with over 70% missing information will be excluded; variables with information that are not suitable for fitting into XGB model will be excluded (e.g. ID-type variable, names etc); only structural data, i.e., mainly categorical and numerical data are used in the baseline model.

Results of Baseline Models
After the original variables are pre-processed by SAFE, i.e., our self-designed feature selection and feature interaction tool, there are in total 1,155 features, which are generated from the original 216 variables combined, one-hot encoded, interacted, added and subtracted according to their types. One special Boolean feature named 'Compensation Type_Normal Case' is excluded, because it is a huge giveaway in predicting fraud cases. In order to evaluate the performance of model comprehensively, four criteria, precision, recall, F 1 -score and Area Under the Curve (AUC) will be used to evaluate model performance.

True Positive Precision
True Positive False Positive (3)

True Positive Recall
True Positive False Negative All 6,447 subjects were randomly separated into train and test set with the ratio of 80%/20% and all 1,154 features are fed to the XGB model. The trained model has an overall accuracy of 0.8364 with precision equals to 0.7095, recall equals to 0.4441 and F 1 score as 0.5462. The plots of the ROC and PR curves are presented in Figure 4 and Figure 5 respectively.
Based on the results for baseline model, the model performance is rather moderate in predicting fraud behavior in auto insurance cases.

Results of Unstructured Text Data Processing
In order to extract information that is relevant to fraud detection from text data, we formulated five text classification tasks, wherein each one is a multi-class classification task. For each task, we defined text labels, i.e., 12 types of accidents, types of driving status, 11 types of cause of accident, 4 types of car numbers and 5 types of roads. We manually labeled each accident description text with those five types of labels. To simplify the effort of the labeling work. we firstly selected 750 relatively uncorrelated samples and labeled them manually. The uncorrelation is achieved by clustering the texts and select text in different clusters. Then for each type of a label within as single task, we ensure at least 35 samples from those 750 labeled data samples. Afterall, we achieved a small data set to train a coarse classfier for each of the five tasks. Then the coarse classifier is used to categorize all text samples. Incorrect categorization results a manually adjusted. Final classification results for those five tasks are shown in Table 3 below.

Auto Insurance Fraud Detection with Multimodal Learning
Five new features were generated from text data. After one-hot encoding, there are 45 new boolean factors. The trained XGB model with factors extracted from text has overall accuracy of 0.8481 with precision equals to 0.7473, recall equals to 0.4755 and F 1 score as 0.5812.
According to Table 4, although the number 45, i.e., the number of features extracted and derived from text data is relatively small compared to the original number of features, i.e., 1,154. There is significant improvement in model performance. Both recall and F 1 score increase by around 6-7%. The plots for ROC and PR curves are presented in Figure 6 and Figure 7 respectively.   The feature importance of partial extracted factors from text data is listed in Table 5, along with their rankings.

Results of Visual Data Encoding
As mentioned in Section 3.3, the first task of visual data mining is the categorization of the raw data. The accuracy of the automatic multi-label classification algorithm is listed in Table 6 for each category, wherein most categories are well classified. Then risk factor extraction was carried out according to the scheme described in Section 3.3. Some extraction accuracy results are presented in Table 7, wherein car parts and damage detection accuracy are relatively low due to the imbalance issue in corresponding data Detailed risk factors extracted from visual data are listed in Table 8.

Auto Insurance Fraud Detection with Multimodal Learning
All factors are designed and defined based on previous expert knowledge and reports in detecting fraud cases. However, due to the quality of pictures, only 10 variables are relatively complete (less than 30% missing) and were extracted from documentary pictures. After one-hot encoding for categorical variables, there are 29 new features from visual data. The trained XGB model with factors extracted from text has overall accuracy of 0.8736 with precision equals to 0.724, recall equals to 0.6107 and F 1 score as 0.6625.
According to Table 9, we can see that there is a significant improvement in model performance with these visual features. Both recall and F1-score increase by over 20% which may be because visual data contain key information that is not included in structural data. The plots for ROC and PR curves are presented in Figure 8 and Figure 9 respectively.   In order to be more specific, some anonymised visual data are shown in Figure 10 to Figure 14.    Figure 14. Accident scene pictures.

Auto Insurance Fraud Detection with Multimodal Learning
The rectangle annotation marks different parts of cars, damage on cars and inspectors hired by insurance companies, which will be then converted to structural features as risk factors from visual data.

Results for Ensemble Model
Finally, we combine the high-risk factors extracted from both text data and visual data to our baseline model in order to check the improvement of model performance brought by multi-modal data. The results in Table 10 show that there is a substantial increase in model performance after adding factors extracted from multi-modal data in auto insurance cases. Compared to baseline model, the performance increases by 12.24% in AUC after adding 45 text features and 29 visual features. The ROC and PR curves for ensemble model are presented in Figure 15 and Figure 16 respectively.

Limitation Analysis
Although we have achieved rather nice model performance by adding factors extracted from the multimodal data, we still observe some limitations of the current scheme. Firstly, categorization of text data is extremely imbalanced. For example, main causes of accidents are driver's fault and third-party's responsibility, while other causes, e.g., bad weather, are not adequately present in current dataset. Additionally, the consequences caused by driver's fault are also imbalanced and varied. It is easy to be misclassified when the consequence of one accident is semantically close to another. Examples are shown below. The cause of the first accident should be driver's fault while the cause of the second accident should be bad weather. However, due to the imbalanced number of samples relevant to those two different causes in the training data, the model may easily misclassify the cause of first case as weather.
Also, due to the limitation of BERT (mainly refer to its adaptation of the specific context of car accident scenarios), semantic ambiguity or limited data samples, the causal relationship or the sequence of multiple elements in one sentence cannot be identified clearly. There are still a big portion of text data left unused. For example, the wounded information for drivers or people involved in the accident, the description of injuries from doctors' notes and traffic police report, etc.
Due to the poor quality for many visual data, only 10 variables were extracted from visual data with satisfactory accuracy. Pictures for some cases cannot be detected or recognized and thus lead to lots of missing data, which will bring data leakage issue to some extent consequently. Because of the small quantity and bias issue, the performance of damage detection model are limited as well. More visual data are needed

Auto Insurance Fraud Detection with Multimodal Learning
to train the fine-grained images or parts. Additionally, there should be a better way to annotate the visual data. Rectangle annotation is relatively rough when marking tiny or irregular damage. Semantic segmentation is worth trying for the next step research.

CONCLUSION AND FUTURE WORK
In this paper, we ensemble a structural data feature engineering algorithm, a natural language processing model and a processing framework for visual data together with a machine learning model to handle the task of auto insurance fraud detection based on multi-modal data. We first design an auto insurance multimodal learning (AIML) framework to analyze multi-modal data collected during the auto insurance business. With AIML, we can utilize multi-modal data efficiently and improve the model performance to predict auto insurance fraud behavior. We also design a text mining algorithm and a framework to process visual data. Both of them have achieved significant improvements in predicting fraud behavior. Experimental results show the high quality of AIML, and the effectiveness of applying AIML to auto insurance fraud detection on multi-modal data.
As we have achieved substantial increase in model performance based on multi-modal data mining with real-world dataset, constructing a real-time system or pipeline will be an appealing topic for the next step to introduce multi-modal data mining in auto insurance industry. One possible challenge could be multi-modal big data. As the amount of data increases, there will exist a bottleneck for each branch that processes different types of data. Potential solution may consider distributed system with load balance between algorithms handling different types of data, e.g., NLP for text data and CV for visual data. Considering the potential of further performance improvement, one may consider using knowledge graph to connect and represent multi-modal data in a more structured way. The main research interests include theoretical analysis and research of machine learning, model construction and optimization, data governance, automatic feature engineering and scenario solution construction. He graduated from Shanghai University majoring in mathematics and applied Mathematics with a doctor's degree. During the doctoral period, the research work focused on dimension reductions and integrabilities of high-dimensional semi-discrete integrable system, and completed seven SCI academic papers, two of which were listed as highly cited papers by Web-of-Scicence. After graduation, he went to the School of Mathematics and Science of Fudan University to engage in full-time postdoctoral research. The main research content is the algebraic structure of constrained high-dimensional semidiscrete integrable systems. ORCID: 0000-0002-7925-9968

Auto Insurance Fraud Detection with Multimodal Learning
Ding Kai is currently a senior researcher of Financial Technology Center in ZheJiang Lab. He received his Ph.D. degree in School of Automation Science and Electronic Engineering at Beihang University, China. His research interests include multimedia information retrieval, nature language process, and deep learning. ORCID: 0000-0003-4534-2904