Fuzzy-Constrained Graph Patter n Matching in Medical Knowledge Graphs

The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the research has focused on complex networks, and there are few researches on GPM in the medical field. Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE algorithm. The experimental results on the two datasets show that compared with existing algorithms, our proposed algorithm is more efficient and effective


INTRODUCTION
As a basic data structure, graphs are widely used in a lot of applications. For example, as for object anomaly checking, objects can be represented by graphs, and then anomalies can be discovered with certain graph algorithm [1]. Meanwhile, in order to determine whether a user is interested in a certain webpage, the webpages can be converted into multiple graphs, and with the multiple graphs taken as a bag, the bag can be classified and judged [2]. As a popular graph-based technology, graph pattern matching (GPM) has attracted a lot of attentions. Specifically, given a pattern graph, finding subgraphs from the data graph with a similar or the same structure as the pattern graph is named as GPM. As the research field of GPM has changed from the initial protein isomorphism [3,4] to community detection [5,6], expert discovery [7], recommendation systems [8], the discovery of social groups [9][10][11] and the identification of criminal groups [12], the definition of graph pattern has also changed accordingly.
Technically, GPM is originally defined based on subgraph isomorphism. Given a data graph G D and a pattern graph G P as input, it will return whether it contains a subgraph, and whether this subgraph has exactly the same topological structure as G P . For example, we can guess the properties of unknown proteins from the properties of known proteins through this matching [3,4]. However, this traditional subgraph isomorphism is too strict. In order to extend the application scenarios of GPM, Fan et al. [12] propose a bounded simulation, which extends the edge-to-edge matching to the edge-to-finite length path matching. However, this matching still does not make use of the rich attribute information on vertices and edges. Therefore, Liu et al. [13] propose a multi-constrained graph pattern matching (MC-GPM) problem to obtain more effective matching results. Afterwards, Liu et al. [14] propose a multiple fuzzy constrained graph pattern matching (MFC-GPM) based on MC-GPM, considering that some attributes do not require exact matching. However, the current application scenarios of GPM are mostly concentrated on complex networks, and there are very few research applications of GPM in the medical field, especially in Medical Knowledge Graphs.
Nowadays, the incidence of breast cancer is getting higher and higher, and the age is getting younger and younger. Breast cancer can be divided into ductal carcinoma in situ, lobular carcinoma in situ, invasive ductal carcinoma, invasive lobular carcinoma, and so on. Each type of breast cancer can be divided according to the primary tumor staging, regional lymph node staging, and distant metastasis staging. The purpose of this paper is to make a diagnosis through GPM technology before the patient's condition is diagnosed with surgery.
In this paper, to introduce GPM into the medical field, we propose the problem of GPM in MKGs and give relevant definitions. In addition, the M-TBRE algorithm is proposed, which firstly divides the pattern graph into pattern subgraphs, then obtains the matching results of the pattern subgraphs, and finally merges the matching results of the pattern subgraphs. M-TBRE can give the diagnosis distribution of the pattern graph, and return the best k diagnosis classification results according to the frequency of each diagnosis classification. Fuzzy constraints are also introduced in the M-TBRE algorithm, which extend it to the Fuzzy-M-TBRE algorithm, and the effectiveness of our algorithm are verified on two public data sets.

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
The rest of this paper is organized as follows. The related work of GPM is reviewed in Section 2. Then in Section 3, the concept of pattern matching in MKGs is introduced. Section 4 proposes a multi-threaded bidirectional routing exploration algorithm and a Fuzzy-M-TBRE algorithm to process GPM in MKGs. Section 5 introduces the data sets and conducts experiments to verify our proposed Fuzzy-M-TBRE algorithm, and Section 6 concludes our work in this paper.

RELATED WORK
According to the judgment based on bijective function or based on binary relationship, the research on GPM can be divided into isomorphism-based GPM and simulation-based GPM.

Isomorphism-Based GPM
Isomorphism-based GPM has a bijective function between the pattern graph and the data graph, and the topological structure of the matched data subgraph and pattern graph must be the same. Ullmann [15] first proposes a matching algorithm based on depth-first search. Cordella et al. [16] improve Ullmann's algorithm in terms of matching order and pruning strategy, and propose the VF2 algorithm. Tong et al. [17] propose the G-Ray method, which uses the goodness function to measure the degree of matching between a subgraph and the pattern graph, so that the optimal-k subgraphs can be returned. In addition, Cheng et al. [18] also propose a top-k matching algorithm, which sorts the matched subgraphs obtained based on the number of spanning trees, thereby returning the optimal-k subgraphs. Cheng et al. [19] propose the R-join algorithm based on the join index of the clustering graph and optimize the algorithm. Other representative algorithms include DDST algorithm [20], IncBMatch algorithm [21], and so on. Generally, as an NP-complete problem, Isomorphism-based GPM uses indexing [22,23] and parallel distributed [24][25][26] to improve the efficiency of matching.
Isomorphism-based GPM is mostly used in fields with strict structural requirements such as protein isomorphism, 3D object matching [27] and network abnormal behavior detection [28]. However, such matching is too strict for applications such as social networks or knowledge graphs that do not require strict matching accuracy. Therefore, simulation-based GPM research has emerged.

Simulation-Based GPM
As judged through binary relations, graph simulation is introduced by Henzinger et al. [29], but it is still an edge-to-edge matching, which cannot meet the requirements of many applications. Fan et al. [12] extend the graph simulation and propose a bounded simulation, where the edge of the pattern graph can be matched to a path, and the length of this path does not exceed the given constraint value k. Based on the bounded simulation, Ma et al. [30] propose a strong simulation, which can well preserve the topological structure of the pattern graph. There is a lot of attribute information on vertexes and edges in big graph data, but these existing work does not consider this information. Liu et al. [13] consider this information to extend the bounded simulation to MC-GPM and propose a baseline algorithm based on exploration and

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
a heuristic algorithm based on data graph compression index (HAMC). Since the HAMC algorithm only considers the constraint conditions of the matching path, which does not consider the minimization of the matching path length and the HAMC algorithm does not support a distributed computing structure, Liu et al. [31] propose an M-HAMC algorithm. Considering that the attribute values on vertexes and edges sometimes do not need to be exactly matched, on the basis of MC-GPM, Liu et al. [14] propose an MFC-GPM and an ETOF-K algorithm, which improves matching efficiency from two aspects: edge matching and edge connection. Based on the topologically ordered sequence of pattern graph vertexes, Liu et al. [32] propose the NTSS algorithm and optimize the algorithm by introducing two measures: caching mechanism and reverse edge matching. The caching mechanism solves the problem of repeated calculation of the same candidate path in multiple matching subgraphs, and the reverse edge matching prunes the candidate set of the edge with a partial degree of entry 0 in advance.

GRAPH PATTERN MATCHING
GPM is to find all data subgraphs that satisfy the pattern graph G P in a given data graph G D . In this section, we will give the relevant definitions of data graphs, pattern graphs, and graph pattern matching in MKGs.

Data Graph and Pattern Graph
The related definitions of the data graph and the pattern graph are as follows.

Data Graph
is a directed graph with vertex attributes and edge attributes, where • V is the set of vertices of the data graph; • E is the set of edges of the data graph, and In an MKG, each vertex v has a label r r , and r r represents the type of this vertex. The value of r r is different, and the other attributes in the attribute set f V D (v) are also different. The value of r r can be DI, BI, MI, GW, OC, AL and PD; is the attribute set of e. In an MKG, for a directed ρ is a list that stores patient numbers, that is, the identity information of vertex which comes from those patients; DI: When the value of r r is DI, the attribute set f V D (v) of vertex v describes the diagnostic classification information of breast cancer, which includes pathological information r h , T staging stage r s , tumor length r TS , the description of regional lymph nodes N staging stage r LN , and M staging stage r DM describing distant metastasis. The value of r h can be 0, 1, 2, and 3 respectively representing "invasive ductal carcinoma", chinaXiv:202211.00420v1

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
"invasive lobular carcinoma", "ductal carcinoma in situ", and "lobular carcinoma in situ"; the value of r s can be 0, 1, 2, 3, and 4; r TS is a floating point number in cm. The value of r LN can be N0, N1, N2, and N3; the value of r DM can be M0, and M1.

BI:
When the value of r r is BI, the attribute set f V D (v) of vertex v describes the basic information of the patient, which includes r CN , r CP and r age . r CN indicates whether the patient currently needs care, and its value is true or false; r CP indicates that the patient is currently pregnant, and its value is true or false; r age indicates the current age of the patient, and its value is a positive integer.

MI:
When the value of r r is MI, the attribute set f V D (v) of vertex v describes the patient's menopausal information, and it only contains r MS . The value of r MS can be 0, 1, and 2, respectively, indicating pre-menopausal, perimenopausal, and post-menopausal.

GW:
When the value of r r is GW, the attribute set f V D (v) of vertex v describes the patient's current overall well-being, and it only contains r GWB . The value of r GWB can be 0, 1, 2, 3, and 4, respectively, which means "fully active, no complaints or symptoms", "doing normal activities requires a little effort", "occasionally need help, but can meet most of the personal needs", "needs a lot of assistance and frequent medical care", "completely disabled, can only lie in a bed or a chair." OC: When the value of r r is OC, the attribute set f V D (v) of vertex v describes whether the patient has cancers other than breast cancer, which includes r OCS and r OCN . When the value of r OCS is false, the value of r OCN is none; when the value of r OCS is true, the value of r OCN is the names of the patient's other cancers; AL: When the value of r r is AL, the attribute set f V D (v) of vertex v describes the patient's axillary lymph nodes, which includes r LN , r LS , r IN , r SN and r CW . The value of r LS is true or false, indicating whether the patient's axillary lymph nodes are normal or not. The value of r IN is true or false, indicating whether the supraclavicular lymph nodes of the patient are normal or not. The value of r SN is true or false, indicating whether the subclavian lymph nodes of the patient are normal or not. The value of r CW is true or false, indicating whether the patient's chest wall is normal or not. The value of r LN is a positive integer, which means that several of the three of the patient's supraclavicular lymph node, subclavian lymph node, and chest wall have problems.

PD:
When the value of r r is PD, the attribute set of vertex v describes some diagnosis information of the patient in the past, which includes r aid , r ane , r aut , r lun , r dia , r car , r ost and r rep . The value of r aid is true or false, indicating whether the patient has AIDS; the value of r ane is true or false, indicating whether the patient has anemia; the value of r aut is true or false, indicating whether the patient has autoimmune disease; The value of r lun is true or false, indicating whether the patient has lung cancer; the value of r dia is true or false, indicating whether the patient has diabetes; the value of r car is true or false, indicating whether the patient has cardiovascular disease; The value of r ost is true or false, which indicates whether the patient has osteoporosis; the value of r rep is true or false, which indicates whether the patient's reproductive organs are diseased.

Pattern Graph
is a directed graph with vertex attributes and edge attributes, where: • V p is the set of vertices of the pattern graph; • E p is the set of edges of the pattern graph, and (u i , u j )  E P represents the directed edge from vertex In an MKG, the function f V P (u) corresponding to the vertex u has the same meaning as the attribute set of the above vertex in the data graph.
is the attribute set of e. In an MKG, for a directed edge is a list that stores patient numbers, that is, the identity information of the vertices comes from those patients.
is the length constraint of the edge (u i , u j ), and its value is a positive integer k or a symbol *, respectively, indicating that the length of the interval from v i to v j does not exceed k, or there is no length limit. In an MKG, f l P (u i , u j ) = 1; • f m P is a set of membership constraint functions defined on vertex attributes and edge attributes.

Fuzzy constraints
During matching, it would be better to get more and better matching results. Because in the actual matching, each matched subgraph corresponds to a patient who has roughly the same health information as the patient to be diagnosed in the pattern graph. The more obtained matches, the better experience will be used for reference in the treatment of patients corresponding to the pattern graph. However, in practice, it is possible that a subgraph in a data graph can be well satisfied with other constraints, but because some less important attribute constraints on a vertex cannot be satisfied, the subgraph cannot become a matching result. In addition, some attribute constraints on vertexes do not need to be accurately matched when matching, and their differences only need to fall within a certain range. Therefore, we introduce fuzzy constraints to GPM in MKGs.
In the MKG, the membership function

Pattern Matching
The matched subgraph sub sub sub sub sub ( , , , ) is a subgraph of the data graph G D and matches the pattern graph G P . The number of matched subgraphs may not be unique, where The definition of pattern matching in the MKG is as follows.
For a pattern graph denoted as G P  G D , if there is a binary relationship:  ρ is 1375, which means that the relevant information on the B 1 vertex comes from the breast cancer patient numbered 1375. The pattern graph G P is the health status of a patient to be diagnosed. The vertices B, C, D, E, F, and G respectively represent the patient's basic information, menopausal status, general well-being, information on cancers other than breast cancer, axillary lymph nodes and information about past diagnoses. Vertex A is the diagnostic information of this patient, but it is unknown and needs to be obtained through GPM. Since all vertex information in the pattern graph comes from the same patient, we need to find a patient number as the attribute constraint information on the edges to get the matching result of the pattern graph.  Figure 1, it is easy to find a subgraph M sub1 from data graph G D that matches pattern graph G P . M sub1 passes through vertexes A 2 , B 2 , C 1 , D 2 , E 1 , F 2 and G 3 . Vertex A 2 is the breast cancer diagnosis result of the pattern graph G P . The attribute constraint value on the edges in M sub1 is 2384, which means that the patient with the number 2384 is closer to the health status of the patient corresponding to G P .

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
After introducing fuzzy constraints, since 3

GRAPH PATTERN MATCHING IN MEDICAL KNOWLEDGE GRAPHS
In this section, we propose a multi-threaded bidirectional routing exploration algorithm M-TBRE to solve the GPM problem in MKGs.

Algorithm Description
The emergence of multi-core CPU can realize the parallel processing of tasks and speeds up the execution of programs. Since the multi-constrained GPM problem is an NP-complete problem, in order to speed up the matching speed and return the matched results quickly, here we consider adopting multi-threading to solve this GPM problem. In the matching process, the idea of divide and conquer is adopted. For a pattern graph G P , it can be divided into several pattern subgraphs. After the matching of each pattern subgraph is completed in the data graph G D , the matched results of each pattern subgraph can be connected to obtaining the matched results of the pattern graph G P . The matching of pattern subgraphs can be delivered as subtasks to multiple threads to complete independently, so that matching results can be obtained quickly.

Algorithm Flow
In the M-TBRE algorithm, since the pattern graph of the MKG can be regarded as a path, we can segment the pattern graph according to the intermediate vertexes of this path, divide the pattern graph into two parts, and obtain two pattern subgraphs. Next, to match the two pattern subgraphs, the matched results are connected to obtaining the matched results of the pattern graph.

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
The MC-FEM algorithm can complete the matching of the pattern su bgraph sub 2 P G . The processing process of the MC-FEM algorithm is similar to that of the MC-SEM algorithm, except that MC-FEM completes the matching of the pattern subgraph sub 2 P G according to the reverse depth-first search strategy.

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
The RM algorithm can complete the connection operation of the matching results of pattern subgraphs

Example 3:
In this example, G P and G D in Figure 1 are the pattern graph and the data graph, respectively.  The patients numbered 676 and 2384 can be used for reference when treating the patients corresponding to the pattern graph G P .

EXPERIMENTS
In this section, we conduct experiments on two public MKGs. The details of these two datasets are shown in Table 1. We propose and implement the M-TBRE algorithm to complete the pattern matching of MKG. Since the M-TBRE algorithm divides the pattern graph into two sub-pattern graphs, for different edge chinaXiv:202211.00420v1

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
attribute constraint values r pid , the matching of these two sub-pattern graphs is delivered to the thread pool as a subtask for execution. More or less the number of threads in the thread pool will affect the execution results of the algorithm. We will set a different number of threads to measure the efficiency dynamics of the M-TBRE algorithm. In addition, to obtain more matched results, we introduce the fuzzy constraint, which is a membership function for the age attribute constraint in the vertexes. The age attribute constraint on the data graph vertexes does not need to be the same as the attribute constraints in the pattern graph during matching, but only needs to go through the calculated result of the membership function and satisfy the corresponding membership constraint value. Together with the M-TBRE algorithm, we have the Fuzzy-M-TBRE algorithm. The Fuzzy-M-TBRE and M-TBRE algorithms can be compared to prove the effectiveness of the introduced fuzzy constraints.

Experimental Settings and Implementation
The MKG used in the experiment is about breast cancer. Dataset-1 and dataset-2 are used to represent the dataset Female-breast-cancer-2013a and the dataset Breastcancer-femalepatient-2016A, respectively. Dataset-1 is composed of the physical condition information of 10,000 breast cancer patients, and dataset-2 is composed of 100,000 breast cancer patients. In our experiment, several pattern graphs are used, but these pattern graphs are similar to the pattern graph shown in Figure 1. Our membership function is only for the age attribute of the vertex, and the membership constraint value is set to 3. Both M-TBRE and Fuzzy-M-TBRE are implemented using Java and running on a PC with Intel(R) Core(TM) i9-10900F CPU @2.81G GHz, 32 GB RAM and Windows 10 operating system.

Experiments on Execution Time
This experiment studies the execution time change when we set different thread numbers for the thread pool used in the M-TBRE algorithm, and the algorithm completes the GPM. To prevent error interference, the results in the experiment are the arithmetic mean after 10 runs.
As shown in Figure 2 and Figure 3, the abscissa represents different pattern graphs, and the ordinate represents the matching time of these pattern graphs. The M-TBRE-1 algorithm represents that the number of threads in the thread pool in the M-TBRE algorithm is set to 1. Since the pattern graph in the MKG is a path, and the edge join strategy proposed in the ETOF-K algorithm does not take effect in the matching, the performance of the ETOF-K algorithm and the M-TBRE-1 algorithm is almost the same on dataset-1 and dataset-2. The reverse matching strategy of the NTSS algorithm is invalid in the matching process, but its

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
caching mechanism avoids the double calculation of the same path, so the NTSS algorithm is better than the ETOF-K algorithm and the M-TBRE-1 algorithm on dataset-1 and dataset-2.  However, our M-TBRE-1 algorithm can be extended to multithreaded algorithms, such as the M-TBRE-2 algorithm, M-TBRE-4 algorithm, M-TBRE-8 algorithm, which means that the number of threads in the thread pool is set to 2, 4, and 8, respectively. As can be seen from Figure 2 and Figure 3, the effect of the M-TBRE-2 algorithm has already exceeded the NTSS algorithm, which also proves the effectiveness of our proposed M-TBRE algorithm. In addition, Table 2 and Table 3 show the detailed execution time in seconds. Table 4 shows the comparison of the average execution time of these four algorithms on the two datasets. It can be seen that on the two data sets, as the number of threads increases, the execution time of the algorithm continues to decrease.  For our proposed M-TBRE algorithm, as the number of threads increases, the execution speed is also accelerating. But when the number of threads reaches a certain level, the increase in execution speed will slow down, as shown by M-TBRE-8 in Figure 2. If the dataset is larger, or when the M-TBRE algorithm submits more subtasks, this slowing downtrend will become slower. Compared with the M-TBRE-4 algorithm,

Fuzzy-Constrained Graph Pattern Matching in Medical Knowledge Graphs
the M-TBRE-8 algorithm in dataset-1 increased by 27.81%, while in dataset-2, the M-TBRE-8 algorithm increased by 36.06% compared with the M-TBRE-4 algorithm.

Experiments on Fuzzy Constraints
This experiment studies the change in the number of matching subgraphs when our M-TBRE algorithm introduces fuzzy constraints. The Fuzzy-M-TBRE algorithm is the algorithm after M-TBRE introduces fuzzy constraints. Since our Fuzzy-M-TBRE algorithm can get all the matched results of the pattern graph, we can compare the changes in the total number of matches before and after the introduction of fuzzy constraints.
As shown in Figure 4 and Figure 5, the abscissa represents different pattern graphs, and the ordinate represents the number of matched subgraphs. On dataset-1 and dataset-2, for the same pattern graph, the Fuzzy-M-TBRE algorithm returns more matched results than the M-TBRE algorithm. Each matched subgraph corresponds to a breast cancer patient. When treating the patient corresponding to the pattern graph, please refer to the treatment plan of the corresponding patient in these matched subgraphs. Introducing fuzzy constraints can get more treatment options. Therefore, it is necessary to introduce fuzzy constraints into the M-TBRE algorithm.

CONCLUSION
In this paper, we put forward the problem of GPM in MKGs, and provide related definitions. In order to solve this problem, an M-TBRE algorithm is proposed, which divides the pattern graph into several pattern subgraphs, uses multi-threaded bidirectional routing to complete the matching of the pattern subgraphs, and then merges the matching results. In addition, fuzzy constraints are introduced to obtain more matching subgraphs. Each matched subgraph corresponds to a past patient. The patients corresponding to these matched subgraphs have the same physical condition as the patient corresponding to the pattern graph, so the treatment plan of the patients corresponding to these matched subgraphs can be used for reference in the treatment of the patient corresponding to the pattern graph. In this way, better and more effective treatment plans can be developed for patients corresponding to the pattern graph. We conduct verification experiments on the M-TBRE algorithm on two public MKG datasets. Experimental results show that our proposed M-TBRE algorithm has better performance. Furthermore, the necessity of introducing fuzzy constraints is also demonstrated, which leads to the outperformance of the Fuzzy-M-TBRE algorithm. In the future, we will further research and improve the M-TBRE algorithm, and study the dynamic graph pattern matching problem in MKGs oriented to the dynamics of pattern graph content.