## Abstract

The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the research has focused on complex networks, and there are few researches on GPM in the medical field. Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE algorithm. The experimental results on the two datasets show that compared with existing algorithms, our proposed algorithm is more efficient and effective.

## 1. INTRODUCTION

As a basic data structure, graphs are widely used in a lot of applications. For example, as for object anomaly checking, objects can be represented by graphs, and then anomalies can be discovered with certain graph algorithm [1]. Meanwhile, in order to determine whether a user is interested in a certain webpage, the webpages can be converted into multiple graphs, and with the multiple graphs taken as a bag, the bag can be classified and judged [2]. As a popular graph-based technology, graph pattern matching (GPM) has attracted a lot of attentions. Specifically, given a pattern graph, finding subgraphs from the data graph with a similar or the same structure as the pattern graph is named as GPM. As the research field of GPM has changed from the initial protein isomorphism [3, 4] to community detection [5, 6], expert discovery [7], recommendation systems [8], the discovery of social groups [9–11] and the identification of criminal groups [12], the definition of graph pattern has also changed accordingly.

Technically, GPM is originally defined based on subgraph isomorphism. Given a data
graph *G _{D}* and a pattern graph

*G*as input, it will return whether it contains a subgraph, and whether this subgraph has exactly the same topological structure as

_{P}*G*. For example, we can guess the properties of unknown proteins from the properties of known proteins through this matching [3, 4]. However, this traditional subgraph isomorphism is too strict. In order to extend the application scenarios of GPM, Fan et al. [12] propose a bounded simulation, which extends the edge-to-edge matching to the edge-to-finite length path matching. However, this matching still does not make use of the rich attribute information on vertices and edges. Therefore, Liu et al. [13] propose a multi-constrained graph pattern matching (MC-GPM) problem to obtain more effective matching results. Afterwards, Liu et al. [14] propose a multiple fuzzy constrained graph pattern matching (MFC-GPM) based on MC-GPM, considering that some attributes do not require exact matching. However, the current application scenarios of GPM are mostly concentrated on complex networks, and there are very few research applications of GPM in the medical field, especially in Medical Knowledge Graphs.

_{P}Nowadays, the incidence of breast cancer is getting higher and higher, and the age is getting younger and younger. Breast cancer can be divided into ductal carcinoma in situ, lobular carcinoma in situ, invasive ductal carcinoma, invasive lobular carcinoma, and so on. Each type of breast cancer can be divided according to the primary tumor staging, regional lymph node staging, and distant metastasis staging. The purpose of this paper is to make a diagnosis through GPM technology before the patient's condition is diagnosed with surgery.

In this paper, to introduce GPM into the medical field, we propose the problem of GPM
in MKGs and give relevant definitions. In addition, the M-TBRE algorithm is
proposed, which firstly divides the pattern graph into pattern subgraphs, then
obtains the matching results of the pattern subgraphs, and finally merges the
matching results of the pattern subgraphs. M-TBRE can give the diagnosis
distribution of the pattern graph, and return the best *k* diagnosis
classification results according to the frequency of each diagnosis classification.
Fuzzy constraints are also introduced in the M-TBRE algorithm, which extend it to
the Fuzzy-M-TBRE algorithm, and the effectiveness of our algorithm are verified on
two public data sets.

The rest of this paper is organized as follows. The related work of GPM is reviewed in Section 2. Then in Section 3, the concept of pattern matching in MKGs is introduced. Section 4 proposes a multi-threaded bidirectional routing exploration algorithm and a Fuzzy-M-TBRE algorithm to process GPM in MKGs. Section 5 introduces the data sets and conducts experiments to verify our proposed Fuzzy-M-TBRE algorithm, and Section 6 concludes our work in this paper.

## 2. RELATED WORK

According to the judgment based on bijective function or based on binary relationship, the research on GPM can be divided into isomorphism-based GPM and simulation-based GPM.

### 2.1 Isomorphism-Based GPM

Isomorphism-based GPM has a bijective function between the pattern graph and the
data graph, and the topological structure of the matched data subgraph and
pattern graph must be the same. Ullmann [15] first proposes a matching algorithm based on depth-first search.
Cordella et al. [16] improve
Ullmann's algorithm in terms of matching order and pruning strategy, and
propose the VF2 algorithm. Tong et al. [17] propose the G-Ray method, which uses the goodness function to
measure the degree of matching between a subgraph and the pattern graph, so that
the optimal-k subgraphs can be returned. In addition, Cheng et al. [18] also propose a top-*k* matching algorithm, which sorts the matched subgraphs obtained based on the
number of spanning trees, thereby returning the optimal-*k* subgraphs. Cheng et al. [19] propose
the R-join algorithm based on the join index of the clustering graph and
optimize the algorithm. Other representative algorithms include DDST algorithm
[20], IncBMatch algorithm [21], and so on. Generally, as an
NP-complete problem, Isomorphism-based GPM uses indexing [22, 23] and
parallel distributed [24–26] to improve the efficiency of
matching.

Isomorphism-based GPM is mostly used in fields with strict structural requirements such as protein isomorphism, 3D object matching [27] and network abnormal behavior detection [28]. However, such matching is too strict for applications such as social networks or knowledge graphs that do not require strict matching accuracy. Therefore, simulation-based GPM research has emerged.

### 2.2 Simulation-Based GPM

As judged through binary relations, graph simulation is introduced by Henzinger
et al. [29], but it is still an
edge-to-edge matching, which cannot meet the requirements of many applications.
Fan et al. [12] extend the graph
simulation and propose a bounded simulation, where the edge of the pattern graph
can be matched to a path, and the length of this path does not exceed the given
constraint value *k*. Based on the bounded simulation, Ma et al.
[30] propose a strong simulation,
which can well preserve the topological structure of the pattern graph. There is
a lot of attribute information on vertexes and edges in big graph data, but
these existing work does not consider this information. Liu et al. [13] consider this information to extend
the bounded simulation to MC-GPM and propose a baseline algorithm based on
exploration and a heuristic algorithm based on data graph compression index
(HAMC). Since the HAMC algorithm only considers the constraint conditions of the
matching path, which does not consider the minimization of the matching path
length and the HAMC algorithm does not support a distributed computing
structure, Liu et al. [31] propose an
M-HAMC algorithm. Considering that the attribute values on vertexes and edges
sometimes do not need to be exactly matched, on the basis of MC-GPM, Liu et al.
[14] propose an MFC-GPM and an
ETOF-K algorithm, which improves matching efficiency from two aspects: edge
matching and edge connection. Based on the topologically ordered sequence of
pattern graph vertexes, Liu et al. [32]
propose the NTSS algorithm and optimize the algorithm by introducing two
measures: caching mechanism and reverse edge matching. The caching mechanism
solves the problem of repeated calculation of the same candidate path in
multiple matching subgraphs, and the reverse edge matching prunes the candidate
set of the edge with a partial degree of entry 0 in advance.

## 3. GRAPH PATTERN MATCHING

GPM is to find all data subgraphs that satisfy the pattern graph *G _{P}* in a given data graph

*G*. In this section, we will give the relevant definitions of data graphs, pattern graphs, and graph pattern matching in MKGs.

_{D}### 3.1 Data Graph and Pattern Graph

The related definitions of the data graph and the pattern graph are as follows.

#### 3.1.1 Data Graph

A data graph *G _{D}* = (

*V*,

*E*,

*f*,

_{V}^{D}*f*) is a directed graph with vertex attributes and edge attributes, where

_{E}^{D}*V*is the set of vertices of the data graph;*E*is the set of edges of the data graph, and (*v*) ∊_{i}, v_{j}*E*represents the directed edge from vertex*v*, ∊_{i}*V*to vertex*v*∊ V;_{j}–

*f*is a function defined on V. ∀v ∊ V,_{V}^{D}*f*is the attribute set of v. In an MKG, each vertex v has a label_{V}^{D}(v)*ρ*, and_{r}*ρ*represents the type of this vertex. The value of_{r}*ρ*is different, and the other attributes in the attribute set_{r}*f*are also different. The value of_{V}^{D}(v)*ρ*can be DI, BI, MI, GW, OC, AL and PD;_{r}–

*f*is the function defined on E._{E}^{D}*∀e*∊*E*,*f*(_{E}^{D}*v*,_{i}*v*) is the attribute set of_{j}*e*. In an MKG, for a directed edge (*v*,_{i}*v*),_{j}*f*(_{E}^{D}*v*,_{i}*v*), only contains_{j}*ρ*_{vivj}^{pids}.*ρ*_{vivj}^{pids}is a list that stores patient numbers, that is, the identity information of vertex which comes from those patients;

**DI:** When the value of *ρ _{r}* is
DI, the attribute set

*f*of vertex v describes the diagnostic classification information of breast cancer, which includes pathological information

_{V}^{D}(v)*ρ*, T staging stage

^{h}*ρ*, tumor length

^{s}*ρ*

^{TS}, the description of regional lymph nodes N staging stage

*ρ*

^{LN}, and M staging stage

*ρ*

^{DM}describing distant metastasis. The value of

*ρ*can be 0, 1, 2, and 3 respectively representing “invasive ductal carcinoma”, “invasive lobular carcinoma”, “ductal carcinoma in situ”, and “lobular carcinoma in situ”; the value of

^{h}*ρ*

^{s}can be 0, 1, 2, 3, and 4;

*ρ*

^{TS}is a floating point number in cm. The value of

*ρ*

^{LN}can be N0, N1, N2, and N3; the value of

*ρ*

^{DM}can be M0, and M1.

**BI:** When the value of *ρ*_{r} is
BI, the attribute set *f _{V}^{D}(v)* of
vertex v describes the basic information of the patient, which includes

*ρ*

^{CN},

*ρ*

^{CP}and

*ρ*

^{age}.

*ρ*

^{CN}indicates whether the patient currently needs care, and its value is true or false;

*ρ*

^{CP}indicates that the patient is currently pregnant, and its value is true or false;

*ρ*

^{age}indicates the current age of the patient, and its value is a positive integer.

**MI:** When the value of *ρ _{r}* is
MI, the attribute set f

_{V}

^{D}(v) of vertex v describes the patient's menopausal information, and it only contains

*ρ*

^{MS}. The value of

*ρ*

^{MS}can be 0, 1, and 2, respectively, indicating pre-menopausal, perimenopausal, and post-menopausal.

**GW:** When the value of *ρ _{r}* is
GW, the attribute set

*f*of vertex v describes the patient's current overall well-being, and it only contains

_{V}^{D}(v)*ρ*

^{GWB}. The value of

*ρ*

^{GWB}can be 0, 1, 2, 3, and 4, respectively, which means “fully active, no complaints or symptoms”, “doing normal activities requires a little effort”, “occasionally need help, but can meet most of the personal needs”, “needs a lot of assistance and frequent medical care”, “completely disabled, can only lie in a bed or a chair.”

**OC:** When the value of *ρ _{r}* is
OC, the attribute set

*f*of vertex v describes whether the patient has cancers other than breast cancer, which includes

_{V}^{D}(v)*ρ*

^{OCS}and

*ρ*

^{OCN}. When the value of

*ρ*

^{OCS}is false, the value of

*ρ*

^{OCN}is none; when the value of

*ρ*

^{OCS}is true, the value of

*ρ*

^{OCN}is the names of the patient's other cancers;

**AL:** When the value of *ρ _{r}* is
AL, the attribute set

*f*of vertex v describes the patient's axillary lymph nodes, which includes

_{V}^{D}(v)*ρ*

^{LN},

*ρ*

^{LS},

*ρ*

^{IN},

*ρ*

^{SN}and

*ρ*

^{CW}. The value of

*ρ*

^{LS}is true or false, indicating whether the patient's axillary lymph nodes are normal or not. The value of

*ρ*

^{IN}is true or false, indicating whether the supraclavicular lymph nodes of the patient are normal or not. The value of

*ρ*

^{SN}is true or false, indicating whether the subclavian lymph nodes of the patient are normal or not. The value of

*ρ*

^{CW}is true or false, indicating whether the patient's chest wall is normal or not. The value of

*ρ*

^{LN}is a positive integer, which means that several of the three of the patient's supraclavicular lymph node, subclavian lymph node, and chest wall have problems.

**PD:** When the value of *ρ _{r}* is
PD, the attribute set of vertex v describes some diagnosis information of
the patient in the past, which includes

*ρ*

^{aid},

*ρ*

^{ane},

*ρ*

^{aut},

*ρ*

^{lun},

*ρ*

^{dia},

*ρ*

^{car},

*ρ*

^{ost}and

*ρ*

^{rep}. The value of

*ρ*

^{aid}is true or false, indicating whether the patient has AIDS; the value of

*ρ*

^{ane}is true or false, indicating whether the patient has anemia; the value of

*ρ*

^{aut}is true or false, indicating whether the patient has autoimmune disease; The value of

*ρ*

^{lun}is true or false, indicating whether the patient has lung cancer; the value of

*ρ*

^{dia}is true or false, indicating whether the patient has diabetes; the value of

*ρ*

^{car}is true or false, indicating whether the patient has cardiovascular disease; The value of

*ρ*

^{ost}is true or false, which indicates whether the patient has osteoporosis; the value of

*ρ*

^{rep}is true or false, which indicates whether the patient's reproductive organs are diseased.

#### 3.1.2 Pattern Graph

A pattern graph *G _{P}* = (V

_{p},

*E*, f

_{P}, f_{V}^{P}, f_{E}^{P}, f_{i}^{p}_{m}

^{P}) is a directed graph with vertex attributes and edge attributes, where:

*V*is the set of vertices of the pattern graph;_{p}*E*is the set of edges of the pattern graph, and (_{p}*u*,_{i}*u*) ∊_{j}*E*represents the directed edge from vertex_{P}*u*∊_{i}*V*to vertex_{P}*u*∊_{j}*V*._{P},*f*is a function defined on V_{V}^{p}_{p}, and*∀v*∊*V*is the attribute set of v. In an MKG, the function_{P}, f_{V}^{p}(v)*f*corresponding to the vertex_{V}^{P}(u)*u*has the same meaning as the attribute set of the above vertex in the data graph.*f*is the function defined on_{E}^{P}*E*∊_{p}. ∀e*E*is the attribute set of_{P}, f_{E}^{p}(e)*e*. In an MKG, for a directed edge (*u*,_{i}*u*),_{j}*f*(_{E}^{P}*u*,_{i}*u*) only contains_{j}*ρ*_{uiuj}^{pids}.*ρ*_{uiuj}^{pids}is a list that stores patient numbers, that is, the identity information of the vertices comes from those patients.*f*is the function defined on_{i}^{p}*E*._{p}*∀*(*u*,_{i}*u*) ∊_{j}*E*,_{p}*f*(_{i}^{p}*u*,_{i}*u*) is the length constraint of the edge (_{j}*u*,_{i}*u*), and its value is a positive integer_{j}*k*or a symbol *, respectively, indicating that the length of the interval from*v*to_{i}*v*does not exceed_{j}*k*, or there is no length limit. In an MKG,*f*(_{i}^{p}*u*,_{i}*u*) = 1;_{j}*f*is a set of membership constraint functions defined on vertex attributes and edge attributes._{m}^{p}

#### 3.1.3 Fuzzy constraints

During matching, it would be better to get more and better matching results. Because in the actual matching, each matched subgraph corresponds to a patient who has roughly the same health information as the patient to be diagnosed in the pattern graph. The more obtained matches, the better experience will be used for reference in the treatment of patients corresponding to the pattern graph. However, in practice, it is possible that a subgraph in a data graph can be well satisfied with other constraints, but because some less important attribute constraints on a vertex cannot be satisfied, the subgraph cannot become a matching result. In addition, some attribute constraints on vertexes do not need to be accurately matched when matching, and their differences only need to fall within a certain range. Therefore, we introduce fuzzy constraints to GPM in MKGs.

In the MKG, the membership function *f _{m}^{P}* =
{

*f*

_{age}

^{m}} is only considered to introduce a fuzzy constraint to the age attribute.

*f*

_{age}

^{m}represents the membership function defined on the vertex age attribute

*ρ*

_{age}. The constraint value of f

_{age}

^{m}is set to 3. The membership function f

_{age}

^{m}is defined as Eq. (1), where

*abs*is the absolute value function,

*ρ*

_{v}

^{age}represents the age attribute constraint value of vertex v in pattern graph

*G*.

_{P}*ρ*

_{u}

^{age}represents the age attribute constraint value of vertex

*u*in data graph

*G*. During matching, age attribute

_{D}*ρ*

_{age}only needs to satisfy

*f*

_{age}

^{m}≤ 3.

#### 3.2 Pattern Matching

The matched subgraph *G*_{sub} =
(*V*_{sub}, *F*_{sub}, *f*_{Vsub}, *f*_{esub}^{D})
is a subgraph of the data graph *G _{D}* and matches
the pattern graph

*G*. The number of matched subgraphs may not be unique, where

_{P}*G*

_{sub}⊂

*G*

_{D},

*V*

_{sub}⊂

*V*,

*E*

_{sub}⊂

*E*,

*f*

_{Vsub}

^{D}⊂

*f*

_{V}

^{D},

*f*

_{Esub}

^{D}⊂

*f*

_{E}

^{D}; The definition of pattern matching in the MKG is as follows.

For a pattern graph *G _{P}* = (V

_{p},

*E*) and a data graph

_{P}, f_{V}^{P}, f_{E}^{P}, f_{i}^{P}, f_{m}^{p}*G*= (V, E,

_{D}*f*),

^{D}_{V}, f^{D}_{F}*G*matches

_{D}*G*, denoted as

_{P}*G*⊴

_{P}*G*, if there is a binary relationship:

_{D}for all

*u*∊*V*, there is_{P}*v*∊*V*such that (*u*,*v*) ∊*S*, which means that there is a vertex*v*in*V*that matches*u*, that is,*v*satisfies*f*. If age attribute_{V}^{P}(u)*ρ*_{ui}^{age;}is included,*f*_{m}^{P}= {*f*_{age}^{m}} represents the membership function defined on the age attribute of*u*, then*ρ*_{u}^{age}only needs to satisfy*f*_{age}^{m}≤ 3. Except for the age attribute*ρ*_{u}^{age}, the values corresponding to the other attributes of*v*must be equal to the values of the attributes corresponding to*u*before it can be determined that*v*matches_{i}*u*._{i}for each pair (

*u*,_{i}*v*) ∊_{j}*S*,∗

*u*∼_{i}*v*, and_{i}∗ for each edge (

*u*,_{i}*u*) in_{j}*E*, there is a path from_{P}*v*, to_{i}*v*in_{¡}*G*such that (_{D}*u*) ∊_{i}, v_{j}*S*. Because of*f*_{l}^{p}(*u*,_{i}*u*= 1, this path can be regarded as the edge from_{j}*v*to_{i}*v*in_{j}*G*;_{D}

**Example 1**: As shown in Figure 1, *G _{D}* is a data graph composed of
related information of multiple breast cancer patients. The attribute
information of some vertexes contained in the data graph saves the
diagnostic classification information of breast cancer. In the data graph

*G*, each vertex represents some information of the patient. For the function f

_{D}_{E}

^{D}(A

_{1}, B

_{1}) defined on the directed edge (A

_{1}, B

_{1}) in

*G*, B

_{D}, f_{E}^{D}(A_{1}_{1}) only contains the attribute

*ρ*

_{A1B1}

^{pids}. For example, the value of

*ρ*

_{A1B1}

^{pids}is 1375, which means that the relevant information on the

*B*vertex comes from the breast cancer patient numbered 1375. The pattern graph

_{1}*G*is the health status of a patient to be diagnosed. The vertices

_{P}*B*,

*C*,

*D*,

*E*,

*F*, and

*G*respectively represent the patient's basic information, menopausal status, general well-being, information on cancers other than breast cancer, axillary lymph nodes and information about past diagnoses. Vertex

*A*is the diagnostic information of this patient, but it is unknown and needs to be obtained through GPM. Since all vertex information in the pattern graph comes from the same patient, we need to find a patient number as the attribute constraint information on the edges to get the matching result of the pattern graph.

##### Data graph and pattern graph in an MKG

**Example 2:** As shown in Figure 1, it is easy to find a subgraph M_{sub1} from data graph *G _{D}* that matches pattern graph

*G*. M

_{P}_{sub1}passes through vertexes A

_{2},

*B*, C

_{2}_{1}, D

_{2}, E

_{1}, F

_{2}and G

_{3}. Vertex

*A*is the breast cancer diagnosis result of the pattern graph

_{2}*G*. The attribute constraint value on the edges in M

_{P}_{sub1}is 2384, which means that the patient with the number 2384 is closer to the health status of the patient corresponding to

*G*.

_{p}After introducing fuzzy constraints, since *f*_{age}^{m} =
abs(*ρ*_{B}^{age} − *ρ*_{B3}^{age}) = 1, it does not
exceed the membership function constraint value 3 on the age attribute. In
addition, *ρ*_{B}^{′} = *ρ*_{B3}^{′}, *ρ*_{B}^{CN}, *ρ*_{B}^{CP} = *ρ*_{B3}^{CP},
and vertex *B* matches vertex *B*_{3}.
We can get a new matched subgraph M_{sub2} that passes through
vertexes A_{2}, B_{3}, *C _{1}*,
D

_{2},

*E*F

_{1}_{2}and G

_{3}. The attribute constraint value on the edges in M

_{sub2}is 676.

## 4. GRAPH PATTERN MATCHING IN MEDICAL KNOWLEDGE GRAPHS

In this section, we propose a multi-threaded bidirectional routing exploration algorithm M-TBRE to solve the GPM problem in MKGs.

### 4.1 Algorithm Description

The emergence of multi-core CPU can realize the parallel processing of tasks and
speeds up the execution of programs. Since the multi-constrained GPM problem is
an NP-complete problem, in order to speed up the matching speed and return the
matched results quickly, here we consider adopting multi-threading to solve this
GPM problem. In the matching process, the idea of divide and conquer is adopted.
For a pattern graph *G _{P}*, it can be divided into
several pattern subgraphs. After the matching of each pattern subgraph is
completed in the data graph

*G*, the matched results of each pattern subgraph can be connected to obtaining the matched results of the pattern graph

_{D}*G*. The matching of pattern subgraphs can be delivered as subtasks to multiple threads to complete independently, so that matching results can be obtained quickly.

_{P}### 4.2 Algorithm Flow

In the M-TBRE algorithm, since the pattern graph of the MKG can be regarded as a path, we can segment the pattern graph according to the intermediate vertexes of this path, divide the pattern graph into two parts, and obtain two pattern subgraphs. Next, to match the two pattern subgraphs, the matched results are connected to obtaining the matched results of the pattern graph.

The detailed steps of the M-TBRE algorithm are shown in Algorithm 1. First, the intermediate vertex
V_{P}^{mid} of the pattern graph *G _{P}* and the candidate vertex set
cand

_{mid}of V

_{P}

^{mid}need to be obtained, as shown in lines 1-2. In line 3,

*pool*and

*tempInfo*represent the thread pool and the temporary result of the matching, respectively. The number of threads in

*pool*can be set according to the actual situation. Then the pattern graph

*G*is divided into two sub-pattern graphs

_{P}*G*and

_{p}^{sub1}*G*with intermediate vertex V

_{P}^{sub2}_{P}

^{mid}as the dividing point, and the two sub-pattern graphs are matched in the data graph

*G*. Traversing the candidate vertex set cand

_{D}_{mid}of V

_{P}

^{mid}to complete the matching of the sub-pattern graphs

*G*

_{P}

^{sub1}and G

_{P}

^{sub2}, as shown in lines 4-27. For each element cand

_{mid}[i] in cand

_{mid}, we use the attribute constraint

*ρ*

_{pe}

^{pids}on each forward edge

*ρ*

_{D}

^{pe}of cand

_{mid}[i] to intersect the attribute constraint

*ρ*

_{ae}

^{pids}on each successor edge

*ρ*

_{D}

^{pe}to obtain

*ρ*

^{pids}, which saves the common patient number information of the current forward edge

*ρ*

_{D}

^{pe}and the current successor edge

*ρ*

_{D}

^{ae}, as shown in lines 6-25. For each patient number

*ρ*

^{pid}in

*ρ*

^{pids},

*ρ*

^{pid}is taken as the attribute constraint on the edge to complete the matching of

*G*

_{P}

^{sub1}and

*G*, as shown in lines 14-21.

_{p}^{sub2}*tempInfo*stores the partial matched result with

*ρ*

^{pid}as the edge attribute constraint, as shown in lines 15-16. The thread pool submits subtasks MC-SEM and MC-FEM to complete the matching of

*G*

_{P}

^{sub1}and

*G*

_{P}

^{sub2}respectively, as shown in lines 19-20. The algorithm RM merges the matched results, as shown in line 28. The MC-SEM algorithm can complete the matching of the pattern subgraph G

_{p}

^{sub1}, where V

_{P}

^{curr}, V

_{D}

^{curr},

*ρ*

^{pid}and

*templnfo*respectively represent the pattern vertex to be matched, the candidate vertex of the pattern vertex

*v*to be matched, the attribute constraint value (patient number) of the edge, and the temporary result of the matching. If vertex

_{p}^{curr}*v*matches vertex

_{D}^{curr}*v*but

_{p}^{curr}*v*does not have a successor edge, that is, when the out-degree of

_{p}^{curr}*v*is 0, the matching of the pattern subgraph

_{p}^{curr}*G*

_{p}^{sub}^{1}is completed, and the matched result when

*ρ*

^{pid}is used as the attribute constraint on the edge is saved in

*tempInfo*, such as Algorithm 2 is shown in lines 2-7. If vertex

*v*matches vertex

_{D}^{curr}*v*and

_{p}^{curr}*v*has a successor edge, then traverse the successor edge

_{p}^{curr}*e*

_{D}

^{ae}of

*v*

_{p}

^{curr}. When the attribute constraint

*ρ*

_{ae}

^{pids}on

*ρ*

_{D}

^{ae}includes

*ρ*

^{pid}, the matching of pattern vertex

*e*

_{P}

^{ae}. tailNode is recursively completed, as shown in lines 8-16.

The MC-FEM algorithm can complete the matching of the pattern subgraph *G _{P}^{sub2}*. The processing process of
the MC-FEM algorithm is similar to that of the MC-SEM algorithm, except that
MC-FEM completes the matching of the pattern subgraph

*G*

_{P}

^{sub2}according to the reverse depth-first search strategy.

The RM algorithm can complete the connection operation of the matching results of
pattern subgraphs G_{p}^{sub1} and G_{p}^{sub2}.
When a given value is used as an attribute constraint on all edges, and the flag
bits representing the matching results of and G_{p}^{sub2} are
both 1, then combining the matching results of G_{p}^{sub1} and
G_{p}^{sub2} is a matching result of the pattern graph
G_{P}, such as lines 4-6 in Algorithm 4.

**Example 3:** In this example, *G _{P}* and

*G*in Figure 1 are the pattern graph and the data graph, respectively. First, to obtain intermediate vertex

_{D}*D*of pattern graph

*G*and candidate vertex set cand

_{P}_{mid}= {D

_{2}} of D. The pattern graph

*G*is divided into pattern subgraph G

_{P}_{p}

^{sub2}which passes through vertexes A,

*B*and C, and pattern subgraph G

_{p}

^{sub2}which passes through vertexes E,

*F*and G. The forward edge (C

_{1}, D

_{2}) and the subsequent edge (D

_{2},

*E*) of D

_{1}_{2}have the same attribute constraint ρ

^{pids}= {676, 2384}. Taking the matching of G

_{p}

^{sub2}as an example, the attribute constraint

*ρ*

_{C1C2}

^{pids}= {676,2384} on edge (C

_{1}, D

_{2}) contains ρ

^{pid}= 2384, and C

_{1}matches

*C*at the same time, so edge (C

_{1}, D

_{2}, G

_{D}) ⋍ (C, D, G

_{P}). We can get (C

_{1}, D

_{2}, G

_{D}) ⋍ (C, D, G

_{P}) and (A

_{2}, B

_{2}, G

_{D}) ⋍ (

*A*,

*B*,

*G*). The matching of

_{P}*G*takes ρ

_{p}^{sub2}^{pid}= 2384 as the attribute constraint, and the attribute constraint information of vertex A

_{2}is the diagnosis classification result of

*G*

_{p}

^{sub2}. In the same way, we can get the matched results (D

_{2}, E

_{1}, G

_{D}) ⋍ (D, E, G

_{P}), (E

_{1}, F

_{2},

*G*) ⋍ (E, F, G

_{D}_{P}), (F

_{2}, G

_{3},

*G*⋍ (F, G,

_{D})*G*) of when ρ

_{P}^{pid}= 2384 is the attribute constraint on edge

*E*. Both

_{P}*G*and

_{p}^{sub1}*G*have matched results, so the diagnostic classification information of

_{p}^{sub2}*G*is the diagnostic classification information of

_{P}*G*, and the patient number in

_{p}^{sub1}*G*is 2384. Finally, two matching subgraphs are obtained through the M-TBRE algorithm. M

_{D}_{sub1}= {Vm,

*E*,

_{m}*f*,

_{v}*f*}, where

_{e}*V*= {

_{M}*A2*,

*B*,

_{2}*C*

_{1}*D*,

_{2}*E*,

_{1}*F*,

_{2}*G3*},

*f*= {2384} and

_{e}*E*= {A B2), (B2, C1), (C1, D2), (D2, E1), (E1, F2), (F2, G3)}. Msub2 = {Vm, Em,

_{M}*f*}, where V

_{v}, fe_{M}= {A2,

*B3*, C1, D2, E1, F2, G

_{3}},

*f*= {676} and

_{e}*E*= {(A

_{M}_{2}, B

_{3}), (B

_{3}, C

_{1}), (C

_{1}, D

_{2}), (D

_{2}, E

_{1}), (E

_{1}, F

_{2}), (F

_{2}, G

_{3})}. The patients numbered 676 and 2384 can be used for reference when treating the patients corresponding to the pattern graph

*G*.

_{P}## 5. EXPERIMENTS

In this section, we conduct experiments on two public MKGs. The details of these two
datasets are shown in Table 1. We propose
and implement the M-TBRE algorithm to complete the pattern matching of MKG. Since
the M-TBRE algorithm divides the pattern graph into two sub-pattern graphs, for
different edge attribute constraint values ρ^{pid}, the matching of
these two sub-pattern graphs is delivered to the thread pool as a subtask for
execution. More or less the number of threads in the thread pool will affect the
execution results of the algorithm. We will set a different number of threads to
measure the efficiency dynamics of the M-TBRE algorithm. In addition, to obtain more
matched results, we introduce the fuzzy constraint, which is a membership function
for the age attribute constraint in the vertexes. The age attribute constraint on
the data graph vertexes does not need to be the same as the attribute constraints in
the pattern graph during matching, but only needs to go through the calculated
result of the membership function and satisfy the corresponding membership
constraint value. Together with the M-TBRE algorithm, we have the Fuzzy-M-TBRE
algorithm. The Fuzzy-M-TBRE and M-TBRE algorithms can be compared to prove the
effectiveness of the introduced fuzzy constraints.

Dataset . | Vertices . | Edges . | Description . |
---|---|---|---|

Female-breast-cancer-2013a | 10812 | 20366 | A graph about breast cancer patients |

Breastcancer-femalepatient-2016A | 101221 | 200845 | A graph about breast cancer patients |

Dataset . | Vertices . | Edges . | Description . |
---|---|---|---|

Female-breast-cancer-2013a | 10812 | 20366 | A graph about breast cancer patients |

Breastcancer-femalepatient-2016A | 101221 | 200845 | A graph about breast cancer patients |

### 5.1 Experimental Settings and Implementation

The MKG used in the experiment is about breast cancer. Dataset-1 and dataset-2 are used to represent the dataset Female-breast-cancer-2013a and the dataset Breastcancer-femalepatient-2016A, respectively. Dataset-1 is composed of the physical condition information of 10,000 breast cancer patients, and dataset-2 is composed of 100,000 breast cancer patients. In our experiment, several pattern graphs are used, but these pattern graphs are similar to the pattern graph shown in Figure 1. Our membership function is only for the age attribute of the vertex, and the membership constraint value is set to 3. Both M-TBRE and Fuzzy-M-TBRE are implemented using Java and running on a PC with Intel(R) Core(TM) i9-10900F CPU @2.81G GHz, 32 GB RAM and Windows 10 operating system.

### 5.2 Experimental Results and Analysis

#### 5.2.1 Experiments on Execution Time

This experiment studies the execution time change when we set different thread numbers for the thread pool used in the M-TBRE algorithm, and the algorithm completes the GPM. To prevent error interference, the results in the experiment are the arithmetic mean after 10 runs.

As shown in Figure 2 and Figure 3, the abscissa represents different pattern graphs, and the ordinate represents the matching time of these pattern graphs. The M-TBRE-1 algorithm represents that the number of threads in the thread pool in the M-TBRE algorithm is set to 1. Since the pattern graph in the MKG is a path, and the edge join strategy proposed in the ETOF-K algorithm does not take effect in the matching, the performance of the ETOF-K algorithm and the M-TBRE-1 algorithm is almost the same on dataset-1 and dataset-2. The reverse matching strategy of the NTSS algorithm is invalid in the matching process, but its caching mechanism avoids the double calculation of the same path, so the NTSS algorithm is better than the ETOF-K algorithm and the M-TBRE-1 algorithm on dataset-1 and dataset-2.

##### Matching time of different pattern graphs on Female-breast-cancer-2013a.

##### Matching time of different pattern graphs on Breastcancer-femalepatient-2016A.

However, our M-TBRE-1 algorithm can be extended to multithreaded algorithms, such as the M-TBRE-2 algorithm, M-TBRE-4 algorithm, M-TBRE-8 algorithm, which means that the number of threads in the thread pool is set to 2, 4, and 8, respectively. As can be seen from Figure 2 and Figure 3, the effect of the M-TBRE-2 algorithm has already exceeded the NTSS algorithm, which also proves the effectiveness of our proposed M-TBRE algorithm. In addition, Table 2 and Table 3 show the detailed execution time in seconds. Table 4 shows the comparison of the average execution time of these four algorithms on the two datasets. It can be seen that on the two data sets, as the number of threads increases, the execution time of the algorithm continues to decrease.

For dataset-1, the execution time of the M-TBRE-2 algorithm is increased by 43.11% compared with the M-TBRE-1, and the execution time of the M-TBRE-4 algorithm is increased by 39.56% compared with the M-TBRE-2 algorithm, but compared with the M-TBRE-4 algorithm, the execution time of the M-TBRE-8 algorithm is only increased by 27.81%. This is because dataset-1 itself is small in scale, and the time spent on thread context switching and system state transitions occupies a large proportion of the total time.

For dataset-2, its scale is larger, and the total execution time of the algorithm is also larger. The time spent on thread context switching and system state transition takes up a relatively small proportion of the total time. Therefore, M-TBRE-8 in dataset-2 still increased by 36.06%.

. | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . |
---|---|---|---|---|---|---|---|---|

ETOF-K | 0.1625 | 0.1933 | 0.1742 | 0.1659 | 0.1743 | 0.1601 | 0.1742 | 0.1633 |

NTSS | 0.1398 | 0.1472 | 0.1109 | 0.1365 | 0.1247 | 0.0973 | 0.1148 | 0.1245 |

M-TBRE-1 | 0.1832 | 0.1869 | 0.1516 | 0.1807 | 0.1654 | 0.1560 | 0.1535 | 0.1700 |

M-TBRE-2 | 0.1161 | 0.1097 | 0.0905 | 0.1059 | 0.0890 | 0.0785 | 0.0789 | 0.0979 |

M-TBRE-4 | 0.0869 | 0.0709 | 0.0488 | 0.0595 | 0.0550 | 0.0459 | 0.0448 | 0.0512 |

M-TBRE-8 | 0.0750 | 0.0488 | 0.0430 | 0.0373 | 0.0366 | 0.0311 | 0.0294 | 0.0331 |

. | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . |
---|---|---|---|---|---|---|---|---|

ETOF-K | 0.1625 | 0.1933 | 0.1742 | 0.1659 | 0.1743 | 0.1601 | 0.1742 | 0.1633 |

NTSS | 0.1398 | 0.1472 | 0.1109 | 0.1365 | 0.1247 | 0.0973 | 0.1148 | 0.1245 |

M-TBRE-1 | 0.1832 | 0.1869 | 0.1516 | 0.1807 | 0.1654 | 0.1560 | 0.1535 | 0.1700 |

M-TBRE-2 | 0.1161 | 0.1097 | 0.0905 | 0.1059 | 0.0890 | 0.0785 | 0.0789 | 0.0979 |

M-TBRE-4 | 0.0869 | 0.0709 | 0.0488 | 0.0595 | 0.0550 | 0.0459 | 0.0448 | 0.0512 |

M-TBRE-8 | 0.0750 | 0.0488 | 0.0430 | 0.0373 | 0.0366 | 0.0311 | 0.0294 | 0.0331 |

. | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . |
---|---|---|---|---|---|---|---|---|

ETOF-K | 25.1406 | 26.1477 | 29.3101 | 28.1559 | 26.7859 | 27.4765 | 28.6518 | 27.4800 |

NTSS | 20.7231 | 21.6519 | 19.9583 | 20.4127 | 19.8864 | 22.7456 | 19.7193 | 21.5986 |

M-TBRE-1 | 27.2807 | 27.6337 | 28.7748 | 27.8562 | 27.7634 | 27.1356 | 26.8715 | 27.3600 |

M-TBRE-2 | 16.2298 | 16.3104 | 16.1557 | 16.3067 | 16.6475 | 16.5609 | 16.4358 | 16.7567 |

M-TBRE-4 | 9.4020 | 9.7056 | 9.8746 | 9.9239 | 9.7943 | 9.0300 | 8.9316 | 8.9685 |

M-TBRE-8 | 5.8198 | 5.9732 | 6.0509 | 6.0256 | 5.9963 | 6.1483 | 6.1579 | 6.1895 |

. | P1 . | P2 . | P3 . | P4 . | P5 . | P6 . | P7 . | P8 . |
---|---|---|---|---|---|---|---|---|

ETOF-K | 25.1406 | 26.1477 | 29.3101 | 28.1559 | 26.7859 | 27.4765 | 28.6518 | 27.4800 |

NTSS | 20.7231 | 21.6519 | 19.9583 | 20.4127 | 19.8864 | 22.7456 | 19.7193 | 21.5986 |

M-TBRE-1 | 27.2807 | 27.6337 | 28.7748 | 27.8562 | 27.7634 | 27.1356 | 26.8715 | 27.3600 |

M-TBRE-2 | 16.2298 | 16.3104 | 16.1557 | 16.3067 | 16.6475 | 16.5609 | 16.4358 | 16.7567 |

M-TBRE-4 | 9.4020 | 9.7056 | 9.8746 | 9.9239 | 9.7943 | 9.0300 | 8.9316 | 8.9685 |

M-TBRE-8 | 5.8198 | 5.9732 | 6.0509 | 6.0256 | 5.9963 | 6.1483 | 6.1579 | 6.1895 |

Dataset . | M-TBRE-1 . | M-TBRE-2 . | M-TBRE-4 . | M-TBRE-8 . | Percentage . |
---|---|---|---|---|---|

Female-breast-cancer-201 3a | 0.1684 | 0.0958 | — | — | 43.11% |

Female-breast-cancer-2013a | — | 0.0958 | 0.0579 | — | 39.56% |

Female-breast-cancer-2013a | — | — | 0.0579 | 0.0418 | 27.81% |

Breastcancer-femalepatient-2016A | 27.5845 | 16.4254 | — | — | 40.45% |

Breastcancer-femalepatient-2016A | — | 16.4254 | 9.4538 | — | 42.44% |

Breastcancer-femalepatient-2016A | — | — | 9.4538 | 6.0452 | 36.06% |

Dataset . | M-TBRE-1 . | M-TBRE-2 . | M-TBRE-4 . | M-TBRE-8 . | Percentage . |
---|---|---|---|---|---|

Female-breast-cancer-201 3a | 0.1684 | 0.0958 | — | — | 43.11% |

Female-breast-cancer-2013a | — | 0.0958 | 0.0579 | — | 39.56% |

Female-breast-cancer-2013a | — | — | 0.0579 | 0.0418 | 27.81% |

Breastcancer-femalepatient-2016A | 27.5845 | 16.4254 | — | — | 40.45% |

Breastcancer-femalepatient-2016A | — | 16.4254 | 9.4538 | — | 42.44% |

Breastcancer-femalepatient-2016A | — | — | 9.4538 | 6.0452 | 36.06% |

For our proposed M-TBRE algorithm, as the number of threads increases, the execution speed is also accelerating. But when the number of threads reaches a certain level, the increase in execution speed will slow down, as shown by M-TBRE-8 in Figure 2. If the dataset is larger, or when the M-TBRE algorithm submits more subtasks, this slowing downtrend will become slower. Compared with the M-TBRE-4 algorithm, the M-TBRE-8 algorithm in dataset-1 increased by 27.81%, while in dataset-2, the M-TBRE-8 algorithm increased by 36.06% compared with the M-TBRE-4 algorithm.

#### 5.2.2 Experiments on Fuzzy Constraints

This experiment studies the change in the number of matching subgraphs when our M-TBRE algorithm introduces fuzzy constraints. The Fuzzy-M-TBRE algorithm is the algorithm after M-TBRE introduces fuzzy constraints. Since our Fuzzy-M-TBRE algorithm can get all the matched results of the pattern graph, we can compare the changes in the total number of matches before and after the introduction of fuzzy constraints.

As shown in Figure 4 and Figure 5, the abscissa represents different pattern graphs, and the ordinate represents the number of matched subgraphs. On dataset-1 and dataset-2, for the same pattern graph, the Fuzzy-M-TBRE algorithm returns more matched results than the M-TBRE algorithm. Each matched subgraph corresponds to a breast cancer patient. When treating the patient corresponding to the pattern graph, please refer to the treatment plan of the corresponding patient in these matched subgraphs. Introducing fuzzy constraints can get more treatment options. Therefore, it is necessary to introduce fuzzy constraints into the M-TBRE algorithm.

##### The number of matched subgraphs of different pattern graphs on Female-breast-cancer-2013a.

##### The number of matched subgraphs of different pattern graphs on Breastcancer-femalepatient-2016A.

## 6. CONCLUSION

In this paper, we put forward the problem of GPM in MKGs, and provide related definitions. In order to solve this problem, an M-TBRE algorithm is proposed, which divides the pattern graph into several pattern subgraphs, uses multi-threaded bidirectional routing to complete the matching of the pattern subgraphs, and then merges the matching results. In addition, fuzzy constraints are introduced to obtain more matching subgraphs. Each matched subgraph corresponds to a past patient. The patients corresponding to these matched subgraphs have the same physical condition as the patient corresponding to the pattern graph, so the treatment plan of the patients corresponding to these matched subgraphs can be used for reference in the treatment of the patient corresponding to the pattern graph. In this way, better and more effective treatment plans can be developed for patients corresponding to the pattern graph. We conduct verification experiments on the M-TBRE algorithm on two public MKG datasets. Experimental results show that our proposed M-TBRE algorithm has better performance. Furthermore, the necessity of introducing fuzzy constraints is also demonstrated, which leads to the outperformance of the Fuzzy-M-TBRE algorithm. In the future, we will further research and improve the M-TBRE algorithm, and study the dynamic graph pattern matching problem in MKGs oriented to the dynamics of pattern graph content.

## ACKNOWLEDGMENTS

This work has been supported by the National Natural Science Foundation of China under grants 62076087 & 61906059 and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education of China under grant IRT17R32.

The first author would like to thank his wife Jun Zhang, his parents and friends during his fight with lung adenocarcinoma. “I leave no trace of wings in the air, but I am glad I have had my flight.”

## AUTHOR CONTRIBUTIONS

All authors including L. Li (lilei@hfut.edu.cn), X. Du (dlx4339@163.com), Z. Zhang (zanzhang@hfut.edu.cn), and Z. Tao (zctao@ustc.edu.cn) took part in writing the paper. In addition, L. Li designed the algorithm and experiments, and provided the funding; X. Du designed and conducted experiments, and analyzed the data; Z. Tao analyzed the data.