The research on graph pattern matching (GPM) has attracted a lot of attention. However, most of the research has focused on complex networks, and there are few researches on GPM in the medical field. Hence, with GPM this paper is to make a breast cancer-oriented diagnosis before the surgery. Technically, this paper has firstly made a new definition of GPM, aiming to explore the GPM in the medical field, especially in Medical Knowledge Graphs (MKGs). Then, in the specific matching process, this paper introduces fuzzy calculation, and proposes a multi-threaded bidirectional routing exploration (M-TBRE) algorithm based on depth first search and a two-way routing matching algorithm based on multi-threading. In addition, fuzzy constraints are introduced in the M-TBRE algorithm, which leads to the Fuzzy-M-TBRE algorithm. The experimental results on the two datasets show that compared with existing algorithms, our proposed algorithm is more efficient and effective.

As a basic data structure, graphs are widely used in a lot of applications. For example, as for object anomaly checking, objects can be represented by graphs, and then anomalies can be discovered with certain graph algorithm [1]. Meanwhile, in order to determine whether a user is interested in a certain webpage, the webpages can be converted into multiple graphs, and with the multiple graphs taken as a bag, the bag can be classified and judged [2]. As a popular graph-based technology, graph pattern matching (GPM) has attracted a lot of attentions. Specifically, given a pattern graph, finding subgraphs from the data graph with a similar or the same structure as the pattern graph is named as GPM. As the research field of GPM has changed from the initial protein isomorphism [3, 4] to community detection [5, 6], expert discovery [7], recommendation systems [8], the discovery of social groups [911] and the identification of criminal groups [12], the definition of graph pattern has also changed accordingly.

Technically, GPM is originally defined based on subgraph isomorphism. Given a data graph GD and a pattern graph GP as input, it will return whether it contains a subgraph, and whether this subgraph has exactly the same topological structure as GP. For example, we can guess the properties of unknown proteins from the properties of known proteins through this matching [3, 4]. However, this traditional subgraph isomorphism is too strict. In order to extend the application scenarios of GPM, Fan et al. [12] propose a bounded simulation, which extends the edge-to-edge matching to the edge-to-finite length path matching. However, this matching still does not make use of the rich attribute information on vertices and edges. Therefore, Liu et al. [13] propose a multi-constrained graph pattern matching (MC-GPM) problem to obtain more effective matching results. Afterwards, Liu et al. [14] propose a multiple fuzzy constrained graph pattern matching (MFC-GPM) based on MC-GPM, considering that some attributes do not require exact matching. However, the current application scenarios of GPM are mostly concentrated on complex networks, and there are very few research applications of GPM in the medical field, especially in Medical Knowledge Graphs.

Nowadays, the incidence of breast cancer is getting higher and higher, and the age is getting younger and younger. Breast cancer can be divided into ductal carcinoma in situ, lobular carcinoma in situ, invasive ductal carcinoma, invasive lobular carcinoma, and so on. Each type of breast cancer can be divided according to the primary tumor staging, regional lymph node staging, and distant metastasis staging. The purpose of this paper is to make a diagnosis through GPM technology before the patient's condition is diagnosed with surgery.

In this paper, to introduce GPM into the medical field, we propose the problem of GPM in MKGs and give relevant definitions. In addition, the M-TBRE algorithm is proposed, which firstly divides the pattern graph into pattern subgraphs, then obtains the matching results of the pattern subgraphs, and finally merges the matching results of the pattern subgraphs. M-TBRE can give the diagnosis distribution of the pattern graph, and return the best k diagnosis classification results according to the frequency of each diagnosis classification. Fuzzy constraints are also introduced in the M-TBRE algorithm, which extend it to the Fuzzy-M-TBRE algorithm, and the effectiveness of our algorithm are verified on two public data sets.

The rest of this paper is organized as follows. The related work of GPM is reviewed in Section 2. Then in Section 3, the concept of pattern matching in MKGs is introduced. Section 4 proposes a multi-threaded bidirectional routing exploration algorithm and a Fuzzy-M-TBRE algorithm to process GPM in MKGs. Section 5 introduces the data sets and conducts experiments to verify our proposed Fuzzy-M-TBRE algorithm, and Section 6 concludes our work in this paper.

According to the judgment based on bijective function or based on binary relationship, the research on GPM can be divided into isomorphism-based GPM and simulation-based GPM.

2.1 Isomorphism-Based GPM

Isomorphism-based GPM has a bijective function between the pattern graph and the data graph, and the topological structure of the matched data subgraph and pattern graph must be the same. Ullmann [15] first proposes a matching algorithm based on depth-first search. Cordella et al. [16] improve Ullmann's algorithm in terms of matching order and pruning strategy, and propose the VF2 algorithm. Tong et al. [17] propose the G-Ray method, which uses the goodness function to measure the degree of matching between a subgraph and the pattern graph, so that the optimal-k subgraphs can be returned. In addition, Cheng et al. [18] also propose a top-k matching algorithm, which sorts the matched subgraphs obtained based on the number of spanning trees, thereby returning the optimal-k subgraphs. Cheng et al. [19] propose the R-join algorithm based on the join index of the clustering graph and optimize the algorithm. Other representative algorithms include DDST algorithm [20], IncBMatch algorithm [21], and so on. Generally, as an NP-complete problem, Isomorphism-based GPM uses indexing [22, 23] and parallel distributed [2426] to improve the efficiency of matching.

Isomorphism-based GPM is mostly used in fields with strict structural requirements such as protein isomorphism, 3D object matching [27] and network abnormal behavior detection [28]. However, such matching is too strict for applications such as social networks or knowledge graphs that do not require strict matching accuracy. Therefore, simulation-based GPM research has emerged.

2.2 Simulation-Based GPM

As judged through binary relations, graph simulation is introduced by Henzinger et al. [29], but it is still an edge-to-edge matching, which cannot meet the requirements of many applications. Fan et al. [12] extend the graph simulation and propose a bounded simulation, where the edge of the pattern graph can be matched to a path, and the length of this path does not exceed the given constraint value k. Based on the bounded simulation, Ma et al. [30] propose a strong simulation, which can well preserve the topological structure of the pattern graph. There is a lot of attribute information on vertexes and edges in big graph data, but these existing work does not consider this information. Liu et al. [13] consider this information to extend the bounded simulation to MC-GPM and propose a baseline algorithm based on exploration and a heuristic algorithm based on data graph compression index (HAMC). Since the HAMC algorithm only considers the constraint conditions of the matching path, which does not consider the minimization of the matching path length and the HAMC algorithm does not support a distributed computing structure, Liu et al. [31] propose an M-HAMC algorithm. Considering that the attribute values on vertexes and edges sometimes do not need to be exactly matched, on the basis of MC-GPM, Liu et al. [14] propose an MFC-GPM and an ETOF-K algorithm, which improves matching efficiency from two aspects: edge matching and edge connection. Based on the topologically ordered sequence of pattern graph vertexes, Liu et al. [32] propose the NTSS algorithm and optimize the algorithm by introducing two measures: caching mechanism and reverse edge matching. The caching mechanism solves the problem of repeated calculation of the same candidate path in multiple matching subgraphs, and the reverse edge matching prunes the candidate set of the edge with a partial degree of entry 0 in advance.

GPM is to find all data subgraphs that satisfy the pattern graph GP in a given data graph GD. In this section, we will give the relevant definitions of data graphs, pattern graphs, and graph pattern matching in MKGs.

3.1 Data Graph and Pattern Graph

The related definitions of the data graph and the pattern graph are as follows.

3.1.1 Data Graph

A data graph GD = (V, E, fVD, fED) is a directed graph with vertex attributes and edge attributes, where

  • V is the set of vertices of the data graph;

  • E is the set of edges of the data graph, and (vi, vj) ∊ E represents the directed edge from vertex vi, ∊ V to vertex vj ∊ V;

    • fVD is a function defined on V. ∀v ∊ V, fVD(v) is the attribute set of v. In an MKG, each vertex v has a label ρr, and ρr represents the type of this vertex. The value of ρr is different, and the other attributes in the attribute set fVD(v) are also different. The value of ρr can be DI, BI, MI, GW, OC, AL and PD;

    • fED is the function defined on E. ∀eE, fED(vi, vj) is the attribute set of e. In an MKG, for a directed edge (vi, vj), fED(vi, vj), only contains ρvivjpids. ρvivjpids is a list that stores patient numbers, that is, the identity information of vertex which comes from those patients;

DI: When the value of ρr is DI, the attribute set fVD(v) of vertex v describes the diagnostic classification information of breast cancer, which includes pathological information ρh, T staging stage ρs, tumor length ρTS, the description of regional lymph nodes N staging stage ρLN, and M staging stage ρDM describing distant metastasis. The value of ρh can be 0, 1, 2, and 3 respectively representing “invasive ductal carcinoma”, “invasive lobular carcinoma”, “ductal carcinoma in situ”, and “lobular carcinoma in situ”; the value of ρs can be 0, 1, 2, 3, and 4; ρTS is a floating point number in cm. The value of ρLN can be N0, N1, N2, and N3; the value of ρDM can be M0, and M1.

BI: When the value of ρr is BI, the attribute set fVD(v) of vertex v describes the basic information of the patient, which includes ρCN, ρCP and ρage. ρCN indicates whether the patient currently needs care, and its value is true or false; ρCP indicates that the patient is currently pregnant, and its value is true or false; ρage indicates the current age of the patient, and its value is a positive integer.

MI: When the value of ρr is MI, the attribute set fVD(v) of vertex v describes the patient's menopausal information, and it only contains ρMS. The value of ρMS can be 0, 1, and 2, respectively, indicating pre-menopausal, perimenopausal, and post-menopausal.

GW: When the value of ρr is GW, the attribute set fVD(v) of vertex v describes the patient's current overall well-being, and it only contains ρGWB. The value of ρGWB can be 0, 1, 2, 3, and 4, respectively, which means “fully active, no complaints or symptoms”, “doing normal activities requires a little effort”, “occasionally need help, but can meet most of the personal needs”, “needs a lot of assistance and frequent medical care”, “completely disabled, can only lie in a bed or a chair.”

OC: When the value of ρr is OC, the attribute set fVD(v) of vertex v describes whether the patient has cancers other than breast cancer, which includes ρOCS and ρOCN. When the value of ρOCS is false, the value of ρOCN is none; when the value of ρOCS is true, the value of ρOCN is the names of the patient's other cancers;

AL: When the value of ρr is AL, the attribute set fVD(v) of vertex v describes the patient's axillary lymph nodes, which includes ρLN, ρLS, ρIN, ρSN and ρCW. The value of ρLS is true or false, indicating whether the patient's axillary lymph nodes are normal or not. The value of ρIN is true or false, indicating whether the supraclavicular lymph nodes of the patient are normal or not. The value of ρSN is true or false, indicating whether the subclavian lymph nodes of the patient are normal or not. The value of ρCW is true or false, indicating whether the patient's chest wall is normal or not. The value of ρLN is a positive integer, which means that several of the three of the patient's supraclavicular lymph node, subclavian lymph node, and chest wall have problems.

PD: When the value of ρr is PD, the attribute set of vertex v describes some diagnosis information of the patient in the past, which includes ρaid, ρane, ρaut, ρlun, ρdia, ρcar, ρost and ρrep. The value of ρaid is true or false, indicating whether the patient has AIDS; the value of ρane is true or false, indicating whether the patient has anemia; the value of ρaut is true or false, indicating whether the patient has autoimmune disease; The value of ρlun is true or false, indicating whether the patient has lung cancer; the value of ρdia is true or false, indicating whether the patient has diabetes; the value of ρcar is true or false, indicating whether the patient has cardiovascular disease; The value of ρost is true or false, which indicates whether the patient has osteoporosis; the value of ρrep is true or false, which indicates whether the patient's reproductive organs are diseased.

3.1.2 Pattern Graph

A pattern graph GP = (Vp, EP, fVP, fEP, fip, fmP) is a directed graph with vertex attributes and edge attributes, where:

  • Vp is the set of vertices of the pattern graph;

  • Ep is the set of edges of the pattern graph, and (ui, uj) ∊ EP represents the directed edge from vertex uiVP to vertex ujVP,.

  • fVp is a function defined on Vp, and ∀vVP, fVp(v) is the attribute set of v. In an MKG, the function fVP(u) corresponding to the vertex u has the same meaning as the attribute set of the above vertex in the data graph.

  • fEP is the function defined on Ep. ∀eEP, fEp(e) is the attribute set of e. In an MKG, for a directed edge (ui, uj), fEP(ui, uj) only contains ρuiujpids. ρuiujpids is a list that stores patient numbers, that is, the identity information of the vertices comes from those patients.

  • fip is the function defined on Ep. (ui, uj) ∊ Ep, fip(ui, uj) is the length constraint of the edge (ui, uj), and its value is a positive integer k or a symbol *, respectively, indicating that the length of the interval from vi to vj does not exceed k, or there is no length limit. In an MKG, fip(ui, uj) = 1;

  • fmp is a set of membership constraint functions defined on vertex attributes and edge attributes.

3.1.3 Fuzzy constraints

During matching, it would be better to get more and better matching results. Because in the actual matching, each matched subgraph corresponds to a patient who has roughly the same health information as the patient to be diagnosed in the pattern graph. The more obtained matches, the better experience will be used for reference in the treatment of patients corresponding to the pattern graph. However, in practice, it is possible that a subgraph in a data graph can be well satisfied with other constraints, but because some less important attribute constraints on a vertex cannot be satisfied, the subgraph cannot become a matching result. In addition, some attribute constraints on vertexes do not need to be accurately matched when matching, and their differences only need to fall within a certain range. Therefore, we introduce fuzzy constraints to GPM in MKGs.

In the MKG, the membership function fmP = {fagem} is only considered to introduce a fuzzy constraint to the age attribute. fagem represents the membership function defined on the vertex age attribute ρage. The constraint value of fagem is set to 3. The membership function fagem is defined as Eq. (1), where abs is the absolute value function, ρvage represents the age attribute constraint value of vertex v in pattern graph GP. ρuage represents the age attribute constraint value of vertex u in data graph GD. During matching, age attribute ρage only needs to satisfy fagem ≤ 3.

(1)

3.2 Pattern Matching

The matched subgraph Gsub = (Vsub, Fsub, fVsub, fesubD) is a subgraph of the data graph GD and matches the pattern graph GP. The number of matched subgraphs may not be unique, where GsubGD, VsubV, EsubE, fVsubDfVD, fEsubDfED; The definition of pattern matching in the MKG is as follows.

For a pattern graph GP = (Vp, EP, fVP, fEP, fiP, fmp) and a data graph GD = (V, E, fDV, fDF), GD matches GP, denoted as GPGD, if there is a binary relationship:

  • for all uVP, there is vV such that (u, v) ∊ S, which means that there is a vertex v in V that matches u, that is, v satisfies fVP(u). If age attribute ρuiage; is included, fmP = {fagem} represents the membership function defined on the age attribute of u, then ρuage only needs to satisfy fagem ≤ 3. Except for the age attribute ρuage, the values corresponding to the other attributes of v must be equal to the values of the attributes corresponding to u before it can be determined that vi matches ui.

  • for each pair (ui, vj) ∊ S,

    • uivi, and

    • ∗ for each edge (ui, uj) in EP, there is a path from vi, to v¡ in GD such that (ui, vj) ∊ S. Because of flp(ui, uj = 1, this path can be regarded as the edge from vi to vj in GD;

Example 1: As shown in Figure 1, GD is a data graph composed of related information of multiple breast cancer patients. The attribute information of some vertexes contained in the data graph saves the diagnostic classification information of breast cancer. In the data graph GD, each vertex represents some information of the patient. For the function fED(A1, B1) defined on the directed edge (A1, B1) in GD, fED(A1, B1) only contains the attribute ρA1B1pids. For example, the value of ρA1B1pids is 1375, which means that the relevant information on the B1 vertex comes from the breast cancer patient numbered 1375. The pattern graph GP is the health status of a patient to be diagnosed. The vertices B, C, D, E, F, and G respectively represent the patient's basic information, menopausal status, general well-being, information on cancers other than breast cancer, axillary lymph nodes and information about past diagnoses. Vertex A is the diagnostic information of this patient, but it is unknown and needs to be obtained through GPM. Since all vertex information in the pattern graph comes from the same patient, we need to find a patient number as the attribute constraint information on the edges to get the matching result of the pattern graph.

Data graph and pattern graph in an MKG
Figure 1.
Data graph and pattern graph in an MKG
Figure 1.
Data graph and pattern graph in an MKG
Close modal

Example 2: As shown in Figure 1, it is easy to find a subgraph Msub1 from data graph GD that matches pattern graph GP. Msub1 passes through vertexes A2, B2, C1, D2, E1, F2 and G3. Vertex A2 is the breast cancer diagnosis result of the pattern graph GP. The attribute constraint value on the edges in Msub1 is 2384, which means that the patient with the number 2384 is closer to the health status of the patient corresponding to Gp.

After introducing fuzzy constraints, since fagem = abs(ρBageρB3age) = 1, it does not exceed the membership function constraint value 3 on the age attribute. In addition, ρB = ρB3, ρBCN, ρBCP = ρB3CP, and vertex B matches vertex B3. We can get a new matched subgraph Msub2 that passes through vertexes A2, B3, C1, D2, E1 F2 and G3. The attribute constraint value on the edges in Msub2 is 676.

In this section, we propose a multi-threaded bidirectional routing exploration algorithm M-TBRE to solve the GPM problem in MKGs.

4.1 Algorithm Description

The emergence of multi-core CPU can realize the parallel processing of tasks and speeds up the execution of programs. Since the multi-constrained GPM problem is an NP-complete problem, in order to speed up the matching speed and return the matched results quickly, here we consider adopting multi-threading to solve this GPM problem. In the matching process, the idea of divide and conquer is adopted. For a pattern graph GP, it can be divided into several pattern subgraphs. After the matching of each pattern subgraph is completed in the data graph GD, the matched results of each pattern subgraph can be connected to obtaining the matched results of the pattern graph GP. The matching of pattern subgraphs can be delivered as subtasks to multiple threads to complete independently, so that matching results can be obtained quickly.

4.2 Algorithm Flow

In the M-TBRE algorithm, since the pattern graph of the MKG can be regarded as a path, we can segment the pattern graph according to the intermediate vertexes of this path, divide the pattern graph into two parts, and obtain two pattern subgraphs. Next, to match the two pattern subgraphs, the matched results are connected to obtaining the matched results of the pattern graph.

The detailed steps of the M-TBRE algorithm are shown in Algorithm 1. First, the intermediate vertex VPmid of the pattern graph GP and the candidate vertex set candmid of VPmid need to be obtained, as shown in lines 1-2. In line 3, pool and tempInfo represent the thread pool and the temporary result of the matching, respectively. The number of threads in pool can be set according to the actual situation. Then the pattern graph GP is divided into two sub-pattern graphs Gpsub1 and GPsub2 with intermediate vertex VPmid as the dividing point, and the two sub-pattern graphs are matched in the data graph GD. Traversing the candidate vertex set candmid of VPmid to complete the matching of the sub-pattern graphs GPsub1 and GPsub2, as shown in lines 4-27. For each element candmid[i] in candmid, we use the attribute constraint ρpepids on each forward edge ρDpe of candmid[i] to intersect the attribute constraint ρaepids on each successor edge ρDpe to obtain ρpids, which saves the common patient number information of the current forward edge ρDpe and the current successor edge ρDae, as shown in lines 6-25. For each patient number ρpid in ρpids, ρpid is taken as the attribute constraint on the edge to complete the matching of GPsub1 and Gpsub2, as shown in lines 14-21. tempInfo stores the partial matched result with ρpid as the edge attribute constraint, as shown in lines 15-16. The thread pool submits subtasks MC-SEM and MC-FEM to complete the matching of GPsub1 and GPsub2 respectively, as shown in lines 19-20. The algorithm RM merges the matched results, as shown in line 28. The MC-SEM algorithm can complete the matching of the pattern subgraph Gpsub1, where VPcurr, VDcurr, ρpid and templnfo respectively represent the pattern vertex to be matched, the candidate vertex of the pattern vertex vpcurr to be matched, the attribute constraint value (patient number) of the edge, and the temporary result of the matching. If vertex vDcurr matches vertex vpcurr but vpcurr does not have a successor edge, that is, when the out-degree of vpcurr is 0, the matching of the pattern subgraph Gpsub1 is completed, and the matched result when ρpid is used as the attribute constraint on the edge is saved in tempInfo, such as Algorithm 2 is shown in lines 2-7. If vertex vDcurr matches vertex vpcurr and vpcurr has a successor edge, then traverse the successor edge eDae of vpcurr. When the attribute constraint ρaepids on ρDae includes ρpid, the matching of pattern vertex ePae. tailNode is recursively completed, as shown in lines 8-16.

The MC-FEM algorithm can complete the matching of the pattern subgraph GPsub2. The processing process of the MC-FEM algorithm is similar to that of the MC-SEM algorithm, except that MC-FEM completes the matching of the pattern subgraph GPsub2 according to the reverse depth-first search strategy.

The RM algorithm can complete the connection operation of the matching results of pattern subgraphs Gpsub1 and Gpsub2. When a given value is used as an attribute constraint on all edges, and the flag bits representing the matching results of and Gpsub2 are both 1, then combining the matching results of Gpsub1 and Gpsub2 is a matching result of the pattern graph GP, such as lines 4-6 in Algorithm 4.

Example 3: In this example, GP and GD in Figure 1 are the pattern graph and the data graph, respectively. First, to obtain intermediate vertex D of pattern graph GP and candidate vertex set candmid = {D2} of D. The pattern graph GP is divided into pattern subgraph Gpsub2 which passes through vertexes A, B and C, and pattern subgraph Gpsub2 which passes through vertexes E, F and G. The forward edge (C1, D2) and the subsequent edge (D2, E1) of D2 have the same attribute constraint ρpids = {676, 2384}. Taking the matching of Gpsub2 as an example, the attribute constraint ρC1C2pids = {676,2384} on edge (C1, D2) contains ρpid = 2384, and C1 matches C at the same time, so edge (C1, D2, GD) ⋍ (C, D, GP). We can get (C1, D2, GD) ⋍ (C, D, GP) and (A2, B2, GD) ⋍ (A, B, GP). The matching of Gpsub2 takes ρpid = 2384 as the attribute constraint, and the attribute constraint information of vertex A2 is the diagnosis classification result of Gpsub2. In the same way, we can get the matched results (D2, E1, GD) ⋍ (D, E, GP), (E1, F2, GD) ⋍ (E, F, GP), (F2, G3, GD) ⋍ (F, G, GP) of when ρpid = 2384 is the attribute constraint on edge EP. Both Gpsub1 and Gpsub2 have matched results, so the diagnostic classification information of GP is the diagnostic classification information of Gpsub1, and the patient number in GD is 2384. Finally, two matching subgraphs are obtained through the M-TBRE algorithm. Msub1 = {Vm, Em, fv, fe}, where VM = {A2, B2, C1D2, E1, F2, G3}, fe = {2384} and EM = {A B2), (B2, C1), (C1, D2), (D2, E1), (E1, F2), (F2, G3)}. Msub2 = {Vm, Em, fv, fe}, where VM= {A2, B3, C1, D2, E1, F2, G3}, fe = {676} and EM = {(A2, B3), (B3, C1), (C1, D2), (D2, E1), (E1, F2), (F2, G3)}. The patients numbered 676 and 2384 can be used for reference when treating the patients corresponding to the pattern graph GP.

In this section, we conduct experiments on two public MKGs. The details of these two datasets are shown in Table 1. We propose and implement the M-TBRE algorithm to complete the pattern matching of MKG. Since the M-TBRE algorithm divides the pattern graph into two sub-pattern graphs, for different edge attribute constraint values ρpid, the matching of these two sub-pattern graphs is delivered to the thread pool as a subtask for execution. More or less the number of threads in the thread pool will affect the execution results of the algorithm. We will set a different number of threads to measure the efficiency dynamics of the M-TBRE algorithm. In addition, to obtain more matched results, we introduce the fuzzy constraint, which is a membership function for the age attribute constraint in the vertexes. The age attribute constraint on the data graph vertexes does not need to be the same as the attribute constraints in the pattern graph during matching, but only needs to go through the calculated result of the membership function and satisfy the corresponding membership constraint value. Together with the M-TBRE algorithm, we have the Fuzzy-M-TBRE algorithm. The Fuzzy-M-TBRE and M-TBRE algorithms can be compared to prove the effectiveness of the introduced fuzzy constraints.

Table 1.
The detail information of two datasets.
DatasetVerticesEdgesDescription
Female-breast-cancer-2013a 10812 20366 A graph about breast cancer patients 
Breastcancer-femalepatient-2016A 101221 200845 A graph about breast cancer patients 
DatasetVerticesEdgesDescription
Female-breast-cancer-2013a 10812 20366 A graph about breast cancer patients 
Breastcancer-femalepatient-2016A 101221 200845 A graph about breast cancer patients 

5.1 Experimental Settings and Implementation

The MKG used in the experiment is about breast cancer. Dataset-1 and dataset-2 are used to represent the dataset Female-breast-cancer-2013a and the dataset Breastcancer-femalepatient-2016A, respectively. Dataset-1 is composed of the physical condition information of 10,000 breast cancer patients, and dataset-2 is composed of 100,000 breast cancer patients. In our experiment, several pattern graphs are used, but these pattern graphs are similar to the pattern graph shown in Figure 1. Our membership function is only for the age attribute of the vertex, and the membership constraint value is set to 3. Both M-TBRE and Fuzzy-M-TBRE are implemented using Java and running on a PC with Intel(R) Core(TM) i9-10900F CPU @2.81G GHz, 32 GB RAM and Windows 10 operating system.

5.2 Experimental Results and Analysis

5.2.1 Experiments on Execution Time

This experiment studies the execution time change when we set different thread numbers for the thread pool used in the M-TBRE algorithm, and the algorithm completes the GPM. To prevent error interference, the results in the experiment are the arithmetic mean after 10 runs.

As shown in Figure 2 and Figure 3, the abscissa represents different pattern graphs, and the ordinate represents the matching time of these pattern graphs. The M-TBRE-1 algorithm represents that the number of threads in the thread pool in the M-TBRE algorithm is set to 1. Since the pattern graph in the MKG is a path, and the edge join strategy proposed in the ETOF-K algorithm does not take effect in the matching, the performance of the ETOF-K algorithm and the M-TBRE-1 algorithm is almost the same on dataset-1 and dataset-2. The reverse matching strategy of the NTSS algorithm is invalid in the matching process, but its caching mechanism avoids the double calculation of the same path, so the NTSS algorithm is better than the ETOF-K algorithm and the M-TBRE-1 algorithm on dataset-1 and dataset-2.

Matching time of different pattern graphs on Female-breast-cancer-2013a.
Figure 2.
Matching time of different pattern graphs on Female-breast-cancer-2013a.
Figure 2.
Matching time of different pattern graphs on Female-breast-cancer-2013a.
Close modal
Matching time of different pattern graphs on Breastcancer-femalepatient-2016A.
Figure 3.
Matching time of different pattern graphs on Breastcancer-femalepatient-2016A.
Figure 3.
Matching time of different pattern graphs on Breastcancer-femalepatient-2016A.
Close modal

However, our M-TBRE-1 algorithm can be extended to multithreaded algorithms, such as the M-TBRE-2 algorithm, M-TBRE-4 algorithm, M-TBRE-8 algorithm, which means that the number of threads in the thread pool is set to 2, 4, and 8, respectively. As can be seen from Figure 2 and Figure 3, the effect of the M-TBRE-2 algorithm has already exceeded the NTSS algorithm, which also proves the effectiveness of our proposed M-TBRE algorithm. In addition, Table 2 and Table 3 show the detailed execution time in seconds. Table 4 shows the comparison of the average execution time of these four algorithms on the two datasets. It can be seen that on the two data sets, as the number of threads increases, the execution time of the algorithm continues to decrease.

  • For dataset-1, the execution time of the M-TBRE-2 algorithm is increased by 43.11% compared with the M-TBRE-1, and the execution time of the M-TBRE-4 algorithm is increased by 39.56% compared with the M-TBRE-2 algorithm, but compared with the M-TBRE-4 algorithm, the execution time of the M-TBRE-8 algorithm is only increased by 27.81%. This is because dataset-1 itself is small in scale, and the time spent on thread context switching and system state transitions occupies a large proportion of the total time.

  • For dataset-2, its scale is larger, and the total execution time of the algorithm is also larger. The time spent on thread context switching and system state transition takes up a relatively small proportion of the total time. Therefore, M-TBRE-8 in dataset-2 still increased by 36.06%.

Table 2.
Execution time on the Female-breast-cancer-2013a dataset.
P1P2P3P4P5P6P7P8
ETOF-K 0.1625 0.1933 0.1742 0.1659 0.1743 0.1601 0.1742 0.1633 
NTSS 0.1398 0.1472 0.1109 0.1365 0.1247 0.0973 0.1148 0.1245 
M-TBRE-1 0.1832 0.1869 0.1516 0.1807 0.1654 0.1560 0.1535 0.1700 
M-TBRE-2 0.1161 0.1097 0.0905 0.1059 0.0890 0.0785 0.0789 0.0979 
M-TBRE-4 0.0869 0.0709 0.0488 0.0595 0.0550 0.0459 0.0448 0.0512 
M-TBRE-8 0.0750 0.0488 0.0430 0.0373 0.0366 0.0311 0.0294 0.0331 
P1P2P3P4P5P6P7P8
ETOF-K 0.1625 0.1933 0.1742 0.1659 0.1743 0.1601 0.1742 0.1633 
NTSS 0.1398 0.1472 0.1109 0.1365 0.1247 0.0973 0.1148 0.1245 
M-TBRE-1 0.1832 0.1869 0.1516 0.1807 0.1654 0.1560 0.1535 0.1700 
M-TBRE-2 0.1161 0.1097 0.0905 0.1059 0.0890 0.0785 0.0789 0.0979 
M-TBRE-4 0.0869 0.0709 0.0488 0.0595 0.0550 0.0459 0.0448 0.0512 
M-TBRE-8 0.0750 0.0488 0.0430 0.0373 0.0366 0.0311 0.0294 0.0331 
Table 3.
Execution time on the Breastcancer-femalepatient-2016A dataset.
P1P2P3P4P5P6P7P8
ETOF-K 25.1406 26.1477 29.3101 28.1559 26.7859 27.4765 28.6518 27.4800 
NTSS 20.7231 21.6519 19.9583 20.4127 19.8864 22.7456 19.7193 21.5986 
M-TBRE-1 27.2807 27.6337 28.7748 27.8562 27.7634 27.1356 26.8715 27.3600 
M-TBRE-2 16.2298 16.3104 16.1557 16.3067 16.6475 16.5609 16.4358 16.7567 
M-TBRE-4 9.4020 9.7056 9.8746 9.9239 9.7943 9.0300 8.9316 8.9685 
M-TBRE-8 5.8198 5.9732 6.0509 6.0256 5.9963 6.1483 6.1579 6.1895 
P1P2P3P4P5P6P7P8
ETOF-K 25.1406 26.1477 29.3101 28.1559 26.7859 27.4765 28.6518 27.4800 
NTSS 20.7231 21.6519 19.9583 20.4127 19.8864 22.7456 19.7193 21.5986 
M-TBRE-1 27.2807 27.6337 28.7748 27.8562 27.7634 27.1356 26.8715 27.3600 
M-TBRE-2 16.2298 16.3104 16.1557 16.3067 16.6475 16.5609 16.4358 16.7567 
M-TBRE-4 9.4020 9.7056 9.8746 9.9239 9.7943 9.0300 8.9316 8.9685 
M-TBRE-8 5.8198 5.9732 6.0509 6.0256 5.9963 6.1483 6.1579 6.1895 
Table 4.
The comparison of execution time on two datasets.
DatasetM-TBRE-1M-TBRE-2M-TBRE-4M-TBRE-8Percentage
Female-breast-cancer-201 3a 0.1684 0.0958 — — 43.11% 
Female-breast-cancer-2013a — 0.0958 0.0579 — 39.56% 
Female-breast-cancer-2013a — — 0.0579 0.0418 27.81% 
Breastcancer-femalepatient-2016A 27.5845 16.4254 — — 40.45% 
Breastcancer-femalepatient-2016A — 16.4254 9.4538 — 42.44% 
Breastcancer-femalepatient-2016A — — 9.4538 6.0452 36.06% 
DatasetM-TBRE-1M-TBRE-2M-TBRE-4M-TBRE-8Percentage
Female-breast-cancer-201 3a 0.1684 0.0958 — — 43.11% 
Female-breast-cancer-2013a — 0.0958 0.0579 — 39.56% 
Female-breast-cancer-2013a — — 0.0579 0.0418 27.81% 
Breastcancer-femalepatient-2016A 27.5845 16.4254 — — 40.45% 
Breastcancer-femalepatient-2016A — 16.4254 9.4538 — 42.44% 
Breastcancer-femalepatient-2016A — — 9.4538 6.0452 36.06% 

For our proposed M-TBRE algorithm, as the number of threads increases, the execution speed is also accelerating. But when the number of threads reaches a certain level, the increase in execution speed will slow down, as shown by M-TBRE-8 in Figure 2. If the dataset is larger, or when the M-TBRE algorithm submits more subtasks, this slowing downtrend will become slower. Compared with the M-TBRE-4 algorithm, the M-TBRE-8 algorithm in dataset-1 increased by 27.81%, while in dataset-2, the M-TBRE-8 algorithm increased by 36.06% compared with the M-TBRE-4 algorithm.

5.2.2 Experiments on Fuzzy Constraints

This experiment studies the change in the number of matching subgraphs when our M-TBRE algorithm introduces fuzzy constraints. The Fuzzy-M-TBRE algorithm is the algorithm after M-TBRE introduces fuzzy constraints. Since our Fuzzy-M-TBRE algorithm can get all the matched results of the pattern graph, we can compare the changes in the total number of matches before and after the introduction of fuzzy constraints.

As shown in Figure 4 and Figure 5, the abscissa represents different pattern graphs, and the ordinate represents the number of matched subgraphs. On dataset-1 and dataset-2, for the same pattern graph, the Fuzzy-M-TBRE algorithm returns more matched results than the M-TBRE algorithm. Each matched subgraph corresponds to a breast cancer patient. When treating the patient corresponding to the pattern graph, please refer to the treatment plan of the corresponding patient in these matched subgraphs. Introducing fuzzy constraints can get more treatment options. Therefore, it is necessary to introduce fuzzy constraints into the M-TBRE algorithm.

The number of matched subgraphs of different pattern graphs on Female-breast-cancer-2013a.
Figure 4.
The number of matched subgraphs of different pattern graphs on Female-breast-cancer-2013a.
Figure 4.
The number of matched subgraphs of different pattern graphs on Female-breast-cancer-2013a.
Close modal
The number of matched subgraphs of different pattern graphs on Breastcancer-femalepatient-2016A.
Figure 5.
The number of matched subgraphs of different pattern graphs on Breastcancer-femalepatient-2016A.
Figure 5.
The number of matched subgraphs of different pattern graphs on Breastcancer-femalepatient-2016A.
Close modal

In this paper, we put forward the problem of GPM in MKGs, and provide related definitions. In order to solve this problem, an M-TBRE algorithm is proposed, which divides the pattern graph into several pattern subgraphs, uses multi-threaded bidirectional routing to complete the matching of the pattern subgraphs, and then merges the matching results. In addition, fuzzy constraints are introduced to obtain more matching subgraphs. Each matched subgraph corresponds to a past patient. The patients corresponding to these matched subgraphs have the same physical condition as the patient corresponding to the pattern graph, so the treatment plan of the patients corresponding to these matched subgraphs can be used for reference in the treatment of the patient corresponding to the pattern graph. In this way, better and more effective treatment plans can be developed for patients corresponding to the pattern graph. We conduct verification experiments on the M-TBRE algorithm on two public MKG datasets. Experimental results show that our proposed M-TBRE algorithm has better performance. Furthermore, the necessity of introducing fuzzy constraints is also demonstrated, which leads to the outperformance of the Fuzzy-M-TBRE algorithm. In the future, we will further research and improve the M-TBRE algorithm, and study the dynamic graph pattern matching problem in MKGs oriented to the dynamics of pattern graph content.

This work has been supported by the National Natural Science Foundation of China under grants 62076087 & 61906059 and the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education of China under grant IRT17R32.

The first author would like to thank his wife Jun Zhang, his parents and friends during his fight with lung adenocarcinoma. “I leave no trace of wings in the air, but I am glad I have had my flight.”

All authors including L. Li ([email protected]), X. Du ([email protected]), Z. Zhang ([email protected]), and Z. Tao ([email protected]) took part in writing the paper. In addition, L. Li designed the algorithm and experiments, and provided the funding; X. Du designed and conducted experiments, and analyzed the data; Z. Tao analyzed the data.

[1]
Ma
,
X.
,
Wu
,
J.
,
Xue
,
S.
, et al.:
A comprehensive survey on graph anomaly detection with deep learning
. IEEE Transactions on Knowledge and Data Engineering (
2021
)
[2]
Wu
,
J.
,
Zhu
,
X.
,
Zhang
,
C.
, et al.:
Bag constrained structure pattern mining for multi-graph classification
.
IEEE Transactions on Knowledge and Data Engineering
26
(
10
),
2382
2396
(
2014
)
[3]
Hu
,
J.
,
Ferguson
,
A.
:
Global graph matching using diffusion maps
.
Intelligent Data Analysis
20
(
3
),
637
654
(
2016
)
[4]
Tian
,
Y.
,
Patel
,
J.
:
TALE: A tool for approximate large graph matching
. In: Proceedings of the IEEE 24th International Conference on Data Engineering, pp.
963
972
(
2018
)
[5]
Liu
,
F.
,
Xue
,
S.
,
Wu
,
J.
, et al.:
Deep learning for community detection: progress, challenges and opportunities
. In: Proceedings IJCAI, pp.
4981
4987
(
2020
)
[6]
Su
,
X.
,
Xue
,
S.
,
Liu
,
F.
, et al.:
A comprehensive survey on community detection with deep learning
.
IEEE Transactions on Neural Networks and Learning Systems,
1-21
(
2021
)
[7]
Fan
,
W.
,
Wang
,
X.
,
Wu
,
Y.
:
Finding experts by graph pattern matching
. In: Proceedings of the IEEE 29th International Conference on Data Engineering, pp.
1316
1319
(
2008
)
[8]
Fan
,
W.
,
Wang
,
X.
,
Wu
,
Y.
:
Incremental graph pattern matching
.
ACM Transactions on Database Systems
38
(
3
),
1
47
(
2013
)
[9]
Khan
,
A.
,
Golab
,
L.
, et al.:
Compact group discovery in attributed graphs and social networks
.
Information Processing & Management
57
(
2
),
102054
(
2020
)
[10]
Ryota
,
S.
,
Hitoshi
,
H.
, et al.:
Social Group Discovery Extracting Useful Features using Multiple Instance Learning
.
Journal of Japan Society for Fuzzy Theory & Intelligent Informatics
28
(
6
),
920
931
(
2016
)
[11]
Chikhaoui
,
B.
,
shimula
,
J.
,
Wang
,
S.
:
Community Mining and Cross-Community Discovery in Online Social Networks
. In: Proceedings of the International Conference on Network-Based Information Systems, pp.
176187
(
2020
)
[12]
Fan
,
W.
,
Li
,
J.
, et al.:
Graph pattern matching: from intractable to polynomial time
.
Proceedings of the VLDB Endowment
3
(
1-2
),
264
275
(
2010
)
[13]
Liu
,
G.
,
Zheng
,
K.
, et al.:
Multi-constrained graph pattern matching in large-scale contextual social graphs
. In: Proceedings of the IEEE 31st International Conference on Data Engineering, pp.
351
362
(
2015
)
[14]
Liu
,
G.
,
Li
,
L.
,
Wu
,
X.
:
Multi-fuzzy-constrained graph pattern matching with big graph data
.
Intelligent Data Analysis
24
(
4
),
941
958
(
2020
)
[15]
Ullmann
,
J.
:
An Algorithm for Subgraph Isomorphism
.
Journal of the ACM
23
(
1
),
31
42
(
1976
)
[16]
Cordella
,
L.
,
Foggia
, et al.:
A (Sub) Graph Isomorphism Algorithm for Matching Large Graphs
.
IEEE transactions on pattern analysis and machine intelligence
26
(
10
),
1367
1372
(
2004
)
[17]
Tong
,
H.
,
Faloutsos
,
C.
, et al.:
Fast best-effort pattern matching in large attributed graphs
. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.
737
746
(
2007
)
[18]
Cheng
,
J.
,
Zeng
,
X.
,
Yu
,
J.
:
Top-k graph pattern matching over large graphs
. In: Proceedings of the IEEE 29th International Conference on Data Engineering, pp.
1033
1044
(
2013
)
[19]
Cheng
,
J.
,
Yu
,
J.
, et al.:
Fast Graph Pattern Matching
. In: Proceedings of the IEEE 24th International Conference on Data Engineering, pp.
913
922
(
2008
)
[20]
Song
,
C.
,
Ge
,
T.
,
Chen
,
C.
,
Wang
,
J.
:
Event pattern matching over graph streams
.
Proceedings of the VLDB Endowment
8
(
4
),
413
424
(
2014
)
[21]
Fan
,
W.
,
Li
,
J.
,
Luo
,
J.
,
Tan
,
Z.
,
Wang
,
X.
,
Wu
,
X.
:
Incremental graph pattern matching
. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp.
925
936
(
2011
)
[22]
Yan
,
X.
,
Yu
,
P.
,
Han
,
J.
:
Graph indexing: a frequent structure-based approach
. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp.
335
346
(
2004
)
[23]
Shasha
,
D.
,
Wang
,
J.
,
Giugno
,
R.
:
Algorithmics and applications of tree and graph searching
. In: Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp.
39
52
(
2002
)
[24]
Afrati
,
F.
,
Fotakis
,
D.
,
Ullman
,
J.
:
Enumerating subgraph instances using map-reduce
. In: Proceedings of the IEEE 29th International Conference on Data Engineering, pp.
62
73
(
2013
)
[25]
Shao
,
Y.
,
Cui
,
B.
,
Chen
,
L.
,
Ma
,
L.
,
Yao
,
J.
,
Xu
,
N.
:
Parallel subgraph listing in a large-scale graph
. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp.
625
636
(
2014
)
[26]
Huang
,
J.
,
Venkatraman
,
K.
,
Abadi
,
D.
:
Query optimization of distributed pattern matching
. In: Proceedings of the 2014 IEEE 30th International Conference on Data Engineering, pp.
64
75
(
2014
)
[27]
Demirci
,
M.
:
Graph-based shape indexing
.
Machine Vision and Applications
23
(
3
),
541
555
(
2012
)
[28]
Choudhury
,
S.
,
Holder
,
L.
,
Chin
,
G.
,
Ray
,
A.
,
Beus
,
S.
,
Feo
,
J.
:
StreamWorks: a system for dynamic graph search
. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp.
1101
1104
(
2013
)
[29]
Henzinger
,
M.
,
Henzinger
,
T.
,
Kopke
,
P.
:
Computing simulations on finite and infinite graphs
. In: Proceedings of the IEEE 36th Annual Foundations of Computer Science, pp.
453
462
(
1995
)
[30]
Ma
,
S.
,
Cao
,
Y.
,
Fan
,
W.
,
Huai
,
J.
,
Wo
,
T.
:
Capturing topology in graph pattern matching
.
ACM Transactions on Database Systems
39
(
1
),
4:1
4:46
(
2014
)
[31]
Liu
,
G.
,
Liu
,
Y.
,
Zheng
,
K.
,
Liu
,
A.
,
Li
,
Z.
,
Wang
,
Y.
,
Zhou
,
X.
:
MCS-GPM: Multi-Constrained Simulation Based Graph Pattern Matching in Contextual Social Graphs
.
IEEE Transactions on Knowledge and Data Engineering
30
(
6
),
1050
1064
(
2018
)
[32]
Liu
,
G.
,
Li
,
L.
,
Liu
,
G.
,
Wu
,
X.
:
Social Group Query Based on Multi-Fuzzy-Constrained Strong Simulation
.
Transactions on Knowledge Discovery from Data
16
(
3
),
1
27
(
2021
)
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.