Relational Topology-based Heterogeneous Network Embedding for Predicting Drug-Target Interactions

ABSTRACT Predicting interactions between drugs and target proteins has become an essential task in the drug discovery process. Although the method of validation via wet-lab experiments has become available, experimental methods for drug-target interaction (DTI) identification remain either time consuming or heavily dependent on domain expertise. Therefore, various computational models have been proposed to predict possible interactions between drugs and target proteins. However, most prediction methods do not consider the topological structures characteristics of the relationship. In this paper, we propose a relational topology-based heterogeneous network embedding method to predict drug-target interactions, abbreviated as RTHNE_ DTI. We first construct a heterogeneous information network based on the interaction between different types of nodes, to enhance the ability of association discovery by fully considering the topology of the network. Then drug and target protein nodes can be represented by the other types of nodes. According to the different topological structure of the relationship between the nodes, we divide the relationship in the heterogeneous network into two categories and model them separately. Extensive experiments on the real-world drug datasets, RTHNE_DTI produces high efficiency and outperforms other state-of-the-art methods. RTHNE_DTI can be further used to predict the interaction between unknown interaction drug-target pairs.


Introduction
The prediction of drug-target interactions (DTIs) is the key to the development of new drugs. It plays an important role in the study of drug toxicity and side effects and in the treatment of diseases. However, traditional methods based on large-scale biological experiments usually take several years and are often very expensive [1]. In recent years, with the rapid development of computer technology and the accumulation of large amounts of medical data, methods such as machine learning and data mining have been widely used to solve various complex problems in the field of biomedicine [2,3,4]. Currently, there are three types of prediction approaches in computational-aided drug discovery, namely similarity-based [5], deep learning based [6,7], and network-based [8,9] methods.
In the branch of similarity-based methods, Yamanashi et al. [5] proposed a method strategy mainly using nuclear regression methods, taking the information of known drug interactions as input new DTIs. This is a method proposed earlier to reveal the significant correlation between drug structure similarity, target sequence similarity, and drug-target interaction network topology. Zheng et al. [10] proposed a method to predict the interaction between drugs and targets by using a multi-similarity cooperation matrix. The core idea is to get the interaction matrix and similarity score matrix by multiplying two similarity matrices representing drugs and targets. As another similarity-based approach to DTIs prediction, compared with the previous method, Ezzat et al. [11] considered that many edges not appearing in the network are actually unknown or missing, the method of graph regularization matrix decomposition is used to predict unknown edges. However, most of these methods employed the chemical structure and protein sequence of the drug. However, in public data sets, it is often difficult to obtain the protein sequence and chemical information of many polymers.
Recently, the rapid development of deep learning technology has offered effective ways of predicting DTIs. Mayr et al. [12] compared several deep learning methods with other machine learning and target prediction methods on large-scale drug discovery datasets, and concluded that the deep learning method has the best prediction performance. Lee et al. [13] predicted DTIs through convolutional neural networks (CNNs) on original protein sequences. In a study called DeepDTA, Ozturk et al [6] proposed a deep-learning based model to predict the binding affinity between drugs and targets. CNNs were mainly used to model protein sequences and compound 1D representations.
There are various networks in practice, such as social networks [14], citation networks [15], and biological information networks [16]. And some interesting research works on network analysis have attracted increasing attention. Particularly, link prediction is one of the hotspot tasks of network analysis. Currently, most networkbased DTIs prediction is based on machine learning [8]. Wang et al. transformed new DTIs prediction problems into a two-layer graphical model named the restricted Boltzmann machine (RBM). Wan et al. [17] developed a new nonlinear end-to-end learning model, called Neo DTI, which integrates different heterogeneous information of drugs and targets, and learned the representation of drugs and targets to predict DTIs. However, note that one of the drawbacks of these methods is that they may not work when the chemicals pathway and proteins interact is unknown.
Heterogeneous network representation learning is a hot topic in current research and has quite good performance in link prediction [18]. Although heterogeneous network representation learning methods have been widely adopted for link prediction of social networks with good results, they have not been used for link prediction of DTIs to be best of our knowledge. Most of the previous studies on networks were based on homogeneous networks, to be specific, nodes in networks are of the same type. With the development of network representation learning, to simulate the heterogeneity of networks, some people have tried heterogeneous network representation learning. For instance, Shang et al. [19] proposed a framework, namely ESim, which uses random walks based on the mate-path to generate node sequences to optimize the similarity between several points. Fu et al. [20] proposed a method of heterogeneous information network representation HIN2vec, unlike many previous works based on the skip-gram language model, and the core of HIN2Vec is a neural network model, which learns the representation of both nodes and relationships (meta-paths) in a network. Han et al. [21] proposed an aspect-level collaborative filtering model based on neural networks. In their model, they extracted similarity matrices of different aspect levels of nodes through different meta-paths, and inputted these matrices into the designed deep neural network to learn the potential factors of the aspect level. Usually these methods are used in social networks, scholar networks, etc.. Due to the drug network is very similar to these networks in structure, so we try to solve the problem of drug-target prediction in this way. Particularly, our main contributions are as follows: • We proposed a heterogeneous network representation learning method named "RTHNE DTI" to predict DTIs, which learns the distributed representation of nodes by embedding heterogeneous networks into low-dimensional spaces to make full use of the network topology information. On the other hand, we apply the method of heterogeneous network representation learning to the field of drug-target interaction prediction, which achieves a more rapid and effective use of medical data, thereby greatly improving the accuracy of prediction.
• The traditional heterogeneous network representation learning method uses a single model to deal with all relationships. Here according to the different topological structure of the relationship between nodes in a heterogeneous network, we divide the relationship into two types: Affiliation relationship and Peer relationship, and built different models for them to capture the rich semantic information between nodes better. And make full use of the diversity of drug network relationships.
• In general, the prediction of drug target interaction is carried out on the labeled network (Some drug target relationship pairs with known interactions were added to the training set). However, our model can also achieve good prediction results on the unlabeled network. This solves the problem of insufficient drug labeling data and low prediction accuracy.
• We conduct extensive experiments using real drug data set and compare with other predictive models and the results show that RTHNE DTI has the best predictive performance.

Problem Formulation
In this section, we introduce some basic definitions of heterogeneous network embedding to predict DTIs.

Defintion 1: Heterogeneous Network (HN).
A Heterogeneous Network is defined as a Graph G = ( V, E, A, φ, ψ ), where V represents the set of nodes, E ⊆ V × V represents the set of edges. φ and ψ are the type mapping functions of nodes and edges, respectively, where φ : V → N and ψ : E → R. Here N and R are the type sets of nodes and edges, respectively. A = N ∪ R, and while | N | + | R | > 2, the network is called a heterogeneous network; otherwise it is a homogeneous network.
Defintion2: Meta-path. In a heterogeneous network, the meta-path P is a sequence of node types n 1 , n 2 , ..., n m and edge types r 1 , r 2 , ..., r m−1 , in the form of: Defintion3: Heterogeneous Network Embedding. Given a heterogeneous network G, the heterogeneous network embedding learns a low-dimensional vector E v ∈ R d for each vertex v ∈ V by a mapping function f :  As shown in Figure 1, (a) is a heterogeneous network constructed by five types of nodes (drug, target protein, disease, side-effect, action). In this network, there are not only simple relationships, such as D-D, but also compound relationships, such as D-P-Di. In (b) we divide all relationships into two categories according to the relationship topology and model them separately. Finally, we apply the model to different scenarios to verify the performance of our model.

Affiliation relationship and peer relationship
In studying the data sets associated with the prediction of drug and target protein interactions, we found that not all relationship pairs had equal number of nodes of the two types of connections, and some relationship pairs had significantly different number of nodes of the two types of connections, as shown in the figure 2.
Our study of DrugBank found that the types of action of proteins are very few, only 47, but the variety of proteins is very large, so their relationship network looks like an action-centered network spreading outward. As shown in Figure 2 (a). However, most of the relationships in the drug data set are like drugs and proteins. The two types of nodes do not differ greatly in number, so they form a well-balanced network. As shown in Figure 2 To make full use of the topology characteristics of heterogeneous networks (HNs), we study the topological features of different relationships in a heterogeneous network. In the network, the degree of a node can well reflect the topological structure characteristics of the network [22]. In general, the degree of a node refers to the number of edges associated with the node. In order to explore the difference between the topological structures of different relationships in HN, we used the degree-based measure D(e) for calculation: Where n u , n v represent the node type of nodes u, v in a retion tuple (u, v, e), d nu and d nv are the average degrees for n u and n v , respectively. It is worth mentioning that D(e) ≥ 1. Here, a greater D(e) value indicates that the topology of the two types of connected nodes is not identical, where one side is biased to the other?that is, nodes with a high D(r) value show stronger affiliation relationship (AR) between them, and nodes connected by this relationship have more similar properties [23]. However, for smaller D(e) values, it shows that the topology of the two types of connected nodes is peer, which we named the peer relationship (PR).
In order to ensure the accuracy of the results, we compare these relationships from the perspective of sparseness, annotated as S(e), so as to discover the differences in the network structure of different relationships. We define S(e) as follows: In the above formula, N r is the number of edges in type r. In addition, N nu and N nv are the number of nodes of types n u and n v , respectively. It should be We conducted a comprehensive analysis of the obtained data according to the above indicators as shown in Table 1.

Different models for PR and AR
To respect their different characteristics, we need to design different model treatments for them separately. Here, for two nodes connected by a PR relationship, there is a strong interactive relationship, and their topology structure is very similar. The nodes themselves contain rich structural information between two nodes, so we model the PR as a transition between nodes in a low-dimensional vector space.
In addition, for relation type AR, Euclidean distance is used as the calculation to measure the proximity of interacting nodes in low-dimensional space. It should be noted that the calculation methods we use for the two relationships are very consistent mathematically [24]. We use the Euclidean distance method for the AR mainly because of the following reasons. First, the nodes connected by this relationship share the same attributes [25], so the nodes connected by the AR can be directly approached in the vector space, which is consistent with the Euclidean distance optimization [26]. Second, the purpose of the heterogeneous network representation is to preserve the structural characteristics of the high-dimensional network. The Euclidean method satisfies the condition of triangular inequality [27], which ensures that the first-order and second-order similarities of the nodes remain unchanged.
Translation-based distance for peer relations. Through the study of Table  1, we found that in the drug heterogeneous network we constructed, most of the relationships are peer-to-peer. Specifically, a drug acts on multiple diseases, and a disease can also be treated by multiple drugs. And the number of drug nodes and disease nodes differs very little. Peer relationships show powerful interactions between nodes with peer-to-peer structure. For the calculation of the score function of PR, we first give a PR-type relationship tuple (a, r, b), where r ∈ R P R has a weight of w a,b . Then for the embedding of nodes a and b, we define them as P a and P b respectively. In addition, we annotate the embedding of relation r as Q r . The final definition is as follows: For the relationship tuples (a, r, b) ∈ T P R whose relationship is PR in the heterogeneous network, the marginbased loss function [24] is defined as follows: In the above formula, T P R represents the positive sample set in the PR triplet, and T P R is the negative sample set. γ > 0 represents a margin hyperparameter.
Euclidean distance for affiliation relations. For the heterogeneous network we constructed, only the target protein and its action type belong to the AR relationship. Specifically, the types of protein nodes and action nodes vary greatly in number. The nodes with this relationship can be directly approached in the vector space, so we use Euclidean distance to calculate the proximity between two nodes. Given a set of triples (m, i, n) with relationship type AR, where i ∈ R i represents the action relationship between nodes m and n. Its weight is defined as w m,n and the form is as follows: g(m, n) = w m,n ||P m − P n || 2 2 (6) Similar to the above formula, P m and P n are the embedding of nodes m and n, respectively. g(m, n) is to calculate the distance between m and n in a lowdimensional space. To ensure that the nodes connected by the AR relationship are closer, we minimize g(m, n) as much as possible, therefore we define the marginbased loss function as: As before, T AR and T AR are the positive and negative examples in the AR relationship, respectively.

Conjunctive Model
To make the model more complete, we smoothly join the two models presented in the previous section by using a loss-minimization function.
In Table 1, we can clearly find that the distribution of PR and AR are quite unbalanced. In the second experiment, to prevent the traditional edge sampling method from biasing to the larger number, we use probability distribution to extract positive samples. We adopt the previous method to construct a negative sample, the form is as follows: The above formula shows that for the positive relationship tuple (m, i, n), m, n can be randomly replaced, but not simultaneously.

Datasets
In this paper, the data set we used to construct the heterogeneous network includes the node type set V = {drug, target, diseases, side-effects, action}, the relationship type set R = {drug-drug, drug-target, drug-diseases, drug-side-effects, target-target, target-diseases, target-action}. The data sources we used are as follows: DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug data with comprehensive drug-target interaction, target-action information, and drug-drug interaction information. We use the Drug-Bank version 3.0 and DrugBank version 5.1.6. [28]. HPDR (Human Protein Reference Database) contains manually curated scientific information pertaining to the biology of most human proteins, and the data of protein?protein interactions extracted from the HPRD database Release 9 [29]. CTD (Comparative Toxicogenomic Database) is a public website and research tool that provides four types of core data: chemical-gene interactions, chemical-disease associations, gene-disease associations, and chemical-phenotype associations. The drug-disease association and protein-disease association used in this paper were extracted from CTD [30].
SIDER database, contains information about marketed drugs and their adverse reaction records. In this paper, the drug-side-effects interactions were extracted from SIDER database Version 2 [31].
We obtained data from the above four sources, and after data preprocessing, we finally obtained 708 drugs, 1512 target proteins, 5603 diseases, 47 actions, and 4192 side effects. Some descriptive statistics of the dataset are shown in Table 1.

Experimental Settings
RTHNE DTI has three parameters: embedding dimension d, the margin γ, and α, we set γ=1, and α=0.01. To study the influence of different dimensions on our model, we explored the parameter d, as shown in figure 3 we can see that when the dimension is 300, the predicted AUC value is the highest. So we set d=100 in the experiment.
About evaluation metrics, we use AUC and AUPR to evaluate the performances of prediction. [32] is a recommendation method relying on network-based inference, which is based on domain knowledge including drug and target similarity.

DT-Hybrid
BLMNI [33] improves the traditional BLM method and can be used to deal with new drug and target candidate problems, and it called neighbor-based interactionprofile inferring. HNM [34] combined with the drug target information, the intensity between the drug-disease pair is calculated by the iterative algorithm on the heterogeneous graph. MSCMF [10] uses multiple drug similarity matrices and multiple target similarity matrices to project drugs and targets into a common low-dimensional feature space to predict DTIs. NetLapRLS [35] is a semi-supervised learning method -Laplacian regularized least square (LapRLS), which use Laplacian Regular Least Squares (LapRLS) to simultaneously use a small amount of available labeled data and a large amount of unlabeled data to obtain maximum generalization ability from chemical structure and genome sequence.
DTINet [36] is a network integration approach that integrates heterogeneous information of drug-target heterogeneous networks.
RHINE [18] is a heterogeneous information network (HIN) embedding method which using the structural characteristics of heterogeneous relations.
Neo DTI [17] integrates diverse information from heterogeneous networks, and use graph neural network to automatically learn the representation of drugs and targets.

Task 1 : Predictive performance of labled network based on PR relationships
In this paper, we conducted experiments in two different scenarios, one based on the labeled network and the other based on the unlabeled network to predict the interactions between drugs and targets.
After our analysis of the data set, as shown in Table 1, we found that most of the relationships in the drug-target heterogeneous network are PR-type, so here we only used the data with the relationship PRs specifically : {drug-drug, drug-target, drug-diseases, drug-side-effects, target-target, target-diseases}, and we compared the performance of our model with other DTIs prediction models.
During the experiment, we used 10% of the drug-target relationship and all other PR relationships as the training set, and the remaining 90% of the drug-target relationships was held out as the test set. According to the difference between positive and negative examples, we conducted two different experiments, the first one in which the ratio between positive and negative samples was set to 1:10, the other in which all unknown drug-target interacting pairs were considered as negative sample.
The comparison results between our model and other models are shown in Table 2. The AUC scores obtained by our model in two different scenario prediction experiments are 94.3% and 95.8%, which exceeds the method Neo DTI by 3% and 2% respectively. Compared to Neo DTI, the embedding dimension of our method is 300, and Neo DTI is 1024.
What needs to be explained here is that in the Neo DTI experiment, in addition to the data mentioned above, the similarity information of the drug structure and the similarity information of the protein sequence are also used. Furthermore, Neo DTI is very time consuming and its running time is about 100 times that of our method.

Task 2 : Predictive performance of labled network based on all relationships
In this part we used all relations and compare with more advanced methods. As before, we still use the 90% drug-target relationship and all other relationships as the training set, and the remaining data is used as the test set. For a fair comparison, we set the embedding dimension d = 100, because the two models run the most efficiently when the dimensionality is low, and all unknown pairs were targeted as negative sample for all method in this experiment. The results are shown in Figure  4. It can be seen from the results that our model is superior to the other two methods. Here, the Neo DTI method also utilizes the similarity information between the drug and the target protein, but the AUC value of our model is still 9% higher than it. Compared with the RHINE method, our method considers more heterogeneous relationships. Thus, our results have better performance in this task. In addition, in this experiment, the AUC value of our method is 96.93%, which is about 3% higher than the result using only the PR relationship. It proves that the AR type relationship is also very important to improve the prediction ability.

Task3 : Predictive performance of unlabeled network based on all relationships
In the existing DTIs prediction methods, basically the drug and target pairs with known relationships are added to the training set to train the model. However, we boldly assume whether it is possible to not add the relationship for prediction in the training set, and only use other. In order to verify this conjecture, we conducted an experiment on task 3. For task 3, We use all drug-target relationship pairs as test sets. The ratio of positive and negative samples is 1:10. Through experiments, the AUC and AUPR score obtained by our model are 92.11% and 63.69%, respectively. Therefore, we can use their external relationships to predict when we have no clue on whether there is an interaction between a drug and a target. As shown in Figure 5, although the AUC value of our method in task 3 is lower than that in task 2, it is very close to the result of Neo DTI method. In addition, from the AUPR indicator, the result of RTHNE DTI in task 3 is the best.

Predictive performance based on other datasets
DBLP is an integrated database system of computer English literature with the author as the core of the research results in the computer field. The details of the DBLP dataset is shown in Table 3.  Table 3 we can see that DBLP dataset contains more IR relations. In the experiment, we respectively predict the two relationship pairs author-author, authorconference. The result as show in Table 4. The above experiments proved that our method can not only achieve good results on the drug network, but also on the scholar network. Compared with the traditional drug-target prediction method, our method is more general.

Conclusions
Accurately predicting the interaction between drugs and target proteins is very important for drug discovery. In this paper, we applied the method of heterogeneous network representation learning to predict drug-target interactions. We built a heterogeneous network by the rich external relationships between drugs and target proteins, and learn about drug and protein representations through neighboring nodes. According to the different topological structures of the relationship in the heterogeneous network, we divided the relationship into two categories: Affiliation relations and Peer relations, and model them separately. By doing this, our model can better capture the topology information and semantic information of the network with drugs and target proteins.To evaluate the ability of our method, we compared it with several state-of-the-art approaches. The results proved that RTHNE DTI taked shorter time and was more efficient on the DTIs task. Furthermore, whether it is labeled network or unlabeled network with the relationship of drugs and target proteins, RTHNE DTI has achieved good results. In the future, we will consider the rich domain knowledge of drugs and target proteins on the basis of heterogeneous network to enhance the prediction effect.