Deep Learning for Medication Recommendation: A Systematic Survey

ABSTRACT Making medication prescriptions in response to the patient's diagnosis is a challenging task. The number of pharmaceutical companies, their inventory of medicines, and the recommended dosage confront a doctor with the well-known problem of information and cognitive overload. To assist a medical practitioner in making informed decisions regarding a medical prescription to a patient, researchers have exploited electronic health records (EHRs) in automatically recommending medication. In recent years, medication recommendation using EHRs has been a salient research direction, which has attracted researchers to apply various deep learning (DL) models to the EHRs of patients in recommending prescriptions. Yet, in the absence of a holistic survey article, it needs a lot of effort and time to study these publications in order to understand the current state of research and identify the best-performing models along with the trends and challenges. To fill this research gap, this survey reports on state-of-the-art DL-based medication recommendation methods. It reviews the classification of DL-based medication recommendation (MR) models, compares their performance, and the unavoidable issues they face. It reports on the most common datasets and metrics used in evaluating MR models. The findings of this study have implications for researchers interested in MR models.


INTRODUCTION
A recommender system is an information retrieval & filtering mechanism that attempts to mitigate the negative impact of the well-known problems of information & cognitive overloads resulting due to the ever-growing size of information repositories [1,2]. While talking about these huge dumps of information, medical science cannot be ignored where the abundance of pharmaceutical companies and their growing number of medicines lay a huge impact on the prescription of a medication for a doctor against the diagnosis and medical history of a patient. To address this inevitable issue, researchers have considered electronic health records (EHRs) in automatically recommending medication so that a medical practitioner can make an informed decision while selecting and including a drug in the prescription. These EHRs present a comprehensive picture of the medical history of patients and may include previous medications, diagnoses, laboratory tests, treatment plans, and medical imaging such as x-rays, ultrasounds, and magnetic resonance imaging (MRI) scans, etc. [3]. They are the main data carriers for personalized medical research [4]. In addition, the recent improvements in the quality of EHRs attracted researchers due to their potential applications, viz., medical diagnosis and recommendation. They are semantics-rich and represented as a patient's temporal admission sequence with a series of clinical events, including procedures, diagnoses, medications, and so on [4]. These records when combined with the current clinical status (events, diagnoses, etc.) of a patient and fed into a medication recommendation system result in personalized medication recommendations, which assist medical practitioners in making informed prescriptions against the current health condition of the patient [5]. However, the recommendation task is not that simple, rather it is challenging and highly non-trivial with a prolonged history of machine-aided medical diagnoses and treatment. A medication recommender system can employ either content-based (CB), collaborative (CF), or hybrid filtering [6,7]. However, these traditional filtering approaches produce inadequate results due to issues like data sparsity, cold-start, and lack of Personalization [8]. In response to these issues, researchers have employed deep learning (DL) in producing quality medication recommendations. Some of the notable examples of DL-based medication recommendation (MR) models include [9,10,11,12,13,3,14,15].
Several surveys and review articles [6,16,17,18,19,20,7] have explored the domain of healthcare and medication recommendation. Sezgin and Ozkan [6] discussed traditional MR models using information filtering methods. However, they were unable to report on the current state of DL-based MR models and the issues they face.
Hors-Fraile et al. [16] presented a general overview of technical aspects of MR models including filtering methods and profile adaptation techniques published during 2007-2016. However, they presented negligible works on MR models, most studies are related to health and lifestyle with no analysis of the DL-based MR models. Their coverage of the latest DL-based MR models was also limited.
Zhang et al. [17] reviewed ML-and DL-based models for personalized medicine with a little touch to MR task. They covered challenges in personalized medicine and some future opportunities. However, they were unable to cover the technical aspects including filtering methods, and information sources. They performed no analysis of the ML-and DL-based MR models and optimization methods. Sezgin and Özkan [6] 1998-2012 General few issues only Limited *No coverage of the issues faced by MR models *No classifi cation of MR models based on information sources and fi ltering methods *No analysis of the DL-based MR models *Relatively old study with no coverage of latest MR models Hors-Fraile et al. [16] 2007-2016 General Few issues only Derived *Presents technical aspects including fi ltering methods (CB, CF), profi le representation, and adaptation techniques. *Negligible works on MR models, most studies are related to health and lifestyle *No analysis of the DL-based MR models *Limited coverage of latest DL-based MR models Zhang et al. [17] N.G ML-and DL-based Issues Limited *Presents ML and DL models for personalized medicine with a little touch to MR task. *Covers challenges in personalized medicine and future opportunities *No coverage of technical aspects including fi ltering methods, information sources *No analysis of the DL-based MR models and optimization methods Rajkomar et al. [18] N.G General Challenges Limited *Presents a general overview on how ML can be used in medicine *Presents how ML works and the type of input and output medicinal data that power ML algorithms * No discussion on any aspect of ML algorithms for MR task Model  reference  Duration  Models  types   Issues  explored  Trends  Strengths and limitations Ngiam and Khor [19] N.G ML-based Benefi ts and Issues of ML algorithms Limited *Presents some benefi ts and challenges of MLbased models in health-care delivery. *Covers certain ML platforms and tools that may offer recommendations in addition to other services *No coverage of recommendation-specifi c details including fi ltering methods. *No coverage of information sources and factors *Few works on MR models, most studies are related to health care delivery *No analysis of the DL-based MR models. *No coverage of optimization methods Su et al. [20] N.G DL-based Challenges and opportunities

Deep Learning for Medication Recommendation: A Systematic Survey
Limited *Presents network embedding models widely used in the biomedical domain and assesses their performance. *Presents software tools used for network embedding in the biomedical domain. *Covers challenges faced by network embedding models and future directions on how to improve them *No coverage of recommendation-specifi c details including fi ltering methods, sources, factors, and optimization methods. Etemadi, Maryam, et al. [7] 2010-2021 General Issues only Derived *Presents technical aspects including fi ltering methods (CB, CF, hybrid, knowledge-and contextbased

Deep Learning for Medication Recommendation: A Systematic Survey
Considering the above discussion and the recent emergence of novel DL-based MR models, an inclusive and comprehensive analysis is required to analyze the area, find interesting trends, and highlight the main issues. With this study, we explore the domain of MR models that employ DL methods.
Coverage and contributions. This study presents a comprehensive review of the literature on DL-based MR systems by reporting on 37 MR models that employed deep neural networks and were published during 2013-2022. It classifies these DL models with regard to their platform, problems addressed, DL-based information filtering, information factors exploited, optimization methods adopted, and the type of recommendation, viz., personalized vs. non-personalized. This review has implications for researchers working in the DL-based MR domain by reporting on the strengths, limitations, and trends in DL-based MR models. It also reports on open research issues, challenges, and research opportunities in DL-based MR models.
Structure of this article. The remaining paper has four sections. Section 2 presents a taxonomy of MR models by covering platform, information factors, information filtering methods, optimization, and recommendation types. Section 3 covers datasets and metrics used in evaluating these models. Section 4 presents a comparison of the experimental results of the explored models using different datasets and evaluation metrics. Section 5 discusses issues and challenges faced by the reported DL-based MR models and the opportunities to address them. Section 5 concludes the article with the main findings and future directions derived from this study.

TAXONOMY OF M ODELS
This section presents a taxonomy of DL-based MR models developed by reviewing selected 37 studies on medication recommendation as illustrated in Figure 1. The classification is based on the platform used (offline vs. offline), data features considered, deep neural networks used, issues and challenges they faced, optimization methods adopted, and recommendation types such as personalized vs. non-personalized. The following subsections present this taxonomy.

Platform
The term platform means whether the MR model has been deployed in a real online recommendation system or not. This gives the clue that how many MR research works are actually part of practical applications. If we look at Table 2, it is clear that only one model [23] is part of an online system, and other models work offline, indicating that most of the proposed models are not used in practical applications.

Information Factors
This section reports on the information sources and features used by reviewed DL-based MR models.
Medication history. An accurate medication history offers the foundation to assess the suitability of medication in the current therapy of a patient and directs future treatment choices. It helps in preventing errors in the prescription of medicines and avoids other pharmaceutical issues including poor or nonadherence to the recommended doses.  This is the most important factor adopted in the explored MRs as adopted in all 37 models.

Deep Learning for Medication Recommendation: A Systematic Survey
Time/Temporal dynamics. Time is among the crucial dimensions in generating recommendations [49].
A patient upon feeling sick visits the hospital where the doctors prescribe drugs after examining the lab tests. This clinical practice leads to the irregular production of medical records. It is generally and widely assumed that the recent medical records of the patient are more important than the previous ones in predicting their current health status [22]. However, even these irregular historical records have valuable clinical data that may not exist in the latest record (e.g., the extremely abnormal glucose level in the blood). Therefore, it is essential to build a time-aware and more adaptive mechanism for learning flexibly the impact of the time interval for each clinical feature. In addition, it required that the temporal aspect of the conditions of the patients and their visits to the hospital are considered in recommending medications. In line with this need, the reported literature (Table 2) reveals that many models, 29 out of 37, used the time factor in recommending medications [9,21,10,23,11,24,25,12,13,5,26,27,14,28,15,29,3,30,31,50,32,34,35,39,22,41,44,42,47,48].
Diagnoses. The process of medical diagnosis allows for determining the relationship of a disease with the signs and symptoms of a patient. The diagnosis collects the physical examination and medical history of the patient by employing one or more diagnostic procedures including lab tests. An accurate and timely diagnosis has a high probability of a positive health outcome for the patient as the correct understanding of the health problem tailors an effective decision-making [51]. This factor has been used by several studies as shown in Table 2.
Symptoms and signs. Symptoms describe a disease from the perspective of the patient, offer subjective evidence, and describe the complaints of the patient that leads her to the health care unit, while signs are the manifestation of the disease a doctor perceives. Few models [37,38,41,36] have used this feature as shown in Table 2 as symptoms may not support the evidence against a certain disease.

Table 2. Continued
Word embedding is widely used by natural language processing (NLP) in learning the latent representations of words and phrases. So far several word embedding models have been proposed to capture vigorous syntactical and semantic information about words and phrases. However, the most accepted and widely used among these include word2vec [54], doc2vec [57], and BERT [58]. They have been exploited in embedding items, users, documents, and locations [59] into a latent space. In network/graph embedding [55,1], the networks/graphs and their nodes are converted into low dimensional representations by considering the structure of the networks, their topological configurations, their relationships with the nodes, and other auxiliary details including content and attributes. Using graph embedding methods, meaningful relationships between nodes (medications, patients, procedures, diagnosis, etc.) are captured, which depend on the node-to-node differences in the embedding space [60].

Deep Learning for Medication Recommendation: A Systematic Survey
Fi gure 2. Information factors used in the LSTM-DE Model.
A knowledge graph (KG) is a heterogeneous graph that represents entities by nodes and the relationships among these entities are denoted with edges among nodes [61]. The KG-embedding models, such as TransD [62], GCN [63], GNN [64], and GAN [65] allow enriching the representation of users and medications. Mostly, such models have two modules, first, the graph embedding that learns the representations of its entities and relationships; second, the recommendation module that estimates the preferences of the patient for a certain medication, so that the medical practitioner can prescribe it if appealing. To this end, an example KG-embedding in MRs using an EHR graph is the GAMENet [21] that embeds the KG of drug-drug interactions (DDI) via a memory module, which is employed as a GCN [63] defined in Equation 1.
where, D and I denote diagonal and identity matrices. The model then applies a two-layer GCN on each graph in learning extended embeddings on drug combinations and DDIs, respectively. Through this model, the longitudinal patient records are jointly learned as an EHR graph whereas the drug knowledge base as

Deep Learning for Medication Recommendation: A Systematic Survey
the DDI KG to recommend safe and effective medications. The longitudinal methods such as RETAIN [10] and DMNC [14] outperform traditional DL baselines, which confirms the importance of temporal data in medication recommendations. However, they recommend a large bunch of medication combinations. To address this issue, GAMENet uses KG to improve performance and DDI rate. Yet, the use of the DDI graph alone may restrict some medication rules considering the external knowledge [27]. The patient representation and the memory output are exploited in predicting the multi-label medication ŷt and are defined by Equation 2.
( ) Where q t is the query at t th visit, ∈R , which is the memory output given current memory state M b and is directly retrieved using content attention M M a is obtained using retrieved information from M b and t m a from temporal aspect. In the same direction, G-BERT utilizes GCN [63] to learn the initial embedding of medical codes using medical ontology. The EHR data is exploited by employing an adaptive BERT [58] embedding model using the discarded single-visit data and learns the patient's visit embedding v as follows.
( ) where [CLS] denotes sepcial token utilized in BERT. c * represents medical code, and * c o denote ontology embedding vector for leap node c * . Finally, G-BERT applies a prediction layer to generate medication recommendations. Results of the G-BERT model reveal that it gains improved Jaccard and F-scores compared to GAMENet and attention-based RETAIN [10] model, which exhibits that incorporating hierarchical ontology information with pre-training procedure results in improved predictions.
In the same direction, MedGCN [23] makes medication predictions for patients employing incomplete lab tests. This is explained by the authors with the help of an example scenario illustrated in Figure 3. Here, the need is to predict the missing values of lab test results, e.g., for encounters 2, 3, and 4 and to recommend full or partial medications list for encounters 3 and 4. MedGCN exploits the relations among entities (encounters, patients, medications, and lab tests) using a heterogeneous graph (called MedGraph) of their inherent features. For each entity in this graph, it learns a vector representation based on GCN [63]. To deal with different entities, the model decomposes the heterogeneous graph into multiple subgraphs, each holding one type of edge (relation) and a single adjacency matrix is used to represent it. In each GCN layer, the model aggregates the representations of each node in all the subgraphs to learn its final embedding. These representations are then fed to two fully-connected neural networks h M f and h L f followed by the sigmoid activation, i.e.,  SMGCN [37] proposed a multi-layer neural network to simulate the interactions between herbs and symptoms for recommending herbs. Given the set of symptoms  [66], which propagates symptomoriented embedding for the target symptom node and herb-oriented embedding for the target herb node, respectively. This way, symptom representations b s and herb representations b h are learned. Second, it employs synergy graph encoding (SGE) to capture the synergy information of symptom and herb pairs. The symptom embedding r s is learned by executing GCN on the symptom-symptom graph for symptom pairs, constructed based on the concurrent frequency of symptom pairs. In a similar manner, SMGCN gains knowledge of herb embedding r h from a graph of herbs. Third, it creates the integrated embeddings for each symptom (herb) by fusing two types of word embedding b and r from the Bipar-GCN and SGE. Finally, it applies the syndrome-aware prediction layer to feed symptoms in the symptom set Sc into an MLP to produce overall syndrome embeddings e syndrome (sc
Deep reinforcement learning techniques. Deep reinforcement learning (DRL) mimics the learning capabilities of humans for machines and software agents so that they can also learn from their actions. The models employing DRL either penalize or reward an agent for their actions taken in an environment [67]. The actions that help agents to achieve their goals are rewarded, i.e., reinforced. If an agent performs an action at time t, the environment assigns a quantitative incentive to the agent in time t, and it alters itself at the position of the action. The agent repetitively takes these actions until the arrival of some terminal position [68]. These models are most suitable for dynamic and changing environments like medication recommendations. These models have been used by several researchers for recommending medications. Zhang et al. [3] proposed the LEAP (LEArn to Prescribe) model to learn the connections between the categories of medications and multiple diseases and capture the dependencies among medication categories in recommending medications. They used a recurrent decoder (GRU) for modeling label dependencies and content-based attention [69] so that label instance mapping can be captured. The prediction at step t is given using Equation 4.
Where medication and total medication are represented with y and Y, respectively. s t represents the variable summarizing the state at step t, which is computed as Here, Y(.) denotes attention mechanism employed, y t denotes medication at step t. Note that where M denotes a mapping matrix, in which each element M ti indicates the contribution of the t th diagnosis code x i to generating the t th medication y t . To do so, the model optimizes the cross-entropy loss function. The basic LEAP model has several issues. For example, it faces adverse drug interactions due to the nonavailability of negative training samples and thus leads to incomplete medication sequences. To address this issue, it is fine-tuned via model-free policy-based reinforcement learning [70], which increases the expected reward of the treatment set Y suggested by the policy as given in Equation 5.
Where ( , , ) R X Y Y represents a scalar value reward function that assesses the quality of Y, Ŷ is the treatment set for X that the doctors have prescribed considering the EHR data.
The post-processing and fine-tuning, e.g., using DDI knowledge to remove adverse medication combinations from the prediction results, which is adopted in existing models like LEAP, affects the optimal parameters that are learned in the prediction process. This is illustrated in Figure 4, which demonstrates adverse DDI between "insulin" and "sulfonamides." By removing "insulin," the "diabetes" is not treated, and if "sulfonamides" is removed, the "respiratory tract bacterial infection" receives no treatment. These issues were addressed in CompNet (Combined Orderfree Medicine Prediction Network), which is a graph convolutional reinforcement learning model that alleviates unreasonable assumptions on the sequence of medicines to leverage the correlations among them. It applies Dual-CNN on EHRs to produce patient representations, as given in Equation 6.
Where, = ⊕ p d Z z z that results from concatenating the representation of diagnoses z d and procedures z p along the first axis. These representations are balanced using attention weights a t to make the attention mechanism more effective. That is, employing DNN, CompNet approximates the Q-function Q(s t , a t , h), which produces a Q-value for each state-action pair (s t , a t ) at timestamp t. The s t is a result of combining the patient's representation ẑ t and the KGrepresentation t t of the medicine related to the current predicted medicines. The model parameters are represented with h. The model applies a greedy approach at each timestamp t to select a medicine a t considering the Q-value.
The doctors reward r t for the selected medicine a t . The model updates its policy considering this award. Here, s t is computed as s t = s(W s h t ), where s is the sigmoid activation function; W s is the learnable parameter matrix; and h t is the hidden state, computed using Equation 7.
Where, W h and U h are parameter matrices, and h t -1 is the hidden state representation at previous step t -1; h 0 is a zero vector; and x t is the interaction representation between KGs of patient and medicine at timestamp t, computed as = .

Deep Learning for Medication Recommendation: A Systematic Survey
Wang et al. [30] proposed SRL-RNN (Supervised Reinforcement Learning with RNN) to produce recommendations for a general dynamic treatment regime (DTR-a sequence of tailored treatments in response to the dynamic patient states) that involves multiple medications and diseases. It combines evaluation and indicator signals in learning an integrated policy. The SRL-RNN offers an off-policy actorcritic framework for learning complex relations among individuals, their diseases, and medications. The actor-network recommends time-varying medications in response to the changing states of patients, where the supervision of the decisions made by the doctors helps in ensuring safe actions so that the learning process accelerates by considering the doctors' knowledge. The critic network encourages or discourages the recommended treatments by estimating the action value corresponding to the actor-network. The SRL-RNN model is extended with LSTM to handle the issue of fully observed states in real-world applications, where the entire historical observations are summarized for capturing the dependence of the temporal and longitudinal records of the patients. This is achieved by optimizing the loss function given in Equation 8.
Where J RL (h) is the objective function of the reinforcement learning task that attempts to maximize the expected return and J SL (h) is the objective function of the supervised learning task. However, the limited experience of doctors and the knowledge gap make unclear the ground truth of "good" treatment strategy in supervised learning, which may result in imprecise predictions. Compared to the PMDC-RNN and LEAP models, SRL-RNN gives better predictions due to its use of reinforcement learning that infers optimal policies very well on non-optimal prescriptions. According to this study, only four models adopted DRL [30,31,41,3].
Recurrent neural net works. Unlike feed-forward neural networks, RNNs employ g ates such as input, output, forget, etc., to hold useful data and long-term dependencies [53]. They are close to CNNs, yet they preserve the previously learned data by employing the concept of memory to use it in the upcoming operations. This aspect make these networks suitable for sequential data [71]. They keep previous data using a directional loop and feed it to the output. Considering the nature of the problem, they have many variants but gated recurrent units (GRU) [72,73] and long short-term memory (LSTM) [53] are widely used.
To deal with vanishing gradient problem [72], encountered by traditional RNNs, an extension of RNNs, viz., GRUs and LSTMs introduced gates. Among these, LTSM uses input, output, and forget gats to either keep or discard the information. On the other hand, GRUs use hidden states to pass information and employ reset and update gates, which are similar in functionality to the update and forget gate of LSTM, whereas the reset gate forwards important information to the next level. The RNN model and its variants capture long-range dependencies and temporal dynamics [72,74] and thus are more suitable for medication recommendations, and thus used in various models. For example, PMDC-RNN [45] predicts multiple medications by applying a three-layered GRU model [73] on the patients' diagnosis records, i.e., diagnostic billing codes. However, it may predict imprecise medications due to discontinued medications or missing billing codes. LSTM-DE [39] is the next-period prescription prediction model that uses a heterogeneous LSTM with several hidden temporal sequences to capture the dynamics of medical sequences. The model constructs one hidden temporal sequence to model the prediction sequence and the other hidden temporal sequences to model physical examination results. Correspondingly, one hidden sequence each reflects the treatment course and recovery progress. Then, three heterogeneous LSTM models exploit the interactions of various medical sequences, where a fully connected heterogeneous LSTM keep the interactions of hidden states bidirectional and parallel. A partially-connected heterogeneous LSTM keeps the interactions from hidden physical states to treatment hidden states. The physical examination results are directly imposed on treatment hidden states in decomposed LSTM models. Finally, the model incorporates demographics and diagnostics in the hidden states to predict the next-time prescriptions. Since the model utilizes auxiliary information sources, therefore it produces improved area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR) scores compared to vanilla LSTM and other baselines.
The RETAIN model [10] addressed the interpretability issue by employing a two-level neural attention for sequential data offering a detailed interpretation of prediction findings while preserving RNN-like prediction accuracy. For generating more stable attention, it represents physician behavior during an encounter by looking at the past visits of the patient in reverse temporal sequence. This way, it identifies important visits and quantifies visit-specific properties that contribute to prediction. Because of exploiting temporal data, it outperforms MLP-based MRS and vanilla GRU, which use no such data [5]. However, considering only the patient's history, the recommendations produced are of low quality [5]. An unfolded view of its architecture is shown in Figure 5. In the first step, embeddings are generated. In the second and third steps, a and b values are produced using RNN a and RNN b , respectively. In the fourth step, the generated attentions of the third step are exploited to produce the context vector c j for a patient up to the j th visit, given by Equation 9.

Deep Learning for Medication Recommendation: A Systematic Survey
Where, v i , v i -1 , …, v 1 represents visit embeddings in a reverse order and represents element-wise multiplication. In the fifth step, the context vector c j ∈R n predicts the true label y j ∈{0, 1}, given by Equation 10.
Le, Tran, and Svetha [14] proposed DMNC that uses a memory-augmented neural network (MANN) to address the problem of long-term dependencies and asynchronous interactions. Here, three neural controllers and two external memories are employed that resulting in a dual-memory neural computer. To model the intra-view interactions, each view has its own controller and memory. The controller is responsible for reading input events, updating the memory, reading vectors from memory at each timestamp, and generating output considering its current hidden state. The intra-view interactions are of two types namely early-fusion and late-fusion memories. During the encoding process, no information is exchanged between these two memories as the late-fusion mode keeps memory space for each view independent and separated. In the decoding process, the read values of the memories are used to generate inter-view knowledge. Here, unlike the late-fusion, the views share the addressing space of the memory to ensure information sharing. This asynchronous sharing is offered by temporary holding the write values of each time step in a cache so that information from different time steps can be written to the memories simultaneously. The decoding process employs a write-protected mechanism on the memory to improve inference efficiency. Each encoder employs LSTM to convert embedding vectors to h-dimensional vectors. Although DMNC uses attentionbased DNC blocks, which enables it to recognize the interactions between sequences, it ignores considering medications during history visits [11]. In a similar way, the previously prescribed medications are ignored by AMANet [34]. However, it captures the intra-and inter-correlations of heterogeneous sequences using multiple attention networks, which helps in achieving a relatively better performance.
Some models treat drugs as mutually independent by ignoring their latent DDI. For example, DPR [15] considers the interaction effects within drugs that can be affected by the conditions of the patient in recommending drug packages. More specifically, a pre-training method is applied that uses collaborative filtering to get the initial embeddings of drugs and patients. A DDI graph is then produced considering domain knowledge and medical records. A drug package recommendation (DPR) framework is employed in two variants using a weighted graph (DPR-WG) and attributed graph (DPR-AG), where each interaction is described respectively by assigned weights or attribute vectors.
In embedding the package, a mask layer captures the impact of the patient's condition, and graph neural networks (GNNs) perform the final graph induction. During pre-training, MLP and char-LSTM [75] learn the disease document and admission note, respectively. DPR [15] outperforms AMANet [34] as the latter is unable to capture evolution information, including disease progression via temporal sequence learning

Deep Learning for Medication Recommendation: A Systematic Survey
networks, which is still a significant information source for decision-making. Similarly, MeSIN [11] addressed the complexity of EHR data, having a large number of patient records, visits, and sequential laboratory results, by introducing an interactive and multi-level selective network to recommend medications. The interactive LSTM is employed to reinforce the interactions among multi-level medical sequences in EHR data by employing an enhanced input gate and a calibrated memory-augment cell. An attentional selective module assigns flexible attention scores to various medical code representations on the basis of their relatedness to the suggested medications in each admission. Finally, a global selective fusion module incorporates the embeddings of information from multiple sources into the representations of patients for recommending medication.
A patient's health representation is a compact and indicative vector that represents the patient's status, defined by diagnosis and procedure information, to enable doctors to recommend medications [50]. In this regard, MICRON [50] learns the sequential data locally considering two consecutive visits, i.e., (t -1) th and the t th , and propagates them visit-by-visit to keep the longitudinal information of the patient. Given the health representations, i.e., h (t -1) and h (t) , the model learns a prescription network from the hidden embedding space for two visits, separately to recommend medications. Formally, represent the representations of medications, each entry quantifies a real value for the corresponding medication. Here, a fully connected neural network implements NET med . Formally, h (t -1) -h (t) = r (t) , is called residual health representation that encodes the alterations in clinical health measurements, indicates an update in the health condition of the patient. This health update r (t) causes an update in the resulting medication representation u (t) . Therefore, the authors were motivated that if NET med can map a complete h (t) ) into a complete m (t)), then r (t) should also be mapped into an update in the same representation space through NET med . In other words, r (t) and u (t) shall also follow the same NET med . In other words, According to the authors, Equation 11 and 13 could be learned using the medication combinations in the dataset as supervision, however, formulating direct supervision of Equation 13 is challenging. Therefore, they proposed modeling the addition and the removal of medication sets separately. Therefore, they considered reconstructing u (t) from ( ) MICRON is different from existing MR models, including, viz., Gamenet [21] and Retain [10] in the sense that it learns sequential information locally, whereas the later ones use global sequential patterns using RNNs.

Deep Learning for Medication Recommendation: A Systematic Survey
The ConCare [22] captures the interdependencies among features using a self-attention mechanism [76], where fixed positional encoding is used to offer relative position information for timestamps [77]. It separately embeds time series of features by employing multi-channel GRU, using Equation 14.
( ) Where, the time series of feature n is represented as ,: , 1 , , .
The hidden representation is summarized for the whole time span. Time-aware attention is employed for capturing the impact of time intervals in each sequence. An attention function maps the query and the set of key-value pairs to an output [76]. The hidden representation produces the query vector and key vectors, where the former is produced at the last time step T. Formally, these are described using Equation 15 and Equation 16: This alignment model qualifies the contribution of each hidden representation to the densely summarized representation for each feature. Here, Δt is the time interval to the latest record, s represents the sigmoid function, and b n is a feature-specific learnable parameter for controlling the impact of time interval on the corresponding feature. The attention weight a n,T decays significantly, if: • The Δt is long, meaning that the value was recorded a long time ago. A feature's most recent value, i.e., Δt = 0 decays slightly, i.e., log(e) = 1.
• The time-decay ratio b n is high, meaning that only recently recorded value for a particular feature matters. If the influence of a clinical feature persists, i.e., b n , it will be decayed slightly.
• The historical record has no active response to the current health condition, i.e., , , n base The inter-dependencies among dynamic features are captured using visits and the static baseline data, whereas self-attention enables further re-encoding of the feature embedding under personal context. During feature processing by ConCare, a better encoding is attempted by looking at other features for clues. In addition, it employs a multi-head mechanism to improve the attention layer with multiple representation subspaces. The heads for self-attention are expected to capture dependencies from different aspects. However, in practice, they may tend to learn similar dependencies [76], therefore, non-redundant or diverse representations [78,79] are employed by minimizing the cross-covariance of hidden activations across different heads. A cross-head decorrelation module is employed to enable models to focus on different features by following [78].
The RETAIN model [10] uses two RNNs to learn time and feature attention and combines the weighted visit embedding for prediction. However, it lacks advanced feature extraction with limited prediction accuracy [80,81]. In this direction, Lee et al. [82] proposed a medical contextual attention-based RNN that uses the individual information derived from conditional variational auto-encoders. However, these studies could not explore the inter-dependencies among dynamic records and static baseline data from a global view. On the other hand, ConCare adaptively captures the relations among clinical features to produce personalized recommendations for patients in diverse health contexts. It performs better than positional encoding-based methods such as SAnD [77], Transformer-Encoder, attention-based RETAIN [10], and time-aware approaches such as T-LSTM [74], showing that considering each feature's time-decay impact separately in a global view is far better than decaying the hidden memory of all visits directly. The study shows that a huge number of authors use RNNs and their variants [11,45,24,10,34,39,14,38,30,9,12,26,33,3,15,47,48].

Convolutional neural network.
A convolutional neural network (CNN) [83] is a DL-based model that produces efficient results with little pre-processing and lesser memory for training than RNNs. A CNN structure has several layers including input, convolutional, sub-sampling, fully connected, and output layers with functionalities such as receiving input data, performing convolution, pooling, learning non-linear combinations among features, and producing final predictions, respectively. A CNN model creates a feature map, which is implemented as a non-linear function, and computed using Equation 19.
Where, * represents the convolution operator. Let a sentence of size n has a raw key x 1:n , and a filter h applies to the word embedding matrix x 1:n , where l(l ≤ n) is the window's length of the filter and b ∈ R as a bias. This way, the execution cost reduces with the reduction in the size of the layer. These similar operations are carried out repeatedly on various layers to enable them to find useful features, which enable CNN to work as a classifier. The second last year computes the probability for every class of any item being classified. The last layer produces the final classification results [53] using the softmax function. Different objective functions, including Cross Entropy, are employed.
The SD_CNN [42] uses the CNN [83] framework to learn patients' similarity [84]. The framework maps patient A's one-hot feature matrix via the embedding layer to a low-dimensional sparse matrix. The maximum pooling and convolution are applied to each of these matrices and their eigenvectors are aggregated to Deep Learning for Medication Recommendation: A Systematic Survey make a composite vector. the same embedding and CNN parameters are obtained for Patient B. By matching matrix and conversion layers, The composite vector of these patients obtains a similarity feature vector, which is used to obtain their similarity probability via the softmax layer. On the other hand, GAMENet [21] combines DDI KG with a memory module implemented as a GCN, using longitudinal records of the patient as the query in recommending medications.
The framework of TAHDNet [13] holds three blocks namely 1D-CNN, transformer, and time-aware block. The model uses 1D-CNN for local dependency, a transformer for global dependency, and a time-aware block for dynamic time-aware attention to learn hierarchical dependencies on longitudinal EHR data (where each record is represented as a multivariate sequence). A new representation for each patient is produced by concatenating the outputs of these blocks, which is then fed to the prediction layer for recommending medication. The mode uses DDI loss for co-determining the final recommendation. It adapts transformer structure and uses a pre-trained transformer-based module by following G-BERT [25] to model the global dependency considering the whole patient records. Each patient's input data is represented by E = (e 1 , e 2 ,……e r ). A pre-trained transformer is then used in learning the interactions among medical ontologies as h T = Trans former (e 1 , e 2 ,……e r ) where ( ) Where, is the output of 1D-CNN's the hidden layer and h represents its hidden size.
TAHDNet avoids internal covariate shift by introducing layer normalization into ID-CNN: where m is a layer's mean value, s 2 is its variance, a and b are the parameter vectors for scaling and translation, respectively. In the time-aware block, TAHDNet introduces a fused decay function to consider periodic and monotonic decay, and then using the transformer's self-attention mechanism [76], it computes the attention weights and produces the latent space representation of time intervals:  Table 2 report that CNNs have been adopted by three models [42,13,84] only. Generative adversarial networks. The generative adversarial networks (GANs) adopt an unsupervised learning approach that automatically discovers and learns the patterns or regularities in the data to enable the model to output or generate new examples that could have been possibly drawn from the original data [85]. These models adopt an intelligent approach to train a generative model by employing two sub-models including a generator and discriminator. The former generates new samples and the latter classifies them as either real (i.e., from the domain) or fake (i.e., generated). They are trained in an adversarial manner until the latter is fooled for about half the time, which means that the former is producing plausible samples [53].

Deep Learning for Medication Recommendation: A Systematic Survey
To this end, ARMR [9] model uses two GRU networks [71] to build an encoder that exploits patient diagnoses and procedures to generate robust patient representations. Then, it uses a key-value memory network [86] to keep historical representations and associated medications as pairs and performs multi-hop reading on the memory network for obtaining case-based similar information from historical EHRs, used in updating patient's embedding. It combines encoder and memory network [86] to build Medication Recommendation (MedRec) module. The model makes a GAN model by fusing the encoder as a generator with a discriminator and treats as real data the representations of the patients having DDI rates smaller than a preset threshold to enable the GAN model to shape the distribution of patient representations generated by the encoder to reduce DDI. MedRec and GAN are trained jointly within each mini-batch with two objectives: a traditional error criterion corresponding to recommending medication and an adversarial training criterion to regularize distribution. This way, ARMR learns meaningful patient representations and regulates data distribution for maintaining low DDI, simultaneously. Meantime, ARMR uses q T to fit Gaussian distribution, which provides the real data for GAN, while the encoder is responsible for generating the fake data. During regularization, first, the GAN model updates the discriminator to distinguish real data p(z) from fake data T f q , then it is confused by updating the generator, where the cost function for regularizing GAN is defined using Equation 21 [85].

Deep Learning for Medication Recommendation: A Systematic Survey
Where, D and G denote discriminator and generator networks, respectively. Experiments exhibit that ARMR gains improved results in terms of DDI rate and medication prediction compared to other competitive baselines namely LAEP, DMNC, RETAIN, GAMENet, and MedRec because the proposed model regulates the distribution of the patient representations that result in improved performance.
To deal with DDI's fatal side effects, SARMR [12] processes raw EHRs to get the probability distributions of patient representations related to safe combinations of medication in the feature space. It then adversarially regularizes these distributions to get reduced DDI rates by applying knowledge as true data. The model treats and regularizes patients with different DDI rates as different cohorts, this way, the model avoids the adverse impacts on generalization caused by treating them as a single cohort. In contrast to SARMR, the RNN-based baselines including LEAP, RE-TAIN, and DMNC are limited in capturing important factors that affect the patient's health state to the highest degree. GAMENet uses additional DDI knowledge as a memory component to alleviate DDI, however, its reasoning capability over interactions between patients and doctors is limited and results in lower figures using Jaccard and F-score. Finally, If we look at the statistics of the examined works, we notice that this area still needs further research as very few models [24,9,12] used GANs in MRMs.
Attention networks and transformer-based models. Attention networks are much popular among researchers [87,88] as they produce robust recommendations by paying more attention to the salient information [89,90]. They have been successful in producing interpretable and explainable medication recommendations [91]. To this end, RE-TAIN [10] employs the attention mechanism and GRU [71] to leverage sequence information and improve prediction interpretability. In particular, it relies on an attention mechanism modeled to illustrate the behavior of physicians during an encounter. To encode physician behaviors, RETAIN analyzes a patient's past visits in reverse time order, enabling a more stable attention generation. Consequently, RETAIN determines the most significant visits and quantifies visit-specific features that contribute to medication predictions. Most of the existing models namely PREMIER [24], GAMENet [21] and SRL-RNN [30] propose the longitudinal EHRs from few patients having multiple visits but ignore many patients with a single visit, which leads to selection bias. In addition, hierarchical knowledge such as the hierarchy of diagnosis, which is important from the recommendation perspective, is not considered in

Deep Learning for Medication Recommendation: A Systematic Survey
representation learning. G-BERT [25] addresses these issues by employing graph attention network [65] for representing hierarchical structures of medical codes using ontology embedding. It uses BERT [76] in pretraining each visit from EHR in order to consider the EHR data that has even a single hospital visit. It finetunes the pre-trained visit and representation for downstream predictions on longitudinal EHRs (number of visits) from patients having multiple visits. A visit is the combination of medical diagnoses codes t In this direction, COGNet [5] recommends a combination of medications considering the current health conditions of the patient via an encoder-decoder generation network. The encoder contains two transformerbased networks [76], which use a multi-head self-attention mechanism, to encode the diagnosis and procedure information, and two graph convolutional encoders [63] to model the relations between medications. The copy module evaluates the current health conditions against previous visits to copy reusable medications in prescribing drugs for the current visit considering changes in the health condition. A hierarchical selection mechanism combines the visit-and medication-level scores to compute the copy probability for each medication. The copy module outperforms other counterparts including LEAP, RETAIN, DMNC, GAMENet, MICRON, and SafeDrug because, in clinical practice, the recommendations for the same patients are closely related. In contrast to COGNet, these baseline models ignore the historical visit information of the patient. Moreover, they consider no relationship between the medication recommendations of the same patient and are unable to capture long-range visit dependency. Finally, we can notice a positive trend towards using BERT-based and attention networks as adopted by ten models [11,42,10,34,22,25,26,5,47,48] in recent years.
Hybrid and other networks. A hybrid network integrates two or more DL methods to capture their inherent benefits and alleviate their potential limitations in producing robust medication recommendations. For example, an unavoidable challenge is handling the difficulty in learning the inter-view interactions due to the unaligned nature of multiple sequences. This is addressed by a hybrid model, AMANet [34] that integrates memory network [92] and attention by employing three main components. These include a neural controller that uses self-attention to capture the intra-view interactions by encoding the input sequence.
The inter-view interaction is learned by employing an inter-attention mechanism, which learns the interview interaction. To connect the positions of a single sequence, either a self-attention or intra-attention mechanism is used. Here, the intra-attention obtains the relationship between different elements in the same sequence. In addition, the inter-attention connects positions in two sequences. Specifically, in the inter-attention, one input embedding projects the query, and another projects key and value. The sequence's encoding vector is then produced by concatenating the inter-attention and self-attention vectors. The history attention memory keeps the previous encoding vectors of the same object. The dynamic external memory stores the common knowledge about data and is shared by all training objects. The predictions are generated

Deep Learning for Medication Recommendation: A Systematic Survey
by concatenating the encoding vector, read vector, and historical attention vector. However, the AMANet model is unable to fully exploit the captured evolution information including disease progression through temporal sequence learning networks, which if exploited, could lead to more robust recommendations [11].
The ARMR [9] model proposes an encoder with two GRU networks [73] to exploit diagnoses and procedures to produce patient representations. The model updates patient representations by storing historical representations and association medication in a key-value memory network [93] and reads it via multi-hop reading for extracting case-based similar data from historical EHRs to update patient representations. This results in a medication recommendation (MedRec) module that comprises of encoder and memory network. The model integrates the encoder as a generator with a discriminator to produce GAN model [85]. The GAN model reduces DDI by exploiting patient representations having DDI rates smaller than a preset threshold as real data to shape the distribution of patient representations produced by the encoder. Together, MedRec and GAN are jointly trained within each mini-batch to get a traditional error criterion for recommending medications and an adversarial training criterion for regulating distribution. This strategy allows the model to learn meaningful patient representation and maintain low DDI at the same time, which leads to quality medication recommendations.
Avoiding fatal DDI is among the prominent challenges in recommending medications. This issue is addressed by the SARMR model [12] that processes raw EHRs to get the probability distributions of patient representations for safe medication combinations. It reduces DDI by adversarially regularizing the distributions of patient representations using the knowledge as real data. It uses and regularizes patients having varying DDI rates as distinct cohorts to avoid the negative effects on the generalization, which may occur if they are treated as a single cohort. Firstly, it models the interactions between patients and physicians by encoding EHRs with GRUs [73] and then constructs a key-value memory neural network [93] with keys denoting admission and values showing the corresponding medications. Secondly, it uses the representation of the most recent admission as a query to carry out multi-reading on the MemNN [93] with GCN [63] embedding module of the read results. The medications are recommended considering the updated query. Next, it uses records of all patients, with no regard to their DDI rates, to recommend medications and regularize adversarial distribution with GAN [85] on the basis of representations obtained from the first step to achieve both reduction in DDI and effective medication combinations. The final results are predicted as Equation 22.
Where q T is the patient representation, v M is multi-hop reading result, i is the medication with weighted embeddings, g(.) is fully-connected layer, and S(.) is the sigmoid function.
To consider the consecutive correlation in dynamic prescription history and understand irregular timeseries dependencies, MERITS [27] employs neural ordinary differential equations (Neural-ODE) so that the continuous inner process can be better modeled. It employs an encoder-decoder architecture in predicting next medication sequence and combines static and dynamic using self-attention. In the meantime, it embeds and uses the knowledge about drugs and the experience of the doctors by exploiting three graphs,

Deep Learning for Medication Recommendation: A Systematic Survey
namely sequential, DDI, and co-occurrence graphs to represent drug sequential relationships, conflicts, and co-occurrences. The encoder has three modules, namely, a medical embedding module that employs a self-attention module [76] and RNN for capturing sequential information; a dynamic encoding module that models irregular time series data at a specific timestamp using Neural ODE; and a patient aggregation module that uses the simple linear map to model the patient's state by aggregating the sequential medications, and static as well as dynamic features The encoder produces a representation of the patient at the current timestamp by extracting medication strategies and patient status from irregularly sampled time series data. The decoder employs a medication generator and graph attention module. It recommends medications at timestamp t + 1 using the patient representation and graphs that establish the relationships between drugs in the medication history.
The TAHDNet model [13] captures the dependence information between medications and patients at local and global levels by adopting hierarchical learning. Figure 7 presents its architecture consisting of a transformer, time-aware, and 1D-CNN blocks. It employs 1D-CNN [83] in learning the patient's local representation and uses adapted transformer-based learning [25] in learning her global representation via a self-supervised pre-training process. It models the disease progression by employing a fused temporal decay function with monotonic and periodic decay for dynamic time-aware attention, which leads to a more realistic evaluation of disease progression. The model outperforms several baseline models including LEAP [3], RETAIN [10], G-BERT [25] and GAMENet [21]. Here, LEAP, which is instance-based, performed lower than the RETAIN temporal method. This advocate for the importance of temporal data in EHRs. However, G-BERT performed comparatively well and outperformed GAMENet due to learning additional information about DDI and procedure codes. This discussion demonstrates that transformer-based models are more effective for recommending medications. Yet, G-BERT considers no temporal information and thus is unable to learn the disease progression information, which is one of the main causes of its sub-optimal performance. TAHDNet gives better results due to its capability of extracting as many details as possible from EHRs while reducing noise.
Recommending medications is a time-consuming process for experienced medical practitioners and error-prone for inexperienced ones, especially in complicated cases. The COGNet model [5] addresses this issue by employing a generation network based on an encoder-decoder to recommend suitable medications in a sequential manner. It represents the patient's historical health conditions by encoding all her medical codes from previous visits in the encoder network. It represents the patient's current health condition by encoding the diagnosis and procedure codes from the t th visit. It employs a decoder to generate the medication procedure codes of the t th visit one by one to represent the patient's current drug combination suggestions. The decoder collects information by procedures, diagnoses, and medications to suggest the next medication during each decoding step. If the current visit's diseases are consistent with previous visits, the copy module copies the associated medications immediately from the historical medicines combinations. In other words, the copy module extends the basic model by comparing the health conditions of historical and current visits and then copying the reusable medications to write prescriptions for the current visit based on condition changes. Diagnosis and procedure encoders are transformer-based networks [76] with different parameters. The set of patient's symptoms and medications define the input to the medication recommender, however, this input still lacks sufficient details that can relate these two entities. MedRec [36] addresses this issue by including knowledge about medicines and their attribute graphs in its model to connect medications with symptoms. A medical KG of symptoms and medications is created which results in their richer representation. This KG holds four key nodes including physical examination, symptom, disease, and medicine. An edge connects two related nodes. For example, a disease has certain symptoms and requires specific medications, all three are connected with different edges. The attribute graph models the interrelationships among medicines. If two medicines belong to the same category or have the same sub-molecular structure, then they are related. In recommending medications, MedRec first applies multi-relational GCN [63] to learn the embeddings of entities and relations and uses the objective function of the link-prediction task to optimize the model. Similarly, the embeddings of medicines and symptoms are produced. It fuses the attention mechanism with the embedding of each symptom to produce a syndrome representation. MedRec employs GCN [63] to get the embedding of an attribute graph, which is used in combination with medical KG to produce the overall representation of a medicine. Finally, it produces the prediction scores by learning the interaction of medicine and syndrome. Figure 8  The score(sc, M) characterizes the ranking score in recommending medicines. Given symptom set sc, the ground truth set is represented as a multi-hot vector mc in dimension |M| and score(sc, M), which is the output probability vector for all medicines, the mean square loss between score (sc, M) and mc is computed using Equation 24.
Generally, the drugs are considered as individual items by the medicine recommenders and thus neglect the unique requirements of recommending drugs as a set of items while keeping DDIs as much as possible. This issue is addressed by 4SDrug [28] which recommends medications by performing set-to-set comparison for designing set-oriented representation and similarity measurement for both medicines and symptoms. It takes the set of medicines D i and symptoms S i as inputs and employs three modules in recommending medicines against a symptom. The set-to-set comparison module employs i S h for the symptom set i th and

Deep Learning for Medication Recommendation: A Systematic Survey
The drug set module recommends sets of medicine using the intersection-based set augmentation and a hybrid DDI penalty mechanism for ensuring the principle of a small and safe drug set. Figure 9 illustrates an example of this recommendation, showing that two patients Jack and Lisa share similar symptoms, such as fever, cough, chills, and headache, and thus the same disease, i.e., viral influenza has the maximum chances. Therefore, they will be recommended the same medication, such as Ibuprofen, Ambroxol, and Oseltamivir. Thus, the physical status of the patient can be judged from their symptoms without disclosing any personal data [94,95]. Therefore, symptom-based medication recommenders can be widely adopted in drug prescriptions to avoid privacy issues. Using the set of symptoms S (j) and medicines D (j) can be represented respectively via ( ) The model uses Equation 26 to optimize the objective function.
Where, D (j) ) are the medicines used in the treatment of symptoms S (j) .

Figure 9.
A toy instance of the symptom-based set-to-set medicine recommendatio n.
The experimental results indicate that 4SDrug outperforms other competitors including GAMENet and LEAP. That is, it outperforms GAMENet because the latter lacks considering the number of recommended drugs and outputs an undesirable DDI rate, consistent with the results in the current work [33]. In addition, 4SDrug gives better computational space and complexity due to requiring comparatively lesser complex neural architecture and is compatible with efficient mini-batch training. GAMENet [21] requires more space due to a large memory bank, whereas LEAP [3] is computationally complex due to sequential modeling

Deep Learning for Medication Recommendation: A Systematic Survey
and recommending medications one by one. Considering all these factors, 4SDrug is more suitable for real-world industrial applications as it is more efficient and adaptable.

Optimization Methods
A DL model employs its algorithm to generalize the data so that it can make predictions against unseen data. Therefore, it is always required to find an algorithm that not only makes such predictions but also optimizes the results. By optimization, we mean finding a way that discovers those values of the parameters or weights that reduces the chances of errors and enhances model accuracy while mapping inputs to outputs. Such an optimization accelerates training and helps improve performance while learning from data. However, finding the optimal weights for a DL model is challenging due to the millions of parameters within it. Therefore, the need to choose an appropriate optimization algorithm is the key to success [96]. This section discusses the most widely used optimization algorithms used in employing DL algorithms for recommending medications.

Gradient descent.
The gradient descent is an iterative first-order algorithm that attempts to find a local minimum/maximum for a given function [97].
Stochastic gradient descent. The stochastic gradient descent extends gradient descent by reducing its computational intensiveness as the latter computes the derivative of one point at a time [96].

Momentum.
A gradient descent algorithm finds it challenging to navigate ravines, i.e., the areas having surface curves steeper among different dimensions, most common around local optima. To address this, stochastic gradient descent oscillates across the ravine's slopes while making tentative progress toward the local optimum. The momentum extends gradient descent to speed up stochastic gradient descent in an appropriate direction and keep the oscillations of noisy gradients to the minimum [97,96].

RMSProp. Root Mean Squared Prop is another adaptive learning rate method that tries to improve
AdaGrad [98] that takes the cumulative sum of squared gradients. RMSProp takes the exponential moving average. Both have an identical first step, however, RMSProp divides the learning rate by an exponentially decaying average [99].
Adam. Adam [99,97] combines the advantages of Momentum and RMSProp to compute the adaptive learning rate for each parameter. It stores the previous decaying average of the squared gradients and holds the average of past gradients similar to that of Momentum. Table 3 shows that the majority of the models, i.e., 24 out of 37 models used Adam and its variants. The possible reason behind the usage of Adam could be its capability to converge faster. Gradient descent and its variants stand in the second position, which is employed by 8 models. Only one model used AdaGrad while others share no details regarding their optimization method.

Recommendation Types
A drug recommendation can be personalized or non-personalized. In the first case, recommendations are made on the basis of the user profile and personal interests. For instance, patients' medical history, diagnosis, procedures, symptoms, and temporal dynamics related to their visits for understanding their medical status and generating individualized predictions. A non-personalized medication recommender system considers generic features and exploits no additional rich semantics corresponding to the patients. Table 2 reports that most of the models adopted a personalized approach.

E VALUATION METHODS
This section gives a brief account of the evaluation methodology (datasets and evaluation metrics) adopted by the MR models in evaluating their experimental results.

Evaluation Metrics
W e provide details of the evaluation metrics that are commonly used in the literature of medication recommendation.

Recall. assesses an MR model
Where TPseen represents total true positives till k. Generally, AP@10 is set as the cut-off value for the average precision (AP).

Deep Learning for Medication Recommendation: A Systematic Survey
Normalized discounted cumulative gain. nDCG [100] assesses position/rank of true relevant medications in the list of top-N recommendations. It adopts graded relevance to assess the effectiveness of an MR model using Equation 29.
Where, nDCG g represents the accumulated normalized gain for a rank g. G is the list of relevant medications in the collection up to position g. To ensure that the top relevant medications appear at the top of the recommendations list, a weighted sum of the relevance degrees of suggested medications is defined and referred to as discounted cumulative gain (DCG). This leads to IDCG g , which represents the DCG of ideal ordering, used in normalizing the DCG scores. Mean reciprocal rank. analyzes an MR model's capability to suggest relevant medications in the list of top k results, and computed using Equation 30.
Where, Q T is the testing set and rank q is the rank of its first ground truth medicines.
Accuracy. computes the superiority of medication predictions, i.e., an incorrect/correct guess of the next medicine recommended [101]. Equation 31 computes it. @ @ test TruePositive n Accuracy n D = Where |D test | is the test set and n represents the number of top suggestions against the query medicine.
F-measure. combines precision and recall through a harmonic mean [102]. Comparatively, it gives a better assessment of the suggested medications than accuracy and can be calculated using Equation 32.
Area under curve. is considered for MR models that formulate recommendation as a classification task.
where p j denotes the predicted score of j-th positive sample, while n k is the predicted score computed for the k-th negative sample. N p and N n represent the total number of positive and negative samples, respectively.
Jaccard similarity. is a common proximity measurement that computes the similarity between two nodes/vectors. It is defined using Equation 34 as the ratio of intersection of ground truth Y t and predicted result ˆt Y to the union of Y t and ˆt Y , where N is the total number of patients.

Deep Learning for Medication Recommendation: A Systematic Survey
DDI rate. measures the medication safety of a model, which defines as the percentage of medication recommendation that contains DDIs.
Where, the set will count each medication pair (c i , c j ) in the recommendation set Ŷ if the pair belongs to the edge set e d of the DDI graph. Here N is the size of test dataset and T k is the number of visits of the k th patient. Table 4 reports that the most widely used metrics are F-Score (24 out of 37) and AUC (23 out of 37), indicating a greater interest of researchers in generating accurate medication predictions. These are followed by Jaccard (20 out of 37) showing that a considerable number of MR models treat recommendation as a classification problem. This is followed by the DDI rate (13 out of 37) and recall (11 out of 37). In addition, the majority of the models adopted a combination of metrics together.
The classification or ranking accuracy measures are employed to optimize recommendations with the aim of finding the most relevant medications for a patient. Most of the reported MR models use accuracy measures of different types, including coverage and precision (recall, precision), rank-based measures (nDCG or MRR), and prediction measures (RMSE). Finally, we noticed that the majority of models (21 out of 37) used three or more evaluation metrics, which shows that an evaluation based on many metrics makes the experiments of MR models more robust. Table 5 reports on the most widely used medication recommendation datasets. This section gives a brief overview of these datasets to enable researchers to choose the right dataset for their experiments.

Datasets
MIMIC-III. medical information mart for intensive care (MIMIC-III) is the most rich dataset, developed by the computational physiology lab of Massachusetts Institute of Technology (MIT)  , provides access to information sources including patients, diagnosis records, clinical events, procedures, medicines, and symptoms. Therefore, the majority of the models, i.e. 24 out of 37 used this dataset [9,21,23,11,24,25,45,13,5,14,28,29,103,41,46]. [104] is the most recently released dataset, which has been used in only one model. This dataset provides access to information sources such as 2, 78, 388 clinical events, and 230 medicines.

ICD-9.
The International Classification of Diseases version 9 (ICD-9) is the official standard codes of diagnosis and procedures. It contains 13000 disease codes in tabular form. The codes specify that each disease has a unique code and is used in EHR for the billing mechanism. Several models utilized ICD-9 based datasets [29,42,44].

Deep Learning for Medication Recommendation: A Systematic Survey
Proprietary and non-public datasets. Several studies developed proprietary and non-public datasets to evaluate their MR models. Table 5 reports that six models have used such datasets, making it challenging for researchers to compare the results of these models with other models [10,27,15,36,38,39,22]. Some other datasets adopted by explored models include Sutter [3], TCM [36,37], DrugBank [29], IQVIA [90] and PRIVATE [11,24]. Since these datasets give access to limited information sources, therefore employed by a few studies.

COMPARATIVE ANALYSIS OF THE EXPERIMENTAL RESULTS OF THE MODELS
This section is dedicated to the comparison of experimental results generated by the examined models using different evaluation metrics and datasets. If we look at the results of models using the MIMIC-III dataset in Table 7, The best performance on MIMIC-III is gained by the DMNC [14]. The DMNC attained the best performance due to the introduction of a new memory-augmented neural network model that aims to model these complex interactions between two asynchronous sequential views. DMNC uses two encoders for reading from and writing to two external memories for encoding input views. The intra-view interactions and the long-term dependencies are captured by the use of memories during this encoding process. There are two modes of memory accessing in DMNC [14] system: late-fusion and early-fusion, corresponding to late and early inter-view interactions. In the late-fusion mode, the two memories are separated, containing only view-specific contents. In the early-fusion mode, the two memories share the same addressing space, allowing cross-memory accessing. In both cases, the knowledge from the memories will be combined by a decoder to make predictions over the output space.
The second best performance is attained by the COGNet model [5] because it utilizes a generation network based on an encoder-decoder to recommend suitable medications in a sequential manner. It represents the patient's historical health conditions by encoding all her medical codes from previous visits in the encoder network. It represents the patient's current health condition by encoding the diagnosis and procedure codes from the patient's visit. It employs a decoder to generate the medication procedure codes of the visit one by one to represent the patient's current drug combination suggestions. The decoder collects information by procedures, diagnoses, and medications to suggest the next medication during each decoding step. If the current visit's diseases are consistent with previous visits, the copy module copies the associated medications immediately from the historical medicines combinations.

Deep Learning for Medication Recommendation: A Systematic Survey
Diagnosis and procedure encoders are transformer-based network [76] with different parameters. On this dataset, the third best performer is the PREMIER [24] model. PREMIER [24] is a two-stage recommender system comprising attention-based RNNs to model patient visits and graph networks to model drug co-occurrences in the EHR and known drug interactions. PREMIER adapts GAT to incorporate the varying importance of drug interactions to learn effective drug embeddings for the task of medication recommendation. PREMIER [24] justifies the key reasons for recommending a particular medication by providing the percentage of contributions among the diagnosis, procedures, and previously prescribed medications.
On the contrary, the MERITS [27] model produces superior results for the Non-public dataset compared to other models based on precision, recall, F-score, and AUC metrics. It is credited for its use of neural ordinary differential equations (Neural ODE) to represent the irregular time-series dependencies, which can better learn the continuous inner process. Moreover, it incorporates static and dynamic features through self-attention and uses the encoder-decoder architecture to forecast the next sequence of medications. In the same direction, SMGCN [37] generates better results than its counterpart MedRec [36] based on the TCM dataset employing precision and recall metrics. The possible reason behind the improved results of SMGCN could be the combination of MLP and GCN to fuse symptom representations into the overall implicit syndrome embedding and learn symptom and herb representations, respectively. On the other hand, MedRec employs a knowledge graph to link symptoms, diseases, medicines, and examinations. Using similar characteristics and molecular structures, an attribute graph is used to link many medications. The combined learning representations of symptoms and medicines is then employed in medication recommendations.
Finally, if we see the results reported on other datasets, viz., Private, eICU, NMEDW, Sutter, and NELL, we cannot make meaningful implications since these datasets have been utilized by one model each to report their performance.

OPEN ISSUES AND OPPORTUNITIES
This section reports on the problems faced by the chosen MR approaches and presents research opportunities in addressing them by examining the research examined in this article.

Cold-start Problem
One of the well-known issues that MR methods encounter is the "cold-start" issue [53], which is further classified as cold-start patients and medications. In these situations, the approach cannot provide trustworthy medication recommendations due to insufficient knowledge about patients and medications. For example, when a new patient appears, the system has insufficient patient information, and therefore, it is unable to create reasonable recommendations. To address the cold-start issue, most of the models employed medication history, time, diagnoses, and procedures. For instance, SMR [29] first connects medical knowledge and EMRs graphs in order to construct a superior heterogeneous graph. The approach then encodes patients, diseases, medications, and their related relationships in a common lower-dimensional space. Finally, in

Deep Learning for Medication Recommendation: A Systematic Survey
order to build the medication recommendation into a link prediction task, SMR also considers the patient's diagnoses of adverse drug reactions. Likewise, MetaCare++ [43] introduced a meta-learning technique to address the cold-start diagnosis task that dynamically forecasts future diagnoses and timestamps for infrequent patients and explicitly encodes the impact of disease progression over time as a generalization prior.

Sparsity
This issue is most common in CF techniques [8], faced by several MR models when the dataset or patient information is sparse. It is difficult for the method to produce pertinent recommendations due to the lack of information. If the number of medications in the database is relatively less than that of patients then the MR model faces network sparsity or data sparsity problems. The examined studies exhibit that sparsity problems have been resolved by employing secondary information. In the case of network sparsity problems, side information enhances MR models' knowledge about patients by extending the network of connections with new objects and relations. The new node, for example, indicates the association between medication, patients, diseases, symptoms, and lab tests. Most of the approaches investigated in this study employ hybrid strategies that combine CF and CB to address data sparsity. The DL technique used to generate personalized medication recommendations is the main distinction between them. For the task of recommending herbs, SMGCN [37] utilizes a multi-layer neural network model that simulates the interactions between syndromes and herbs. The representations of the symptoms in an intended symptom set are then combined using an MLP to produce the overall implied syndrome representation. The model combines syndrome representation with herb embeddings to produce final predictions.
In the same direction, MedRec [36] uses a knowledge graph to link medications, diseases, examinations, and symptoms. Additionally, it relates medications through common molecular structures and attributes using an attribute graph. As a result, the two graphs improve the relationship between symptoms and treatments, which solves the problem of data scarcity.

Drug-Drug Adverse Interactions
The recommendation model should take seriously into consideration the interaction between drugs. If a model recommends drugs that have adverse interactions, then it can cause serious damage to a patient's health. Different models in the literature proposed solutions to tackle this problem. For instance, GAMENet [21] combines the DDI KG using a memory module implemented as a GCN, which models patients' longitudinal records to produce safe and personalized drug recommendations. Similarly, 4SDrug [28] introduces a drug set module by devising intersection-based set augmentation, knowledge-based, and datadriven penalties to ensure small and safe drug sets recommendations. COGNet [5] uses a basic module to recommend the medication combination based on the patient's health condition in the current visit using an encoder-decoder architecture. Moreover, to consider the patient's historical visit information, the model introduces a copy module that evaluates the current health conditions against previous visits to copy reusable medications in prescribing drugs for the current visit considering changes in the health condition.

Deep Learning for Medication Recommendation: A Systematic Survey
A hierarchical selection mechanism combines the visit-and medication-level scores to compute the copy probability for each medication. Comparably, ARMR [9] initially utilizes RNNs to generate patient representations and employs a key-value memory system to contain historical representations and associated medications. As a result, a case-based approach with related results can be employed for medication recommendation. To accomplish DDI reduction, ARMR incorporates a GAN model that aligns the distribution of patient representations to a previous Gaussian distribution. The MedRec component and GAN model are conversely trained with double objectives in a mini-batch. The majority of available techniques impede models by adding more DDI knowledge in an effort to address the DDI problem. To overcome this issue, SARMR [12] extracts from raw patient records the target distribution linked with safer drug combinations for adversarial regularization. The technique can modify patient representation distributions in this way to lessen DDI. With a great deal of flexibility, SafeDrug [33] adaptively merges supervised loss and unsupervised DDI constraints. Specifically, if the DDI rate of individual samples is higher than a specific threshold /target during training, the negative DDI signal will be highlighted and back-propagated.

Capturing Temporal Dynamics
The patient's recent health conditions and tests play a vital role in recommending precise medications. Moreover, there are certain diseases such as flu that depend on the recent patient's clinical records. On the other hand, certain diseases like cardiovascular diseases need patient's previous records to contain valuable information and help predict precise recommendations. To this end, RETAIN [10] predicts future diagnosis by calculating a visit's attention weights at time t, considering the medical information in the current visit and the hidden state of the recurrent neural network at time t, to predict the visit at time t + 1. However, the relationships among all visits from time 1 to t are ignored. Dipole [48] handles this issue by embedding high-dimensional medical codes into a low code-level space. These code representations are then fed to an attention-based bidirectional GRU [71] to produce the hidden state representation by employing a softmax layer that predicts the medical codes in future visits. On the other hand, Concare [22] proposes a multichannel medical feature embedding architecture to learn the representation of various feature sequences through separate GRUs and uses time-aware attention to capture the effect of time intervals between records adaptively. Similarly, MeSIN [11] employs an interactive temporal sequence learning network to incorporate the intra-correlations of several visits within a single medical sequence and the inter-correlations of various sequences of EHR data. In particular, the improved laboratory findings embeddings are fed into the temporal sequence learning network i.e long-short temporal neural network (LSTM) for combining with the historical laboratory results. To provide a more accurate representation for the prediction task, TAHDNet [13] incorporated a Time-aware block to reflect the irregular time intervals. Specifically, an interval gate is utilized to fuse the two decay functions in order to take into account both periodic decay and monotonic decay.

Personalized Patient's Modeling
The patient's medical needs evolve during time periods. In particular, a patient may visit a hospital to get treatment for the flu, but next time her/his visit might be to treat stomach issues. Therefore, it is pertinent

Deep Learning for Medication Recommendation: A Systematic Survey
to exploit such evolving factors to capture the patient's recent medical requirements. To this end, ConCare [22] uses multi-head self-attention to extract the dependencies among clinical features explicitly to learn the personal health context and regenerate the feature embedding under the context. The diversity among heads is encouraged using cross-head decorrelation. A multichannel medical feature embedding architecture is employed to learn the representation of various feature sequences via separate GRUs and the effect of time intervals between the records of each feature is adaptively captured using time-aware attention.
Similarly, G-BERT [25] employs GCN [63] and BERT [58] to learn medical code representation and medication recommendation, respectively. In particular, the approach integrates the GNN representation into a transformer-based visit encoder and pre-trains it on EHR data from patients with a single visit. In order to address the issue of asynchronous multi-view learning, AMANet [34] combines attention mechanism and memory. Self-attention and inter-attention mechanisms are utilized to learn intra-view interaction and inter-view interaction, respectively. Information about a specific object is maintained by historical attention memory and is employed as a local knowledge storage system. On contrary, dynamic external memory is utilized to keep the global knowledge for each view. MERITS [27] uses neural ordinary differential equations(Neural ODE) to capture irregular time-series dependencies. In the meantime, the model employs a DDI knowledge graph and two learned medication relation graphs to investigate the medications' co-occurrence and sequential correlations. It also applies an attention-based encoder-decoder framework for combining patient and medication history from the EMR.
Finally, ARMR [9] model utilizes two GRU networks [71] to build an encoder that exploits patient diagnoses information and procedures to generate robust patient representations, which are employed in generating final predictions.

CONCLUSION AND IMPLICATIONS
This paper explored DL-based MR models with respect to the platform, information filtering, information features and factors, recommendation type, evaluation methodology including datasets and metrics, the issues they face, and opportunities in addressing them. The following points summarize some of the main findings of this study.
• The majority of the examined models utilized medication history, diagnoses, time, and procedures as data factors, which are important aspects when making a personalized medication prediction for a patient. Besides, models that employ auxiliary information, such as medication history, diagnoses, time, procedures, symptoms, and physical examinations, can provide precise recommendations and alleviate the sparsity problem because such techniques exploit rich information and enrich knowledge about the patient's disease.
• The embedding-based methods are most common in DL-based MR approaches due to their ability to exploit multiple information sources and capture the users' preference dynamics. These are followed by RNNs due to their good performance in NLP tasks and capturing long-range dependencies. They are also useful in the MR domain that considers the updates in patient's health over time. These are

Deep Learning for Medication Recommendation: A Systematic Survey
followed by the CNN variants, as they can exploit contextual details and capture local relevant features.
• Recently, transformer-based models with attention networks are getting popular because they capture salient information factors and features regarding patients and medication and consider complex relations among them. We have found 10 out of 37 MR models that employed transformers to recommend medications.
• According to the survey, the majority of models viz. 24 out of 37 used the Adam optimization technique, while eight used gradient descent. One model employs Adagrad. Similarly, one of the 37 models used RMSprop. The possible reason behind the usage of Adam and SGD could be their capability to converge and generalize better compared to others.
• The main issues experienced by researched models are personalization, exploiting temporal dynamics, and DDI. As a consequence of a lack of sufficient information about the patient's disease, some of the models struggled with the sparsity and cold-start problems. The interpretability is the least explored by the selected models. According to the study results, embedding methods and RNNs have betteraddressed personalization, robustness, and DDI problems. The main reason is that embedding methods exploit robust semantic relations in EHR networks. Moreover, RNNs can better capture long-range dependencies and perform better on NLP tasks. On the contrary, the survey demonstrates that graph/network embedding methods have better addressed the sparsity and cold start issues. The primary reason for this is that GCN embeds diseases, symptoms, medicines, patients, and their corresponding relationships into a shared lower-dimensional space.
• MIMIC-III dataset contains rich information sources, namely patient information, diagnosis records, clinical events, procedures, medicines, and symptoms. As a result, the survey found that the MIMIC-III dataset is the most commonly used in the domain of medication recommendations. Generally, other datasets are employed by a few models. For instance, NELL is the most newly published dataset and has only been used in one approach.
We hope the research avenues identified in this survey will assist researchers to explore interesting trends and devise robust medication recommender systems. degree in Software Engineering in 2022 from Southeast University, Nanjing, China. As a student, he won the title of excellent graduate twice, and got the principal scholarship and Huawei scholarship. In 2021, he also served in Tencent as a research intern. Rui Wu is currently working as a machine learning engineer in Ant Group. His research interests include deep learning, nature language processing and AI for healthcare. He has published several papers at academic conferences such as WWW, DASFAA and so on.