## Abstract

Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video. However, RNNs consisting of sigma cells or tanh cells are unable to learn the relevant information of input data when the input gap is large. By introducing gate functions into the cell structure, the long short-term memory (LSTM) could handle the problem of long-term dependencies well. Since its introduction, almost all the exciting results based on RNNs have been achieved by the LSTM. The LSTM has become the focus of deep learning. We review the LSTM cell and its variants to explore the learning capacity of the LSTM cell. Furthermore, the LSTM networks are divided into two broad categories: LSTM-dominated networks and integrated LSTM networks. In addition, their various applications are discussed. Finally, future research directions are presented for LSTM networks.

## 1 Introduction

Over the past few years, deep learning techniques have been well developed and widely adopted to extract information from various kinds of data (Ivakhnenko & Lapa, 1965; Ivakhnenko, 1971; Bengio, 2009; Carrio, Sampedro, Rodriguez-Ramos, & Campoy, 2017; Khan & Yairi, 2018). Considering different characteristics of input data, there are several types of architectures for deep learning, such as the recurrent neural network (RNN; Robinson & Fallside, 1987; Werbos, 1988; Williams, 1989; Ranzato et al., 2014), convolutional neural network (CNN; Fukushima, 1980; LeCun et al., 1989; Weng, Ahuja, & Huang, 1993; Rawat & Wang, 2017), and deep neural network (DNN; Guo, Liu, Georgiou, & Lew, 2017; Sharma & Singh, 2017). Usually the CNN and DNN cannot deal with the temporal information of input data. Therefore, in research areas that contain sequential data, such as text, audio, and video, RNNs are dominant. Specifically, there are two types of RNNs: discrete-time RNNs and continuous-time RNNs (Pearlmutter, 1989; Brown, Yu, & Garverick, 2004; Gallagher, Boddhu, & Vigraham, 2005). The focus of this review is on discrete-time RNNs.

The typical feature of the RNN architecture is a cyclic connection, which enables the RNN to possess the capacity to update the current state based on past states and current input data. These networks, such as fully RNNs (Elman, 1990; Jordan, 1986; Chen & Soo, 1996) and selective RNNs (Šter, 2013), consisting of standard recurrent cells (e.g., sigma cells), have had incredible success on some problems. Unfortunately, when the gap between the relevant input data is large, the above RNNs are unable to connect the relevant information. In order to handle the “long-term dependencies,” Hochreiter and Schmidhuber (1997) proposed long short-term memory (LSTM).

Almost all exciting results based on RNNs have been achieved by LSTM, and thus it has become the focus of deep learning. Because of their powerful learning capacity, LSTMs work tremendously well and have been widely used in various kinds of tasks, including speech recognition (Fernández, Graves, & Schmidhuber, 2007; He & Droppo, 2016; Hsu, Zhang, Lee, & Glass, 2016), acoustic modeling (Sak, Senior, & Beaufays, 2014; Qu, Haghani, Weinstein, & Moreno, 2017), trajectory prediction (Altché & Fortelle, 2017), sentence embedding (Palangi et al., 2015), and correlation analysis (Mallinar & Rosset, 2018). In this review, we explore these LSTM networks. This review is different from other review work on RNNs (Deng, 2013; Lipton, Berkowitz, & Elkan, 2015); it focuses only on advances in the LSTM cell and structures of LSTM networks. Here, the LSTM cell denotes the recurrent unit in LSTM networks.

## 2 LSTM Cells and Their Variants

In RNNs, the recurrent layers or hidden layers consist of recurrent cells whose states are affected by both past states and current input with feedback connections. The recurrent layers can be organized in various architectures to form different RNNs. Therefore, RNNs are mainly distinguished by the recurrent cell and network architecture. Different cells and inner connections enable RNNs to possess different capacities. In order to explore the development of LSTM networks, this section first gives a brief review of the LSTM cell and its variants.

### 2.1 Standard Recurrent Cell

### 2.2 LSTM

In order to deal with the problem of “long-term dependencies,” Hochreiter and Schmidhuber (1997) proposed the LSTM cell. They improved the remembering capacity of the standard recurrent cell by introducing a “gate” into the cell. Since this pioneering work, LSTMs have been modified and popularized by many researchers (Gers, 2001; Gers & Schmidhuber, 2000). Variations include LSTM without a forget gate, LSTM with a forget gate, and LSTM with a peephole connection. Usually the term *LSTM cell* denotes LSTM with a forget gate. We first introduce the original LSTM model that possesses only input and output gates.

#### 2.2.1 LSTM without a Forget Gate

#### 2.2.2 LSTM with a Forget Gate

Gers, Schmidhuber, and Cummins (2000) modified the original LSTM in 2000 by introducing a forget gate into the cell. In order to obtain the mathematical expressions of this modified LSTM cell, Figure 3 presents its inner connections.

The forget gate can decide what information will be thrown away from the cell state. When the value of the forget gate, $ft$, is 1, it keeps this information; meanwhile, a value of 0 means it gets rid of all the information. Jozefowicz, Zaremba, and Sutskever (2015) found that when increasing the bias of the forget gate, $bf$, the performance of the LSTM network usually became better. Furthermore, Schmidhuber, Wierstra, Gagliolo, and Gomez (2007) proposed that LSTM was sometimes better trained by evolutionary algorithms combined with other techniques rather than by pure gradient descent.

#### 2.2.3 LSTM with a Peephole Connection

As the gates of the above LSTM cells do not have direct connections from the cell state, there is a lack of essential information that harms the network's performance. In order to solve this problem, Gers and Schmidhuber (2000) extended the LSTM cell by introducing a peephole connection, as shown in Figure 4.

### 2.3 Gated Recurrent Unit

The learning capacity of the LSTM cell is superior to that of the standard recurrent cell. However, the additional parameters increase computational burden. Therefore, the gated recurrent unit (GRU) was introduced by Cho et al. (2014). Figure 5 shows the details of the architecture and connections of the GRU cell.

In order to reduce the number of parameters, the GRU cell integrates the forget gate and input gate of the LSTM cell as an update gate. The GRU cell has only two gates: an update gate and a reset gate. Therefore, it could save one gating signal and the associated parameters. The GRU is essentially a variant of vanilla LSTM with a forget gate. Since one gate is missing, the single GRU cell is less powerful than the original LSTM. The GRU cannot be taught to count or to solve context-free language (Weiss, Goldberg, & Yahav, 2018) and also does not work for translation (Britz, Goldie, Luong, & Le, 2017). Chung, Gulcehre, Cho, and Bengio (2014) empirically evaluated the performance of the LSTM network, GRU network, and traditional tanh-RNN and found that both the LSTM cell and GRU cell were superior to the traditional tanh unit, under the condition that each network had approximately the same number of parameters. Dey and Salemt (2017) modified the original GRU and evaluated properties of three variants of GRU—GRU-1, GRU-2, and GRU-3—on MNIST and IMDB data sets. The results showed that these three variants could reduce the computational expense while performing as well as the original GRU cell.

### 2.4 Minimal Gated Unit

The MGU cell involves only one forget gate. Hence, this kind of cell has a simpler structure and fewer parameters compared with LSTM and the GRU. Evaluation results (Zhou, Sun et al., 2016), based on various sequence data, showed that the performance of the MGU is comparable to that of the GRU. In order to further simplify the MGU, Heck and Salem (2017) introduced three model variants—MGU-1, MGU-2, and MGU-3—which reduced the number of parameters in the forget gate. They found that the performances of these variants were comparable to those of the MGU in tasks on the MNIST data set and the Reuters Newswire Topics data set.

### 2.5 Other LSTM Variants

In order to determine whether the architecture of the LSTM cell is optimal, Jozefowicz et al. (2015) evaluated over 10,000 different architectures and found three that outperformed both the LSTM and GRU on the considered tasks: MUT-1, MUT-2, and MUT-3. The architectures of these cells are similar to that of the GRU with only two gates: an update gate and a reset gate. Neil, Pfeiffer, and Liu (2016) extended the LSTM cell by adding a new time gate, and proposed the phased LSTM cell. This new time gate leads to a sparse updating for the cell, which makes the phased LSTM achieve faster convergence than regular LSTMs on tasks of learning long sequences. Nina and Rodriguez (2016) simplified the LSTM cell by removing the input gate and coupling the forget gate and input gate. Removing the input gate produced better results in tasks that did not need to recall very long sequences.

Unlike the variants that modified the LSTM cell through decreasing or increasing gate functions, there are other kinds of LSTM variants. For example, Rahman, Mohammed, and Azad (2017) incorporated a biologically inspired variation into the LSTM cell and proposed the biologically variant LSTM. They changed only the updating of cell state $c(t)$ to improve cell capacity. Moreover, LSTM with working memory was introduced by Pulver and Lyu (2017). This modified version replaced the forget gate with a functional layer, whose input is decided by the previous memory cell value. The gated orthogonal recurrent unit (GORU) was proposed by Jing et al. (2017). They modified the GRU by an orthogonal matrix, which replaced the hidden state loop matrix of the GRU. This made the GORU possess the advantages of both an orthogonal matrix and the GRU structure. Irie, Tüske, Alkhouli, Schlüter, and Ney (2016) added highway connections to the LSTM cell and GRU cell to ensure an unobstructed information flow between adjacent layers. Furthermore, Veeriah, Zhuang, and Qi (2015) introduced a differential gating scheme for the LSTM neural network to solve the impact of spatiotemporal dynamics and then proposed the differential LSTM cell.

Although variants and analogous neural cells have been introduced, they can only be used on one or some specific data sets. Additionally, there is no cell variant that can outperform the LSTM cell on the whole. Thus, the LSTM cell is still the focus of deep learning for the processing of sequential data, and it remains the most popular recurrent cell in the literature. Therefore, in section 3, we give a comprehensive review of networks that consist of LSTM cells.

## 3 Different LSTM Networks

Due to the limited capacity of the single LSTM cell in handling engineering problems, the LSTM cells have to be organized into a specific network architecture when processing practical data. Despite the fact that all RNNs can be modified as LSTM networks by replacing the standard recurrent cell with the LSTM cell, this review discusses only the verified LSTM networks. We divide these LSTM networks into two broad categories: LSTM-dominated networks and integrated LSTM networks. LSTM-dominated networks are neural networks that are mainly constructed by LSTM cells. These networks focus on optimizing the connections of the inner LSTM cells so as to enhance the network properties (Nie, An, Huang, Yan, & Han, 2016; Zhao, Wang, Yan, & Mao, 2016). The integrated LSTM networks consist of LSTM layers and other components, such as a CNN and an external memory unit. The integrated LSTM networks mainly pay attention to integrating the advantageous features of different components when dealing with the target task.

Before introducing the LSTM networks, we simplify the schematic of the LSTM cell in section 2.2 as in Figure 7. In the figure, the dashed line indicates identity transformations. Therefore, the LSTM cell only outputs $ht$ along the depth dimension with no time delay and transmits both the $ct$ and $ht$ along the time dimension with one-unit-interval delay.

### 3.1 LSTM-Dominated Neural Networks

#### 3.1.1 Stacked LSTM Network

In the usual application, the simplest method to add capacity and depth to the LSTM network is to stack the LSTM layers. Therefore, the stacked LSTM network is the most basic and simplest structure of the LSTM network (Fernández, Graves, & Schmidhuber, 2007). It can also be treated as a multilayer fully connected structure. A block of three recurrent layers is illustrated in Figure 8.

Then we unroll this schematic along the time dimension in Figure 9, where we assume that the sequence length is 4.

Because of the simple and efficient architecture, the stacked LSTM network has been widely adopted by researchers. Du, Zhang, Nguyen, and Han (2017) adopted the stacked LSTM network to solve the problem of vehicle-to-vehicle communication and discovered that the efficiency of the stacked LSTM-based regression model was much higher than that of logistic regression. Sutskever, Vinyals, and Le (2014) used a stacked LSTM network with four layers, and 1000 cells at each layer, to accomplish an English-to-French translation task. They found that reversing the order of source words could introduce short-term dependencies between the source and the target sentence; thus, the performance of this LSTM network was remarkably improved. Saleh, Hossny, and Nahavandi (2018) constructed a deep stacked LSTM network to predict the intent of vulnerable road users. When evaluated on the Daimler pedestrian path prediction benchmark data set (Schneider & Gavrila, 2013) for intent and path prediction of pedestrians in four unique traffic scenarios, the stacked LSTM was more powerful than the methodology relying on a set of specific hand-crafted features.

#### 3.1.2 Bidirectional LSTM Network

Conventional RNNs are only able to make use of a previous context. In order to overcome this shortcoming, the bidirectional RNN (BRNN) was introduced by Schuster and Paliwal (1997). This kind of architecture could be trained in both time directions simultaneously, with separate hidden layers (i.e., forward layers and backward layers). Graves and Schmidhuber (2005) combined the BRNN with the LSTM cell and proposed the bidirectional LSTM. Figure 10 shows the internal connections of bidirectional LSTM recurrent layers.

Bidirectional LSTM networks have been widely adopted by researchers because of their excellent properties (Han, Wu, Jiang, & Davis, 2017; Yu, Xu, & Zhang, 2018). Thireou and Reczko (2007) applied this architecture to the sequence-based prediction of protein localization. The results showed that the bidirectional LSTM network outperforms the feedforward network and standard recurrent network. Wu, Zhang, and Zong (2016) investigated the different skip connections in a stacked bidirectional LSTM network and found that adding skip connections to the cell outputs with a gated identity function could improve network performance on the part-of-speech tagging task. Brahma (2018) extended the bidirectional LSTM to suffix bidirectional LSTM (SuBiLSTM), which improved the bidirectional LSTM network by encoding each suffix of the sequence. The SuBiLSTM outperformed the bidirectional LSTM in learning general sentence representations.

#### 3.1.3 Multidimensional LSTM Network

The standard RNNs can only be used to deal with one-dimensional data. However, in fields with multidimensional data like video processing, the properties of RNNs—the ability to access contextual information and robustness to input warping—are also desired. Graves, Fernández, and Schmidhuber (2007) introduced multidimensional LSTM (MDLSTM) to extend the application fields of RNNs. The core idea of MDLSTM was to build as many recurrent connections as the dimensions of the data. At each point in the data sequence, the recurrent layer $L$ receives both the output of layer $L-1$ and its own activations from one step back along all dimensions. This means that the LSTM cells in layer $L$ have $n$-forget gates in the $n$-dimensional LSTM network (one for each of the cell's previous states along every dimension). Figure 11 shows the unrolled two-dimensional (2D) LSTM network case.

Graves et al. (2007) extended the above unidirectional MDLSTM to the multidirectional MDLSTM and adopted this architecture to deal with the Air Freight database (McCarter & Storkey, 2007) and the MNIST database (LeCun, Bottou, Bengio, & Haffner, 1998). The results showed that this architecture is more robust to input warping than the state-of-the-art digit recognition algorithm. Li, Mohamed, Zweig, and Gong (2016a) constructed a MDLSTM network by time-frequency LSTM cells (Li, Mohamed, Zweig, & Gong, 2016b), which performed recurrence along the time and frequency axes. On the Microsoft Windows phone short message dictation task, the recurrence over both time and frequency axes promoted learning accuracy compared with a network that consisted of only time LSTM cells.

#### 3.1.4 Graph LSTM Network

When processing graph-structured data, the existing pixel-wise LSTM structures usually assume that each pixel is influenced by fixed neighboring pixels. These structures include row LSTM (Oord, Kalchbrenner, & Kavukcuoglu, 2016), diagonal BiLSTM (Shabanian, Arpit, Trischler, & Bengio, 2017), and local-global LSTM (Liang, Shen, Xiang et al., 2016). These predefined recurrent sequences usually cause redundant computational costs. Liang, Shen, Xiang et al. (2016) extended the fixed topologies and proposed the graph LSTM network, based on the graph RNN network (Goller & Kuchler, 1996). The graph LSTM model assumes that each superpixel node is defined by its previous states and adaptive neighboring nodes. Figures 12a and 12b show a superpixel graph topology and the corresponding graph LSTM architecture, respectively. Instead of using the fixed starting node and predefined updating route for all images, the starting node and node updating scheme of graph LSTM are dynamically specified.

Liang, Shen, Feng et al. (2016) demonstrated the superiority of the graph LSTM network on four data sets: the Fashionista data set (Yamaguchi, 2012), PASCAL-Person-Part data set (Chen et al., 2014), ATR data set (Liang et al., 2015), and Horse-Cow Parsing data set (Wang & Yuille, 2015), and the results showed that this network could perform well in semantic object parsing. Liang et al. (2017) further extended the graph LSTM model to the structure-evolving LSTM. The structure-evolving LSTM merges the graph nodes stochastically and thus learns the multilevel graph structures in a progressive and stochastic way. The effectiveness evaluation shows that the structure-evolving LSTM outperforms the state-of-the-art LSTM models.

#### 3.1.5 Grid LSTM Network

The Grid LSTM network, proposed by Kalchbrenner, Danihelka, and Graves (2015), could also be used to process multidimensional data. This architecture arranges the LSTM cells in a grid of one or more dimensions. Different from the existing networks, the grid LSTM network has recurrent connections along the depth dimension. In order to explain the architecture of the grid LSTM, the blocks from the standard LSTM and Grid LSTM are shown in Figure 13.

In the blocks, the dashed lines denote identity transformations. The red lines, green lines, and purple lines indicate the transformations along different dimensions. Compared with the standard LSTM block, the 2D Grid LSTM block has the cell memory vector along the vertical dimension.

Kalchbrenner et al. (2015) put forward that the one-dimensional (1D) grid LSTM network is the architecture that replaces the transfer functions (e.g., tanh and ReLU; Nair & Hinton, 2010) of the feedforward network by the 1D grid LSTM block. Thus, the 2D Grid LSTM network is similar to the stacked LSTM network, but there are recurrent connections along the depth dimension. The grid LSTM with three or more dimensions corresponds to MDLSTM (Graves, 2012), but the $N$-way recurrent interactions exit along all dimensions. Figure 14 shows the stacked LSTM network and 2D Grid LSTM network.

Kalchbrenner et al. (2015) found that the 2D grid LSTM outperformed the stacked LSTM on the task of memorizing sequences of numbers (Zaremba & Sutskever, 2014). The recurrent connections along the depth dimension could improve the learning properties of grid LSTM. In order to lower the computational complexity of the grid LSTM network, Li and Sainath (2017) compared four grid LSTM variations and found that the frequency-block grid LSTM reduced computation costs without loss of accuracy on a 12,500-hour voice search task.

#### 3.1.6 Convolutional LSTM Network

The fully connected LSTM layer contains too much redundancy for spatial data (Sainath, Vinyals, Senior, & Sak, 2015). Therefore, to accomplish a spatiotemporal sequence forecasting problem, Shi et al. (2015) proposed the convolutional LSTM (ConvLSTM), which had convolutional structures in the recurrent connections. Figure 15 shows the recurrent structure of the ConvLSTM layer.

Wei, Zhou, Sankaranarayanan, Sengupta, and Samet (2018) adopted the ConvLSTM network to solve tweet count prediction, a spatiotemporal sequence forecasting problem. The results of experiments on the city of Seattle showed that the proposed network consistently outperforms the competitive baseline approaches: the autoregressive integrated moving average model, ST-ResNet (Zhang, Zheng, & Qi, 2016), and Eyewitness (Krumm & Horvitz, 2015). Zhu, Zhang, Shen, and Song (2017) combined the 3D CNN and ConvLSTM to construct a multimodal gesture recognition model. The 3D CNN and ConvLSTM learned the short-term and long-term spatiotemporal features of gestures, respectively. When verified on the Sheffield Kinect Gesture data set and the ChaLearn LAP large-scale isolated gesture data set, the results showed that the proposed method performs better than other models. Liu, Zhou, Hang, and Yuan (2017) extended both the ConvLSTM and bidirectional LSTM and presented the bidirectional–convolutional LSTM architecture to learn spectral–spatial features from hyperspectral images.

#### 3.1.7 Depth-gated LSTM Network

The stacked LSTM network is the simplest way to construct DNN. However, Yao, Cohn, Vylomova, Duh, and Dyer (2015) pointed out that the error signals in the stacked LSTM network might be either diminished or exploded because the error signals from the top have to be backpropagated through many layers of nonlinear transformations. In order to solve this problem, they proposed the depth-gated LSTM (DGLSTM), in which a depth gate was used to connect memory cells of neighboring LSTM layers. An illustration of the DGLSTM architecture is shown in Figure 16.

The DPLSTM architecture is inspired by the highway network (Kim, El-Khamy, & Lee, 2017; Srivastava, Greff, & Schmidhuber, 2015) and grid LSTM (Kalchbrenner et al., 2015). Yao et al. (2015) used the varied depth of DGLSTMs to accomplish the Chinese-to-English machine translation task The results showed that the performance of the DGLSTM network is always better than that of the stacked LSTM network. Zhang, Chen et al. (2015) proposed efficient algorithms to train the DGLSTM network using both frame and sequence discriminative criteria and achieved even better improvements.

#### 3.1.8 Gated-Feedback LSTM Network

To deal with the problem of learning multiple adaptive timescales, Chung, Gulcehre, Cho, & Bengio (2015) proposed the gated-feedback RNN (GF-RNN) and the gated-feedback LSTM (GF-LSTM) network. For each pair of layers, the GF-RNN uses a global gating unit to allow and control signals flowing from upper recurrent layers to lower layers, which are called gated-feedback connections. Figure 17 shows the architecture of the GF-RNN.

The proposed GF-RNN was evaluated against the conventional stacked RNN on the task of Python program evaluation (Koutnik, Greff, Gomez, & Schmidhuber, 2014), and the results showed that the GF-RNN outperforms the stacked RNN.

#### 3.1.9 Tree-Structured LSTM Network

Most of the existing LSTM networks are chain-structured. These networks perform well in machine translation and speech recognition. However, when these chain-structured networks need to combine words and phrases in natural language processing, they exhibit poor properties (Li, Luong, Dan, & Hovy, 2015; Zhang, Lu, & Lapata, 2015). Therefore, Zhu, Sobhani, and Guo (2015) and Tai, Socher, and Manning (2015) extended the chain-structured LSTM networks to tree-structured networks, based on the original tree-structured recursive models (Goller & Kuchler, 1996; Sperduti & Starita, 1997; Francesconi et al., 1997; Frasconi, Gori, & Sperduti, 1998). Figure 18 shows a tree-structured LSTM network.

Zhu et al. (2015) proved that the tree-structured LSTM network outperformed other recursive models in learning distributed sentiment representations for texts. Tai et al. (2015) proposed two tree-structured LSTM architectures: the Child-Sum Tree-LSTM, and the N-ary Tree-LSTM. The tree nodes in both models are dependent on multiple child units. The Child-Sum Tree-LSTM adopts the sum of child hidden states $hk$ to update the cell. However, the N-ary Tree-LSTM introduces separate parameter matrices for each child and is able to learn more finely grained information from its children. Teng and Zhang (2016) further extended the tree-structured LSTM to the bidirectional tree-structured LSTM with head lexicalization. The general tree-structured LSTM network constitutes trees in the bottom-up direction. Therefore, only the leaf nodes can use the input word information. In order to propagate head words from leaf nodes to every constituent node, Teng and Zhang (2016) proposed the automatic head lexicalization for a general tree-structured LSTM network, and built a tree LSTM in the top-down direction. Miwa and Bansal (2016) further stacked a bidirectional tree-structured LSTM network on a bidirectional sequential LSTM network to accomplish end-to-end relation extraction. Additionally, Niu, Zhou, Wang, Gao, and Hua (2017) extended the tree-structured LSTM to the hierarchical multimodal LSTM (HM-LSTM) for the problem of dense visual-semantic embedding. Since the HM-LSTM could exploit the hierarchical relations between whole images and image regions and between sentences and phrases, this model outperformed other methods on Flickr8K (Graves, 2014), Flickr30K (Plummer et al., 2017), and MS-COCO (Chen et al., 2015; Lin et al., 2014) data sets.

#### 3.1.10 Other Networks

*Coupled LSTM Network.* Inspired by the grid LSTM and the multidimensional RNN, Liu, Qiu, and Huang (2016) proposed two coupled LSTMs architectures, the loosely coupled LSTM (LC-LSTM) and the tightly coupled LSTM (TC-LSTM), to model the interactions of sentence pairs. These two-coupled LSTM networks model the strong interactions of two sentences through the connections between hidden states.

*Deep Fusion LSTM Networks.* Liu, Qiu, Chen, and Huang (2016) adopted a deep fusion LSTM (DF-LSTM) network to extract the strong interaction of text pairs. This network consisted of two interdependent LSTMs. In order to capture more complicated matching patterns, the authors then used external memory to enhance the memory capacity of LSTMs. They gave a qualitative analysis for the model capacity and demonstrated the model efficacy on the Stanford Natural Language Inference Corpus.

*Multiplicative LSTM Network.* The weight matrices of the usual LSTM networks are fixed for different inputs. These networks are inexpressive when dealing with sequences of discrete mutually exclusive elements. Therefore, Krause, Lu, Murray, and Renals (2016) combined the LSTM with the multiplicative RNN architecture and obtained the multiplicative LSTM, which possessed flexible input-dependent weight matrices. In a series of character-level language modeling tasks, this model outperformed the stacked LSTM.

*Nested LSTMs Network.* The nested LSTM (NLSTM) network was proposed by Moniz and Krueger (2018). They adopted nesting to add the depth of the network. In the NLSTM network, the memory cell of the outer LSTM cell is computed by the inner LSTM cell, and $ctouter=htinner$. In the task of character-level language modeling, the NLSTM network is a promising alternative to the stacked architecture, and it outperforms both stacked and single-layer LSTM networks with similar numbers of parameters.

### 3.2 Integrated LSTM Networks

Sometimes the capacity of a single LSTM network cannot meet the practical engineering requirements, and thus there are some integrated LSTM networks that consist of an LSTM layer/network and other components, such as a convolution neural network and an external memory unit. These integrated LSTM networks take advantage of different components.

#### 3.2.1 Neural Turing Machine

Because the RNN needs to update the current cell states by inputs and previous states, the memory capacity is the essential feature of the RNN. In order to enhance the memory capacity of networks, neural networks have been adopted to learn to control end-to-end differentiable stack machines (Sun, 1990; Mozer, & Das, 1993). Schmidhuber (1993) and Schlag and Schmidhuber (2017) extended the above structure, and used the RNN to learn and control end-to-end-differentiable fast weight memory. Furthermore, Graves, Wayne, and Danihelka (2014) proposed the neural Turing machine (NTM) with an LSTM controller, which took inspiration from both the design of digital computers and the biological working memory models.

In Figure 19, the dashed line displays the division between the NTM and the external environment. The RNN receives inputs and emits outputs during each update cycle. At the same time, the RNN also exchanges information with the memory matrix via the read and write heads. The mathematical expression of the neural network is decided by the type of the selected RNN. Graves et al. (2014) selected the LSTM cell to construct the neural network and compared the performance of this NTM and a standard LSTM network. The results showed that NTM possessed advantages over the LSTM network. Xie and Shi (2018) improved the training speed by designing a new read–write mechanism for the NTM, which used convolution operations to extract global memory features.

#### 3.2.2 DBN-LSTM Network

This network is the combination of a deep belief network and LSTM (DBN-LSTM), and was introduced by Vohra, Goel, and Sahoo (2015). It is an extension of the RNN-DBN (Goel, Vohra, & Sahoo, 2014). In the amalgamation of the DBN and LSTM, the multilayer DBN helps in high-level representation of data, while the LSTM network provides temporal information. Compared with the RNN-DBN, the most notable improvement of DBN-LSTM is that it replaces the standard recurrent cell with the LSTM cell, which ensures that the model can deal with a longer time duration.

#### 3.2.3 Multiscale LSTM Network

Cheng et al. (2016) combined a preprocessing block and LSTM network to form the multiscale LSTM (MS-LSTM) network. The preprocessing block helps select a proper timescale for the input data, and the LSTM layer is adopted to model the processed sequential data. In processing dynamic Internet traffic, Cheng et al. (2016) used this model to learn the Internet traffic pattern in a flexible time window. Peng, Zhang, Liang, Liu, and Lin (2016) combined the MS-LSTM with a CNN and constructed a recurrent architecture for geometric scene parsing.

#### 3.2.4 CFCC-LSTM Network

To predict sea surface temperature, Yang et al. (2017) proposed a CFCC-LSTM network, which was composed of a fully connected LSTM (FC-LSTM) and a CNN. They first introduced a 3D grid to preprocess the information and then adopted an FC-LSTM layer to predict the temporal information in the 3D grid and a convolution operation to handle the spatial information. The results on two data sets (the China Ocean data set and the Bohai Sea data set) showed the effectiveness of this model.

#### 3.2.5 C-LSTM Network

The C-LSTM network is also a combination of a CNN and LSTM network. It was proposed by Zhou, Wu, Zhang, and Zhou (2016). This network can extract a sequence of higher-level phrase representations by the CNN, and then this information is fed into the LSTM network to obtain the sentence representation in the task of sentence and document modeling.

#### 3.2.6 LSTM-in-LSTM Network

Song, Tang, Xiao, Wu, and Zhang (2016) combined a Deep-CNN network and an LSTM-in-LSTM architecture to generate rich, finely grained textual descriptions of images. The LSTM-in-LSTM architecture consists of an inner LSTM cell and an outer LSTM cell. This architecture can learn the contextual interactions between visual cues and can thus predict long sentence descriptions.

## 4 Conclusion

We have systematically reviewed various LSTM cell variants and LSTM networks. The LSTM cell is the basic node of LSTM networks. Different variants outperform the standard LSTM cell on some characteristics and tasks. For example, the GRU has a small number of cell parameters. However, there is no variant that can surpass the standard LSTM cell in all aspects. As for LSTM networks, there are two major categories: LSTM-dominated neural networks and integrated LSTM networks. In order to enhance the network properties of some specific tasks, LSTM-dominated networks focus on optimizing the connections between inner LSTM cells. Integrated LSTM networks mainly pay attention to integrating the advantageous features of different components, such as the convolution neural network and external memory unit, when dealing with the target task. Current LSTM models have acquired incredible success on numerous tasks. Nevertheless, there are still some directions to augment RNNs with more powerful properties.

First, more efficient recurrent cells should be explored through explicit knowledge. The recurrent cells are the basic nodes, and the properties of the networks depend on recurrent cells to some extent. However, the current studies on recurrent cells are mainly empirical explorations. The handling capacity of a specific cell structure is ambiguous. Therefore, the relationship between the cell's handling capacity and data structure needs to be clarified.

Second, RNNs should be combined with external memory to strengthen their memory capacity, such as in NTM and differentiable neural computer (Graves et al., 2016). Differing from the CNN and DNN, the RNNs require memory capacity because they need to update the cell states by inputs and previous states. However, neural networks are not good at storing information. The external memory could greatly increase the memory capacity of RNNs and help them handle the long-term dependencies.

Finally, adopting adaptive computation time is an interesting direction. At present, for each time step, the amount of computation in current RNNs is the same. This is no problem when tasks have the same degree of difficulty. However, more time should be spent on individual tasks of higher difficulty, and researchers should allow the RNNs to execute multiple steps of computation during each time step based on the degree of difficulty of the task. Self-delimiting (SLIM) RNN (Schmidhuber, 2012) is a typical network, and SLIM RNN could decide during the computation which parts of itself are used at all, and for how long, learning to control time through a ”halt unit.” For each time step the amount of computation in adaptive computation time RNNs may be different, and they could deal with the tasks with different difficulties well.

## Acknowledgments

We thank LetPub (www.letpub.com) for its linguistic assistance during the preparation of this review. This work was supported by the National Nature Science Foundation of China (grants 61773386, 61573365, 61573366, 61573076), the Young Elite Scientists Sponsorship Program of China Association for Science and Technology (grant 2016QNRC001), and the National Key R&D Program of China (grant 2018YFB1306100).

## References

*Learning phrase representations using RNN encoder-decoder for statistical machine translation*

*Gated feedback recurrent neural networks*

*Proceedings of the IEEE International Midwest Symposium on Circuits and Systems*

*Proceedings of the International Conference on Artificial Neural Networks*