## Abstract

Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks such as language modeling. Existing architectures that address the issue are often complex and costly to train. The differential state framework (DSF) is a simple and high-performing design that unifies previously introduced gated neural models. DSF models maintain longer-term memory by learning to interpolate between a fast-changing data-driven representation and a slowly changing, implicitly stable state. Within the DSF framework, a new architecture is presented, the delta-RNN. This model requires hardly any more parameters than a classical, simple recurrent network. In language modeling at the word and character levels, the delta-RNN outperforms popular complex architectures, such as the long short-term memory (LSTM) and the gated recurrent unit (GRU), and, when regularized, performs comparably to several state-of-the-art baselines. At the subword level, the delta-RNN's performance is comparable to that of complex gated architectures.

## 1 Introduction

Recurrent neural networks are becoming increasingly popular models for sequential data. The simple recurrent neural network (RNN) architecture (Elman, 1990), however, is not suitable for capturing longer-distance dependencies. Architectures that address this shortcoming include the long short-term memory (LSTM; Hochreiter & Schmidhuber, 1997a), the gated recurrent unit (GRU; Chung, Gulcehre, Cho, & Bengio, 2014, 2015), and the structurally constrained recurrent network (SCRN; Mikolov, Joulin, Chopra, Mathieu, & Ranzato, 2014). While these can capture some longer-term patterns (20 to 50 words), their structural complexity makes it difficult to understand what is going on inside. One exception is the SCRN architecture, which is by design simple to understand. It shows that the memory acquired by complex LSTM models on language tasks does correlate strongly with simple weighted bags of words. This demystifies the abilities of the LSTM model to a degree; while some authors have suggested that the LSTM understands the language and even the thoughts being expressed in sentences (Choudhury, 2015), it is arguable whether this could be said about a model that performs equally well and is based on representations that are essentially equivalent to a bag of words.

One property of recurrent architectures that allows for the formation of longer-term memory is the self-connectedness of the basic units. This is most explicitly shown in the SCRN model, where one hidden layer contains neurons that do not have other recurrent connections except to themselves. Still, this architecture has several drawbacks; one has to choose the size of the fully connected and self-connected recurrent layers, and the model is not capable of modeling nonlinearities in the longer-term memory component.

In this work, we aim to increase representational efficiency—the ratio of performance to acquired parameters. We simplify general neural model architecture further and develop several variants under the differential state framework, where the hidden layer state of the next time step is a function of its current state and the delta change computed by the model. We do not present the differential state framework as a model of human memory for language. However, we point out its conceptual origins in surprisal theory (Boston, Hale, Kliegl, Patil, & Vasishth, 2008; Hale, 2001; Levy, 2008), which posits that the human language processor develops complex expectations of future words, phrases, and syntactic choices and that these expectations and deviations from them (surprisal) guide language processing (e.g., in reading comprehension). How complex the models are (in the human language processor) that form the expectation is an open question. The cognitive literature has approached this with existing parsing algorithms, probabilistic context-free grammars, or n-gram language models. We take a connectionist perspective. The differential state framework proposes to not just generatively develop expectations and compare them with actual state changes caused by observing new input; it explicitly maintains gates as a form of high-level error correction and interpolation. An instantiation of this framework, the delta-RNN, will be evaluated as a language model, and we will not attempt to simulate human performance such as in situations with garden-path sentences that need to be reanalyzed because of costly initial misanalysis.

## 2 The Differential State Framework and the Delta-RNN

In this section, we describe the proposed differential state framework (DSF), as well as several concrete implementations one can derive from it.

### 2.1 General Framework

^{1}or could even include information such as decoupled memory, and in general will be updated as symbols are iteratively processed. We define to be any, possibly complicated, function that maps the previous hidden state and the currently encountered data point (e.g., a word, subword, or character token) to a real-valued vector of fixed dimensions using parameters . , on the other hand, is defined to be the outer function that uses parameters to integrate the fast state, as calculated by , and the slowly moving, currently untransformed state . In the sections that follow, we describe simple formulations of these two core functions; in section 3, we show how currently popular architectures, like the LSTM and various simplifications, are instantiations of this framework.

The specific structure of equation 2.1 was chosen because we hypothesize the reason behind the success of gated neural architectures is largely that they have been treating next-step prediction tasks, like language modeling, as an interaction between two functions. One inner function focuses on integrating observed samples with a current filtration to create a new data-dependent hidden representation (or state “proposal”), while an outer function focuses on computing the difference, or “delta,” between the impression of the sub-sequence observed so far (i.e., ) with the newly formed impression. For example, as a sentence is iteratively processed, there might not be much new information (or “suprisal”) in a token's mapped hidden representation (especially if it is a frequently encountered token), thus requiring less change to the iteratively inferred global representation of the sentence.^{2} However, encountering a new or rare token (especially an unexpected one) might bias the outer function to allow the newly formed hidden impression to more strongly influence the overall impression of the sentence, which will be useful when predicting what token or symbol will come next. In section 5, we present a small demonstration using one of the trained word models to illustrate the intuition just described.

In the subsections that follow, we describe the ways we chose to formulate and in the experiments of our study. The process we followed for developing the concrete implementations of and involved starting from the simplest possible form using the fewest (if any) possible parameters to compose each function and testing it in preliminary experiments to verify its usefulness.

It is important to note that equation 2.1 is still general enough to allow for future design of more clever or efficient functions that might improve the performance and long-term memory capabilities of the framework. More importantly, one might view the parameters that uses as possibly encapsulating structures that can be used to store explicit memory vectors, as is the case in stacked-based RNNs (Das, Giles, & Sun, 1992; Joulin & Mikolov, 2015) or linked-list-based RNNs (Joulin & Mikolov, 2015).

### 2.2 Forms of the Outer Function

Keeping as general as possible, here we describe several ways one could design , the function meant to decide how new and old hidden representations will be combined at each time step. We will strive to introduce as few additional parameters as necessary, and experimental results will confirm the effectiveness of our simple designs.

Note that we define to be the Hadamard product. Incorporating this interpolation mechanism can be interpreted as giving the DSF model a flexible mechanism for mixing various dimensions of its longer-term memory with its more localized memory. Interpolation, especially through a simple gating mechanism, can be an effective way to allow the model to learn how to turn on or off latent dimensions, potentially yielding improved generalization performance, as was empirically shown by Serban, Ororbia, Alexander, Pineau, and Courville (2016).

Figure 1 depicts the architecture using the simple late-integration mechanism.

### 2.3 Forms of the Inner Function: Instantiating the Delta-RNN

When a concrete form of the inner function is chosen, we can fully specify an instantiation of the DSF. We will also show, in section 3, how many other commonly used RNN architectures can, in fact, be treated as special cases of this general framework defined under equation 2.1.

^{3}state model: where could be any choice of activation function, including the identity function. This form of interpolation allows for a more direct error propagation pathway since gradient information, once transmitted through the interpolation gate, has two pathways: through the nonlinearity of the local state (through ) and the pathway composed of implicit identity connections.

When using a simple Elman RNN, we have essentially described a first-order Delta-RNN. However, historically, second-order recurrent neural architectures have been shown to be powerful models in tasks such as grammatical inference (Giles et al., 1991) and noisy time series prediction (Giles, Lawrence, & Tsoi, 2001), as well as incredibly useful in rule extraction when treated as finite-state automata (Giles et al., 1992; Goudreau, Giles, Chakradhar, & Chen, 1994). Very recently, Wu, Zhang, Zhang, Bengio, and Salakhutdinov (2016) showed that the gating effect between the state-driven component and data-driven components of a layer's preactivations facilitated better propagation of gradient signals as opposed to the usual linear combination. A second-order version of would be highly desirable, not only because it further mitigates the vanishing gradient problem that plagues backpropagation through time (used in calculating parameter gradients of neural architectures), but because the form introduces negligibly few additional parameters. We do note that the second-order form we use, as in Wu et al. (2016), is a rank-1 matrix approximation of the actual tensor used in Giles et al. (1992) and Goudreau et al. (1994).

Assuming a single hidden layer language model, with hidden units and input units (where corresponds to the cardinality of the symbol dictionary), a full late-integration Delta-RNN that employs a second-order (see equation 2.11), has only parameters,^{4} which is only slightly larger than a classical RNN with only parameters. This stands in stark contrast to the sheer number of parameters required to train commonly used complex architectures such as the LSTM (with peephole connections), with parameters, and the GRU, with parameters.

### 2.4 Regularizing the Delta-RNN

Regularization is often important when training large, overparameterized models. To control for overfitting, approaches range from structural modifications to impositions of priors over parameters (Neal, 2012). Commonly employed modern approaches include dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) and variations (Gal & Ghahramani, 2016) or mechanisms to control for internal covariate drift, such as batch normalization (Ioffe & Szegedy, 2015) for large feedforward architectures. In this letter, we investigate the effect that dropout will have on the Delta-RNN's performance.^{5}

### 2.5 Learning under the Delta-RNN

^{6}defined as To learn parameters for any of our models, we optimize with respect to the sequence negative log likelihood: Model parameters, , of the Delta-RNN are learned under an empirical risk minimization framework. We employ backpropagation of errors (or, rather, reverse-mode automatic differentiation with respect to this negative log-likelihood objective function) to calculate gradients and update the parameters using the method of steepest gradient descent. For all experiments conducted in this letter, we found that the ADAM adaptive learning rate scheme (Kingma & Ba, 2014) (followed by a Polyak average; Polyak & Juditsky, 1992, for the subword experiments) yielded the most consistent and near-optimal performance. We therefore use this setup for optimization of parameters for all models (including baselines) unless otherwise mentioned. For all experiments, we unroll computation graphs steps in time (where varies across experiments and tasks), and, in order to approximate full backpropagation through time, we carry over the last hidden from the previous mini-batch (within a full sequence). More important, we found that by using the derivative of the loss with respect to the last hidden state, we can improve the approximation and thus perform one step of iterative inference to update the last hidden state carried over.

^{7}We ultimately used this proposed improved approximation for the subword models (since in those experiments, we could directly train all baseline and proposed models in a controlled, identical fashion to ensure fair comparison).

For all Delta-RNNs experimented with in this letter, the output activation of the inner function was chosen to be the hyperbolic tangent. The output activation of the outer function was set to be the identity for the word and character benchmark experiments and the hyperbolic tangent for the subword experiments (these decisions were made based on preliminary experimentation on subsets of the final training data). The exact configuration of the implementation we used in this letter involved using the late-integration form—either the unregularized (see equation 2.9) or the dropout regularized (see equation 2.12) variant, for the outer function and equation 2.11.

We compare our proposed models against a wide variety of unregularized baselines, as well as several state-of-the-art regularized baselines for the benchmark experiments. These baselines include the LSTM, GRU, and SCRN, as well as computationally more efficient formulations of each, such as the MGU. The goal is to see if our proposed Delta-RNN is a suitable replacement for complex gated architectures and can capture longer-term patterns in sequential text data.

## 3 Related Work: Recovering Previous Models

A contribution of this work is that our general framework, presented in section 2.1, offers a way to unify previous proposals for gated neural architectures (especially for use in next-step prediction tasks like language modeling) and explore directions of improvement. Since we will ultimately compare our proposed Delta-RNN of section 2.3 to these architectures, we next present how to derive several key architectures from our general form, such as the gated recurrent unit and the long short-term memory. More important, we introduce them in the same notation and design as the Delta-RNN and highlight the differences between previous work and our own through the lens of and .

Simple models, largely based on the original Elman RNN (Elman, 1990), have often been shown to perform quite well in language modeling tasks (Mikolov, Karafiát, Burget, Černocký, & Khudanpur, 2010; Mikolov, Kombrink, Burget, Černocký, & Khudanpur, 2011). The structurally constrained recurrent network (SCRN: Mikolov et al., 2014), an important predecessor and inspiration for this work, showed that one fruitful path to learning longer-term dependencies was to impose a hard constraint on how quickly the values of hidden units could change, yielding more “stable” long-term memory. The SCRN itself is very similar to a combination of the RNN architectures of Jordan (1990) and Mozer (1993). The key element of its design is the constraint that part of the recurrent weight matrix must stay close to the identity, a constraint that is also satisfied by the Delta-RNN. These identity connections (and corresponding context units that use them) allow for improved information travel over many time steps and can even be viewed as an exponential trace memory (Mozer, 1993). Residual networks, though feedforward in nature, also share a similar motivation (He, Zhang, Ren, & Sun, 2016). Unlike the SCRN, the proposed Delta-RNN does not require a separation of the slow- and fast-moving units, but instead models this slower timescale through implicitly stable states.

A final related, but important, strand of work uses depth (i.e., number of processing layers) to directly model various timescales, as emulated in models such as the hierarchical/multiresolutional recurrent neural network (HM-RNN) (Chung, Ahn, & Bengio, 2016). Since the Delta-RNN is designed to allow its interpolation gate to be driven by the data, it is possible that the model might already be learning how to make use of boundary information (word boundaries at the character or subword level; sentence boundaries as marked by punctuation at the word level). The HM-RNN, however, more directly attacks this problem by modifying an LSTM to learn how to manipulate its states when certain types of symbols are encountered. (This is different from models like the clockwork RNN that require explicit boundary information; Koutnik, Greff, Gomez, & Schmidhuber, 2014.) One way to take advantage of the ideas behind the HM-RNN would be to manipulate the DSF to incorporate the explicit modeling of timescales through layer depth (each layer is responsible for modeling a different timescale). Furthermore, it would be worth investigating how the HM-RNN's performance would change when built from modifying a Delta-RNN instead of an LSTM.

## 4 Experimental Results

Language modeling is an incredibly important next-step prediction task, with applications in downstream applications in speech recognition, parsing, and information retrieval. As such, we will focus this letter on experiments on this task domain to gauge the efficacy of our Delta-RNN framework, noting that this framework might prove useful in, for instance, machine translation (Bahdanau, Cho, & Bengio, 2014) or light chunking (Turian, Bergstra, & Bengio, 2009). Beyond improving language modeling performance, the sentence (and document) representations iteratively inferred by our architectures might also prove useful in composing higher-level representations of text corpora, a subject we will investigate in future work.

### 4.1 Data Sets

#### 4.1.1 Penn Treebank Corpus

The Penn Treebank corpus (Marcus, Marcinkiewicz, & Santorini, 1993) is often used to benchmark both word- and character-level models via perplexity or bits per character, and thus we start here.^{8} The corpus contains 42,068 sentences (971,657 tokens; average token length of about 4.727 characters) of varying length (the range is from 3 to 84 tokens at the word level).

#### 4.1.2 IMDB Corpus

The large sentiment analysis corpus (Maas et al., 2011) is often used to benchmark algorithms for predicting the positive or negative tonality of documents. However, we opt to use this large corpus (training consists of 149,714 documents, 1,875,523 sentences, 40,765,697 tokens with average token length about 3.4291415 characters) to evaluate our proposed Delta-RNN as a (subword) language model. The IMDB data set serves as a case when the context extends beyond the sentence level in the form of actual documents.

### 4.2 Word and Character-Level Benchmark

The first set of experiments allows us to examine our proposed Delta-RNN models against reported state-of-the-art models. These reported measures have been on traditional word- and character-level language modeling tasks: we measure the per symbol perplexity of models. For the word-level models, we calculate the per word perplexity (PPL) using the measure . For the character-level models, we report the standard bits per character (BPC), which can be calculated from the log likelihood using the formula .

Over 100 epochs, word-level models were traned with mini-batches of 64 (padded) sequences. (Early stopping with a look-ahead of 10 was used.) Gradients were clipped using a simple magnitude-based scheme (Pascanu, Mikolov, & Bengio, 2013), with the magnitude threshold set to 5. A simple grid search was performed to tune the learning rate, , as well as the size of the hidden layer . Parameters (non biases) were initialized from zero-mean gaussian distributions with variance tuned, .^{9} The character-level models were updated using mini-batches of 64 samples over 100 epochs. (Early stopping with a look-ahead of 10 was used.) The parameter initializations and grid search for the learning rate and hidden layer size were the same as for the word models, with the exception of the hidden layer size, which was searched over .^{10}

A simple learning rate decay schedule was employed: if the validation loss did not decrease after a single epoch, the learning rate was halved (unless a lower bound on the value had been reached). When dropout was applied to the Delta-RNN (*Delta-RNN-drop*), we set the probability of dropping a unit to for the character-level models and for the word-level models. We present the results for the unregularized and regularized versions of the models. For all of the Delta-RNNs, we furthermore experiment with two variations of dynamic evaluation, which facilitates fair comparison to compression algorithms, inspired by the improvements observed in Mikolov (2012). *Delta-RNN-drop, dynamic #1* refers to simply updating the model sample-by-sample after each evaluation, where in this case, we update parameters using simple stochastic gradient descent (Mikolov, 2012), with a step-size . We develop a second variation of dynamic evaluation, *Delta-RNN-drop, dynamic #2*, where we allow the model to first iterate (and update) once over the validation set and then finally the test set, completely allowing the model to compress the Penn Treebank corpus. These two schemes are used for both the word- and character-level benchmarks. It is important to stress that the BPC and PPL measures reported for the dynamic models follow a strict “test-then-train” online paradigm, meaning that each next-step prediction is made before updating model parameters.

The standard vocabulary for the word-level models contains 10,000 unique words (including an unknown token for out-of-vocabulary symbols and an end-of-sequence token),^{11} and the standard vocabulary for the character-level models includes 49 unique characters (including a symbol for spaces). Results for the word-level and character-level models are reported in Table 1.

Penn Treebank: Word Models . | PPL . |
---|---|

N-Gram (Mikolov et al., 2014) | 141 |

NNLM (Mikolov, 2012) | 140.2 |

N-Gram+cache (Mikolov et al., 2014) | 125 |

RNN (Gulcehre et al., 2016) | 129 |

RNN (Mikolov, 2012) | 124.7 |

LSTM (Mikolov et al., 2014) | 115 |

SCRN (Mikolov et al., 2014) | 115 |

LSTM (Sundermeyer, 2016) | 107 |

MI-RNN (Wu et al., 2016, our implementation) | 109.2 |

Delta-RNN (present work) | 100.324 |

Delta-RNN, dynamic #1 (present work) | 93.296 |

Delta-RNN, dynamic #2 (present work) | 90.301 |

LSTM-recurrent drop (Krueger et al., 2016) | 87.0 |

NR-dropout (Zaremba et al., 2014) | 78.4 |

V-dropout (Gal & Ghahramani, 2016) | 73.4 |

Delta-RNN-drop, static (present work) | 84.088 |

Delta-RNN-drop, dynamic #1 (present work) | 79.527 |

Delta-RNN-drop, dynamic #2 (present work) | 78.029 |

Penn Treebank: Character Models | BPC |

N-discount N-gram (Mikolov et al., 2012) | 1.48 |

RNN+stabilization (Krueger et al., 2016) | 1.48 |

linear MI-RNN (Wu et al., 2016) | 1.48 |

Clockwork RNN (Koutnik et al., 2014) | 1.46 |

RNN (Mikolov et al., 2012) | 1.42 |

GRU (Jernite, Grave, Joulin, & Mikolov, 2016) | 1.42 |

HF-MRNN (Mikolov et al., 2012) | 1.41 |

MI-RNN (Wu et al., 2016) | 1.39 |

Max-Ent N-gram (Mikolov et al., 2012) | 1.37 |

LSTM (Krueger et al., 2016) | 1.356 |

Delta-RNN (present work) | 1.347 |

Delta-RNN, dynamic #1 (present work) | 1.331 |

Delta-RNN, dynamic #2 (present work) | 1.326 |

LSTM-norm stabilizer (Krueger et al., 2016) | 1.352 |

LSTM-weight noise (Krueger et al., 2016) | 1.344 |

LSTM-stochastic depth (Krueger et al., 2016) | 1.343 |

LSTM-recurrent drop (Krueger et al., 2016) | 1.286 |

RBN (Cooijmans, Ballas, Laurent, Gülçehre, & Courville, 2016) | 1.32 |

LSTM-zone out (Krueger et al., 2016) | 1.252 |

H-LSTM + LN (Ha, Dai, & Le, 2016) | 1.25 |

TARDIS (Gulcehre et al., 2017) | 1.25 |

3-HM-LSTM + LN (Chung et al., 2016) | 1.24 |

Delta-RNN-drop, static (present work) | 1.251 |

Delta-RNN-drop, dynamic #1 (present work) | 1.247 |

Delta-RNN-drop, dynamic #2 (present work) | 1.245 |

Penn Treebank: Word Models . | PPL . |
---|---|

N-Gram (Mikolov et al., 2014) | 141 |

NNLM (Mikolov, 2012) | 140.2 |

N-Gram+cache (Mikolov et al., 2014) | 125 |

RNN (Gulcehre et al., 2016) | 129 |

RNN (Mikolov, 2012) | 124.7 |

LSTM (Mikolov et al., 2014) | 115 |

SCRN (Mikolov et al., 2014) | 115 |

LSTM (Sundermeyer, 2016) | 107 |

MI-RNN (Wu et al., 2016, our implementation) | 109.2 |

Delta-RNN (present work) | 100.324 |

Delta-RNN, dynamic #1 (present work) | 93.296 |

Delta-RNN, dynamic #2 (present work) | 90.301 |

LSTM-recurrent drop (Krueger et al., 2016) | 87.0 |

NR-dropout (Zaremba et al., 2014) | 78.4 |

V-dropout (Gal & Ghahramani, 2016) | 73.4 |

Delta-RNN-drop, static (present work) | 84.088 |

Delta-RNN-drop, dynamic #1 (present work) | 79.527 |

Delta-RNN-drop, dynamic #2 (present work) | 78.029 |

Penn Treebank: Character Models | BPC |

N-discount N-gram (Mikolov et al., 2012) | 1.48 |

RNN+stabilization (Krueger et al., 2016) | 1.48 |

linear MI-RNN (Wu et al., 2016) | 1.48 |

Clockwork RNN (Koutnik et al., 2014) | 1.46 |

RNN (Mikolov et al., 2012) | 1.42 |

GRU (Jernite, Grave, Joulin, & Mikolov, 2016) | 1.42 |

HF-MRNN (Mikolov et al., 2012) | 1.41 |

MI-RNN (Wu et al., 2016) | 1.39 |

Max-Ent N-gram (Mikolov et al., 2012) | 1.37 |

LSTM (Krueger et al., 2016) | 1.356 |

Delta-RNN (present work) | 1.347 |

Delta-RNN, dynamic #1 (present work) | 1.331 |

Delta-RNN, dynamic #2 (present work) | 1.326 |

LSTM-norm stabilizer (Krueger et al., 2016) | 1.352 |

LSTM-weight noise (Krueger et al., 2016) | 1.344 |

LSTM-stochastic depth (Krueger et al., 2016) | 1.343 |

LSTM-recurrent drop (Krueger et al., 2016) | 1.286 |

RBN (Cooijmans, Ballas, Laurent, Gülçehre, & Courville, 2016) | 1.32 |

LSTM-zone out (Krueger et al., 2016) | 1.252 |

H-LSTM + LN (Ha, Dai, & Le, 2016) | 1.25 |

TARDIS (Gulcehre et al., 2017) | 1.25 |

3-HM-LSTM + LN (Chung et al., 2016) | 1.24 |

Delta-RNN-drop, static (present work) | 1.251 |

Delta-RNN-drop, dynamic #1 (present work) | 1.247 |

Delta-RNN-drop, dynamic #2 (present work) | 1.245 |

### 4.3 Subword Language Modeling

We chose to measure the negative log likelihood of the various architectures in the task of subword modeling. Subwords are particularly appealing not only in that the input distribution is of lower dimensionality but, as evidenced by the positive results of Mikolov et al. (2012), subword/character hybrid language models improve over the performance of pure character-level models. Subword models also enjoy the advantage held by character-level models when it comes to handling out-of-vocabulary words, avoiding the need for an “unknown” token. Research in psycholinguistics has long suggested that even human infants are sensitive to word boundaries at an early stage (e.g., Aslin, Saffran, & Newport, 1998), and that morphologically complex words enjoy dedicated processing mechanisms (Baayen & Schreuder, 2006). Subword-level language models may approximate such an architecture. Consistency in subword formation is critical in order to obtain meaningful results (Mikolov et al., 2012). Thus, we design our subword algorithm to partition a word according to the following scheme:

Split on vowels (using a predefined list).

Link or merge each vowel with a consonant to the immediate right if applicable.

Merge straggling single characters to subwords on the immediate right unless a subword of shorter character length is to the left.

This simple partitioning scheme was designed to ensure that no subword was shorter than two characters in length. Future work will entail designing a more realistic subword partitioning algorithm. Subwords below a certain frequency were discarded and combined with 26 single characters to create the final dictionary. For Penn Treebank, this yields a vocabulary of 2405 symbols (2378 subwords 26 characters 1 end token). For the IMDB corpus, after replacing all emoticons and special nonword symbols with special tokens, we obtain a dictionary of 1926 symbols (1899 subwords 26 single characters 1 end token). Results for all subword models are reported in Table 2.

. | Performance . | . | Performance . | ||
---|---|---|---|---|---|

PTB-SW . | Number of Parameters . | NLL . | IMDB-SW . | Number of Parameters . | NLL . |

RNN | 1,272,464 | 1.8939 | RNN | 499,176 | 2.1691 |

SCRN | 1,268,604 | 1.8420 | SCRN | 496,196 | 2.2370 |

MGU | 1,278,692 | 1.8694 | MGU | 495,444 | 2.1312 |

MI-RNN | 1,267,904 | 1.8441 | MI-RNN | 495,446 | 2.1741 |

GRU | 1,272,404 | 1.8251 | GRU | 499,374 | 2.1551 |

LSTM | 1,274,804 | 1.8412 | LSTM | 503,664 | 2.2080 |

Delta-RNN | 1,268,154 | 1.8260 | Delta-RNN | 495,570 | 2.1333 |

. | Performance . | . | Performance . | ||
---|---|---|---|---|---|

PTB-SW . | Number of Parameters . | NLL . | IMDB-SW . | Number of Parameters . | NLL . |

RNN | 1,272,464 | 1.8939 | RNN | 499,176 | 2.1691 |

SCRN | 1,268,604 | 1.8420 | SCRN | 496,196 | 2.2370 |

MGU | 1,278,692 | 1.8694 | MGU | 495,444 | 2.1312 |

MI-RNN | 1,267,904 | 1.8441 | MI-RNN | 495,446 | 2.1741 |

GRU | 1,272,404 | 1.8251 | GRU | 499,374 | 2.1551 |

LSTM | 1,274,804 | 1.8412 | LSTM | 503,664 | 2.2080 |

Delta-RNN | 1,268,154 | 1.8260 | Delta-RNN | 495,570 | 2.1333 |

Note: Subword modeling tasks on Penn Treebank and IMDB.

Specifically, we test our implementations of the LSTM^{12} (with peephole connections as described in Graves, 2013), the GRU, the MGU, the SCRN, and a classical Elman network, of both first and second order (Giles et al., 1991; Wu et al., 2016).^{13} Subword models were trained in a similar fashion as the character-level models, updated (every 50 steps) using mini-batches of 20 samples but over 30 epochs. Learning rates were tuned in the same fashion as the word-level models, and the same parameter initialization schemes were explored. The notable difference between this experiment and the previous ones is that we fix the number of parameters for each model to be equivalent to that of an LSTM with 100 hidden units for PTB and 50 hidden units for IMDB. This ensures a controlled, fair comparison across models and allows us to evaluate if the Delta-RNN can learn similar to models with more complicated processing elements (an LSTM cell versus a GRU cell versus a Delta-RNN unit). Furthermore, this allows us to measure parameter efficiency, where we can focus on the value of actual specific cell types (e.g., allowing us to compare the value of a much more complex LSTM memory unit versus a simple Delta-RNN cell) when the number of parameters is held roughly constant. We are currently running larger versions of the models depicted Table 2 to determine if the results hold at scale.

## 5 Discussion

With respect to the word- and character-level benchmarks, we see that the Delta-RNN outperforms all previous, unregularized models and performs comparably to regularized state-of-the-art. As documented in Table 2, we further trained a second-order, word-level RNN (MI-RNN) to complete the comparison and note that the second-order connections appear to be quite useful in general, outperforming the SCRN and coming close to that of the LSTM. This extends the results of Wu et al. (2016) to the word level. However, the Delta-RNN, which also makes use of second-order units within its inner function, ultimately offers the best performance and performs better than the LSTM in all experiments. In both Penn Treebank and IMDB subword language modeling experiments, the Delta-RNN is competitive with complex architectures such as the GRU and the MGU. In both cases, the Delta-RNN nearly reaches the same performance as the best-performing baseline model in either data set (i.e., it nearly reaches the same performance as the GRU on Penn Treebank and the MGU on IMDB). Surprisingly, on IMDB, a simple Elman network performs quite well, even outperforming the MI-RNN. We argue that this might be the result of constraining all neural architectures to only a small number of parameters for such a large data set, a constraint we intend to relax in future work.

The Delta-RNN is far more efficient than a complex LSTM and certainly a memory-augmented network like TARDIS (Gulcehre, Chander, & Bengio, 2017). Moreover, it appears to learn how to make appropriate use of its interpolation mechanism to decide how and when to update its hidden state in the presence of new data.^{14} Given our derivations in section 3, one could argue that nearly all previously proposed gated neural architectures are essentially trying do the same thing under the DSF. The key advantage offered by the Delta-RNN is that this functionality is offered directly and cheaply (in terms of required parameters).

It is important to contrast these (unregularized) results with those that use some form of regularization. Zaremba, Sutskever, and Vinyals (2014) reported that a single LSTM (for word-level Penn Treebank) can reach a PPL of about 80, but this was achieved via dropout regularization (Srivastava et al., 2014). There is a strong relationship between using dropout and training an ensemble of models. Thus, one can argue that a single model trained with dropout actually is not a single model, but an implicit ensemble (see also Srivastava et al., 2014). An ensemble of 20 simple RNNs and cache models previously did reach PPL as low as 72, while a single RNN model gives only 124 (Mikolov, 2012). Zaremba et al. (2014) trained an ensemble of 38 LSTMs regularized with dropout, each with 100 times more parameters than the RNNs used by Mikolov (2012), achieving PPL 68. This is arguably a small improvement over 72 and seems to strengthen our claim that dropout is an implicit model ensemble and thus should not be used when one wants to report the performance of a single model. However, the Delta-RNN is amenable to regularization, including dropout. As our results show, when simple dropout is applied, the Delta-RNN can reach much lower perplexities, even similar to the state of the art with much larger models, especially when dynamic evaluation is permitted. This even extends to very complex architectures, such as the recently proposed TARDIS, which is a memory-augmented network (and when dynamic evaluation is used, the simple Delta-RNN can outperform this complex model). Though we investigate the utility of simple dropout in this letter, our comparative results suggest that more sophisticated variants such as variational dropout (Gal & Ghahramani, 2016) could yield yet further improvement in performance.

What is the lesson to be learned from the DSF? First and foremost, we can obtain strong performance in language modeling with a simpler, more efficient (in terms of number of parameters), and thus faster architecture. Second, the Delta-RNN is designed from the interpretation that the computation of the next hidden state is the result of a composition of two functions. One inner function decides how to “propose” a new hidden state while the outer function decides how to use this new proposal in updating the previously calculated state. The data-driven interpolation mechanism is used by the model to decide how much impact the newly proposed state has in updating what is likely to be a slowly changing representation. The SCRN, which could be viewed as the predecessor to the Delta-RNN framework, was designed with the idea that some constrained units could serve as a sort of cache meant to capture longer-term dependencies. Like the SCRN, the Delta-RNN is designed to help mitigate the problem of vanishing gradients and, through the interpolation mechanism, has multiple pathways through which the gradient might be carried, boosting the error signal's longevity down the propagation path through time. However, the SCRN combines the slow-moving and fast-changing hidden states through a simple summation and thus cannot model nonlinear interactions between its shorter- and longer-term memories, furthermore requiring tuning of the sizes of these separated layers. On the other hand, the Delta-RNN, which does not require special tuning of an additional hidden layer, can nonlinearly combine the two types of states in a data-dependent fashion, possibly allowing the model to exploit boundary information from text, which is quite powerful in the case of documents. The key intuition is that the gating mechanism allows the state proposal to affect the maintained memory state only if the currently observed data point carries any useful information. This warrants a comparison, albeit indirect, to surprisal theory. This “surprisal” proves useful in iteratively forming a sentence impression that will help to better predict the words that come later.

With respect to the last point made, we briefly examine the evolution of a trained Delta-RNN's hidden state across several sample sentences. The first two sentences are hand-created (constrained to use only the vocabulary of Penn Treebank), and the last one is sampled from the Penn Treebank training split. Since the Delta-RNN iteratively processes symbols of an ordered sequence, we measure the L1 norm across consecutive pairs of hidden states. We report the (min-max) normalized L1 scores^{15} in Figure 2 and observe that in accordance with our intuition, we can see that the L1 norm is lower for high-frequency words (indicating a smaller delta) such as *the* or *of* or *is*, which are words generally less informative about the general subject of a sentence or document. As this qualitative demonstration illustrates, the Delta-RNN appears to learn what to do with its internal state in the presence of symbols of variable information content.

## 6 Conclusion

We present the differential state framework, which affords us a useful perspective on viewing computation in recurrent neural networks. Instead of recomputing the whole state from scratch at every time step, the Delta-RNN learns only how to update the current state. This seems to be better suited for many types of problems, especially those that involve longer-term patterns where part of the recurrent network's state should be constant most of the time. Comparison to the currently widely popular LSTM and GRU architectures shows that the Delta-RNN can achieve similar or better performance on language modeling tasks while being conceptually much simpler and with far fewer parameters. Comparison to the structurally constrained recurrent network (SCRN), which shares many of the main ideas and motivation, shows better accuracy and a simpler model architecture (in the SCRN, tuning the sizes of two separate hidden layers is required, and this model cannot learn nonlinear interactions within its longer memory).

Future work includes larger-scale language modeling experiments to test the efficacy of the Delta-RNN, as well as architectural variants that employ decoupled memory. Since the Delta-RNN can also be stacked just like any other neural architecture, we intend to investigate if depth (in terms of hidden layers) might prove useful on larger-scale data sets. In addition, we intend to explore how useful the Delta-RNN might be in other tasks that the architectures such as the LSTM currently hold state-of-the-art performance in. Finally, it would be useful to explore if the Delta-RNN's simpler, faster design can speed up the performance of grander architectures, such as the differentiable neural computer (Graves et al., 2016), which is largely made up of multiple LSTM modules.

## Appendix: Layer Normalized Delta-RNNs

In this appendix, we describe how layer normalization would be applied to a Delta-RNN. Though our preliminary experiments did not uncover that layer normalization gave much improvement over dropout, this was observed only on the Penn Treebank benchmark. Future work will investigate the benefits of layer normalization over dropout (as well as model ensembling) on larger-scale benchmarks.

## Notes

^{1}

refers to the “cell state” as in Hochreiter and Schmidhuber (1997b).

^{2}

One way to extract a sentence representation from a temporal neural language model would be simply to take the last hidden state calculated upon reaching a symbol such as punctuation (e.g., a period or exclamation point). This is sometimes referred to as encoding variable-length sentences or paragraphs to a real-valued vector of fixed dimensionality.

^{3}

Late-integration might remind readers of the phrase “late fusion,” as in the context of Wang and Cho (2015). However, they focused on merging the information from an external bag-of-words context vector with the standard cell state of the LSTM.

^{4}

counts the hidden bias, the full interpolation mechanism (see equation 2.5), and the second-order biases, .

^{5}

In preliminary experiments, we also investigated incorporating layer normalization (Ba, Kiros, & Hinton, 2016) into the Delta-RNN architecture; the details are in the appendix. We did not observe noticeable gains using layer normalization over dropout and thus report only the results of dropout in this letter.

^{6}

Note that the bias term has been omitted for clarity.

^{7}

We searched the step-size over the values for all experiments in this letter.

^{8}

To be directly comparable with previously reported results, we make use of the specific preprocessed train/valid/test splits found at http://www.fit.vutbr.cz/imikolov/rnnlm/.

^{9}

We also experimented with other initializations, most notably the identity matrix for the recurrent weight parameters as in Le, Jaitly, and Hinton (2015). We found that this initialization often worsened performance. For the activation functions of the first-order models, we experimented with the linear rectifier, the parameterized linear rectifier, and even our own proposed parameterized smoothened linear rectifier, but we found that such activations lead to less-than-satisfactory results. The results of this inquiry are documented in the code that will accompany this letter.

^{10}

Note that would yield nearly 4 million parameters, our upper bound on total number of parameters allowed for experiments in order to be commensurable with the work of Wu et al. (2016), who actually used for all Penn Treebank models.

^{11}

We use a special null token (or zero vector) to mark the start of a sequence.

^{12}

We experimented with initializing the forget gate biases of all LSTMs with values searched over since previous work has shown this can improve model performance.

^{13}

Code to build and train the architectures in this study can be found online at http://github.com/ago109/Delta-RNN-Theano.git.

^{15}

If we calculate the L1 norm, or Manhattan distance, for every contiguous pair of state vectors across a sequence of length and is the state calculated for the start or null token, we obtain the sequence of L1 measures (the L1 for the start token is simply excluded). Calculating the score for any () is then as simple as performing min-max normalization, or .

## Acknowledgments

We thank C. Lee Giles and Prasenjit Mitra for their advice. We thank NVIDIA for providing GPU hardware that supported this letter. A.O. was funded by a NACME-Sloan scholarship; D.R. acknowledges funding from NSF IIS-1459300.