Abstract
The aim of this letter is to propose a theory of deep restricted kernel machines offering new foundations for deep learning with kernel machines. From the viewpoint of deep learning, it is partially related to restricted Boltzmann machines, which are characterized by visible and hidden units in a bipartite graph without hidden-to-hidden connections and deep learning extensions as deep belief networks and deep Boltzmann machines. From the viewpoint of kernel machines, it includes least squares support vector machines for classification and regression, kernel principal component analysis (PCA), matrix singular value decomposition, and Parzen-type models. A key element is to first characterize these kernel machines in terms of so-called conjugate feature duality, yielding a representation with visible and hidden units. It is shown how this is related to the energy form in restricted Boltzmann machines, with continuous variables in a nonprobabilistic setting. In this new framework of so-called restricted kernel machine (RKM) representations, the dual variables correspond to hidden features. Deep RKM are obtained by coupling the RKMs. The method is illustrated for deep RKM, consisting of three levels with a least squares support vector machine regression level and two kernel PCA levels. In its primal form also deep feedforward neural networks can be trained within this framework.
1 Introduction
Deep learning has become an important method of choice in several research areas including computer vision, speech recognition, and language processing (LeCun, Bengio, & Hinton, 2015). Among the existing techniques in deep learning are deep belief networks, deep Boltzmann machines, convolutional neural networks, stacked autoencoders with pretraining and fine-tuning, and others (Bengio, 2009; Goodfellow, Bengio, & Courville, 2016; Hinton, 2005; Hinton, Osindero, & Teh, 2006; LeCun et al., 2015; Lee, Grosse, Ranganath, & Ng, 2009; Salakhutdinov, 2015; Schmidhuber, 2015; Srivastava & Salakhutdinov, 2014; Chen, Schwing, Yuille, & Urtasun, 2015; Jaderberg, Simonyan, Vedaldi, & Zisserman, 2014; Schwing & Urtasun, 2015; Zheng et al., 2015). Support vector machines (SVM) and kernel-based methods have made a large impact on a wide range of application fields, together with finding strong foundations in optimization and learning theory (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995; Rasmussen & Williams, 2006; Schölkopf & Smola, 2002; Suykens, Van Gestel, De Brabanter, De Moor, & Vandewalle, 2002; Vapnik, 1998; Wahba, 1990). Therefore, one can pose the question: Which synergies or common foundations could be developed between these different directions? There has already been exploration of such synergies— for example, in kernel methods for deep learning (Cho & Saul, 2009), deep gaussian processes (Damianou & Lawrence, 2013; Salakhutdinov & Hinton, 2007), convolutional kernel networks (Mairal, Koniusz, Harchaoui, & Schmid, 2014), multilayer support vector machines (Wiering & Schomaker, 2014), and mathematics of the neural response (Smale, Rosasco, Bouvrie, Caponnetto, & Poggio, 2010), among others.
In this letter, we present a new theory of deep restricted kernel machines (deep RKM), offering foundations for deep learning with kernel machines. It partially relates to restricted Boltzmann machines (RBMs), which are used within deep belief networks (Hinton, 2005; Hinton et al., 2006). In RBMs, one considers a specific type of Markov random field, characterized by a bipartite graph consisting of a layer of visible units and another layer of hidden units (Bengio, 2009; Fisher & Igel, 2014; Hinton et al., 2006; Salakhutdinov, 2015). In RBMs, which are related to harmoniums (Smolensky, 1986; Welling, Rosen-Zvi, & Hinton, 2004), there are no connections between the hidden units (Hinton, 2005), and often also no visible-to-visible connections. In deep belief networks, the hidden units of a layer are mapped to a next layer in order to create a deep architecture. In RBM, one considers stochastic binary variables (Ackley, Hinton, & Sejnowski, 1985; Hertz, Krogh, & Palmer, 1991), and extensions have been made to gaussian-Bernoulli variants (Salakhutdinov, 2015). Hopfield networks (Hopfield, 1982) take continuous values, and a class of Hamiltonian neural networks has been studied in DeWilde (1993). Also, discriminative RBMs have been studied where the class labels are considered at the level of visible units (Fisher & Igel, 2014; Larochelle & Bengio, 2008). In all of these methods the energy function plays an important role, as it also does in energy-based learning methods (LeCun, Chopra, Hadsell, Ranzato, & Huang, 2006).
Representation learning issues are considered to be important in deep learning (Bengio, Courville, & Vincent, 2013). The method proposed in this letter makes a link to restricted Boltzmann machines by characterizing several kernel machines by means of so-called conjugate feature duality. Duality is important in the context of support vector machines (Boser et al., 1992; Cortes & Vapnik, 1995; Vapnik, 1998; Suykens et al., 2002; Suykens, Alzate, & Pelckmans, 2010), optimization (Boyd & Vandenberghe, 2004; Rockafellar, 1987), and in mathematics and physics in general. Here we consider hidden features conjugated to part of the unknown variables. This part of the formulation is linked to a restricted Boltzmann machine energy expression, though with continuous variables in a nonprobabilistic setting. In this way, a model can be expressed in both its primal representation and its dual representation and give an interpretation in terms of visible and hidden units, in analogy with RBM. The primal representation contains the feature map, while the dual model representation is expressed in terms of the kernel function and the conjugated features.
The class of kernel machines discussed in this letter includes least squares support vector machines (LS-SVM) for classification and regression, kernel principal component analysis (kernel PCA), matrix singular value decomposition (matrix SVD), and Parzen-type models. These have been previously conceived within a primal and Lagrange dual setting in Suykens and Vandewalle (1999b), Suykens et al. (2002), Suykens, Van Gestel, Vandewalle, and De Moor (2003), and Suykens (2013, 2016). Other examples are kernel spectral clustering (Alzate & Suykens, 2010; Mall, Langone, & Suykens, 2014), kernel canonical correlation analysis (Suykens et al., 2002), and several others, which will not be addressed in this letter, but can be the subject of future work. In this letter, we give a different characterization for these models, based on a property of quadratic forms, which can be verified through the Schur complement form. The property relates to a specific case of Legendre-Fenchel duality (Rockafellar, 1987). Also note that in classical mechanics, converting a Lagrangian into Hamiltonian formulation is by Legendre transformation (Goldstein, Poole, & Safko, 2002).
The kernel machines with conjugate feature representations are used then as building blocks to obtain the deep RKM by coupling the RKMs. The deep RKM becomes unrestricted after coupling the RKMs. The approach is explained for a model with three levels, consisting of two kernel PCA levels and a level with LS-SVM classification or regression. The conjugate features of level 1 are taken as input of level 2 and, subsequently, the features of level 2 as input for level 3. The objective of the deep RKM is the sum of the objectives of the RKMs in the different levels. The characterization of the stationary points leads to solving a set of nonlinear equations in the unknowns, which is computationally expensive. However, for the case of linear kernels, in part of the levels it reveals how kernel fusion is taking place over the different levels. For this case, a heuristic algorithm is obtained with level-wise solving. For the general nonlinear case, a reduced-set algorithm with estimation in the primal is proposed.
In this letter, we make a distinction between levels and layers. We use the terminology of levels to indicate the depth of the model. The terminology of layers is used here in connection to the feature map. Suykens and Vandewalle (1999a) showed how a multilayer perceptron can be trained by a support vector machine method. It is done by defining the hidden layer to be equal to the feature map. In this way, the hidden layer is treated at the feature map and the kernel parameters level. Suykens et al. (2002) explained that in SVM and LS-SVM models, one can have a neural networks interpretation in both the primal and the dual. The number of hidden units in the primal equals the dimension of the feature space, while in the dual representation, it equals the number of support vectors. In this way, it provides a setting to work with parametric models in the primal and kernel-based models in the dual. Therefore, we also illustrate in this letter how deep multilayer feedforward neural networks can be trained within the deep RKM framework. While in classical backpropagation (Rumelhart, Hinton, & Williams, 1986), one typically learns the model by specifying a single objective (e.g., unless imposing additional stability constraints to obtain stable multilayer recurrent networks with dynamic backpropagation; (Suykens, Vandewalle, & De Moor, 1995), in the deep RKM the objective function consists of the different objectives related to the different levels.
In summary, we aim at contributing to the following challenging questions in this letter:
- •
Can we find new synergies and foundations between SVM and kernel methods and deep learning architectures?
- •
Can we extend primal and dual model representations, as occurring in SVM and LS-SVM models, from shallow to deep architectures?
- •
Can we handle deep feedforward neural networks and deep kernel machines within a common setting?
In order to address these questions, this letter is organized as follows. Section 2 outlines the context of this letter with a brief introductory part on restricted Boltzmann machines, SVMs, LS-SVMs, kernel PCA, and SVD. In section 3 we explain how these kernel machines can be characterized by conjugate feature duality with visible and hidden units. In section 4 deep restricted kernel machines are explained for three levels: an LS-SVM regression level and two additional kernel PCA levels. In section 5, different algorithms are proposed for solving in either the primal or the dual, where the former will be related to deep feedfoward neural networks and the latter to kernel-based models. Illustrations with numerical examples are given in section 6. Section 7 concludes the letter.
2 Preliminaries and Context
In this section, we explain basic principles of restricted Boltzmann machines, SVMs, LS-SVMs, and related formulations for kernel PCA, and SVD. These are basic ingredients needed before introducing restricted kernel machines in section 3.
2.1 Restricted Boltzmann Machines
2.2 Least Squares Support Vector Machines and Related Kernel Machines
2.2.1 SVM and LS-SVM
2.2.2 Kernel PCA and Matrix SVD
3 Restricted Kernel Machines and Conjugate Feature Duality
3.1 LS-SVM Regression as a Restricted Kernel Machine: Linear Case
A training data set is assumed to be given with input data and output data (now with p outputs), where the data are assumed to be identical and independently distributed and drawn from an unknown but fixed underlying distribution P(x,y), a common assumption made in statistical learning theory (Vapnik, 1998).
3.2 Nonlinear Case
As Suykens et al. (2002) explained, one can also give a neural network interpretation to both the primal and the dual representation, with a number of hidden units equal to the dimension of the feature space for the primal representation and the number of support vectors in the dual representation, respectively. For the case of a gaussian RBF kernel, one has a one-hidden-layer interpretation with an infinite number of hidden units in the primal, while in the dual, the number of hidden units equals the number of support vectors.
3.3 Classifier Formulation
3.4 Kernel PCA
Here the number of hidden units equals s with and the visible units with , and .
3.5 Singular Value Decomposition
3.6 Kernel pmf
4 Deep Restricted Kernel Machines
In this section we couple different restricted kernel machines within a deep architecture. Several coupling configurations are possible at this point. We illustrate deep restricted kernel machines here for an architecture consisting of three levels. We discuss two configurations:
Two kernel PCA levels followed by an LS-SVM regression level
LS-SVM regression level followed by two kernel PCA levels
In the first architecture, the first two levels extract features that are used within the last level for classification or regression. Related types of architectures are stacked autoencoders (Bengio, 2009), where a pretraining phase provides a good initialization for training the deep neural network in the fine-tuning phase. The deep RKM will consider an objective function jointly related to the kernel PCA feature extractions and the classification or regression. We explain how the insights of the RKM kernel PCA representations can be employed for combined supervised training and feature selection. A difference with other methods is also that conjugated features are used within the layered architecture.
In the second architecture, one starts with regression and then lets two kernel PCA levels further act on the residuals. In this case connections will be shown with deep Boltzmann machines (Salakhutdinov, 2015; Salakhutdinov & Hinton, 2009) when considering the special case of linear feature maps, though for the RKMs in a nonprobabilistic setting.
4.1 Two Kernel PCA Levels Followed by Regression Level
We focus here on a deep RKM architecture consisting of three levels:
- •
Level 1 consists of kernel PCA with given input data xi and is characterized by conjugated features .
- •
Level 2 consists of kernel PCA by taking as input and is characterized by conjugated features .
- •
Level 3 consists of LS-SVM regression on with output data yi and is characterized by conjugated features .
4.2 Regression Level Followed by Two Kernel PCA Levels
In this case, we consider a deep RKM architecture with the following three levels:
- •
Level 1 consists of LS-SVM regression with given input data xi and output data yi and is characterized by conjugated features .
- •
Level 2 consists of kernel PCA by taking as input and is characterized by conjugated features .
- •
Level 3 consists of kernel PCA by taking as input and is characterized by conjugated features .
5 Algorithms for Deep RKM
The characterization of the stationary points for the objective functions in the different deep RKM models typically leads to solving large sets of nonlinear equations in the unknown variables, especially for large given data sets. Therefore, in this section, we outline a number of approaches and algorithms for working with the kernel-based models (in either the primal or the dual). We also outline algorithms for training deep feedforward neural networks in a parametric way in the primal within the deep RKM setting. The algorithms proposed in sections 5.2 and 5.3 are applicable also to large data sets.
5.1 Levelwise Solving for Kernel-Based Models
For the case of linear kernels in levels 2 and 3 in equation 4.11 and 4.18, we propose a heuristic algorithm that consists of level-wise solving linear systems and eigenvalue decompositions by alternating fixing different unknown variables.
For equation 4.18, in order to solve level 1 as a linear system, one needs the input/output data , but also the knowledge of . Therefore, an initialization phase is required. One can initialize as zero or at random at level 1, obtain H1, and propagate it to level 2. At level 2, after initializing H3, one finds H2, which is then propagated to level 3, where one computes H3. After this forward phase, one can go backward from level 3 to level 1 in a backward phase.
Schematically this gives the following heuristic algorithm:
One can repeat the forward and backward phases a number of times, without the initialization step. Alternatively, one could also apply an algorithm with forward-only phases, which can then be applied a number of times after each other.
5.2 Deep Reduced Set Kernel-Based Models with Estimation in Primal
One can also maximize by adding a term to the objective, equation 5.3, with c0 a positive constant. Note that the components of in levels 1 and 2 do not possess an orthogonality property unless this is imposed as additional constraints to the objective function.
5.3 Training Deep Feedforward Neural Networks within the Deep RKM Framework
In order to further reduce the number of unknowns, and partially inspired by convolutional operations in convolutional neural networks (LeCun, Bottou, Bengio, & Haffner, 1998), we also consider the case where U1 and U2 are Toeplitz matrices. For a matrix , the number of unknowns is reduced then from to .
6 Numerical Examples
6.1 Two Kernel PCA Levels Followed by Regression Level: Examples
We define the following models and methods for comparison:
- •
: Deep reduced set kernel-based models (with RBF kernel) with estimation in the primal according to equation 5.3 with the following choices:
: with additional term and () regularization terms
: without additional term
: with objective function , that is, only the level 3 regression objective.
- •
: Deep feedforward neural networks with estimation in the primal according to equation 5.5 with the same choices in , , as above in . In the model Toeplitz matrices are taken for the U matrices in all levels, except for the last level.
We test and compare the proposed algorithms on a number of UCI data sets: Pima indians diabetes () (), Bupa liver disorder () (), Johns Hopkins University ionosphere () (), adult () () data sets, where the number of inputs (d), outputs (p), training (N), validation (Nval), and test data (Ntest) are indicated. These numbers correspond to previous benchmarking studies in Van Gestel et al. (2004). In Table 1, bestbmark indicates the best result obtained in the benchmarking study of Van Gestel et al. (2004) from different classifiers, including SVM and LS-SVM classifiers with linear, polynomial, and RBF kernel; linear and quadratic discriminant analysis; decision tree algorithm C4.5; logistic regression; one-rule classifier; instance-based learners; and Naive Bayes.
The tuning parameters, selected at the validation level, are
- •
: For : ; ; ; (); (). For : ; ; ; (); (). For : ; ; ;
- •
: For : ; ; ; (); (). For : ; ; ; (); () For : ; ; ; .
- •
: For : ; ; ; (); (). For : ; ; ; (); () For : ; ; ; .
- •
: For : ; ; ; (); (). For : ; ; ; (); (). For : ; ; ; , .
The other tuning parameters were selected as for , , and , for , unless specified differently above. In the and models, the matrices and the interconnection matrices were initialized at random according to a normal distribution with zero mean and standard deviation 0.1 (100, 20, 10, and 3 initializations for , , , , respectively), the diagonal matrices by the identity matrix, and for the RBF kernel models in . For the training, a quasi-Newton method was used with in Matlab.
The following general observations from the experiments are shown in Table 1:
- •
Having the additional terms with kernel PCA objectives in levels 1 and 2, as opposed to the level 3 objective only, gives improved results on all tried data sets.
- •
The best selected value for cstab varies among the data sets. In case this value is large, the value of the objective function terms related to the kernel PCA parts is close to zero.
- •
The use of Toeplitz matrices for the U matrices in the deep feedforward neural networks leads to competitive performance results and greatly reduces the number of unknowns.
Figure 4 illustrates the evolution of the objective function (in logarithmic scale) during training on the data set, for different values of cstab and in comparison with a level 3 objective function only.
. | . | . | . | . |
---|---|---|---|---|
19.53 [20.02(1.53)] | 26.09 [30.96(3.34)] | 0 [0.68(1.60)] | 16.99 [17.46(0.65)] | |
18.75 [19.39(0.89)] | 25.22 [31.48(4.11)] | 0 [5.38(12.0)] | 17.08 [17.48(0.56)] | |
21.88 [24.73(5.91)] | 28.69 [32.39(3.48)] | 0 [8.21(6.07)] | 17.83 [21.21(4.78)] | |
21.09 [20.20(1.51)] | 27.83 [28.86(2.83)] | 1.71 [5.68(2.22)] | 15.07 [15.15(0.15)] | |
18.75 [20.33(2.75)] | 28.69 [28.38(2.80)] | 10.23 [6.92(3.69)] | 14.91 [15.08 (0.15)] | |
19.03 [19.16(1.10)] | 26.08 [27.74(9.40)] | 6.83 [6.50(8.31)] | 15.71 [15.97(0.07)] | |
24.61 [22.34(1.95)] | 32.17 [27.61(3.69)] | 3.42 [9.66(6.74)] | 15.21 [15.19(0.08)] | |
bestbmark | 22.7(2.2) | 29.6(3.7) | 4.0(2.1) | 14.4(0.3) |
. | . | . | . | . |
---|---|---|---|---|
19.53 [20.02(1.53)] | 26.09 [30.96(3.34)] | 0 [0.68(1.60)] | 16.99 [17.46(0.65)] | |
18.75 [19.39(0.89)] | 25.22 [31.48(4.11)] | 0 [5.38(12.0)] | 17.08 [17.48(0.56)] | |
21.88 [24.73(5.91)] | 28.69 [32.39(3.48)] | 0 [8.21(6.07)] | 17.83 [21.21(4.78)] | |
21.09 [20.20(1.51)] | 27.83 [28.86(2.83)] | 1.71 [5.68(2.22)] | 15.07 [15.15(0.15)] | |
18.75 [20.33(2.75)] | 28.69 [28.38(2.80)] | 10.23 [6.92(3.69)] | 14.91 [15.08 (0.15)] | |
19.03 [19.16(1.10)] | 26.08 [27.74(9.40)] | 6.83 [6.50(8.31)] | 15.71 [15.97(0.07)] | |
24.61 [22.34(1.95)] | 32.17 [27.61(3.69)] | 3.42 [9.66(6.74)] | 15.21 [15.19(0.08)] | |
bestbmark | 22.7(2.2) | 29.6(3.7) | 4.0(2.1) | 14.4(0.3) |
Notes: Shown first is the test error corresponding to the selected model with minimal validation error from the different random initializations. Between brackets, the mean and standard deviation of the test errors related to all initializations are shown. The lowest test error is in bold.
6.2 Regression Level Followed by Two Kernel PCA Levels: Examples
6.2.1 Regression Example on Synthetic Data Set
6.2.2 Multiclass Example: USPS
In this example, the USPS handwritten digits data set is taken from http://www.cs.nyu.edu/∼roweis/data.html. It contains 8-bit grayscale images of digits 0 through 9 with 1100 examples of each class. These data are used without additional scaling or preprocessing. We compare a basic LS-SVM model (with primal representation and with , that is, one output per class, and RBF kernel) with deep RKM consisting of LS-SVM + KPCA + KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and 3 (with number of selected components , in levels 2 and 3). In level 1 of deep RKM, the same type of model is taken as in the basic LS-SVM model. In this way, we intend to study the effect of the two additional KPCA layers. The dimensionality of the input data is . Two training set sizes were taken ( and data points, that is, 200 and 400 examples per class), 2000 data points (200 per class) for validation, and 5000 data (500 per class) for testing. The tuning parameters are selected based on the validation set: , for the RBF kernel in the basic LS-SVM model and , , , ( has been chosen) for deep RKM. The number of forward-backward passes in the deep RKM is chosen equal to 2. The results are shown for the case of 2000 training data in Figure 5, showing the results on training, validation, and test data with the predicted class labels and the predicted output values for the different classes. For the case , the selected values were , , , , . The misclassification error on the test data set is for the deep RKM and for the basic LS-SVM (with and ). For the case , the selected values were , , , , . The misclassification error on the test data set is for the deep RKM and for the basic LS-SVM (with and ). This illustrates that for deep RKM, levels 2 and 3 are given high relative importance through the selection of large , values.
6.2.3 Multiclass Example: MNIST
The data set, which is used without additional scaling or preprocessing, is taken from http://www.cs.nyu.edu/∼roweis/data.html. The dimensionality of the input data is (images of size 28 × 28 for each of the 10 classes). In this case, we take an ensemble approach where the training set ( with 10 classes) has been partitioned into small nonoverlapping subsets of size 50 (5 data points per class). The choice for this subset size resulted from taking the last 10,000 points of this data set as validation data with the use of 40,000 data for training in that case. Other tuning parameters were selected in a similar way. The 1000 resulting submodels have been linearly combined after applying the function to their outputs. The linear combination is determined by solving an overdetermined linear system with ridge regression, following a similar approach as discussed in section 6.4 of Suykens et al. (2002). For the submodels, deep RKMs consisting of LSSVM + KPCA + KPCA with RBF kernel in levels 1 and linear kernels in levels 2 and 3, are taken. The selected tuning parameters are , , , , , . The number of forward-backward passes in the deep RKM is chosen equal to 2. The training data set has been extended with another 50,000 training data consisting of the same data points but corrupted with noise (random perturbations with zero mean and standard deviation 0.5, truncated to the range [0,1]), which is related to the method with random perturbations in Kurakin, Goodfellow, and Bengio (2016). The misclassification error on the test data set (10,000 data points) is , which is comparable in performance to deep belief networks () and in between the reported test performances of deep Boltzmann machines () and SVM with gaussian kernel () (Salakhutdinov, 2015) (see http://yann.lecun.com/exdb/mnist/ for an overview and comparison of performances obtained by different methods).
7 Conclusion
In this letter, a theory of deep restricted kernel machines has been proposed. It is obtained by introducing a notion of conjugate feature duality where the conjugate features correspond to hidden features. Existing kernel machines such as least squares support vector machines for classification and regression, kernel PCA, matrix SVD, and Parzen-type models are considered as building blocks within a deep RKM and are characterized through the conjugate feature duality. By means of the inner pairing, one achieves a link with the energy expression of restricted Boltzmann machines, though with continuous variables in a nonprobabilistic setting. It also provides an interpretation of visible and hidden units. Therefore, this letter connects, on the one hand, to deep learning methods and, on the other hand, to least squares support vector machines and kernel methods. In this way, the insights and foundations achieved in these different research areas could possibly mutually reinforce each other in the future. Much future work is possible in different directions, including efficient methods and implementations for big data, the extension to other loss functions and regularization schemes, treating multimodal data, different coupling schemes, and models for clustering and semisupervised learning.
Appendix: Stabilization Term for Kernel PCA
Notes
Note that also the term appears. This would in a Boltzmann machine energy correspond to matrix G equal to the identity matrix. The term is an additional regularization term.
This states that for a matrix , one has if and only if and the Schur complement (Boyd & Vandenberghe, 2004).
The following properties are used throughout this letter: for matrices and vectors (Petersen & Pedersen, 2012).
Acknowledgments
The research leading to these results has received funding from the European Research Council (FP7/2007-2013) / ERC AdG A-DATADRIVE-B (290923) under the European Union’s Seventh Framework Programme. This letter reflects only my views; the EU is not liable for any use that may be made of the contained information; Research Council KUL: GOA/10/09 MaNet, CoE PFV/10/002 (OPTEC), BIL12/11T; PhD/Postdoc grants; Flemish government: FWO: PhD/postdoc grants, projects: G0A4917N (deep restricted kernel machines), G.0377.12 (Structured systems), G.088114N (tensor-based data similarity); IWT: PhD/postdoc grants, projects: SBO POM (100031); iMinds Medical Information Technologies SBO 2014; Belgian Federal Science Policy Office: IUAP P7/19 (DYSCO, dynamical systems, control and optimization, 2012–2017).