## Abstract

Stable concurrent learning and control of dynamical systems is the subject of adaptive control. Despite being an established field with many practical applications and a rich theory, much of the development in adaptive control for nonlinear systems revolves around a few key algorithms. By exploiting strong connections between classical adaptive nonlinear control techniques and recent progress in optimization and machine learning, we show that there exists considerable untapped potential in algorithm development for both adaptive nonlinear control and adaptive dynamics prediction. We begin by introducing first-order adaptation laws inspired by natural gradient descent and mirror descent. We prove that when there are multiple dynamics consistent with the data, these non-Euclidean adaptation laws implicitly regularize the learned model. Local geometry imposed during learning thus may be used to select parameter vectors—out of the many that will achieve perfect tracking or prediction—for desired properties such as sparsity. We apply this result to regularized dynamics predictor and observer design, and as concrete examples, we consider Hamiltonian systems, Lagrangian systems, and recurrent neural networks. We subsequently develop a variational formalism based on the Bregman Lagrangian. We show that its Euler Lagrange equations lead to natural gradient and mirror descent-like adaptation laws with momentum, and we recover their first-order analogues in the infinite friction limit. We illustrate our analyses with simulations demonstrating our theoretical results.

## 1 Introduction

Adaptation is an online learning problem concerned with control or prediction of the dynamics of an unknown nonlinear system. This task is accomplished by constructing an approximation to the true dynamics through the online adjustment of a vector of parameter estimates under the assumption that there exists a fixed vector of parameters that globally fits the dynamics. The overarching goal is provably safe, stable, and concurrent learning and control of nonlinear dynamical systems.

Adaptive control theory is a mature field, and many results exist tailored to specific system structures (Ioannou & Sun, 2012; Narendra & Annaswamy, 2005; Slotine & Li, 1991). An adaptive control algorithm typically consists of a parameter estimator coupled in feedback to the controlled system, and the estimator is often strongly inspired by gradient-based optimization algorithms. A significant difference between standard optimization algorithms and adaptive control algorithms is that the parameter estimator must not only converge to a set of parameters that leads to perfect tracking of the desired trajectory, but the system must remain stable throughout the process. The additional requirement of stability prevents the immediate application of optimization algorithms as adaptive control algorithms, and stability must be proved by jointly analyzing the closed-loop system and estimator.

Significant progress has been made in adaptive control even for nonlinear systems in the *linearly parameterized* setting, where the dynamics approximation is of the form $f^=Y(x,t)a^$ for some known regressor matrix $Y(x,t)$ and vector of parameter estimates $a^(t)$. Examples include the adaptive robot trajectory controller of Slotine and Li (1987) and the neural network-based controller of Sanner and Slotine (1992), which employs a mathematical expansion in physical nonlinear basis functions to uniformly approximate the unknown dynamics.

Unlike its linear counterpart, solutions to the adaptive control problem in the general nonlinearly parameterized setting $f^=f(x,a^,t)$ have remained elusive. Intuitively, this is unsurprising: guarantees for gradient-based optimization algorithms typically rely on convexity, with a few notable exceptions such as the Polyak-Lojasiewicz condition (Polyak, 1963). In the linearly parameterized setting, the underlying optimization problem will be convex. When the parameters appear nonlinearly, the problem is in general nonconvex and difficult to provide guarantees for.

In this work, we provide new provably globally convergent algorithms for both the linearly and nonlinearly parameterized adaptive control problems, along with new insight into existing adaptive control algorithms for the linearly parameterized setting. Our results for nonlinearly parameterized systems are valid under the monotonicity assumptions of Tyukin, Prokhorov, and van Leeuwen (2007) and the convexity assumptions of Fradkov (1980). These monotonicity assumptions are equivalent to those commonly satisfied by generalized linear models in statistics (Kakade, Kalai, Kanade, & Shamir, 2011).

### 1.1 Description of Primary Contributions

Our contributions can be categorized into two main advances.

We further develop a class of natural gradient and mirror descent-like algorithms that have recently appeared in the literature in the context of physically consistent inertial parameter learning in robotics (Lee, Kwon, & Park, 2018) and geodesically convex optimization (Wensing & Slotine, 2020). We prove that these algorithms implicitly regularize the learned system model in both the linearly parameterized and nonlinearly parameterized settings.

We construct a general class of higher-order in-time adaptive control algorithms that incorporate momentum into existing adaptation laws. We prove that our new momentum algorithms are stable and globally convergent for both linearly parameterized and nonlinearly parameterized systems.

Unlike standard problems in optimization and machine learning, explicit regularization terms cannot be naively added to adaptive control algorithms without affecting stability and performance. Our approach enables a provably stable and globally convergent implementation of regularization in adaptive control. We demonstrate the utility of these results through examples in the context of dynamics prediction, such as sparse estimation of a physical system's Hamiltonian or Lagrangian function, and estimating the weights of a continuous-time recurrent neural network model.

It is well known in adaptive control that the true parameters are only recovered when the desired trajectory satisfies a strong condition known as *persistent excitation* (Narendra & Annaswamy, 2005; Slotine & Li, 1991). In general, an adaptation law need only find parameters that enable perfect tracking, and very little is known about what parameters are found when the estimator converges without persistent excitation. Our proof of implicit regularization provides an answer and shows that standard Euclidean adaptation laws lead to parameters of minimum $l2$ norm.

For the second contribution, we utilize the Bregman Lagrangian (Betancourt, Jordan, & Wilson, 2018; Wibisono, Wilson, & Jordan, 2016; Wilson, Recht, & Jordan, 2016) in tandem with the velocity gradient methodology (Andrievskii, Stotskii, & Fradkov, 1988; Fradkov, 1980, 1986; Fradkov, Miroshnik, & Nikiforov, 1999) to define a general formalism that generates higher-order in-time (Morse, 1992) velocity gradient algorithms. Our key insight is that the velocity gradient formalism provides an optimization-like framework that encompasses many well-known adaptive control algorithms and that the velocity gradient “loss function” can be placed directly in the Bregman Lagrangian.

### 1.2 Summary of Related Work

Our work continues in a long-standing tradition that utilizes a continuous-time view to analyze optimization algorithms, and here we consider a nonexhaustive list. Diakonikolas and Jordan (2019) develop momentum algorithms from the perspective of Hamiltonian dynamics, while Maddison, Paulin, Teh, O'Donoghue, and Doucet (2018) use Hamiltonian dynamics to prove linear convergence of new optimization algorithms without strong convexity. Muehlebach and Jordan (2019, 2020) study momentum algorithms from the viewpoint of dynamical systems and control. Boffi and Slotine (2020) analyze distributed stochastic gradient descent algorithms via dynamical systems and nonlinear contraction theory. Su, Boyd, and Candès (2016) provide an intuitive justification for Nesterov's accelerated gradient method (Nesterov, 1983) through a limiting differential equation. Continuous-time differential equations were used as early as 1964 by Polyak to derive the classical momentum or “heavy ball” optimization method (Polyak, 1964). In all cases, continuous time often affords simpler proofs, and it enables the application of physical intuition when reasoning about optimization algorithms. Given the gradient-based nature of many adaptive control algorithms, the continuous-time view of optimization provides a natural bridge from modern optimization to modern adaptive control.

Despite the simplicity of proofs in continuous time, finding a discretization that provably retains the convergence rates of a given differential equation is challenging. In a significant advance, Wibisono et al. (2016) showed that many accelerated methods in optimization can be derived via a variational point of view from a single mathematical object known as the Bregman Lagrangian. The Bregman Lagrangian leads to second-order mass-spring-damper-like dynamics, and careful discretization provides discrete-time algorithms such as Nesterov's celebrated accelerated gradient method (Nesterov, 1983). We similarly use the Bregman Lagrangian to generate our new adaptive control algorithms, which generalize and extend a recently developed algorithm due to Gaudio, Gibson, Annaswamy, and Bolender (2019).

Progress has been made in nonlinearly parameterized adaptive control in a number specific cases. Annaswamy, Skantze, and Loh (1998), Ai-Poh Loh, Annaswamy, and Skantze (1999), and Kojić and Annaswamy (2002) develop stable adaptive control laws for convex and concave parameterizations, though they may be overly conservative and require solving optimization problems at each time step. Astolfi and Ortega (2003) and Liu, Ortega, Su, and Chu (2010) develop the immersion and invariance (I&I) approach, and prove global convergence if a certain monotone function can be constructed. Ortega, Gromov, Nuño, Pyrkin, and Romero (2019) use a similar approach for system identification. Tyukin et al. (2007) consider dynamical systems satisfying a monotonicity assumption that is essentially identical to conditions required for learning generalized linear models in machine learning and statistics (Goel & Klivans, 2017; Goel, Klivans, & Meka, 2018; Kakade et al., 2011), and develop provably stable adaptive control algorithms for nonlinearly parameterized systems in this setting. Fradkov (1980), Andrievskii et al. (1988), Fradkov (1986), and Fradkov et al. (1999) develop the velocity gradient methodology, an optimization-like framework for adaptive control that allows for provably global convergence under a convexity assumption. As mentioned in section 1.1, this framework, in tandem with the Bregman Lagrangian, is central to our development of momentum algorithms.

Our work is strongly related to and inspired by a line of recent work that analyzes the implicit bias of optimization algorithms in machine learning. Soudry, Hoffer, Nacson, Gunasekar, and Srebro (2018) and Gunasekar, Lee, Soudry, and Srebro (2018b, 2018a) characterize implicit regularization of common gradient-based optimization algorithms such as gradient descent with and without momentum, as well as natural gradient descent and mirror descent in the settings of regression and classification. Azizan, Lale, and Hassibi (2019) and Azizan and Hassibi (2019) arrive at similar results via a different derivation based on results from $H\u221e$ control. Similarly, Belkin, Hsu, Ma, and Mandal (2019) consider the importance of implicit regularization in the context of the successes of deep learning. Our results are the adaptive control analogues of those presented in these papers.

### 1.3 Paper Outline

The paper is organized as follows. In section 2, we present some required mathematical background on direct adaptive control in the linearly and nonlinearly parameterized settings. In section 3 we analyze the implicit bias of adaptive control algorithms, while in section 4 we consider general observer and dynamics predictor design, Hamiltonian dynamics prediction, control of Lagrangian systems, and estimation of recurrent neural networks. In section 5 we provide background for our development of momentum algorithms, including a review of the velocity gradient formalism (section 5.1) and the Bregman Lagrangian (section 5.2). In section 6 we present adaptive control algorithms with momentum, and we extend them to the non-Euclidean setting in section 7. We illustrate our results via simulation in section 8, and we conclude with some closing remarks and future directions in section 9.

## 2 Direct Adaptive Control

In this section, we provide an introduction to direct adaptive control for both linearly parameterized and nonlinearly parameterized systems, along with a description of some natural gradient-like adaptive laws that have appeared in the recent literature.

### 2.1 Linearly Parameterized Dynamics

*sliding variable*(Slotine & Li, 1991),

*dynamics*

The fundamental utility of defining the variable $s$ is its conversion of the adaptive control problem for the $n$th-order system, equation 2.1, to an adaptive control problem for the first-order system, equation 2.6. Our results may be simply extended to other error models (Ai-Poh Loh et al., 1999; Narendra & Annaswamy, 2005) of the form 2.6, or error models with similar input-output guarantees, as summarized by lemma ^{72}.

We will use $f$ to denote the equivalent first-order system to (2.1), $x\u02d9=f(x,a,t)+u$, where $f=x2,x3,\u2026,f(x,a,t)$ and $u=0,0,\u2026,u$.

In this work, we make a mild additional assumption that simplifies some of the proofs.

The dynamics $f^(x,a^,t)$ is locally bounded in $x^$ and $a^$ uniformly in $t$. That is, if $\u2225x\u2225\u2264\u221e$ and $\u2225a^\u2225<\u221e$, then $\u2200t\u22650,|f^(x,a^,t)|<\u221e.$

### 2.2 Nonlinearly Parameterized Dynamics

While a difficult problem in general, significant progress has been made for the nonlinearly parameterized adaptive control problem under the assumption of *monotonicity*, and several notions of monotonicity have appeared in the literature (Astolfi & Ortega, 2003; Liu et al., 2010; Ortega et al., 2019; Tyukin, 2011; Tyukin et al., 2007). We consider one such notion as presented by Tyukin et al. (2007), which is captured in the following assumption.

^{1}Proportional-integral adaptive laws of this type have been known as algorithms in finite form (Fradkov et al., 1999; Tyukin, 2003) and appear in the well-known I&I framework (Astolfi & Ortega, 2003; Liu et al., 2010). Following this prescription, equation 2.12 may be implemented in a proportional-integral form,

*pseudogradient*$f\u02dc(x,a^,a,t)\alpha (x,t)$ despite nonconvexity of the square loss in this setting. Similarly, if $f$ is nonincreasing, we find

### 2.3 The Bregman Divergence and Natural Adaptation Laws

*Hessian metric*$12\u2225x\u2225\u22072\psi 2=12xT\u22072\psi (x)x$. As two simple examples, for $\psi (x)=12\u2225x\u22252$, $d\psi (x\u2225y)=12\u2225x-y\u22252$. For $\psi (x)=12xTQx$ with $Q>0$ a positive-definite matrix, $d\psi (x\u2225y)=12x-yTQx-y$. For general convex functions, $d\psi (\xb7\u2225\xb7)$ can always be written via Taylor's formula with an integral remainder for multivariate functions as

*inverse Hessian*$\u22072\psi (a^)-1$ of the strongly convex function $\psi $. In essence, this amounts to the adaptive control equivalent of the

*natural gradient*algorithm of Amari (1998), so that the resulting adaptation law respects the underlying Riemannian geometry captured by the Hessian metric $\u22072\psi (a^)$. The standard adaptation law $a^\u02d9=-PYTs$ uses the constant metric $P-1$, which in turn explains the appearance of $P$ in the natural gradient-like system.

The choice of $\psi $ enables the design of adaptation algorithms that respect physical Riemannian constraints (Lee, Wensing, & Park, 2020) obeyed by the true parameters, as in the estimation of mass properties in robotics (Wensing, Kim, & Slotine, 2018). Similarly, it allows one to introduce a priori bounds on parameter estimates without resorting to parameter projection techniques by choosing $\psi $ to be a $log$-barrier function (Wensing & Slotine, 2020). In section 3.1, we further prove that the choice of $\psi $ implicitly regularizes the learned system model.

with $\psi $ strongly convex and $\gamma >0$ ensures that $f\u02dc\u2208L2$ over the maximal interval of existence of $x(t)$ for nonlinearly parameterized systems categorized by assumption 2. The proof is identical to that of Tyukin et al. (2007). The implementation of this algorithm in PI form will be described in remark ^{13} and is based on a correspondence between mirror descent and natural gradient descent in continuous time. This algorithm can be seen as the adaptive control equivalent of a mirror descent or natural gradient extension of the GLMTron of Kakade et al. (2011), and this correspondence will be considered in greater detail in section 6.

In the linearly parameterized setting, rather than the Lyapunov-like function $V=12s2+d\psi (a\u2225a^)$, the Lyapunov-like function $V=12s2+d\psi (Pa\u2225Pa^)$ may be used for any positive-definite matrix $P$. This shows stability of the adaptation law $a^\u02d9=-P-1\u22072\psi (Pa^)-1P-1YTs$, where the choice of the matrix $P$ offers an additional design flexibility.

In some practical applications, as in adaptive robot control, the estimated parameters $a^$ may correspond to physical constants. In this case, the weighted parameter estimation error term $12a\u02dcTP-1a\u02dc$ not only provides additional design flexibility through the elements of $P$ in the adaptation law, but is necessary for physical consistency of units. Indeed, the usual Lyapunov-like function $V=12s2+12a\u02dcTP-1a\u02dc$ shows that $P-1$ must be chosen so that the parameter estimation error term $12a\u02dcTP-1a\u02dc$ has the same units as the tracking error term $12s2$. Similar considerations apply when replacing this standard parameter estimation error term with the Bregman divergence $d\psi (a\u2225a^)$, which has the same units as $\psi (a^)$. In this case, $\psi (a^)$ must be chosen to have the same units as the tracking error term, for example, by introducing a diagonal matrix of constants to ensure consistent dimensions.

## 3 Natural Gradient Adaptation and Implicit Regularization

In this section, we show that the “natural” adaptation algorithms of the previous section implicitly regularize the learned system model.

### 3.1 Implicit Regularization and Adaptive Control

With deep networks as the predominant example, modern machine learning often considers highly overparameterized models that are capable of interpolating the training data (achieving zero error on the training set) while still generalizing well to unseen examples. The classical principles of statistical learning theory emphasize a trade-off between generalization performance and model capacity, and predict that in the highly overparameterized regime, generalization performance should be poor due to a tendency of the model to fit noise in the training data. Nevertheless, empirical evidence indicates that deep networks and other modern machine learning models do not obey classical statistical learning wisdom (Belkin et al., 2019) and can even generalize with significant label noise (Zhang, Bengio, Hardt, Recht, & Vinyals, 2016).

More surprisingly, the ability to simultaneously fit label noise in the training data yet generalize to new examples has been observed in overparameterized linear models (Bartlett, Long, Lugosi, & Tsigler, 2020; Muthukumar, Vodrahalli, & Sahai, 2019). A possible explanation for the ability of highly overparameterized models to generalize when optimized using simple first-order algorithms is their implicit bias—that is, the tendency of an algorithm to converge to a particular (e.g., minimum norm) solution when there are many that interpolate the training data (Azizan & Hassibi, 2019; Azizan et al., 2019; Gunasekar et al., 2018a, 2018b; Soudry et al., 2018).

In adaptive control, the possibility of there being many possible parameter vectors $a^$ that lead to zero tracking error is not unique to the overparameterized case. Unless the trajectory is *persistently exciting*^{2} (Narendra & Annaswamy, 2005; Slotine & Li, 1991), it is well known that $a^$ will not converge to the true parameters $a$ in general. Depending on the complexity of the trajectory, there may even be many solutions in the *underparameterized* case where $dim(a^)<dim(a)$. To achieve perfect tracking, the adaptation algorithm need only fit the unknown dynamics $f(x(t),a,t)$ along the trajectory rather than the whole state space, so that the effective number of parameters may be less than $dim(a)$.

The wealth of possible solutions in the linearly parameterized case is captured by the time-dependent null space of $Y(x(t),t)$: when $x\u2192xd$, we can conclude that $Y(xd(t),t)a\u02dc(t)=0$, and hence that $a^(t)=a+n^(t)$ where $Y(xd(t),t)n^(t)=0$ for all $t$. This observation also highlights that any element $n^(t)$ of the null space may be added to the parameter estimates $a^$ without affecting the value of $f^$.^{3} In the overparameterized case when $dim(a^)>dim(a)$, the set of parameters that achieve zero tracking error is not unique regardless of the complexity of the desired trajectory. By deriving a continuous-time extension of a recent proof of the implicit bias of mirror descent algorithms (Azizan & Hassibi, 2019; Azizan et al., 2019), we now show that the natural adaptive laws of the previous section implicitly regularize $a^$. This proof of implicit regularization provides an answer to the question, with infinitely many parameter vectors that achieve zero tracking error, which does adaptation choose?

Equation 3.3 captures the implicit regularization imposed by the adaptation algorithm equation 3.2: out of all possible interpolating parameters, it chooses the $a^$ that achieves the minimum value of $\psi $.

The assumptions of proposition 1 provide a setting where theoretical insight may be gained into the implicit regularization of adaptive control algorithms, but they are stronger than needed. In general, the parameters $a^(t)$ found by an adaptive controller need not converge to a constant despite the fact that $a^\u02d9\u21920$.^{4} Similarly, even in the case that the parameters converge, it is not strictly required that $Y(x(t),t)a^\u221e=f(x(t),a,t)$ along the entire trajectory, as this condition is satisfied asymptotically. Numerical simulations in section 8 will demonstrate the implicit regularization of parameters $a^(t)$ found by adaptive control along the entire trajectory.

We may make a similar claim in the nonlinearly parameterized setting captured by assumption 2. To do so, we require an additional assumption.

For any vector of parameters $\theta $ and the true parameters $a$, $f(x(t),\theta ,t)=f(x(t),a,t)$ implies that $\alpha (x(t),t)T\theta =\alpha (x(t),t)Ta$.

For the class of systems 2.11, a sufficient condition for assumption 3 is that $\lambda (x(t),t)\u22600$ and that the map $\varphi (x,t)Ta\u2192fm(x(t),\varphi (x,t)Ta)$ is invertible at every $t$. We may now state our implicit regularization result for nonlinearly parameterized systems.

Algorithm 3.4 must be implemented in PI form due to the appearance of $f\u02dc$. The use of the PI form in $a^$, equations 2.13 to 2.16, is complicated by the presence of the inverse Hessian of $\psi $. To implement equation 2.12, the Euclidean variant may be implemented through the usual PI form for an auxiliary variable $v^\u02d9=-f\u02dc(x(t),a^(t),a,t)\alpha (x(t),t)$, and then the controller parameters may be computed by inverting the gradient of $\psi $, $a^(t)=\u2207\psi -1(v^(t))$. This follows by the equivalence of mirror descent and natural gradient descent in continuous time. Concretely, the identity $ddt\u2207\psi (a^)=\u22072\psi (a^)a^\u02d9$ shows that $a^\u02d9=-\u22072\psi (a^)-1f\u02dc(x,a^,a,t)\alpha (x,t)$ is equivalent to $ddt\u2207\psi (a^)=-f\u02dc(x,a^,a,t)\alpha (x,t)$. The auxiliary variable $v^$ can then be identified with $\u2207\psi (a^)$.

If the inverse gradient of $\psi $ is unknown but $\psi $ is chosen to be strongly convex, the contracting (Lohmiller & Slotine, 1998) dynamics $w\u02d9=-1\tau \u2207\psi (w)-\u2207\psi (v^)$ with $\tau >0$ will converge to a ball around $v^$ with radius set by $\u2225ddt\u2207\psi (v^)\u2225\xd7\tau l$ where $l$ is the strong convexity parameter. By choosing $\tau $ so that this contracting dynamics is fast on the timescale of adaptation, $w$ will represent a good approximation of the instantaneous $v^$.

Our results highlight, through the equivalence of their continuous-time limits, that both mirror descent–like and natural gradient-like adaptive laws impose implicit regularization. This observation extends recent results on the implicit regularization of mirror descent (Azizan & Hassibi, 2019; Azizan et al., 2019) to natural gradient descent and, furthermore, applies to linearly parameterized and generalized linearly parameterized models in machine learning, not just in the context of adaptive control. This has previously been noted in Gunasekar et al. (2018a), where it was discussed that in discrete time, natural gradient descent only approximately imposes implicit regularization due to discretization errors.

Propositions 1 and 2 demonstrate for the first time the implicit bias of adaptive control algorithms. In doing so, they identify an additional design choice that may be exploited for the application of interest. Proposition 1 implies that the Slotine and Li controller, when initialized with the parameters at $a^(0)=0$, finds the interpolating parameter vector of minimum $l2$ norm. Other norms, such as the $l1$, $l\u221e$, $lp$ for arbitrary $p$, or group norms will find alternative parameter vectors that may have desirable properties such as sparsity.^{5} The usual Euclidean geometry–based adaptive laws can be seen as a form of ridge regression, while imposed $l1$, $l2$ and $l1$ simultaneously, or $lp$ regularization through the choice of $\psi $ can be seen as the adaptive control equivalents of LASSO (Tibshirani, 1996) or compressed sensing, elastic net, and bridge regression, respectively. In the context of adaptive control, this notion of implicit regularization is particularly interesting, as typical regularization terms such as $l1$ and $l2$ penalties cannot in general be added to the adaptation law directly without affecting the stability and performance of the algorithm.

Following remark ^{8}, if $\psi (\xb7)$ is chosen as the $l$th power of a $p$ norm, in practical applications it is necessary to include a matrix $\Gamma $ to ensure that $\psi (a^)=\u2225\Gamma a^\u2225pl$ has the same units as the tracking error component of the Lyapunov function. For example, if $l=2$, then $\Gamma $ may be chosen as $\Gamma =P-1/2$ for consistency of units where $P$ is a gain matrix tuned for the standard adaptive law $a^\u02d9=-PYTs$. In addition, $l=2$ admits a simple inversion formula for $\u2207\psi $ for any $p$, as will be utilized in the simulations in section 8, although the corresponding inverse Hessian $\u22072\psi (\xb7)-1$ is nondiagonal for $p\u22602$. For $l=p$, the inverse Hessian is diagonal, but $\Gamma $ must then be calibrated independently from $P$ tuned for the standard $l2$ law. Note that choosing $\psi $ to be an $l1$ norm will impose sparsity on $\Gamma a^$, so that $\Gamma $ should be taken to be diagonal to ensure sparsity in $a^$ itself.

### 3.2 Non-Euclidean Measure of the Tracking Error

## 4 Adaptive Dynamics Prediction, Control, and Observer Design

In this section, we demonstrate how the new non-Euclidean adaptation laws of section 3.1 may be used for regularized dynamics prediction, regularized adaptive control, and regularized observer design.

### 4.1 Regularized Adaptive Dynamics Prediction

The above discussion demonstrates a separation theorem for adaptive dynamics prediction. If a dynamics predictor can be designed under the assumption that the true system dynamics is known (e.g., if bounds on $\u2202f(x)\u2202x$ are available), then the same dynamics predictor can be made adaptive by incorporating the skew-symmetric law, equation 4.1. Convergence properties then only depend on the nominal system with control feedback and are independent of the parameter estimator, as shown by the conditions for contraction.

In principle, these simple results could be made more general using the techniques developed in Lopez and Slotine (2021), or could be performed in a latent space computed via a nonlinear dimensionality-reduction technique such as an autoencoder (Champion, Lusch, Kutz, & Brunton, 2019) or more generally a hierarchical expansion (Chen, Paiton, & Olshausen, 2018). This could also extend to adaptive control, for example, in robot control applications where an adaptive controller could be designed in a latent space computed from raw pixels via a neural network.

### 4.2 Regularized Dynamics Prediction for Hamiltonian Systems

### 4.3 Regularized Adaptive Control for Lagrangian Systems

^{71}in this article) and using skew-symmetry of $H\u02d9-2C$ to eliminate $12sTH\u02d9s$. In physical applications, dimensions or relative scaling of the components of $a^(K)$ and $a^(P)$ can be handled as described in remarks

^{8}and

^{16}.

As in section 4.2, by using an $l1$ approximation for $\psi $, this approach may find sparse, interpretable models of the kinetic and potential energies. Estimating the potential energy directly may in some cases lead to simpler parameterizations than estimating the resulting forces.

If more structure in the inertia matrix is known, for example, that it depends only on a few unknown parameters, it may still be approximated using the usual Slotine and Li controller. The external forces can then be estimated by directly estimating the corresponding potential that generates them.

### 4.4 Regularized Adaptive Observer Design

As in section 4.1, this discussion demonstrates a separation theorem for adaptive observer design. If an observer can be designed for the true system with unknown parameters, then the same observer can be made adaptive by incorporating the adaptation law presented in this section. Convergence properties then depend only on the true system with feedback and are independent of the parameter estimator. The results of section 3.1 show that the choice of $\psi $ can be used to regularize the observer model while maintaining provable reconstruction of the full system state.

### 4.5 Regularized Dynamics Prediction for Recurrent Neural Networks

^{6}. The Lyapunov-like function

^{72}shows that $e\u21920$. This approach could be used, for example, for identifying regularized low-dimensional models in computational neuroscience. Our results are similar to those of Foster, Rakhlin, and Sarkar (2020), but handle a mirror descent or natural gradient extension valid in the continuous-time deterministic setting.

## 5 Velocity Gradient Algorithms and the Bregman Lagrangian

### 5.1 Velocity Gradient Algorithms

^{6}The adaptation law, equation 5.1, ensures that $a^$ moves in a direction that instantaneously decreases $Q\u02d9(x,a^,t)$. Under the conditions specified by assumptions 4 to 6, this causes $Q\u02d9(x,a^,t)$ to be negative for long enough to accomplish the control goal (Fradkov et al., 1999).

$Q(x,t)$ is nonnegative and radially unbounded, so that $Q(x,t)\u22650$ for all $x$, $t$ and $Q(x,t)\u2192\u221e$ when $\u2225x\u2225\u2192\u221e$. $Q(x,t)$ is uniformly continuous in $t$ whenever $x$ is bounded.

There exists an ideal set of control parameters $a$ such that the origin of the system 2.1 is globally asymptotically stable when the control is evaluated at $a$. Furthermore, $Q(x,t)$ is a Lyapunov function for the system when the control is evaluated at $a$. That is, there exists a strictly increasing function $\rho $ such that $\rho (0)=0$ with $Q\u02d9(x,a,t)\u2264-\rho (Q)$.

The proof follows by consideration of the Lyapunov-like function $V=Q+12a\u02dcTP-1a\u02dc$.

If $Q(x,t)$ is chosen so that $Q\u02d9(x,a^,t)$ depends on $a^$ only through $f^(x,a^,t)$ and $f(x,a,t)$ is linearly parameterized, then assumption 6 will immediately be satisfied by convexity of affine functions. Indeed, consider defining the goal functional $Q(x,t)=12s(x,t)2$ for system 2.1 where $s$ depends on $t$ through $xd(t)$. It is clear that this proposed goal functional satisfies assumptions 4 and 5 for bounded $xd(t)$. Then $Q\u02d9(x,a^,t)=-\eta s(x,t)2+sf\u02dc(x,a^,a,t)$, and equation 5.1 exactly recovers the Slotine and Li controller, equation 2.8.

An alternative perspective on velocity gradient algorithms can be found by using the expression $Q\u02d9(x,a^,t)=\u2207xQ(x,t)Tx\u02d9+\u2202Q(x,t)\u2202t$. Assume that $x\u02d9=u-Y(x,t)a$, and set $u=Y(x,t)a^+ud$ where $ud$ ensures that $x(t)\u2192xd(t)$ for $a^=a$. Then $\u2207a^Q\u02d9(x,a^,t)=YT\u2207xQ(x,t)$. This shows that the adaptation law $a^\u02d9=-P\u2207a^Q\u02d9(x,a^,t)=-PY(x,t)T\u2207xQ(x,t)$ transforms the gradient of $Q(x,t)$ with respect to $x$ by premultiplication by the regressor $Y(x,t)T$. This interpretation applies to the observers and dynamics predictors designed in section 4, as well as the adaptation law for contracting systems developed in Lopez and Slotine (2021). Conversely, this perspective shows that if a Lyapunov function $V(x,t)$ is known for a nominal system $x\u02d9=f(x,t)$, then the control input $u=Y(x,t)a^$ with adaptation law $a^\u02d9=-PYT\u2207xV(x,t)$ will return the perturbed system $x\u02d9=f(x,t)+u-Y(x,t)a$ back to its nominal behavior.

$R$ is a nonnegative function and $R(x(t),a^(t),t)$ is uniformly continuous in $t$ for bounded $x$ and $a^$. Furthermore, $\u2207a^R(x,a^,t)$ is locally bounded in $x$ and $a^$ uniformly in $t$.

There exists an ideal set of controller parameters $a$ and a scalar function $\mu $ such that $\u222b0\u221e\mu (t')dt'<\u221e$, $limt\u2192\u221e\mu (t)=0$, and $R(x(t),a,t)\u2264\mu (t)$ for all $t$.

The proof follows by consideration of the Lyapunov-like function $V=\u222b0tR(x(t'),a^(t'),t')dt'+12a\u02dcTP-1a\u02dc+\u222bt\u221e\mu (t')dt'$.

Integral functionals allow the specification of a control goal that depends on all past data. $R(x,a^,t)$ is chosen so that it does not necessarily depend on the structure of the dynamics but depends explicitly on $a^$. Local functionals, on the other hand, result in adaptation laws that *do* have an explicit dependence on the dynamics through the appearance of the term $\u2202Q\u2202xTx\u02d9$ in $Q\u02d9(x,a^,t)$.

Integral functionals can be particularly useful if $R(x,a^,t)\u21920$ implies the desired control goal. In this work, we focus on the choice $R(x,a^,t)=12f\u02dc(x,a^,a,t)2$, which will require a PI form as described in section 2 in the context of Tyukin's algorithm.^{7} In particular, note that for this choice of $R$, the result of proposition 4 implies that $f\u02dc\u2208L2$ over the maximal interval of existence of $x$. For some error models, this is enough to ensure that $x\u2208L\u221e$, and hence that $f\u02dc(x,a^,a,t)\u21920$ and $x\u2192xd$.^{8}

Goal functionals can also be written as a sum of local and integral functionals with similar guarantees, and these approaches will lead to composite algorithms in the subsequent sections. (See Fradkov et al., 1999, chap. 3 for more detail.)

### 5.2 The Bregman Lagrangian and Accelerated Optimization Algorithms

The quantities $\alpha \xaf(t):R+\u2192R,\beta \xaf(t):R+\u2192R$, and $\gamma \xaf(t):R+\u2192R$ in equation 5.4 are arbitrary time-dependent functions that will ultimately set the damping and learning rates in the second-order Euler Lagrange dynamics. To generate accelerated optimization algorithms, Wibisono, Wilson, and Jordan (2006) required two ideal scaling conditions: $\beta \xaf\u02d9\u2264e\alpha \xaf$ and $\gamma \xaf\u02d9=e\alpha \xaf$. These conditions originate from the Euler Lagrange equations, where the second is used to eliminate an unwanted term, and a Lyapunov argument, where the first is used to ensure decrease of a chosen Lyapunov function.

^{9}. Here, $\gamma \u22650$ and $\beta \u22650$ are nonnegative scalar hyperparameters, and $N=N(t)$ is a signal chosen based on the system. With these definitions, choosing the Euclidean norm $\psi (\xb7)=12\u2225\xb7\u22252$, and modifying the Bregman Lagrangian presented in Gaudio et al. (2019) to the adaptive control framework defined in section 2, equation 5.4 becomes

^{22}, this is precisely the $Q\u02d9$ velocity gradient functional that gives rise to the Slotine and Li controller. For equation 5.5, the Euler-Lagrange equations lead to the adaptation law,

Consider the higher-order adaptation algorithm, equation 5.6, with $N=1+\mu \u2225Y\u22252$ and $\mu >\gamma \eta \beta $. Then all trajectories $(x,v^,a^)$ remain bounded, $s\u2208L\u221e\u2229L2$, $a^-v^\u2208L2$, $s\u21920$ and $x\u2192xd$.

The proof follows by consideration of the Lyapunov-like function $V=12s2+1\gamma \u2225v\u02dc\u22252+1\gamma \u2225v^-a^\u22252$.

It is well known, for example from a passivity interpretation of the Lyapunov-like analysis (see, e.g., Slotine & Li, 1991), that the pure integrator in the standard Slotine and Li adaptation law, equation 2.8, can be replaced by any linear positive real transfer function containing a pure integrator. The higher-order algorithms presented in this work are distinct from this approach, as most clearly seen by the state-dependent damping term in equation 5.6.

In Wibisono et al. (2016), the suggested Lyapunov function in the Euclidean setting is $V=\u2225x+e-\alpha \xafx\u02d9-x*\u22252+e\beta \xaff(x)$, where $x*$ is the global optimum and $f(x)$ is the loss function. Noting that $v^$ is the equivalent of $x+e-\alpha \xafx\u02d9$ in the adaptive control context (see remark ^{29}), we see that the Lyapunov-like function used to prove stability of the adaptive law, equation 5.6, is similar to that used to prove convergence in the optimization context. The loss function term $f(x)$ is replaced by $12s2$, and it is necessary to add the term $1\gamma \u2225v^-a^\u22252$.

## 6 Adaptation Laws with Momentum

In this section, we develop several new adaptation laws for both linearly and nonlinearly parameterized systems. We begin by noting that the Bregman Lagrangian generates velocity gradient algorithms with momentum. We prove some general conditions under which these momentum algorithms will achieve tracking. By analogy with integral velocity gradient functionals, we then derive a proportional-integral scheme to implement a first-order composite adaptation law (Slotine & Li, 1991) driven directly by the function approximation error rather than its filtered version. We subsequently fuse the generating functional for the composite law with the Bregman Lagrangian to construct a composite algorithm with momentum.

We then employ a connection between recent developments in isotonic regression—the GLMTron of Kakade et al. (2011), along with extensions due to Goel and Klivans (2017) and Goel et al. (2018)---and Tyukin's algorithm, equation 2.12, to derive momentum algorithms for nonlinearly parameterized systems. These momentum algorithms can be seen as the adaptive control equivalent of the GLMTron with momentum.

We follow this development by discussing a new form of high-order algorithm inspired by the elastic averaging stochastic gradient descent (EASGD) algorithm (Boffi & Slotine, 2020; Zhang, Choromanska, & LeCun, 2014). We subsequently demonstrate the capability of using time-varying learning rates with our presented algorithms (Slotine & Li, 1991).

### 6.1 Velocity Gradient Algorithms with Momentum

Assumption 9 is a formal statement that we may “complete the square” on the left-hand side of 6.4. For example, for $\u2207a^Q\u02d9(x,a^,t)=YTs$ and for $Q\u02d9(x,a,t)=-\eta s2$, we may choose $N=\u2225YT\u22252$.

With assumption 9 in hand, we can state the following proposition.

By radial unboundedness of $Q(x,t)$ in $x$, equations 6.5 and 6.6 show that $x$ remains bounded. Similarly, radial unboundedness of $V$ in $v\u02dc$ and $a^-v^$ show that $v^$ and $a^$ remain bounded. Integrating equation 6.6 shows that $\beta \gamma \u222b0\u221e\u2225a^-v^\u22252dt\u2264V(0)-V(\u221e)<\u221e$, so that $a^-v^\u2208L2$. An identical argument shows that $\u222b0\u221e\rho (Q)dt<\u221e$. Now, because $x$ and $a^$ are bounded and because $f\u02dc(x,a^,a,t)$ is locally bounded in $x$ and $a^$ uniformly in $t$ by assumption, writing $x(t)-x(s)=\u222bstf(x(t'),a,t')+u(a^(t'),t')dt'$ shows that $x(t)$ is uniformly continuous in $t$. Because $Q(x,t)$ is uniformly continuous in $t$ when $x$ is bounded, because $Q$ is bounded, and because $\rho $ is continuous in $Q$, we conclude $\rho $ is uniformly continuous in $t$ and $limt\u2192\u221e\rho (t)=limt\u2192\u221e\rho (Q(x(t),t))=0$ by Barbalat's lemma (lemma ^{71}). This shows that $limt\u2192\u221eQ(x(t),t)=0$.

and now require a modified version of assumption 9.

for some constant $k>0$.

Similar to assumption 9, assumption 10 is a formal requirement that we may “complete the square.” Consider the case when $R(x,a^,t)=12f\u02dc(x,a^,a,t)2$. Then $R(x,a,t)=0$, $\u2207a^R(x,a^,t)=f\u02dc(x,a^,a,t)\u2207a^f^(x,a^,t)$, and we may choose $N(t)=\u2225\u2207a^f^(x,a^,t)\u22252$.

With assumption 10, we can state the following proposition.

Consider algorithm 6.7 along with assumptions 6 and 10. Let $Tx$ denote the maximal interval of existence of $x(t)$. Then $v^$ and $a^$ remain bounded for $t\u2208[0,Tx]$, $a^-v^\u2208L2$ over this interval, and $\u222b0TxR(x(t'),a^(t'),t')dt'<\u221e$. Furthermore, for any bounded solution $x$, these conclusions hold for all $t$ and $R(x(t),a^(t),t)\u21920$.

^{71}).

As mentioned in section 5.1, we will be particularly interested in proposition ^{63} when $R=12f\u02dc2$, which will generate composite adaptation algorithms and algorithms applicable to nonlinearly parameterized systems. Proposition ^{63} then shows that $f\u02dc\u2208L2$ over the interval of existence of $x(t)$. As shown by lemma 18, with our error model this is enough to show that $x(t)$ always remains bounded and hence $f\u02dc\u21920$.

Classically, Lyapunov-like functions used in adaptive control consist of a sum of tracking and parameter estimation error terms, with $a^\u02d9$ chosen to cancel a term of unknown sign. Several Lyapunov functions in this work consist only of parameter estimation error terms, such as equation 6.10. From a mathematical point of view, all that matters is that $V\u02d9$ is negative semidefinite and contains signals related to the tracking error. Integrating $V\u02d9$ allows the application of tools from functional analysis to ensure that the control goal is accomplished. The lack of tracking error term in $V$ is the origin of the additional complication that $x(t)$ must be shown to be bounded even after it is known that $V\u02d9\u22640$.

### 6.2 First- and Second-Order Composite Adaptation Laws

Here we consider the linearly parameterized setting $f(x,a,t)=Y(x,t)a$, and derive new first- and second-order composite adaptation laws. Composite adaptation laws are driven by two sources of error: the tracking error itself, as summarized by $s$ in the Slotine and Li controller, and a prediction error. The prediction error term is generally obtained from an algebraic relation constructed by filtering the dynamics (Slotine & Li, 1991). We present a composite algorithm that does not require any explicit filtering of the dynamics but is instead driven simultaneously by $s$ and $f\u02dc$.

The $Ya\u02dc$ term may also be obtained by following the I&I formalism (Astolfi & Ortega, 2003; Liu et al., 2010). To our knowledge, this discussion is the first that demonstrates the possibility of using a PI law in combination with a standard Lyapunov-stability motivated adaptation law to obtain a composite law.

More error signals may be used for additional terms in the adaptation law. For example, a prediction error obtained by filtering the dynamics may also be employed, leading to a three-term composite algorithm.

As mentioned in section 2.1, for clarity of presentation we have restricted our discussion to the $n$th-order system 2.1. In general, the PI form 6.16 leads to undesired unknown terms contained in $\u2202\xi (x,xd)\u2202xTx\u02d9$ in addition to the desired unknown term. In this case, the desired unknown term is $-\kappa PYTYa\u02dc$, while the undesired unknown term is $-\kappa P\u2202Y\u2202xnx\u02d9ns$. Indeed, the purpose of introducing the additional proportional term $\rho (x,xd)$ in equation 6.14 is to cancel this undesired unknown term. In general, cancellation of the undesired terms can be obtained by choosing $\rho $ to solve a PDE, and solutions to this PDE will only exist if the undesired term is the gradient of an auxiliary function. $\rho $ is then chosen to be exactly this auxiliary function. In some cases, the PDE can be avoided, such as through dynamic scaling techniques (Karagiannis, Sassano, & Astolfi, 2009) or the similar embedding technique of Tyukin (2011).

The properties of the adaptive law, equation 6.13, may be summarized with the following proposition.

Consider the adaptation algorithm 6.13 with a linearly parameterized unknown, $f(x,a,t)=Y(x,t)a$. Then all trajectories $(x,a^)$ remain bounded, $s\u2208L2\u2229L\u221e$, $f\u02dc\u2208L2$, $s\u21920$, and $x\u2192xd$.

The proof is given in section A.1.

Consider the higher-order composite adaptation algorithm, equation 6.19, for a linearly parameterized unknown, $f(x,a,t)=Y(x,t)a$. Set $N=1+\mu \u2225Y\u22252$ and $\mu >\gamma \beta 1\eta +\kappa \gamma $. Then all trajectories $(x,v^,a^)$ remain bounded, $\u2225v^-a^\u2225\u2208L2$, $s\u2208L\u221e\u2229L2$, $f\u02dc\u2208L\u221e\u2229L2$, $s\u21920$, and $x\u2192xd$.

The proof is given in section A.2.

By following the proof, the signal $N$ may be chosen alternatively to be matrix-valued as $N=I+\mu YTY$.

The $Ya\u02dc$ term may be used in isolation, by considering the Lyapunov function $V=12\u2225v\u02dc\u22252+12\u2225a^-v^\u22252$.

A gain matrix $P=PT>0$ of appropriate dimension may be placed in front of $YT$ in $v^\u02d9$. The quadratic parameter estimation error terms in the Lyapunov function should then be replaced by the weighted terms $12v\u02dcTP-1v\u02dc+12a^-v^TP-1a^-v^$, and bounds on $\mu $ will be given in terms of $\u2225P\u2225$.

### 6.3 A Momentum Algorithm for Nonlinearly Parameterized Adaptive Control

We now use the development in section 6.2 to present a new momentum algorithm applicable when the unknown parameters appear nonlinearly in the dynamics. We begin with an analogy to statistics.

Generalized linear model (GLM) regression is an extension of linear regression where the data are assumed to be generated by a function of the form $f(x)=u(wTx)$ for a known “link function” $u$ and unknown parameters $w$. The first computationally and statistically efficient algorithm for this problem, the GLM-Tron of Kakade et al. (2011), assumes that $u$ is Lipschitz and monotonic, much like assumption 2.

The GLM-Tron algorithm was recently extended to the setting of kernel methods and was subsequently used to provably learn two hidden-layer neural networks by Goel and Klivans (2017); this extension is known as the Alphatron. In the kernel GLM setting handled by the Alphatron, the function to be approximated is assumed to be of the form $f(x)=u\u2211i=1mwiKx,xi$, where $K$ is the kernel function for a reproducing kernel Hilbert space (RKHS) $H$. $K$ is thus given by the RKHS inner product of a feature map $\varphi $, $K(x,y)=\u2329\varphi (x),\varphi (y)\u232aH$.

The proof is given in section A.3.

Equation 6.23 may be obtained via the Bregman Lagrangian, equation 6.18, for velocity gradient laws with momentum by choosing only the integral term. It is then necessary to modify the resulting Euler-Lagrange equations by setting $fm'$ to $\xb11$ based on the monotonicity of $f\u02dc$ as described in section 2.2. The following proposition summarizes the properties of equations 6.24 and 6.25.

Consider the algorithm 6.23 or its equivalent form, 6.24 and 6.25 under assumption 2 with $N=1+\mu \u2225\alpha (x,t)\u22252$ and $\mu >\gamma D1\beta $. Then all trajectories $(x,a^,v^)$ remain bounded, $f\u02dc\u2208L2$, $a^-v^\u2208L2$, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

The proof is given in section A.4.

As noted in remark ^{43}, by following the proof of proposition 11, one may also take $N$ to be matrix-valued as $N=1+\mu \alpha (x,t)\alpha (x,t)T$.

As in remark ^{45}, a gain matrix $P=PT>0$ of appropriate dimension may be placed in front of $\alpha (x,t)$ in $v^\u02d9$.

Predominantly inspired by deep learning, there has recently been strong interest in nonconvex models that are nevertheless amenable to gradient-based or gradient-inspired optimization. The development in this section suggests that machine learning models that can be provably optimized using gradient techniques represent a promising class of nonlinear parameterizations for adaptive control development.

### 6.4 The Elastic Modification

Boffi and Slotine (2020) observed that in the nondistributed $(p=1)$ case, equations 6.26 and 6.27 do not reduce to standard stochastic gradient descent, and that application of EASGD in this setting has different generalization properties from those of standard SGD when used to train deep neural networks. In a similar spirit, by construction of suitable Lyapunov functions, we now show that adding a center-like variable to the adaptive laws considered in previous sections maintains their stability. This immediately gives rise to a new class of higher-order adaptive control algorithms. Interestingly, these algorithms do not seem to admit an equivalent representation in terms of a single second-, third-, or fourth-order system for $a^$, but must be written as a system of first-order equations.

The algorithms considered in this subsection immediately extend to the case of cloud-based adaptation for networked robotic systems (Wensing & Slotine, 2018), where the center variable is allowed to have its own dynamics as in equation 6.27 rather than simply representing the instantaneous spatial average of the distributed parameters.

The proof is given in section A.5.

The proof is given in section A.6.

Consider the algorithm, equations 6.32 to 6.34, under assumption 2. Set $13\u2264k<1$, $N=1+\mu \u2225\alpha (x,t)\u22252$, and $\mu >2D1\gamma \beta (1-k)$. Then all trajectories $(x,a^,v^,a\xaf)$ remain bounded, $f\u02dc\u2208L2\u2229L\u221e$, $s\u2208L2\u2229L\u221e$, $a^-v^\u2208L2$, $a^-a\xaf\u2208L2$, $s\u21920$ and $x\u2192xd$.

Consider the algorithm, equations 6.35 to 6.37, under assumption 2. Set $\rho <2\beta $, $N=1+\mu \u2225\alpha (x,t)\u22252$, and $\mu >\gamma D1\beta $. Then all trajectories $(x,a^,v^,v\xaf)$ remain bounded, $f\u02dc\u2208L2\u2229L\u221e$, $s\u2208L2\u2229L\u221e$, $v^-v\xaf\u2208L2$, $v^-a^\u2208L2$, $s\u21920$ and $x\u2192xd$.

Consider the algorithm, equations 6.38 to 6.41, under assumption 2. Set $\rho <\beta (1-k)$, $13\u2264k<1$, $N=1+\mu \u2225\alpha (x,t)\u22252$, and $\mu >2\gamma D1\beta (1-k)$. Then all trajectories $(x,v^,v\xaf,a^,a\xaf)$ remain bounded, $f\u02dc\u2208L2\u2229L\u221e$, $s\u2208L2\u2229L\u221e$, $v^-v\xaf\u2208L2$, $a^-a\xaf\u2208L2$, $v^-a^\u2208L2$, $s\u21920$, and $x\u2192xd$.

The proof is given in section A.9.

We have thus shown that all Euclidean adaptive control algorithms presented in this article,^{10} as well as the classic algorithm of Slotine and Li, can be modified to include feedback coupling to a low-pass filtered version of the adaptation variables. It is well known that iterate averaging for stochastic optimization algorithms such as stochastic gradient descent can improve convergence rates via variance reduction (Polyak & Juditsky, 1992). The elastic modification is similar in spirit but employs feedback rather than series coupling. This suggests that adding the elastic term may improve robustness of adaptation algorithms, and we leave a theoretical investigation of this conjecture for future work.

### 6.5 Exponential Forgetting Least Squares and Bounded Gain Forgetting

We focus on algorithms without the elastic modification of section 6.4; extension to the elastic modification is simple. We also focus on the bounded gain forgetting technique: proofs for the exponential forgetting least squares technique are identical, with the addition of an appropriate case statement in the time derivative of the Lyapunov function. For simplicity, we include only the time-dependent gain $P(t)$ and set the scalar gains $\kappa =\gamma =1$ where applicable.

We begin with the first-order non-filtered composite, equation 6.13, with $P$ given by equation 6.42. In this case, the composite algorithm may be implemented via the PI form equations 6.14 to 6.17, where now $P=P(t)$.

Consider the adaptation algorithm, equation 6.13, with $P(t)$ given by equation 6.42, $\lambda (t)$ given by equation 6.44, and $\kappa =\gamma =1$. Then all trajectories $(x,a^)$ remain bounded, $f\u02dc\u2208L2\u2229L\u221e$, $s\u2208L2\u2229L\u221e$, $s\u21920$, and $x\u2192xd$.

The proof is given in section A.10.

Consider the adaptation algorithm, equation 6.45, with $P(t)$ given by equation 6.42, $\lambda (t)$ given by equation 6.44, $N(t)=1+\mu \u2225Y\u22252$ and $\mu >3\eta +22\eta \beta $. Then all trajectories $(x,v^,a^)$ remain bounded, $f\u02dc\u2208L2\u2229L\u221e$, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

The proof is given in section A.11.

We now consider Tyukin's first-order algorithm for nonlinearly parameterized systems, equation 2.12, with $P=P(t)$ given by equation 6.42. To do so, we require an additional assumption:

Together with assumption 2, Assumption 6.48 states that $f\u02dc$ lies between two linear functions. Given that the update, equation 6.42, is derived based on recursive linear least squares considerations, it is unsurprising that such an assumption is required in the nonlinearly parameterized setting. We are now in a position to state the following proposition.

Consider the adaptation algorithm equation 2.12 with $P(t)$ given by equation 6.42 and $\lambda (t)$ given by equation 6.44 for $f\u02dc$ satisfying assumptions 2 and 11. Further assume that $D1<2D22$ or that $D2>12$. Then, all trajectories $(x,a^)$ remain bounded, $f\u02dc\u2208L2$, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

The proof is given in section A.12.

Consider the adaptation algorithm, equation 6.49, with $P(t)$ given by 6.42, $\lambda (t)$ given by (6.44), $N=1+\mu \u2225\alpha \u22252$, and $\mu >4D2-2+2D1+12\beta 4D2-1$. Suppose $f\u02dc$ satisfies assumptions 2 and 11. Further assume that $D2>12$. Then all trajectories $(x,v^,a^)$ remain bounded, $f\u02dc\u2208L2$, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

The proof is given in section A.13.

^{58}, because $P(t)$ is uniformly bounded in $t$, it is not necessary to include $P(t)$ in equation 6.51. It is simple to show by modification of the proof that

## 7 Natural Momentum Algorithms

Consider the higher-order “natural” adaptation law equation 7.2. Assume that $\psi $ is $l$-strongly convex so that $\u22072\psi (\xb7)\u2265lI$ globally. Take $N=1+\mu \u2225Y\u22252$ and $\mu >\gamma 1+l-124\beta \eta $. Then all trajectories $(x,v^,a^)$ remain bounded, $s\u2208L2\u2229L\u221e$, $s\u21920$, and $x\u2192xd$.

Consider the adaptation algorithm, equations 7.5 and 7.6, or the adaptation algorithm, equations 7.7 and 7.8. Assume that $\psi $ is $l$-strongly convex and $L$-smooth, so that $lI\u2264\u22072\psi (\xb7)\u2264LI$. Take $N=1+\mu \u2225Y\u22252$, and choose $\mu >\gamma (l+\gamma L)24\beta \eta l3$ in the former case and $\mu >\gamma (l+\gamma L)24\beta \eta l2$ in the latter. Then all trajectories $(x,v^,a^)$ remain bounded, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

The proof is presented in section A.15.

For efficient implementation of the proposed natural momentum algorithms, as well as for their first-order equivalents, $\psi $ should be chosen so that $\u22072\psi (\xb7)-1$ is efficiently computable and ideally sparse. Alternatively, if the inverse function of the gradient $\u2207\psi -1(\xb7)$ is efficiently computable, $\u2207\psi (a^)$ or $\u2207\psi (v^)$ may be updated directly and subsequently inverted to arrive at the parameter values. Discretization of these algorithms is a subtle issue, and discretization of the $a^\u02d9$ and $v^\u02d9$ dynamics directly results in a natural gradient-like update (Amari, 1998), while discretization of the $ddt\u2207\psi (a^)$ and $ddt\u2207\psi (v^)$ dynamics leads to a mirror descent-like update (Beck & Teboulle, 2003; Nemirovski & Yudin, 1983); these discrete-time algorithms have the same continuous-time limit (Krichene, Bayen, & Bartlett, 2015).

The above natural adaptation laws with momentum may be generalized to composite algorithms, as well as to algorithms for nonlinearly parameterized adaptive control, by replacing Euclidean norms by Bregman divergences where appropriate in the proofs of the corresponding Euclidean algorithms (see, e.g., the proofs of propositions 21 and 22). Rather than derive this for each algorithm, we now show how the general results on velocity gradient algorithms with momentum (see propositions 6 and 7) can be extended to the non-Euclidean setting. We start with the case of a local functional, which requires the modification of assumption 9 to an equivalent non-Euclidean version.

With assumption 12 in hand, we can state the following non-Euclidean equivalent of proposition 6. We focus on the variant equation 7.2, as the other possibilities are similar.

For the integral variant, we require a non-Euclidean version of assumption 10.

With assumption 13, we can state the following proposition.

^{63}.

The general methodology captured by the proofs of propositions 23 and 24, in combination with the results of section 6.3, may be exploited to derive non-Euclidean variants of our nonfiltered composite algorithm and our momentum algorithm for nonlinearly parameterized adaptive control. Note that the strong convexity and smoothness requirements of propositions 21 and 22, in combination with a suitable choice of $N(t)$, are one way to satisfy the requirements of assumptions 12 and 13.

Our implicit regularization results in section 3 also extend to the higher-order setting captured by algorithm 7.2. The assumption that $a^\u2192a^\u221e$ implies $a^\u02d9\u21920$. As noted in section 5.2, $v^=a^+a^\u02d9\beta N$, and we thus conclude that under this assumption $v^\u2192a^\u221e$. Because $v^\u02d9$ in equation 7.3 is identical to algorithm 3.2, the result follows. A formal statement of this fact is provided in section B.1.

## 8 Simulations

In this section, we perform several numerical experiments demonstrating the validity of our theory and consider a number of applications of our non-Euclidean adaptive laws.

### 8.1 Convergence and Implicit Regularization of a Momentum Algorithm for Nonlinearly Parameterized Systems

^{11}The PI form of algorithm 6.23 is given by

^{12}In all cases, $\beta =1$ and $\mu =3\gamma 2\eta \beta $. We set $dim(a)=dim(a^)=500$ and randomly initialize $a^$ and $v^$ around zero from a normal distribution with standard deviation $10-3$. The true parameter vector $a$ is drawn from a normal distribution with mean zero and standard deviation 7.5. The matrix $V$ is set to have normally distributed elements with standard deviation $1dima^$. The state vector is initialized such that $x(0)=xd(0)$. The desired trajectory is taken to be

Figure 3A shows the function approximation error $f\u02dc2(x(t),a^(t),a)$ for each algorithm along with a reference value for fixed $a^(t)=a^(0)$. Each algorithm, as expected by our theory and seen by the low tracking error in Figure 1A, pushes $f\u02dc2$ to zero despite the different forms of regularization imposed on the parameter vectors. Figures 3B to 3F show the control input as a function of time along with the unique “ideal” control law $u(t)=x\xa8d+f(xd(t),a(t))$ valid when $x(0)=xd(0)$. All control inputs can be seen to converge to the ideal law, though the rate of convergence depends on the choice of algorithm. The control input is of reasonable magnitude throughout adaptation for each algorithm.

### 8.2 Learning to Control with Primitives

We now demonstrate that the mirror descent-like laws of section 3 can be used to learn convex combinations of control primitives. Our approach is analogous to the use of multiplicative weight updates in machine learning and respects the natural $l1$ geometry over the probability simplex.

^{13}In both cases, we initialize $\beta ^i=1N$ and set $\gamma =.25$.

### 8.3 Dynamics Prediction for Hamiltonian Systems

We set $k=5$, $\gamma =3.5$, and choose $\psi (\xb7)=12\u2225\xb7\u22251.052$ to identify basis functions relevant to the observed trajectory. We fix $mi=1$ for all $i$ and initialize $q$ and $p$ to lock the system in an oscillatory mode. Past $t=10$, we set $k=\gamma =0$ and run the predictor open-loop, as well as perform shrinkage and set all coefficients with magnitude below $10-3$ formally equal to zero, leaving 13 remaining terms.

Results are shown in Figure 5. Figure 5A displays convergence of $x\u02dc$ to zero with adaptation (solid) and demonstrates that adaptation is necessary for convergence (dashed). When switching to the open-loop predictor past $t=10$, the system without adaptation sustains large errors, while the learned predictor maintains good performance. The inset displays a slow drift of the predictor trajectory $x^(t)$ from the true trajectory $x(t)$ when run open-loop. Figure 5B displays the state trajectory $x$ (dotted), convergence of $x^$ with adaptation (solid) to $x$, and the incorrect behavior of $x^$ without adaptation (dashed). The open-loop unlearned predictor tends to a fixed point, while the open-loop learned predictor maintains the correct oscillatory behavior. Figures 5C and 5D show parameter trajectories and asymptotically converged parameters, respectively. Together, the two panes demonstrate that the implicit bias of the algorithm ensures convergence to a sparse estimate of the system Hamiltonian.

### 8.4 Sparse Identification of Chemical Reaction Networks

Results are shown in Figure 6. In Figure 6A, we show convergence of the observer error to zero with adaptation (solid) and divergence away from zero without adaptation (dashed), demonstrating that adaptation is necessary for effective prediction. The inset displays a closer look at the asymptotic behavior of the open-loop dynamics predictor after shrinkage, which shows that the fixed point of the system is correctly learned. Figure 6B shows convergence of $x^$ (solid) to $x$ (dashed). Figure 6C displays parameter trajectories as a function of time. Many parameters stay at or near zero as predicted by proposition 1. The inset displays a finer-grained view around zero of the parameter trajectories. Figure 6D shows a histogram of the final parameter values learned by the adaptive dynamics predictor, demonstrating that only a few relevant terms are identified.

## 9 Conclusion and Future Directions

It is somewhat unusual in nonlinear control to have a choice between a large variety of algorithms that can all be proven to globally converge. Nevertheless, in this article, we have presented a suite of new globally convergent adaptive control algorithms. The algorithms combine the velocity gradient methodology (Fradkov, 1980; Fradkov et al., 1999) with the Bregman Lagrangian (Betancourt et al., 2018; Wibisono et al., 2016) to systematically generate velocity gradient algorithms with momentum. Based on analogies between isotonic regression (Goel & Klivans, 2017; Goel et al., 2018; Kakade et al., 2011) and algorithms for nonlinearly parameterized adaptive control (Tyukin, 2011; Tyukin et al., 2007), we extended our higher-order velocity gradient algorithms to the nonlinearly parameterized setting of generalized linear models. Using a similar parallel to distributed stochastic gradient descent algorithms (Boffi & Slotine, 2020; Zhang et al., 2014), we developed a stable modification of all of our algorithms. We subsequently fused our developments with time-dependent learning rates based on the bounded gain forgetting formalism (Slotine & Li, 1991).

By consideration of the non-Euclidean Bregman Lagrangian, we derived natural gradient (Amari, 1998) and mirror descent (Beck & Teboulle, 2003; Nemirovski & Yudin, 1983)–like algorithms with momentum. Taking the infinite friction limit of these algorithms recovers a recent algorithm for adaptive robot control (Lee et al., 2018) that respects physical Riemannian constraints on the parameters throughout adaptation. By extending recent results on the implicit bias of optimization algorithms in machine learning (Azizan & Hassibi, 2019; Azizan et al., 2019) to the continuous-time setting, we proved that these mirror descent-like algorithms in the first-order, second-order, and nonlinearly parameterized settings impose implicit regularization on the parameter vectors found by adaptive control.

Throughout the article, for simplicity of exposition, we focused on the $n$th order system, equation 2.1. As discussed in remark ^{2}, our results extend to more general systems that have an error model similar to equation 2.6, in the sense that the proof technique summarized by lemma ^{72} is roughly preserved. The $n$th order system structure makes the employed proportional-integral forms simple, as they can be written down explicitly as in equations 6.14 to 6.17. As summarized in remark ^{40}, a PDE needs to be solved in the general case, and solutions to this PDE may not exist. Solution of the PDE can be avoided by the dynamic scaling technique of Karagiannis et al. (2009) or a similar embedding technique of Tyukin (2011).

A significant outstanding question is whether there is an empirical advantage to using our proposed momentum algorithms. In optimization, accelerated algorithms generated by the Bregman Lagrangian provide faster convergence when properly discretized, and it is thus likely that a careful discretization is necessary to obtain optimal performance of our momentum algorithms. However, we are not aware of any available convergence rates in adaptive control, and it would be necessary to prove such rates to understand analytically if there is an advantage. Similar higher-order algorithms have appeared in the literature for linear systems of relative degree greater than one (Fradkov et al., 1999; Morse, 1992), where first-order algorithms cannot control the system. Here we have focused on feedback linearizable systems, and perhaps there exist classes of nonlinear systems that cannot be adaptively controlled with a first-order algorithm but can with a momentum algorithm. We leave the investigation of these interesting and important questions to future work.

## Appendix A: Omitted Proofs and Required Results

Barbalat's lemma is a classical technique in adaptive control theory, which is used in conjuction with a Lyapunov-like analysis to prove convergence of a given signal.

(Barbalat's Lemma (Slotine & Li, 1991)). Assume that $limt\u2192\u221e\u222b0t|x(\tau )|d\tau <\u221e$. If $x(t)$ is uniformly continuous, then $limt\u2192\u221ex(t)=0$.

Note that a sufficient condition for uniform continuity of $x(t)$ is for $x\u02d9(t)$ to be bounded. Hence, for any signal $x(t)\u2208L2\u2229L\u221e$ with $x\u02d9(t)\u2208L\u221e$, we can apply lemma ^{71} to the signal $x2(t)$ and conclude that $x(t)\u21920$.

Assume that $\u222b0tf\u02dc2(x(t'),a^(t'),a,t')dt'<\u221e$ where $[0,T)$ is the maximal interval of existence of $x(t)$. Further assume that $a^(t)$ is bounded over $[0,T)$, that both bounds are independent of $T$, and that $f\u02dc$ is locally bounded in $x$ and $a^$ uniformly in $t$. Then $a^\u2208L\u221e$, $f\u02dc\u2208L2$, $s\u2208L2\u2229L\u221e$, $s\u21920$ and $x\u2192xd$.

^{71}), $s\u21920$. By definition of $s$, we then conclude that $x\u2192xd$.

### A.1 Proof of Proposition 8

This immediately shows $s\u2208L\u221e$ and $a^\u2208L\u221e$. Because $s\u2208L\u221e$, $x\u2208L\u221e$ by definition of the sliding variable (Slotine & Li, 1991). Integrating $V\u02d9$ shows that $s\u2208L2$ and $f\u02dc\u2208L2$. The result follows by application of lemma ^{72} or directly by Barbalat's lemma (lemma ^{1}).

### A.2 Proof of Proposition 9

^{71}) $s\u21920$ and $x\u2192xd$.

### A.3 Proof of Proposition 10

### A.4 Proof of Proposition 11

^{72}completes the proof.

### A.5 Proof of Proposition 12

^{71}).

### A.6 Proof of Proposition 13

^{72}completes the proof.

### A.7 Proof of Proposition 14

^{72}completes the proof.

### A.8 Proof of Proposition 15

^{72}completes the proof.

### A.9 Proof of Proposition 16

^{72}completes the proof.

### A.10 Proof of Proposition 17

^{72}or directly by Barbalat's Lemma (lemma

^{71}).

### A.11 Proof of Proposition 18

^{71}), $s\u21920$ and $x\u2192xd$.

### A.12 Proof of Proposition 19

^{72}, the proposition is proved.