Abstract

Stable concurrent learning and control of dynamical systems is the subject of adaptive control. Despite being an established field with many practical applications and a rich theory, much of the development in adaptive control for nonlinear systems revolves around a few key algorithms. By exploiting strong connections between classical adaptive nonlinear control techniques and recent progress in optimization and machine learning, we show that there exists considerable untapped potential in algorithm development for both adaptive nonlinear control and adaptive dynamics prediction. We begin by introducing first-order adaptation laws inspired by natural gradient descent and mirror descent. We prove that when there are multiple dynamics consistent with the data, these non-Euclidean adaptation laws implicitly regularize the learned model. Local geometry imposed during learning thus may be used to select parameter vectors—out of the many that will achieve perfect tracking or prediction—for desired properties such as sparsity. We apply this result to regularized dynamics predictor and observer design, and as concrete examples, we consider Hamiltonian systems, Lagrangian systems, and recurrent neural networks. We subsequently develop a variational formalism based on the Bregman Lagrangian. We show that its Euler Lagrange equations lead to natural gradient and mirror descent-like adaptation laws with momentum, and we recover their first-order analogues in the infinite friction limit. We illustrate our analyses with simulations demonstrating our theoretical results.

1  Introduction

Adaptation is an online learning problem concerned with control or prediction of the dynamics of an unknown nonlinear system. This task is accomplished by constructing an approximation to the true dynamics through the online adjustment of a vector of parameter estimates under the assumption that there exists a fixed vector of parameters that globally fits the dynamics. The overarching goal is provably safe, stable, and concurrent learning and control of nonlinear dynamical systems.

Adaptive control theory is a mature field, and many results exist tailored to specific system structures (Ioannou & Sun, 2012; Narendra & Annaswamy, 2005; Slotine & Li, 1991). An adaptive control algorithm typically consists of a parameter estimator coupled in feedback to the controlled system, and the estimator is often strongly inspired by gradient-based optimization algorithms. A significant difference between standard optimization algorithms and adaptive control algorithms is that the parameter estimator must not only converge to a set of parameters that leads to perfect tracking of the desired trajectory, but the system must remain stable throughout the process. The additional requirement of stability prevents the immediate application of optimization algorithms as adaptive control algorithms, and stability must be proved by jointly analyzing the closed-loop system and estimator.

Significant progress has been made in adaptive control even for nonlinear systems in the linearly parameterized setting, where the dynamics approximation is of the form f^=Y(x,t)a^ for some known regressor matrix Y(x,t) and vector of parameter estimates a^(t). Examples include the adaptive robot trajectory controller of Slotine and Li (1987) and the neural network-based controller of Sanner and Slotine (1992), which employs a mathematical expansion in physical nonlinear basis functions to uniformly approximate the unknown dynamics.

Unlike its linear counterpart, solutions to the adaptive control problem in the general nonlinearly parameterized setting f^=f(x,a^,t) have remained elusive. Intuitively, this is unsurprising: guarantees for gradient-based optimization algorithms typically rely on convexity, with a few notable exceptions such as the Polyak-Lojasiewicz condition (Polyak, 1963). In the linearly parameterized setting, the underlying optimization problem will be convex. When the parameters appear nonlinearly, the problem is in general nonconvex and difficult to provide guarantees for.

In this work, we provide new provably globally convergent algorithms for both the linearly and nonlinearly parameterized adaptive control problems, along with new insight into existing adaptive control algorithms for the linearly parameterized setting. Our results for nonlinearly parameterized systems are valid under the monotonicity assumptions of Tyukin, Prokhorov, and van Leeuwen (2007) and the convexity assumptions of Fradkov (1980). These monotonicity assumptions are equivalent to those commonly satisfied by generalized linear models in statistics (Kakade, Kalai, Kanade, & Shamir, 2011).

1.1  Description of Primary Contributions

Our contributions can be categorized into two main advances.

  1. We further develop a class of natural gradient and mirror descent-like algorithms that have recently appeared in the literature in the context of physically consistent inertial parameter learning in robotics (Lee, Kwon, & Park, 2018) and geodesically convex optimization (Wensing & Slotine, 2020). We prove that these algorithms implicitly regularize the learned system model in both the linearly parameterized and nonlinearly parameterized settings.

  2. We construct a general class of higher-order in-time adaptive control algorithms that incorporate momentum into existing adaptation laws. We prove that our new momentum algorithms are stable and globally convergent for both linearly parameterized and nonlinearly parameterized systems.

Unlike standard problems in optimization and machine learning, explicit regularization terms cannot be naively added to adaptive control algorithms without affecting stability and performance. Our approach enables a provably stable and globally convergent implementation of regularization in adaptive control. We demonstrate the utility of these results through examples in the context of dynamics prediction, such as sparse estimation of a physical system's Hamiltonian or Lagrangian function, and estimating the weights of a continuous-time recurrent neural network model.

It is well known in adaptive control that the true parameters are only recovered when the desired trajectory satisfies a strong condition known as persistent excitation (Narendra & Annaswamy, 2005; Slotine & Li, 1991). In general, an adaptation law need only find parameters that enable perfect tracking, and very little is known about what parameters are found when the estimator converges without persistent excitation. Our proof of implicit regularization provides an answer and shows that standard Euclidean adaptation laws lead to parameters of minimum l2 norm.

For the second contribution, we utilize the Bregman Lagrangian (Betancourt, Jordan, & Wilson, 2018; Wibisono, Wilson, & Jordan, 2016; Wilson, Recht, & Jordan, 2016) in tandem with the velocity gradient methodology (Andrievskii, Stotskii, & Fradkov, 1988; Fradkov, 1980, 1986; Fradkov, Miroshnik, & Nikiforov, 1999) to define a general formalism that generates higher-order in-time (Morse, 1992) velocity gradient algorithms. Our key insight is that the velocity gradient formalism provides an optimization-like framework that encompasses many well-known adaptive control algorithms and that the velocity gradient “loss function” can be placed directly in the Bregman Lagrangian.

1.2  Summary of Related Work

Our work continues in a long-standing tradition that utilizes a continuous-time view to analyze optimization algorithms, and here we consider a nonexhaustive list. Diakonikolas and Jordan (2019) develop momentum algorithms from the perspective of Hamiltonian dynamics, while Maddison, Paulin, Teh, O'Donoghue, and Doucet (2018) use Hamiltonian dynamics to prove linear convergence of new optimization algorithms without strong convexity. Muehlebach and Jordan (2019, 2020) study momentum algorithms from the viewpoint of dynamical systems and control. Boffi and Slotine (2020) analyze distributed stochastic gradient descent algorithms via dynamical systems and nonlinear contraction theory. Su, Boyd, and Candès (2016) provide an intuitive justification for Nesterov's accelerated gradient method (Nesterov, 1983) through a limiting differential equation. Continuous-time differential equations were used as early as 1964 by Polyak to derive the classical momentum or “heavy ball” optimization method (Polyak, 1964). In all cases, continuous time often affords simpler proofs, and it enables the application of physical intuition when reasoning about optimization algorithms. Given the gradient-based nature of many adaptive control algorithms, the continuous-time view of optimization provides a natural bridge from modern optimization to modern adaptive control.

Despite the simplicity of proofs in continuous time, finding a discretization that provably retains the convergence rates of a given differential equation is challenging. In a significant advance, Wibisono et al. (2016) showed that many accelerated methods in optimization can be derived via a variational point of view from a single mathematical object known as the Bregman Lagrangian. The Bregman Lagrangian leads to second-order mass-spring-damper-like dynamics, and careful discretization provides discrete-time algorithms such as Nesterov's celebrated accelerated gradient method (Nesterov, 1983). We similarly use the Bregman Lagrangian to generate our new adaptive control algorithms, which generalize and extend a recently developed algorithm due to Gaudio, Gibson, Annaswamy, and Bolender (2019).

Progress has been made in nonlinearly parameterized adaptive control in a number specific cases. Annaswamy, Skantze, and Loh (1998), Ai-Poh Loh, Annaswamy, and Skantze (1999), and Kojić and Annaswamy (2002) develop stable adaptive control laws for convex and concave parameterizations, though they may be overly conservative and require solving optimization problems at each time step. Astolfi and Ortega (2003) and Liu, Ortega, Su, and Chu (2010) develop the immersion and invariance (I&I) approach, and prove global convergence if a certain monotone function can be constructed. Ortega, Gromov, Nuño, Pyrkin, and Romero (2019) use a similar approach for system identification. Tyukin et al. (2007) consider dynamical systems satisfying a monotonicity assumption that is essentially identical to conditions required for learning generalized linear models in machine learning and statistics (Goel & Klivans, 2017; Goel, Klivans, & Meka, 2018; Kakade et al., 2011), and develop provably stable adaptive control algorithms for nonlinearly parameterized systems in this setting. Fradkov (1980), Andrievskii et al. (1988), Fradkov (1986), and Fradkov et al. (1999) develop the velocity gradient methodology, an optimization-like framework for adaptive control that allows for provably global convergence under a convexity assumption. As mentioned in section 1.1, this framework, in tandem with the Bregman Lagrangian, is central to our development of momentum algorithms.

Our work is strongly related to and inspired by a line of recent work that analyzes the implicit bias of optimization algorithms in machine learning. Soudry, Hoffer, Nacson, Gunasekar, and Srebro (2018) and Gunasekar, Lee, Soudry, and Srebro (2018b, 2018a) characterize implicit regularization of common gradient-based optimization algorithms such as gradient descent with and without momentum, as well as natural gradient descent and mirror descent in the settings of regression and classification. Azizan, Lale, and Hassibi (2019) and Azizan and Hassibi (2019) arrive at similar results via a different derivation based on results from H control. Similarly, Belkin, Hsu, Ma, and Mandal (2019) consider the importance of implicit regularization in the context of the successes of deep learning. Our results are the adaptive control analogues of those presented in these papers.

1.3  Paper Outline

The paper is organized as follows. In section 2, we present some required mathematical background on direct adaptive control in the linearly and nonlinearly parameterized settings. In section 3 we analyze the implicit bias of adaptive control algorithms, while in section 4 we consider general observer and dynamics predictor design, Hamiltonian dynamics prediction, control of Lagrangian systems, and estimation of recurrent neural networks. In section 5 we provide background for our development of momentum algorithms, including a review of the velocity gradient formalism (section 5.1) and the Bregman Lagrangian (section 5.2). In section 6 we present adaptive control algorithms with momentum, and we extend them to the non-Euclidean setting in section 7. We illustrate our results via simulation in section 8, and we conclude with some closing remarks and future directions in section 9.

2  Direct Adaptive Control

In this section, we provide an introduction to direct adaptive control for both linearly parameterized and nonlinearly parameterized systems, along with a description of some natural gradient-like adaptive laws that have appeared in the recent literature.

2.1  Linearly Parameterized Dynamics

For simplicity, we restrict ourselves to the class of nth-order nonlinear systems,
x(n)+f(x,a,t)=u
(2.1)
where x(i)R denotes the ith derivative of x, x=x,x(1),,x(n-1)TRn is the system state, aRp is a vector of unknown parameters, f:Rn×Rp×R+R is of known functional form but is unknown due to its dependence on a, and uR is the control input. We seek to design a feedback control law u=u(x,a^) that depends on a set of adjustable parameters a^Rp and ensures that x(t)xd(t) where xd(t)Rn is a known desired trajectory. Along the way, we require that all system signals remain bounded. The estimated parameters a^ are updated according to a learning rule or adaptation law,
a^˙=g(a,a^,x),
(2.2)
where g:Rp×Rp×RnRp must be implementable solely in terms of known system signals despite its potential dependence on a. For nth-order systems as considered in equation 2.1, a common approach is to define the sliding variable (Slotine & Li, 1991),
s=ddt+λn-1x˜=x˜(n-1)-x˜r(n-1),
(2.3)
where λ>0 is a constant and x˜(t)=x(t)-xd(t). We have defined x˜(i)(t)=x(i)(t)-xd(i)(t) and x˜r(n-1) as the remainder based on the definition of s. According to the definition, equation 2.3, s obeys the differential equation,
s˙=u-f(x,a,t)-x˜r(n).
(2.4)
Hence, from equation 2.4, we may choose
u=f(x,a^,t)+x˜r(n)-ηs
(2.5)
to obtain the stable first-order linear filter:
s˙=-ηs+f(x,a^,t)-f(x,a,t).
(2.6)
For future convenience, we define f˜(x,a^,a,t)=f(x,a^,t)-f(x,a,t) and will omit its arguments when clear from the context. From the definition of s in equation 2.3, s=0 defines the dynamics
ddt+λn-1x˜=0.
(2.7)
Equation 2.7 is a stable (n-1)th-order filter which ensures that x˜0 exponentially. For systems of the form in equation 2.1, it is thus sufficient to consider the two first-order dynamics, equations 2.2 and 2.6. The adaptive control problem has thus been reduced to finding a learning algorithm that ensures s0.
Remark 1.
Systems in the matched uncertainty form
x˙=Ax+bu-f(x,a,t),
where the constant pair (A,b) is controllable and the constant parameter vector a in the nonlinear function f(x,a,t) is unknown, can always be put in the form of equation 2.1 by using a state transformation to the second controllability canonical form (see Luenberger, 1979, chap. 8.8). After such a transformation, the new state variables z satisfy z˙i=zi+1 for i<n and z˙n=-i=1n-1aizi+u-f(x,a,t) for some fixed constants ci. Defining s as in equation 2.3 and choosing u accordingly leads to equation 2.6. Hence, all results in this article extend immediately to such systems.
Remark 2.

The fundamental utility of defining the variable s is its conversion of the adaptive control problem for the nth-order system, equation 2.1, to an adaptive control problem for the first-order system, equation 2.6. Our results may be simply extended to other error models (Ai-Poh Loh et al., 1999; Narendra & Annaswamy, 2005) of the form 2.6, or error models with similar input-output guarantees, as summarized by lemma 72.

Remark 3.

We will use f to denote the equivalent first-order system to (2.1), x˙=f(x,a,t)+u, where f=x2,x3,,f(x,a,t) and u=0,0,,u.

The classic setting for adaptive control assumes that the unknown nonlinear dynamics depends linearly on the set of unknown parameters, that is,
f(x,a,t)=Y(x,t)a,
with Y:Rn×R+R1×p a known function. In this setting, a well-known algorithm is the adaptive controller of Slotine and Coetsee (1986), given by
a^˙=-PYTs,
(2.8)
and its extension to multi-input adaptive robot control (Slotine & Li, 1987), where P=PT>0Rp×p is a constant positive-definite matrix of learning rates. Consideration of the Lyapunov-like function V=12s2+12a˜TP-1a˜ shows stability of the feedback interconnection of equations 2.6 and 2.8 and convergence to the desired trajectory via an application of Barbalat's lemma (see lemma A.1). We will refer to equation 2.8 as the Slotine and Li controller.

In this work, we make a mild additional assumption that simplifies some of the proofs.

Assumption 1.

The dynamics f^(x,a^,t) is locally bounded in x^ and a^ uniformly in t. That is, if x and a^<, then t0,|f^(x,a^,t)|<.

2.2  Nonlinearly Parameterized Dynamics

While a difficult problem in general, significant progress has been made for the nonlinearly parameterized adaptive control problem under the assumption of monotonicity, and several notions of monotonicity have appeared in the literature (Astolfi & Ortega, 2003; Liu et al., 2010; Ortega et al., 2019; Tyukin, 2011; Tyukin et al., 2007). We consider one such notion as presented by Tyukin et al. (2007), which is captured in the following assumption.

Assumption 2.
There exists a known time- and state-dependent function α:Rn×R0Rp such that
a˜Tα(x,t)fx,a^,t-fx,a,t0,
(2.9)
|α(x,t)Ta˜|1D1|fx,a^,t-fx,a,t|,
(2.10)
where D1>0 is a positive scalar.
This assumption is satisfied, for example, by all functions of the form
f(x,a,t)=λ(x,t)fm(x,ϕ(x,t)Ta,t),
(2.11)
where λ:Rn×R0R, ϕ:Rn×R0Rp, fm:Rn×R×R0R, and where fm is monotonic and Lipschitz in ϕ(x,t)Ta. In this setting, α(x,t) may be taken as α(x,t)=(-1)pD1λ(x,t)ϕ(x,t), where p=0 if fm is nondecreasing in ϕTa and p=1 if fm is nonincreasing in ϕTa (Tyukin, 2011; Tyukin et al., 2007).
Under assumption 2, Tyukin et al. (2007) showed that the adaptation law,
a^˙=-f˜(x,a^,a,t)Pα(x,t),
(2.12)
with P=PT>0 a positive-definite matrix of learning rates of appropriate dimensions ensures that f˜L2 over the maximal interval of existence of x. Under suitable conditions on the error model, this then ensures that f˜L2L, x(t) and a^(t) both remain bounded for all t, and that xxd. The proof follows by consideration of the Lyapunov-like function V=12a˜TP-1a˜.
While f˜ itself is unknown, and hence equation 2.12 is not directly implementable, it is contained in s˙. Intuitively, unknown quantities contained in s˙ can be obtained in the adaptation dynamics through a proportional term in a^ that contains s. This idea of gaining a “free” derivative is the basis of the reduced-order Luenberger observer for linear systems (Luenberger, 1979).1 Proportional-integral adaptive laws of this type have been known as algorithms in finite form (Fradkov et al., 1999; Tyukin, 2003) and appear in the well-known I&I framework (Astolfi & Ortega, 2003; Liu et al., 2010). Following this prescription, equation 2.12 may be implemented in a proportional-integral form,
ξ(x,t)=-Ps(x,t)α(x,t),
(2.13)
ρ(x,t)=Pxn(t0)xn(t)s(x,t)α(x,t)xndxn,
(2.14)
a^=a¯+ξ(x,t)+ρ(x,t),
(2.15)
a¯˙=-ηsPα+Psi=1n-1αxixi+1-i=1n-1ρxixi+1-ρxdTx˙d-ξt-ρt.
(2.16)
Algorithm 2.12 is similar to a gradient flow algorithm. If f(x,a,t) has the form in equation 2.11 and is nondecreasing, gradient flow on the loss function L(x,a^,a,t)=12f˜2(x,a^,a,t) with a gain matrix D1P leads to
a^˙=-f˜(x,a^,a,t)fm'(x,ϕTa^,t)Pα(x,t),
where ' denotes differentiation with respect to the second argument. fm'(x,ϕTa^,t) is of known sign due to the monotonicity assumption but of unknown magnitude. It is sufficient to remove this quantity from the adaptation law and instead to follow the pseudogradientf˜(x,a^,a,t)α(x,t) despite nonconvexity of the square loss in this setting. Similarly, if f is nonincreasing, we find
a^˙=f˜(x,a^,a,t)fm'(x,ϕTa^,t)Pα(x,t),
and it is sufficient to set fm' to negative one.

2.3  The Bregman Divergence and Natural Adaptation Laws

Lee et al. (2018) introduced an elegant modification of the Slotine and Li adaptive robot controller, later generalized by Wensing and Slotine (2020). It consists of replacing the usual parameter estimation error term 12a˜TP-1a˜ in the Lyapunov-like function V=12s2+12a˜TP-1a˜ with the Bregman divergence (Bregman, 1967),
dψ(yx)=ψ(y)ψ(x)yxTψ(x)
to obtain the new “non-Euclidean” Lyapunov-like function,
V=12s2+dψ(aa^),
(2.17)
for an arbitrary strongly convex function ψ.
The Bregman divergence may be understand as the error made when approximating ψ(y) by a first-order Taylor expansion around x. It is guaranteed to be nonnegative for strongly convex functions by the first-order characterization of convexity. While it is not a norm in general, it defines a distance-like function for ψ strongly convex related to the Hessian metric12x2ψ2=12xT2ψ(x)x. As two simple examples, for ψ(x)=12x2, dψ(xy)=12x-y2. For ψ(x)=12xTQx with Q>0 a positive-definite matrix, dψ(xy)=12x-yTQx-y. For general convex functions, dψ(··) can always be written via Taylor's formula with an integral remainder for multivariate functions as
dψ(yx)=y-xT012ψx+sy-x(1-s)dsy-x.
Indeed, a quick calculation shows that the derivative of the Bregman divergence is simply
ddtdψ(aa^)=a˜T2ψ(a^)a^˙,
(2.18)
which can be directly used to show stability of the adaptation law
a^˙=-2ψ(a^)-1YTs.
This procedure replaces the gain matrix P in the adaptation law by the a^-dependent inverse Hessian2ψ(a^)-1 of the strongly convex function ψ. In essence, this amounts to the adaptive control equivalent of the natural gradient algorithm of Amari (1998), so that the resulting adaptation law respects the underlying Riemannian geometry captured by the Hessian metric 2ψ(a^). The standard adaptation law a^˙=-PYTs uses the constant metric P-1, which in turn explains the appearance of P in the natural gradient-like system.

The choice of ψ enables the design of adaptation algorithms that respect physical Riemannian constraints (Lee, Wensing, & Park, 2020) obeyed by the true parameters, as in the estimation of mass properties in robotics (Wensing, Kim, & Slotine, 2018). Similarly, it allows one to introduce a priori bounds on parameter estimates without resorting to parameter projection techniques by choosing ψ to be a log-barrier function (Wensing & Slotine, 2020). In section 3.1, we further prove that the choice of ψ implicitly regularizes the learned system model.

Remark 4.
The relation 2.18 shows that Tyukin's algorithm, equation 2.12, can be generalized to have a parameter estimate-dependent gain matrix. Indeed, consideration of the Lyapunov-like function V=1γdψ(aa^) shows that the algorithm
a^˙=-γf˜(x,a^,a,t)2ψ(a^)-1α(x,t),

with ψ strongly convex and γ>0 ensures that f˜L2 over the maximal interval of existence of x(t) for nonlinearly parameterized systems categorized by assumption 2. The proof is identical to that of Tyukin et al. (2007). The implementation of this algorithm in PI form will be described in remark 13 and is based on a correspondence between mirror descent and natural gradient descent in continuous time. This algorithm can be seen as the adaptive control equivalent of a mirror descent or natural gradient extension of the GLMTron of Kakade et al. (2011), and this correspondence will be considered in greater detail in section 6.

Remark 5.

In the linearly parameterized setting, rather than the Lyapunov-like function V=12s2+dψ(aa^), the Lyapunov-like function V=12s2+dψ(PaPa^) may be used for any positive-definite matrix P. This shows stability of the adaptation law a^˙=-P-12ψ(Pa^)-1P-1YTs, where the choice of the matrix P offers an additional design flexibility.

Remark 6.

In some practical applications, as in adaptive robot control, the estimated parameters a^ may correspond to physical constants. In this case, the weighted parameter estimation error term 12a˜TP-1a˜ not only provides additional design flexibility through the elements of P in the adaptation law, but is necessary for physical consistency of units. Indeed, the usual Lyapunov-like function V=12s2+12a˜TP-1a˜ shows that P-1 must be chosen so that the parameter estimation error term 12a˜TP-1a˜ has the same units as the tracking error term 12s2. Similar considerations apply when replacing this standard parameter estimation error term with the Bregman divergence dψ(aa^), which has the same units as ψ(a^). In this case, ψ(a^) must be chosen to have the same units as the tracking error term, for example, by introducing a diagonal matrix of constants to ensure consistent dimensions.

3  Natural Gradient Adaptation and Implicit Regularization

In this section, we show that the “natural” adaptation algorithms of the previous section implicitly regularize the learned system model.

3.1  Implicit Regularization and Adaptive Control

With deep networks as the predominant example, modern machine learning often considers highly overparameterized models that are capable of interpolating the training data (achieving zero error on the training set) while still generalizing well to unseen examples. The classical principles of statistical learning theory emphasize a trade-off between generalization performance and model capacity, and predict that in the highly overparameterized regime, generalization performance should be poor due to a tendency of the model to fit noise in the training data. Nevertheless, empirical evidence indicates that deep networks and other modern machine learning models do not obey classical statistical learning wisdom (Belkin et al., 2019) and can even generalize with significant label noise (Zhang, Bengio, Hardt, Recht, & Vinyals, 2016).

More surprisingly, the ability to simultaneously fit label noise in the training data yet generalize to new examples has been observed in overparameterized linear models (Bartlett, Long, Lugosi, & Tsigler, 2020; Muthukumar, Vodrahalli, & Sahai, 2019). A possible explanation for the ability of highly overparameterized models to generalize when optimized using simple first-order algorithms is their implicit bias—that is, the tendency of an algorithm to converge to a particular (e.g., minimum norm) solution when there are many that interpolate the training data (Azizan & Hassibi, 2019; Azizan et al., 2019; Gunasekar et al., 2018a, 2018b; Soudry et al., 2018).

In adaptive control, the possibility of there being many possible parameter vectors a^ that lead to zero tracking error is not unique to the overparameterized case. Unless the trajectory is persistently exciting2 (Narendra & Annaswamy, 2005; Slotine & Li, 1991), it is well known that a^ will not converge to the true parameters a in general. Depending on the complexity of the trajectory, there may even be many solutions in the underparameterized case where dim(a^)<dim(a). To achieve perfect tracking, the adaptation algorithm need only fit the unknown dynamics f(x(t),a,t) along the trajectory rather than the whole state space, so that the effective number of parameters may be less than dim(a).

The wealth of possible solutions in the linearly parameterized case is captured by the time-dependent null space of Y(x(t),t): when xxd, we can conclude that Y(xd(t),t)a˜(t)=0, and hence that a^(t)=a+n^(t) where Y(xd(t),t)n^(t)=0 for all t. This observation also highlights that any element n^(t) of the null space may be added to the parameter estimates a^ without affecting the value of f^.3 In the overparameterized case when dim(a^)>dim(a), the set of parameters that achieve zero tracking error is not unique regardless of the complexity of the desired trajectory. By deriving a continuous-time extension of a recent proof of the implicit bias of mirror descent algorithms (Azizan & Hassibi, 2019; Azizan et al., 2019), we now show that the natural adaptive laws of the previous section implicitly regularize a^. This proof of implicit regularization provides an answer to the question, with infinitely many parameter vectors that achieve zero tracking error, which does adaptation choose?

Define the set
A={θ|f(x(t),θ,t)=f(x(t),a,t)t},
(3.1)
that is, the set in equation 3.1 contains only parameters that interpolate the dynamics f(x(t),a,t) along the entire trajectory. We are now in a position to state the following proposition.
Proposition 1.
Consider the natural gradient-like adaptation law for a linearly parameterized dynamics,
a^˙=-2ψ(a^)-1YTs,
(3.2)
where ψ(·) is a strongly convex function. Assume that a^(t)a^A. Then
a^=argminθAdψ(θ|a^(0)).
In particular, if a^(0)=argminθRpψ(θ), then
a^=argminθAψ(θ).
(3.3)
Proof.
Let θ be any constant vector of parameters. The Bregman divergence dψ(θa^)=ψ(θ)-ψ(a^)-ψ(a^)Tθ-a^ has the time derivative
ddtdψ(θa^)=-ddtψ(a^)Tθ-a^.
From equation 3.2, ddtψ(a^)=-YTs, so that
ddtdψ(θa^)=sYθ-a^.
Integrating both sides of the above shows that
dψ(θ|a^(0))=dψ(θa^(t))+0ts(τ)Y(x(τ),τ)a^(τ)-θdτ.
If we now take θA, Y(x(τ),τ)θ=f(x(τ),a,τ) and the integral term is independent of θ. Assuming that a^a^A, we can take the limit as t and say that for any θA,a^A,
dψ(θ|a^(0))=dψ(θa^)+0s(τ)Y(x(τ),τ)a^(τ)-f(x(τ),a,τ)dτ.
Because the only dependence of the right-hand side on θ is in the first term and because this relation holds for any θ, the argmin of the two Bregman divergences must be identical. The minimum of the right-hand side over θ is clearly obtained at a^, while the minimum of the left-hand side is by definition obtained at argminθAdψ(θ,a^(0)). From this, we conclude that
a^=argminθAdψ(θ|a^(0)),
which completes the proof.

Equation 3.3 captures the implicit regularization imposed by the adaptation algorithm equation 3.2: out of all possible interpolating parameters, it chooses the a^ that achieves the minimum value of ψ.

Remark 7.

The assumptions of proposition 1 provide a setting where theoretical insight may be gained into the implicit regularization of adaptive control algorithms, but they are stronger than needed. In general, the parameters a^(t) found by an adaptive controller need not converge to a constant despite the fact that a^˙0.4 Similarly, even in the case that the parameters converge, it is not strictly required that Y(x(t),t)a^=f(x(t),a,t) along the entire trajectory, as this condition is satisfied asymptotically. Numerical simulations in section 8 will demonstrate the implicit regularization of parameters a^(t) found by adaptive control along the entire trajectory.

We may make a similar claim in the nonlinearly parameterized setting captured by assumption 2. To do so, we require an additional assumption.

Assumption 3.

For any vector of parameters θ and the true parameters a, f(x(t),θ,t)=f(x(t),a,t) implies that α(x(t),t)Tθ=α(x(t),t)Ta.

For the class of systems 2.11, a sufficient condition for assumption 3 is that λ(x(t),t)0 and that the map ϕ(x,t)Tafm(x(t),ϕ(x,t)Ta) is invertible at every t. We may now state our implicit regularization result for nonlinearly parameterized systems.

Proposition 2.
Consider the adaptation algorithm
a^˙=-2ψ(a^)-1f˜(x(t),a^(t),a,t)α(x(t),t)
(3.4)
under assumptions 2 and 3. Assume a^(t)a^A. Then
a^=argminθAdψ(θ|a^(0)).
Proof.
The proof is much the same as proposition 1. The Bregman divergence dψ(θa^) for any fixed vector of parameters θ verifies
ddtdψ(θa^)=f˜(x(t),a^(t),a,t)α(x(t),t)Tθ-a^,
so that, integrating both sides,
dψ(θa^(0))=dψ(θa^(t))-0tf˜(x(τ),a^(τ),a,τ)α(x(τ),τ)Tθ-a^(τ)dτ.
Now take θA. By the assumptions of the proposition, α(x(τ),τ)Tθ=α(x(τ),τ)Ta is independent of θ. Hence, using that a^(t)a^A, we can write
dψ(θa^(0))=dψ(θa^)-0f˜(x(τ),a^(τ),a,τ)α(x(τ),τ)Ta-a^(τ)dτ.
Optimizing both sides over θA as in proposition 1 yields the result.
Remark 8.

Algorithm 3.4 must be implemented in PI form due to the appearance of f˜. The use of the PI form in a^, equations 2.13 to 2.16, is complicated by the presence of the inverse Hessian of ψ. To implement equation 2.12, the Euclidean variant may be implemented through the usual PI form for an auxiliary variable v^˙=-f˜(x(t),a^(t),a,t)α(x(t),t), and then the controller parameters may be computed by inverting the gradient of ψ, a^(t)=ψ-1(v^(t)). This follows by the equivalence of mirror descent and natural gradient descent in continuous time. Concretely, the identity ddtψ(a^)=2ψ(a^)a^˙ shows that a^˙=-2ψ(a^)-1f˜(x,a^,a,t)α(x,t) is equivalent to ddtψ(a^)=-f˜(x,a^,a,t)α(x,t). The auxiliary variable v^ can then be identified with ψ(a^).

Remark 9.

If the inverse gradient of ψ is unknown but ψ is chosen to be strongly convex, the contracting (Lohmiller & Slotine, 1998) dynamics w˙=-1τψ(w)-ψ(v^) with τ>0 will converge to a ball around v^ with radius set by ddtψ(v^)×τl where l is the strong convexity parameter. By choosing τ so that this contracting dynamics is fast on the timescale of adaptation, w will represent a good approximation of the instantaneous v^.

Remark 10.

Our results highlight, through the equivalence of their continuous-time limits, that both mirror descent–like and natural gradient-like adaptive laws impose implicit regularization. This observation extends recent results on the implicit regularization of mirror descent (Azizan & Hassibi, 2019; Azizan et al., 2019) to natural gradient descent and, furthermore, applies to linearly parameterized and generalized linearly parameterized models in machine learning, not just in the context of adaptive control. This has previously been noted in Gunasekar et al. (2018a), where it was discussed that in discrete time, natural gradient descent only approximately imposes implicit regularization due to discretization errors.

Propositions 1 and 2 demonstrate for the first time the implicit bias of adaptive control algorithms. In doing so, they identify an additional design choice that may be exploited for the application of interest. Proposition 1 implies that the Slotine and Li controller, when initialized with the parameters at a^(0)=0, finds the interpolating parameter vector of minimum l2 norm. Other norms, such as the l1, l, lp for arbitrary p, or group norms will find alternative parameter vectors that may have desirable properties such as sparsity.5 The usual Euclidean geometry–based adaptive laws can be seen as a form of ridge regression, while imposed l1, l2 and l1 simultaneously, or lp regularization through the choice of ψ can be seen as the adaptive control equivalents of LASSO (Tibshirani, 1996) or compressed sensing, elastic net, and bridge regression, respectively. In the context of adaptive control, this notion of implicit regularization is particularly interesting, as typical regularization terms such as l1 and l2 penalties cannot in general be added to the adaptation law directly without affecting the stability and performance of the algorithm.

Remark 11.

Following remark 8, if ψ(·) is chosen as the lth power of a p norm, in practical applications it is necessary to include a matrix Γ to ensure that ψ(a^)=Γa^pl has the same units as the tracking error component of the Lyapunov function. For example, if l=2, then Γ may be chosen as Γ=P-1/2 for consistency of units where P is a gain matrix tuned for the standard adaptive law a^˙=-PYTs. In addition, l=2 admits a simple inversion formula for ψ for any p, as will be utilized in the simulations in section 8, although the corresponding inverse Hessian 2ψ(·)-1 is nondiagonal for p2. For l=p, the inverse Hessian is diagonal, but Γ must then be calibrated independently from P tuned for the standard l2 law. Note that choosing ψ to be an l1 norm will impose sparsity on Γa^, so that Γ should be taken to be diagonal to ensure sparsity in a^ itself.

3.2  Non-Euclidean Measure of the Tracking Error

The usual Lyapunov function incorporates a Euclidean tracking error term given by 12s2. In a similar vein to the derivation of the “natural” adaptive laws, for any strictly convex function ϕ:RR, we may instead replace this tracking error term by the Bregman divergence dϕ(0s). This quantity has time derivative
ddtdϕ(0s)=-ηs2ϕ''(s)+ϕ''(s)Ya˜
in the linearly parameterized case. Because ϕ''(s)0 for strictly convex ϕ, it is simple to see that this modification to the usual Lyapunov function in combination with a non-Euclidean measure of the parameter estimation error leads to a family of stable adaptation laws parameterized by ϕ and ψ of the form a^˙=-2ψ(a^)-1YTϕ''(s)s. This shows, for example, that any odd power of s may be stably employed in the adaptation law by taking ϕ=sp for even some power p. Surprisingly, more exotic adaptation laws such as a^˙=-2ψ(a^)-1YTeλ|s|s for λ>0 may also be used.
In the single-input case, these laws could be more simply obtained by replacing the 12s2 term in the Lyapunov-like function with a term of the form g(s) where g'(s)s0 and g'(s) is known. In the multi-input case, these two approaches differ. Taking g to be a strongly convex function with minimum attained at s=0 and a known gradient, the Lyapunov-like function
V=g(s)-infsg(s)+dψ(aa^)
shows that the adaptation law
a^˙=-2ψ(a^)-1YTg(s)
is globally convergent. On the other hand, the Lyapunov-like function
V=dϕ(0s)+dψ(aa^)
shows that the distinct adaptation law
a^˙=-2ψ(a^)-1YT2ϕ(s)s
is also globally convergent.

4  Adaptive Dynamics Prediction, Control, and Observer Design

In this section, we demonstrate how the new non-Euclidean adaptation laws of section 3.1 may be used for regularized dynamics prediction, regularized adaptive control, and regularized observer design.

4.1  Regularized Adaptive Dynamics Prediction

Similar to direct adaptive control, online parameter estimation may also be used within an observer-like framework for dynamics prediction. This enables, for instance, the design of provably stable online learning rules for the weights of a recurrent neural network in the dynamics approximation context (Alemi, Machens, Deneve, & Slotine, 2018; Gilra & Gerstner, 2017; Sussillo & Abbott, 2009). Consider a nonlinear system dynamics
x˙=f(x)+c(t),
where xRn is the system state, f:RnRn is the system dynamics, and c:R+Rn is a system input. Define the observer-like system
x^˙=-k(x^-x)+Y(x^)a^+c(t),
where Y:RnRn×p, a^Rp, and k>0 is a scalar gain. Assume that there exists a fixed parameter vector a such that for all xRn, Y(x)a=f(x). By adding and subtracting f(x^)=Y(x^)a, the error e=x^-x has dynamics
e˙=-ke+Y(x^)a˜+f(x^)-f(x).
Consider the parameter estimator
a^˙=-γ2ψ(a^)-1YT(x^)Γe,
(4.1)
where γ>0 is a constant learning rate, ψ is a strongly convex potential function, and ΓRn×n>0 is a constant symmetric positive definite matrix. Now consider the Lyapunov-like function
V=12eTΓe+1γdψ(aa^),
which has time derivative
V˙=eTΓ-ke+Y(x^)a˜+f(x^)-f(x)-a˜TYT(x^)Γe,=eTΓ-ke+f(x^)-f(x),=eT01Γfx(x+se)-kΓdse.
(4.2)
Equation 4.2 shows that e0 as long as f(x)-kx is contracting in the metric Γ (Lohmiller & Slotine, 1998; Slotine, 2003), that is, if
f(x)xTΓ+Γf(x)x2k-λΓ
uniformly over x for some contraction rate λ>0. It is simple to check that the metric Γ may also be time dependent, Γ=Γ(t). More generally, rather than the proportional term -ke, any term of the form g(x^)-g(x) may be used in x^˙, leading to the condition
f(x)x+g(x)xTΓ+Γf(x)x+g(x)x-2λΓ
uniformly over x for some contraction rate λ>0. The implicit regularization results of section 3.1 show that this framework provides a technique for provably regularizing learned predictive dynamics models without negatively affecting stability or convergence of the combined error and parameter estimation systems.

The above discussion demonstrates a separation theorem for adaptive dynamics prediction. If a dynamics predictor can be designed under the assumption that the true system dynamics is known (e.g., if bounds on f(x)x are available), then the same dynamics predictor can be made adaptive by incorporating the skew-symmetric law, equation 4.1. Convergence properties then only depend on the nominal system with control feedback and are independent of the parameter estimator, as shown by the conditions for contraction.

Remark 12.

In principle, these simple results could be made more general using the techniques developed in Lopez and Slotine (2021), or could be performed in a latent space computed via a nonlinear dimensionality-reduction technique such as an autoencoder (Champion, Lusch, Kutz, & Brunton, 2019) or more generally a hierarchical expansion (Chen, Paiton, & Olshausen, 2018). This could also extend to adaptive control, for example, in robot control applications where an adaptive controller could be designed in a latent space computed from raw pixels via a neural network.

4.2  Regularized Dynamics Prediction for Hamiltonian Systems

If the underlying system is known to have a specific structure, this structure may be leveraged in a principled way to adaptively compute models for dynamics prediction (Sanner & Slotine, 1995). For example, large classes of physical systems are described by Hamiltonian dynamics,
H=H(p,q),p˙=-qH(p,q),q˙=pH(p,q),
where H(p,q) is the system Hamiltonian, p is the generalized momentum, and q is the generalized coordinate conjugate to p. This structure was exploited in recent work by Chen, Zhang, Arjovsky, and Bottou (2020) via direct estimation of the system Hamiltonian with a deep feedforward network in combination with symplectic integration of the resulting dynamics. In a similar spirit, rather than parameterizing the system dynamics as in section 4.1, consider estimating the scalar Hamiltonian itself as a linear expansion in a set of known nonlinear basis functions {Yk},
H^(a^,p,q)=ka^kYk(p,q)=Y(p,q)a^,
where Y(p,q)R1×p is a row vector of basis functions. Assume that there exists some true parameter vector a that exactly approximates the Hamiltonian globally, and consider the dynamics prediction model for kp>0, kq>0:
p^˙=-q^Y(p^,q^)a^+kpp-p^,
(4.3)
q^˙=p^Y(p^,q^)a^+kqq-q^.
(4.4)
The above predictor employs parameter sharing between both dynamics due to the direct estimation of the system Hamiltonian. The basis functions for the individual dynamics reflect the symplectic structure, as they are given by partial derivatives of the basis functions for the Hamiltonian.
After subtracting the true dynamics p˙ and q˙ from above, consider the decomposition of the error dynamics,
p˜˙=-q^Y(p^,q^)a˜-kpp˜-q^H(p^,q^)-q^H(p,q^)-q^H(p,q^)-qH(p,q),q˜˙=p^Y(p^,q^)a˜-kqq˜+p^H(p^,q^)-p^H(p^,q)+p^H(p^,q)-pH(p,q),
along with the adaptation law,
a^˙=γ2ψ(a^)-1q^Y(p^,q^)Tp˜-p^Y(p^,q^)Tq˜,
with γ>0 a positive learning rate. The Lyapunov-like function
V=12p˜Tp˜+12q˜Tq˜+dψ(aa^)γ
has time derivative
V˙=p˜T[-q^Y(p^,q^)a˜-kpp˜-q^H(p^,q^)-q^H(p,q^)-q^H(p,q^)-qH(p,q)]=+q˜T[p^Y(p^,q^)a˜-kqq˜+p^H(p^,q^)-p^H(p^,q)+p^H(p^,q)-pH(p,q)]=+a˜Tq^Y(p^,q^)Tp˜-p^Y(p^,q^)Tq˜=p˜T-kpp˜-q^H(p^,q^)-q^H(p,q^)-q^H(p,q^)-qH(p,q)=+q˜T-kqq˜+p^H(p^,q^)-p^H(p^,q)+p^H(p^,q)-pH(p,q)=p˜Tq˜T×-kpI-01p+sp˜q^H(p+sp˜,q^)ds-01q+sq˜2H(p,q+sq˜)ds01p+sp˜2H(p+sp˜,q)ds-kqI+01q+sq˜p^H(p^,q+sq˜)ds×p˜q˜
A sufficient condition for convergence of p˜0 and q˜0 is uniform negative definiteness of the Jacobian matrix
J=-kpI-pqH(p,q)-q2H(q,p)p2H(p,q)-kqI+qpH(p,q),
in p and q, that is, contraction of the nominal p^˙ and q^˙ system in the Euclidean metric. Sufficient conditions for this are
kp>-12λminpqH(p,q)+qpH(p,q),kq>12λmaxpqH(p,q)+qpH(p,q),λpλq>14λmax2p2H(p,q)-q2H(p,q),
where λp and λq are the contraction rates of the p^ and q^ systems, respectively, given by the difference of the left- and right-hand sides of the first two inequalities above. More general conditions can be obtained by utilizing a nonidentity metric, that is, replacing the 12p˜Tp˜ and 12q˜Tq˜ terms in V by the Mahalanobis distances 12p˜TΓpp˜ and 12q˜TΓqq˜ where Γp and Γq are symmetric positive-definite matrices. The adaptation law will need to be modified accordingly.
Rather than a general Hamiltonian H=H(p,q), it is common to have a separable Hamiltonian structure,
H(p,q)=T(p)+U(q).
Above, T(·) is the kinetic energy and U(·) is the potential energy. Following an identical proof, the Jacobian matrix then reduces to
J=-kpI-q^2U(q^)p^2T(p^)-kqI,
so that the conditions for contraction in the Euclidean metric are simplified to
kqkp>14λmax2p^2T(p^)-q^2U(q^).
(4.5)
The results of section 3.1 show that the choice of ψ may be used to regularize the estimate of the Hamiltonian and, in turn, the dynamics. This may be used, for instance, for parsimonious Hamiltonian estimation through the combination of a rich set of physically motivated scalar basis functions and a sparse representation obtained via l1 regularization, similar to Champion et al. (2019). Further results that exploit the structure of separable Hamiltonians through independent estimation of the kinetic and potential energies are presented in appendix B.

4.3  Regularized Adaptive Control for Lagrangian Systems

A similar methodology can be applied to parameterize a scalar Lagrangian rather than Hamiltonian, leading to a second-order differential equation with inertia matrix, centripetal and Coriolis forces, and potential energy parameterized by a shared set of weights. As we now show, generalizing the derivation of the Slotine and Li robot controller (Slotine & Li, 1987) to this setting allows for stable adaptive control of Lagrangian systems by direct estimation of the Lagrangian itself. Consider the Lagrangian
L=12q˙TH(q)q˙-U(q),
with H(q) an unknown inertia matrix and U(q) an unknown potential. Assume that the inertia matrix and scalar potential are given exactly by an expansion in physically motivated basis functions. That is, for a set of positive-definite matrices Ml>0 and scalar functions ϕl,
H(q)=lal(K)Ml(q),U(q)=lal(P)ϕl(q),
where superscript (K) and (P) denote kinetic and potential, respectively, and the vectors a(K) and a(P) are unknown. The Euler-Lagrange equations of motion ddtLq˙-Lq=u with u a control input then give the dynamics
ljal(K)Mijl(q)q¨j+lkjal(K)q˙kq˙jMijl(q)qk-12Mkjl(q)qi+lal(P)ϕl(q)qi=ui.
Above, the second term,
lkjal(K)q˙kq˙jMijl(q)qk-12Mkjl(q)qi,
uniquely defines the centripetal and Coriolis forces (traditionally written as C(q,q˙)q˙ with C the Coriolis matrix), but does not uniquely define the Coriolis matrix (Slotine & Li, 1991). Choosing
Cij(q,q˙)=klal(K)12Mijl(q)qk-Mkjl(q)qi-Mkil(q)qjq˙k
preserves the Coriolis force C(q,q˙)q˙ and ensures that H˙(q)-2C(q,q˙) is a skew-symmetric matrix. In matrix notation, the dynamics are then given by
H(q)q¨+C(q,q˙)q˙+g(q)=u,
with the potential force g(q)=lal(P)qϕl(q). Defining s and q˙r as s=ddt+λq˜=q˙-q˙r, these dynamics can be equivalently rewritten as
H(q)s˙+C(q,q˙)s=u-H(q)q¨r+C(q,q˙)q˙r+g(q).
(4.6)
Observe that because the Lagrangian was linearly parameterized, the resulting dynamics are also linearly parameterized. Defining the known basis functions,
Yil(P)=ϕl(q)qi,Yil(K)=kj12Mijl(q)qk-Mkjl(q)qi-Mkil(q)qjq˙kq˙r,j+jMijlq¨r,j,
we can write equation 4.6 as
H(q)s˙+C(q,q˙)s=u-Y(P)a(P)-Y(K)a(K).
For K>0 a positive-definite matrix and for parameter estimates a^(P) and a^(K), taking u=-Ks+Y(p)a^(P)+Y(K)a^(K) leads to
H(q)s˙+C(q,q˙)s=-Ks+Y(P)a˜(P)+Y(K)a˜(K).
The proof in Slotine and Li (1987) can now be directly extended. For ψ(K), ψ(P) strictly convex functions and γK>0, γP>0 positive gains, the Lyapunov-like function
V=12sTH(q)s+1γKdψ(K)a(K)a^(K)+1γPdψ(P)a(P)a^(P),
shows stability of the adaptation laws
a^˙(K)=-γK2ψ(K)a^(K)-1Y(K)Ts,a^˙(P)=-γP2ψ(P)a^(P)-1Y(P)Ts,
after an application of Barbalat's lemma (lemma 71 in this article) and using skew-symmetry of H˙-2C to eliminate 12sTH˙s. In physical applications, dimensions or relative scaling of the components of a^(K) and a^(P) can be handled as described in remarks 8 and 16.

As in section 4.2, by using an l1 approximation for ψ, this approach may find sparse, interpretable models of the kinetic and potential energies. Estimating the potential energy directly may in some cases lead to simpler parameterizations than estimating the resulting forces.

If more structure in the inertia matrix is known, for example, that it depends only on a few unknown parameters, it may still be approximated using the usual Slotine and Li controller. The external forces can then be estimated by directly estimating the corresponding potential that generates them.

4.4  Regularized Adaptive Observer Design

In many physical and engineering systems, only a low-dimensional output of the system y(x)Rm is available for measurement. Assuming that y(x)=Cx is a linear readout for some known matrix CRm×n, we now show that the tools of sections 4.2 and 4.3 can be used to design regularized adaptive observers for the full system state. Assume that the true system dynamics satisfies
x˙=f(x)+c(t)=Y(y(x))a+c(t),
with aRp a vector of unknown parameters and where the known regressor matrix YRn×p only depends on the system output y(x). Consider the adaptive observer,
x^˙=Y(y^)a^+c(t)+g(y^)-g(y),a^˙=-γ2ψ(a^)-1YT(y^)CTΓy˜,
with γ>0 a positive learning rate, y^=y(x^), ψ a strongly convex potential function, and Γ a positive-definite matrix. The Lyapunov-like function
V=12y˜TΓy˜+1γdψ(aa^),
has time derivative
V˙=y˜TΓCY(y^)a˜+Y(y^)-Y(y)a+g(y^)-g(y)-y˜TΓCY(y^)a˜,=y˜TΓCY(y^)-Y(y)a+g(y^)-g(y),=y˜TΓC01Y(y+sy˜)ay+g(y+sy˜)ydsy˜,
which shows that a sufficient condition for convergence of y˜0 is
ΓCY(y)ay+g(y)y+Y(y)ay+g(y)yTCTΓ-λΓ
uniformly in y for some contraction rate λ>0. A natural choice of g(y) to satisfy this condition with Γ=I is g(y)=-kCTy for some k>0 if CCT is full rank. The requirement is equivalent to contraction of the unknown output dynamics,
y˙=CY(y)a+Cg(y)+Cc(t),
in the metric Γ. Under suitable observability assumptions on the system, convergence of y^ to y ensures that x^ converges to x, and hence that the full system state can be observed (Luenberger, 1979).

As in section 4.1, this discussion demonstrates a separation theorem for adaptive observer design. If an observer can be designed for the true system with unknown parameters, then the same observer can be made adaptive by incorporating the adaptation law presented in this section. Convergence properties then depend only on the true system with feedback and are independent of the parameter estimator. The results of section 3.1 show that the choice of ψ can be used to regularize the observer model while maintaining provable reconstruction of the full system state.

4.5  Regularized Dynamics Prediction for Recurrent Neural Networks

Consider a recurrent neural network model,
τx˙=-x+σΘx,
(4.7)
with xRn a vector of neuron firing rates, ΘRn×n the synaptic weights, σΘx the postsynaptic potentials, and τ>0 a relaxation timescale. Let σ(·) be an elementwise Lipschitz and monotonic activation function:
σ(x)i=σi(xi),|σi(x)-σi(y)|Li|x-y|,(x-y)σi(x)-σi(y)0.
These requirements are satisfied by common activation functions such as the ReLU, softplus, tanh, and sigmoid. For ψ a strongly convex function on n×n matrices or vectors in Rn2 and γ>0 a positive gain, consider the regularized adaptive dynamics predictor for equation 4.7,
τx^˙=-x^+σΘ^x+kx-x^,
(4.8)
Θ^˙=-γ2ψΘ^-1σΘ^x-σΘxxT.
(4.9)
In equation 4.8 the true vector of firing rates x is used underneath the application of σ(·) in the x^˙ dynamics. The update law, equation 4.9, can be seen as the vector-valued generalization of the algorithm considered in remark 6. The Lyapunov-like function
V=1γdψ(ΘΘ^),
has time derivative
V˙=-ijΘ˜ijσΘ^x-σΘxixj,=-ijΘ˜ijσikΘ^ikxk-σikΘikxkxj,=-ikΘ˜ikxkσikΘ^ikxk-σikΘikxk,-i1LiσikΘ^ikxk-σikΘikxk2,-1maxkLkσΘ^x-σΘx220.
Integrating the above inequality shows that σΘ^x-σΘx is an L2 signal and, hence, that each component σiΘ^x-σiΘx is also an L2 signal. The error dynamics
e˙=-(k+1)e+σΘ^x-σΘx,
shows that each component ei is a low-pass filter of each component of the function approximation error σiΘ^x-σiΘx. Applying lemma 72 shows that e0. This approach could be used, for example, for identifying regularized low-dimensional models in computational neuroscience. Our results are similar to those of Foster, Rakhlin, and Sarkar (2020), but handle a mirror descent or natural gradient extension valid in the continuous-time deterministic setting.
The adaptation law, equation 4.9, cannot be implemented directly through a PI form. However, it can be well approximated, for example, by the PI construction
x¯˙=λx-x¯,ψΘ^=γΘ¯-ex¯T,Θ¯˙=-k+1ex¯T+λex-x¯T,
for λ>0 a positive gain ensuring x¯x.

5  Velocity Gradient Algorithms and the Bregman Lagrangian

In this section, we provide background material on the velocity gradient formalism (Andrievskii et al., 1988; Fradkov, 1980, 1986; Fradkov et al., 1999) and the Bregman Lagrangian (Betancourt et al., 2018; Wibisono et al., 2016; Wilson et al., 2016).

5.1  Velocity Gradient Algorithms

We now provide a brief introduction to a class of adaptive control methods known as velocity gradient algorithms (Andrievskii et al., 1988; Fradkov, 1980, 1986; Fradkov et al., 1999). In their most basic form, they are specified by a “local” goal functional Q(x,t):Rn×R+R we would like to drive to zero. The adaptation law is defined as
a^˙=-Pa^Q˙(x,a^,t),
(5.1)
where P=PT>0 is a positive definite matrix of learning rates of appropriate dimension and Q˙(x,a^,t)=xQ(x,t)Tx˙+Q(x,t)t. Intuitively, while the goal functional Q(x,t) may only depend on the control parameters a^ indirectly through x, its time derivative will depend explicitly on a^ through x˙.6 The adaptation law, equation 5.1, ensures that a^ moves in a direction that instantaneously decreases Q˙(x,a^,t). Under the conditions specified by assumptions 4 to 6, this causes Q˙(x,a^,t) to be negative for long enough to accomplish the control goal (Fradkov et al., 1999).
Assumption 4.

Q(x,t) is nonnegative and radially unbounded, so that Q(x,t)0 for all x, t and Q(x,t) when x. Q(x,t) is uniformly continuous in t whenever x is bounded.

Assumption 5.

There exists an ideal set of control parameters a such that the origin of the system 2.1 is globally asymptotically stable when the control is evaluated at a. Furthermore, Q(x,t) is a Lyapunov function for the system when the control is evaluated at a. That is, there exists a strictly increasing function ρ such that ρ(0)=0 with Q˙(x,a,t)-ρ(Q).

Assumption 6.
The time derivative of Q is convex in the control parameters a^, that is,
Q˙(x,a1,t)Q˙(x,a2,t)+a1-a2Ta2Q˙(x,a2,t),
(5.2)
is satisfied for all a1 and a2.

The properties of equation 5.1 are summarized in the following proposition (Fradkov et al., 1999).

Proposition 3.
Consider the local velocity gradient algorithm equation 5.1, under assumptions 4 to 6. Then all solutions (x(t),a^(t)) of equations 2.1 and 5.1 remain bounded, and
limtQ(x(t),t)=0
for all x(0)Rn.

The proof follows by consideration of the Lyapunov-like function V=Q+12a˜TP-1a˜.

Remark 13.

If Q(x,t) is chosen so that Q˙(x,a^,t) depends on a^ only through f^(x,a^,t) and f(x,a,t) is linearly parameterized, then assumption 6 will immediately be satisfied by convexity of affine functions. Indeed, consider defining the goal functional Q(x,t)=12s(x,t)2 for system 2.1 where s depends on t through xd(t). It is clear that this proposed goal functional satisfies assumptions 4 and 5 for bounded xd(t). Then Q˙(x,a^,t)=-ηs(x,t)2+sf˜(x,a^,a,t), and equation 5.1 exactly recovers the Slotine and Li controller, equation 2.8.

Remark 14.

An alternative perspective on velocity gradient algorithms can be found by using the expression Q˙(x,a^,t)=xQ(x,t)Tx˙+Q(x,t)t. Assume that x˙=u-Y(x,t)a, and set u=Y(x,t)a^+ud where ud ensures that x(t)xd(t) for a^=a. Then a^Q˙(x,a^,t)=YTxQ(x,t). This shows that the adaptation law a^˙=-Pa^Q˙(x,a^,t)=-PY(x,t)TxQ(x,t) transforms the gradient of Q(x,t) with respect to x by premultiplication by the regressor Y(x,t)T. This interpretation applies to the observers and dynamics predictors designed in section 4, as well as the adaptation law for contracting systems developed in Lopez and Slotine (2021). Conversely, this perspective shows that if a Lyapunov function V(x,t) is known for a nominal system x˙=f(x,t), then the control input u=Y(x,t)a^ with adaptation law a^˙=-PYTxV(x,t) will return the perturbed system x˙=f(x,t)+u-Y(x,t)a back to its nominal behavior.

Rather than a local functional, one may instead specify an integral goal functional of the form Q(x,a^,t)=0tR(x(t'),a^(t'),t')dt'. In this case, equation 5.1 takes the form
a^˙=-Pa^R(x,a^,t).
(5.3)
Equation 5.3 is a gradient flow algorithm on the loss function R(x,a^,t). We now replace assumptions 4 and 5 by a slightly modified setting.
Assumption 7.

R is a nonnegative function and R(x(t),a^(t),t) is uniformly continuous in t for bounded x and a^. Furthermore, a^R(x,a^,t) is locally bounded in x and a^ uniformly in t.

Assumption 8.

There exists an ideal set of controller parameters a and a scalar function μ such that 0μ(t')dt'<, limtμ(t)=0, and R(x(t),a,t)μ(t) for all t.

The properties of algorithm 5.3 are summarized in the following proposition (Fradkov et al., 1999).

Proposition 4.
Consider the integral velocity gradient algorithm 5.3 where the goal functional Q satisfies assumptions 6 to 8. Then Q(x(t);t)α, where
α=12a˜(0)TP-1a˜(0)+0μ(t')dt',
and R(x(t'),a^(t'),t')dt'< over the maximal interval of existence of x. Furthermore, R(x,a^,t)0 for any bounded solution x(t).

The proof follows by consideration of the Lyapunov-like function V=0tR(x(t'),a^(t'),t')dt'+12a˜TP-1a˜+tμ(t')dt'.

Integral functionals allow the specification of a control goal that depends on all past data. R(x,a^,t) is chosen so that it does not necessarily depend on the structure of the dynamics but depends explicitly on a^. Local functionals, on the other hand, result in adaptation laws that do have an explicit dependence on the dynamics through the appearance of the term QxTx˙ in Q˙(x,a^,t).

Integral functionals can be particularly useful if R(x,a^,t)0 implies the desired control goal. In this work, we focus on the choice R(x,a^,t)=12f˜(x,a^,a,t)2, which will require a PI form as described in section 2 in the context of Tyukin's algorithm.7 In particular, note that for this choice of R, the result of proposition 4 implies that f˜L2 over the maximal interval of existence of x. For some error models, this is enough to ensure that xL, and hence that f˜(x,a^,a,t)0 and xxd.8

Goal functionals can also be written as a sum of local and integral functionals with similar guarantees, and these approaches will lead to composite algorithms in the subsequent sections. (See Fradkov et al., 1999, chap. 3 for more detail.)

Remark 15.
Following the developments of section 3, we can immediately prove analogous results for natural gradient or mirror descent–like velocity gradient algorithms. For local functionals, the adaptation law
a^˙=-γ2ψ(a^)-1a^Q˙(x,a^,t)
with γ>0 a positive learning rate and ψ a strongly convex function will lead to the same conclusions as proposition 3 under the same conditions. The proof follows by consideration of the Lyapunov-like function,
V=Q(x,t)+dψ(aa^).
Similarly, the Lyapunov-like function
V=0tR(x(t'),a^(t'),t')dt'+dψ(aa^)+tμ(t')dt'
shows that the same conclusions as in proposition 4 hold under the same conditions for the integral natural velocity gradient algorithm,
a^˙=-γ2ψ(a^)-1a^R(x,a^,t).
In both cases, the choice of ψ offers a principled way to regularize velocity gradient algorithms.

5.2  The Bregman Lagrangian and Accelerated Optimization Algorithms

In Wibisono et al. (2016), the Bregman Lagrangian was shown to generate a suite of accelerated optimization algorithms in continuous time by appealing to the Euler Lagrange equations through the principle of least action. In its original form, the Bregman Lagrangian is given by
L(x,x˙,t)=eα¯+γ¯dψx+e-α¯x˙x-eβ¯f(x).
(5.4)
In equation 5.4, f(x) is the loss function to be optimized, and ψ(x) is a strongly convex function. We take ψ(·)=12·22 in section 6 and consider extensions to arbitrary ψ in section 7. Allowing for arbitrary ψ extends the algorithms presented in section 6 to the natural gradient-like setting of section 2.3.

The quantities α¯(t):R+R,β¯(t):R+R, and γ¯(t):R+R in equation 5.4 are arbitrary time-dependent functions that will ultimately set the damping and learning rates in the second-order Euler Lagrange dynamics. To generate accelerated optimization algorithms, Wibisono, Wilson, and Jordan (2006) required two ideal scaling conditions: β¯˙eα¯ and γ¯˙=eα¯. These conditions originate from the Euler Lagrange equations, where the second is used to eliminate an unwanted term, and a Lyapunov argument, where the first is used to ensure decrease of a chosen Lyapunov function.

Gaudio et al. (2019) recently utilized the Bregman Lagrangian to derive a momentum-like adaptive control algorithm. To do so, they defined α¯=log(βN), β¯=logγβN, and γ¯=eα¯dt9. Here, γ0 and β0 are nonnegative scalar hyperparameters, and N=N(t) is a signal chosen based on the system. With these definitions, choosing the Euclidean norm ψ(·)=12·2, and modifying the Bregman Lagrangian presented in Gaudio et al. (2019) to the adaptive control framework defined in section 2, equation 5.4 becomes
La^,a^˙,t=e0tβN(t)dt1βN12a^˙Ta^˙-γβNddt12s2.
(5.5)
Comparing equations 5.4 and 5.5, it is clear that the loss function f(x) in equation 5.4 has been replaced by ddt12s2 in equation 5.5. Following remark 22, this is precisely the Q˙ velocity gradient functional that gives rise to the Slotine and Li controller. For equation 5.5, the Euler-Lagrange equations lead to the adaptation law,
a^¨+a^˙βN-N˙N=-γβNYTs.
(5.6)
Equation 5.6 may be understood as a modification of the Slotine and Li adaptive controller to incorporate momentum and time-dependent damping. This equation may also be rewritten as two first-order systems,
v^˙=-γYTs,
(5.7)
a^˙=βNv^-a^,
(5.8)
which are useful for proving stability. The properties of equation 5.6 are summarized in the following proposition.
Proposition 5.

Consider the higher-order adaptation algorithm, equation 5.6, with N=1+μY2 and μ>γηβ. Then all trajectories (x,v^,a^) remain bounded, sLL2, a^-v^L2, s0 and xxd.

The proof follows by consideration of the Lyapunov-like function V=12s2+1γv˜2+1γv^-a^2.

Remark 16.
The transformation to a system of two first-order equations may seem somewhat ad-hoc, but it follows immediately by consideration of the non-Euclidean Bregman Lagrangian, equation 5.4. Indeed, it is easy to check that v^=a^+a^˙βN, which is precisely the adaptive control equivalent of x+e-α¯x˙ in the first argument of dψ(··) in equation 5.4. The transformation is also readily apparent by use of the Bregman Hamiltonian
H(a^,p)=12βNe-γ¯p2+γeγ¯ddt12s2,
(5.9)
which, via Hamilton's equations, leads to
p˙=-Ha^=-γeγ¯YTs,a^˙=Hp=βNe-γ¯p.
Defining v^=e-γ¯p+a^ immediately leads to equations 5.7 and 5.8. This line of reasoning was recently investigated further by Gaudio, Annaswamy, Bolender, Lavretsky, and Gibson (2021). As is typical in classical mechanics, the Bregman Hamiltonian may be obtained from a Legendre transform of the Bregman Lagrangian. The Hamiltonian dynamics may be useful for discrete-time algorithm development through application of symplectic discretization techniques (Betancourt et al., 2018; França, Sulam, Robinson, & Vidal, 2019; Shi, Du, Su, & Jordan, 2019).
Remark 17.

It is well known, for example from a passivity interpretation of the Lyapunov-like analysis (see, e.g., Slotine & Li, 1991), that the pure integrator in the standard Slotine and Li adaptation law, equation 2.8, can be replaced by any linear positive real transfer function containing a pure integrator. The higher-order algorithms presented in this work are distinct from this approach, as most clearly seen by the state-dependent damping term in equation 5.6.

Remark 18.

In Wibisono et al. (2016), the suggested Lyapunov function in the Euclidean setting is V=x+e-α¯x˙-x*2+eβ¯f(x), where x* is the global optimum and f(x) is the loss function. Noting that v^ is the equivalent of x+e-α¯x˙ in the adaptive control context (see remark 29), we see that the Lyapunov-like function used to prove stability of the adaptive law, equation 5.6, is similar to that used to prove convergence in the optimization context. The loss function term f(x) is replaced by 12s2, and it is necessary to add the term 1γv^-a^2.

6  Adaptation Laws with Momentum

In this section, we develop several new adaptation laws for both linearly and nonlinearly parameterized systems. We begin by noting that the Bregman Lagrangian generates velocity gradient algorithms with momentum. We prove some general conditions under which these momentum algorithms will achieve tracking. By analogy with integral velocity gradient functionals, we then derive a proportional-integral scheme to implement a first-order composite adaptation law (Slotine & Li, 1991) driven directly by the function approximation error rather than its filtered version. We subsequently fuse the generating functional for the composite law with the Bregman Lagrangian to construct a composite algorithm with momentum.

We then employ a connection between recent developments in isotonic regression—the GLMTron of Kakade et al. (2011), along with extensions due to Goel and Klivans (2017) and Goel et al. (2018)---and Tyukin's algorithm, equation 2.12, to derive momentum algorithms for nonlinearly parameterized systems. These momentum algorithms can be seen as the adaptive control equivalent of the GLMTron with momentum.

We follow this development by discussing a new form of high-order algorithm inspired by the elastic averaging stochastic gradient descent (EASGD) algorithm (Boffi & Slotine, 2020; Zhang, Choromanska, & LeCun, 2014). We subsequently demonstrate the capability of using time-varying learning rates with our presented algorithms (Slotine & Li, 1991).

6.1  Velocity Gradient Algorithms with Momentum

As noted in section 5.2, the Bregman Lagrangian (equation 5.5) that generates the higher-order algorithm in equation 5.6 contains the local velocity gradient functional Q(x,t)=12s(x,t)2 that gives rise to the Slotine and Li controller (equation 2.8). Based on this observation, we define local and integral higher-order velocity gradient algorithms via the Euclidean Bregman Lagrangian. We begin with the local functional
La^,a^˙,t=e0tβN(t)dt1βN(t)12a^˙Ta^˙-γβN(t)ddtQ(x,t),
which generates the higher-order law
a^¨+a^˙βN-N˙N=-γβNa^Q˙(x,a^,t).
(6.1)
Algorithm 6.1 can be rewritten as two first-order systems:
v^˙=-γa^Q˙(x,a^,t),
(6.2)
a^˙=βNv^-a^.
(6.3)
To achieve the control goal, we require the following technical assumption in addition to assumptions 4 and 6. This assumption replaces assumption 5 for first-order velocity gradient algorithms.
Assumption 9.
There exists a time-dependent signal N(t) and nonnegative scalar values β0,μ0 such that the time-derivative of the goal functional evaluated at the true parameters, Q˙(x,a,t), satisfies the following inequality:
Q˙(x,a,t)-βμγN(t)a^-v^2+2a^-v^Ta^Q˙(x,a^,t)-ρ(Q).
(6.4)
In equation 6.4, ρ(·) is positive definite, continuous in Q, and satisfies ρ(0)=0.

Assumption 9 is a formal statement that we may “complete the square” on the left-hand side of 6.4. For example, for a^Q˙(x,a^,t)=YTs and for Q˙(x,a,t)=-ηs2, we may choose N=YT2.

With assumption 9 in hand, we can state the following proposition.

Proposition 6.

Consider algorithm 6.1 or its equivalent form, 6.2 and 6.3, and assume Q satisfies assumptions 4, 6, and 9. Then, all solutions (x(t),v^(t),a^(t)) remain bounded, a^-v^L2, and limtQ(x(t);t)=0.

Proof.
Consider the Lyapunov-like function,
V=Q(x,t)+12γv˜Tv˜+12γa^-v^Ta^-v^.
(6.5)
Equation 6.5 implies that with N(t)=1+μN(t),
V˙=Q˙(x,a^,t)-a˜Ta^Q˙(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2+2a^-v^Ta^Q˙(x,a^,t),Q˙(x,a,t)-βγa^-v^2-βμγN(t)a^-v^2+2a^-v^Ta^Q˙(x,a^,t),-ρ(Q)-βγa^-v^2.
(6.6)

By radial unboundedness of Q(x,t) in x, equations 6.5 and 6.6 show that x remains bounded. Similarly, radial unboundedness of V in v˜ and a^-v^ show that v^ and a^ remain bounded. Integrating equation 6.6 shows that βγ0a^-v^2dtV(0)-V()<, so that a^-v^L2. An identical argument shows that 0ρ(Q)dt<. Now, because x and a^ are bounded and because f˜(x,a^,a,t) is locally bounded in x and a^ uniformly in t by assumption, writing x(t)-x(s)=stf(x(t'),a,t')+u(a^(t'),t')dt' shows that x(t) is uniformly continuous in t. Because Q(x,t) is uniformly continuous in t when x is bounded, because Q is bounded, and because ρ is continuous in Q, we conclude ρ is uniformly continuous in t and limtρ(t)=limtρ(Q(x(t),t))=0 by Barbalat's lemma (lemma 71). This shows that limtQ(x(t),t)=0.

By taking Q=12s2 in proposition 6, we immediately recover proposition 5.

We now consider the integral functional,
La^,a^˙,t=e0tβN(t)dt1βN(t)12a^˙Ta^˙-γβN(t)ddt0tR(x(t'),a^(t'),t')dt',
which generates the higher-order law
a^¨+a^˙βN-N˙N=-γβNa^R(x,a^,t).
(6.7)
We again rewrite equation 6.7 as two first-order systems
v^˙=-γa^R(x,a^,t),
(6.8)
a^˙=βNv^-a^,
(6.9)

and now require a modified version of assumption 9.

Assumption 10.
R(x,a^,t)0 for all x,a^, and t, and is uniformly continuous in t for bounded x and a^. a^R(x,a^,t) is locally bounded in x and a^ uniformly in t. Furthermore, there exists a time-dependent signal N(t) and nonnegative scalar values β0,μ0 such that
R(x,a,t)-R(x,a^,t)-βμγN(t)a^-v^2+2a^-v^Ta^R(x,a^,t)-kR(x,a^,t)

for some constant k>0.

Similar to assumption 9, assumption 10 is a formal requirement that we may “complete the square.” Consider the case when R(x,a^,t)=12f˜(x,a^,a,t)2. Then R(x,a,t)=0, a^R(x,a^,t)=f˜(x,a^,a,t)a^f^(x,a^,t), and we may choose N(t)=a^f^(x,a^,t)2.

With assumption 10, we can state the following proposition.

Proposition 7.

Consider algorithm 6.7 along with assumptions 6 and 10. Let Tx denote the maximal interval of existence of x(t). Then v^ and a^ remain bounded for t[0,Tx], a^-v^L2 over this interval, and 0TxR(x(t'),a^(t'),t')dt'<. Furthermore, for any bounded solution x, these conclusions hold for all t and R(x(t),a^(t),t)0.

Proof.
Consider the Lyapunov-like function,
V=12γv˜Tv˜+12γa^-v^Ta^-v^.
(6.10)
Equation 6.10 implies that with N(t)=1+μN(t),
V˙=-a˜Ta^R(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2+2a^-v^Ta^R(x,a^,t),R(x,a,t)-R(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2+2a^-v^Ta^R(x,a^,t),-kR(x,a^,t)-βγa^-v^2.
(6.11)
Equations 6.10 and 6.11 show boundedness of v^ and a^ over [0,Tx]. Furthermore, integrating equation 6.11 shows that 0Txa^-v^2dt'< and 0TxR(x(t'),a^(t'),t')dt'<. For any bounded solution x, these integrals may be extended to infinity, and we conclude that a^-v^L2, a^L, and v^L. Writing x(t)-x(s) in integral form as in the proof of proposition 6 shows that x(t) is uniformly continuous in t, and in light of the local boundedness assumption on a^R, the same procedure can be applied to v^ and a^. Because R(x(t),a^(t),t) is uniformly continuous in t for bounded x and a^, and because x(t) and a^(t) are both uniformly continuous in t, we conclude that R(x(t),a^(t),t) is uniformly continuous in t and R0 by Barbalat's lemma (lemma 71).

As mentioned in section 5.1, we will be particularly interested in proposition 63 when R=12f˜2, which will generate composite adaptation algorithms and algorithms applicable to nonlinearly parameterized systems. Proposition 63 then shows that f˜L2 over the interval of existence of x(t). As shown by lemma 18, with our error model this is enough to show that x(t) always remains bounded and hence f˜0.

Remark 19.

Classically, Lyapunov-like functions used in adaptive control consist of a sum of tracking and parameter estimation error terms, with a^˙ chosen to cancel a term of unknown sign. Several Lyapunov functions in this work consist only of parameter estimation error terms, such as equation 6.10. From a mathematical point of view, all that matters is that V˙ is negative semidefinite and contains signals related to the tracking error. Integrating V˙ allows the application of tools from functional analysis to ensure that the control goal is accomplished. The lack of tracking error term in V is the origin of the additional complication that x(t) must be shown to be bounded even after it is known that V˙0.

6.2  First- and Second-Order Composite Adaptation Laws

Here we consider the linearly parameterized setting f(x,a,t)=Y(x,t)a, and derive new first- and second-order composite adaptation laws. Composite adaptation laws are driven by two sources of error: the tracking error itself, as summarized by s in the Slotine and Li controller, and a prediction error. The prediction error term is generally obtained from an algebraic relation constructed by filtering the dynamics (Slotine & Li, 1991). We present a composite algorithm that does not require any explicit filtering of the dynamics but is instead driven simultaneously by s and f˜.

A starting point for our first proposed algorithm is to consider a hybrid local and integral velocity gradient functional,
Q(x,t)=γ2s(x,t)2+κ20tf˜2(x(t'),a^(t'),a,t')dt',
(6.12)
where κ>0 and γ>0 are positive learning rates weighting the contributions of each term. As discussed in section 5.1, the first term leads to the Slotine and Li controller. The second can be clearly seen to satisfy assumptions 7 and 8 with μ(t)=0. It also satisfies assumption 6, as f˜2 is a quadratic function of a˜ for linear f˜. Following the velocity gradient formalism, the resulting adaptation law is given by
a^˙=-PYTγs+κYa˜,
(6.13)
which is a composite adaptation law simultaneously driven by s and the instantaneous function approximation error Ya˜=f˜. Equation 6.13 depends on the function approximation error f˜, which is not measured and hence cannot be used directly in an adaptation law. Nevertheless, it can be obtained through a proportional-integral form for a^ in an identical manner to section 2.2. To do so, we define
ξ(x,t)=-κPs(x,t)Y(x,t)T,
(6.14)
ρ(x,t)=κPxn(t0)xn(t)s(x,t)Y(x,t)Txndxn,
(6.15)
a^=a¯+ξ(x,t)+ρ(x,t),
(6.16)
a¯˙=-κη+γsPYT+κsi=1n-1PYxix˙i-i=1n-1ρxiTx˙i,-ρxdTx˙d-ξt-ρt.
(6.17)
Computing a^˙ demonstrates that equation 6.13 is obtained through only the known signals contained in equations 6.14 to 6.17 despite its dependence on Ya˜. A few remarks concerning algorithms 6.13 to 6.17 are in order.
Remark 20.

The Ya˜ term may also be obtained by following the I&I formalism (Astolfi & Ortega, 2003; Liu et al., 2010). To our knowledge, this discussion is the first that demonstrates the possibility of using a PI law in combination with a standard Lyapunov-stability motivated adaptation law to obtain a composite law.

Remark 21.

More error signals may be used for additional terms in the adaptation law. For example, a prediction error obtained by filtering the dynamics may also be employed, leading to a three-term composite algorithm.

Remark 22.

Much like the standard composite law obtained by filtering the dynamics, rearranging equation 6.13 shows that a˜˙+PYTYa˜=-PYTs, so that the additional term can be seen to add a damping term that smooths adaptation (Slotine & Li, 1991).

Remark 23.

As mentioned in section 2.1, for clarity of presentation we have restricted our discussion to the nth-order system 2.1. In general, the PI form 6.16 leads to undesired unknown terms contained in ξ(x,xd)xTx˙ in addition to the desired unknown term. In this case, the desired unknown term is -κPYTYa˜, while the undesired unknown term is -κPYxnx˙ns. Indeed, the purpose of introducing the additional proportional term ρ(x,xd) in equation 6.14 is to cancel this undesired unknown term. In general, cancellation of the undesired terms can be obtained by choosing ρ to solve a PDE, and solutions to this PDE will only exist if the undesired term is the gradient of an auxiliary function. ρ is then chosen to be exactly this auxiliary function. In some cases, the PDE can be avoided, such as through dynamic scaling techniques (Karagiannis, Sassano, & Astolfi, 2009) or the similar embedding technique of Tyukin (2011).

The properties of the adaptive law, equation 6.13, may be summarized with the following proposition.

Proposition 8.

Consider the adaptation algorithm 6.13 with a linearly parameterized unknown, f(x,a,t)=Y(x,t)a. Then all trajectories (x,a^) remain bounded, sL2L, f˜L2, s0, and xxd.

The proof is given in section A.1.

Following the velocity gradient with momentum approach of section 6.1, we now obtain a higher-order composite algorithm and give a PI implementation. We again consider a hybrid local and integral velocity gradient functional, so that equation 5.4 takes the form
La^,a^˙,t=e0tβN(t)dt1βN(t)(12a^˙Ta^˙-βN(t)ddt×γ2s2+κ20tf˜2(x(t'),a^(t'),a,t')dt')
(6.18)
where γ>0 and κ>0 are positive constants weighting the two error terms. The Euler-Lagrange equations then lead to the higher-order composite system:
a^¨+βN-N˙Na^˙=-βNYTγs+κYa˜.
(6.19)
As in section 5.2, equation 6.19 may be implemented as two first-order systems:
v^˙=-YTγs+κYa˜,
(6.20)
a^˙=βNv^-a^.
(6.21)
In an implementation, equation 6.20 is obtained through the PI form v^=v¯+ξ(x,t)+ρ(x,t) with ξ, ρ, and v¯˙ given by equations 6.14, 6.15, and 6.17, respectively, with P=I. The properties of the higher-order composite adaptation law, equation 6.19, are stated in the following proposition.
Proposition 9.

Consider the higher-order composite adaptation algorithm, equation 6.19, for a linearly parameterized unknown, f(x,a,t)=Y(x,t)a. Set N=1+μY2 and μ>γβ1η+κγ. Then all trajectories (x,v^,a^) remain bounded, v^-a^L2, sLL2, f˜LL2, s0, and xxd.

The proof is given in section A.2.

Remark 24.

By following the proof, the signal N may be chosen alternatively to be matrix-valued as N=I+μYTY.

Remark 25.

The Ya˜ term may be used in isolation, by considering the Lyapunov function V=12v˜2+12a^-v^2.

Remark 26.

A gain matrix P=PT>0 of appropriate dimension may be placed in front of YT in v^˙. The quadratic parameter estimation error terms in the Lyapunov function should then be replaced by the weighted terms 12v˜TP-1v˜+12a^-v^TP-1a^-v^, and bounds on μ will be given in terms of P.

6.3  A Momentum Algorithm for Nonlinearly Parameterized Adaptive Control

We now use the development in section 6.2 to present a new momentum algorithm applicable when the unknown parameters appear nonlinearly in the dynamics. We begin with an analogy to statistics.

Generalized linear model (GLM) regression is an extension of linear regression where the data are assumed to be generated by a function of the form f(x)=u(wTx) for a known “link function” u and unknown parameters w. The first computationally and statistically efficient algorithm for this problem, the GLM-Tron of Kakade et al. (2011), assumes that u is Lipschitz and monotonic, much like assumption 2.

The GLM-Tron algorithm was recently extended to the setting of kernel methods and was subsequently used to provably learn two hidden-layer neural networks by Goel and Klivans (2017); this extension is known as the Alphatron. In the kernel GLM setting handled by the Alphatron, the function to be approximated is assumed to be of the form f(x)=ui=1mwiKx,xi, where K is the kernel function for a reproducing kernel Hilbert space (RKHS) H. K is thus given by the RKHS inner product of a feature map ϕ, K(x,y)=ϕ(x),ϕ(y)H.

The Alphatron initializes all weights to zero and, given a batch of labeled training data xi,f(xi)i=1m, updates them with a learning rate λ>0 according to the iteration
w^it+1=w^it-λmf^(w^t,xi)-f(xi).
(6.22)
We now demonstrate an equivalence between Tyukin's adaptation law equation 2.12, and the Alphatron weight update, equation 6.22 in the following proposition.
Proposition 10.

The adaptation law, equation 2.12, is an application of the Alphatron algorithm, equation 6.22, to adaptive control.

The proof is given in section A.3.

Proposition 10 shows a convergence of techniques in nonlinearly parameterized adaptive control and nonconvex learning. This correspondence suggests the momentum-like variant of equation 2.12,
a^¨+βN-N˙Na^˙=-γβNf˜(x,a^,a,t)α(x,t),
(6.23)
which, as before, admits an equivalent representation in terms of two first-order systems,
v^˙=-γf˜(x,a^,a,t)α(x,t),
(6.24)
a^˙=βNv^-a^.
(6.25)
Equation 6.23 may be implemented through equations 6.24 and 6.25 via the PI form, equations 2.13 to 2.16, applied to the v^ variable.

Equation 6.23 may be obtained via the Bregman Lagrangian, equation 6.18, for velocity gradient laws with momentum by choosing only the integral term. It is then necessary to modify the resulting Euler-Lagrange equations by setting fm' to ±1 based on the monotonicity of f˜ as described in section 2.2. The following proposition summarizes the properties of equations 6.24 and 6.25.

Proposition 11.

Consider the algorithm 6.23 or its equivalent form, 6.24 and 6.25 under assumption 2 with N=1+μα(x,t)2 and μ>γD1β. Then all trajectories (x,a^,v^) remain bounded, f˜L2, a^-v^L2, sL2L, s0 and xxd.

The proof is given in section A.4.

Remark 27.

As noted in remark 43, by following the proof of proposition 11, one may also take N to be matrix-valued as N=1+μα(x,t)α(x,t)T.

Remark 28.

As in remark 45, a gain matrix P=PT>0 of appropriate dimension may be placed in front of α(x,t) in v^˙.

Predominantly inspired by deep learning, there has recently been strong interest in nonconvex models that are nevertheless amenable to gradient-based or gradient-inspired optimization. The development in this section suggests that machine learning models that can be provably optimized using gradient techniques represent a promising class of nonlinear parameterizations for adaptive control development.

6.4  The Elastic Modification

We now consider a modification to the previously discussed adaptive control laws inspired by the elastic averaging SGD (EASGD) algorithm (Boffi & Slotine, 2020; Zhang et al., 2014). EASGD is an algorithm intended for distributed training of deep neural networks across p graphics processing units (GPUs). Each GPU is used to train a local copy of the deep network model, and each local copy maintains its own set of parameters a^(i). These parameters are updated according to the iteration
a^t+1(i)=a^t(i)-λgt(i)+λka¯t-a^t(i),
(6.26)
a¯t+1=a¯t+λk1pi=1pa^(i)-a¯t,
(6.27)
where λ is the learning rate, gt(i) is the stochastic gradient approximation computed by the ith agent at time step t, k is the coupling strength, and a¯ is the center variable. Equation 6.27 takes the form of a low-pass filter of the instantaneous average of the set of local parameters.

Boffi and Slotine (2020) observed that in the nondistributed (p=1) case, equations 6.26 and 6.27 do not reduce to standard stochastic gradient descent, and that application of EASGD in this setting has different generalization properties from those of standard SGD when used to train deep neural networks. In a similar spirit, by construction of suitable Lyapunov functions, we now show that adding a center-like variable to the adaptive laws considered in previous sections maintains their stability. This immediately gives rise to a new class of higher-order adaptive control algorithms. Interestingly, these algorithms do not seem to admit an equivalent representation in terms of a single second-, third-, or fourth-order system for a^, but must be written as a system of first-order equations.

Remark 29.

The algorithms considered in this subsection immediately extend to the case of cloud-based adaptation for networked robotic systems (Wensing & Slotine, 2018), where the center variable is allowed to have its own dynamics as in equation 6.27 rather than simply representing the instantaneous spatial average of the distributed parameters.

We first apply the elastic modification to the Slotine and Li adaptive controller, equation 2.8, for linearly parameterized unknown dynamics f˜=Ya˜. These results extend trivially to the nonfiltered composite algorithm of section 6.2. To this end, we define the adaptation law
a^˙=-PYTs+ka¯-a^,
(6.28)
a¯˙=ka^-a¯,
(6.29)
whose basic stability properties are summarized in the following proposition.
Proposition 12.

Consider the adaptation law, equations 6.28 and 6.29. Then all trajectories (x,a^,a¯) remain bounded, sL2L, a^-a¯L2, s0 and xxd.

The proof is given in section A.5.

We now apply the elastic modification to algorithm 2.12 for nonlinearly parameterized unknown dynamics satisfying assumption 2. As in equations 6.28 and 6.29, we define
a^˙=-f˜Pα+ka¯-a^,
(6.30)
a¯˙=ka^-a¯.
(6.31)
Proposition 13.

Consider the adaptation law, equations 6.30 and 6.31. Then all trajectories (x,a^,a¯) remain bounded, f˜L2L, a^-a¯L2, sLL2, s0 and xxd.

The proof is given in section A.6.

We now consider the higher-order algorithms presented in sections 6.2 and 6.3. In the higher-order setting, there are three clear possibilities for the elastic modification: coupling to a center variable for the a^ variable, coupling to a center variable for the v^ variable, or coupling to center variables in both a^ and v^. We prove stability for all three possibilities only in the nonlinearly parameterized setting described by assumption 2. The results extend naturally to the higher-order composite algorithm for linearly parameterized systems presented in section 6.2. We begin with the first possibility,
v^˙=-γf˜α,
(6.32)
a^˙=βNv^-a^+kβNa¯-a^,
(6.33)
a¯˙=kβNa^-a¯.
(6.34)
The basic stability properties of the algorithm, equations 6.32 to 6.34, are summarized in the following proposition.
Proposition 14.

Consider the algorithm, equations 6.32 to 6.34, under assumption 2. Set 13k<1, N=1+μα(x,t)2, and μ>2D1γβ(1-k). Then all trajectories (x,a^,v^,a¯) remain bounded, f˜L2L, sL2L, a^-v^L2, a^-a¯L2, s0 and xxd.

The proof is given in section A.7. We now consider the second possibility of adding a center variable in the v^ variable,
v^˙=-γf˜α+ρv¯-v^,
(6.35)
v¯˙=ρv^-v¯,
(6.36)
a^˙=βNv^-a^.
(6.37)
The basic stability properties of equations 6.35 to 6.37 are summarized in the following proposition.
Proposition 15.

Consider the algorithm, equations 6.35 to 6.37, under assumption 2. Set ρ<2β, N=1+μα(x,t)2, and μ>γD1β. Then all trajectories (x,a^,v^,v¯) remain bounded, f˜L2L, sL2L, v^-v¯L2, v^-a^L2, s0 and xxd.

The proof is given in section A.8. Finally, we consider adding coupling to center variables in both a^ and v^,
v^˙=-γf˜α+ρv¯-v^,
(6.38)
v¯˙=ρv^-v¯,
(6.39)
a^˙=βNv^-a^+kβNa¯-a^,
(6.40)
a¯˙=kβNa^-a¯.
(6.41)
The basic stability properties of equations 6.38 to 6.41 are summarized in the following proposition.
Proposition 16.

Consider the algorithm, equations 6.38 to 6.41, under assumption 2. Set ρ<β(1-k), 13k<1, N=1+μα(x,t)2, and μ>2γD1β(1-k). Then all trajectories (x,v^,v¯,a^,a¯) remain bounded, f˜L2L, sL2L, v^-v¯L2, a^-a¯L2, v^-a^L2, s0, and xxd.

The proof is given in section A.9.

We have thus shown that all Euclidean adaptive control algorithms presented in this article,10 as well as the classic algorithm of Slotine and Li, can be modified to include feedback coupling to a low-pass filtered version of the adaptation variables. It is well known that iterate averaging for stochastic optimization algorithms such as stochastic gradient descent can improve convergence rates via variance reduction (Polyak & Juditsky, 1992). The elastic modification is similar in spirit but employs feedback rather than series coupling. This suggests that adding the elastic term may improve robustness of adaptation algorithms, and we leave a theoretical investigation of this conjecture for future work.

6.5  Exponential Forgetting Least Squares and Bounded Gain Forgetting

We now demonstrate how to apply the techniques of exponential forgetting and bounded gain forgetting least squares (Slotine & Li, 1991) to the adaptation algorithms we have developed. These techniques are useful for estimation of time-varying parameters, as they rapidly discard previous information used for parameter estimation. Exponential forgetting least squares is described by a time-dependent learning rate matrix P(t), which, in the linearly parameterized case f˜=Ya˜, takes the form
P˙=λP-PYTYPifPP00else
(6.42)
where λ>0 is a constant forgetting factor, P0 is a maximum bound on the norm, and P is a matrix norm such as the opeator norm. Equation 6.42 implies for the inverse matrix,
ddtP-1=-λP-1+YTYifPP00else
(6.43)
In the nonlinearly parameterized case described by assumption 2, we will replace YT in equations 6.42 and 6.43 by α(x,t). In the bounded gain forgetting technique, λ is a time-dependent function,
λ(t)=λ01-PP0,
(6.44)
where λ0>0 sets the forgetting factor when the norm of P is small. It can be shown that this choice of λ(t) ensures that PP0, and thus we may drop the case statement in equations 6.42 and 6.43 (Slotine & Li, 1991). The choice of λ(t) in bounded gain forgetting and the case statement used in equations 6.42 and 6.43 are both employed to prevent unboundedness of the learning rate matrix.

We focus on algorithms without the elastic modification of section 6.4; extension to the elastic modification is simple. We also focus on the bounded gain forgetting technique: proofs for the exponential forgetting least squares technique are identical, with the addition of an appropriate case statement in the time derivative of the Lyapunov function. For simplicity, we include only the time-dependent gain P(t) and set the scalar gains κ=γ=1 where applicable.

We begin with the first-order non-filtered composite, equation 6.13, with P given by equation 6.42. In this case, the composite algorithm may be implemented via the PI form equations 6.14 to 6.17, where now P=P(t).

Proposition 17.

Consider the adaptation algorithm, equation 6.13, with P(t) given by equation 6.42, λ(t) given by equation 6.44, and κ=γ=1. Then all trajectories (x,a^) remain bounded, f˜L2L, sL2L, s0, and xxd.

The proof is given in section A.10.

We can state a similar result for the higher-order nonfiltered composite with time-dependent P(t) given by equation 6.42,
a^¨+βN-N˙N-P˙P-1a^˙=-βNP(t)YTs+f˜,
(6.45)
which admits a representation as two first-order equations,
v^˙=-P(t)YTs+f˜,
(6.46)
a^˙=βNP(t)v^-a^.
(6.47)
Equation 6.46 can be implemented via the PI form v^=v¯+ξ(x,t)+ρ(x,t) where ξ,ρ, and v¯˙ are given by equations 6.14, 6.15, and 6.17, respectively, with γ=κ=1.
Proposition 18.

Consider the adaptation algorithm, equation 6.45, with P(t) given by equation 6.42, λ(t) given by equation 6.44, N(t)=1+μY2 and μ>3η+22ηβ. Then all trajectories (x,v^,a^) remain bounded, f˜L2L, sL2L, s0 and xxd.

The proof is given in section A.11.

Remark 30.
Because P(t) is uniformly bounded in t, it is not necessary to include P(t) in equation 6.47; by a slight modification of the proof, it is easy to show that the modified higher-order law,
a^¨+βN-N˙Na^˙=-βNP(t)YTs+f˜,
is also a stable adaptive law with a suitable choice of gains.

We now consider Tyukin's first-order algorithm for nonlinearly parameterized systems, equation 2.12, with P=P(t) given by equation 6.42. To do so, we require an additional assumption:

Assumption 11.
For the same function α(x,t) as in assumption 2, there exists a constant D2 such that
|f˜(x,a^,a,t)|D2|α(x,t)Ta˜|.
(6.48)

Together with assumption 2, Assumption 6.48 states that f˜ lies between two linear functions. Given that the update, equation 6.42, is derived based on recursive linear least squares considerations, it is unsurprising that such an assumption is required in the nonlinearly parameterized setting. We are now in a position to state the following proposition.

Proposition 19.

Consider the adaptation algorithm equation 2.12 with P(t) given by equation 6.42 and λ(t) given by equation 6.44 for f˜ satisfying assumptions 2 and 11. Further assume that D1<2D22 or that D2>12. Then, all trajectories (x,a^) remain bounded, f˜L2, sL2L, s0 and xxd.

The proof is given in section A.12.

Last, we consider the momentum algorithm for nonlinearly parameterized systems,
a^¨+βN-N˙N-P˙P-1a^˙=-βNf˜P(t)α(x,t),
(6.49)
which admits a representation as two first-order equations,
v^˙=-f˜P(t)α(x,t),
(6.50)
a^˙=βNP(t)v^-a^.
(6.51)
Proposition 20.

Consider the adaptation algorithm, equation 6.49, with P(t) given by 6.42, λ(t) given by (6.44), N=1+μα2, and μ>4D2-2+2D1+12β4D2-1. Suppose f˜ satisfies assumptions 2 and 11. Further assume that D2>12. Then all trajectories (x,v^,a^) remain bounded, f˜L2, sL2L, s0 and xxd.

The proof is given in section A.13.

Remark 31.
As in remark 58, because P(t) is uniformly bounded in t, it is not necessary to include P(t) in equation 6.51. It is simple to show by modification of the proof that
a^¨+βN-N˙Na^˙=-βNf˜P(t)α(x,t)
is also a stable adaptive law with a suitable choice of gains.

7  Natural Momentum Algorithms

The Bregman Lagrangian allows for the introduction of non-Euclidean metrics. In section 5.2, we took the potential function ψ to be the Euclidean norm, ψ(x)=12x2. We now show that taking ψ to be an arbitrary strictly convex function leads to a more general class of algorithms that can be seen as the higher-order variants of those discussed in sections 2.3 and 3. With the same definitions of α¯,γ¯, and β¯ as in section 5.2, but now taking the general Bregman divergence dψ(··), the Bregman Lagrangian, equation 5.5 takes the form
L=e0tβN(t)dtβNdψa^+a^˙βNa^-γddt12s2.
(7.1)
The Euler-Lagrange equations for equation 7.1 lead to the natural adaptation law with momentum
a^¨+βN-N˙Na^˙+γβN2ψa^+a^˙βN-1YTs=0.
(7.2)
Above, the Euclidean adaptive law has been modified so that YTs is now premultiplied by the inverse Hessian of ψ evaluated at a^+a^˙βN. As discussed in section 5.2, this quantity is precisely v^. The resulting adaptation law can thus be written in the equivalent form:
v^˙=-γ2ψv^-1YTs,
(7.3)
a^˙=βNv^-a^.
(7.4)
Equations 7.3 and 7.4 demonstrate that using the Bregman divergence in the Bregman Lagrangian leads to momentum variants of the natural algorithms of section 2.3. Taking the β limit immediately recovers the first-order laws discussed in section 2.3. The stability of the above laws for strongly convex ψ is stated in the following proposition.
Proposition 21.

Consider the higher-order “natural” adaptation law equation 7.2. Assume that ψ is l-strongly convex so that 2ψ(·)lI globally. Take N=1+μY2 and μ>γ1+l-124βη. Then all trajectories (x,v^,a^) remain bounded, sL2L, s0, and xxd.

The proof is given in section A.14. A second, related variant is given by
v^˙=-γ2ψ(v^)-1YTs,
(7.5)
a^˙=βN2ψ(a^)-1ψ(v^)-ψ(a^).
(7.6)
Algorithm 7.5 and 7.6 is equivalent to algorithm 5.6 entirely in the mirrored domain. Indeed, it may be rewritten as
ddtψ(v^)=-γYTs,ddtψ(a^)=βNψ(v^)-ψ(a^),
which shows that ψ(a^) obtains the same values over time as a^ computed via algorithm 5.6. The stability of this adaptive law (shown in proposition 22) implies that the parameters obtained by the momentum algorithm, equation 5.6, may be transformed via the inverse of the gradient of an l-strongly convex and L-smooth function, and the resulting transformed parameters will still ensure stability and tracking for the closed-loop system.
A modification of equation 7.6 that is driven by v^ rather than ψ(v^) is given by
v^˙=-γ2ψ(v^)-1YTs,
(7.7)
a^˙=βN2ψ(a^)-1v^-a^.
(7.8)
The properties of these two possible adaptation laws are given in the following proposition.
Proposition 22.

Consider the adaptation algorithm, equations 7.5 and 7.6, or the adaptation algorithm, equations 7.7 and 7.8. Assume that ψ is l-strongly convex and L-smooth, so that lI2ψ(·)LI. Take N=1+μY2, and choose μ>γ(l+γL)24βηl3 in the former case and μ>γ(l+γL)24βηl2 in the latter. Then all trajectories (x,v^,a^) remain bounded, sL2L, s0 and xxd.

The proof is presented in section A.15.

Remark 32.

For efficient implementation of the proposed natural momentum algorithms, as well as for their first-order equivalents, ψ should be chosen so that 2ψ(·)-1 is efficiently computable and ideally sparse. Alternatively, if the inverse function of the gradient ψ-1(·) is efficiently computable, ψ(a^) or ψ(v^) may be updated directly and subsequently inverted to arrive at the parameter values. Discretization of these algorithms is a subtle issue, and discretization of the a^˙ and v^˙ dynamics directly results in a natural gradient-like update (Amari, 1998), while discretization of the ddtψ(a^) and ddtψ(v^) dynamics leads to a mirror descent-like update (Beck & Teboulle, 2003; Nemirovski & Yudin, 1983); these discrete-time algorithms have the same continuous-time limit (Krichene, Bayen, & Bartlett, 2015).

The above natural adaptation laws with momentum may be generalized to composite algorithms, as well as to algorithms for nonlinearly parameterized adaptive control, by replacing Euclidean norms by Bregman divergences where appropriate in the proofs of the corresponding Euclidean algorithms (see, e.g., the proofs of propositions 21 and 22). Rather than derive this for each algorithm, we now show how the general results on velocity gradient algorithms with momentum (see propositions 6 and 7) can be extended to the non-Euclidean setting. We start with the case of a local functional, which requires the modification of assumption 9 to an equivalent non-Euclidean version.

Assumption 12.
There exists a time-dependent signal N(t) and nonnegative scalar values β0,μ0 such that the time derivative of the goal functional evaluated at the true parameters, Q˙(x,a,t), satisfies the following inequality:
Q˙(x,a,t)-βμγN(t)a^-v^2+a^-v^TI+2ψ(v^)-1a^Q˙(x,a^,t)-ρ(Q).
(7.9)
In equation 7.9, ρ(·) is positive definite, continuous in Q, and satisfies ρ(0)=0.

With assumption 12 in hand, we can state the following non-Euclidean equivalent of proposition 6. We focus on the variant equation 7.2, as the other possibilities are similar.

Proposition 23.
Consider the algorithm
a^¨+βN-N˙Na^˙+γβN2ψa^+a^˙βN-1a^Q˙(x,a^,t)=0
or its equivalent first-order form,
v^˙=-γ2ψ(v^)-1a^Q˙(x,a^,t),a^˙=βNv^-a^,
and assume Q satisfies assumptions 4, 6, and 12. Then all solutions (x(t),v^(t),a^(t)) remain bounded, a^-v^L2, and limtQ(x(t);t)=0.
Proof.
Consider the Lyapunov-like function,
V=Q(x,t)+1γdψ(av^)+12γa^-v^Ta^-v^.
(7.10)
Equation 7.10 implies that, with N(t)=1+μN(t),
V˙=Q˙(x,a^,t)-a˜Ta^Q˙(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2=+a^-v^TI+2ψ(v^)-1a^Q˙(x,a^,t),Q˙(x,a,t)-βγa^-v^2-βμγN(t)a^-v^2+a^-v^TI+2ψ(v^)-1a^Q˙(x,a^,t),-ρ(Q)-βγa^-v^2.
(7.11)
The first line to the second follows by convexity of Q˙(x,a^,t) in its second argument, while the second line to the third follows by assumption 12. The remainder of the proof is identical to Proposition 6.

For the integral variant, we require a non-Euclidean version of assumption 10.

Assumption 13.
R(x,a^,t)0 for all x,a^, and t, and is uniformly continuous in t for bounded x and a^. a^R(x,a^,t) is locally bounded in x and a^ uniformly in t. Furthermore, there exists a time-dependent signal N(t) and nonnegative scalar values β0,μ0 such that
R(x,a,t)-R(x,a^,t)-βμγN(t)a^-v^2+a^-v^TI+2ψv^-1a^R(x,a^,t)-kR(x,a^,t)
for some constant k>0.

With assumption 13, we can state the following proposition.

Proposition 24.
Consider the algorithm
a^¨+βN-N˙Na^˙+γβN2ψa^+a^˙βN-1a^R(x,a^,t)=0,
or its equivalent first-order form,
v^˙=-γ2ψ(v^)-1a^R(x,a^,t),a^˙=βNv^-a^,
along with assumptions 6 and 13. Let Tx denote the maximal interval of existence of x(t). Then v^ and a^ remain bounded for t[0,Tx), a^-v^L2 over this interval, and 0TxR(x(t'),a^(t'),t')dt'<. Furthermore, for any bounded solution x, these conclusions hold for all t and R(x(t),a^(t),t)0.
Proof.
Consider the Lyapunov-like function,
V=1γdψ(av^)+12γa^-v^Ta^-v^.
(7.12)
Equation 7.12 implies that, with N(t)=1+μN(t),
V˙=-a˜Ta^R(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2=+a^-v^TI+2ψ(v^)-1a^R(x,a^,t),R(x,a,t)-R(x,a^,t)-βγa^-v^2-βμγN(t)a^-v^2=+a^-v^TI+2ψ(v^)-1a^R(x,a^,t),-kR(x,a^,t)-βγa^-v^2.
(7.13)
The first line to the second follows by convexity of R(x,a^,t) in its second argument, while the second to the third follows by assumption 13. The remainder of the proof is identical to proposition 63.

The general methodology captured by the proofs of propositions 23 and 24, in combination with the results of section 6.3, may be exploited to derive non-Euclidean variants of our nonfiltered composite algorithm and our momentum algorithm for nonlinearly parameterized adaptive control. Note that the strong convexity and smoothness requirements of propositions 21 and 22, in combination with a suitable choice of N(t), are one way to satisfy the requirements of assumptions 12 and 13.

Remark 33.

Our implicit regularization results in section 3 also extend to the higher-order setting captured by algorithm 7.2. The assumption that a^a^ implies a^˙0. As noted in section 5.2, v^=a^+a^˙βN, and we thus conclude that under this assumption v^a^. Because v^˙ in equation 7.3 is identical to algorithm 3.2, the result follows. A formal statement of this fact is provided in section B.1.

8  Simulations

In this section, we perform several numerical experiments demonstrating the validity of our theory and consider a number of applications of our non-Euclidean adaptive laws.

8.1  Convergence and Implicit Regularization of a Momentum Algorithm for Nonlinearly Parameterized Systems

We first empirically verify the global convergence and implicit regularization of our momentum algorithm for nonlinearly parameterized systems, equation 6.23. In particular, we consider a second-order system,
x˙1=x2,x˙2=u-f(x,a,t),
with an unknown system dynamics of the form
f(x,a,t)=σtanhVxTa.
(8.1)
Equation 8.1 represents a three-layer neural network with input layer x, hidden-layer weights V, hidden-layer nonlinearity tanh(·), hidden-layer weights a, and output nonlinearity σ(x)=e.1x. The system model, equation 8.1, can clearly be seen to satisfy assumption 2 with α(x)=tanh(Vx).11 The PI form of algorithm 6.23 is given by
ψ(v^)=v¯+ξ(x,t)+ρ(x,t),
(8.2)
ξ(x,t)=-γs(x,t)tanh(Vx),
(8.3)
ρ(x,t)=γtanhVxx2-logcoshVxV2+λx˜-x2,d(t)tanhVx,
(8.4)
v¯˙=γx˙2,d(t)-λx2-x2,d(t)-ηstanh(Vx)+γx2tanh(Vx)V1V2,
(8.5)
a^˙=β1+μtanh(Vx)2v^-a^,
(8.6)
where and denote elementwise multiplication and division, respectively, where Vi is the ith column of V and v^ is obtained from ψ(v^) by inverting ψ. For the squared p norm ψ(·)=12·p2, the inverse function can be analytically computed as
ψ-1(y)=yq2-q|y|q-1sign(y),
(8.7)
where 1q+1p=1, |·| denotes elementwise absolute value and sign(·) denotes elementwise sign (Gentile, 2003). We consider the l1,l2,l4,l6, and l10 norms for ψ. To approximate the l1 norm, equation 8.7 is used with p=1.1. All other p norms can be used directly.
In all simulations we take λ=.5 in the definition of s (see equation 2.4) and η=.5 in the control input (see equation 2.5). For the adaptation hyperparameters, we choose γ=1.5 for the l2,l4, and l6 norms. We take γ=50 for the l1 norm and γ=.5 for the l10 norm.12 In all cases, β=1 and μ=3γ2ηβ. We set dim(a)=dim(a^)=500 and randomly initialize a^ and v^ around zero from a normal distribution with standard deviation 10-3. The true parameter vector a is drawn from a normal distribution with mean zero and standard deviation 7.5. The matrix V is set to have normally distributed elements with standard deviation 1dima^. The state vector is initialized such that x(0)=xd(0). The desired trajectory is taken to be
xd(t)=sin2π12t+cos3π12t.
The tracking error for each choice of ψ along with a baseline comparison to fixed a^(t)=a^(0) is shown in Figure 1A. Figures 1B–1F show trajectories for 100 out of the 500 parameters. The timescale on each axis is set to show the trajectories approximately until the parameters converge for the given algorithm. Each case results in remarkably different dynamics and resulting converged parameter vectors a^ψ. The tracking performance is good for each algorithm.
Figure 1:

Tracking error and parameter trajectories. (A) Trajectory tracking error. All algorithms result in convergence xxd, though transient performance differs between the algorithms. (B–F) Parameter trajectories for 100/500 of the total parameters. Each algorithm results in remarkably different parameter trajectories and final values a^ψ.

Figure 1:

Tracking error and parameter trajectories. (A) Trajectory tracking error. All algorithms result in convergence xxd, though transient performance differs between the algorithms. (B–F) Parameter trajectories for 100/500 of the total parameters. Each algorithm results in remarkably different parameter trajectories and final values a^ψ.

Figure 2:

Parameter histograms. (A) True parameters a. (B) Parameter vector found by the algorithm with ψ(·)=12·1.12. The resulting solution is extremely sparse and has a few parameters with large magnitude, indicative of implicit l1 regularization. (C) Parameter vector found by the standard Euclidean algorithm with ψ(·)=12·22. The resulting parameter vector looks approximately gaussian distributed, indicating l2 regularization. (C–F) Parameter vectors found by ψ(·)=12·p2 with p=4,6, and 10, respectively. The transition clearly indicates a trend toward l-norm regularization, with two bimodal peaks forming around ±1. The l norm of the parameter vector decreases with increasing p.

Figure 2:

Parameter histograms. (A) True parameters a. (B) Parameter vector found by the algorithm with ψ(·)=12·1.12. The resulting solution is extremely sparse and has a few parameters with large magnitude, indicative of implicit l1 regularization. (C) Parameter vector found by the standard Euclidean algorithm with ψ(·)=12·22. The resulting parameter vector looks approximately gaussian distributed, indicating l2 regularization. (C–F) Parameter vectors found by ψ(·)=12·p2 with p=4,6, and 10, respectively. The transition clearly indicates a trend toward l-norm regularization, with two bimodal peaks forming around ±1. The l norm of the parameter vector decreases with increasing p.

Further insight can be gained into the structure of the parameter vector a^ψ found by each adaptation algorithm by consideration of the histograms (rug plots shown on x-axis) for a^ at the end of the simulation in Figures 2A to 2F. Figure 2A shows the true parameter vector. The choice of ψ(·)=12·1.12 in Figure 2B leads to a sparse solution with most of the weight placed on a few parameters. This is consistent with l1 regularized solutions found by the LASSO algorithm (Tibshirani, 1996). The inset displays a closer view around zero. The choice of ψ(·)=12·22 in Figure 2C (Euclidean adaptation law) leads to a parameter vector a^12·22a that is roughly gaussian distributed. This distribution highlights the implicit l2 regularization of standard adaptation laws. The progression from ψ(·)=12·42 to ψ(·)=12·102 displays a trend toward approximate l-norm regularization: the distribution of parameters is pushed to be bimodal and peaked around ±1, and the l norm of a^ decreases as p is increased.
Figure 3:

Function approximation error and control inputs. (A) The function approximation error f˜2(x(t),a^(t)). All algorithms drive the error to zero. (B–F) Comparison of the control input uψ(t) to the “ideal” control u(t)=x¨d(t)+f(xd(t),a). All algorithms converge to the ideal control, though at a different rate. The control magnitude is kept to a reasonable level in every case.

Figure 3:

Function approximation error and control inputs. (A) The function approximation error f˜2(x(t),a^(t)). All algorithms drive the error to zero. (B–F) Comparison of the control input uψ(t) to the “ideal” control u(t)=x¨d(t)+f(xd(t),a). All algorithms converge to the ideal control, though at a different rate. The control magnitude is kept to a reasonable level in every case.

Figure 3A shows the function approximation error f˜2(x(t),a^(t),a) for each algorithm along with a reference value for fixed a^(t)=a^(0). Each algorithm, as expected by our theory and seen by the low tracking error in Figure 1A, pushes f˜2 to zero despite the different forms of regularization imposed on the parameter vectors. Figures 3B to 3F show the control input as a function of time along with the unique “ideal” control law u(t)=x¨d+f(xd(t),a(t)) valid when x(0)=xd(0). All control inputs can be seen to converge to the ideal law, though the rate of convergence depends on the choice of algorithm. The control input is of reasonable magnitude throughout adaptation for each algorithm.

8.2  Learning to Control with Primitives

We now demonstrate that the mirror descent-like laws of section 3 can be used to learn convex combinations of control primitives. Our approach is analogous to the use of multiplicative weight updates in machine learning and respects the natural l1 geometry over the probability simplex.

As a model problem for this setting, we consider the second-order system
x˙1=x2,x˙2=u-tanhVxTa,
with aRp a fixed vector of unknown parameters and VRp×2 a random matrix with VijN0,1p2. To define our control primitives, we consider a distribution over tasks specified by random desired trajectories
xd(i)(t)=Msin(Ait+Bicos(Cit))+Di,
with Di=2i(-1)i×M, AiUnif(0,5π), BiUnif(0,3), and CiUnif(0,5π). The shift Di ensures that the desired trajectories occupy nonoverlapping regions of state space. We then learn primitives uii=1N to track xd(i)(t)i=1N where each ui is given by equation 2.5 with parameter estimates a^(i). The parameter estimates are found via the Slotine and Li adaptation law,
a^˙(i)=-γtanh(Vx)Ts,
which is allowed to run until the parameter estimates converge. We set p=15, N=300, M=0.1, γ=5, and η=λ=0.5. Each vector of parameter estimates a^(i) is initialized randomly, a^(i)(0)N0,σa^2 with σa^=10-3. The state is initialized randomly for each task, x0N0,σx2 with σx=5. The true parameters a are drawn randomly, aN0,σa2 with σa=2.
With control primitives ui capable of tracking trajectories xd(i) in hand, we consider tracking a desired trajectory xd(t) given piecewise by the previously drawn random trajectories. Concretely, we fix a time horizon T and a number of tasks k, and set
xd(t)=xd(il)(t)iftl-1t<tl,
with l=1,,k, il drawn uniformly from i=1,,N, t0=0, and tl=lTk. To leverage the learned control primitives, we use the input
u=i=1Nβ^iui=uβ^.
Above, uR1×N is a row vector with components ui. We require that β^i>0 for all i and that i=1Nβ^i=1. In our experiments, we fix T=1000 and set k=5.
It is well known in the online convex optimization community that mirror descent with respect to the entropy ψβ^=iβ^ilogβ^i can improve the dimension dependence of convergence rates in comparison to projected gradient descent when optimizing over the simplex (Hazan, 2016). Here we demonstrate that the same phenomenon appears in adaptive control. We consider two adaptation laws,
β^˙=-γuTs,ddtψβ^=-γuTs,
with projection of β^ onto the simplex.13 In both cases, we initialize β^i=1N and set γ=.25.
Our results are shown in Figure 4. In Figure 4A, we show convergence of s for both adaptive laws. s jumps every 200 units of time as the task changes discretely. While both converge, adaptation with respect to the entropy converges significantly faster and minimizes s to lower values. This effect is more prominently displayed in Figure 4B, which shows the convergence of s on a logarithmic scale. Figures 4C and 4D show the parameter trajectories for the mirror descent and Euclidean laws, respectively. The mirror descent law displays trajectories in which fewer parameters stray from zero. Those that do stray an order of magnitude farther from zero than the Euclidean law. The discrete changes of the desired trajectory are more visible in the parameter trajectories for the mirror-descent law.
Figure 4:

Learning to control with primitives. (A) s on a linear scale for the Euclidean and mirror descent–like adaptation laws. Mirror descent converges faster for all tasks. (B) s on a logarithmic scale for the Euclidean and mirror descent–like adaptation laws. Mirror descent converges faster and minimizes s further for all tasks. (C, D) Parameter trajectories for the mirror descent and Euclidean adaptation laws. Mirror descent leads to smoother trajectories, with fewer parameters straying from 0.

Figure 4:

Learning to control with primitives. (A) s on a linear scale for the Euclidean and mirror descent–like adaptation laws. Mirror descent converges faster for all tasks. (B) s on a logarithmic scale for the Euclidean and mirror descent–like adaptation laws. Mirror descent converges faster and minimizes s further for all tasks. (C, D) Parameter trajectories for the mirror descent and Euclidean adaptation laws. Mirror descent leads to smoother trajectories, with fewer parameters straying from 0.

8.3  Dynamics Prediction for Hamiltonian Systems

We now experimentally demonstrate the predictions of the theoretical calculations performed in section 4.2. Similar to Chen et al. (2020), consider the Hamiltonian for three point masses interacting in d=2 dimensions via Newtonian gravitation (in units such that the gravitational constant G=1),
H=12m1p12+12m2p22+12m3p32-m1m2q1-q2-m1m3q1-q3-m2m3q2-q3,
(8.8)
with mi the mass of body i, pi the momentum of body i, and qi the position of body i. Denote by q the vector q1T,q2T,q3TT with similar notation for p.
It is well known that physical systems can often be described by a few common mechanisms (see, e.g., Feynman, Leighton, & Sands, 1977, sec. 12.7). As such, we consider estimating the Hamiltonian, equation 8.8, directly with a physically motivated overparameterized basis,
H^(a^)=Y(q,p)a^,
to form the dynamics predictor equations 4.3 and 4.4. We define Y(q,p) to be a row vector of basis functions consisting of quadratics and quartics in pi and qi, as well as 1/rij, 1/rij2, and 1/rij3 potentials with rij=qi-qj for ij, comprising 21 total basis functions. These choices represent standard expressions for kinetic energy, spring potentials, central force potentials, and higher-order terms; any basis functions can be chosen motivated by knowledge of the physical system at hand.

We set k=5, γ=3.5, and choose ψ(·)=12·1.052 to identify basis functions relevant to the observed trajectory. We fix mi=1 for all i and initialize q and p to lock the system in an oscillatory mode. Past t=10, we set k=γ=0 and run the predictor open-loop, as well as perform shrinkage and set all coefficients with magnitude below 10-3 formally equal to zero, leaving 13 remaining terms.

Results are shown in Figure 5. Figure 5A displays convergence of x˜ to zero with adaptation (solid) and demonstrates that adaptation is necessary for convergence (dashed). When switching to the open-loop predictor past t=10, the system without adaptation sustains large errors, while the learned predictor maintains good performance. The inset displays a slow drift of the predictor trajectory x^(t) from the true trajectory x(t) when run open-loop. Figure 5B displays the state trajectory x (dotted), convergence of x^ with adaptation (solid) to x, and the incorrect behavior of x^ without adaptation (dashed). The open-loop unlearned predictor tends to a fixed point, while the open-loop learned predictor maintains the correct oscillatory behavior. Figures 5C and 5D show parameter trajectories and asymptotically converged parameters, respectively. Together, the two panes demonstrate that the implicit bias of the algorithm ensures convergence to a sparse estimate of the system Hamiltonian.

8.4  Sparse Identification of Chemical Reaction Networks

We now demonstrate an example of regularized adaptive dynamics prediction for an unknown chemical reaction network. Consider a set of chemical reactions with N distinct chemical species. Under the continuum hypothesis and the well-mixed assumption, mass-action kinetics dictates that the system dynamics can be described exactly in a monomial basis (Liu, Slotine, & Barabási, 2013),
νj(x)=kjΠi=1Nxiaji,x˙=Γν(x),
where xi is the concentration of chemical species i, Γ is the stoichiometric matrix, and the aji are stoichiometric coefficients. Under the assumption that the full state of the network is measured, consider the adaptive dynamics predictor,
x^˙=Γ^ν^(x^)+kx-x^,
(8.9)
ddtψΓ^=-γ(x^-x)ν^(x^)T,
(8.10)
with γ>0 a positive learning rate, k>0 an observer gain, ψ a strongly convex function, Γ^ an estimate of the stoichiometric matrix, and ν^(x^) a vector of monomial basis functions representing available knowledge of the system. Here we consider a four-species chemical reaction network (see Liu et al., 2013, supplementary information),
x˙1=-k1x1x2,x˙2=-k1x1x2-k2x2x32,x˙3=-k1x1x2-2k2x2x32,x˙4=k2x2x32,
with corresponding adaptive dynamics predictor equations 8.9 and 8.10. We set ν^ to be a vector of all monomials up to degree 3 comprising a total of 140 candidate basis functions, and we set ψ(Γ^)=12vecΓ^1.012 to identify a sparse, parsimonious model consistent with the data. Searching over sparse models ensures that our learned predictor selects only a few relevant terms in the approximate system dynamics. We fix k=1.5 and γ=0.25 for t<10. As in section 8.3, past t=10, we set k=γ=0 and run the predictor open-loop. We also perform shrinkage and set all coefficients with magnitude below 10-3 formally equal to zero, leaving 19 remaining parameters.
Figure 5:

Three-body system. (A) Observer error x˜=x^-x for the adaptive dynamics predictor, equations 4.3 and 4.4, with adaptation (solid) and without adaptation (dashed). Inset shows the asymptotic behavior of the open-loop predictor after learning. (B) Convergence of x^ with adaptation (solid) to x (dotted) for the adaptive dynamics predictor. x^ without adaptation (dashed) does not converge to the true behavior. When run open-loop, the learned predictor maintains the correct oscillatory behavior, while the unlearned open-loop predictor incorrectly tends to a fixed point. (C) Parameter trajectories for the adaptive dynamics predictor. Many parameters stay at or near zero, as predicted by proposition 1. (D) Histogram of final parameter values learned by the adaptive dynamics predictor.

Figure 5:

Three-body system. (A) Observer error x˜=x^-x for the adaptive dynamics predictor, equations 4.3 and 4.4, with adaptation (solid) and without adaptation (dashed). Inset shows the asymptotic behavior of the open-loop predictor after learning. (B) Convergence of x^ with adaptation (solid) to x (dotted) for the adaptive dynamics predictor. x^ without adaptation (dashed) does not converge to the true behavior. When run open-loop, the learned predictor maintains the correct oscillatory behavior, while the unlearned open-loop predictor incorrectly tends to a fixed point. (C) Parameter trajectories for the adaptive dynamics predictor. Many parameters stay at or near zero, as predicted by proposition 1. (D) Histogram of final parameter values learned by the adaptive dynamics predictor.

Results are shown in Figure 6. In Figure 6A, we show convergence of the observer error to zero with adaptation (solid) and divergence away from zero without adaptation (dashed), demonstrating that adaptation is necessary for effective prediction. The inset displays a closer look at the asymptotic behavior of the open-loop dynamics predictor after shrinkage, which shows that the fixed point of the system is correctly learned. Figure 6B shows convergence of x^ (solid) to x (dashed). Figure 6C displays parameter trajectories as a function of time. Many parameters stay at or near zero as predicted by proposition 1. The inset displays a finer-grained view around zero of the parameter trajectories. Figure 6D shows a histogram of the final parameter values learned by the adaptive dynamics predictor, demonstrating that only a few relevant terms are identified.

9  Conclusion and Future Directions

It is somewhat unusual in nonlinear control to have a choice between a large variety of algorithms that can all be proven to globally converge. Nevertheless, in this article, we have presented a suite of new globally convergent adaptive control algorithms. The algorithms combine the velocity gradient methodology (Fradkov, 1980; Fradkov et al., 1999) with the Bregman Lagrangian (Betancourt et al., 2018; Wibisono et al., 2016) to systematically generate velocity gradient algorithms with momentum. Based on analogies between isotonic regression (Goel & Klivans, 2017; Goel et al., 2018; Kakade et al., 2011) and algorithms for nonlinearly parameterized adaptive control (Tyukin, 2011; Tyukin et al., 2007), we extended our higher-order velocity gradient algorithms to the nonlinearly parameterized setting of generalized linear models. Using a similar parallel to distributed stochastic gradient descent algorithms (Boffi & Slotine, 2020; Zhang et al., 2014), we developed a stable modification of all of our algorithms. We subsequently fused our developments with time-dependent learning rates based on the bounded gain forgetting formalism (Slotine & Li, 1991).

Figure 6:

Chemical reaction network. (A) Observer error x˜=x^-x for the adaptive dynamics predictor, equations 8.9 and 8.10, with adaptation (solid) and without adaptation (dashed). The predictor without adaptation diverges immediately. Past t=10, shrinkage is performed, all coefficients with magnitude below 10-3 are set to zero, and the predictor is run open-loop with k=γ=0. (B) Convergence of x^ to x for the adaptive dynamics predictor. The predictor accurately learns the fixed point of the system and stays stationary when run open-loop. (C) Parameter trajectories for the adaptive dynamics predictor. Many parameters stay at or near zero, as predicted by proposition 1. (D) Histogram of final parameter values learned by the adaptive dynamics predictor. Only a few relevant terms are identified.

Figure 6:

Chemical reaction network. (A) Observer error x˜=x^-x for the adaptive dynamics predictor, equations 8.9 and 8.10, with adaptation (solid) and without adaptation (dashed). The predictor without adaptation diverges immediately. Past t=10, shrinkage is performed, all coefficients with magnitude below 10-3 are set to zero, and the predictor is run open-loop with k=γ=0. (B) Convergence of x^ to x for the adaptive dynamics predictor. The predictor accurately learns the fixed point of the system and stays stationary when run open-loop. (C) Parameter trajectories for the adaptive dynamics predictor. Many parameters stay at or near zero, as predicted by proposition 1. (D) Histogram of final parameter values learned by the adaptive dynamics predictor. Only a few relevant terms are identified.

By consideration of the non-Euclidean Bregman Lagrangian, we derived natural gradient (Amari, 1998) and mirror descent (Beck & Teboulle, 2003; Nemirovski & Yudin, 1983)–like algorithms with momentum. Taking the infinite friction limit of these algorithms recovers a recent algorithm for adaptive robot control (Lee et al., 2018) that respects physical Riemannian constraints on the parameters throughout adaptation. By extending recent results on the implicit bias of optimization algorithms in machine learning (Azizan & Hassibi, 2019; Azizan et al., 2019) to the continuous-time setting, we proved that these mirror descent-like algorithms in the first-order, second-order, and nonlinearly parameterized settings impose implicit regularization on the parameter vectors found by adaptive control.

Throughout the article, for simplicity of exposition, we focused on the nth order system, equation 2.1. As discussed in remark 2, our results extend to more general systems that have an error model similar to equation 2.6, in the sense that the proof technique summarized by lemma 72 is roughly preserved. The nth order system structure makes the employed proportional-integral forms simple, as they can be written down explicitly as in equations 6.14 to 6.17. As summarized in remark 40, a PDE needs to be solved in the general case, and solutions to this PDE may not exist. Solution of the PDE can be avoided by the dynamic scaling technique of Karagiannis et al. (2009) or a similar embedding technique of Tyukin (2011).

A significant outstanding question is whether there is an empirical advantage to using our proposed momentum algorithms. In optimization, accelerated algorithms generated by the Bregman Lagrangian provide faster convergence when properly discretized, and it is thus likely that a careful discretization is necessary to obtain optimal performance of our momentum algorithms. However, we are not aware of any available convergence rates in adaptive control, and it would be necessary to prove such rates to understand analytically if there is an advantage. Similar higher-order algorithms have appeared in the literature for linear systems of relative degree greater than one (Fradkov et al., 1999; Morse, 1992), where first-order algorithms cannot control the system. Here we have focused on feedback linearizable systems, and perhaps there exist classes of nonlinear systems that cannot be adaptively controlled with a first-order algorithm but can with a momentum algorithm. We leave the investigation of these interesting and important questions to future work.

Appendix A: Omitted Proofs and Required Results

Barbalat's lemma is a classical technique in adaptive control theory, which is used in conjuction with a Lyapunov-like analysis to prove convergence of a given signal.

Lemma 1.

(Barbalat's Lemma (Slotine & Li, 1991)). Assume that limt0t|x(τ)|dτ<. If x(t) is uniformly continuous, then limtx(t)=0.

Note that a sufficient condition for uniform continuity of x(t) is for x˙(t) to be bounded. Hence, for any signal x(t)L2L with x˙(t)L, we can apply lemma 71 to the signal x2(t) and conclude that x(t)0.

Lemma 2.

Assume that 0tf˜2(x(t'),a^(t'),a,t')dt'< where [0,T) is the maximal interval of existence of x(t). Further assume that a^(t) is bounded over [0,T), that both bounds are independent of T, and that f˜ is locally bounded in x and a^ uniformly in t. Then a^L, f˜L2, sL2L, s0 and xxd.

Proof.
By equation 2.6, we can write explicitly
s(t)=0te-η(t-τ)f˜(x(τ),a^(τ),a,τ)dτ.
(A.1)
By the Cauchy-Schwarz inequality,
s2(T)0Te-2η(T-τ)dτ0Tf˜2(τ)dτ12η0Tf˜2(τ)dτ1-e-2ηT12η0Tf˜2(τ)dτ
so that supt[0,T)|s(t)|<. Observe that this bound is independent of T. It immediately follows that supt[0,T)x(t)< and that this bound is independent of T. This observation contradicts that [0,T) is the maximal interval of existence of x(t) for any T, and thus x(t) must exist for all t. This shows that xL, sL, and that the bounds on f˜ and a^ can be extended for all t. From this we conclude f˜L2 and a^L. Similarly, Parseval's theorem applied to the low-pass filter, equation A.1, shows that sL2. Because xL and a^L, and because f˜ is locally bounded in x and a^ uniformly in t, f˜L. By equation 2.6, s˙L, and hence by Barbalat's lemma (Lemma 71), s0. By definition of s, we then conclude that xxd.

A.1  Proof of Proposition 8

Proof.
Consider the Lyapunov-like function,
V=12s2+12γa˜TP-1a˜,
which has time derivative
V˙=-ηs2-κγf˜2.

This immediately shows sL and a^L. Because sL, xL by definition of the sliding variable (Slotine & Li, 1991). Integrating V˙ shows that sL2 and f˜L2. The result follows by application of lemma 72 or directly by Barbalat's lemma (lemma 1).

A.2  Proof of Proposition 9

Proof.
Consider the Lyapunov function,
V=12s2+12γv˜2+a^-v^2,
which has time derivative
V˙=-ηs2+sf˜+1γ[v˜T-κf˜-γsYT+a^-v^TβNv^-a^+γsYT+κf˜YT]=-ηs2-κγf˜2-βγa^-v^2-βμγa^-v^Y2+2sa^-v^TYT+2κγf˜a^-v^TYT-ηs2-κγf˜2-βγa^-v^2-βμγa^-v^Y2+2|s|a^-v^Y+2κγ|f˜|a^-v^Y-ε1ηs2-ε2κγf˜2-(1-ε1)η|s|-1(1-ε1)ηa^-v^Y2=-(1-ε2)κγ|f˜|-κγγ(1-ε2)κa^-v^Y2-βγa^-v^2,
where 0<ε1<1 and 0<ε2<1 are arbitrary and where we have taken μ=γβ1(1-ε1)η+κ(1-ε2)γ. Because ε1 and ε2 are arbitrary, this shows that V˙ is negative semidefinite for μ>γβ1η+κγ. Hence, v^L, a^L, and sL. Because sL, we automatically have xL, which shows that s˙L by local boundedness of f˜ in x and a^ uniformly in t. Integrating V˙ shows that sL2 and hence by Barbalat's lemma (lemma 71) s0 and xxd.

A.3  Proof of Proposition 10

Proof.
Defining the vector v^t=i=1mw^itϕ(xi), equation 6.22 implies the iteration on v^,
v^t+1=v^t-λmi=1mf^(w^t,xi)-f(xi)ϕ(xi).
(A.2)
Equation A.2 shows that at time t,
v^t=-λmi=1mj=1t-1f˜ijϕ(xi),
(A.3)
where f˜ij in equation A.3 is the function approximation error on the ith input example at iteration j, f˜ij=f^(w^j,xi)-f(xi).
Now, assuming that for the adaptive control problem f(x,a,t)=u(αT(x,t)a), setting P=λI, a^(0)=0, and integrating both sides of equation 2.12, we see that at time t,
a^(t)=-λ0tf˜(x(t'),a^(t'),a,t')α(x(t'),t')dt'.
(A.4)
The current function approximation f^ at time t for the parameters in equation A.4 can then be written as
f^(t)=uαT(x,t)a^(t)=u0t-λf˜(x(t'),a^(t'),a,t')αT(x(t),t)α(x(t'),t')dt'=u0tc(t')K(t,t')dt',
(A.5)
where we have defined c(t')=-λf˜(x(t'),a^(t'),a,t') and K(t,t')=αT(x(t),t)α(x(t'),t'). Similarly, in the case of the Alphatron, the current approximation at iteration t is given by
f^(w^t,x)=uv^t,ϕ(x)H=ui=1mj=1t-1-λmf˜ijϕ(x),ϕ(xi)=ui=1mw^itK(x,xi),
(A.6)
where we have noted that with w^i0=0 for all i, w^it=j=1t-1-λmf˜ij.

A.4  Proof of Proposition 11

Proof.
Consider the Lyapunov function candidate,
V=12γv˜2+12γa^-v^2,
which has time derivative
V˙=1γv˜T-γf˜α+1γa^-v^TβNv^-a^+γf˜α=-a˜Tαf˜-βγNa^-v^2+2a^-v^Tαf˜-f˜2D1-βγa^-v^2-βμγa^-v^2α2+2αa^-v^|f˜|-εD1f˜2-βa^-v^2-1-εD1|f˜|-D11-εαa^-v^2,
where 0<ε<1 is arbitrary and we have chosen μ=γD1(1-ε)β. Because ε is arbitrary, this shows that v^ and a^ remain bounded for μ>γD1β over the maximal interval of existence of x(t). By integrating V˙, we see that f˜L2 over this same interval. Note that the bounds are independent of the length of the interval. Application of lemma 72 completes the proof.

A.5  Proof of Proposition 12

Proof.
The Lyapunov-like function,
V=12a˜TP-1a˜+a¯˜TP-1a¯˜+s2,
has time derivative
V˙=-ηs2-ka¯-a^TP-1a¯-a^.
This shows that s, a^, and a¯ remain bounded. The remaining conclusions of the proposition are immediately drawn by integrating V˙ and applying Barbalat's lemma (lemma 71).

A.6  Proof of Proposition 13

Proof.
The Lyapunov-like function,
V=12a˜TP-1a˜+a¯˜TP-1a¯˜,
has time derivative
V˙-1D1f˜2-ka^-a¯TP-1a^-a¯.
This shows that a^ and a¯ remain bounded over the maximal interval of existence of x(t). Integration of V˙ shows f˜L2 and a^-a¯L2 over the same interval. Note that the bounds are independent of the length of the interval. Application of lemma 72 completes the proof.

A.7  Proof of Proposition 14

Proof.
The Lyapunov-like function,
V=12γv˜2+a^-v^2+a¯-a^2,
has time derivative
V˙=-a˜Tαf˜-βγNa^-v^2+kβγNa^-v^Ta¯-a^+2f˜a^-v^Tα=-2kβγNa^-a¯2+βγNa^-a¯Tv^-a^-f˜2D1-βN2γ1-ka^-v^2-βN2γa^-a¯23k-1+2|f˜|a^-v^α-f˜2D1-β2γ1-ka^-v^2-βμ2γ1-ka^-v^2α2=-βN2γa^-a¯23k-1+2|f˜|a^-v^α-εD1f˜2-1-ε|f˜|D1-D11-εa^-v^α2-β2γ(1-k)a^-v^2=-βN2γ(3k-1)a^-a¯2,
where 0<ε<1 is arbitrary and we have chosen μ=2γD1β(1-ε)(1-k). From above, we conclude v^, a^, and a¯ remain bounded over the maximal interval of existence of x(t) for 13k<1. By integrating V˙, we see that f˜L2, a^-a¯L2, and a^-v^L2 over the same interval. Note that the bounds are independent of the length of the interval. Application of lemma 72 completes the proof.

A.8  Proof of Proposition 15

Proof.
The Lyapunov-like function,
V=1γv˜2+v¯˜2+a^-v^2,
has time derivative
V˙=-a˜Tαf˜+2f˜a^-v^Tα-βNγa^-v^2-ργv^-v¯2-ργa^-v^Tv¯-v^-f˜2D1+2|f˜a^-v^α-βγ-ρ2γa^-v^2-βμγa^-v^2α2-ρ2γv^-v¯2-εD1f˜2-1-εD1|f˜|-D11-εv^-a^α2-ρ2γv¯-v^2-12γ2β-ρv^-a^2
where 0<ε<1 is arbitrary and we have chosen μ=γD1β(1-ε). From above, we conclude v^, v¯, and a^ remain bounded over the maximal interval of existence of x(t) for ρ<2β. Integrating V˙ shows that f˜L2, v^-v¯L2, and v^-a^L2 over the same interval. Note that the bounds are independent of the length of the interval. Application of lemma 72 completes the proof.

A.9  Proof of Proposition 16

Proof.
The Lyapunov-like function,
V=12γa^-v^2+a^-a¯2+v˜2+v¯˜2,
has time derivative
V˙=-a˜Tαf˜-βNγa^-v^2-2kβNγa¯-a^2+2f˜a^-v^Tα+βNa^-a¯Tv^-a^+kβNγa^-v^Ta¯-a^-ργv^-v¯2-ργa^-v^Tv¯-v^-1D1f˜2-12γβ1-k-ρa^-v^2-12γβμ1-ka^-v^2α2=-Nβ2γ3k-1a¯-a^2-ρ2γv^-v¯2+2|f˜|a^-v^Y-εD1f˜2-1-εD1|f˜|-D11-εa^-v^α2-ρ2γv¯-v^2=-βN2γ(3k-1)a^-a¯2-12γ(1-k)β-ρa^-v^2,
where 0<ε<1 is arbitrary and we have chosen μ=2γD1β(1-k)(1-ε). This immediately shows that a^,v^,a¯, and v¯ remain bounded over the maximal interval of existence of x(t) for 13k<1 and ρ<β(1-k). Integrating V˙ shows that f˜L2, v¯-v^L2, a^-a¯L2, and a^-v^L2 over the same interval. Note that the bounds are independent of the length of the interval. Application of lemma 72 completes the proof.

A.10  Proof of Proposition 17

Proof.
The Lyapunov-like function,
V=12s2+12a˜TP-1a˜,
has time derivative
V˙=-ηs2-12f˜2-λ2a˜TP-1a˜,
which shows that s and a^ remain bounded. Because s remains bounded, x remains bounded. Integrating V˙ shows that sL2 and f˜L2. The proof is completed by application of lemma 72 or directly by Barbalat's Lemma (lemma 71).

A.11  Proof of Proposition 18

Proof.
Consider the Lyapunov-like function,
V=12s2+12v˜TP-1v˜+12v^-a^TP-1v^-a^,
which has time derivative
V˙=-ηs2+sf˜-v^-a^+a˜Ts+f˜YT+12v˜TYT2-λ(t)2v˜TP-1v˜=+v^-a^T-βNv^-a^-s+f˜YT+12v^-a^TYT2-λ(t)2v^-a^TP-1v^-a^=-ηs2-f˜2-2v^-a^Ts+f˜YT-βNv^-a^2+12v˜TYT2+12v^-a^TYT2-λ(t)2v˜TP-1v˜+v^-a^TP-1v^-a^.
Now we use that v˜TYT=v^-a^TYT+f˜ to say that 12v˜TYT2=12v^-a^TYT2+v^-a^TYTf˜+12f˜2. Hence,
V˙=-ηs2-12f˜2-2sv^-a^TYT-f˜v^-a^TYT-βNv^-a^2=+v^-a^TYT2-λ(t)2v˜TP-1v˜+v^-a^TP-1v^-a^=-ηs2-12f˜2-2sv^-a^TYT-f˜v^-a^TYT-βv^-a^2=-βμY2v^-a^2+v^-a^TYT2-λ(t)2v˜TP-1v˜+v^-a^TP-1v^-a^-ηs2-12f˜2+2|s|v^-a^YT+|f˜|v^-a^YT-βv^=-a^2-βμ-1Y2v^-a^2-λ(t)2v˜TP-1v˜+v^-a^TP-1v^-a^-ηε1s2-ε22f˜2-(1-ε1)η|s|-1(1-ε1)ηv^-a^Y2=-1-ε22|f˜|-1221-ε2v^-a^YT2=-βv^-a^2-λ(t)2v˜TP-1v˜+v^-a^TP-1v^-a^,
where 0<ε1<1 and 0<ε2<1 are both arbitrary and where we have chosen μ=1β1+1η(1-ε1)+12(1-ε2). This shows that s, v^, and a^ remain bounded. Because s remains bounded, x remains bounded. Integrating V˙ shows that sL2 and f˜L2. By local boundedness of f˜ in x and a^ uniformly in t, f˜ remains bounded and hence s˙ remains bounded. By Barbalat's lemma (lemma 71), s0 and xxd.

A.12  Proof of Proposition 19

Proof.
Consider the Lyapunov-like function,
V=12a˜TP-1a˜,
(A.7)
which has time derivative
V˙=-f˜αTa˜+12a˜Tα2-λ2a˜TP-1a˜,-1D1f˜2+12D22f˜2=-1D1-12D22f˜2.
For D1<2D22, V˙0 and f˜L2 over the maximal interval of existence of x(t). Alternatively, using the same Lyapunov function,
V˙-αTa˜2D2-12.
For D2>12, V˙0 and αTa˜L2 over the maximal interval of existence of x(t). By assumption 2, this implies that f˜L2 over the same interval. Hence, both approaches demonstrate that a^ remains bounded over the maximal interval of existence of x(t) and that f˜L2 over the same interval. Furthermore, these bounds are independent of the length of the interval. By lemma 72, the proposition is proved.

A.13  Proof of Proposition 20

Proof.
Consider the Lyapunov-like function,
V=12v˜TP-1v˜+a^-v^TP-1a^-v^,
which has time derivative
V˙=-v˜Tαf˜+12v˜T-λP-1+ααTv˜+a^-v^TβNv^-a^