## Abstract

We introduce frequency propagation, a learning algorithm for nonlinear physical networks. In a resistive electrical circuit with variable resistors, an activation current is applied at a set of input nodes at one frequency and an error current is applied at a set of output nodes at another frequency. The voltage response of the circuit to these boundary currents is the superposition of an activation signal and an error signal whose coefficients can be read in different frequencies of the frequency domain. Each conductance is updated proportionally to the product of the two coefficients. The learning rule is local and proved to perform gradient descent on a loss function. We argue that frequency propagation is an instance of a multimechanism learning strategy for physical networks, be it resistive, elastic, or flow networks. Multimechanism learning strategies incorporate at least two physical quantities, potentially governed by independent physical mechanisms, to act as activation and error signals in the training process. Locally available information about these two signals is then used to update the trainable parameters to perform gradient descent. We demonstrate how earlier work implementing learning via chemical signaling in flow networks (Anisetti, Scellier, et al., 2023) also falls under the rubric of multimechanism learning.

## 1 Introduction

Advancements in artificial neural networks (ANN; Goodfellow et al., 2016) have inspired a search for adaptive physical networks that can be optimized to achieve desired functionality (Anisetti, Scellier et al., 2023; Kendall et al., 2020; Stern et al., 2020, 2021; Lopez-Pastor & Marquardt, 2021; Dillavou et al., 2021; Scellier et al., 2022; Stern & Murugan, 2022). Similar to ANNs, adaptive physical networks modify their learning degrees of freedom to approximate a desired input-to-output function, but unlike ANNs, they achieve this using physical laws. In a physical network, the input is typically an externally applied boundary condition, and the output is the network’s response to this input or a statistic of this response. For instance, in a resistive network, an input signal can be fed in the form of applied currents or voltages, and the output may be the vector of voltages across a subset of nodes of the network. The learning degrees of freedom of the network are, for example, the conductances of the resistors (assuming variable resistors). Ideally, these learning parameters must be updated using only locally available information. Otherwise, the network would require additional channels to transmit the gradient information.

Moreover, these parameter updates should preferably follow the direction of gradient descent in the loss function landscape, as is the case for ANNs.

Existing learning algorithms for adaptive physical networks include equilibrium propagation (Scellier & Bengio, 2017; Kendall et al., 2020) and coupled learning (Stern et al., 2021). These algorithms are based on the idea of contrastive learning (Baldi & Pineda, 1991) and proceed as follows. In a first phase, an input is presented to the network in the form of boundary currents or voltages, and the network is allowed to settle to equilibrium (the free state), where a supervisor checks the output of the system. Then the supervisor nudges the output toward the desired output. This perturbation causes the system to settle to a new (perturbed) equilibrium state, which is a slightly more accurate approximation of the function that one wants to learn. The supervisor then compares the perturbed state with the free state to make changes in the learning degrees of freedom in such a way that the network spontaneously produces an output that is slightly closer to the desired output. In the limit of infinitesimal nudging, this procedure performs gradient descent on the squared prediction error (Scellier & Bengio, 2017).

The described procedure presents practical challenges when applied to physical systems. Specifically, it requires either the retention of the network’s free state for comparison with the perturbed state or the utilization of an additional, duplicate network for transmitting supplementary signals. For example, in their experimental work, Dillavou et al. (2021) use two copies of the same network to compute the free and perturbed states. Alternatively, Yi et al. (2023) use a single network, but they make use of additional external memory to store the two states before performing the weight updates. Another idea, proposed by Kendall et al. (2020), is to use a capacitor (sample-and-hold amplifier) at each node to store the free state values, but this idea has not been verified experimentally. In this work, we propose an alternative multimechanism learning approach to overcome this hurdle. Our approach incorporates two physical quantities, each driven by its own respective mechanisms: one quantity acting as an activation signal and the other acting as an error signal. This concept is motivated by biological systems implementing functionality via multiple biophysical routes or mechanisms. Such functionality can be chemical, electrical, or even mechanical in nature with potential interactions between such mechanisms. For instance, in the brain, activity can propagate from one cell to another via electrical and chemical synapses, as opposed to just one mechanism (Pereda, 2014). Given this modular functionality in biology, it would be remiss not to explore such richness in how adaptive physical networks learn. Alternatively, as we shall soon see, this modularity is not necessarily in terms of mechanical versus chemical versus electrical signals but distinguishable signals.

We introduce frequency propagation (Freq-prop), a physical learning algorithm falling under the umbrella concept of multimechanism learning. In Freq-prop, the activation and error signals are both sent through a single channel but are encoded in different frequencies of the frequency domain; we can thus obtain the respective responses of the system through frequency (Fourier) decomposition. This algorithm, which we show to perform gradient descent, can be used to train adaptive nonlinear networks such as resistor networks, elastic networks, and flow networks. Freq-prop thus has the potential to be an all-encompassing approach. (See Figure 1 for a graphical summary of Freq-prop). In the next section, we present this idea of frequency propagation in the context of resistor networks, and in section 3, we show that frequency propagation is an example of multimechanism learning and can be generalized to train various physical systems like flow and mechanical networks. In section 4, we demonstrate this idea by training linear resistor networks to classify the Iris dataset (Fisher, 1988).

## 2 Nonlinear Resistive Networks

A resistive network is an electrical circuit of nodes interconnected by resistive devices, which includes linear resistors and diodes. Let $N$ be the number of nodes in the network, and denote $vj$ the electric potential of node $j$. A subset of the nodes are input nodes, where we can set input currents: we denote $xj$ the input current at input node $j$. For each pair of nodes $j$ and $k$, we denote $\theta jk$ the conductance of the resistor between these nodes (provided that the corresponding branch contains a linear resistor). We further denote $\theta ={\theta jk:linearbranch(j,k)}$ the vector of conductances and $x=(x1,x2,...,xN)$ the vector of input currents, where by convention, $xj=0$ if node $j$ is not an input node. Finally, we denote $v=(v1,v2,...,vN)$ the configuration of the nodes’ electric potentials and $v(\theta ,x)$ the equilibrium value of $v$ as a function of the branch conductances ($\theta $) and the input currents ($x$).

The following result, known since the work of Millar (1951), provides a characterization of the equilibrium state---see also Kendall et al. (2020) for a proof of this result with notations closer to ours.

where $Einput(x,v)$ is a function of $x$ and $v$ and $Enonlinear(v)$ is a function of $v$ only.

$E(\theta ,x,v)$ is the energy function of the system, also called the co-content (Millar, 1951), and the equilibrium state $v(\theta ,x)$ is a minimizer of the energy function. The energy function contains an energy term $Einput(x,v)$ associated with boundary input currents $x$. It also contains energy terms of the form $\theta jk(vj-vk)2$ representing the power dissipated in branch $(j,k)$. The term $Enonlinear(v)$ contains all nonlinearities of the system. In a linear resistance network (i.e., when $Enonlinear(v)=0$), it is well known that the equilibrium configuration of node electric potentials minimizes the power dissipation. Theorem 1 generalizes this result to nonlinear networks. Below we explain how the different terms of $E(\theta ,x,v)$ are constructed.

### 2.1 Constructing the Energy Function

^{1}For such linear branches, the energy term is

which is the power dissipated in branch $(j,k)$.

Putting all the energy terms together and denoting $Einput(x,v)$, the energy term of input signals (either $Einputcurrent(x,v)$ or $Einputvoltage(x,v)$ depending on the case), we get the energy function of equation 2.2.

## 3 Multimechanism Learning via Frequency Propagation

Learning in a resistive network consists in adjusting the branch conductances ($\theta $) so that the network exhibits a desired behavior: an input-output function $x\u21a6v(\theta ,x)$. In machine learning, this problem is formalized by introducing a cost function $C$. Given an input-output pair $(x,y)$, the quantity $C(v(\theta ,x),y)$ measures the discrepancy between the model prediction $v(\theta ,x)$ and the desired output $y$. The learning objective is to find the parameters $\theta $ that minimize the expected cost $E(x,y)C(v(\theta ,x),y)$ over input-output pairs $(x,y)$ coming from the data distribution for the task that the system must solve.

^{2}This augments the energy function of the system by a time-dependent sinusoidal energy term $\gamma sin(\omega t)C(v,y)$. Due to this perturbation, the system’s response $v(t)$ minimizing the energy at time $t$ is

Namely, the learning rule 3.5 approximates one step of gradient descent with respect to the loss, with learning rate $\alpha \gamma $. Note that this learning rule is local: it requires solely locally available information for each parameter $\theta jk$.

For simplicity, we have omitted the time of relaxation to equilibrium in our analysis. However, a practical circuit has an effective capacitance $Ceff$ and therefore will equilibrate in time $\tau relax\u223cReffCeff$, where $Reff$ is the effective resistance of the circuit. Our learning algorithm will work as long as the circuit equilibrates much faster than the timescale of oscillation ($\tau relax\u226a1/\omega $). Our analysis thus requires that $Ceff$ be small enough for the assumption $\tau relax\u226a1/\omega $ to hold.

If this is not the case, there will be a trade-off between how fast one can train the network with Freq-Prop versus how accurate the approximation is for gradient descent. We leave the study of the regime where $Ceff$ is nonnegligible for future work.

We note however that the effective capacitance of the circuit is expected to grow linearly with the size of the network (the total amount of wire), so that inference time grows linearly with the size of the network too. We also note that the same is true for deep neural networks: in a feedforward network, both inference (the forward pass) and training (backpropagation) grow linearly with the size of the network.

While our nudging method, equation 3.3, is inspired by the one of equilibrium propagation (Scellier & Bengio, 2017; Kendall et al., 2020), it is also possible to apply the nudging variant of coupled learning (Stern et al., 2021) which might be easier to implement in practice (Dillavou et al., 2021). To do this, we denote $vOF$ the “free” equilibrium value of the output nodes of the network (where the prediction is read), without nudging. Then, at time $t$, we clamp the output nodes to $vOC(t)=vOF+\gamma sin(\omega t)(y-vOF)$. This nudging method can be achieved via AC voltage sources at output nodes. We note, however, that theorem 2 does not hold with this alternative nudging method.

Measuring $bj$ for every node $j$ as per equation 3.4 requires that we use the same reference time $t=0$ for all nodes, that is, it requires global synchronization of the measurements for all nodes.

However, in practice, there may be a time delay $tj$ between nudging and measurement, leading to a measured response $vj(t)=aj+bjsin(\omega (t+tj))+O(\gamma 3)$ at node $j$. Without any information about $tj$, we can only obtain the absolute value of the coefficient $bj$, not its sign.

We propose a solution to this issue in section 5.

## 4 Demonstration of Freq-Prop in Linear Networks

We trained a network, comprising 40 nodes and 239 edges, to learn the Iris data set (see Figure 2). Each edge in the network is assigned a conductance value, drawn from a uniform distribution $[10-5,1]$. Throughout the training process, the weights are constrained to the range $[10-5,20]$. We set the nudge amplitude, $\eta $, to 1 and the learning rate, $\alpha $, to 0.01. Our training methodology predominantly aligns with the procedures outlined in our previous work (Anisetti, Scellier et al., 2023), albeit with a few modifications:

Input to the network is applied as a DC voltage at input nodes. The response to this input is calculated at the output nodes, and error is determined.

For each discrete time step, $t$, with a time interval $\Delta t=0.05$ between two consecutive steps, the discrepancy between the error is modulated by multiplying it with $sin(\omega t)$ (here, $\omega =2\pi $, period = 1). This AC voltage is applied over a duration of five units of time, encompassing five complete oscillation periods. This error signal is applied along with the DC signal.

The response of each node is recorded over time, resulting in a time series of voltage values for each node in the network. This time series contains both the steady-state (DC) component and the oscillating (AC) component of the response. Then we decompose the responses at each node into AC and DC components using Fourier series method (see equation 3.4).

The DC component represents the average or steady-state response of the node. It is calculated by taking the mean of the time series of the response at each node.

The AC component is extracted by multiplying the time series of the response by $sin(\omega t)$ and then averaging the result.

Using these AC and DC components, the weights are updated (see equation 3.5).

We have detailed the entire training procedure in appendix A.

## 5 Choice of the Nudging Signal

We have seen in section 3 that when a sinusoidal nudging signal $\gamma sin(\omega t)C(v,y)$ is used, the measured response at node $j$ will be of the form $vj(t)=aj+bjsin(\omega (t+tj))+O(\gamma 3)$, where $tj$ is the time delay between nudging and measurement. Unfortunately, it is not possible to recover the sign of $bj$ without any knowledge of $tj$. This problem can be overcome by using a different nudging signal.

In general, if we nudge the system by an energy term $\gamma f(t)C(v,y)$, where $f(t)$ is an arbitrary function such that $supt|f(t)|<\u221e$, then the system’s response at node $j$ will be of the form $vj(t)=aj+bjf(t+tj)+O(\gamma 2)$.

Our goal is to choose an $f$ so that we can obtain for every node $j$ the values of $aj$ and $bj$ by measuring only $vj(t)$, without knowing $tj$.

Clearly this is not possible for all functions $f$. For example, if $f(\xb7)$ is a constant, then $vj(\xb7)$ is also a constant, and we cannot recover the values of $aj$ and $bj$ from $vj(\xb7)$ alone. Another example for which this is not possible is $f(t)=sin(\omega t)$. This is because a time delay $tj=\pi /\omega $ will change the sign of the signal, $sin(\omega (t+tj))=-sin(\omega t)$; therefore, the sign of $bj$ cannot be recovered without any knowledge of $tj$.

More generally, we show in appendix B that in principle, it is possible to recover the coefficients $aj$ and $bj$ if and only if the function $f$ has the property that there is no $\tau $ such that $f(t)=supf+inff-f(t+\tau )$ for every $t$. In other words, no amount of time delay converts the signal’s upright form to its inverted form or vice versa.

## 6 General Applicability of Frequency Propagation

Freq-prop applies to arbitrary physical networks: not only resistive networks, but also flow networks, capacitive networks, and inductive networks, among others. In these networks, the notion of current-voltage characteristics will be replaced by current-pressure characteristics, current-flux characteristics, and charge-voltage characteristics, respectively. The mathematical framework for nonlinear elements (see section 2) also applies to these networks, where the energy functions minimized at equilibrium are the co-content, the inductive energy, and the capacitive co-energy, respectively (Millar, 1951; Cherry, 1951).

To emphasize the generality of Freq-prop, we present it here in the context of central force spring networks (or elastic networks; Stern et al., 2021), as well as Hopfield networks (the Ising model).

### 6.1 Central Force Spring Networks

### 6.2 Continuous Hopfield Networks

Freq-prop also applies to Hopfield networks (the Ising model; Hopfield, 1984; Baldi & Pineda, 1991). In a Hopfield network of multiple units interconnected by synapses, the energy term between unit $i$ and unit $j$ is $Eij=wijhihj$, where $wij$ is the synaptic weight and $hi$ is the state of unit $i$.

## 7 Related Work

Frequency propagation builds on learning via chemical signaling (Anisetti, Scellier et al., 2023), another example of multimechanism learning (MmL) in physical networks. Whereas MmL via frequency propagation uses two different frequencies to play the role of the activation and error signals during training, MmL via chemical signaling uses two different chemical concentrations for these signals. Anisetti, Scellier et al. (2023) present learning via chemical signaling in the setting of linear flow networks, which we extend here to the nonlinear setting (see appendix C).

The idea of using an oscillating nudging signal was also proposed by Baldi and Pineda (1991) and more recently (concurrent with our work) in holomorphic EP (Laborieux & Zenke, 2022). Our work differs from these two other works in several ways. First, our learning rule can be decomposed as activation signal times error signal ($a\xd7b$), whereas the learning rule of Baldi and Pineda (1991) takes the form $\Delta \theta jk=\alpha 2\u222bsin(\omega t)vjk(t)2dt$, and similarly for holomorphic EP. Second, our learning rule is proved to approximate the gradient of the cost function, up to $O(\beta 3)$, unlike in Baldi and Pineda (1991). Laborieux and Zenke (2022) exploit the Cauchy formula of complex calculus to prove that their algorithm computes the exact gradient of the cost function, independent of the strength of the nudging signal. To achieve this, the authors allow the nudging factor to take complex values, $\beta =\gamma ei\omega t\u2208C$, and the domain of definition of the energy function $v\u21a6E(\theta ,x,v)$ is extended to complex configurations $v\u2208CN$. However, it is not yet straightforward to see how this mathematical formalism can be directly mapped to physical systems such as resistive networks or spring networks, the motivation of our work.

Another recent work (McCaughan et al., 2023) presents a technique for gradient computation in analog DNNs. The authors modulate the parameters, $\theta i$, using periodic functions with distinct frequencies, $\omega i$. This results in the cost function oscillating as a superposition of these frequencies. They demonstrated that the oscillation amplitude of the cost function carries gradient information. The gradient with respect to individual weights, $\theta i$, can be discerned by isolating the Fourier component of the cost function oscillating at frequency $\omega i$. While this approach shares the concept of signal decomposition with our frequency propagation method, there are key differences. In our approach, we modulate the output error using a periodic function, leading to an AC feedback signal and a DC feedforward signal. We then extract individual signals through Fourier decomposition.

Another recent work proposes an alternative approach to train physical systems by gradient descent, agnostic equilibrium propagation (Scellier et al., 2022). However, this method imposes constraints on the nature of the parameters ($\theta $), which must minimize the system’s energy ($E$), just like the state variables ($v$) do. This assumption does not allow us to view the conductances of resistors (resp. the spring constants) as trainable parameters in a resistive network (resp. in a spring network). The method also requires control knobs with the ability to perform homeostatic control over the parameters. Our work can also be seen as a physical implementation of implicit differentiation in physical networks. We refer to Zucchet and Sacramento (2022) for a description of implicit differentiation where the authors use a mathematical formalism close to ours.

Finally, other physical learning algorithms that make explicit use of time are being developed. For instance, recent work proposes a way to train physical systems with time-reversible Hamiltonians (Lopez-Pastor & Marquardt, 2021). In this method, called *Hamiltonian echo backpropagation* (HEB), the error signal is a time-reversed version (an echo) of the activation signal, with the cost function acting as a perturbation on this signal.

However, HEB requires a feasible way to time-reverse the activation signal.

## 8 Discussion

We have introduced frequency propagation (Freq-prop), a physical learning algorithm that falls in the category of multimechanism learning (MmL). In MmL, separate and distinguishable activation and error signals contribute to a local learning rule, such that trainable parameters (e.g., conductances of variable resistors) perform gradient descent on a loss function.

In Freq-prop, the activation and error signals are implemented using different frequencies of a single physical quantity (e.g., voltages or currents) and are thus distinguishable. We note however that the distinguishability of the signals does not mean that they are mathematically independent. In Freq-prop, the error signal depends on the activation signal via the Hessian of the network.

Other potential MmL algorithms may involve independent physical mechanisms, such as an electrical activation signal and a chemical error signal or vice versa. Multimechanism learning algorithms, such as Freq-prop, may have implications toward designing fast and low-power, or high-efficiency, hardware for AI, as they are rooted in physical principles. For the time being, inroads are being made by using backpropagation to train controllable physical systems in a hybrid in silico–in situ approach (Wright et al., 2022). As we work toward a fully in situ approach, Freq-prop is a natural candidate. And while the in situ realization of a nonlinear resistor network is an obvious starting point, there are potential limitations, particularly in terms of timescales. Consider the time of relaxation to equilibrium ($\tau relax$), the timescale of the sinusoidal nudging signal ($T=2\pi /\omega $), and the timescale of learning ($\tau learning$). Our learning methodology requires that $\tau relax\u226aT<\tau learning$. More specifically:

Once input is applied, the network reaches equilibrium in time $\tau relax$.

Based on the network’s output, a sinusoidal nudging signal of frequency $\omega $ is applied at the output nodes. The timescale of evolution of this sinusoidal nudging wave is $T=2\pi /\omega $. Assuming that $\tau relax\u226aT$, the network is at equilibrium at every instant $t$.

We observe the network’s response $v(t)$ for a time $\tau obs>T$ to extract the coefficients $a$ and $b$ of equation 3.4. Updating the conductances of the resistors takes a time $\tau learning\u223c\tau obs$ using the values of $a$ and $b$ to determine the magnitude and sign of these updates.

In our discussion of physical learning mechanisms, it’s important to clarify the nature of the physical processes we envision for self-updating weights. In seeking an ideal physical mechanism for neuromorphic systems, we prioritize simplicity and intrinsic capability in each network component. The goal is to avoid complex, artificially induced modifications for self-updating capabilities. For instance, consider the contrast between a modified sliding rheostat, where weight adjustment is motorcontrolled, and a memristor. The latter represents a more straightforward approach to achieving self-updating resistance. A memristor exemplifies the kind of simplicity we are advocating for. This simplicity is not just a matter of ease of implementation, but also crucial for scalability and robustness in neuromorphic computing. Simplified components are more conducive to building larger and more efficient neural networks. They also tend to be less prone to errors compared to more complex devices like motor-controlled rheostats. Our research, as highlighted in Anisetti, Kandala et al. (2023), is an ongoing effort to explore and develop such inherent physical mechanisms.

There are many examples of biological systems implementing functionality via multiple biophysical routes, since brains developed rather recently if one looks at the evolutionary tree dating back to the origins of life. For instance, cancer cells learn to move through mazes more efficiently using self-generated chemotaxis (Tweedy et al., 2020). Slime mold, a single-cell organism, stores memory and can learn through a combination of chemical and mechanical signals (Boussard et al., 2021). Even at the subcellular scale, chromatin can be viewed as a mechanical computer, similar to the analytical machine.

Finally, could something like Freq-prop occur in the brain? Earlier work analyzing local field potentials recorded simultaneously from different regions in the cortex suggested that feedforward signaling is carried by gamma-band (30–80 Hz) activity, whereas feedback signaling is mediated by alpha-(5–15 Hz) or beta- (14–18 Hz) band activity, though local field potentials are not actively relayed between regions (Bastos et al., 2015). More recent work in the visual cortex argues that feedforward and feedback signaling rely on separate “channels” since correlations in neuronal population activity patterns, which are actively relayed between regions, are distinct during feedforward- and feedback-dominated periods (Semedo et al., 2022). Freq-prop is also related in spirit to the idea of frequency multiplexing in biological neural networks (Naud & Sprekeler, 2018; Payeur et al., 2021; Akam & Kullmann, 2014), which uses the simultaneous encoding of two or more signals.

While Freq-prop here uses only two separate signals—an activation signal and an error signal—one can envision multiple activation and error signals being encoded to accommodate vector inputs and outputs and to accommodate multiple, competing cost functions. With multiple activation and error signals, one can also envision coupling learning via chemical signaling (see appendix C) with Freq-prop, for example, to begin to capture the full computational creativity of the brain.

## Appendix A: Details of the Training Process

We train the network on a standard machine learning task: classifying Iris flowers. The Iris data set (Fisher, 1988) contains 150 examples of Iris flowers belonging to three species (Setosa, Virginica, and Versicolor), and, therefore, 50 examples for each category. Each example is of the form $X=(X1,X2,X3,X4)$, composed of four features of the flower (petal width, petal length, sepal width, and sepal length, all measured in centimeters), and comes with its assigned Iris category, denoted $Y$. So an example would look something like $X=(5.1,3.5,1.4,0.2)$ and $Y=setosa$. Given the four features of the flower as input, the trained network should be able to tell which species it belongs to:

**Network generation:**A network consisting of $N$ nodes, with a varying number of edges $M$, is created. For this, we first create a Barabási-Albert network with connection parameter 1. This graph generation algorithm connects a new node with one existing node in a manner that nodes with higher degree have a stronger likelihood for selection. This creates a network with $N$ nodes and $N-1$ edges. To create a network with $M$ edges, we add $M-(N+1)$ unique edges.Four pairs of input nodes are chosen from this flow network, where the input data (the normalized features of the Iris) are given as external currents across these four pairs of boundary nodes. Three pairs of nodes are chosen as the output nodes. Once the network is trained for a given input, the set of potential drops across these output node pairs should tell the category of Iris the input data correspond to.

The network architecture remains fixed throughout the training-testing process. Only the conductances of these weights are modified. As for how the flow network interfaces with the Iris data set:

The data set is divided into two subsets: one training set (used for training) and one test set (used for testing). Each of these sets has 75 examples of Irises, 25 from each category.

The data set is normalized. That is, for each example $X$ in the data set and for each feature $Xi$ of $X$, we set $Xinorm=\lambda \xb7Xi-XiminXimax-Ximin$, where $Ximax$ and $Ximin$ are the minimum and maximum values for that feature $Xi$ in the training set. We choose $\lambda =5$ for all simulations.

While choosing the desired outputs, we must keep in mind the fact that this linear system may not be able to find a set of weights that give out the desired output. In other words, we must choose desired outputs that are physically attainable. We therefore implement the same technique as described by Dillavou et al. (2021) to choose the desired output voltages for each of the three Iris categories. For each category, the desired voltage is the average, normalized input data. That is, each Iris category has 25 examples of four input features, and each input feature is averaged out over 25 examples. Finally, we obtain a four-tuple of averaged input features for each Iris category. When this is given as input to the initial network, we aim to arrive at an output voltage that corresponds to the average behavior of the input, which is the desired voltage.

To conduct the training process, first the input data are given to the network as a DC voltage and the output is observed. If the output is not equal to the desired output for that Iris category, feedback AC currents are applied at the output nodes. The weights of the network are modified using the learning rule mentioned above. This process is repeated consecutively for all examples.

Once the entire data set is exhausted, we say that one epoch has passed. We train the network for multiple epochs. At the beginning of each epoch, because the network has changed significantly, new desired voltages are calculated. Therefore, each epoch has its own set of desired voltages. To conduct the testing process, after the network is trained for multiple epochs, we record how well it classifies unseen data from the test set. To be specific, the test set has 25 examples per Iris category. The Iris features from these test examples are given as input, and the output voltage is compared with the desired voltage. The desired voltage is calculated using the testing data set. An example is then classified into that Iris category, for which the output voltage is closest to the desired output.

## Appendix B: Further Details on the Nudging Signal

Let $f(t)$ denote the nudging signal. Assuming that $f$ is bounded, recall that for every $j$, the measured response $vj(t)$ at node $j$ is of the form $vj(t)=aj+bjf(t+tj)+O(\gamma 2)$, where $aj$ and $bj$ are the numbers that we wish to recover (up to $O(\gamma 2)$) to implement the parameter update and $tj$ is an unknown time delay. Our goal is to obtain for every node $j$ the values of $aj$ and $bj$ by measuring only $vj(t)$, without any knowledge of $tj$.

We now establish a necessary and sufficient condition on the nudging signal $f(t)$ so that one can, at least in principle, uniquely obtain the values of $aj$ and $bj$ for every node $j$. We are concerned with quantities that depend only on a single node and hence will drop the node index with the understanding that all of the analysis applies to any arbitrary node.

Let $F$ denote the set of all real-valued, bounded functions, and let $f$ be an element of $F$. Let $Cf:R3\u2192F$ be the function that maps the parameters $(a,b,t0)$ to the function $v(\xb7)=a+bf(\xb7+t0)$. We define the following equivalence relation on $F$: two functions $g,h\u2208F$ are equivalent if they differ by a time translation, that is, $g\u223ch$, if and only if there exists a $t0\u2208R$ such that $g(t)=h(t+t0)$ for all $t\u2208R$. Let $F\u02dc=F/\u223c$ be the quotient of $F$ under this equivalence relation and let $[g]$ be the equivalence class that contains the function $g$. The map $Cf$ can be lifted to yield $C\u02dcf:R2\u2192F\u02dc$ such that $C\u02dcf(a,b)=[a+bf]$. In order to be able to uniquely extract $a$ and $b$ from any equivalence class of the form $[a+bf]$, the function $C\u02dcf$ has to be injective. This can be reexpressed as a direct condition on the nudging signal $f$.

The following statements are equivalent:

**P1:**The function $C\u02dcf:R2\u2192F\u02dc$ defined by $C\u02dcf(a,b)=[a+bf]$ is injective.**P2:**There exists no $\tau \u2208R$ such that for all $t\u2208R$,$f(t)=supf+inff-f(t+\tau ),$

We establish this by proving that the negations of the two statements are equivalent, that is, the following statements are equivalent:

**N1:**There exist two distinct pairs of real numbers $(a1,b1)$ and $(a2,b2)$ such that$[a1+b1f]=[a2+b2f].$**N2:**There exists a $\tau \u2208R$ such that for all $t\u2208R$,$f(t)=supf+inff-f(t+\tau ).$

Suppose that N2 is true: there is a $\tau \u2208R$ such that for all $t\u2208R$, $f(t)=supf+inff-f(t+\tau )$. This means that $f$ and $supf+inff-f$ are related by a time translation, that is, $[f]=[supf+inff-f]$. Therefore, N1 is true, with $(a1,b1)=(0,1)$ and $(a2,b2)=(supf+inff,-1)$.

## Appendix C: Multimechanism Learning via Chemical Signaling

In this appendix, we generalize the learning algorithm via chemical signaling (Anisetti, Scellier et al., 2023) to nonlinear networks. Learning via chemical signaling is another example of multimechanism learning in physical networks. It uses pressures and chemical concentrations to implement a local learning rule. This way of using multiple independent “mechanisms” is the central idea behind multimechanism learning.

Consider a flow network, one of nodes interconnected by tubes. A flow network is formally equivalent to the resistive network of section 2, with $v$ being the configuration of node pressures and $\theta jk$ being the conductance of the branch between nodes $j$ and $k$.

Indeed, diffusion along a tube follows the same equation as that of flow along the same tube, up to a constant factor (replacing node pressures and flow conductivity by chemical concentration and diffusion constant, respectively). When there is no ambiguity from the context, we write $v=v(\theta ,x)$ for simplicity. We note that although $v$ is not affected by the chemical, $u$ depends on $v$. In particular $u$ also depends on $\theta $ and $x$ through $v$.

Namely, the learning rule of equation C.3 performs one step of gradient descent with respect to the loss, with step size $\alpha \beta $. We note that learning via chemical signaling comes in two variants, either with $\beta >0$ and $\alpha >0$ or with $\beta <0$ and $\alpha <0$. The procedure performs one step of gradient descent as long as the product $\alpha \beta $ is positive.

## Acknowledgments

We thank Sam Dillavou, Nachi Stern, Andrea Liu, Doug Durian, and Jack Kendall for discussion. J.M.S acknowledges financial support from NSF-DMR-1832002 and NSF-DMR-2204312.

^{1}

To avoid any confusion, we stress that $\theta jk$ is a scalar, whereas $i^jk(\xb7)$ is a real-valued function. Thus, $\theta jkvj-vk$ denotes the product of $\theta jk$ and $vj-vk$, whereas $i^jk(vj-vk)$ denotes the function $i^jk$ applied to the voltage $vj-vk$.

^{2}

In practical situations such as the squared error prediction, the cost function $C$ depends only on the state of output nodes; therefore, nudging requires injecting currents at output nodes only.