We construct a thermodynamic potential that can guide training of a generative model defined on a set of binary degrees of freedom. We argue that upon reduction in description, so as to make the generative model computationally manageable, the potential develops multiple minima. This is mirrored by the emergence of multiple minima in the free energy proper of the generative model itself. The variety of training samples that employ N binary degrees of freedom is ordinarily much lower than the size 2N of the full phase space. The nonrepresented configurations, we argue, should be thought of as comprising a high-temperature phase separated by an extensive energy gap from the configurations composing the training set. Thus, training amounts to sampling a free energy surface in the form of a library of distinct bound states, each of which breaks ergodicity. The ergodicity breaking prevents escape into the near continuum of states comprising the high-temperature phase; thus, it is necessary for proper functionality. It may, however, have the side effect of limiting access to patterns that were underrepresented in the training set. At the same time, the ergodicity breaking within the library complicates both learning and retrieval. As a remedy, one may concurrently employ multiple generative models—up to one model per free energy minimum.

Training sets and empirical data alike are often processed using representations that do not have an obvious physical meaning or are not optimized for the specific application computation-wise. Of particular interest are binary representations of information as would be pertinent to digital computation. Not only do the values of the binary variables depend on the detailed digitization recipe, but the number of training samples will usually be vastly smaller than the size 2N of the full phase space available, in principle, to N binary variables. Given this, one is justified in asking whether a reduced description exists that uses a relatively small number of variables and parameters to efficiently document the empirically relevant configurations. At the same time, it is desirable for the reduced description to be robust with respect to the choice of a discretization procedure that is used to present the original data set.

The problem of finding reduced descriptions is relevant for all fields of knowledge, of course, and is quite difficult in general. For example, the state of an equilibrated collection of particles is unambiguously specified by the expectation value of local density in a broad range of temperature and pressure (Evans, 1979). Particles exchange places on times comparable to or shorter than typical vibrational times, implying it is unnecessary to keep track of the myriad coordinates of individual particles. The equilibrium density profile is a unique, slowly varying function of just three spatial coordinates. Yet under certain conditions, the translational symmetry becomes broken: One may no longer speak of an equilibrium density profile that is unique or smooth. Instead, one must keep track of a large collection of distinct, rapidly varying density profiles, each of which corresponds to a metastable solid; these profiles can be regarded as equilibrated with respect to particles’ vibrations but not translations. For instance in a glassy melt (Lubchenko, 2015; Lubchenko & Wolynes, 2007), the number of alternative, metastable structures scales exponentially with the system size while the free energy surface becomes a vastly degenerate landscape that breaks ergodicity.

Here we address the problem of finding reduced descriptions in the context of machine learning. The complete description in the present setup is realized, by construction, through a generative model that is a universal approximator to an arbitrary digital data set. That is, one can always choose such values for the model’s parameters that the model will eventually have generated any given ensemble of 2N distinct binary sequences of length N. The generative model is in the form of an Ising spin-based energy function, each spin representing a binary number. Ising spin-based generative models have been employed for decades (Hopfield, 1982; Laydevant et al., 2024; Mohseni et al., 2022), of course. The present energy function has the functional form of the higher-order Boltzmann machine (Sejnowski, 1986) and generally contains every possible combination of the spins. The learning rules are, however, different in that the coupling constants are deterministically expressed through the log weights of individual sequences in the ensemble we want to reproduce. The retrieval is performed by Gibbs-sampling the Boltzmann distribution of the resulting energy function at a nonvanishing temperature.

This study begins by constructing effective thermodynamic potentials whose arguments are the parameters of the complete generative model. Each of these potentials is uniquely minimized by the optimal values of the coupling constants—in the complete description—and can be generalized so as to reflect correlations among distinct subensembles. We note that effective thermodynamic potentials for a variety of generative models have been considered in the past. These are exemplified by the Helmholtz machine (Dayan et al., 1995) or the loss function for the restricted Boltzmann machine (Montúfar, 2018), among others.

Specific inquiries during retrieval in the present generative model are made by imposing a constraint of user’s choice. Of particular interest are constraints in the form of an additive contribution to the energy function that can stabilize a particular combination of the spins. This is analogous to how in a particle system, one can use an external—or “source”—field to stabilize a desired density profile (Lubchenko, 2015; Evans, 1979). If the system is ergodic, one may then use a Legendre transform to obtain a free energy as a function of the density profile. The latter procedure is a way to obtain a description in terms of variables of interest. At the same time, it represents a type of coarse graining. Likewise, here we employ appropriate source fields to produce a description in terms of variables of interest. The new degrees of freedom reflect weighted averages of the original spin degrees; thus their energetics are governed by a free energy. We show that when the source fields are turned off, this free energy becomes the thermodynamic potential for the coupling constants. Thus, learning and retrieval, respectively, can be thought of as minimizations on a conjoint free energy surface.

The total number of the coupling constants in the complete description, 2N, becomes impractically large already for trivial applications, which then prompts one to ask whether the description can be reduced in some controlled way. The most direct way to reduce the description is to simply omit some terms from the energy function; the number of such terms increases combinatorially with the order of the interaction. We show that following a reduction in description, however, the free energy will increase nonuniformly over the phase space so as to develop multiple minima of comparable depth. By co-opting known results from statistical mechanics, we argue that depending on the application, the amount of minima could be so large as to scale exponentially with the number of variables.

The multiplicity of minima makes the choice of an optimal description ambiguous. Conversely, a successful reduction in description (Merchan & Nemenman, 2016; Davtyan et al., 2012) implies that the underlying interactions are low rank.

Using the conjoint property of the free energy of learning and retrieval, respectively, we provide an explicit, rather general illustration of how the appearance of multiple minima in the learning potential also signals a breaking of ergodicity in the phase space spanned by the actual degrees of freedom of the model. Hereby the phase space of the spin system becomes fragmented into regions separated by free energy barriers (Goldenfeld, 1992). Consequently, escape rates from any given free energy minimum can become very low because they scale exponentially with the temperature and the parameters of the model (Chan & Lubchenko, 2015). Ergodicity breaking has been observed in restricted Boltzmann machines (Béreux et al., 2023; Decelle & Furtlehner, 2021), thus suggesting the latter machines represent reduced descriptions.

The ergodicity breaking implies that quantities such as the energy and entropy, among others, are no longer state functions; instead, they could at best be thought of as multivalued functions. Hence the notions of free energy and entropy—as well as the associated probability distribution—all become ill defined. Indeed, having already two alternative free energy minima in a physical system corresponds to a coexistence of two distinct physical phases. Consider, for example, water near liquid-vapor coexistence at the standard pressure. The entropy of the system can vary by about 10kB per particle, depending on the phase, according to the venerable Trouton’s rule for the entropy of boiling (Berry et al., 1980). Thus, the entropy and the free energy, as well as many other thermodynamic quantities, are poorly defined. The uncertainty in the density of the system is particularly dramatic, about three orders in magnitude (Lubchenko, 2020).

As a potential remedy for the fragmentation of the space of the coupling constants caused by the reduction in description, one may employ a separate generative model for an individual minimum, or phase. Such individual models would each be ergodic and can be thought of as a Helmholtz machine (Dayan et al., 1995), since free energies can be defined unambiguously for a single-phase system. Running the machine would then be much like consulting multiple experts at the same time. Such multiexpert inquiries may in fact be unavoidable when distinct models employ sufficiently dissimilar interactions. Consider liquid-to-solid transitions as an example. Not only are such transitions intrinsically discontinuous (Lubchenko, 2015), but practical descriptions of the two respective phases involve altogether different variables (Lubchenko, 2017), as alluded to above.

Since the number 2N of possible configurations of N bits becomes huge even for modest values of N, the vast majority of all possible configurations of the system are automatically missing from the training set. Because of their thermodynamically large entropy, we argue, these nonrepresented configurations can overwhelm the output of the machine. To avoid instabilities of this sort, the nonrepresented states must be placed at higher energies than the represented configurations; furthermore, the two sets of configurations should be separated by an extensive energy gap. Thus, the missing configurations comprise a higher-temperature phase; this also implies a breaking of ergodicity. The robustness of the generative model, then, comes down to preventing a transition toward this high-temperature phase. At a fixed temperature, such stability is achieved by parameterizing the spectrum of the nonrepresented states so as to make the energy gap sufficiently large. Conversely, at a fixed value of the gap, the retrieval temperature must be set below the transition temperature between the low-entropy and high-entropy phases, respectively. This will, however, limit one’s ability to retrieve those configurations that were relatively underrepresented in the training set.

The article is organized as follows. In section 2, we construct a conjoint free energy surface whose arguments are the parameters of the generative model and, at the same time, coarse-grained values of the original degrees of freedom the complete description operates on. The formalism allows one to standardize weights of individual configurations in a data set as a means to implement calibration of sensors as well as manage bias or duplication in the data, if any. In section 3, we find ergodicity breaking in the simplest possible realization of a reduced description. We make a connection with the ergodicity breaking during the physical phenomenon of phase coexistence and discuss implications for machine learning. Section 4 provides a thermodynamically consistent treatment of incomplete data sets and argues that knowledge can be thought of as a library of bound states whose free energy is lower than the free energy of the nonrepresented states. Section 5 provides a summary and some perspective.

2.1  Setup of the Generative Model

By construction, we consider a machine that operates on N binary variables in the form of Ising spins σα=±1, α=1,2,...,N. The “positive” (“negative”) polarization state of each binary variable may be referred to as the “up” (“down”) state. These names are meant to specify the state of a binary register or a Boolean variable, as in this list:
(2.1)
The 2N possible configurations of N Ising spins σα, α=1,...N cover the corners of a hypercube in an N-dimensional Hamming space. A configuration i will be denoted as σi:
(2.2)
We will use Greek indexes to distinguish the N directions in the Hamming space—these directions pertain to the individual bits (spins) themselves and correspond to the subscripts on the l.h.s. of equation 2.2.
We define a data set by assigning a number Zi to configuration i of the N binary variables. We further define the normalized weight of configuration i according to
(2.3)
where
(2.4)

the summation being over the 2N points comprising the Hamming space. We will consistently label the latter points in the Hamming space using Latin indices, as well as any other quantities assigned to those points.

By construction, the weight xi is intended to specify what a chemist would call the “mole fraction” of configuration i. The weight xi may or may not be associated with a probability, depending on the context; the weights are assigned according to a user-defined convention as would be mole fractions in chemistry, where one must explicitly specify what is meant by a “species,” “particle,” and so on. We set aside until section 4 the obvious issue that even for a very modestly sized system, obtaining or storing 2N-worth of quantities Zi is impractical. For now, we simply assume that for configurations not represented in the data set, the respective weights are assigned some values of one’s choice. In any event, the quantities Zi do not have to be integer. For instance, pretend we are teaching a machine, by example, the behavior of the inverter gate. Four distinct configurations are possible in principle: (1) , (2) , (3) , and (4) , where one arrow stands for the input bit and the other arrow for the output bit. For concreteness, let us set Zi at the number of times configuration i was presented in the set. Suppose the training set is Z1=32, Z2=37, Z3=2, Z4=0. It will be useful to regard Z4 as an adjustable parameter, even if we eventually adopt for it a fixed value that is very small relative to the rest of the Zi’s.

We will sometimes refer to expressions of the type
(2.5)

as “weighted sums,” or “averages,” or “expectation values.”

We define a generative model in the form of an energy-like function acting on N binary degrees of freedom. Because σ2n+1=σ for integer n, the most general function of the spin variables σα can be written as a linear combination of all possible products of the variables, where in each product, a particular component σα is present at most once. Thus, one may define the following function:
(2.6)
A parameter Jα1α2...αn is the coupling constant for the interaction that couples the n spins σα1, σα2, ..., σαn. The last line in equation 2.6 purveys a useful shorthand whereby we present the energy function as the inner product of two 2N-dimensional vectors J and σ. By construction, vector σ is the Kronecker product of all pairs (1,σα):
(2.7)
We will consistently order all Kronecker products so that the spin label α increases right to left. For instance, for three spins, one has σ=(1,σ3)(1,σ2)(1,σ1)=(1,σ1,σ2,σ1σ2,σ3,σ1σ3,σ2σ3,σ1σ2σ3). The components of the 2N-dimensional vector J are labeled so as to match the combination of spins the component in question multiplies, per equation 2.6 (J0 multiplies the 1.) And so for three spins, one has J(J0,J1,J2,J12,J3,J13,J23,J123) and so on.
Owing to the mixed-product property of the Kronecker product, two vectors σi and σj are orthogonal if σα(i)σα(j) at least for one α. Thus,
(2.8)
This orthogonality relation, in turn, implies the following completeness relation:
(2.9)
where |ab| denotes the outer product of vectors a and b and 1 is the unit matrix of size 2N.
Because the total number n=0NN!/n!(N-n)!=2N of the coupling constants matches the total number of configurations available to our spin system, one may inquire if there is a set of coupling constants such that the set of the 2N values of the function E(σi), i=1,...,2N, can match exactly an arbitrary set of energies Ei, i=1,...,2N: E(σi)=Ei. Equation 2.6, then implies that the coupling constants would have to solve the following system of 2N linear equations:
(2.10)
The solution indeed always exists, is unique and is straightforwardly obtained by multiplying the above equation by σi| on the right, summing over i, and using equation 2.9 (see also Gresele & Marsili, 2017). This yields
(2.11)
or, more explicitly,
(2.11′)

It is not immediately obvious how much the coefficients J from equation 2.11—and hence the generative model itself—would vary among data sets originating from distinct sources or experiments that one may nonetheless deem being qualitatively similar or even equivalent. In the happy event that the variation is indeed small, such robustness could be thought of, by analogy with thermodynamics, as the couplings J being subject to a smoothly varying free-energy surface. With the aim of constructing such a free-energy surface, we next discuss a variety of ways to connect our generative model with a data set and, conversely, how to retrieve patterns learned by the model.

2.2  Data Retrieval and Calibration: Learning Rules

We consider a candidate ensemble in which the numbers Zi are driven by source fields in the form of the energies Ei themselves. The couplings J—which are ultimately of interest—can be then determined using the relations 2.11. We are specifically interested in the ability to drive the distribution Zi with respect to the energy Ei relative to some preset reference value Ei:
(2.12)
We will call the reference values “standard” and consistently label them with the symbol . By construction, Zi and Ei are independent of temperature.

Retrieval in the setup above is performed by Gibbs sampling, at temperature T, the energy function obtained by computing Ei=E(σi) with the help of equation 2.6, and then subtracting from it the quantity (Ei-TlnZi). If parallel tempering is used, the quantity T in the last expression is fixed at the target temperature, of course.

The standard value Zi serves to calibrate the weight of configuration i, if presented, in the data set; the quantities Zi are nonvanishing, by construction. Let us provide several examples of where such calibration is useful or even necessary: (1) If distinct configurations are detected using separate sensors, the outputs of the latter sensors must be calibrated against each other. To drive this point home, imagine that two or more detectors operate using different physical phenomena. For instance, one may determine temperature by using a known equation of state for a material or by spectroscopic means, among many others. Outputs of distinct detectors can be and must be mutually calibrated where the respective ranges of detection overlap. (2) Calibration may be necessary to manage duplication in the data set or a bias, if any, in the acquisition or publication of the data (see Beker et al., 2022, for a discussion of such a bias in the context of using machine learning to predict optimal reaction conditions). (3) One can think of the quantity lnZi as the intrinsic entropy of state i, according to the second equality in equation 2.12. The latter entropy could, for instance, reflect the log-number of states of a hidden degree of freedom, when the visible variables happen to be in configuration i. (4) Variation with respect to the “local” energy difference (Ei-Ei) is analogous to the material derivative, whereby the local reference energy Ei can be thought of as specifying context. This is useful if the inputs exhibit correlations and/or one wishes to parameterize differences among distinct inputs (see appendix A). There we also compare the present use of standard values to that in chemistry. Finally, one may choose to regard the standard values Zi and Ei as inherently distributed, in a Bayesian spirit.

According to equation 2.12, the “local” energy deviations (Ei-Ei) reflect how much the coupling constants J would have to be perturbed from their standard values to shift the distribution {Zi} from its standard value. Suppose, for the sake of argument, the latter shift is due to incoming additional data and that we set Zi at the number of instances of configuration i. Then equation 2.11 effectively prescribes a set of learning rules. For instance, a two-body coupling constant will be modified according to JαβJαβ-(1/2N)iσα(i)σβ(i)(Ei-Ei)=Jαβ+(T/2N)iσα(i)σβ(i)ln(Zi/Zi). We see this learning rule is similar to, but distinct from, the venerable Hebbian rule, in which instances would instead add up cumulatively. Here, in contrast, the coupling constants are weighted by the log-number of instances of reinforcement.

2.3  Free Energy of Learning

Consider the following object:
(2.13)
where we treat each quantity Zi=Zie-(Ei-Ei)/T and the quantity A itself as functions of the energies Ei and temperature T. We next consider the following quantity:
(2.14)
(2.15)
In equation 2.14 we reflected that the sum vanishes for the standard weights and used the definition 2.3 of the weights xi. The inequality in equation 2.15 holds because the sum in the above equation is nonnegative. (The latter sum also happens to be the Kullback-Leibler divergence between the distributions defined by xi and xi, respectively.) The divergence is minimized—and hence the quantity S˜ is maximized—along the line Zi/Z=Zi/Zxi. If one imposes an additional constraint, for instance, by fixing the energy-like quantity
(2.16)
the quantity S˜ is now maximized by a unique configuration of {Zi} or, equivalently, of the energies {Ei}. Thus, the quantity S˜ is an entropy.
Clearly E˜=0; consequently
(2.17)
This implies the standard entropy S˜ is temperature independent, since according to equation 2.13, the standard value A
(2.18)
is strictly proportional to temperature. Also, equations 2.17 and 2.13 can be used to show that in the gauge Zi=e-Ei/T, where T is a positive constant, the standard entropy is given by the following expression:
(2.19)
The entropy above has the form of the Shannon entropy of the data set if one sets ixiEi=0. The latter can be done without loss of generality but will have the effect of fixing the energy reference.
One may also readily define a heat capacity-like quantity:
(2.20)
The quantities E˜ and C˜ can be regarded as analogs of the conventional energy and heat capacity, respectively, but generalized for a local gauge in the form of the state-specific energy references Ei. Conversely, if we adopt a global gauge, by adopting a uniform Ei=const, we obtain the conventional forms E˜=E¯-const and C˜=(E2¯-E¯2)/T2, respectively.
According to equations 2.13 and 2.3,
(2.21)
Thus, we obtain that the function A is a free energy:
(2.22)
Consistent with this identification, A({Ei},T=const) is a convex-up function of the variables {Ei}. The curvature vanishes in exactly one direction, Zi/Z=const; movement in the latter direction leaves the weights xi invariant.
A more useful description is afforded by a free energy that operates on the weights xi—which are the actual “observable” quantities; we accomplish this using Legendre transforms. Choose any number M of configurations of interest, and assign to them labels 1 through M. One may then define the following Legendre transform, whereby one removes the energetic contribution to the shift of the free energy off its standard value, while retaining exclusively the entropic contribution, for configurations 1 through M:
(2.23)
so that
(2.24)
Thus,
(2.25)
and we have defined
(2.26)
We note that
(2.27)
Also, iMZi=ZiMxi; thus Z=Z'/(1-iMxi) and, subsequently,
(2.28)
which can be used to express A˜(M) explicitly in terms of xi:
(2.29)
where
(2.30)
and ixi=1, of course. Throughout, summation over configurations spans the range 1 through either the explicitly indicated upper limit or, otherwise, through 2N. We will use the following shorthand:
(2.31)
Further, equations 2.26 and 2.29 imply
(2.32)
Consequently, dA˜(M)=-TdS˜(M)-S˜(M)dT. Combining this with equation 2.24 yields
(2.33)
Thus, the entropy S˜(M) is uniquely maximized at Ei=Ei and, hence, at xi=xi. Consequently, the function A˜(M) is uniquely minimized at xi=xi, by equation 2.32:
(2.34)
for any M. These notions can be also established directly by differentiating equation 2.29 with respect to the xi’s. Thus, the function A˜ is analogous to a thermodynamic potential governing the relaxation of a single-phase physical system. Note that equations 2.29 and 2.34 are consistent with equation 2.6 of Dayan et al. (1995).
The bound in equation 2.34 is connected to the familiar Gibbs inequality. Indeed, in view of equation 2.23, inequality 2.34 becomes
(2.35)
At M=2N, the above equation is the traditional way to express the Gibbs inequality (Girardeau & Mazo, 1973):
(2.36)
In other words, the free energy of a system of interest—represented by the standard model in this case—is bounded from above by the free energy for a trial energy function plus the expectation value of the correct energy, relative to the trial energy, where the weights are computed using the trial energy function.

According to equations 2.11 and 2.12, there is one-to-one correspondence between the quantities Zi and the coupling constants J. Thus equation 2.11 can be viewed as the solution of a minimization problem in which one minimizes the thermodynamic potential A˜ with respect to the coupling constants J, while keeping exactly one coupling constant fixed. (It is most convenient to fix J0, which simply specifies an overall multiplicative factor for the numbers Zi and, hence, does not affect the weights xi.). In this sense, one can think of A˜ as being able to guide a learning process in principle.

What is the meaning of the distribution e-A˜({xi})/T, which one may nominally associate with the potential A˜? Analogous distributions arise in thermodynamics as a result of the canonical construction. Hereby one effectively considers an infinite ensemble of distinct but physically equivalent replicas of the system (McQuarrie, 1973). The width of the distribution e-A˜/T then should be formally thought of as variations of the weights xi among distinct replicas. To generate such ensembles in practice, one can, for instance, break up a very large data set into smaller—but still large—partial data sets. In some cases, data may exhibit correlations due to implicit or hidden variables, such as the time and place of collection. If so, breaking data sets into pertinent subsets may reveal correlations among fluctuations of the weights xi—from subset to subset—that are not necessarily captured by the ideal mixture-like expression, equation 2.29. Adopting a particular calibration scheme for the inputs of the sensors may also introduce correlations among weight fluctuations. We outline, in appendix A, a possible way to modify the free energy so as to account for the latter correlations.

2.4  Coarse-Graining, Choice of Description, and Inquiries

Because spins are intrinsically discrete degrees of freedom, equivalent yet distinct data sets may result in spin arrangements that are similar in some course-grained sense, yet may be rather dissimilar at the level of individual spins owing, for instance, to noise. It may then be advantageous to specify the state of the system at a coarse-grained level, that is, using not the discrete variables σα—themselves or combinations thereof—but, instead, their averaged values that retain essential features of the pattern while not being overly sensitive to detailed variations in the polarization pattern. Here we use the coarse-graining recipe underlying the canonical construction whereby one applies a source field h to the spins, which yields the following, equilibrium Gibbs energy:
(2.37)
The source field h itself is independent of spin configuration and can be used to stabilize spin arrangements of interest. Throughout, we will limit ourselves to source fields in the form of a sum of one-spin, on-site fields:
(2.38)
In a straightforward manner, one can determine the typical magnetization of spin α by varying the free energy G with respect to the source field hα:
(2.39)
where the weights xi now reflect the additional bias due to the external field:
(2.40)
(cf. equation 2.3).
The simple type of source field on the right-hand side of equation 2.38 is just one particular—albeit instructive—functional form for a source field. An arbitrary linear combination of spin products σα, σασβ, σασβσγ, and so on, can be stabilized by the same token. From a formal viewpoint, such complicated forms can be used to probe for multispin correlations. More significant in the present context, however, is that the corresponding expectation values constitute a description in terms of some new variables of choice. These variables are coarse-grained combinations of the original spin degrees of freedom. Thus, the appropriate energy quantity that determines the statistics of these coarse-grained variables is a free energy. Indeed, according to equation 2.39, one may introduce the Helmholtz energy in the usual way:
(2.41)
where the source fields hα can be thought of as external fields that would be necessary to induce a magnetization pattern of interest:
(2.42)
At the same time,
(2.43)
(cf. equation 2.21) and so the quantity
(2.44)
should be recognized as the Gibbs counterpart of A˜.
We make a connection with the thermodynamic potential A˜ from equation 2.29 by repeating the steps that led to equation 2.29 while having replaced Ei by the quantity Ei(h),
(2.45)
in 2.23 and A by G. One thus obtains
(2.46)
At the same time, equations 2.38, 2.39, and 2.45 imply that
(2.47)
Combined with equation 2.46, this yields
(2.48)
since hα=0 by construction (see equation 2.13). In view of equation 2.41, the above equation implies
(2.49)
which generalizes equation 2.23 at M=2N, to a broader range of polarization patterns. By combining equations 2.48 and 2.44, one obtains
(2.50)
(see equations 2.41). Finally, we take differentials of equations 2.44 and 2.49, respectively, and use equations 2.39 and 2.43 to see that
(2.51)
(2.52)

Equations 2.48 through 2.52 thus complete the formal task of building a conjoint thermodynamic potential for the coupling constants, on the one hand, and for the expectation values of the degrees of freedom, on the other hand. The thermodynamic potentials A˜ and G˜ can be thought of as constrained free energies of the system, where the constraint is due to deviations from the standard model. We see that the constrained versions of the Helmholtz and Gibbs energies are in the same relation to each other as their unconstrained counterparts.

Relation 2.42 can be profitably thought of as specifying stationary points of a tilted free energy surface A(m)-αhαmα as a function of the magnetizations mα, where the fields hα are treated as constants representing externally imposed fields. If there is only one stationary point, it corresponds to a unique minimum; the depth of the minimum is then equal to the equilibrium Gibbs energy G=A-αhαmα. If, however, a portion of the surface A(m) happens to exhibit a negative curvature, then there is a way to tilt the latter surface so that it will now exhibit more than one minimum. If such additional minima are in fact present—implying equation 2.42 has multiple solutions—one is no longer able to unambiguously define the Gibbs energy or other thermodynamic quantities including the entropy and energy. Consequently, the coarse-graining scheme itself becomes ambiguous. These notions apply equally to the constrained free energies A˜ and G˜, per equations 2.50 to 2.52, which is central to the discussion of reduced descriptions in section 3.

Equations 2.40 and 2.37 also embody a particular way to implement specific inquiries in the present formalism. Suppose, for the sake of argument, that we wish to recover a pattern from a cue in the form of a subset of spins, each fixed in a certain direction, similar to how one would retrieve in the Hopfield network (Hopfield, 1982). This would formally correspond to setting hα=±, the two options corresponding to spin α polarized up and down, respectively. More generally, the setup in equation 2.37 allows one to impose a broad variety of constraints, whose rigidity can be tuned by varying the magnitude of the pertinent source field.

The number of coupling constants one can employ in practice is much less than the number 2N of the parameters constituting the full description from equation 2.11. This implies that practical descriptions are reduced essentially by construction. Here we argue that the task of finding such reduced, practical descriptions is, however, intrinsically ambiguous. The section is organized as follows. First, we attempt to emulate a very simple data set using a trial generative model that is missing a key interaction from the actual model underlying the data set. We will observe that in contrast with the complete description, the reduced free energy surface A˜(J) is no longer single-minimum but instead develops competing minima. We next show that the Helmholtz free energy of the reduced model also develops competing minima; these minima correspond exactly with the minima of the potential A˜(J). This connection is then used to show that reduced descriptions transiently break ergodicity, which will act to stymie both training and retrieval. A potential remedy will be proposed.

The simplest system where reduction in description can be performed is a set of two interacting spins. In the full description, there are three independent weights xi; the corresponding generative model will generally employ all three coupling constants, namely, the one-particle fields J1 and J2, and the two-particle, spin-spin coupling, J12. To this end, let us consider the function A˜ for the following standard model:
(3.1)
where note we use the plus sign on the right-hand side, in contrast with equation 2.6. We will assign the distribution Zi according to the Boltzmann prescription for all descriptions:
(3.2)
The positive constant T can be thought of as some normal temperature. A positive J corresponds to a fuzzy-logic inverter gate, a negative J to a fuzzy-logic identity gate. For concreteness, we will consider J>0, without loss of generality. By construction we set up the standard values as follows:
(3.3)
The resulting thermodynamic potential A˜, as a function of the couplings J, is illustrated in Figures 1 and 2 in the form of select cross-sections J12=const. Note that the J12=-J cross-section contains the global minimum of the potential A˜.
Figure 1:

The free energy surface A˜(J1,J2,J12) from equation 2.29, as a function of the coupling constants, for the standard model from equation 3.3; M=2N=4. Two select cross-sections J12=const are shown. The J12=-J cross-section contains the global minimum of A˜, while the J12=0 cross-section corresponds to a reduced description in which the two-body interaction is missing. T=T=1, J=1.5.

Figure 1:

The free energy surface A˜(J1,J2,J12) from equation 2.29, as a function of the coupling constants, for the standard model from equation 3.3; M=2N=4. Two select cross-sections J12=const are shown. The J12=-J cross-section contains the global minimum of A˜, while the J12=0 cross-section corresponds to a reduced description in which the two-body interaction is missing. T=T=1, J=1.5.

Close modal
Figure 2:

(a, c) Contour plots corresponding to the J12=0 and J12=-J=-1.5 slices, respectively, from Figure 1. (b) A special intermediate situation, where the surface A˜(J1,J2,J12) just begins to develop two distinct minima: T=T=1, J=1.5.

Figure 2:

(a, c) Contour plots corresponding to the J12=0 and J12=-J=-1.5 slices, respectively, from Figure 1. (b) A special intermediate situation, where the surface A˜(J1,J2,J12) just begins to develop two distinct minima: T=T=1, J=1.5.

Close modal

Because of the combinatorial multiplicity of high-rank couplings, we are particularly interested in such trial generative models that employ the smallest possible number of high-rank couplings. Thus, for the two spins, we focus on the cross-section J12=0 of the overall free energy surface. As can be seen in Figure 1, the function A˜(J1,J2;J12=0) is minimized not at a unique set of the two remaining variables J1 and J2 but instead is bistable. The two minima are exactly degenerate. This means the optimal choice of the coupling constants—J1 and J2 in this case—is no longer obvious. Indeed, already an infinitesimal change in the coupling constants of the standard model from equation 3.3 can result in a discrete change in the position of the lowest minimum.

Suppose, for the sake of argument, that we simplemindedly accept the J1>0, J2<0 minimum as the result of training and ignore the other minimum. This will stabilize one of the correct configurations of the gate, , but it will also make the remaining correct configuration of the gate the least likely one out of the four available configurations during retrieval. As yet another candidate strategy, suppose we accept not the location of an individual minimum of A˜(J1,J2,J12=0) as the result of training, but, instead, use a Boltzmann-weighted average over the minima. For the generative model, equation 3.3, this would result in accepting J1=J2=0 as the result of training, a nonsensical outcome.

Similar ambiguities arise in statistical descriptions of phase coexistence, when the free energy surface of the system exhibits more than one minimum. We can see this explicitly by expressing the free energy A˜ of the reduced description as a function of coarse-grained variables. For the reduced description, we will use the following trial energy function,
(3.4)
which corresponds to a widely used mean-field ansatz introduced early on by van der Waals (Landau & Lifshitz, 1980). Hereby one replaces instantaneous values of the force acting on a given degree of freedom by an effective force that depends only on the configuration of the latter degree of freedom. Thus, expectation values of products of quantities pertaining to distinct degrees of freedom factorize as
(3.5)
According to equation 3.5, the effective force acting on a given spin is determined by using the typical values of the magnetization of the other spins, not the actual values.

If one sets all the coupling constants of order two and higher to zero in the generative model, equation 2.6, in order to reduce the description or otherwise, the latter model becomes equivalent to the mean-field ansatz equation 3.4. This circumstance allows us to use the same notation: A˜ for the thermodynamic potential as a function of either the coupling constants Ji, on the one hand, or the magnetizations mi, on the other hand.

Let us rewrite equation 2.23 at M=2N to yield the following expression:
(3.6)
where S is the conventional canonical entropy of the trial model. Indeed, A=ixiEi-TS. According to equation 3.6, A˜ for a trial generative model can be computed by first writing down the expression for the free energy of the trial energy function and then simply replacing the energy part of the expression by the expectation value of the actual expression for the energy, however evaluated using the Boltzmann weights of the trial ansatz.
The entropy for a stand-alone spin σ=±1 is simply the mixing entropy of a binary mixture with mole fractions (1+m)/2 and (1-m)/2, respectively. In view of equation 3.6, then, the Helmholtz free energy for the most general energy function equation 2.6, computed under the mean-field constraint, equation 3.5, reads:
(3.7)
where the expression E(σ)|σ=m refers to the energy function 2.6 evaluated using the standard values of the coupling constants and setting σαmα. For example, for two spins, the above expression becomes
(3.8)
Let us examine consequences of using the meanfield expression, equation 3.8, as a trial description, specifically for the two-spin system from equation 3.1. Thus, we set J0=J1=J2=0 and J12=-J:
(3.9)
The interaction term Jm1m2 is that of an antiferromagnet. The free energy surface is readily seen to become bistable for J/T>1, implying the mean-field magnet, equation 3.9, exhibits a criticality in the form of a paramagnet-to-antiferromagnet transition with a Néel temperature TC=J. In contrast, the exact free energy is single minimum for all values of J, implying the meanfield solution becomes qualitatively incorrect for J/T>1. In Figure 3, we show the slice of the free energy surface along the mm1=-m2 line, for three select values of J. For reference, we also show the corresponding values of the exact free energy, which can be readily evaluated numerically using equations 2.37, 2.39, and 2.41.
Figure 3:

Dashed lines: the mm1=-m2 slice of the free energy surface A˜(m1,m2) from equation 3.9 for three select values of J. Solid lines: the respective values of the exact Helmholtz energy A for the energy function, equation 3.1. T=T=1.

Figure 3:

Dashed lines: the mm1=-m2 slice of the free energy surface A˜(m1,m2) from equation 3.9 for three select values of J. Solid lines: the respective values of the exact Helmholtz energy A for the energy function, equation 3.1. T=T=1.

Close modal

The two minima on the meanfield curve for J=1.5, Figure 3, are precisely equivalent to the two minima of A˜(J1,J2,J12) in the J12=0 plane in Figure 1, because the generative model, equation 2.6, at J12=0 (at N=2) is equivalent to the meanfield ansatz, equation 3.4, as already alluded to.

According to the discussion following equation 2.52, we do not expect the thermodynamics of the setup 3.9 to be well defined below the critical point, J/T>1, where the surface 3.9 has more than one minimum. In physical terms, the nominal expectation value of the magnetization—which remains vanishing even below the critical point—now becomes decoupled from the set of its typical values and, thus, is no longer descriptive of the microscopic behavior of the system. In formal terms, the corresponding Gibbs energy, the entropy, and the energy all become poorly defined and can be thought of, at best, as multivalued under limited circumstances. We flesh out these notions in what follows; the detailed calculations are relegated to appendix B.

Having two distinct minima on the A˜(m) curve implies the Gibbs energy is no longer well defined. Indeed, on the one hand, the typical values mα are nonvanishing below the critical point already at hα=0. On the other hand, the average magnetization should be zero, at hα=0, by symmetry. This seeming paradox is resolved by noting that symmetry-breaking fields acting on individual spins do not have to be externally imposed but could emerge internally (Goldenfeld, 1992; Lubchenko, 2015); such fields originate from the surrounding spins and come about self-consistently. One may then define a multivalued Gibbs energy, the number of branches determined by the degree of degeneracy of the Helmholtz energy. The two branches of the Gibbs energy—one per each minimum of A˜(m)—are shown in Figure 4 and labeled “G˜+” and “G˜-” as pertinent to the right-hand side and left-hand side minimum of A˜, respectively. Each of the two branches thus corresponds to a restricted Gibbs energy computed for a single phase. Indeed, the differential relation in equation 2.52 can sample the Helmholtz energy only within an individual minimum. Sampling the phase space within an individual minimum can be thought of as vibrational relaxation, a relatively fast process (He & Lubchenko, 2023). In contrast, crossing over to the other minimum requires a discontinuous transition across the mechanically unstable region delineated by the spinodals (see Figures 5 and 9). In terms of kinetics, such discontinuous transitions are slow because they require activation (He & Lubchenko, 2023; see also below).

Figure 4:

The two branches of the restricted Gibbs energy G˜ are labeled with G˜- and G˜+. The equilibrium Gibbs energy G˜ eq from equation 3.11 is shown for select values of Nr. The Nr=64 curve is masked by the mean-field curves. J=1.5, T=T=1.

Figure 4:

The two branches of the restricted Gibbs energy G˜ are labeled with G˜- and G˜+. The equilibrium Gibbs energy G˜ eq from equation 3.11 is shown for select values of Nr. The Nr=64 curve is masked by the mean-field curves. J=1.5, T=T=1.

Close modal
Figure 5:

The magnetization, per spin, as a function of the source field along the slice m=m1=-m2, h=h1=-h2, below the critical point. The mean-field curves correspond to the m1>0, m2<0 minimum (red), the m1<0, m2>0 minimum (blue), and the mechanically unstable region separating the spinodals (pink); the portions of this curve where mh<0 are either metastable or unstable. The equilibrium values for the energy function, equation 3.10, are given for several system sizes. T=T=1, J=1.5.

Figure 5:

The magnetization, per spin, as a function of the source field along the slice m=m1=-m2, h=h1=-h2, below the critical point. The mean-field curves correspond to the m1>0, m2<0 minimum (red), the m1<0, m2>0 minimum (blue), and the mechanically unstable region separating the spinodals (pink); the portions of this curve where mh<0 are either metastable or unstable. The equilibrium values for the energy function, equation 3.10, are given for several system sizes. T=T=1, J=1.5.

Close modal
Figure 6:

The equilibrium Helmholtz free energy A˜ eq for the energy function from equation 3.10, compared with the actual free energy cost, -TlnN(m), to have magnetization m and the mean-field Helmholtz free energy A˜. The horizontal axis refers to the magnetization along the (1,-1) direction as in Figure 3. T=T=1, J=1.5.

Figure 6:

The equilibrium Helmholtz free energy A˜ eq for the energy function from equation 3.10, compared with the actual free energy cost, -TlnN(m), to have magnetization m and the mean-field Helmholtz free energy A˜. The horizontal axis refers to the magnetization along the (1,-1) direction as in Figure 3. T=T=1, J=1.5.

Close modal
Figure 7:

(a) The internal energy and (b) the entropy. The mean-field energy and entropy, respectively, corresponding to the right-hand-side and left-hand-side minimum of A˜(m) in Figure 6 are labeled with supserscripts “+” and “-.” J=1.5. The equilibrium energy and entropy, respectively, are labeled using the value Nr they were computed for. (The Nr=64 curves are masked by the meanfield curves.)

Figure 7:

(a) The internal energy and (b) the entropy. The mean-field energy and entropy, respectively, corresponding to the right-hand-side and left-hand-side minimum of A˜(m) in Figure 6 are labeled with supserscripts “+” and “-.” J=1.5. The equilibrium energy and entropy, respectively, are labeled using the value Nr they were computed for. (The Nr=64 curves are masked by the meanfield curves.)

Close modal
Figure 8:

Parametric graph of the entropy as a function of energy from Figure 11. The dots show the model, discrete energy distribution ΩT(ET) and ΩF(EF) that were used to compute the restricted and the equilibrium quantities, respectively. The blue and orange dots show the locations corresponding to temperature T=T. The location, energy-wise, of the false states was parameterized so that they are in equilibrium with the true states at T=T. N=64.

Figure 8:

Parametric graph of the entropy as a function of energy from Figure 11. The dots show the model, discrete energy distribution ΩT(ET) and ΩF(EF) that were used to compute the restricted and the equilibrium quantities, respectively. The blue and orange dots show the locations corresponding to temperature T=T. The location, energy-wise, of the false states was parameterized so that they are in equilibrium with the true states at T=T. N=64.

Close modal
Figure 9:

(a) Signs of the exact polarizations as functions the source field for the energy function, equation 3.1, are shown using colors. Red and blue areas: m1m2<0. Purple area: m1m2>0. The counterparts of the region boundaries pertaining to the likeliest mean-field magnetizations are shown using dashed black lines. The line connecting the two yellow dots corresponds to a discontinuity in the most probable value of the magnetization and, as such, is a phase boundary. (b) The magnetization m1 as a function of hh1=-h2. The solid and dashed line to the stable and metastable minimum, respectively, of the mean-field free energy surface A˜, at J=1.5, T=T=1.

Figure 9:

(a) Signs of the exact polarizations as functions the source field for the energy function, equation 3.1, are shown using colors. Red and blue areas: m1m2<0. Purple area: m1m2>0. The counterparts of the region boundaries pertaining to the likeliest mean-field magnetizations are shown using dashed black lines. The line connecting the two yellow dots corresponds to a discontinuity in the most probable value of the magnetization and, as such, is a phase boundary. (b) The magnetization m1 as a function of hh1=-h2. The solid and dashed line to the stable and metastable minimum, respectively, of the mean-field free energy surface A˜, at J=1.5, T=T=1.

Close modal
On the other hand the equilibrium Gibbs energy from equation 2.37 corresponds, by construction, to statistics that are gathered on times longer than any relaxation times in the system, including, in particular, the times of activated escape from individual free energy minima. To compute the equilibrium Gibbs energy corresponding to the mean-field ansatz, we need to implement the constraint 3.5 in the canonical sum over the configurations in equation 2.37, where the energy function is specifically from equation 3.1. To do so, we first recall that the canonical construction implies we are considering a very large number Nr of equivalent noninteracting replicas of the system so as to determine the likeliest mixture of the states of the individual replicas (McQuarrie, 1973). Denote the replicas of spin 1 as ξq, q=1,...,Nr and the corresponding replicas of spin 2 as ζq, q=1,...,Nr. Each pair (ξq,ζq) can be thought of as the configuration of the spin pair (σ1,σ2) in sample q. The energy per replica is, then, E/Nr=qNrξqζq/Nr. To implement the constraint in equation 3.5, we need to decouple the fluctuations of spin 2 from the fluctuations of spin 1. To accomplish this, we replace, in the correct sum qNrξqζq, the actual value of ξq by its average value: qNrξqζqqNr(sNrξs/Nr)ζq. As a by-product of the latter replacement, spin 2 now enters as an average over the replicas and so no further averaging is needed. Thus, we arrive at the following energy function:
(3.10)
We see the mean-field constraint effectively means the replicas in the canonical ensemble now interact with each other. The corresponding equilibrium Gibbs energy reads as follows:
(3.11)
At its face value, the energy function in equation 3.10 corresponds to a set of N=2Nr spins that interact via a very long-range interaction, whereby an individual spin interacts with half of the spins. Still, the interaction depends on the system size N in such a way that the total energy scales linearly with N; thus, it is proper, extensive energy. Notwithstanding the somewhat artificial nature of the interaction, the energy function in equation 3.10 should actually be a decent approximation for the energetics of a small magnetic domain. Such small domains happen to break ergodicity on sufficiently long times to be useful for information storage; thus, their free energy must exhibit at least two free energy minima already for modest values of N. We show that this is the case for the energy function, equation 3.10, in what follows; we will compute the equilibrium values of the basic thermodynamic quantities alongside. Everywhere below, the extensive quantities are shown per spin. We continue focusing on the regime below the critical point, specifically at J=1.5. We show equilibrium values of the magnetization m(h) in Figure 5 and the corresponding equilibrium Gibbs energy G˜ eq (h) in Figure 4, for select values of N=2Nr. This allows us to compute, via the Legendre transform, the equilibrium Helmholtz free energy A˜ eq (m), shown with the dashed line in Figure 6. Also in Figure 6, we show the actual free energy cost -TlnN(m) to have magnetization m=m1=-m2, where
(3.12)
Here, δ is the Kronecker delta and the quantity m admits by construction a discrete set of equally spaced values, the number of distinct values equal to (Nr+1).

We see the equilibrium free energies each provide a lower bound on the restricted free energies—convex-up for Gibbs and convex-down for Helmholtz—as they should (Landau & Lifshitz, 1980). Thus, when the distribution of the magnetization becomes bimodal, the equilibrium Helmholtz energy no longer reflects the actual distribution. Instead, the central portion of the equilibrium Helmholtz energy A˜ eq becomes a shallow bottom that largely interpolates between the two minima of the restricted Helmholtz energy A˜(m). The mean-field free energy A˜, on the other hand, has a qualitatively correct form and, we see, tends asymptotically to the actual free energy cost—TlnN(m) for large N, as it should. Transitions between the two free energy minima of A˜(m) are accompanied by an extensive change of total magnetization sξs(-qζq)N and thus can be induced by varying the external field by a small amount that scales as 1/N; this notion is consistent with Figure 5. Consequently, the variation of the equilibrium Helmholtz energy, per particle, along its shallow bottom scales as 1/N with the system size N. Thus, it vanishes in the thermodynamic limit N. Consequently, the equilibrium susceptibility becomes poorly defined even though the susceptibilities of individual phases each remain well defined and finite.

Thus, when ergodicity is broken, the notions of free energy and thermodynamic quantities become poorly defined (see the discussion following equation 2.52). We show the restricted and equilibrium values of the internal energy and the entropy in respectively, Figures 7a and 7b. These graphs directly convey that already for a generative model as simple as the one in equation 3.1, a reduced description does not allow for a Helmholtz machine (Dayan et al., 1995), since the basic thermodynamic quantities, such as the energy and entropy, cannot be properly defined.

In a finite system, the barriers separating distinct free energy minima are finite, even if tall, implying the ergodicity is broken transiently. The ergodicity is restored on times that scale exponentially with the parameters of the problem and can become very long already for modestly sized systems and/or following small temperature variations. We illustrate these notions in appendix C. There we also demonstrate that transitions among distinct free energy minima are rare events that occur via nongeneric sequences of individual spin flips.

Below the critical point, the minima of the surface A˜ are two and are separated by a finite barrier. Thus, the machine will be able to reproduce only the patterns belonging to the now-reduced phase space. Specifically which minimum will be found during training is generally a matter of chance; the odds are system specific. For instance, setting J1 at a small, positive value, in the generative model, equation 3.3, will create a bias toward the J1>0, J2<0 minimum of the A˜(J1,J2) surface. One can use symmetry considerations (see appendix D) to make a general case that the reduced free energy surface will consist of isolated minima whose curvature is nonvanishing in all directions. This means that the bound states corresponding to the latter minima are compact, shape-wise, in the phase space.

The two minima on the surface A˜(m1,m2), for the model 3.1, correspond to the two “correct” states of the inverter gate, and , respectively. Both minima must be sampled in order for the machine to work adequately. Owing to the one-to-one correspondence between the minima of A˜(J1,J2) and A˜(m1,m2), respectively, the two minima on the A˜(J1,J2) surface must be sampled for the machine to work properly. Sampling multimodal distributions can become computationally unwieldy already for a modest number of variables. Thus, to efficiently sample multiple free energy minima that are separated by barriers, it may be necessary to employ a separate generative model—call it an “expert”—for each individual minimum. For the underlying model 3.1, one such expert in the form of fields J1>0 and J2<0 in equation 3.4, can be used to retrieve the state of the gate, as already mentioned. The other model—in the form of fields J1<0 and J2>0—will retrieve the state.

The requisite number of experts is thus bounded from above by how many minima free energy landscapes will acquire as a result of reduction in description. Actual physical systems provide insight as to how large this number can be. Note that lowering the temperature in a physical system raises the free energy and thus corresponds to a reduction in description. A variety of physical scenarios can unfold as a set of interacting degrees of freedom is cooled. For instance, translationally invariant Ising-spin models with ferromagnetic, short-range coupling will exhibit two minima below the ergodicity breaking transition, irrespective of the system size. The worst-case scenario is arguably represented by some disordered spin systems, such as the mean-field Potts glasses (Kirkpatrick & Wolynes, 1987), in which the number of minima can scale exponentially with the system size in the broken-ergodicity regime. The disorder can also be self-generated—as in actual finite-dimensional glassy liquids—whereby the free energy surface becomes exponentially degenerate below the dynamical crossover (Lubchenko, 2015; Lubchenko & Wolynes, 2007). In the latter case, a reduced description has in fact been achieved (Singh et al., 1985; Baus & Colot, 1986; Rabochiy & Lubchenko, 2012; Lubchenko, 2015) using a trial density profile of a glassy liquid in the form of a sum of narrow gaussian peaks, each centered at a distinct corner of an aperiodic lattice. The number of distinct, comparably stable aperiodic structures self-consistently turns out to scale exponentially with the system size, consistent with experiment (Xia & Wolynes, 2000; Lubchenko, 2015; Lubchenko & Wolynes, 2007). Each such lattice is mechanically stable and corresponds to a distinct expert.

We provide an elementary illustration in appendix D that the greater the reduction in the description, the more experts are needed for proper functionality. Conversely, one may say that an individual expert operating on a progressively naive model will retrieve a smaller subset of the correct configurations. In any event, we have described elsewhere (He, 2022) a protocol to train sets of experts specifically for binary data sets; these findings are also to be presented in a forthcoming submission.

The number of spins N in a description depends on the resolution. Even at relatively lower resolutions and, hence, modest values of N, the number of configurations represented in a realistic data set will be much much less than the size 2N of the phase space available to N binary variables. Consequently, the vast majority of the energies in the full description (see equation 2.11) are undetermined. Here we argue that the energies of nonrepresented configurations must be explicitly parameterized and that such parameterizations, as well as retrieval protocols, must satisfy rigid constraints in order to avoid instabilities toward learning or retrieval of nonrepresented configurations. One may think of nonrepresented configurations as false positives.

Denote the number of configurations that are in fact represented in the data set with M0<2N and number them using i=1,...,M0. For the sake of argument, we will sometimes call the represented configurations “true,” while referring to the nonrepresented configurations as “false:”
(4.1)
(4.2)
We stipulate, for concreteness, that the model performs satisfactorily if the total weight of the false states, relative to the total weight of the true states, is less than one:
(4.3)
Here,
(4.4)
Inequality 4.3 should be thought of as imposing an upper bound on the probability of producing a false positive. The actual numerical value of the bound on the right-hand side of equation 4.3 is not essential for what follows.

The contributions of the nonrepresented reconfigurations to the sums in equation 2.11 are indeterminate, as already alluded to. Simply omitting the latter contributions from the sums in equation 2.11 would be formally equivalent to assigning a fixed, vanishing value to their energies Ei=0, irrespective of the detailed parameterization of the represented states. Instead, we take a general approach whereby we treat the energies of the nonrepresented states as adjustable parameters; their values, then, will be chosen so as satisfy the constraint in equation 4.3.

For concreteness, we choose a calibration where Zi is set to the number of times configuration i was presented in the data set. (For those many configurations that were not presented in the data set, we treat the corresponding Zi as adjustable parameters.) Also for concreteness, we connect the standard weights and energies, respectively, using a Boltzmann weight-like relation, as in section 3:
(4.5)
where T is a positive constant. It is understood that some of the “true” states can be false positives, while some of the “false” states can be false negatives. For this reason, the two respective sets of energies will generally overlap. Also for the same reason, we will employ a functional relation Ei=Ei({Zj}) that is not specific to configurations being represented or not so as not to prejudge the veracity of a pattern on the basis of its being present or absent in any given data set. In any event, the parameterized energy set, equation 4.5, unambiguously defines a generative model of the type in equation 2.6 according to equation 2.11.
Owing to the huge number of configurations, the criterion, equation 4.3, can be profitably recast in free-energetic terms, our next step. We have for the restricted partition functions of the true and false states, respectively:
(4.6)
(4.7)
The individual Boltzmann weights correspond to the weights of respective configurations during Gibbs sampling of the generative model at temperature T; thus, the quantity T may be designated as the “retrieval temperature.” To avoid confusion, we note that the ensemble in equations 4.6 and 4.7 is the conventional canonical ensemble, not the ensemble from equation 2.12. The energies Ei are kept strictly constant, while the temperature T does not set energy units but instead is allowed to take any positive value by construction.
To the partition functions from equations 4.6 and 4.7 there correspond restricted Helmholtz free energies:
(4.8)
(4.9)
The condition, equation 4.3, is then equivalent to stipulating that the true states are thermodynamically stable relative to the false states:
(4.10)
To satisfy this constraint, two conditions must be met. First, there should be a temperature interval, in which the Helmholtz energy of the true states is lower than the Helmholtz energy of the false states. Second, retrieval must be performed at a temperature from the latter interval. To establish the implications of these constraints for the parameterization of the nonrepresented states, we first introduce the canonical energy and entropy:
(4.11)
(4.12)
where the partition function Z can be either one of the restricted partition functions from equations 4.6 and 4.7, or the full, equilibrium partition function,
(4.13)
In a usual way (Landau & Lifshitz, 1980), equations 4.11 and 4.12 uniquely specify a coarse-grained log-spectrum S(E), with the variable T being the parameter, for any system with a positive heat capacity. Conversely, given a temperature T, the canonical energy E of a single-phase system is determined by the location of the tangent to the curve S(E) whose slope is equal to 1/T:
(4.14)
We assume for now that the individual spectra of the true and false states, respectively, each exhibit a positive heat capacity throughout. The energies {Ei} of the model from equation 4.5 are defined in terms of the positive constant T. This automatically fixes the ratio E(T)/T for the restricted and equilibrium partition functions alike, given a specific form of the distribution of the dimensionless quantities {Ei/T}; in other words, the constant T simply specifies the energy units. Consequently, equations 4.10 and 4.12 imply
(4.15)
One can also infer the above inequality directly from equations 4.3 and 4.5 by noting that E/T=-iZilnZi/jZj and S=-iZiln(Zi/jZj)/kZk, where the summations are consistently over either the true or the false set of states, respectively.

Equation 4.15 dictates that the canonical energy of the false states in a robust description must be separated by a nonvanishing, extensive gap from the canonical energy of the true states. Indeed, for M02N, the right-hand side of equation 4.15 is numerically close to Nln2. We note that for data set parameterizations obeying condition 4.15, the standard entropy, equation 2.19, is dominated by the represented states, implying the present treatment is internally consistent.

The requisite presence of a nonvanishing, extensive energy gap means that the set of the true states and the set of the false states, respectively, can be thought of as comprising two separate phases; the energy gap corresponds to the latent heat of the transition. Condition 4.10 for the generative model to be able to retrieve the represented configuration is thus equivalent to requiring that the phase comprising the true states be stable relative to the phase composed of the false states. Specifically, the true states will be the stable phase at temperatures T such that
(4.16)
where T0 is the (phase-transition) temperature at which the two phases are in mutual equilibrium:
(4.17)
Note that T0T, in view of equation 4.15. We illustrate the borderline case T0=T in Figure 8, using ad hoc energy distributions for the true and false states, ΩT(ET) and ΩF(EF), respectively. The latter distributions are displayed in the same figure using the dots, while the corresponding dependences of the canonical entropy on the canonical energy are shown with solid lines. (See appendix E for a detailed discussion.) There we also explicitly illustrate that each of two phases corresponds to a free energy minimum.

The thermodynamic singularity at T0 defines a dichotomic distinction between the true and false states, even if their respective spectra overlap. Because of its high entropy, the high-T phase can be thought of as corresponding to a generic mixture of 2N components. The low-T phase, on the other hand, is by construction a mixture with a data set-specific composition, whereby some components are represented much more than others. Thus, at the temperature T0, the composition of the mixture undergoes an intrinsically discrete change, consistent with the transition exhibiting a latent heat.

The states within the energy gap [ET(T0),EF(T0)] are strictly inaccessible in equilibrium because at any temperature, there is at least one state whose free energy is lower. This implies that a subset of the true states, those at E>ET(T0), could not be observed in equilibrium. This high-energy flank of the true states corresponds, in the calibration 4.5, to low frequency, relatively underrepresented configurations from the data set. Although the latter states can be observed, in principle, at temperatures above T0—as we illustrate in Figure 12b in appendix E—the true states, as a whole, are, however, only metastable at such temperatures. If kept at T>T0, the machine will spontaneously transition into the false states—and thus will begin to produce nonrepresented configurations. The machine is thermodynamically unlikely to retrieve the represented configurations again unless the temperature is lowered below T0. Thus, we view the robustness of retrieval as conditional on the machine being confined to the portion of the phase space pertaining to the true states, which implies a breaking of ergodicity.

Figure 10:

Relaxation profiles for equilibrium Monte Carlo simulations. (a) Single spin autocorrelation function Cσσξs(t+t0)ξs(t0) averaged over individual spins and over t0. The horizontal dashed line indicates the location squared of an individual minimum of A˜(m), the limiting height of the plateau when the escape time from an individual minimum diverges. (b) Multispin autocorrelation function Cmmsξs(t+t0)qξq(t0)/Nr2. J=1.5.

Figure 10:

Relaxation profiles for equilibrium Monte Carlo simulations. (a) Single spin autocorrelation function Cσσξs(t+t0)ξs(t0) averaged over individual spins and over t0. The horizontal dashed line indicates the location squared of an individual minimum of A˜(m), the limiting height of the plateau when the escape time from an individual minimum diverges. (b) Multispin autocorrelation function Cmmsξs(t+t0)qξq(t0)/Nr2. J=1.5.

Close modal
Figure 11:

Canonical entropy and energy as functions of temperature for the discrete energy distribution Ω(E) shown as dots in Figure 8. The restricted, single-phase averages over the lower-energy block are labeled with T and over the higher-energy block labeled with F. The full equilibrium quantities are labeled with TF. N=64.

Figure 11:

Canonical entropy and energy as functions of temperature for the discrete energy distribution Ω(E) shown as dots in Figure 8. The restricted, single-phase averages over the lower-energy block are labeled with T and over the higher-energy block labeled with F. The full equilibrium quantities are labeled with TF. N=64.

Close modal
Figure 12:

(a) Thermodynamic potentials F˜(E)E-TS(E), where T is an externally fixed temperature, for the true and false states. The restricted entropies are the same as in Figure 8. (b) An expanded view of the low-T phase to illustrate that in order to sample “true” states above ET(T0), one must raise the temperature above T0 and vice versa for the states with E<ET(T0). The main graph shows that at T>T0, the “true” states are, however, only metastable.

Figure 12:

(a) Thermodynamic potentials F˜(E)E-TS(E), where T is an externally fixed temperature, for the true and false states. The restricted entropies are the same as in Figure 8. (b) An expanded view of the low-T phase to illustrate that in order to sample “true” states above ET(T0), one must raise the temperature above T0 and vice versa for the states with E<ET(T0). The main graph shows that at T>T0, the “true” states are, however, only metastable.

Close modal

Thus, the totality of the true states can be thought of as a collection—or library (Lubchenko & Wolynes, 2004)—of bound states centered at patterns from the data set. The bounding potential is a free energy since it has an entropic component. The spectrum of the nonrepresented states must be properly parameterized to avoid escape toward nonrepresented states; such an escape may well occur in an abrupt, avalanche-like fashion owing to the transition being discontinuous. This notion is consistent with an analysis of US Supreme Court data, due to Gresele and Marsili (2017), who used a spin-based generative model of the type considered here. The latter study associates frequencies of configurations with Boltzmann weights. The nonrepresented states—called “unobserved” in Gresele and Marsili (2017)—are omitted from the sums in the expressions for the coupling constants. According to the discussion in the beginning of this section, omitting unobserved states amounts to pinning their energies at Ei=0; the latter value happens to be greater than the energies assigned to the observed states. Gresele and Marsili (2017) find that including in the data set configurations whose weight is lower than a certain threshold value causes an instability toward faulty retrieval. Our study suggests that those low-weight states may well be sufficiently close, energy-wise, to the nonrepresented states so as to substantially stabilize them.

The stability criterion, equation 4.16, must be satisfied by descriptions irrespective of whether they employ full or reduced sets of coupling constants. This amounts to an additional constraint when optimizing with respect to the values of the coupling constants and/or number of experts, in a reduced description. These conclusions are consistent with earlier studies of associative memory Hamiltonians (AMH) for protein folding. AMH-based generative models have been used to predict three-dimensional structures of native folds of proteins since the 1980s (Friedrichs & Wolynes, 1989; Davtyan et al., 2012). The native structures are extracted from protein-structure databases, while the nonnative states are emulated by placing residues but generically within correct structures. The coupling constants are then determined by maximizing the energy gap separating the native and nonnative states, respectively, relative to the width of the spectrum of the nonnative states.

A discussion of practical strategies as to finding reasonable reduced sets of coupling constants is beyond the scope of this article. Still, the stability criterion equation 4.10, appears to provide some insight. We note that a single spin σ or any product of distinct spins i=1nσαi (n1) vanishes if averaged over the full set of 2N states—this is a simple consequence of equation 2.11. For energy gaps ΔE significantly greater than the respective widths of the spectra of the true and false states, one may approximately write, according to equation 2.11,
(4.18)
up to a multiplicative constant, where the averaging is over the represented states. Thus, one may use Hebbian learning rules for initial screening of important interactions.

Finally, we briefly comment on data sets and/or parameterizations thereof where the distribution {Zi} is multimodal. In such cases, the true states themselves may be best thought of as a collection of distinct phases whose stability relative to each other and to the false states will depend on temperature. When distributed, the numbers Zi can be profitably thought of as numerical labels. See Marsili et al. (2013), Haimovici and Marsili (2015), and Cubero et al. (2019) for an in-depth discussion of such labeling schemes.

We have considered acquisition of knowledge in the form of a generative model that operates on binary variables. The generative model assigns an energy value to each of the 2N configurations of N such binary variables. The expression for the energy has the functional form of a high-order Boltzmann machine (Sejnowski, 1986) but employs a non-Hebbian training protocol. We explicitly calibrate the weights of individual configurations within the data set and assign separate energy references to individual configurations of the machine. Thus, we explicitly treat learning as contextual.

We have built a conjoint free energy surface that can guide, in principle, both training and retrieval. The free energy is a function of the coupling constants; it is uniquely minimized by some complete set of coupling constants whose size can be as large as 2N. In practice, the set of the coupling constants must be reduced in size from the value of 2N, high-order couplings likelier to be removed because of their multiplicity. The resulting free energy can be thought of as a cross-section of the original surface along a submanifold of the original space of the coupling constants. We find that when evaluated within such submanifolds, the free energy has not one but multiple minima. The latter degeneracy is connected to the degeneracy of the free energy as a function of coarse-grained degrees of freedom. This apparent connection between the free energy for learning and retrieval, respectively, can be traced to bounds on the free energy derived early on by Gibbs. We have seen that reduction in description can be thought of as a mean-field approximation.

We have argued that in a consistent treatment, one must explicitly parameterize the energies of configurations not represented in a data set; they cannot simply be regarded as indeterminate. Because of their huge number, the nonrepresented configurations must be destabilized, energy-wise, by an extensive amount relative to the represented states. Thus, the nonrepresented configurations comprise a distinct, high-temperature phase.

Reduction in description leads to ergodicity breaking, which plays a dual role in acquisition of knowledge. Learned patterns can be thought of as a library of bound states centered at configurations from the data set. All other patterns comprise what is essentially a high-entropy continuum. Ergodicity breaking prevents escape into the continuum and, hence, is essential to discriminating correct patterns. At the same time, kinetic barriers separating distinct free-energy minima in the low-temperature phase will result in kinetic bottlenecks for both learning and retrieval. The latter aspect of the ergodicity breaking is detrimental; it also seems to be an inherent feature of contextual learning. To mitigate these detrimental effects, one may have to resort to using separate generative models for distinct free energy minima. The number of minima—and hence the demand for more experts—will typically increase with the degree of reduction in description. It is conceivable that the thermodynamic potentials considered here can be employed to rate the veracity of an individual expert by using the depth of the corresponding free energy minimum as a metric.

The above notions appear to be consistent, for instance, with the limited success machine learning–based force fields have had in predicting structures of inorganic solids. Indeed, the bonding preferences for most elements and the bond orders tend to switch among several, discretely different patterns, in a fashion similar to that shown in Figure 5. An example important in applications is the competition between the tetrahedral and octahedral bonding patterns in intermetallic alloys (Zhugayevych & Lubchenko, 2010). This near discreteness results from cooperative processes (Golden et al., 2017) that are similar to the processes causing the ergodicity breaking we discussed here. Thus, we anticipate that each distinct bonding pattern will require a separate generative model. For instance, for three atoms and two generative models per atom, there would be eight different potential outcomes for the ground state. Furthermore, the analysis in section 4 suggests that successful generative models would have to be trained on at least two sets of structures: One set corresponds to low-energy ordered structures, the other set to high-energy liquid structures. We do not, however, have access to liquid structures at present, which one might view as a fundamentally difficult aspect of the problem of predicting the structure of inorganic solids. This important problem (Maddox, 1988) remains unsolved.

Our work has focused on thermodynamic aspects of learning and retrieval. The corresponding kinetics are highly sensitive to the detailed form of the generative model and the sampling moves. A generic sampling move, during retrieval, will place the system into one of the false states essentially always, because of their multiplicity. The probability of sampling out of a false state into a true state is very very low, roughly M0/2N, corresponding to an entropic free energy barrier NTln2. This entropic bottleneck is analogous to Levinthal’s paradox of protein folding (Levinthal, 1969; Wolynes, 1997a) according to which an unfolded protein should never find a native state, even if the latter is thermodynamically favorable. Levinthal’s paradox is, however, resolved by noting that the landscape of the nonnative states of actual proteins is not flat. Instead, its overall shape is funnel-like and minimally frustrated notwithstanding some amount of roughness (Bryngelson et al., 1995; Onuchic et al., 1997; Wolynes, 1997b), a notion at the heart of spectrum parameterization in associated-memory Hamiltonians (Davtyan et al., 2012). Likewise, it appears that machine learning will work well only if the data set itself allows for a funneled free energy landscape or a combination of a modest number of such funnels.

The notions of calibration put forth in section 2 can be compared with how one counts states in thermochemistry. In chemical contexts, one may use the semiclassical result for the density of states to normalize the partition function for a set of classical particles (Landau & Lifshitz, 1980). Thereby, one counts configurations in the convention that there is one state per the volume formed by the thermal de Broglie wavelength of the particle. Since the latter wavelength is inversely proportional to the square root of the mass, one obtains that per every state of the 20Ne atom, there are (40/20)3/22.83 states of the 40Ar atom. But this becomes an entirely moot point in the classical regime, in which all thermodynamic properties of the system are strictly independent of mass! Consistent with this, the entropy of a classical system can be defined only up to an additive constant, and so only entropy differences are meaningful. The relevant length scale for counting states in a classical gas is the typical spacing between like particles since a particle identity can be established only if no other particles of the same species are nearby (Lubchenko, 2015, 2020). Thus, the chemist ties standard states to densities at the onset of the calculation; equation 2.12 has a similar purpose.

When correlations, if any, among fluctuations of the weights xi are neglected, one may determine the susceptibilities pertaining to the thermodynamic potential A˜(M) from equation 2.29 by Taylor-expanding the latter potential near its minimum:
(A.1)
Apart from the soft constraint stemming from the normalization condition ixi=1, entering as the second sum on the right, we see the susceptibilities for the individual weights xi have the characteristic form for the number fluctuations in an ideal gas, that is, (xi)1/2. Equations 2.24 and A.1 yield the following constitutive relations for weak deviations from the standard model:
(A.2)
We see that the energies Ei are similar to the (negative of) the chemical potential. To avoid ambiguity, we note that unlike here, the standard state in chemical contexts usually corresponds to pure substances, xi=1.
One can supplement the generic thermodynamic potential, equation 2.29, with additional terms so as to encode correlations among the variations of the weights xi, if any. At the quadratic level, one thus obtains
(A.3)
where we use the label “(c)” to distinguish the potential modified for correlations from the basic form in equation 2.29. Consequently, the constitutive relations become
(A.4)
which supersedes the simpler relation in equation A.2. We have presented the correction as a multiplicative factor under the logarithm, in deference to the empirical laws due to Henry and Raoult, respectively (Berry et al., 1980; Silbey et al., 2004).
Now suppose, for the sake of argument, that after we have added the correction, a quadratic expansion of the potential A˜(c) around its minimum defines the associated probability distribution adequately. The original free energy A, then, must be corrected according to
(A.5)
(A.6)
where the quantities
(A.7)
comprise, by construction, the coefficients of the combined quadratic forms from equations A.1 and A.3. δij is the Kronecker delta function. The matrix αij and its inverse thus determine the correlations among fluctuations of the weights xi at the gaussian level (Landau & Lifshitz, 1980),
(A.8)
and of the energies, according to
(A.9)
We reflected, in the first equality of equation A.8, that the standard values xi of the weights are fixed by construction. It is understood that the matrix α is positive definite.

The coefficients γij in the quadratic expansion in equation A.3 imply nontrivial correlations among fluctuations of the weights. One may imagine how such correlations can arise owing to intrinsic uncertainties in calibrating detectors. Consider, for instance, Shockley’s setup of a self-guiding missile or face-recognition device (Brock, 2021), in which the image collected by the device’s camera is passed through a film containing the image of the intended target. The accuracy of the aim is assessed by measuring the intensity of the light that has passed through the film. One must set a separate intensity standard for each individual target or even the very same target depending on the lighting conditions. A universal device, capable of processing multiple images and/or lighting conditions, would then require a floating calibration scheme for the input. Ideally the intensity standard should vary smoothly with variations in the image. Incidentally, “elastic” algorithms to align or match images have been discussed since the early 1980s (Burr, 1981; Moshfeghi, 1991). When alignment of two images requires deletions or insertions, the latter may be thought of as “lattice defects,” by analogy with continuum mechanics.

The standard values xi are set by the calibration convention for the outputs. The choice of the source fields Ei remains flexible and, hence, can be used to implement a particular floating-calibration scheme for the inputs. The reference values Ei should be equal to each other within an error that, ideally, increases smoothly with the difference between two supposedly similar images, according to an adopted similarity criterion. (One hopes that the cost of the defects, if any, does not overwhelm the cost stemming from purely elastic distortion of defectless portions of the image.) Thus, roughly, (Ei-Ej)2(σi-σj)2, consistently for all pairs (i,j) that are neighbors in the Hamming space of configurations, (σi-σj)2<L and L is some judiciously chosen cutoff distance. The latter convention is analogous to the setup of a scalar field theory defined on a discrete lattice (Itzykson & Zuber, 2012); hereby the standard Ei is the field itself, while the lattice points comprise the (N-dimensional) Hamming space of the configurations represented in the data set.

The standard potentials Ei and Ej of two configurations that are farther apart than the cutoff distance L are still correlated, but indirectly, through chains of neighbors (in the Hamming space). The degree of correlation is problem specific. In continuum mechanics contexts, the average in equation A.9 tends to a steady value in dimensions three and higher but diverges logarithmically and linearly with the distance in two and one spatial dimensions, respectively, or in any dimensions when the shear modulus vanishes (Landau & Lifshitz, 1980). In any case, we conclude that owing to intrinsic uncertainties in input calibration, the coupling constants γij will be nonvanishing. Finally we note that the local nature of the standard state Ei is formally analogous to the locality of gauge fields in field theory (Itzykson & Zuber, 2012) or, for instance, of the Berry phase in quantum mechanics (Berry, 1984). Variations in such gauge fields amount to long-range interactions among local degrees of freedom. Specifically in elastic continua, whether degenerate or not, such interactions are of the dipole-dipole variety (Bevzenko & Lubchenko, 2009, 2014) but can be screened in the presence of fluidity (Lemaître et al., 2021).

There are two alternative methods to calculate the Gibbs energy, which are equivalent in an ergodic system. In one method, one evaluates the Gibbs energy starting from the Helmholtz energy, equation 3.9, and then using the Legendre-transform prescription from equations 2.50 and 2.52. One thus obtains
(B.1)
The quantities (-Jmα) inside the brackets, when mα0, can be thought of as internal fields that emerge self-consistently. Such internal fields do emerge in non-mean-field settings as well, of course, but require cooperative effects (Goldenfeld, 1992).

When the Helmholtz energy A˜(m1,m2) has two minima, equation B.1 has not one but three solutions for a vanishing external field hα=0: Two solutions correspond to the minima themselves and one solution to the saddle point separating the minima. The two minima and the saddle point all lie within the slice m1=-m2 from Figure 3. It will suffice for our purposes—and will simplify the prose a great deal—to work along the latter slice. We use mm1=-m2 and hh1=-h2 as our variables. Below the critical point, the A˜(m) curve exhibits two inflection points m sp , whereby (2A˜/m2)m sp =0. These points—conventionally called the “spinodals”—delineate the stability limits, since the susceptibility is negative between the spinodals: m/h=(h/m)-1=(A˜/m2)-1<0. We denote the locations of the spinodals pertaining to the positive and negative minimum as m sp + and m sp -, respectively. When m sp +=-m sp ->0, equation B.1 will have three solutions within an h interval of nonvanishing width, whose left-hand-side and right-hand-side boundaries are determined by solving equation B.1 with m set at m sp + and m sp -, respectively. This is illustrated in Figure 5 with the m MF curves. Of the three solutions of equation B.1, we will consider the two stable solutions pertaining to the minima. Hereby the magnetization is subject to a bimodal distribution, the more likely mode corresponding to the deeper minimum of A˜.

To elucidate the nature of the mean-field constraint—which causes ergodicity breaking at sufficiently low temperatures—we first juxtapose, in Figure 9a, the signs of the magnetizations mα as functions of the source fields hα for the exact and mean-field solution, respectively, of the generative model, equation 3.1. For the exact solution, we color-code the regions of positive and negative mα with red and blue, respectively. The purple areas, then, show where the exact values of the two typical polarizations m1 and m2, respectively, have the same sign. For the mean-field solution, possible values of the magnetizations are determined by the positions, in the (m1,m2) plane, of the minima of the tilted free energy surface A˜(m1,m2)-m1h1-m2h2, where the fields hα are treated as constants (see the discussion following equation 2.52). We show, in Figure 9a, the signs for the likelier magnetization pattern—the one corresponding to the deeper minimum of the tilted surface. The corresponding boundaries are shown using dashed lines; we see they lie rather close to their exact counterparts. Unlike the signs, the magnitudes of the exact and mean-field magnetizations, respectively, show a qualitatively different behavior except when the source fields are large enough for the (tilted) mean-field free energy surface to exhibit just one minimum. (We reiterate that the exact free energy always has just one minimum.) Consequently, the likeliest values of mean-field magnetizations experience a discontinuity when the two competing minima of the free energy surface are exactly degenerate. Conditions for such degeneracy are met along a substantial segment of the h2=h1 line shown in Figure 9a as the straight dashed line connecting the two yellow dots. The latter segment thus represents a phase boundary. An elementary calculation shows that the ends of the latter phase boundary are located at (±hc,±hc), where hc=Jm0+Tatanh(m0) and m0(1-T/J)1/2, while the mean-field mα=0 lines are given by functions hα/T=atanh(hβ/J), |hα|hc.

The distinct difference of the likeliest magnetizations on the two opposite sides of the phase boundary—the m1 component shown in Figure 9b—is an instance of hysteresis, a classic signature of broken ergodicity. The discontinuity across the phase boundary implies that a substantial region of phase space around the origin m=0 is strictly avoided for sufficiently large values of the coupling J as a result of the mean-field constraint.

An alternative way to compute the Gibbs energy is directly through the equilibrium partition function with an added source field, equation 3.11. The results of this calculation for the exact and mean-field case, respectively, are shown in the main text. Here we only verify that the setup in equation 3.11 yields the mean-field description, equation 3.9, in the Nr limit. First we note a Hubbard-Stratonovich formula, equation 27.55 from Schulman (2012), that can be used to uncouple the two factors in the product in equation 3.10:
(B.2)
Here, d2η=d(Reη)d(Imη) corresponds to integration in the complex plane, and the integration variables η and η* are treated as independent. One may now straightforwardly sum over the spin states to obtain
(B.3)
The Nr asymptotics can be readily evaluated by the saddle-point integration. The locations of the stationary points (η0,(η*)0) are solutions of the following system of equations:
(B.4)
To avoid confusion, we note that generally (η*)0(η0)*. To each saddle point there corresponds a Gibbs energy whose value, in the leading order in Nr, equals
(B.5)
Comparing the derivatives of equations 3.11 and B.3, respectively, with respect to the source fields allows one to identify (η*)0 as m1 and (-η0) as m2, in the Nr limit. Thus, equation B.4 is identical to equation B.1, while the branches G˜+ and G˜-, respectively, of the restricted reduced Gibbs energy must be identified with the two stationary points of the integrand in equation B.3. Consistent with this identification, the formula ln2cosh(x)=-{(1+y)ln[(1+y)/2]+(1-y)ln[(1-y)/2]}/2+xy, where ytanhx, allows one to see that the Gibbs energy per replica G˜/Nr from equation B.5 when computed for an individual branch, equals precisely the restricted Gibbs energy G˜ from equation 2.50. (The equilibrium value G˜ eq , in the Nr limit, is equal to the lowest of the set of G˜’s from equation B.5 as computed for the full set of the stationary points.)
In a finite system, the ergodicity is eventually restored. To this end, we show in Figure 10 autocorrelation functions for two select observables; the time is measured in steps of a Monte Carlo simulation. In panel a, we display the single spin-spin autocorrelation function Cσσ(t)ξs(t+t0)ξs(t0), where the averaging is over the location t0 of the sampling window and over the spins. Past a certain, modest value of the system size, we clearly observe two distinct, time-separated processes. The short-time process corresponds to the vibrational relaxation within an individual free energy minimum; call the corresponding relaxation time t vib . For t>t vib , the correlation function temporarily settles at a plateau value ξs(t+t0)ξs(t0)ξ vib 2, where the average ξ vib 0 pertains to an individual minimum and t is less than the typical escape time tc from a minimum. (ξs(t+t0)ξs(t0)ξ vib 2 for tc and long times t, t vib <t<tc.) In the long term, t>tc, the system is, however, able to transition between the two minima; as a result, the average magnetization eventually attains its equilibrium value of zero ξ=0. This causes the correlation function to decay to zero too. The interminimum dynamics is sometimes called configurational relaxation (He & Lubchenko, 2023; Lubchenko & Wolynes, 2007; Lubchenko, 2015). The appearance of timescale separation between the vibrational and configurational relaxation, respectively, implies ergodicity is broken, even if transiently. The length of the plateau then reflects the temporal extent of the ergodicity breaking. The latter extent scales exponentially with the parameters of the problem and can become very long even for modestly sized systems and/or following small temperature variations. We confirm the cooperative nature of ergodicity-restoring transitions in Figure 10b, where we display the autocorrelation function
The vibrational component of the relaxation is largely averaged out at the onset, implying the overall relaxation is mostly due to interminimum transitions.

Below the critical point, the minima of the surface A˜(m1,m2) from equation 3.9 are strictly degenerate when J1 and J2 vanish. The latter degeneracy is dictated by the invariance of the product σ1σ2 with respect to flipping the two spins at the same time. Yet the latter symmetry has another consequence: that the average magnetizations mα must all vanish at all temperatures. At the same time, the operation (σ1,σ2)(-σ1,-σ2) is intrinsically discrete. Thus, the symmetry breaking signaled by the emergence of the two minima in A˜(m1,m2)—each of which corresponds to nonvanishing nomagnetizations mα0—must be also of the discrete kind. Consequently, no Goldstone modes (Goldstone et al., 1962) appear as a result of the symmetry breaking; the newly emerged minima of the free energy must be separated by a barrier. The barrier can be made small near criticality, if any, but criticality is ordinarily observed within a manifold of vanishing volume in the phase space and thus is rare.

One may generalize the above discussion to higher-order interactions by replacing the object σ2, in model 3.1 by the object σ2σ3, while keeping the overall coupling constant positive. The resulting standard generative model,
(D.1)
now corresponds to the logic operation XOR:
(D.2)
where the labels of the rows and columns correspond to the respective states of any two spins while the entries in the body of the table correspond to the states of the remaining spin.
The product σ2σ3 can be presented as a direct sum Z2Z2. This amounts to the three-body energy function, equation D.1, being a sum of two equivalent, noninteracting replicas of the two-body energy function from equation 3.1. Thus, one may consider a trial description with two couplings, J1 and J23,
(D.3)
where the compound object σ2σ3 can be treated as a single spin while the overall free energy is lowered by Tln2, relative to a two-spin system proper. Fluctuations of spins 2 and 3 are correlated with each other, but not so with fluctuations of spin 1:
(D.4)
The preceding discussion can be repeated to see that the factorization scheme, equation D.4, will result in a pair of identical, doubly degenerate free energy surfaces. Interactions of order four and higher can be treated analogously. In effect, correlations of the respective rank will be approximated as products of lower-rank correlation functions, a common mean-field approximation (Zubarev, 1960).

When present, Goldstone modes imply the free energy minima, below the symmetry breaking, have a vanishing curvature along one or more directions in the order-parameter space. We see such noncompact free energy minima would be untypical for binary data sets, thus indicating the ergodicity breaking is of the harshest type possible.

For many problems, considering digitized data sets as binary is arguably gratuitous, especially when the underlying problem is continuous. In such cases, Goldstone modes would be absent nonetheless. This lack of Goldstone modes can be viewed, rather generally, as a consequence of the symmetry of the contextual ensemble 2.12, with respect to the gauge transformation Ei/TEi/T+δi, lnZilnZi-δi. When gauged symmetries are broken, candidate Goldstone excitations do become gapped (Anderson, 1984; Itzykson & Zuber, 2012). Now, a distinct variety of low-frequency modes can arise in models with translationally invariant, short-range forces when more than one phase coexists and are in near equilibrium with each other. Here system is us broken up into regions each occupied by an individual phase, a phenomenon called “spinodal decomposition” (Goldenfeld, 1992; Bray, 1994). Interfaces separating the latter regions can often deform and move about with relative ease. We view such situations as coincidental because conditions for phase equilibrium could be fulfilled only within a manifold of vanishing volume, in the phase space.

The restricted entropy SF(EF), corresponding to equations 4.7, 4.11, and 4.12, can be parameterized to be a strictly convex-up function by construction. Assume for now that the entropy ST(ET) of the true states—computed using equations 4.6, 4.11, and 4.12—is also a strictly convex-up function. (We use distinct variables, ET and EF, for the energies of the true and false states, because their respective spectra generally overlap.) When the energy gap is at its lowest allowed value—corresponding to the equality in equation 4.15—the two phases are in mutual equilibrium. Indeed, (ST/ET)=(SF/EF)=1/T when evaluated at ET(T) and EF(T), respectively. Note that equation 4.15, at equality, expresses the familiar double tangent construction for phase equilibrium in the microcanonical ensemble (Lubchenko, 2008, 2020).

The two circles in Figure 8 correspond to T=T. The mutual arrangements of the true and false states in Figure 8 represent the borderline case: Moving the false states by any amount toward lower energies—while keeping ΩT(ET) fixed—would make them more stable than the true states. Conversely, one is allowed to use a gap that is greater than the one we used in Figure 8. For this reason, T0T. Now suppose that one has settled on a specific spectrum for the false states that satisfies the constraint 4.15. The two phases will be at equilibrium at a temperature T0 from equation 4.17. Because of the discrete change of the entropy at the transition, ΔSSF(T0)-ST(T0)>0, the transition is discontinuous and, furthermore, exhibits a latent heat equal to T0ΔS.

Criterion 4.16 can be lucidly visualized by plotting, on the same graph, the thermodynamic potential F˜(E)E-TS(E) for each of the individual phases, where the temperature T is now regarded as a fixed, externally imposed parameter. For a single-phase system, the potential F˜(E), as a function of E, is uniquely minimized at E such that F˜/E=1-TS/E=0; hereby the internal temperature (S/E)-1 becomes equal to T (Lubchenko, 2020). The depth of the minimum is equal to the equilibrium Helmholtz energy A(T)=E(T)-TS(T). During phase coexistence, there is a separate F˜(E) minimum for each phase; the deepest minimum corresponds to the stable phase. We observe directly in Figure 12a that at T<T0, the true states are more stable than the false states and vice versa at T>T0.

The equilibrium energy E eq (T) and entropy S eq (T) correspond to the full partition function Z TF from equation 4.13; they are shown in Figures 8 and 11 with the dashed line. We observe that E eq (T) and S eq (T) each undergo an abrupt variation within a narrow temperature interval already at the modest value N=64 of the system size. It is straightforward to show that in the thermodynamic limit, N, the equilibrium energy and entropy develop a strict discontinuity at T=T0, consistent with Figure 11, while the parametric equilibrium entropy S eq (E eq ) asymptotically tends, within the gap, to the common tangent to the respective entropies of the pure phases, consistent with Figure 8. In any event, the equilibrium S eq (E eq )—a convex-up envelope of the two restricted entropies, respectively—does not pertain to either one of the individual phases when E[ET(T0),EF(T0)]. In summary, the entropy and energy alike cannot be regarded as one-valued state functions during phase coexistence, analogously to the discussion of ergodicity breaking in section 3.

V.L. thanks Vitali Khvatkov, Rolf M. Olsen, and Michael Tuvim for inspiring conversations. We gratefully acknowledge the support of NSF grants CHE-1465125 and CHE-1956389, the Welch Foundation grant E-1765, and a grant from the Texas Center for Superconductivity at the University of Houston. We gratefully acknowledge the use of the Carya/Opuntia/Sabine Cluster and the advanced support from the Research Computing Data Core at the University of Houston acquired through NSF Award Number ACI-1531814.

Anderson
,
P. W.
(
1984
).
Basic notions of condensed matter physics
.
Benjamin Cummins
.
Baus
,
M.
, &
Colot
,
J.-L.
(
1986
).
The hard-sphere glass: Metastability versus density of random close packing
.
Journal of Physics C: Solid State Physics
,
19
,
L135
L139
.
Beker
,
W.
,
Roszak
,
R.
,
Wolos
,
A.
,
Angello
,
N. H.
,
Rathore
,
V.
,
Burke
,
M. D.
, &
Grzybowski
,
B. A.
(
2022
).
Machine learning may sometimes simply capture literature popularity trends: A case study of heterocyclic Suzuki–Miyaura coupling
.
Journal of the American Chemical Society
,
144
,
4819
4827
.
Béreux
,
N.
,
Decelle
,
A.
,
Furtlehner
,
C.
, &
Seoane
,
B.
(
2023
).
Learning a restricted Boltzmann machine using biased Monte Carlo sampling
.
SciPost Physics
,
14
, 032.
Berry
,
M. V.
(
1984
).
Quantal phase factors accompanying adiabatic changes
.
Proceedings of the Royal Society of London. A: Mathematical and Physical Sciences
,
392
,
45
57
.
Berry
,
R. S.
,
Rice
,
S. A.
, &
Ross
,
J.
(
1980
).
Physical chemistry
.
Wiley
.
Bevzenko
,
D.
, &
Lubchenko
,
V.
(
2009
).
Stress distribution and the fragility of supercooled melts
.
Journal of Physical Chemistry B
,
113
,
16337
16345
.
Bevzenko
,
D.
, &
Lubchenko
,
V.
(
2014
).
Self-consistent elastic continuum theory of degenerate, equilibrium aperiodic solids
.
Journal of Physical Chemistry
,
141
, 174502.
Bray
,
A. J.
(
1994
).
Theory of phase-ordering kinetics
.
Advances in Physics
,
43
,
357
459
.
Brock
,
D. C.
(
2021
,
June
).
How William Shockley’s robot dream helped launch Silicon Valley
.
IEEE Spectrum
. https://spectrum.ieee.org/how-william-shockleys-robot-dream-helped-launch-silicon-valley
Bryngelson
,
J. D.
,
Onuchic
,
J. N.
,
Socci
,
N. D.
, &
Wolynes
,
P. G.
(
1995
).
Funnels, pathways, and the energy landscape of protein folding: A synthesis
.
Proteins: Structure, Function, and Bioinformatics
,
21
,
167
195
.
Burr
,
D.
(
1981
).
A dynamic model for image registration
.
Computer Graphics and Image Processing
,
15
,
102
112
.
Chan
,
H. Y.
, &
Lubchenko
,
V.
(
2015
).
Pressure in the Landau-Ginzburg functional: Pascal’s law, nucleation in fluid mixtures, a meanfield theory of amphiphilic action, and interface wetting in glassy liquids
.
Journal of Chemical Physics
,
143
, 124502.
Cubero
,
R. J.
,
Jo
,
J.
,
Marsili
,
M.
,
Roudi
,
Y.
, &
Song
,
J.
(
2019
).
Statistical criticality arises in most informative representations
.
Journal of Statistical Mechanics: Theory and Experiment
,
2019
, 063402.
Davtyan
,
A.
,
Schafer
,
N. P.
,
Zheng
,
W.
,
Clementi
,
C.
,
Wolynes
,
P. G.
, &
Papoian
,
G. A.
(
2012
).
AWSEM-MD: Protein structure prediction using coarse-grained physical potentials and bioinformatically based local structure biasing
.
Journal of Physical Chemistry. B
,
116
,
8494
8503
.
Dayan
,
P.
,
Hinton
,
G. E.
,
Neal
,
R. M.
, &
Zemel
,
R. S.
(
1995
).
The Helmholtz machine
.
Neural Computation
,
7
(
5
),
889
904
.
Decelle
,
A.
, &
Furtlehner
,
C.
(
2021
).
Exact training of restricted Boltzmann machines on intrinsically low dimensional data
.
Physical Review Letters
,
127
, 158303.
Evans
,
R.
(
1979
).
The nature of the liquid-vapour interface and other topics in the statistical mechanics of non-uniform, classical fluids
.
Advances in Physics
,
28
,
143
200
.
Friedrichs
,
M. S.
, &
Wolynes
,
P. G.
(
1989
).
Toward protein tertiary structure recognition by means of associative memory Hamiltonians
.
Science
,
246
,
371
373
.
Girardeau
,
M. D.
, &
Mazo
,
R. M.
(
1973
).
Variational methods in statistical mechanics
. In
I.
Prigogine
&
S.
Rice
(Eds.),
Advances in chemical physics
(pp.
187
255
).
Wiley
.
Golden
,
J. C.
,
Ho
,
V.
, &
Lubchenko
,
V.
(
2017
).
The chemical bond as an emergent phenomenon
.
Journal of Chemical Physics
,
146
, 174502.
Goldenfeld
,
N.
(
1992
).
Lectures on phase transitions and the renormalization group
.
Addison-Wesley
.
Goldstone
,
J.
,
Salam
,
A.
, &
Weinberg
,
S.
(
1962
).
Broken symmetries
.
Physical Review
,
127
,
965
970
.
Gresele
,
L.
, &
Marsili
,
M.
(
2017
).
On maximum entropy and inference
.
Entropy
,
19
, 642.
Haimovici
,
A.
, &
Marsili
,
M.
(
2015
).
Criticality of mostly informative samples: A Bayesian model selection approach
.
Journal of Statistical Mechanics: Theory and Experiment
,
2015
(
October
), P10013.
He
,
Y.
(
2022
).
The emergence of knowledge and the benefits of quorum
.
PhD diss.
,
University of Houston
.
He
,
Y.
, &
Lubchenko
,
V.
(
2023
).
Emergence of pseudo-time during optimal Monte Carlo sampling and temporal aspects of symmetry breaking and restoration
.
Journal of Chemical Physics
,
158
, 124119.
Hopfield
,
J. J.
(
1982
).
Neural networks and physical systems with emergent collective computational abilities
.
PNAS
,
79
,
2554
2558
.
Itzykson
,
C.
, &
Zuber
,
J.
(
2012
).
Quantum field theory
.
Dover
.
Kirkpatrick
,
T. R.
, &
Wolynes
,
P. G.
(
1987
).
Stable and metastable states of mean-field Potts and structural glasses
.
Phys. Rev. B
,
36
,
8552
8564
.
Landau
,
L. D.
, &
Lifshitz
,
E. M.
(
1980
).
Statistical mechanics
.
Pergamon Press
.
Laydevant
,
J.
,
Marković
,
D.
, &
Grollier
,
J.
(
2024
).
Training an Ising machine with equilibrium propagation
.
Nature Communications
,
15
.
Lemaître
,
A.
,
Mondal
,
C.
,
Moshe
,
M.
,
Procaccia
,
I.
,
Roy
,
S.
, &
Screiber-Réem
,
K.
(
2021
).
Anomalous elasticity and plastic screening in amorphous solids
.
Physical Review E
,
104
, 024904.
Levinthal
,
C.
(
1969
).
How to fold graciously
. In
P.
Debrunner
&
J.
Tsibris
(Eds.),
Mossbauer Spectroscopy in Biological Systems Proceedings, 67
(pp.
22
26
).
University of Illinois
.
Lubchenko
,
V.
(
2008
).
Competing interactions create functionality through frustration
.
PNAS
,
105
,
10635
10636
.
Lubchenko
,
V.
(
2015
).
Theory of the structural glass transition: A pedagogical review
.
Advances in Physics
,
64
,
283
443
.
Lubchenko
,
V.
(
2017
).
Glass transition imminent, resistance is futile
.
PNAS
,
114
(
13
),
3289
3291
.
Lubchenko
,
V.
(
2020
).
Basic notions of thermodynamics and quantum mechanics for natural sciences
.
University of Houston
. https://uhlibraries.pressbooks.pub/lubchenkosurveypchem/
Lubchenko
,
V.
, &
Wolynes
,
P. G.
(
2004
).
Theory of aging in structural glasses
.
Journal of Chemical Physics
,
121
,
2852
2865
.
Lubchenko
,
V.
, &
Wolynes
,
P. G.
(
2007
).
Theory of structural glasses and supercooled liquids
.
Annual Review of Physical Chemistry
,
58
,
235
266
.
Maddox
,
J.
(
1988
).
Crystals from first principles
.
Nature
,
335
, 201.
Marsili
,
M.
,
Mastromatteo
,
I.
, &
Roudi
,
Y.
(
2013
).
On sampling and modeling complex systems
.
Journal of Statistical Mechanics: Theory and Experiment
,
2013
, P09003.
McQuarrie
,
D. M.
(
1973
).
Statistical mechanics
.
Harper-Collins
.
Merchan
,
L.
, &
Nemenman
,
I.
(
2016
).
On the sufficiency of pairwise interactions in maximum entropy models of networks
.
Journal of Statistical Physics
,
162
,
1294
1308
.
Mohseni
,
N.
,
McMahon
,
P. L.
, &
Byrnes
,
T.
(
2022
).
Ising machines as hardware solvers of combinatorial optimization problems
.
Nature Reviews Physics
,
4
(
6
),
363
379
.
Montúfar
,
G.
(
2018
).
Restricted Boltzmann machines: Introduction and review
. In
Information geometry and its applications: On the occasion of Shunichi Amari’s 80th birthday June 2016
(pp.
75
115
).
Moshfeghi
,
M.
(
1991
).
Elastic matching of multimodality medical images
.
CVGIP: Graphical Models and Image Processing
,
53
,
271
282
.
Onuchic
,
J. N.
,
Luthey-Schulten
,
Z.
, &
Wolynes
,
P. G.
(
1997
).
Theory of protein folding: The energy landscape perspective
.
Annual Review of Physical Chemistry
,
48
, 545.
Rabochiy
,
P.
, &
Lubchenko
,
V.
(
2012
).
Universality of the onset of activated transport in Lennard-Jones liquids with tunable coordination: Implications for the effects of pressure and directional bonding on the crossover to activated transport, configurational entropy and fragility of glassforming liquids
.
Journal of Chemical Physics
,
136
, 084504.
Schulman
,
L.
(
2012
).
Techniques and applications of path integration
.
Dover
.
Sejnowski
,
T. J.
(
1986
).
Higher-order Boltzmann machines
. In
AIP Conference Proceedings
(vol. 151, pp.
398
403
).
Silbey
,
R.
,
Alberty
,
R.
, &
Bawendi
,
M.
(
2004
).
Physical chemistry
.
Wiley
.
Singh
,
Y.
,
Stoessel
,
J. P.
, &
Wolynes
,
P. G.
(
1985
).
The hard sphere glass and the density functional theory of aperiodic crystals
.
Physical Review Letters
,
54
,
1059
1062
.
Wolynes
,
P. G.
(
1997a
).
Entropy crises in glasses and random heteropolymers
.
Journal of Research of NIST
,
102
,
187
194
.
Wolynes
,
P. G.
(
1997b
).
Folding funnels and energy landscapes of larger proteins within the capillarity approximation
.
PNAS
,
94
,
6170
6174
.
Xia
,
X.
, &
Wolynes
,
P. G.
(
2000
).
Fragilities of liquids predicted from the random first order transition theory of glasses
.
Journal of PNAS
,
97
,
2990
2994
.
Zhugayevych
,
A.
, &
Lubchenko
,
V.
(
2010
).
Electronic structure and the glass transition in pnictide and chalcogenide semiconductor alloys. I: The formation of the ppσ-network
.
Journal of Chemical Physics
,
133
, 234503.
Zubarev
,
D. N.
(
1960
).
2-time green functions in statistical physics
.
Soviet Physics Uspekhi
,
3
, 320.