## Abstract

A restricted Boltzmann machine (RBM) is an unsupervised machine learning bipartite graphical model that jointly learns a probability distribution over data and extracts their relevant statistical features. RBMs were recently proposed for characterizing the patterns of coevolution between amino acids in protein sequences and for designing new sequences. Here, we study how the nature of the features learned by RBM changes with its defining parameters, such as the dimensionality of the representations (size of the hidden layer) and the sparsity of the features. We show that for adequate values of these parameters, RBMs operate in a so-called compositional phase in which visible configurations sampled from the RBM are obtained by recombining these features. We then compare the performance of RBM with other standard representation learning algorithms, including principal or independent component analysis (PCA, ICA), autoencoders (AE), variational autoencoders (VAE), and their sparse variants. We show that RBMs, due to the stochastic mapping between data configurations and representations, better capture the underlying interactions in the system and are significantly more robust with respect to sample size than deterministic methods such as PCA or ICA. In addition, this stochastic mapping is not prescribed a priori as in VAE, but learned from data, which allows RBMs to show good performance even with shallow architectures. All numerical results are illustrated on synthetic lattice protein data that share similar statistical features with real protein sequences and for which ground-truth interactions are known.

## 1 Introduction

Many complex, interacting systems have collective behaviors that cannot be understood based solely on a top-down approach. This is either because the underlying microscopic interactions between the constituents of the system are unknown—as in biological neural networks, where the set of synaptic connections are unique to each network—or because the complete description is so complicated that analytical or numerical resolution is intractable—as for proteins, for which physical interactions between amino acids can in principle be characterized, but accurate simulations of protein structures or functions are computationally prohibitive. In the past two decades, the increasing availability of large amounts of data collected by high-throughput experiments such as large-scale functional recordings in neuroscience (EEG, fluorescence imaging) (Schwarz et al., 2014; Wolf et al., 2015), fast sequencing technologies (Finn et al., 2015; Kolodziejczyk, Kim, Svensson, Marioni, & Teichmann, 2015; single RNA seq) or deep mutational scans (Fowler et al., 2010) has shed new light on these systems.

Given such high-dimensional data, one fundamental task is to establish a descriptive phenomenology of the system. For instance, given a recording of spontaneous neural activity in a brain region or in the whole brain (e.g., in larval zebrafish), we would like to identify stereotypes of neural activity patterns (e.g., activity bursts, synfire chains, cell-assembly activations) describing the dynamics of the system. This representation is in turn useful to link the behavior of the animal to its neural state and to understand the network architecture. Similarly, given a multiple sequence alignment (MSA) of protein sequences (i.e., a collection of protein sequences from various genes and organisms that share common evolutionary ancestry), we would like to identify amino acid motifs controlling the protein functionalities and structural features and identify, in turn, subfamilies of proteins with common functions. Important tools for this purpose are unsupervised representation-learning algorithms. For instance, principal component analysis can be used for dimensionality reduction, that is, for projecting system configurations into a low-dimensional representation, where similarities between states are better highlighted and the system evolution is tractable. Another important example is clustering, which partitions the observed data into different prototypes. Though these two approaches are very popular, they are not always appropriate: some data are intrinsically multidimensional and cannot be reduced to a low-dimensional or categorical representation. Indeed, configurations can mix multiple, weakly related features, such that using a single global distance metric would be too reductive. For instance, neural activity states are characterized by the clusters of neurons that are activated, which are themselves related to a variety of distinct sensory, motor, or cognitive tasks. Similarly, proteins have a variety of biochemical properties such as binding affinity and specificity, thermodynamic stability, or allostery, which are controlled by distinct amino acid motifs within their sequences. In such situations, other approaches such as independent component analysis or sparse dictionaries, which aim at representing the data by a (larger) set of independent latent factors, appear to be more appropriate (McKeown et al., 1998; Rivoire, Reynolds, & Ranganathan, 2016).

A second goal is to infer the set of interactions underlying the system's collective behavior. In the case of neural recordings, we would look for functional connections that reflect the structure of the relevant synaptic connections in a given brain state. In the case of proteins, we would like to know what interactions between amino acids shape the protein structure and functions. For instance, a Van Der Waals repulsion between two amino acids is irrelevant if both are far from one another in the tridimensional structure; the set of relevant interactions is therefore linked to the structure. One popular approach for inferring interactions from observation statistics relies on graphical (e.g., Ising or Potts) models. It consists in first defining a quadratic log-probability function, then inferring the associated statistical fields (site potentials) and pairwise couplings by matching the first- and second-order moments of data. This can be done efficiently through various approaches (see Nguyen, Zecchina, & Berg, 2017, and Cocco, Feinauer, Figliuzzi, Monasson, & Weigt, 2018) for recent reviews. The inverse Potts approach, called direct-coupling-analysis (Weigt, White, Szurmant, Hoch, & Hwa, 2009; Morcos et al., 2011) in the context of protein sequence analysis, helps predict structural contact maps for a protein family from sequence data only. Moreover, such an approach defines a probability distribution that can be used for artificial sample generation and, more broadly, quantitative modeling—for example, of mutational fitness landscape prediction in proteins (Figliuzzi, Jacquier, Schug, Tenaillon, & Weigt, 2016; Hopf et al., 2017) or neural state information content (Tkacik, Prentice, Balasubramanian, & Schneidman, 2010) or brain states (Posani, Cocco, Ježek, & Monasson, 2017). The main drawback of graphical models is that, unlike representation learning algorithms, they do not provide any direct, interpretable insight over the data distribution. Indeed, the relationship between the inferred parameters (fields and couplings) and the typical configurations associated with the probability distribution of the model is mathematically well defined but intricate in practice. For instance, it is difficult to deduce the existence of data clusters or global collective modes from the knowledge of the fields and couplings.

An interesting alternative for overcoming this issue are restricted Boltzmann machines (RBM). An RBM is a graphical model that can learn both a representation and a distribution of the configuration space, naturally combining both approaches. RBM are bipartite graphical models, constituted by a visible layer carrying the data configurations and a hidden layer, where their representation is formed (see Figure 1a). Unlike Boltzmann machines, there are no couplings within the visible layer or between hidden units. RBM were introduced in Ackley, Hinton, and Sejnowski (1987) and popularized by Hinton et al. (Hinton, 2002; Hinton, Osindero, & Teh, 2006) for feature extraction and pretraining of deep networks. More recently, RBMs were shown to be efficient for modeling coevolution in protein sequences (Tubiana, Cocco, & Monasson, 2018). The features inferred are sequence motifs related to the structure, function, and phylogeny of the protein, and they can be recombined to generate artificial sequences with putative properties. An important question is to understand the nature and statistics of the representation inferred by RBMs and how they depend on the choice of model parameters (nature and number of hidden units, sparse regularization). Indeed, unlike PCA, independent component analysis (ICA), or sparse dictionaries, no constraints on the statistics or the nature of the representations, such as decorrelation or independence, are explicitly enforced in RBM. To answer this question, we apply here RBM on synthetic alignments of lattice protein sequences, for which ground-truth interactions and fitness functions are available. We analyze and interpret the nature of the representations learned by RBM as a function of its defining parameters and in connection with theoretical results on RBMs drawn from random statistical ensembles obtained with statistical physics tools (Tubiana & Monasson, 2017). We then compare our results to the other feature extraction methods.

The letter is organized as follows. In section 2, we formally define RBM and present the main steps of the learning procedure. In section 3, we discuss the different types of representations RBM that can learn, with a reminder of recent related results on the different operation regimes, in particular, the so-called compositional phase, theoretically found in random RBM ensembles. In section 4, we introduce lattice proteins (LP) and present the main results of the applications of RBM to LP sequence data. Section 5 is dedicated to the comparison of RBM with other representation learning algorithms, including principal component analysis (PCA), independent component analysis (ICA), sparse principal component analysis (sPCA), sparse dictionaries, sparse autoencoders (sAE), and sparse variational autoencoders (sVAE).

## 2 Restricted Boltzmann Machines

### 2.1 Definition

where the fields $gi(v)$ and potentials $U\mu $ control the conditional probabilities of the $vi$, $h\mu $ and the weights $wi\mu (v)$ couple the visible and hidden variables. The partition function $Z=\u2211v\u222bdhe-E(v,h)$ is defined such that $P$ is normalized to unity. Some choices for the hidden-unit potentials are:

- •
The Bernoulli potential: $U(0)=0$, $U(1)=u1$, and $U(h)=+\u221e$ if $h\u22600,1$.

- •The quadratic potential,$U(h)=12\gamma h2+\theta h,$(2.2)
with $h$ real-valued.

- •
Rectified linear unit (ReLU) potential $U(h)=12\gamma h2+\theta h$, with $h$ real-valued and positive, and $U(h)=+\u221e$ for negative $h$.

- •The double rectified linear unit (dReLU) potential,$U(h)=12\gamma +(h+)2+12\gamma -(h-)2+\theta +h++\theta -h-,$(2.3)
where $h+=max(h,0)$, $h-=min(h,0)$ and $h$ is real-valued.

Bernoulli and quadratic potentials are standard choices in the literature. RBM with ReLU-like average activity were originally proposed in Nair and Hinton (2010), and the above potential, which can be sampled from exactly, was introduced in Tubiana and Monasson (2017). The so-called double ReLU potential, introduced in Tubiana et al. (2018), is a more general form, with an associated distribution that can be asymmetric and interpolate between bimodal, gaussian, or Laplace-like sparse distributions depending on the choice of the parameters (see Figure 2a).

### 2.2 Sampling

For $\theta +>\theta -$ (the blue curve in Figure 2) small inputs $I$ barely activate the hidden unit, and the average activity is a soft ReLU, whereas for $\theta +<\theta -$, the average activity rises abruptly at $\theta +\gamma --\theta -\gamma +\gamma ++\gamma -$ (the red curve in Figure 2), similar to a soft thresholding function.

### 2.3 Marginal Distribution

### 2.4 Learning Algorithm

## 3 Nature of the Representation Learned and Weights

### 3.1 Nature of Representations and Interpretation of the Weights

Taken together, the hidden units define a probability distribution over the visible layer via equation 2.7 that can be used for artificial sample generation, scoring, or Bayesian reconstruction, as well as a representation that can be used for supervised learning. However, one fundamental question is how to interpret the hidden units and their associated weights when taken individually. Moreover, what does the model tell us about the data studied?

To this end, we recall first that a data point $v$ can be approximately reconstructed from its representation $h$ via a so-called linear-nonlinear relationship, that is, a linear transformation $Ii(v,h)=\u2211\mu wi\mu (v)h\mu $ of the hidden layer activity followed by an element-wise (stochastic) nonlinearity $P(vi|Ii)$ (see the conditional distribution $P(vi|h)$ of equation 2.6). A similar reconstruction relationship also exists in other feature extraction approaches; for instance, it is linear and deterministic for PCA/ICA and nonlinear deterministic for autoencoders (see section 5). Owing to this relationship, hidden units may be interpreted as the underlying degrees of freedom controlling the visible configurations. However, such interpretation relies on the constraints imposed on the distribution of $P(h)$, such as decorrelation, independence, or sparse activity. In contrast, in RBM, the marginal $P(h)$ is obtained by integrating over the visible layer (similarly as $P(v)$) and has no explicit constraints. Therefore, hidden units may be interpreted differently depending on the statistics of $P(h)$. We sketch in Figure 3 several possible scenarios for $P(h)$:

- •
*Prototypical representations*(see Figure 3b). For a given visible layer configuration, about one hidden unit is strongly activated, while the others are weakly activated or silent; each hidden unit responds to a localized region of the configuration state. Owing to the reconstruction relation, the weights attached to the hidden unit are prototypes of data configuration, similar to the centroids obtained with a clustering algorithm. - •
*Intricate representations*(see Figure 3c). A visible layer configuration activates weakly many (of order $M$) hidden units. Each hidden unit is sensitive to a wide region of the configuration space, and different configurations correspond to slightly different and correlated levels of activation of the hidden units. Taken altogether, the hidden units can be very informative about the data, but the individual weights do not have a simple interpretation.^{1} - •
*Compositional representations*(see Figure 3d). A substantial number of hidden units (large compared to one but small compared to the size of the hidden layer) are activated by any visible configuration, while the other units are irrelevant. The same hidden unit can be activated in many distinct parts of the configuration space, and different configurations are obtained through combinatorial choices of the strongly activated hidden units. Conversely, the weights correspond to constitutive parts that, once combined, produce a typical configuration. Such compositional representations share similar properties with the ones obtained with sparse dictionaries (Olshausen & Field, 1996; Mairal, Bach, Ponce, & Sapiro, 2009).

Compositional representations offer several advantages with respect to the other types. First, they allow RBM to capture invariances in the underlying distribution from vastly distinct data configurations (contrary to the prototypical regime). Second, representations are sparse (as in the prototypical regime), which makes it possible to understand the relationship between weights and configurations (see the above discussion). Third, activating hidden units of interest (once their weights have been interpreted) allows one to generate configurations with desired properties (see Figure 3d). A key question is whether we can force the RBM to generate such compositional representations.

### 3.2 Link between Model Parameters and Nature of the Representations

- •
Large $h$ values arise via self-excitation, with a coefficient in $\u2211iwi,\mu 2\u223cpW2$.

- •
The dReLU threshold acts either as a self-inhibition that can suppress small activities ($\theta \mu >0$) or as self-excitation that enhances them.

- •
Hidden unit interactions (hence, correlations), arise via their overlaps $\u2211iwi\mu wi\mu \u223cp2W2$.

- •
The nongaussian nature of the visible units induces an effective inhibition term between each pair of hidden units, whose magnitude $\u2211iwi\mu 2wi\mu '2\u223cp2W4N$ essentially depends on the overlap between the supports of the hidden units. We deduce that the larger the hidden layer, the stronger this high-order inhibition, and that RBMs with sparse weights (order $p$ nonzero coefficients) have significantly weaker overlaps (of order $p2$), and therefore have much less cross-inhibition.

Depending on the parameter values, several behaviors can therefore be observed. Intuitively, the (1) prototype, (2) intricate, and (3) compositional representation correspond to the cases where, (1) respectively, strong self-excitation and strong cross-inhibition result in a winner-take-all situation where one hidden unit is strongly activated and inhibits the others; (2) cross-inhibition dominates, and all hidden units are weakly activated; and (3) strong self-excitation and weak cross-inhibition (e.g., due to sparse weights) allow for several hidden units to be simultaneously strongly activated.

Though informative, the expansion is valid only for small $W$/small $h$, whereas hidden units may take large values in practice. A more elaborate mathematical analysis is therefore required and is presented next.

### 3.3 Statistical Mechanics of Random-Weights RBM

The different representational regimes shown in Figure 3 can be observed and characterized analytically in simple statistical ensembles of RBMs controlled by a few structural parameters (Tubiana & Monasson, 2017):

- •
The aspect ratio $\alpha =MN$ of the RBM

- •
The threshold $\theta \mu =\theta $ of the ReLU potential acting on hidden units

- •
The local field $gi=g$ acting on visible units

- •The sparsity $p$ of the weights $wi\mu $ independently and uniformly drawn at random from the distribution (Agliari, Barra, Galluzzi, Guerra, & Moauro, 2012):
^{2}$wi\mu =0withprobability1-p2,WNwithprobabilityp2,-WNwithprobabilityp2.$(3.2)

*ferromagnetic phase*, one hidden unit, say, $\mu =1$, receives a strong input,

In the *spin-glass phase*, if the aspect ratio $\alpha $ of the machine is too large and the threshold $\theta $ and the visible field $g$ are too small, the RBM enters the spin-glass phase. All inputs $I\mu $ to the visible units are of the order of $\xb11$. Visible units are also subject to random inputs $Ii$ of the order of unity, and are hence magnetized as expected in a spin glass.

The third phase is the *compositional phase*. At large enough sparsity, that is, for small enough values of $p$ (see equation 3.2), a number $L\u223c\u2113p$ of hidden units have strong magnetizations on the order of $m\u223cp\xd7N$, the other units being shut down by choices of the threshold $\theta \u223cp$. Notice that $L$ is large compared to 1 but small compared to the hidden-layer size, $M$. The value of $\u2113$ is determined through minimization of the free energy (Tubiana & Monasson, 2017). The input $Ii$ onto visible units is on the order of $m\xd7w\xd7L\u223cp$. The mean activity in the visible layer is fixed by the choice of the visible field $g\u223cp$. This phase requires low temperatures, that is, weight-squared amplitudes $W2\u223c1p$ at least. Mathematically speaking, these scalings are obtained when the limit $p\u21920$ is taken after the thermodynamic limit $N,M\u2192\u221e$ (at fixed ratio $\alpha =M/N)$.

Although RBMs learned from data do not follow the same simplistic weight statistics and deviations are expected, the general idea that each RBM learns a decomposition of samples into building blocks was observed on MNIST (Hinton, 2002; Fischer & Igel, 2012), a celebrated data set of handwritten digits, and is presented hereafter on in silico protein families.

### 3.4 Summary

To summarize, RBMs with nonquadratic potential and sparse weights, as enforced by regularization, learn compositional representations of data. In other words, enforcing sparsity in weights results in enforcing sparsity in activity in a fashion similar to that of sparse dictionaries. The main advantages of RBM with respect to the latter are that unlike sparse dictionaries, RBM defines a probability distribution—therefore allowing sampling, scoring, and Bayesian inference—and the representation is a simple linear-nonlinear transformation instead of the outcome of an optimization.

Importantly, we note that the random-RBM ensemble analysis shows only that sparsity is a sufficient condition for building an energy landscape with a diversity of gradually related attractors; such a landscape could also be achieved in other parameter regions—for example, when the weights are correlated. Therefore, there is no guarantee that training an RBM without regularization on a compositional data set (e.g., generated by a random-RBM in the compositional phase) will result in sparse weights and a compositional representation.

Since we can enforce a compositional representation via regularization, what have we learned about the data? The answer is that enforcing sparsity does not come at the same cost in likelihood for all data sets. In some cases, for example, for data constituted by samples clustered around prototypes, weight sparsity yields a significant misfit with respect to the data. In other cases, such as lattice proteins, presented below, enforcing sparsity comes at a very weak cost, suggesting that these data are intrinsically compositional. More generally, RBM trained with fixed regularization strength on complex data sets may exhibit heterogeneous behaviors, with hidden units behaving as compositional and others as prototypes (see some examples in Tubiana et al., 2018).

## 4 Results

### 4.1 Lattice Proteins: Model and Data

Lattice protein (LP) models were introduced in the 1990s to investigate the properties of proteins (Shakhnovich & Gutin, 1990), in particular how their structure depend on their sequence. They were recently used to benchmark graphical models inferred from sequence data (Jacquin, Gilson, Shakhnovich, Cocco, & Monasson, 2016). In the version considered here, LP include 27 amino acids and fold on a $3\xd73\xd73$ lattice cube (Shakhnovich & Gutin, 1990). There are $N=103,406$ possible folds (up to global symmetries), that is, self-avoiding conformations of the 27 amino acid–long chains on the cube.

A collection of 36,000 sequences $v$ that specifically fold in structure $SA$, such that $Pnat(v;SA)>0.995$, were generated by Monte Carlo simulations as described in Jacquin et al. (2016). As real MSA, lattice-protein data feature nonconserved and partly conserved sites, short- and long-range correlations between amino acids on different sites, as well as high-order correlations that arise from competition between folds (Jacquin et al., 2016; see Figure 4). The correlations reflect the main modalities of amino acid interactions, such as electrostatic attraction/repulsion, hydrophobic attraction, or cysteine-cysteine disulfide bonds (see Figure 4f). Moreover, as for natural proteins, LP sequences can be partitioned into subfamilies (see Figure 4g).

### 4.2 Representations of Lattice Proteins by RBM: Interpretation of Weights and Generative Power

We now learn RBM from the LP sequence data, with the training procedure presented in section 2.4. The visible layer includes $N=27$ sites, each carrying Potts variables $vi$ taking 20 possible values. Here, we show results with $M=100$ dReLU hidden units, with a regularization $\lambda 12=0.025$, trained on a fairly large alignment of $B=36,000$ sequences. We present in Figure 5 a selection of structural LP features inferred by the model (see Tubiana et al., 2018 for more features). For each hidden unit $\mu $, we show in panel a the weight logo of $wi\mu (v)$ and in panel b the distribution of its hidden unit input $I\mu $, as well as the conditional mean activity $\u2329h\mu \u232a(I\mu )$. Weights have significant values on a limited number of sites only, which makes their interpretation easier.

As seen from Figure 5a, weight 1 focuses mostly on sites 3 and 26, which are in contact in the structure (see Figure 4a) and are not very conserved (see the sequence logo in Figure 4d). Positively charged residues (H,R,K) have a large, positive (resp. negative) component on site 3 (resp. 26), and negatively charged residues (E, D) have a large, negative (resp. positive) components on the same sites. The histogram of the input distribution in Figure 5b shows three main peaks in the data. Since $I1(v)=\u2211iwi1(vi)$, the peaks (1) $I1\u223c3$, (2) $I1\u223c-3$, and (3) $I1\u223c0$ correspond to sequences having, respectively, (1) positively charged amino acids on site 3 and negatively charged amino acids on site 26, (2) conversely, negatively charged amino acids on site 3 and positively charged on site 26, and (3) identical charges or noncharged amino acids. Weight 2 also focuses on sites 3 and 26. Its positive and negative components correspond respectively to charged (D, E, K, R, H) or hydrophobic amino acids (I, L, V, A, P, W, F). The bulk in the input distribution around $I2\u223c-2$ therefore identifies sequences having hydrophobic amino acids at both sites, whereas the positive peak corresponds to electrostatic contacts as the ones shown in weight 1. The presence of hidden units 1 and 2 signals an excess of sequences having significantly high $|I1|$ or $|I2|$ compared to an independent-site model. Indeed, the contribution of hidden unit $\mu =1,2$ to the log probability is $\Gamma (I\mu )\u223cI\mu 2$, since the conditional mean $\u2329h\mu \u232a(I\mu )=\Gamma '(I\mu )$ is roughly linear (see Figure 5b). In other words, sequences folding into $SA$ are more likely to have complementary residues on sites 3 and 26 than what would be predicted from an independent site model (with the same site frequencies of amino acids).

Interestingly, RBM can extract features involving more than two sites. Weight 3 is located mainly on sites 5 and 22, with weaker weights on sites 6, 9, and 11. It codes for a cysteine-cysteine disulfide bridge located on the bottom of the structure and present in about a third of the sequences ($I3\u223c3$). The weak components and small peaks $I3\u223c4$ also highlight sequences with a triangle of cysteines 5-11-22 (see $SA$). We note, however, that this is an artifact of the MJ-based interactions in lattice proteins (see equation 4.1) as a real cysteine amino acid may form only one disulfide bridge.

Weight 4 is an extended electrostatic mode. It has strong components on sites 23, 2, 25, 16, and 18 corresponding to the upper side of the protein (see Figure 4a). Again, these five sites are contiguous on the structure, and the weight logo indicates a pattern of alternating charges present in many sequences ($I4\u226b0$ and $I4\u226a0$).

The collective modes defined by RBM may not be contiguous on the native fold. Weight 5 codes for an electrostatic triangle 20-1-18 and the electrostatic contact 3-26, which is far away from the former. This indicates that despite being far away, sites 1 and 26 often have the same charge. The latter constraint is not due to the native structure $SA$ but impedes folding in the competing structure, $SG$, in which sites 1 and 26 are neighbors (see Figure 4b). Such negative design was also reported through analysis with pairwise models (Jacquin et al., 2016).

### 4.3 Lattice Protein Sequence Design

The interpretability of the hidden units allows us to design new sequences with the right fold. We show in Figure 6a an example of conditional sampling, where the activity of one hidden unit (here, $h5$) is fixed. This allows us to design sequences having either positive or negative $I5$, depending on the sign of $h5$—hence having given charges on a subset of five residues. In this example, biasing can be useful—for example, to find sequences with prescribed charge distribution. More broadly, Tubiana et al. (2018) reported that hidden units can have a structural or functional role, for example, in terms of loop size, binding specificity, or allosteric interaction. In principle, conditional sampling allows one to tune these properties at will. In Figure 6b, the sequences generated have both low sequence similarity to the sequences used in training, with about 40% sequence difference to the closest sequence in the data, and high probability $Pnat$ (see equation 4.2) to fold into $SA$. Interestingly, low temperature sampling for example, sampling from $P'(v)\u221dP(v)2$, can produce sequences with higher $Pnat$ than all the sequences used for training. In real protein design setups, this could correspond to higher binding affinity or better compromises between different target functions.

### 4.4 Representations of Lattice Proteins by RBM: Effect of Sparse Regularization

We now repeat the previous experiment with varying regularization strength, as well as for various potentials. To see the effect of the regularization on the weights, we compute a proxy $p$ for the weight sparsity (see the definition in appendix B). We also evaluate the likelihood $L$ of the model on the train and a held-out test set. To do so, we estimate the partition function via the annealed importance sampling algorithm (Neal, 2001; Salakhutdinov & Murray, 2008). We show in Figure 7 the sparsity-likelihood scatter plot.

Without sparse regularization, the weights are not sparse and the representation is intricate. As expected, increasing the regularization strength results in fewer nonzero weights and lower likelihood on the training set. Somewhat surprisingly, the test set likelihood decays only mildly for a while, even though the data set is very large such that no overfitting is expected. This suggests a large degeneracy of maximum likelihood solutions that perform equally well on test data; sparse regularization allows us to select one for which the representation is compositional and the weights can be easily related to the underlying interactions of the model. Beyond some point, likelihood decreases abruptly because hidden units cannot encode all key interactions anymore. Choosing a regularization strength such that the model lies at the elbow of the curve, as was done for the RBM in Figure 5, is a principled way to obtain a model that is both accurate and interpretable.

### 4.5 Representations of Lattice Proteins by RBM: Effects of the Size of the Hidden Layer

We now study how the value of $M$ affects the representations of the protein sequences. We repeat the training with dReLU RBM fixed regularization $\lambda 12=0.025$ for $M$ varying between 1 and 400 and show in Figure 8 one typical weight $w\mu $ learned by dReLU RBMs for $M$ varying between 1 and 400. We observe different behaviors as the value of $M$ increases.

For very low $M$, the weight vectors are extended over the whole visible layer. For $M=1$, the unique weight vector captures electrostatic features—its entries are strong on charged amino acids, such as K, R (positive charges) and E, D (negative charges)—and is very similar to the top component of the correlation matrix of the data (compare the top panel in Figure 8 and Figure 4f). Additional hidden units (see the panels on the second line of Figure 8 corresponding to $M=2$) capture other collective modes—here patterns of correlated cysteine-cysteine bridges across the protein. Hence, RBM can be seen as a nonlinear implementation of PCA. A thorough comparison with PCA will be presented in section 5.

As $M$ increases, the resolution of representations gets finer and finer: units focus on smaller and smaller portions of the visible layer (see Figure 8). Nonzero entries are restricted to a few sites, often in contact on the protein structure (see Figure 4, fold $SA$). We introduce a proxy $p$ for the sparsity of the weight based on inverse participation ratios of the entries $wi\mu (v)$ (see appendix B). The behavior of $p$ as a function of $M$ is shown in Figure 9a. We observe that $p$ decreases (weights get sparser) until $M$ reaches 50 to 100.

For values of $M>100$, the modes of covariation cannot be decomposed into thinner ones anymore, and $p$ saturates to the minimal value $\u223c2/27$, corresponding to a hidden unit linked to two visible units (see Figure 9a). Then additional features are essentially duplicates of the previous ones. This can be seen from the distribution of maximum weight overlaps $O\mu max$ shown in Figure 9a (see appendix B for a definition).

In addition, the number $L$ of simultaneously active hidden units per sequence grows, undergoing a continuous transition from a ferromagnetic-like regime ($L\u223c1$) to a compositional one ($L\u226b1$). Note crucially that $L$ does not scale linearly with $M$ (see Figures 9a and 9b). If we account for duplicated hidden units, $L$ tends to saturate at about 12, a number that arises from the data distribution rather than from $M$. In contrast, for an unregularized training with quadratic hidden unit potential, $L$ grows to much larger values ($L\u223c50$) as $M$ increases (see Figure 13). Finally, Figure 9c shows that the theoretical scaling $L\u223c1p$ (Tubiana & Monasson, 2017) is qualitatively correct.

## 5 Comparison with other Representation Learning Algorithms

### 5.1 Models and Definitions

We now compare the results obtained with standard representation learning approaches. Thanks to popular packages such as Scikit-learn (Buitinck et al., 2013) or Keras (Chollet, 2015), most of these approaches are easy to implement in practice.

Principal component analysis (PCA) is arguably the most commonly used tool for finding the main modes of covariation of data. It is routinely applied in the context of protein sequence analysis for finding protein subfamilies, identifying specificity-determining positions (Casari, Sander, & Valencia, 1995; Rausell, Juan, Pazos, & Valencia, 2010; De Juan, Pazos, & Valencia, 2013), and defining subsets of sites (called sectors) within proteins that control the various properties of the protein (Halabi, Rivoire, Leibler, & Ranganathan, 2009; McLaughlin, Poelwijk, Raman, Gosal, & Ranganathan, 2012). A major drawback of PCA is that it assumes that the weights are orthogonal, which is in general not true and often results in extended modes that cannot be interpreted easily. Independent component analysis (ICA; Bell & Sejnowski, 1995; Hyvärinen & Oja, 2000) is another approach that aims at alleviating this issue by incorporating high-order moments statistics into the model. ICA was applied for identifying neural areas from fMRI data (McKeown et al., 1998) and also used for protein sequence analysis (Rivoire et al., 2016). Another way to break the rotational invariance is to impose sparsity constraints on the weights or the representation via regularization. Sparse PCA (Zou, Hastie, & Tibshirani, 2006) is one such popular approach, and was considered in both neuroscience (Baden et al., 2016) and protein sequence analysis (Quadeer, Morales-Jimenez, & McKay, 2018). We will also study sparse (in weights) single-layer noisy autoencoders, which can be seen as a nonlinear variant of sparse PCA. Sparse dictionaries (Olshausen & Field, 1996; Mairal et al., 2009) are also considered. Finally, we will consider variational autoencoders (Kingma & Welling, 2013) with a linear-nonlinear decoder, which can be seen as both regularized autoencoder and a nonlinear generative PCA. VAE were recently considered for generative purpose in proteins (Sinai, Kelsic, Church, & Novak, 2017; Riesselman, Ingraham, & Marks, 2017; Greener, Moffat, & Jones, 2018), and their encoder defines a representation of the data.

All the above-mentioned models belong to the same family: the linear-nonlinear latent variable graphical models (see Figure 10). In this generative model, latent factors $z\mu $ are drawn from an independent distribution $P(z)=\u220f\mu P\mu (z\mu )$, and the data are obtained, as in RBM, by a linear transformation followed by an element-wise nonlinear stochastic or deterministic transformation, of the form $P(v|z)=\u220fiP(vi|Iiv(z))$. Unlike RBM, a single pass, rather than an extensive back-and-forth process, is sufficient to sample configurations. For a general choice of $P\mu (z\mu )$ and $P(v|z)$, the inference of this model by maximum likelihood is intractable because the marginal $P(v)$ does not have a closed form. The different models correspond to different hypotheses on $P\mu (z\mu )$, $Pvi|Iiv(z)$, and learning principles that simplify the inference problem (see Table 1).

Algorithm | PCA | ICA (Infomax) | Sparse PCA | Sparse Noisy Autoencoders | Sparse Dictionaries | Variational Autoencoders | RBM |

$P\mu (z\mu )$ | Gaussian | Nongaussian | / | / | Sparse | Gaussian | Nongaussian; correlated |

$M$ | $\u2264N$ | $=N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ |

$P(vi|Iiv)$ | Deterministic | Deterministic | Deterministic | Deterministic | Deterministic | Stochastic | Stochastic |

$P(z|v)$ | Deterministic | Deterministic | Deterministic | Stochastic | Deterministic | Stochastic | Stochastic |

$vi(Iiv)$ | Linear | Linear | Linear | Softmax | Linear | Softmax | Softmax |

$wi\mu (v)$ | Orthonormal | Normalized | Sparse | Sparse | Normalized | Sparse | Sparse |

Learning method | Max. likelihood $\u21d4$ Min. reconstruction error (MSE) | Max. likelihood | Min. reconstruction error (MSE) | Min. Noisy Reconstruction Error (CCE) | Min. reconstruction error (MSE) | Variational max. likelihood | Max. likelihood |

Algorithm | PCA | ICA (Infomax) | Sparse PCA | Sparse Noisy Autoencoders | Sparse Dictionaries | Variational Autoencoders | RBM |

$P\mu (z\mu )$ | Gaussian | Nongaussian | / | / | Sparse | Gaussian | Nongaussian; correlated |

$M$ | $\u2264N$ | $=N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ | $\u2276N$ |

$P(vi|Iiv)$ | Deterministic | Deterministic | Deterministic | Deterministic | Deterministic | Stochastic | Stochastic |

$P(z|v)$ | Deterministic | Deterministic | Deterministic | Stochastic | Deterministic | Stochastic | Stochastic |

$vi(Iiv)$ | Linear | Linear | Linear | Softmax | Linear | Softmax | Softmax |

$wi\mu (v)$ | Orthonormal | Normalized | Sparse | Sparse | Normalized | Sparse | Sparse |

Learning method | Max. likelihood $\u21d4$ Min. reconstruction error (MSE) | Max. likelihood | Min. reconstruction error (MSE) | Min. Noisy Reconstruction Error (CCE) | Min. reconstruction error (MSE) | Variational max. likelihood | Max. likelihood |

Note: MSE = mean square error. CCE: categorical cross-entropy.

### 5.2 Lattice Proteins: Features Inferred

We use each approach to infer features from the MSA lattice proteins. For each model, we use the same number $M=100$ of latent factors. For PCA, ICA, and sparse dictionaries, we used one-hot-encoding for each site (i.e., converted into a 20-dimensional binary vector), and the standard implementation from Scikit-learn with default parameters. For sparse dictionaries, we adjusted the penalty such that the number $L$ of simultaneously active latent features matches the one found in RBM—approximately 10. Sparse PCA and sparse autoencoders are special cases of autoencoders, respectively, with a mean square reconstruction error and categorical cross-entropy reconstruction error, and were implemented in Keras (Chollet, 2015). In both cases, we used the same weights for encoding and decoding. For the noisy variant, we replace 20% of the amino acids with a random one before attempting to reconstruct the original sequence. For variational autoencoders, we used the same parametric form of $P(v|z)$ as for RBM—a linear transformation followed by a categorical distribution. The posterior probability $P(z|v)$ is approximated by a gaussian encoder $N\mu (v),\sigma 2(v)$ where $\mu (v)$ and $log\sigma 2(v)$ are computed from $v$ either by linear transformation or by a single hidden layer neural network. The parameters of both $P(v|z)$ and the encoder are learned by maximizing the variational lower bound on the likelihood of the model. The sparse regularization is enforced on the decoding weights of $P(v|z)$ only. For all models with a sparse weight regularization, we selected the regularization strength so as to obtain similar median sparsity values $p$ as with the RBM shown in Figure 5. We show for each method six selected features in Figure 11 (PCA, sPCA, sNAE, sVAE with linear encoder), and in Figure 14 (ICA, sparse dictionaries, sVAE with nonlinear encoder). For PCA, ICA, and sparse dictionaries, we show the most important features in terms of explained variance. For sPCA, sNAE, and sVAE, we show features with sparsity $p\mu $ close to the median sparsity.

Most of the high-importance features inferred by PCA and ICA are completely delocalized and encode the main collective modes of the data, similar to unregularized RBM. Clearly, there is no simple way to relate these features to the underlying interactions of the system. Even if sparsity could help, the main issue is the enforced constraint of decorrelation or independence, because it impedes the model from inferring smaller coevolutionary modes such as pairs of amino acids in contact. Indeed, lattice proteins exhibits a hierarchy of correlations, with sites that are tightly correlated and also weakly correlated to others. For instance, hidden units 4 and 5 of Figure 5 are strongly but not completely correlated, with a small fraction of sequences having $I4>0$ and $I5<0$ and conversely (see Figure 16a). Therefore, neither PCA nor ICA can resolve the modes; instead, the first principal component is roughly the superposition of both modes (see Figure 4e). Finally, both PCA and ICA also infer a feature that focuses only on site 24 (PC4 and IC4). As seen from the conservation profile (see Figure 4d), this site is strongly conserved, with two possible amino acids. Since mutations are uncorrelated from the remainder of the sequence and the associated variance of the mode is fairly high, both PCA and ICA encode this site. In contrast, we never find a hidden unit focusing only on a single site with RBM because its contribution $\Gamma \mu (I\mu (v))$ to $P(v)$ would be equivalent to a field term $gi(vi)$. Similar features are found with sparse dictionaries as well (see Figure 14).

As expected, all the models with sparse weights penalties (sPCA, sNAE, and sVAE) can infer localized features, provided the regularization is adjusted accordingly. However, unlike in RBM, a significant fraction of these features does not focus on sites in contact. For instance, in sparse PCA, features 2 and 5 focus respectively on pairs 6-17 and 17-21, and neither of them is in contact. To be more systematic, we identify for each method the features focusing on two and three sites (via the criterion $\u2211v|wi\mu (v)|>0.5maxi\u2211v|wi\mu (v)|$), count them, and compare them to the original structure. Results are shown in Table 2. For RBM, 49 hidden units focus on two sites, of which 47 are in contact, and 18 focus on three sites, of which 16 are in contact (e.g., like 8-16-27). For the other methods, both the number of features and the true positive rate are significantly lower. In sparse PCA, only 15/25 pairs and 3/24 triplet features are legitimate contacts or triplets in the structure.

Sparse Model | Pair Features | Triplet Features |

RBM | 47/49 | 16/18 |

sPCA | 15/25 | 3/24 |

Noisy sAE | 20/29 | 7/19 |

sVAE (linear encoder) | 8/8 | 5/6 |

sVAE (nonlinear encoder) | 6/6 | 2/4 |

Sparse Model | Pair Features | Triplet Features |

RBM | 47/49 | 16/18 |

sPCA | 15/25 | 3/24 |

Noisy sAE | 20/29 | 7/19 |

sVAE (linear encoder) | 8/8 | 5/6 |

sVAE (nonlinear encoder) | 6/6 | 2/4 |

The main reason for this failure is that in sparse PCA (as in PCA), the emphasis is put on the complete reconstruction of the sequence from the representation because the mapping is assumed to be deterministic rather than stochastic. The sequence must be compressed entirely in the latent factors, of smaller size $M=100<27\xd720=540$, and this is achieved by grouping sites in a fashion that may respect some correlations but not necessarily the underlying interactions. Therefore, even changing the reconstruction error to cross-entropy to properly take into account the categorical nature of the data does not significantly improve the results. However, we found that corrupting the sequence with noise before attempting to reconstruct it (i.e., introducing a stochastic mapping) indeed slightly improves the performance, though not to the same level as RBM for quantitative modeling.

The simplifying assumption that $P(v|z)$ is deterministic can be lifted by adopting a variational approach for learning the model. In the case of gaussian distribution for $z\mu $, we obtain the variational autoencoder model. The features obtained are quite similar to the ones obtained with RBM, featuring contacts, triplets (with very few false positives), and some extended modes. Owing to this stochastic mapping, the representation need not encode individual site variability and focuses instead on the collective behavior of the system.

The major inconvenience of VAE is that we find that only a small number of latent factors (approximately 20) are effectively connected to the visible layer. Increasing the number of latent factors or changing the training algorithm (SGD versus ADAM, batch normalization) did not improve this number. This is likely because the independent gaussian assumption is not compatible with the linear-nonlinear decoder and sparse weights assumption. Indeed, the posterior distribution of the latent factors (from $P(z|v)$) shows significant correlations and deviations from gaussianity (see Figures 16b and 16c). The KL regularization term of VAE training therefore encourages them to be disconnected. Another consequence of this deviation from i.i.d. gaussian is that sequences generated with the VAE have significantly lower fitness and diversity than with RBM (see Figures 17 and 18). Though deep mappings between the representation and the sequence as in Riesselman et al. (2017), as well as more elaborate priors, as in Mescheder, Nowozin, and Geiger (2017) can be considered for quantitative modeling, the resulting models are significantly less interpretable. In contrast, undirected graphical models such as RBM have a more flexible latent variable distribution and are therefore more expressive for a fixed linear-nonlinear architecture.

In summary, the key ingredients for learning RBM-like representations are sparse weights, stochastic mapping between configurations and representations, and flexible latent variable distributions.

### 5.3 Lattice Proteins: Robustness of Features

Somewhat surprisingly, we also find that allowing a stochastic mapping between representation and configurations results in features that are significantly more robust with respect to finite sampling. For all the models and parameters used above, we repeat the training five times with either the original data set and varying seed or the shuffled data set, in which each column is shuffled independently from the others so as to preserve the first-order moments and to suppress correlations. Since the data size is very large, the models trained on the shuffled data should have latent factors that are either disconnected or localized on a single site, with small feature importance. We show in Figures 12 and 19 the distribution of feature importance for the original and shuffled data for each method. For PCA, ICA, sparse PCA, and sparse autoencoders, the weights are normalized, and feature importance is determined by the variance of the corresponding latent factor. For RBM and VAE, the variance of the latent factors is normalized to 1, and the importance is determined by the norm of the weight vector. In PCA and ICA, we find that only a handful of features emerge from the bulk; sparse regularization slightly improves the number but not by much. In contrast, there is a clean scale separation between feature importance for regular and shuffled data, for both regularized and unregularized RBM and VAE. This notably explains why only a few principal components can be used for protein sequence modeling, whereas many features can be extracted with RBM. The number of modules that can be distinguished within protein sequences is therefore grossly underestimated by principal component analysis.

## 6 Conclusion

Understanding the collective behavior of a physical system from observations requires learning a phenomenology of data, as well as inferring the main interactions that drive its collective behavior. Representation learning approaches, such as principal component analysis or independent component analysis, have been frequently considered for this purpose. Comparatively, restricted Boltzmann machines, a machine learning tool originally popularized for unsupervised pretraining of deep neural networks, have received little attention, notably due to a limited understanding of the properties of the learned representations.

In this work, we have studied in detail the performances of RBM on a notoriously difficult problem: the prediction of the properties of proteins from their sequence (the so-called genotype-to-phenotype relationship). We have shown that provided that appropriate nonquadratic potentials, as well as sparsity regularization over the weights are used, RBM can learn compositional representations of data, in which the hidden units code for constitutive parts of the data configurations. Each constitutive part, due to sparsity, focuses on a small region of the data configurations (subset of visible sites) and is therefore easier to interpret than extended features. In turn, constitutive parts may be stochastically recombined to generate new data with good properties (here, probability of folding into the desired 3D structure). Regularized RBM therefore offers an appealing compromise between model interpretability and generative performance.

We stress that the behavior of RBM described in this letter is in full agreement with theoretical studies of RBM with weights drawn from random statistical ensembles, controlled by a few parameters, including the sparsity, the aspect ratio of the machine, and the thresholds of rectified linear units (Tubiana & Monasson, 2017). It is, however, a nontrivial result, due to the very nature of the lattice-protein sequence distribution, that forcing RBM trained on such data to operate in the compositional regime can be done at essentially no cost in log likelihood (see Figure 7b). This property also holds for real proteins, as shown in Tubiana et al. (2018).

In addition, RBM enjoys some useful properties with respect to the other representation learning approaches studied in this letter. First, RBM representations focus on the underlying interactions between components rather than on all the variability of the data, taken into account by site-dependent potentials acting on the visible unit. The inferred weights are therefore fully informative on the key interactions within the system. Second, the distribution of latent factors (see Figure 10) is not imposed in RBM, contrary to VAE, where it is arbitrarily supposed to be gaussian. The inference of the hidden-unit potential parameters (thresholds $\theta \xb1$ and curvatures $\gamma \xb1$ for dReLU units) confers a lot of adaptibility to RBM to fit the data distribution as closely as possible.

Altogether, beyond the protein sequence analysis application presented here, our work suggests that RBMs shed a different light on data and could be useful to model other data in an accurate and interpretable way. While we have focused on single-layer representations of data, a major challenge is to understand how we could leverage our results to learn interpretable multilayer representations of data. Indeed, many data sets, such as databases of images of objects, exhibit hierarchies of features, entailing that the low-level features extracted themselves have complex patterns of covariation. Though it is not clear whether such hierarchy exists in protein sequences, it is likely to exist in other biological data, such as in large-scale recordings of biological neural networks. In that case, an RBM may not be the optimal choice because the correlations between hidden units are not explicitly fitted from the data, as the RBM model is trained to reproduce the moments $\u2329v\u232adata$, $\u2329f(h)\u232adata$, and $\u2329vh\u232adata$ only. Instead, correlations between hidden units arise only through the overlaps between their attached weights (see equation 3.1). Therefore, if two nonoverlapping hidden units are correlated in the data, the only way to account for this property is to introduce an additional hidden unit, whose weight overlaps with both. In such a situation, introducing additional couplings between hidden units or additional layers of hidden units, such as in deep Boltzmann machines or deep belief networks, would result in more parameter-efficient and easier-to-interpret models. However, deep architectures raise numerous additional questions. Besides training, which is more challenging, it is not clear how to interpret and visualize the hidden-unit receptive fields, which are not linear-nonlinear anymore; how to optimally choose the potentials, architecture, and regularization as functions of the data considered; whether and when compositional representations can arise in some layer; how they interact with other types of layers, such as convolutional or pooling layers; and how the nature of representations varies from layer to layer depending on the structural parameters of the model. Another open issue is sampling. RBMs, when driven in the compositional phase, empirically show efficient mixing properties. Characterizing how fast the data distribution is dynamically sampled would be very interesting, with potential payoff for training where benefiting from efficient sampling is crucial.

## Appendix A: Additional Details for the Learning Algorithm

Here we provide additional details for the learning algorithm of RBMs. Codes for training RBM, as well as the lattice protein multiple sequence alignment, are available at https://github.com/jertubiana/ProteinMotifRBM.

### A.1 Gauge Choice

Since the conditional probability, equation 2.6, is normalized, the transformations $gi(v)\u2192gi(v)+\lambda i$ and $wi\mu (v)\u2192wi\mu (v)+Ki\mu $ leave the conditional probability invariant. We choose the zero-sum gauges, defined by $\u2211vgi(v)=0$, $\u2211vwi\mu (v)=0$. Since the regularization penalties over the fields and weight depend on the gauge choice, the gauge must be enforced throughout all training, not only at the end. The updates on the fields leave the gauge invariant, so the transformation $gi(v)\u2192gi(v)-1q\u2211v'gi(v')$ can be used only once, after initialization. It is not the case for the updates on the weights, so the transformation $wi\mu (v)-1q\u2211v'wi\mu (v')$ must be applied after each gradient update.

### A.2 Dynamic Reparameterization

For quadratic and dReLU potentials, there is a redundancy between the slope of the hidden unit average activity and the global amplitude of the weight vector. Indeed, for the gaussian potential, the model distribution is invariant under rescaling transformations $\gamma \mu \u2192\lambda 2\gamma \mu $, $wi\mu \u2192\lambda wi\mu $, $\theta \mu \u2192\lambda \theta \mu $ and offset transformation $\theta \mu \u2192\theta \mu +K\mu $, $gi\u2192gi-\u2211\mu wi\mu K\mu \gamma \mu $. Though we can set $\gamma \mu =1,\theta \mu =0\u2200\mu $ without loss of generality, it can lead to either numerical instability (at a high learning rate) or slow learning (at a low learning rate). A significantly better choice is to dynamically adjust the slope and offset so that $\u2329h\mu \u232adata\u223c0$ and $Var(h\mu )\u223c1$ at all times. This new approach, reminiscent of batch normalization for deep networks, is implemented in the training algorithm released with this work and is benchmarked in Tubiana, Cocco, and Monasson (2019).

### A.3 Initialization and Stochastic Gradient Descent

The optimization is carried out by stochastic gradient descent. At each step, the gradient is evaluated using a mini-batch of the data, as well as a small number of Markov chain Monte Carlo configurations. We used the same batch size ($=100$) for both sets. The model is initialized as follows:

The fields $gi(v)$ are initialized with the ones of the optimal independent model, $gi(v)=log\u2329\delta vi,v\u232adata$, where $\delta $ denotes the Kronecker function. This allows starting from a distribution that is already close to the data distribution and significantly speeds up the learning.

The parameters of the dReLU potential are initialized to $\gamma +=\gamma -=1$, $\theta +=\theta -=0$. This way, the average activity is initially linear, with a slope of 1. This choice is the most neutral one, as it does not favor any kind of nonlinearity (symmetric, asymmetric, sigmoid-like, or ReLU like).

The initial weights are drawn from a gaussian ensemble with variance $\sigma 2/N$, such that the hidden unit inputs $I\mu $ have initially zero mean and variance $\sigma 2$. Indeed, under the above initialization for the fields and dReLU potential, the initial gradient for $W$ is $\u2207WLt=0=W\u2299Cdata$, where $Cdata$ is the data covariance matrix (of size $N\xd720\xd7N\xd720$) and $\u2299$ denotes the tensor product over the first two axes Decelle, Fissore, & Furtlehner (2017). Therefore, using zero weights leads to a saddle point of the optimization; very small weights result in a small gradient and slow initial learning; and very large weights ($\sigma >1$ result in a spin-glass phase in which samples mix poorly, which can result in divergence. We found that setting $\sigma =0.1$ yields a good compromise between a fast initial learning speed and a fast mixing rate. Using orthogonal ensembles of weights instead of gaussian ones such as in Pennington, Schoenholz, and Ganguli (2017) was also briefly experimented but did not yield any significant speed-up for the lattice proteins data set.

The learning rate is initially set to 0.1 and decays exponentially after a fraction of the total training time (e.g., 50%) until it reaches a final, small value (e.g., $10-4$).

### A.4 Explicit Expressions for Training and Sampling from RBM

Training, sampling, and computing the probability of sequences with RBM requires (1) sampling from $P(v|h)$, (2) sampling from $P(h|v)$, and (3) evaluating the effective energy $Eeff(v)$ and its derivatives. This is done as follows:

Each sequence site $i$ is encoded as a categorical variable taking integer values $vi\u2208[0,20]$, with each integer corresponding to one of the 20 amino acids. Similarly, the fields and weights are encoded as, respectively, an $N\xd720$ matrix and an $M\xd7N\xd720$ tensor.

Given a hidden layer configuration, each visible unit is sampled from a categorical distribution with 20 states.

Given a visible layer distribution, each hidden unit is sampled from $P(h\mu |v)$. For a quadratic potential, this conditional distribution is gaussian. For the dReLU potential, let $\Phi (x)=exp(x22)1-erf(x2)\pi 2$. $\Phi $ satisfies the following properties:

$\Phi (x)\u223cx\u2192-\u221eexp(x22)2\pi $

$\Phi (x)\u223cx\u2192\u221e1x-1x3+3x5+O(1x7)$

$\Phi '(x)=x\Phi (x)-1$

To avoid numerical issues, $\Phi $ is computed in practice with its definition for $x<5$ and with its asymptotic expansion otherwise. We also write $TN(\mu ,\sigma 2,a,b)$—the truncated gaussian distribution of mode $\mu $, width $\sigma $, and support $[a,b]$. Then, $P(h|I)$ is given by a mixture of two truncated gaussians:where $Z\xb1=\Phi \u2213(I-\theta \xb1)\gamma \xb11\gamma \xb1$, and $p\xb1=Z\xb1Z++Z-$.$P(h|I)=p+TNI-\theta +\gamma +,1\gamma +,0,+\u221e+p-TN\mu =I-\theta -\gamma -,\sigma 2=1\gamma -,-\u221e,0,$Evaluating $Eeff$ and its derivatives requires an explicit expression for the cumulant-generating function $\Gamma (I)$. For quadratic potentials, $\Gamma (I)$ is given in main text. For dReLU potentials, $\Gamma (I)=log(Z++Z-)$ where $Z\xb1$ are defined above.

## Appendix B: Proxies for Weight Sparsity, Number of Strongly Activated Hidden Units

Since neither the hidden layer activity nor the weights are exactly zero, proxies are required for evaluating them. In order to avoid the use of arbitrary thresholds, which may not be adapted to every case, we use participation ratios.

## Appendix C: Supplementary Figures

## Notes

^{1}

One extreme example of such intricate representation is the random gaussian projection for compressed sensing (Donoho, 2006). Provided that the configuration is sparse in some known basis, it can be reconstructed from a small set of linear projections onto random independent and identically distributed (i.i.d.) gaussian weights $wi\mu $. Although such representation carries all information necessary to reconstruct the signal, it is by construction unrelated to the main statistical features of the data.

^{2}

Here $p$ is assumed to be finite, such that each hidden unit receives input from a large number $pN$ of visible units. Alternatively, one could impose finite connectivity $c=pN$, that is, $p$ of the order of $1/N$ (Agliari, Annibale, Barra, Coolen, & Tantari, 2013). We evaluated $p$ and $pN$ for a set of 38 RBM trained on various protein families with different lengths and found that $p$ was more stable (see Figure S8 in).

## Acknowledgments

We acknowledge Clément Roussel for helpful comments. This work was partly funded by the ANR project RBMPro CE30-0021-01. JT acknowledges funding from the Safra Center for Bioinformatics, Tel Aviv University.