Motivated by data-rich experiments in transcriptional regulation and sensory neuroscience, we consider the following general problem in statistical inference: when exposed to a high-dimensional signal S, a system of interest computes a representation R of that signal, which is then observed through a noisy measurement M. From a large number of signals and measurements, we wish to infer the “filter” that maps S to R. However, the standard method for solving such problems, likelihood-based inference, requires perfect a priori knowledge of the “noise function” mapping R to M. In practice such noise functions are usually known only approximately, if at all, and using an incorrect noise function will typically bias the inferred filter. Here we show that in the large data limit, this need for a precharacterized noise function can be circumvented by searching for filters that instead maximize the mutual information I[M; R] between observed measurements and predicted representations. Moreover, if the correct filter lies within the space of filters being explored, maximizing mutual information becomes equivalent to simultaneously maximizing every dependence measure that satisfies the data processing inequality. It is important to note that maximizing mutual information will typically leave a small number of directions in parameter space unconstrained. We term these directions diffeomorphic modes and present an equation that allows these modes to be derived systematically. The presence of diffeomorphic modes reflects a fundamental and nontrivial substructure within parameter space, one that is obscured by standard likelihood-based inference.
All statistical regression problems have this SRM form (Bishop, 2006), but we will focus on two biological applications for which this problem is particularly relevant. In neuroscience, SRM experiments are commonly used to characterize the response of neurons to stimuli (Schwartz, Pillow, Rust, & Simoncelli, 2006). For instance, S may be an image to which a retina is exposed, while M is a binary variable (spike or no spike) indicating the response of a single retinal ganglion cell. It is often assumed that the spiking probability depends on a linear projection R of S. The specific probability of a spike given R is determined by the noise function .
More recently, analogous experiments have been used to characterize the biophysical mechanisms of transcriptional regulation. In the context of work by Kinney, Murugan, Callan, and Cox (2010), S is the DNA sequence of a transcriptional regulatory region, R is the rate of mRNA transcription produced by this sequence, and M is a (noisy) measurement of the resulting level of gene expression. The filter is a function of DNA sequence that reflects the underlying molecular mechanisms of transcript initiation. The noise function accounts for both biological noise1 and instrument noise.
Although the correct filter does indeed maximize when the correct noise function is used, full a priori knowledge of this noise function is rare in practice. Often is chosen primarily for computational convenience, as is standard with least-squares regression. This can be problematic because using an incorrect will typically produce bias in the inferred filter , bias that does not disappear in the limit. The reason for this is illustrated in Figure 1.
Sometimes this problem can be partially alleviated by performing a separate calibration experiment in which the noise function is measured directly. For instance, one might be able to make repeated measurements M for a select number of known representations R. However, there will always be residual measurement error in that will propagate to in a manner that is not properly accounted for by simply plugging into likelihood calculations via equation 1.2.
We begin by pointing out that in the limit, maximizing mutual information over alone is equivalent to maximizing likelihood over both and . We then prove that when the correct filter lies within the class of filters being considered, maximizing mutual information is also equivalent to simultaneously maximizing every dependence measure that satisfies the data processing inequality (DPI). However, in the absence of a known noise function , SRM experiments are fundamentally incapable of constraining certain directions in the parameter space of ; we call these directions diffeomorphic modes. An equation for diffeomorphic modes is described and then applied to filters having various functional forms. In particular, our analysis of a linear-nonlinear filter that Kinney et al. (2010) used to model transcriptional regulation demonstrates how model nonlinearities can eliminate diffeomorphic modes in useful and nonobvious ways. This has important consequences for biophysical studies of transcriptional regulation that use recently developed DNA-sequencing-based assays (Kinney et al., 2010; Melnikov et al., 2012).
Throughout this article, we use R to implicitly denote the representation predicted by the filter for signal S; that is, . is used to denote any DPI-satisfying dependence measure. Representations R are assumed to be multidimensional with components and . is used to denote both a filter and the parameters governing that filter. represents both an abstract space of filters and the space of parameters for filters assumed to have a specific functional form. In the latter case, denotes coordinates in parameter space and .
2. Mutual Information and Likelihood
The key point is that finding maximally informative filters is equivalent to solving the maximum likelihood problem over both filters and noise functions . This is because if maximizes , simply choosing a noise function that matches the empirical noise function, that is, setting , will minimize and thus maximize .
3. DPI-Optimal Filters
Mutual information is just one measure among many that satisfy DPI (see appendix B). In this section, we discuss the importance of DPI for the SRM inference problem and introduce the notion of DPI-optimal filters.
4. Diffeomorphic Modes
Whether or not two filters and satisfy the above equivalence relation (equation 3.6) can depend on the true filter and the specific noise function of the SRM experiment. However, certain pairs of filters will satisfy under all SRM experiments. We will refer to such pairs of filters as being information equivalent. In appendix D, we prove that two filters are information equivalent if and only if their predicted representations are related by an invertible transformation.
As an objective function, mutual information is inherently incapable of distinguishing between information equivalent filters. In practice, this means that selecting maximally informative filters from a parameterized set of filters can leave some directions in parameter space unconstrained. Here we term these directions diffeomorphic modes.
The diffeomorphic modes of linear filters have an important and well-recognized consequence in neuroscience: the technique of maximally informative dimensions can identify only the relevant subspace of signal space, not a specific basis within that subspace (Sharpee et al., 2004; Paninski, 2003; Pillow & Simoncelli, 2006). However, an interesting twist occurs in applications to transcriptional regulation. Here, linear filters are often used to model the sequence-dependent binding energies of proteins to DNA (Stormo, 2013). Any mechanistic hypothesis about how DNA-bound proteins interact with one another predicts that the transcription rate will depend on these binding energies in a specific nonlinear manner (Bintu et al., 2005; Stormo, 2013). Such upfront knowledge about the nonlinearities of linear-nonlinear filters can eliminate diffeomorphic modes of the underlying linear filters in useful and nonobvious ways (Kinney, 2008; Kinney et al., 2010).
4.1. An Equation for Diffeomorphic Modes.
Consider a filter , representing a point in , whose parameters are infinitesimally transported along a vector field having components . This yields a new filter with components . If the representation R predicted by for a specified signal S has components in representation space, these will be transformed to .
4.2. General Linear Filters.
The number of diffeomorphic modes is bounded above by the number of independent parameters on which depends (at each ).6 For a general linear filter, we see that there can be no more than diffeomoprhic modes, which is the number of parameters and in equation 4.3. This bound is independent of the number of signal features, that is, the dimensionality of S. In particular, if R is a scalar, then h=a+bR. In this case we observe two diffeomorphic modes, corresponding to additive and multiplicative transformations of R.
4.3. A Linear-Nonlinear Filter.
Measurements M were taken for mutant lac promoters S. These data were then used to fit a model for the DNA-sequence-dependent binding energy of CRP. This was done by maximizing I[Q; M]. Because of the diffeomorphic modes of Q, the parameters were inferred up to an unknown scale, and the additive constant was left undetermined. This is shown in Figure 3B. Analogous results were obtained for RNAP (see Figure 3C).
When the parameters of the linear filters P and Q were simultaneously fit to data by maximizing I[T; M] (or, equivalently, maximizing I[R; M]), three of the four diffeomorphic modes described above were eliminated (see Figure 3D). Specifically, the overall scale of the parameters and was fixed, allowing binding energy predictions for CRP and RNAP in physically meaningful units of kBT. The parameter , corresponding to the intracellular concentration of CRP, was also fixed by the data. The only diffeomorphic mode left unbroken was , corresponding to the intracellular concentration of RNAP.
Likelihood-based inference masks the fundamentally different ways in which data constrain the parameters that lie along diffeomorphic modes versus those that lie along nondiffeomorphic modes. Standard likelihood inference constrains all model parameters, including both diffeomorphic and nondiffeomorphic modes, with error bars that scale as N−1/2.9 These constraints will be consistent with the correct underlying filter when the correct noise function is used (see Figure 4A). However, use of an incorrect noise function will typically cause to fall outside the error bars inferred along both diffeomorphic and nondiffeomorphic modes (see Figure 4B).
This problem is rectified if we use a prior that reflects our uncertainty about what the true noise function is. From equation 2.5, it can be seen that using the resulting marginal likelihood to compute a posterior distribution on will constrain diffeomorphic and nondiffeomorphic modes in fundamentally different ways (see Figure 4C). Nondiffeomorphic modes will be constrained by , which remains finite in the large N limit. This produces error bars on nondiffeomorphic modes comparable to those produced by likelihood when the correct noise function is used. However, constraints along diffeomorphic modes will come only from . Because vanishes as N−1, diffeomorphic constraints become independent of N once N is sufficiently large.10
Fortunately, one does not need to posit a specific prior probability over all possible noise functions in order to confidently infer filters from SRM data. Using mutual information as an objective function instead of likelihood, that is, sampling filters according to , will constrain nondiffeomorphic modes the same way that marginal likelihood does while putting no constraints along diffeomorphic modes (see Figure 4D).
One might worry that a large fraction of filter parameters will be diffeomorphic and that the analysis of SRM experiments will require an assumed noise function in order to obtain useful results even if doing so yields unreliable error bars. Such situations are conceivable, but in practice this is often not the case. We have shown that for linear filters, the number of diffeomorphic modes will typically not exceed regardless of how large is. Some of these diffeomorphic modes may also be eliminated if these linear filters are combined using a nonlinearity of known functional form. Indeed, of the 204 independent parameters comprising the biophysical model of transcriptional regulation inferred by Kinney et al. (2010), only one was diffeomorphic.
A bigger concern perhaps is the practical difficulty of using mutual information as an objective function. Specifically, it remains unclear how to compute rapidly and reliably enough to confidently sample from . Still, various methods for estimating mutual information are available (Khan et al., 2007; Panzeri, Senatore, Montemurro, & Petersen, 2007), and the information optimization problem has been successfully implemented using a variety of techniques (Sharpee et al., 2004, 2006; Kinney et al., 2007, 2010; Melnikov et al., 2012). We believe the exciting applications of mutual-information-based inference provide compelling motivation for making progress on these practical issues.
Appendix A: Marginal Likelihood
In certain cases, can be computed explicitly and thereby be shown to vanish (Kinney et al., 2007). More generally, when is taken to be finite dimensional, a saddle-point computation (valid for large N) gives Here, is the -space Hessian of evaluated at . If and its derivatives are bounded, then the -dependent part of decays as N−1. If is infinite dimensional, this saddle-point computation becomes a semiclassical computation in field theory akin to the density estimation problem studied by Bialek, Callan, and Strong (1996). If this field theory is properly formulated through an appropriate choice of , then may exhibit different decay behavior, but will still vanish as . See also Rajan et al. (2013).
Appendix B: DPI-Satisfying Measures
Appendix C: DPI Optimality
Assume by equation 3.5. Because is a Markov chain, the KL divergence between and can be decomposed as If this quantity is zero, then is also Markov chain, implying , a contradiction. This KL divergence must therefore be positive, that is, . So if , then for every , as well. This proves .
Appendix D: Information Equivalence
First, we observe that if and make isomorphic predictions, then they are information equivalent. This is readily shown from the fact that is invariant under arbitrary invertible transformations of R (Kinney & Atwal, 2013). Next, we show the converse: if and are information equivalent, the predictions R1 and R2 must be isomorphic. Here is the proof. If , then for all , and in particular I[R1; M]=I[R2; M]. In appendix C, we showed that implies is a Markov chain. Imagining an SRM experiment in which and , we find that is a Markov chain. This implies that the mapping is one-to-one. Similarly, is one-to-one. R1 and R2 are therefore bijective.
We thank William Bialek, Curtis Callan, Bud Mishra, Swagatam Mukhopadhyay, Anand Murugan, Michael Schatz, Bruce Stillman, and Gašper Tkačik for helpful conversations. Support for this project was provided by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.
Such as stochastic gene expression (Elowitz, Levine, Siggia, & Swain, 2002).
The notation and I[R; M] will be used interchangeably.
For example, does not vanish at the true noise function .
The subscripts 1 and 2 label two different filters, not two parameters of a single filter.
For example, if the various features exhibit complicated interdependencies, either because of their functional form or because signals S are restricted to a particular subspace. We ignore such possibilities here.
Technically the number of diffeomorphic modes is the number of independent vector fields gi that correspond to such transformations. However, here we consider only proper diffeomorphic modes, not gauge transformations; as in physics, we define gauge transformations to be vector fields gi along which transformation of leaves all predicted representations invariant.
To fix the gauge freedoms of these filters, Kinney et al. (2010) adopted the convention that for all positions l.
This assumes , that is, that CRP actually interacts with RNAP. Which is true.
In this discussion, we ignore gauge parameters, which do not alter model predictions and are therefore nonidentifiable.
More precisely, given any direction i in filter space, for N large enough.