Lossy compression and clustering fundamentally involve a decision about which features are relevant and which are not. The information bottleneck method (IB) by Tishby, Pereira, and Bialek (1999) formalized this notion as an information-theoretic optimization problem and proposed an optimal trade-off between throwing away as many bits as possible and selectively keeping those that are most important. In the IB, compression is measured by mutual information. Here, we introduce an alternative formulation that replaces mutual information with entropy, which we call the deterministic information bottleneck (DIB) and argue better captures this notion of compression. As suggested by its name, the solution to the DIB problem turns out to be a deterministic encoder, or hard clustering, as opposed to the stochastic encoder, or soft clustering, that is optimal under the IB. We compare the IB and DIB on synthetic data, showing that the IB and DIB perform similarly in terms of the IB cost function, but that the DIB significantly outperforms the IB in terms of the DIB cost function. We also empirically find that the DIB offers a considerable gain in computational efficiency over the IB, over a range of convergence parameters. Our derivation of the DIB also suggests a method for continuously interpolating between the soft clustering of the IB and the hard clustering of the DIB.
Compression is a ubiquitous task for humans and machines alike (Cover & Thomas, 2006; MacKay, 2002). For example, machines must turn the large pixel grids of color that form pictures into small files capable of being shared quickly on the web (Wallace, 1991), humans must compress the vast stream of ongoing sensory information they receive into small changes in the brain that form memories (Kandel, Schwartz, Jessell, Siegelbaum, & Hudspeth, 2013), and data scientists must turn large amounts of high-dimensional and messy data into more manageable and interpretable clusters (MacKay, 2002).
Lossy compression involves an implicit decision about what is relevant and what is not (Cover & Thomas, 2006; MacKay, 2002). In the example of image compression, the algorithms we use deem some features essential to representing the subject matter well, and others are thrown away. In the example of human memory, our brains deem some details important enough to warrant attention, and others are forgotten. And in the example of data clustering, information about some features is preserved in the mapping from data point to cluster ID, while information about others is discarded.
In many cases, the criterion for “relevance” can be described as information about some other variable(s) of interest. We call the signal we are compressing, the compressed version, the other variable of interest, and the “information” that has about (we formally define this later). For human memory, is past sensory input, the brain’s internal representation (e.g., the activity of a neural population, or the strengths of a set of synapses), and the features of the future environment that the brain is interested in predicting, such as extrapolating the position of a moving object. Thus, represents the predictive power of the memories formed (Palmer, Marre, Berry, & Bialek, 2015). For data clustering, is the original data, is the cluster ID, and is the target for prediction (e.g., purchasing or ad-clicking behavior in a user segmentation problem). In summary, a good compression algorithm can be described as a trade-off between the compression of a signal and the selective maintenance of the relevant bits that help predict another signal.
This problem was formalized as the information bottleneck (IB) by Tishby, Pereira, and Bialek (1999). Their formulation involved an information-theoretic optimization problem and resulted in an iterative soft clustering algorithm guaranteed to converge to a local (though not necessarily global) optimum. In their cost functional, compression was measured by the mutual information . This compression metric has its origins in rate-distortion theory and channel coding, where represents the maximal information transfer rate, or capacity, of the communication channel between and (Cover & Thomas, 2006).1 While this approach has its applications, often one is more interested in directly restricting the amount of resources required to represent , quantified by the entropy . This latter notion comes from the source coding literature and implies a restriction on the representational cost of (Cover & Thomas, 2006). In the case of human memory, for example, would roughly correspond to the number of neurons or synapses required to represent or store a sensory signal . In the case of data clustering, is related to the number of clusters.
In this letter, we introduce an alternative formulation of the IB, the deterministic information bottleneck (DIB), replacing the compression measure with , thus emphasizing contraints on representation rather than communication. Using a clever generalization of both cost functionals, we derive an iterative solution to the DIB, which turns out to provide a hard clustering, or deterministic mapping from to , as opposed to the soft clustering, or probabilitic mapping, that IB provides. Finally, we compare the IB and DIB solutions on synthetic data to help illustrate their differences.
2 The Original Information Bottleneck
This solution is formal because the first equation depends on the second and vice versa. However, Tishby et al. (1999) showed that an iterative approach can be built on the the above equations, which provably converges to a local optimum of the IB cost function (see equation 2.1).
Note that the first equation, 2.4, is the only “meaty” one; the other two, 2.5 and 2.6, are just there to enforce consistency with the laws of probability (e.g., that marginals are related to joints as they should be). In principle, with no proof of convergence to a global optimum, it might be possible for the solution obtained to vary with the initialization, but in practice, the cost function is smooth enough that this does not seem to happen. This algorithm is summarized in algorithm 1. Note that while the general solution is iterative, there is at least one known case in which an analytic solution is possible: when and are jointly gaussian (Chechik, Globerson, Tishby, & Weiss, 2005).
In summary, given the joint distribution , the IB method extracts a compressive encoder that selectively maintains the bits from that are informative about . As the encoder is a function of the free parameter , we can visualize the entire family of solutions on a curve (see Figure 1), showing the trade-off between compression (on the -axis) and relevance (on the -axis), with as an implicitly varying parameter. For small , compression is more important than prediction, and we find ourselves at the bottom left of the curve in the high-compression, low-prediction regime. As increases, prediction becomes more important relative to compression, and we see that both and increase. At some point, saturates, because there is no more information about that can be extracted from (either because has reached or because has too small cardinality). In this regime, the encoder will approach the trivially deterministic solution of mapping each to its own cluster. At any point on the curve, the slope is equal to , which we can read off directly from the cost functional. Note that the region below the curve is shaded because this area is feasible; for suboptimal , solutions will lie in this region. Optimal solutions will, of course, lie on the curve, and no solutions will lie above the curve.
Additional work on the IB has highlighted its relationship with maximum likelihood on a multinomial mixture model (Slonim & Weiss, 2002) and canonical correlation analysis (Creutzig, Globerson, & Tishby, 2009)—and therefore linear gaussian models (Bach & Jordan, 2006) and slow feature analysis (Turner & Sahani, 2007). Applications have included speech recognition (Hecht & Tishby, 2005; Hecht, Noor, & Tishby, 2009), topic modeling (Slonim & Tishby, 2000b, 2001; Bekkerman, El-Yaniv, Tishby, & Winter, 2001, 2003), and neural coding (Schneidman, Slonim, Tishby, de Ruyter van Steveninck, & Bialek, 2001; Palmer, Marre, Berry, & Bialek, 2015). Most recently, the IB has even been proposed as a method for benchmarking the performance of deep neural networks (Tishby & Zaslavsky, 2015).
3 The Deterministic Information Bottleneck
Our motivation for introducing an alternative formulation of the information bottleneck is rooted in the compression term of the IB cost function; there, the minimization of the mutual information represents compression. As discussed above, this measure of compression comes from the channel coding literature and implies a restriction on the communication cost between and Here, we are interested in the source coding notion of compression, which implies a restriction on the representational cost of . For example, in neuroscience, there is a long history of work on redundancy reduction in the brain in the form of minimizing (Barlow, 1981, 2001a, 2001b).
Note that the last equation is just equation 2.3, since this follows from the Markov constraint. With , we can see that the first equation becomes the IB solution from equation 2.2, as should be the case.
This generalized algorithm can be used in its own right; however, we will not discuss that option further here.
So, as anticipated, the solution to the DIB problem is a deterministic encoding distribution. The above encourages that we use as few values of as possible, via a “rich-get-richer” scheme that assigns each preferentially to a with many s already assigned to it. The divergence term, as in the original IB problem, just makes sure we pick s which retain as much information from about as possible. The parameter , as in the original problem, controls the trade-off between how much we value compression and prediction.
This process is summarized in algorithm 2.
Note that the DIB algorithm also corresponds to “clamping” IB at every step by assigning each to its highest-probability cluster . We can see this by taking the of the logarithm of in equation 2.2, noting that the of a positive function is equivalent to the of its logarithm, discarding the term since it does not depend on , and seeing that the result corresponds to equation 3.12. We emphasize, however, that this is not the same as simply running the IB algorithm to convergence and then clamping the resulting encoder; that would, in most cases, produce a suboptimal, “unconverged” deterministic solution.
As with the IB, the DIB solutions can be plotted as a function of . However, in this case, it is more natural to plot as a function of rather than . That said, in order to compare the IB and DIB, they need to be plotted in the same plane. When plotting in the plane, the DIB curve will of course lie below the IB curve, since in this plane, the IB curve is optimal; the opposite will be true when plotting in the plane. Comparisons with experimental data can be performed in either plane. A Python implementation of algorithms 1 and 2, as well as tools for generating synthetic data and the analysis of results, is available at https://github.com/djstrouse/information-bottleneck.
4 Comparison of IB and DIB
To get an idea of how the IB and DIB solutions differ in practice, we generated a series of random joint distributions , solved for the IB and DIB solutions for each, and compared them in both the IB and DIB plane. To generate the , we first sampled from a symmetric Dirichlet distribution with concentration parameter (so ) and then sampled each row of from another symmetric Dirichlet distribution with concentration parameter (so ). In the experiments shown here, we set to 1000, so that each was approximately equally likely, and we set to be equally spaced logarithmically between and in order to provide a range of informativeness in the conditionals. We set the cardinalities of and to and , with for two reasons. First, this encourages overlap between the conditionals , which leads to a more interesting clustering problem. Second, in typical applications, this will be the case, such as in document clustering, where there are often many more documents than vocabulary words. Since the number of clusters in use for both IB and DIB can only decrease from iteration to iteration (see footnote 10), we always initialized .11 For the DIB, we initialized the cluster assignments to be as even across the cluster as possible; that is, each data point belonged to its own cluster. For IB, we initialized using a soft version of the same procedure, with 75% of each conditional’s probability mass assigned to a unique cluster and the remaining 25% a normalized uniform random vector over the remaining clusters.
An illustrative pair of solutions is shown in Figure 2. The key feature to note is that while performance of the IB and DIB solutions is very similar in the IB plane, their behavior differs drastically in the DIB plane.
Perhaps most unintuitive is the behavior of the IB solution in the DIB plane, where from an entropy perspective, the IB actually “decompresses” the data (i.e., ). To understand this behavior, recall that the IB’s compression term is the mutual information . This term is minimized by any that maps s independent of s. Consider two extremes of such mappings. One is to map all values of to a single value of ; this leads to . The other is to map each value of uniformly across all values of ; this leads to and . In our initial studies, the IB consistently took the latter approach.12 Since the DIB cost function favors the former approach (and indeed the DIB solution follows this approach), the IB consistently performs poorly by the DIB’s standards. This difference is especially apparent at small , where the compression term matters most, and as increases, the DIB and IB solutions converge in the DIB plane.
To encourage the IB to perform closer to DIB optimality at small , we next altered our initialization scheme of to one biased toward single-cluster solutions; instead of each having most of its probability mass on a unique cluster , we placed most of the probability mass for each on the same cluster . The intended effect was to start the IB closer to solutions in which all data points were mapped to a single cluster. Results are shown in Figure 3. Here, is the amount of probability mass placed on the cluster , that is, ; the probability mass for the remaining clusters was again initialized to a normalized uniform random vector. “Random” refers to an initialization that skips placing the mass and just initializes each to a normalized uniform random vector.
We note several features. First, although we can see a gradual shift of the IB toward DIB-like behavior in the DIB plane as , the IB solutions never quite reach the performance of DIB. Moreover, as , the single-cluster initializations exhibit a phase transition in which, regardless of , they “skip” over a sizable fraction of lower- solutions that are picked up by DIB. Third, even for higher- solutions, the single-cluster initializations seem to perform suboptimally, not quite extracting all of the information , as DIB and the multicluster initialization from the previous section do. This can be seen in both the IB and DIB plane.
To summarize, the IB and DIB perform similarly by the IB standards, but the DIB tends to outperform the IB dramatically by the DIB’s standards. Careful initialization of the IB can make up some of the difference, but not all.
It is also worth noting that across all the data sets we tested, the DIB also tended to converge faster, as illustrated in Figure 4. The DIB speedup over IB varied depending on the convergence conditions. In our experiments, we defined convergence as when the relative step-to-step change in the cost functional was smaller than some threshold , that is, when at step . In the results above, we used . In Figure 4, we vary , with the IB initialization scheme fixed to the original multicluster version, to show the effect on the relative speedup of DIB over IB. While DIB remained approximately two to five times faster than IB in all cases tested, that speedup tended to be more pronounced with lower . Since the ideal convergence conditions would probably vary by data set size and complexity, it is difficult to make any general conclusions, though our experiments do at least suggest that DIB offers a computational advantage over IB.
5 Related Work
The DIB is not the first hard clustering version of IB.13 Indeed, the agglomerative information bottleneck (AIB) (Slonim & Tishby, 2000a) also produces a hard clustering and was introduced soon after the IB. Thus, it is important to distinguish between the two approaches. AIB is a bottom-up, greedy method that starts with all data points belonging to their own clusters and iteratively merges clusters in a way that maximizes the gain in relevant information. It was explicitly designed to produce a hard clustering. DIB is a top-down method derived from a cost function that was not designed to produce a hard clustering. Our starting point was to alter the IB cost function to match the source coding notion of compression. The emergence of hard clustering in DIB is itself a result. Thus, while AIB does provide a hard clustering version of IB, DIB contributes the following in addition. First, our study emphasizes why a stochastic encoder is optimal for IB, namely, due to the noise entropy term. Second, our study provides a principled, top-down derivation of a hard clustering version of IB, based on an intuitive change to the cost function. Third, our nontrivial derivation also provides a cost function and solution that interpolates between DIB and IB by adding back the noise entropy continuously, that is, with . This interpolation may be viewed as adding a regularization term to DIB, one that may perhaps be useful in dealing with finitely sampled data. Another interpretation of the cost function with intermediate is as a penalty on both the mutual information between and and the entropy of the compression, .
The original IB also provides a deterministic encoding upon taking the limit that corresponds to the causal-state partition of histories (Still, Crutchfield, & Ellison, 2010). However, this is the limit of no compression, whereas our approach allows for an entire family of deterministic encoders with varying degrees of compression.
Here we have introduced the deterministic information bottleneck (DIB) as an alternative to the information bottleneck (IB) for compression and clustering. We have argued that the DIB cost function better embodies the goal of lossy compression of relevant information and shown that it leads to a nontrivial deterministic version of the IB. We have compared the DIB and IB solutions on synthetic data and found that in our experiments, the DIB performs nearly identically to the IB in terms of the IB cost function, but it is far superior in terms of its own cost function. We also noted that the DIB achieved this performance at a computational efficiency two to five times better than the IB.
Of course, in addition to the studies with synthetic data here, it is important to compare the DIB and IB on real-world data sets as well to see whether the DIB’s apparent advantages hold—for example, with data sets that have a more explicit hierarchical structure for both algorthms to exploit, such as in topic modeling (Blei, Griffiths, Jordan, & Tenenbaum, 2003; Slonim & Weiss, 2002).
One particular application of interest is maximally informative clustering, where it would be interesting to know how IB and DIB relate to classic clustering algorithms such as -means (Strouse & Schwab, 2017). Previous work has, for example, offered a principled way of choosing the number of clusters based on the finiteness of the data (Still & Bialek, 2004), and similarly interesting results may exist for the DIB. More generally, there are learning theory results showing generalization bounds on IB for which an analog on DIB would be interesting as well (Shamir, Sabato, & Tishby, 2010).
Another potential area of application is modeling the extraction of predictive information in the brain (which is one particular example in a long line of work on the exploitation of environmental statistics by the brain (Barlow, 1981, 2001a, 2001b; Atick & Redlich, 1992; Olshausen & Field, 1996, 1997, 2004; Simoncelli & Olshausen, 2001)). There, would be the stimulus at time , the stimulus a short time in the future , and the activity of a population of sensory neurons. One could even consider neurons deeper in the brain by allowing and to correspond not to an external stimulus but to the activity of upstream neurons. An analysis of this nature using retinal data was recently performed with the IB (Palmer et al., 2015). It would be interesting to see if the same data correspond better to the behavior of the DIB, particularly in the DIB plane, where the IB and DIB differ dramatically.
We close by noting that DIB is an imperfect name for the algorithm introduced here for a couple of reasons. First, there exist other deterministic limits and approximations to the IB (see, e.g., the discussion of the AIB in section 5), and so we hesitate to use the phrase “the” deterministic IB. Second, our motivation here was not to create a deterministic version of IB but instead to alter the cost function in a way that better encapsulates the goals of certain problems in data analysis. Thus, the deterministic nature of the solution was a result, not a goal. For this reason, “entropic bottleneck” might also be an appropriate name.
Appendix: Derivation of Generalized IB Solution
For insightful discussions, we thank Richard Turner, Máté Lengyel, Bill Bialek, Stephanie Palmer, Gordon Berman, Zack Nichols, and Spotify NYC’s Paradox Squad. We also acknowledge financial support from NIH K25 GM098875 (Schwab), the Hertz Foundation (Strouse), and the Department of Energy Computational Sciences Graduate Fellowship (Strouse).
Note that the IB problem setting differs significantly from that of channel coding, however. In channel coding, the channel is fixed, and we are free to vary the input distribution, while in IB, the input distribution is fixed, and we are free to vary the channel.
Implicit in the summation here, we have assumed that , , and are discrete. We will be keeping this assumption throughout for convenience of notation, but note that the IB generalizes naturally to , , and continuously by simply replacing the sums with integrals (see, e.g., Chechik, Globerson, Tishby, & Weiss, 2005).
For those unfamiliar with it, mutual information is a very general measure of how related two variables are. Classic correlation measures typically assume a certain form of the relationship between two variables, say, linear, whereas mutual information is agnostic as to the details of the relationship. One intuitive picture comes from the entropy decomposition: . Since entropy measures uncertainty, mutual information measures the reduction in uncertainty in one variable when observing the other. Moreover, it is symmetric (), so the information is mutual. Another intuitive picture comes from the form: . Since measures the distance between two probability distributions, the mutual information quantifies how far the relationship between and is from a probabilistically independent one, that is, how far the joint is from the factorized . A very nice summary of mutual information as a dependence measure is included in Kinney and Atwal (2014).
For readers familiar with rate-distortion theory, equation 2.2 can be viewed as the solution to a rate-distortion problem with the distortion measure given by the KL divergence term in the exponent.
More explicitly, our cost function also implicitly includes a term , and this is where comes into the equation. See section 8 for details.
Note that if at step , no s are assigned to a particular (i.e., ), then . That is, no s will ever again be assigned to (due to the factor in ). In other words, the number of s “in use” can only decrease during the iterative algorithm (or remain constant). Thus, it seems plausible that the solution will not depend on the cardinality of , provided it is chosen to be large enough.
When you take the variational derivative of the with respect to and set it to zero, you get no explicit term, and it is therefore not obvious how to solve these equations. We cannot rule out that approach is possible, of course; we have just taken a different route here.
Note that for , we cannot allow to be continuous since can become infinitely negative, and the optimal solution in that case will trivially be a delta function over a single value of for all , across all values of . This is in contrast to the IB, which can handle continuous . In any case, we continue to assume discrete , , and for convenience.
Note a subtlety here that we cannot claim that the is the solution to , for although and , the solution of the limit need not be equal to the limit of the solution. It would, however, be surprising if that were not the case.
As with the IB, the DIB has the property that once a cluster goes unused, it will not be brought back into use in future steps. That is, if , then and . So once again, one should conservatively choose the cardinality of to be “large enough”; for both the IB and DIB, we chose to set it equal to the cardinality of .
An even more efficient setting might be to set the cardinality of based on the entropy of , say, , but we did not experiment with this.
Intuitively, this approach is more random and is therefore easier to stumble on during optimization.
In fact, even the IB itself produces a hard clustering in the large limit. However, it trivially assigns all data points to their own clusters.