## Abstract

Lossy compression and clustering fundamentally involve a decision about which features are relevant and which are not. The information bottleneck method (IB) by Tishby, Pereira, and Bialek (1999) formalized this notion as an information-theoretic optimization problem and proposed an optimal trade-off between throwing away as many bits as possible and selectively keeping those that are most important. In the IB, compression is measured by mutual information. Here, we introduce an alternative formulation that replaces mutual information with entropy, which we call the deterministic information bottleneck (DIB) and argue better captures this notion of compression. As suggested by its name, the solution to the DIB problem turns out to be a deterministic encoder, or hard clustering, as opposed to the stochastic encoder, or soft clustering, that is optimal under the IB. We compare the IB and DIB on synthetic data, showing that the IB and DIB perform similarly in terms of the IB cost function, but that the DIB significantly outperforms the IB in terms of the DIB cost function. We also empirically find that the DIB offers a considerable gain in computational efficiency over the IB, over a range of convergence parameters. Our derivation of the DIB also suggests a method for continuously interpolating between the soft clustering of the IB and the hard clustering of the DIB.

## 1  Introduction

Compression is a ubiquitous task for humans and machines alike (Cover & Thomas, 2006; MacKay, 2002). For example, machines must turn the large pixel grids of color that form pictures into small files capable of being shared quickly on the web (Wallace, 1991), humans must compress the vast stream of ongoing sensory information they receive into small changes in the brain that form memories (Kandel, Schwartz, Jessell, Siegelbaum, & Hudspeth, 2013), and data scientists must turn large amounts of high-dimensional and messy data into more manageable and interpretable clusters (MacKay, 2002).

Lossy compression involves an implicit decision about what is relevant and what is not (Cover & Thomas, 2006; MacKay, 2002). In the example of image compression, the algorithms we use deem some features essential to representing the subject matter well, and others are thrown away. In the example of human memory, our brains deem some details important enough to warrant attention, and others are forgotten. And in the example of data clustering, information about some features is preserved in the mapping from data point to cluster ID, while information about others is discarded.

In many cases, the criterion for “relevance” can be described as information about some other variable(s) of interest. We call the signal we are compressing, the compressed version, the other variable of interest, and the “information” that has about (we formally define this later). For human memory, is past sensory input, the brain’s internal representation (e.g., the activity of a neural population, or the strengths of a set of synapses), and the features of the future environment that the brain is interested in predicting, such as extrapolating the position of a moving object. Thus, represents the predictive power of the memories formed (Palmer, Marre, Berry, & Bialek, 2015). For data clustering, is the original data, is the cluster ID, and is the target for prediction (e.g., purchasing or ad-clicking behavior in a user segmentation problem). In summary, a good compression algorithm can be described as a trade-off between the compression of a signal and the selective maintenance of the relevant bits that help predict another signal.

This problem was formalized as the information bottleneck (IB) by Tishby, Pereira, and Bialek (1999). Their formulation involved an information-theoretic optimization problem and resulted in an iterative soft clustering algorithm guaranteed to converge to a local (though not necessarily global) optimum. In their cost functional, compression was measured by the mutual information . This compression metric has its origins in rate-distortion theory and channel coding, where represents the maximal information transfer rate, or capacity, of the communication channel between and (Cover & Thomas, 2006).1 While this approach has its applications, often one is more interested in directly restricting the amount of resources required to represent , quantified by the entropy . This latter notion comes from the source coding literature and implies a restriction on the representational cost of (Cover & Thomas, 2006). In the case of human memory, for example, would roughly correspond to the number of neurons or synapses required to represent or store a sensory signal . In the case of data clustering, is related to the number of clusters.

In this letter, we introduce an alternative formulation of the IB, the deterministic information bottleneck (DIB), replacing the compression measure with , thus emphasizing contraints on representation rather than communication. Using a clever generalization of both cost functionals, we derive an iterative solution to the DIB, which turns out to provide a hard clustering, or deterministic mapping from to , as opposed to the soft clustering, or probabilitic mapping, that IB provides. Finally, we compare the IB and DIB solutions on synthetic data to help illustrate their differences.

## 2  The Original Information Bottleneck

Given the joint distribution , the encoding distribution is obtained through the following information bottleneck (IB) optimization problem:
2.1
subject to the Markov constraint . Here denotes the mutual information between and , that is, ,2 where denotes the Kullback-Leibler divergence.3 The first term in the cost function is meant to encourage compression and the second, relevance. is a nonnegative free parameter representing the relative importance of compression and relevance, and our solution will be a function of it. The Markov constraint simply enforces the probabilistic graphical structure of the task; the compressed representation is a (possibly stochastic) function of and can get information about only through . Note that we are using to denote distributions that are given and fixed, and to denote distributions that we are free to change and are being updated throughout the optimization process.
Through a standard application of variational calculus (see section 8 for a detailed derivation of the solution to a more general problem introduced below), one finds the formal solution4
2.2
2.3
where is a normalization factor and is a Lagrange multiplier that enters enforcing normalization of .5 To get an intuition for this solution, it is useful to take a clustering perspective. Since we are compressing into , many will be mapped to the same , and so we can think of the IB as clustering s into their cluster labels . The solution is then likely to map to when is small or, in other words, when the distributions and are similar. These distributions are similar to the extent that and provide similar information about . In summary, inputs get mapped to clusters that maintain information about , as was desired.

This solution is formal because the first equation depends on the second and vice versa. However, Tishby et al. (1999) showed that an iterative approach can be built on the the above equations, which provably converges to a local optimum of the IB cost function (see equation 2.1).

Starting with some initial distributions , , and , the th update is given by6
2.4
2.5
2.6

Note that the first equation, 2.4, is the only “meaty” one; the other two, 2.5 and 2.6, are just there to enforce consistency with the laws of probability (e.g., that marginals are related to joints as they should be). In principle, with no proof of convergence to a global optimum, it might be possible for the solution obtained to vary with the initialization, but in practice, the cost function is smooth enough that this does not seem to happen. This algorithm is summarized in algorithm 1. Note that while the general solution is iterative, there is at least one known case in which an analytic solution is possible: when and are jointly gaussian (Chechik, Globerson, Tishby, & Weiss, 2005).

In summary, given the joint distribution , the IB method extracts a compressive encoder that selectively maintains the bits from that are informative about . As the encoder is a function of the free parameter , we can visualize the entire family of solutions on a curve (see Figure 1), showing the trade-off between compression (on the -axis) and relevance (on the -axis), with as an implicitly varying parameter. For small , compression is more important than prediction, and we find ourselves at the bottom left of the curve in the high-compression, low-prediction regime. As increases, prediction becomes more important relative to compression, and we see that both and increase. At some point, saturates, because there is no more information about that can be extracted from (either because has reached or because has too small cardinality). In this regime, the encoder will approach the trivially deterministic solution of mapping each to its own cluster. At any point on the curve, the slope is equal to , which we can read off directly from the cost functional. Note that the region below the curve is shaded because this area is feasible; for suboptimal , solutions will lie in this region. Optimal solutions will, of course, lie on the curve, and no solutions will lie above the curve.

Figure 1:

An illustrative IB curve. is the relevance term from equation 2.1; is the compression term. is an upper bound on since gets its information about only via . , where is the cardinality of the compression variable, is a bound on since .

Figure 1:

An illustrative IB curve. is the relevance term from equation 2.1; is the compression term. is an upper bound on since gets its information about only via . , where is the cardinality of the compression variable, is a bound on since .

Additional work on the IB has highlighted its relationship with maximum likelihood on a multinomial mixture model (Slonim & Weiss, 2002) and canonical correlation analysis (Creutzig, Globerson, & Tishby, 2009)—and therefore linear gaussian models (Bach & Jordan, 2006) and slow feature analysis (Turner & Sahani, 2007). Applications have included speech recognition (Hecht & Tishby, 2005; Hecht, Noor, & Tishby, 2009), topic modeling (Slonim & Tishby, 2000b, 2001; Bekkerman, El-Yaniv, Tishby, & Winter, 2001, 2003), and neural coding (Schneidman, Slonim, Tishby, de Ruyter van Steveninck, & Bialek, 2001; Palmer, Marre, Berry, & Bialek, 2015). Most recently, the IB has even been proposed as a method for benchmarking the performance of deep neural networks (Tishby & Zaslavsky, 2015).

## 3  The Deterministic Information Bottleneck

Our motivation for introducing an alternative formulation of the information bottleneck is rooted in the compression term of the IB cost function; there, the minimization of the mutual information represents compression. As discussed above, this measure of compression comes from the channel coding literature and implies a restriction on the communication cost between and Here, we are interested in the source coding notion of compression, which implies a restriction on the representational cost of . For example, in neuroscience, there is a long history of work on redundancy reduction in the brain in the form of minimizing (Barlow, 1981, 2001a, 2001b).

Let us call the original IB cost function , that is, . We now introduce the deterministic information bottleneck (DIB) cost function,
3.1
which is to be minimized over , and subject to the same Markov constraint as the original formulation, equation 2.1. The motivation for the “deterministic” in its name will become clear in a moment.
To see the distinction between the two cost functions, note that
3.2
3.3
where we have used the decomposition of the mutual information . is sometimes called the noise entropy and measures the stochasticity in the mapping from to . Since we are minimizing these cost functions, this means that the IB cost function encourages stochasticity in the encoding distribution relative to the DIB cost function. In fact, we will see that by removing this encouragement of stochasticity, the DIB problem actually produces a deterministic encoding distribution, that is, an encoding function, hence, the “deterministic” in its name.
Naively taking the same variational calculus approach as for the IB problem, one cannot solve the DIB problem.7 To make this problem tractable, we are going to define a family of cost functions of which the IB and DIB cost functions are limiting cases. That family, indexed by , is defined as8
3.4
Clearly, . However, instead of looking at as the case, we define the DIB solution as the limit of the solution to the generalized problem :9
3.5
Taking the variational calculus approach to minimizing (under the Markov constraint), we get the following solution for the encoding distribution (see the appendix for the derivation and explicit form of the normalization factor ):
3.6
3.7

Note that the last equation is just equation 2.3, since this follows from the Markov constraint. With , we can see that the first equation becomes the IB solution from equation 2.2, as should be the case.

Before we take the limit, note that we can now write a generalized IB iterative algorithm (indexed by ) that includes the original as a special case ():
3.8
3.9
3.10

This generalized algorithm can be used in its own right; however, we will not discuss that option further here.

For now, we take the limit and see that something interesting happens with : the argument of the exponential begins to blow up. For a fixed , this means that will collapse into a delta function at the value of which maximizes , that is,
3.11
where
3.12

So, as anticipated, the solution to the DIB problem is a deterministic encoding distribution. The above encourages that we use as few values of as possible, via a “rich-get-richer” scheme that assigns each preferentially to a with many s already assigned to it. The divergence term, as in the original IB problem, just makes sure we pick s which retain as much information from about as possible. The parameter , as in the original problem, controls the trade-off between how much we value compression and prediction.

Also as in the original problem, the solution above is only a formal solution, since equation 3.6 depends on equation 3.7 and vice versa. We will again need to take an iterative approach; in analogy to the IB case, we repeat the following updates to convergence (from some initialization):10
3.13
3.14
3.15
3.16
3.17
3.18

This process is summarized in algorithm 2.

Note that the DIB algorithm also corresponds to “clamping” IB at every step by assigning each to its highest-probability cluster . We can see this by taking the of the logarithm of in equation 2.2, noting that the of a positive function is equivalent to the of its logarithm, discarding the term since it does not depend on , and seeing that the result corresponds to equation 3.12. We emphasize, however, that this is not the same as simply running the IB algorithm to convergence and then clamping the resulting encoder; that would, in most cases, produce a suboptimal, “unconverged” deterministic solution.

As with the IB, the DIB solutions can be plotted as a function of . However, in this case, it is more natural to plot as a function of rather than . That said, in order to compare the IB and DIB, they need to be plotted in the same plane. When plotting in the plane, the DIB curve will of course lie below the IB curve, since in this plane, the IB curve is optimal; the opposite will be true when plotting in the plane. Comparisons with experimental data can be performed in either plane. A Python implementation of algorithms 1 and 2, as well as tools for generating synthetic data and the analysis of results, is available at https://github.com/djstrouse/information-bottleneck.

## 4  Comparison of IB and DIB

To get an idea of how the IB and DIB solutions differ in practice, we generated a series of random joint distributions , solved for the IB and DIB solutions for each, and compared them in both the IB and DIB plane. To generate the , we first sampled from a symmetric Dirichlet distribution with concentration parameter (so ) and then sampled each row of from another symmetric Dirichlet distribution with concentration parameter (so ). In the experiments shown here, we set to 1000, so that each was approximately equally likely, and we set to be equally spaced logarithmically between and in order to provide a range of informativeness in the conditionals. We set the cardinalities of and to and , with for two reasons. First, this encourages overlap between the conditionals , which leads to a more interesting clustering problem. Second, in typical applications, this will be the case, such as in document clustering, where there are often many more documents than vocabulary words. Since the number of clusters in use for both IB and DIB can only decrease from iteration to iteration (see footnote 10), we always initialized .11 For the DIB, we initialized the cluster assignments to be as even across the cluster as possible; that is, each data point belonged to its own cluster. For IB, we initialized using a soft version of the same procedure, with 75% of each conditional’s probability mass assigned to a unique cluster and the remaining 25% a normalized uniform random vector over the remaining clusters.

An illustrative pair of solutions is shown in Figure 2. The key feature to note is that while performance of the IB and DIB solutions is very similar in the IB plane, their behavior differs drastically in the DIB plane.

Figure 2:

Example IB and DIB solutions. (Left) IB plane. (Right) DIB plane. The upper limit of the -axes is , since this is the maximal possible value of . The upper limit of the -axes is , since this is the maximal possible value of and (the latter being true since is bounded above by both and , and ). The dashed vertical lines mark , which is both an upper bound for and a natural comparison for (since to place each data point in its own cluster, we need ).

Figure 2:

Example IB and DIB solutions. (Left) IB plane. (Right) DIB plane. The upper limit of the -axes is , since this is the maximal possible value of . The upper limit of the -axes is , since this is the maximal possible value of and (the latter being true since is bounded above by both and , and ). The dashed vertical lines mark , which is both an upper bound for and a natural comparison for (since to place each data point in its own cluster, we need ).

Perhaps most unintuitive is the behavior of the IB solution in the DIB plane, where from an entropy perspective, the IB actually “decompresses” the data (i.e., ). To understand this behavior, recall that the IB’s compression term is the mutual information . This term is minimized by any that maps s independent of s. Consider two extremes of such mappings. One is to map all values of to a single value of ; this leads to . The other is to map each value of uniformly across all values of ; this leads to and . In our initial studies, the IB consistently took the latter approach.12 Since the DIB cost function favors the former approach (and indeed the DIB solution follows this approach), the IB consistently performs poorly by the DIB’s standards. This difference is especially apparent at small , where the compression term matters most, and as increases, the DIB and IB solutions converge in the DIB plane.

To encourage the IB to perform closer to DIB optimality at small , we next altered our initialization scheme of to one biased toward single-cluster solutions; instead of each having most of its probability mass on a unique cluster , we placed most of the probability mass for each on the same cluster . The intended effect was to start the IB closer to solutions in which all data points were mapped to a single cluster. Results are shown in Figure 3. Here, is the amount of probability mass placed on the cluster , that is, ; the probability mass for the remaining clusters was again initialized to a normalized uniform random vector. “Random” refers to an initialization that skips placing the mass and just initializes each to a normalized uniform random vector.

Figure 3:

Example IB and DIB solutions across different IB initializations. Details identical to Figure 2, except colors represent different initializations for the IB, as described in the text.

Figure 3:

Example IB and DIB solutions across different IB initializations. Details identical to Figure 2, except colors represent different initializations for the IB, as described in the text.

We note several features. First, although we can see a gradual shift of the IB toward DIB-like behavior in the DIB plane as , the IB solutions never quite reach the performance of DIB. Moreover, as , the single-cluster initializations exhibit a phase transition in which, regardless of , they “skip” over a sizable fraction of lower- solutions that are picked up by DIB. Third, even for higher- solutions, the single-cluster initializations seem to perform suboptimally, not quite extracting all of the information , as DIB and the multicluster initialization from the previous section do. This can be seen in both the IB and DIB plane.

To summarize, the IB and DIB perform similarly by the IB standards, but the DIB tends to outperform the IB dramatically by the DIB’s standards. Careful initialization of the IB can make up some of the difference, but not all.

It is also worth noting that across all the data sets we tested, the DIB also tended to converge faster, as illustrated in Figure 4. The DIB speedup over IB varied depending on the convergence conditions. In our experiments, we defined convergence as when the relative step-to-step change in the cost functional was smaller than some threshold , that is, when at step . In the results above, we used . In Figure 4, we vary , with the IB initialization scheme fixed to the original multicluster version, to show the effect on the relative speedup of DIB over IB. While DIB remained approximately two to five times faster than IB in all cases tested, that speedup tended to be more pronounced with lower . Since the ideal convergence conditions would probably vary by data set size and complexity, it is difficult to make any general conclusions, though our experiments do at least suggest that DIB offers a computational advantage over IB.

Figure 4:

Fit times for IB and DIB. Cumulative distribution function of fit times across for a variety of settings of the convergence tolerance. Note that absolute numbers here depend on hardware, so we emphasize only relative comparisons of IB versus DIB. Note also that across the range of values we tested here, the (D)IB curves vary by less than the width of the data points, so we omit them.

Figure 4:

Fit times for IB and DIB. Cumulative distribution function of fit times across for a variety of settings of the convergence tolerance. Note that absolute numbers here depend on hardware, so we emphasize only relative comparisons of IB versus DIB. Note also that across the range of values we tested here, the (D)IB curves vary by less than the width of the data points, so we omit them.

## 5  Related Work

The DIB is not the first hard clustering version of IB.13 Indeed, the agglomerative information bottleneck (AIB) (Slonim & Tishby, 2000a) also produces a hard clustering and was introduced soon after the IB. Thus, it is important to distinguish between the two approaches. AIB is a bottom-up, greedy method that starts with all data points belonging to their own clusters and iteratively merges clusters in a way that maximizes the gain in relevant information. It was explicitly designed to produce a hard clustering. DIB is a top-down method derived from a cost function that was not designed to produce a hard clustering. Our starting point was to alter the IB cost function to match the source coding notion of compression. The emergence of hard clustering in DIB is itself a result. Thus, while AIB does provide a hard clustering version of IB, DIB contributes the following in addition. First, our study emphasizes why a stochastic encoder is optimal for IB, namely, due to the noise entropy term. Second, our study provides a principled, top-down derivation of a hard clustering version of IB, based on an intuitive change to the cost function. Third, our nontrivial derivation also provides a cost function and solution that interpolates between DIB and IB by adding back the noise entropy continuously, that is, with . This interpolation may be viewed as adding a regularization term to DIB, one that may perhaps be useful in dealing with finitely sampled data. Another interpretation of the cost function with intermediate is as a penalty on both the mutual information between and and the entropy of the compression, .

The original IB also provides a deterministic encoding upon taking the limit that corresponds to the causal-state partition of histories (Still, Crutchfield, & Ellison, 2010). However, this is the limit of no compression, whereas our approach allows for an entire family of deterministic encoders with varying degrees of compression.

## 6  Discussion

Here we have introduced the deterministic information bottleneck (DIB) as an alternative to the information bottleneck (IB) for compression and clustering. We have argued that the DIB cost function better embodies the goal of lossy compression of relevant information and shown that it leads to a nontrivial deterministic version of the IB. We have compared the DIB and IB solutions on synthetic data and found that in our experiments, the DIB performs nearly identically to the IB in terms of the IB cost function, but it is far superior in terms of its own cost function. We also noted that the DIB achieved this performance at a computational efficiency two to five times better than the IB.

Of course, in addition to the studies with synthetic data here, it is important to compare the DIB and IB on real-world data sets as well to see whether the DIB’s apparent advantages hold—for example, with data sets that have a more explicit hierarchical structure for both algorthms to exploit, such as in topic modeling (Blei, Griffiths, Jordan, & Tenenbaum, 2003; Slonim & Weiss, 2002).

One particular application of interest is maximally informative clustering, where it would be interesting to know how IB and DIB relate to classic clustering algorithms such as -means (Strouse & Schwab, 2017). Previous work has, for example, offered a principled way of choosing the number of clusters based on the finiteness of the data (Still & Bialek, 2004), and similarly interesting results may exist for the DIB. More generally, there are learning theory results showing generalization bounds on IB for which an analog on DIB would be interesting as well (Shamir, Sabato, & Tishby, 2010).

Another potential area of application is modeling the extraction of predictive information in the brain (which is one particular example in a long line of work on the exploitation of environmental statistics by the brain (Barlow, 1981, 2001a, 2001b; Atick & Redlich, 1992; Olshausen & Field, 1996, 1997, 2004; Simoncelli & Olshausen, 2001)). There, would be the stimulus at time , the stimulus a short time in the future , and the activity of a population of sensory neurons. One could even consider neurons deeper in the brain by allowing and to correspond not to an external stimulus but to the activity of upstream neurons. An analysis of this nature using retinal data was recently performed with the IB (Palmer et al., 2015). It would be interesting to see if the same data correspond better to the behavior of the DIB, particularly in the DIB plane, where the IB and DIB differ dramatically.

We close by noting that DIB is an imperfect name for the algorithm introduced here for a couple of reasons. First, there exist other deterministic limits and approximations to the IB (see, e.g., the discussion of the AIB in section 5), and so we hesitate to use the phrase “the” deterministic IB. Second, our motivation here was not to create a deterministic version of IB but instead to alter the cost function in a way that better encapsulates the goals of certain problems in data analysis. Thus, the deterministic nature of the solution was a result, not a goal. For this reason, “entropic bottleneck” might also be an appropriate name.

### Appendix:  Derivation of Generalized IB Solution

Given and subject to the Markov constraint , the generalized IB problem is
A.1
where we have now included the Lagrange multiplier term (which enforces normalization of ) explicitly. The Markov constraint implies the following factorizations,
A.2
A.3
which give us the following useful derivatives:
A.4
A.5
Now, taking the derivative of the cost function with respect to the encoding distribution, we get:
A.6
A.7
A.8
A.9
A.10
A.11
A.12
A.13
A.14
A.15
A.16
A.17
Setting this to zero implies that
A.18
We want to rewrite the term as a KL divergence. First, we need that . Second, we add and subtract . This gives us
A.19
A.20
The second term is now just . Dividing both sides by , this leaves us with
A.21
where we have absorbed all of the terms that do not depend on into a single factor:
A.22
Solving for , we get
A.23
A.24
where
A.25
is just a normalization factor. Now that we are done with the general derivation, we add a subscript to the solution to distinguish it from the special cases of the IB and DIB:
A.26
The IB solution is then
A.27
while the DIB solution is
A.28
with
A.29

## Acknowledgments

For insightful discussions, we thank Richard Turner, Máté Lengyel, Bill Bialek, Stephanie Palmer, Gordon Berman, Zack Nichols, and Spotify NYC’s Paradox Squad. We also acknowledge financial support from NIH K25 GM098875 (Schwab), the Hertz Foundation (Strouse), and the Department of Energy Computational Sciences Graduate Fellowship (Strouse).

## References

Atick
,
J. J.
, &
Redlich
,
A. N.
(
1992
).
What does the retina know about natural scenes?
Neural Computation
,
4
(
2
),
196
210
.
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2006
).
A probabilistic interpretation of canonical correlation analysis (Technical Report)
.
Berkeley
:
University of California, Berkeley
.
Barlow
,
H. B.
(
1981
).
Critical limiting factors in the design of the eye and visual cortex
.
Proceedings of the Royal Society of London B: Biological Sciences
,
212
(
1186
),
1
34
.
Barlow
,
H. B.
(
2001a
).
Redundancy reduction revisited
.
Network Computation in Neural Systems
,
12
(
3
),
241
253
.
Barlow
,
H. B.
(
2001b
).
The exploitation of regularities in the environment by the brain
.
Behavioral and Brain Sciences
,
24
(
4
),
602
607
.
Bekkerman
,
R.
,
El-Yaniv
,
R.
,
Tishby
,
N.
, &
Winter
,
Y.
(
2001
).
On feature distributional clustering for text categorization
. In
Proceedings of the 24th Annual ACM SIGR Conference on Research and Development in Information Retrieval
(pp.
146
153
).
New York
:
ACM
.
Bekkerman
,
R.
,
El-Yaniv
,
R.
,
Tishby
,
N.
, &
Winter
,
Y.
(
2003
).
Distributional word clusters vs. words for text categorization
.
Journal of Machine Learning Research
,
3
,
1183
1208
.
Blei
,
D. M.
,
Griffiths
,
T. L.
,
Jordan
,
M. I.
, &
Tenenbaum
,
J. B.
(
2003
).
Hierarchical topic models and the nested Chinese restaurant process
. In
S.
Thrun
,
L. K.
Saul
, &
B.
Schölkopf
(Eds.),
Advances in neural information processing systems, 16
(pp.
17
24
).
Cambridge, MA
:
MIT Press
.
Chechik
,
G.
,
Globerson
,
A.
,
Tishby
,
N.
, &
Weiss
,
Y.
(
2005
).
Information bottleneck for gaussian variables
.
Journal of Machine Learning Research
,
6
(
1532–4435
),
165
188
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
2006
).
Elements of information theory
.
New York
:
Wiley-Interscience
.
Creutzig
,
F.
,
Globerson
,
A.
, &
Tishby
,
N.
(
2009
).
Past-future information bottleneck in dynamical systems
.
Physical Review E
,
79
(
4
),
041925
5
.
Hecht
,
R. M.
,
Noor
,
E.
, &
Tishby
,
N.
(
2009
).
Speaker recognition by gaussian information bottleneck
. In
Proceedings of the Annual Conference of the International Speech Communication Association
.
Hecht
,
R. M.
, &
Tishby
,
N.
(
2005
).
Extraction of relevant speech features using the information bottleneck method
. In
Proceedings of the Annual Conference of the International Speech Communication Association
.
Kandel
,
E. R.
,
Schwartz
,
J. H.
,
Jessell
,
T. M.
,
Siegelbaum
,
S. A.
, &
Hudspeth
,
A. J.
(
2013
).
Principles of neural science
(5th ed.).
New York
:
McGraw-Hill
.
Kinney
,
J. B.
, &
Atwal
,
G. S.
(
2014
).
Equitability, mutual information, and maximal information coefficient
.
PNAS
,
111
,
3354
3359
.
MacKay
,
D. J. C.
(
2002
).
Information theory, inference and learning algorithms
.
Cambridge
:
Cambridge University Press
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
(
6583
),
607
609
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1997
).
Sparse coding with an overcomplete basis set: A strategy employed by V1
?
Vision Research
,
37
(
23
),
3311
3325
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
2004
).
Sparse coding of sensory inputs
.
Current Opinion in Neurobiology
,
14
(
4
),
481
487
.
Palmer
,
S. E.
,
Marre
,
O.
,
Berry II
,
M. J.
, &
Bialek
,
W.
(
2015
).
Predictive information in a sensory population
.
Proceedings of the National Academy of Sciences
,
112
(
22
),
6908
6913
.
Schneidman
,
E.
,
Slonim
,
N.
,
Tishby
,
N.
,
de Ruyter van Steveninck
,
R. R.
, &
Bialek
,
W.
(
2001
).
Analyzing neural codes using the information bottleneck method
. In
S.
Becker
,
S.
Thrun
, &
K.-R.
Müller
(Eds.),
Advances in neural information processing systems, 15
.
Cambridge, MA
:
MIT Press
.
Shamir
,
O.
,
Sabato
,
S.
, &
Tishby
,
N.
(
2010
).
Learning and generalization with the information bottleneck
.
Theoretical Computer Science
,
411
(
29–30
),
2696
2711
.
Simoncelli
,
E. P.
, &
Olshausen
,
B. A.
(
2001
).
Natural image statistics and neural representation
.
Annual Review of Neuroscience
,
24
(
1
),
1193
1216
.
Slonim
,
N.
, &
Tishby
,
N.
(
2000a
).
Agglomerative information bottleneck
. In
S. A.
Solla
,
T. K.
Leen
, &
K.-R.
Müller
(Eds.),
Advances in neural information processing systems
(pp.
617
623
).
Cambridge, MA
:
MIT Press
.
Slonim
,
N.
, &
Tishby
,
N.
(
2000b
).
Document clustering using word clusters via the information bottleneck method
. In
Proceedings of the International ACM SIGR Conference on Research and Development in Information Retrieval
(pp.
208
215
).
New York
:
ACM
.
Slonim
,
N.
, &
Tishby
,
N.
(
2001
).
The power of word clusters for text classification
. In
Proceedings of the European Colloquium on IR Research
.
Slonim
,
N.
, &
Weiss
,
Y.
(
2002
).
Maximum likelihood and the information bottleneck
. In
S.
Becker
,
S.
Thrun
, &
K.
Obermayer
(Eds.),
Advances in neural information processing systems, 15
.
Cambridge, MA
:
MIT Press
.
Still
,
S.
, &
Bialek
,
W.
(
2004
).
How many clusters? An information-theoretic perspective
.
Neural Computation
,
16
(
12
),
2483
2506
.
Still
,
S.
,
Crutchfield
,
J. P.
, &
Ellison
,
C. J.
(
2010
).
Optimal causal inference: Estimating stored information and approximating causal architecture
.
Chaos
,
20
.
Strouse
,
D.
, &
Schwab
,
D. J.
(
2017
).
On the relationship between distributional and geometric clustering. Unpublished manuscript
.
Tishby
,
N.
,
Pereira
,
F. C.
, &
Bialek
,
W.
(
1999
).
The information bottleneck method
. In
Proceedings of The 37th Allerton Conference on Communication, Control, and Computing
(pp.
368
377
).
Tishby
,
N.
, &
Zaslavsky
,
N.
(
2015
).
Deep learning and the information bottleneck principle
.
arXiv.org
.
Turner
,
R. E.
, &
Sahani
,
M.
(
2007
).
A maximum-likelihood interpretation for slow feature analysis
.
Neural Computation
,
19
(
4
),
1022
1038
.
Wallace
,
G. K.
(
1991
).
The JPEG still picture compression standard
.
Commun. ACM
,
34
(
4
),
30
44
.

## Notes

1

Note that the IB problem setting differs significantly from that of channel coding, however. In channel coding, the channel is fixed, and we are free to vary the input distribution, while in IB, the input distribution is fixed, and we are free to vary the channel.

2

Implicit in the summation here, we have assumed that , , and are discrete. We will be keeping this assumption throughout for convenience of notation, but note that the IB generalizes naturally to , , and continuously by simply replacing the sums with integrals (see, e.g., Chechik, Globerson, Tishby, & Weiss, 2005).

3

For those unfamiliar with it, mutual information is a very general measure of how related two variables are. Classic correlation measures typically assume a certain form of the relationship between two variables, say, linear, whereas mutual information is agnostic as to the details of the relationship. One intuitive picture comes from the entropy decomposition: . Since entropy measures uncertainty, mutual information measures the reduction in uncertainty in one variable when observing the other. Moreover, it is symmetric (), so the information is mutual. Another intuitive picture comes from the form: . Since measures the distance between two probability distributions, the mutual information quantifies how far the relationship between and is from a probabilistically independent one, that is, how far the joint is from the factorized . A very nice summary of mutual information as a dependence measure is included in Kinney and Atwal (2014).

4

For readers familiar with rate-distortion theory, equation 2.2 can be viewed as the solution to a rate-distortion problem with the distortion measure given by the KL divergence term in the exponent.

5

More explicitly, our cost function also implicitly includes a term , and this is where comes into the equation. See section 8 for details.

6

Note that if at step , no s are assigned to a particular (i.e., ), then . That is, no s will ever again be assigned to (due to the factor in ). In other words, the number of s “in use” can only decrease during the iterative algorithm (or remain constant). Thus, it seems plausible that the solution will not depend on the cardinality of , provided it is chosen to be large enough.

7

When you take the variational derivative of the with respect to and set it to zero, you get no explicit term, and it is therefore not obvious how to solve these equations. We cannot rule out that approach is possible, of course; we have just taken a different route here.

8

Note that for , we cannot allow to be continuous since can become infinitely negative, and the optimal solution in that case will trivially be a delta function over a single value of for all , across all values of . This is in contrast to the IB, which can handle continuous . In any case, we continue to assume discrete , , and for convenience.

9

Note a subtlety here that we cannot claim that the is the solution to , for although and , the solution of the limit need not be equal to the limit of the solution. It would, however, be surprising if that were not the case.

10

As with the IB, the DIB has the property that once a cluster goes unused, it will not be brought back into use in future steps. That is, if , then and . So once again, one should conservatively choose the cardinality of to be “large enough”; for both the IB and DIB, we chose to set it equal to the cardinality of .

11

An even more efficient setting might be to set the cardinality of based on the entropy of , say, , but we did not experiment with this.

12

Intuitively, this approach is more random and is therefore easier to stumble on during optimization.

13

In fact, even the IB itself produces a hard clustering in the large limit. However, it trivially assigns all data points to their own clusters.