Abstract

Nonnegative matrix factorization (NMF) is primarily a linear dimensionality reduction technique that factorizes a nonnegative data matrix into two smaller nonnegative matrices: one that represents the basis of the new subspace and the second that holds the coefficients of all the data points in that new space. In principle, the nonnegativity constraint forces the representation to be sparse and parts based. Instead of extracting holistic features from the data, real parts are extracted that should be significantly easier to interpret and analyze. The size of the new subspace selects how many features will be extracted from the data. An effective choice should minimize the noise while extracting the key features. We propose a mechanism for selecting the subspace size by using a minimum description length technique. We demonstrate that our technique provides plausible estimates for real data as well as accurately predicting the known size of synthetic data. We provide an implementation of our code in a Matlab format.

1  Introduction

1.1  Nonnegative Matrix Factorization

Consider a data matrix with dimensions and data points that has only nonnegative elements. If we define two matrices, also with only nonnegative elements, and , then nonnegative matrix factorization (NMF) can reduce the dimensionality of through the approximation,
formula
1.1
where, generally, and .

The columns of make up the new basis directions of the dimensions we are projecting onto. Each column of represents the coefficients of each data point in this new subspace.

There has been a considerable increase in interest in NMF since the publication of a seminal work by Lee and Seung (1999) in part because NMF tends to, naturally, produce a sparse and parts-based representation of the data. This representation is in contrast to other dimensionality-reduction techniques such as principal component analysis, which tends to produce a holistic representation. The parts should represent features of the data; therefore, NMF can produce a representation of the data by the addition of extracted features. This representation may be considerably more interpretable than more holistic approaches.

There is a range of algorithms to conduct NMF, most of them involving minimizing an objective function such as
formula
Other objective functions such as the Kullbeck-Leibler divergence are also options. All the popular algorithms rely on a prechosen value of , the size of the new subspace. In NMF, the size of the subspace has real meaning: it selects the number of features extracted. If is chosen too low, we are likely to miss features, and if is chosen too large, then we will probably model noise. A good choice of then reduces the noise in the data while effectively modeling the key features.

1.2  Rank Selection in Nonnegative Matrix Factorization

In a recent review Gillis (2014) put forward three main methods for selecting : use of expert insight, trial and error, and use of singular value decomposition. Expert insights are invaluable but suffer for three main reasons: (1) there may be no expert capable of selecting a good choice of ; (2) the experts may select incorrectly; and (3) even if an expert is able to effectively select , independent confirmation is useful to add weight to the expert opinion.

Trial and error in this context means trying different values of and then manually selecting one that best fits the aim of the researcher for that particular application. This method suffers because it is hard to know what a “good” solution looks like. Trial and error can be dangerous in that it allows researchers to tune their results in a manner that produces the solution best for their work, so that a “good” solution becomes the solution that confirms their hypothesis.

Singular value decomposition is applied by selecting when the values of the singular values become “small.” The challenge is that unless there is a clear fall toward zero, the choice of where the values become small is very difficult to make.

Several more involved methods have been proposed for the selection of the rank. Examples include the use of cross-validation (Kanagal & Sindhwani, 2010; Owen & Perry, 2009) and the use of Stein's unbiased risk estimator (Ulfarsson & Solo, 2013). Cross-validation, in particular, is a common technique across supervised learning for assessing the quality of a model. In NMF, an unsupervised model, cross-validation essentially requires the imputation of missing data. There are different techniques for achieving this, and Kanagal and Sindhwani (2010) showed that these different techniques can produce significantly different estimates of or sometimes no estimate at all.

There is also an approach to NMF by Blei, Cook, and Hoffman (2010) using a Bayesian formulation that offers the benefit of selecting while finding and . They impose a prior belief that the rank should be small and from there find a solution that fulfills this prior, but this requires domain expertise to determine the choice of a good prior. Our aim is to offer our approach as an additional method to help guide a choice of for researchers using NMF, in particular when little or no domain knowledge is available.

1.3  Approach and Contribution

Our approach is to use a minimum description length (MDL) technique to find the best trade-off between a low , which misses key features, and a high , which models noise. We suggest a pair of methods for applying MDL to NMF to assess the best choice of . Our algorithms, which are available in Matlab format, allow the estimation of the best value of and can produce a range of graphs that can be used to analyze the quality of the estimation.

In the next section we introduce the background and theory behind MDL. We then propose our solution to find the MDL. In section 3, we apply our MDL technique to real and synthetic data demonstrating the validity of the technique. Finally, in section 4, we discuss the results we obtained and explain why we believe our technique is a useful addition to the NMF toolbox.

2  Minimum Description Length

2.1  Background and Theory

Minimum description length (MDL) is a method for selecting between models of varying complexity. At its core is the idea that the best model is one that compresses the data most effectively. Because the best way of compressing the data would also involve the smallest transmission cost when sending an encoded message, compressing of the data and transmitting the shortest message are essentially equivalent.

In the NMF case, the message is the matrix , which is approximated using . The model is simple when is small, and, consequently, and have few elements that are cheap to encode. However, with a small , the approximation is likely to be poor, requiring an addition to the message to correct the poor approximation. The MDL principle is to choose the model that minimizes the total message length (Wallace & Boulton, 1968). By trading off the complexity and accuracy of the model, we hope to find the level of complexity that minimizes the transmission of noise while maximizing the transmission of real features.

There needs to be preagreement between the message transmitter and receiver about the level of precision, , that the data, , should be set to. The message must be communicated to this agreed precision. This means the message will consist of the model, , and corrections to the model to reproduce the original data matrix exactly. Therefore the message length, , consists of two parts (MacKay, 2003),
formula
where is the length of the hypothesis, or the complexity of the model, and encodes the accuracy of the model. More complex models tend to have a larger and a smaller .

Two important points should be made here. First, we are not interested in how to optimally encode the message; we are interested only in the message length itself. Second, for model selection, we are interested only in the relative length of each message. Any additional pieces of information required in the message that are consistent across all the different values of are irrelevant in MDL because they will increase the total description length by a constant amount and make no difference to the location of the minimum. The only terms that matter are those that will be different across different values of . In other words, we are not interested in the message itself or the absolute cost of encoding the message, but in the relative cost of sending the message at different values of .

2.2  Proposed MDL Algorithm

To perform MDL to assess an appropriate subspace size for NMF, we first must specify the components of and . The encoded length of the hypothesis, or the complexity of the model, , is , where and are the length of messages required to encode the matrices and , respectively. The term is the length of the correction required to ensure that can be reproduced exactly (to prespecified precision) and is the encoded length of the matrix of errors, , where . Implementation of MDL then requires the estimation of the minimum length of code that would allow the three matrices to be encoded into a message. When is small, the matrices and are small and so cost relatively little to encode, but the error matrix, , is large and therefore expensive to encode. As we increase , the errors reduce, so the cost of transmitting falls, but the cost of transmitting the model, and , increases. At some point, there should be an value at which the total length is minimized. This is then the MDL and gives us a choice for .

The principle of MDL relies on the use of the best possible encoding of the data, that is, the encoding with the lowest cost. An upper bound on potential encodings can be estimated by considering the information content of each element. In general, any value that occurs multiple times is cheap to encode. To understand why, assume that the values in the error matrix are gaussian distributed; then many elements will fall into a range that is close to the mean, which can be assigned a short code. Any element far from the mean will require a longer code and is therefore more expensive to send. The Shannon information content allows us to estimate this cost using probabilities and is defined as (MacKay, 2003)
formula
where is the value of an element and the probability of that value occurring. The aim is to find the probability of a value occurring in the , , and matrices and then convert that value to a cost using the Shannon information content.

To estimate the probabilities, we separate the data into bins of width , which is the precision of the data. This value should be assessed from the data. We then apply two methods to estimate the probability of a term occurring in that bin. The first is to use the frequency of terms in that bin, , compared to the total number of terms, , so that , where is the probability of an element to be in the th bin. There is a considerable problem with this method in that while we can estimate the probabilities of each element in each bin, and hence the bound on the cost of sending the data, we also would need to send a specification of the histograms themselves, essentially the starting and end points of the histograms, along with the code used for each histogram. The bin width could be assumed to be the precision. The encoding of starting and end points of the histogram is likely to be fairly similar across , and there should be fairly inexpensive methods of encoding which bins are assigned to which codes, but it is not a trivial task to complete. It is, however, likely that the parameters of this histogram model will be dwarfed by the cost of encoding the data. In the rest of this letter, we refer to this technique as the histogram method.

The second method of estimating probabilities is more consistent with MDL principles but also suffers from a potential problem. Instead of using the frequencies of the histograms themselves, probability distributions are applied to the binned data, which allows us to find the probability density, of each bin, . The probability for an element, , in the th bin is then . The advantage over the previous method is that the technique required to send the message is quite straightforward. As long as we use fairly simple distributions, we must simply encode the parameters of the model and send them. The receiver can then recreate the distributions and will therefore be able to recreate the message. The only change in the model as changes will be in the few parameters of the distribution (for a gaussian distribution, the mean and standard deviation, for example), which is highly unlikely to have any noticeable effect on the description lengths for any reasonably large data matrix. The potential problem with this method is that if the distributions do not fit well with the data, the estimates of the probabilities will not be accurate. This will then overestimate the description length.

The probability distributions to fit the nonzero terms from and should possess some features: it must be nonnegative, so the probability density should tend toward zero or infinity at zero values, and the probability density should tend toward zero as the value becomes large. A simple choice is the gamma distribution, which has a probability density function (PDF) of
formula
where is the gamma function and and are parameters. This is quite a flexible family of distributions that is often able to approximate real-world distributions quite accurately.

At this point, there is both a challenge and an opportunity. Part of the value of NMF is that it naturally tends to result in sparse matrices with a relatively high proportion of zero terms. The opportunity is that these zero terms could be sent very cheaply as separate matrices; the challenge is that these zero terms may result in highly inaccurate distributions being set to the and terms. The PDF of the gamma distribution either falls to zero or tends to infinity at zero depending on the parameter . If the nonzero data are best fit by a distribution that tends to zero, the estimates of the probabilities will be very poor. Most serious, they might well significantly overestimate the cost of sending the zero terms. It may be better to split the data up into zero terms and nonzero terms. The separation of zero and nonzero terms requires some threshold to be set. Above the threshold, the data are modeled using a gamma distribution, and below the threshold, the data are separately encoded, as described below. The Matlab code we provide allows for the manual choice of the threshold but also for an automatic choice made by applying MDL techniques themselves.

The automatic threshold is selected by systematically searching through the space of zero thresholds for both and from zero up to the edge of the first bin. The total description length is then calculated, and the lowest value is selected. This can result in different thresholds for and and also across different terms.

The matrices containing the zero terms, and , are encoded via the probability of being a zero, which is given by , where is the number of zero values and is the total number in or . This leads to
formula
where represents either or and the terms are the numbers for or , respectively. The second term encodes the cost of specifying the terms that are nonzero. This can be viewed as sending a code specifying a matrix of zeros and ones followed by the distribution and the codes for the nonzero terms.
This separation of the and matrices results in a total description length of
formula
2.1
where , are the description lengths required to encode the zeros in the and matrices, respectively; , are the description lengths required to encode the nonzero terms in the and matrices, respectively; and is the description length to encode the error terms.

The nonzero data are assigned to bins of width , and a gamma distribution is separately fitted to the and data. The probabilities, followed by the Shannon information content and hence the description lengths, are then calculated. The and matrices that would be found from the message are calculated followed by the error matrix . A gaussian probability distribution is set to the error matrix to enable the extraction of the probabilities and description lengths. The five terms that make up the description length are summed to give the total description length as in equation 2.1. Our technique is explained in algorithm 1.

3  Application of Minimum Description Length

To demonstrate the application of our MDL technique, we have applied it to real and synthetic data sets. The results shown in this letter use the NMF method of Hoyer (2004) without additional sparseness constraints added (but we also tested other methods taken from Gillis, 2014, and see no notable differences). It is important to emphasize that there is no real ground truth in the real data we assess, so we cannot demonstrate beyond reasonable doubt that our technique works effectively on real data. However, there are several criteria we would expect our method to meet if it is capable of selecting an appropriate :

  1. That the MDL technique performs in the manner we anticipate—that would fall and should rise as increases, and there should be a turning point in

  2. That the MDL technique picks a plausible value of for real data, especially if this is similar to choices made using other methods, such as the use of external knowledge

  3. That the MDL technique can reasonably estimate -values from synthetic data with a known

  4. That MDL shows clear estimates of for different types of data

  5. That the choice of is robust to some variation in the data

In Figure 1 we demonstrate the success of MDL in achieving the first and second points and part of the fifth. The left plot shows real data of a set of 2429 images of faces (see Table 1) with 361 dimensions (pixels) used by Lee and Seung (1999). The description lengths change exactly as we would expect: the length of the errors falls with increasing at the same time that the and terms grow larger. The MDL algorithm produces exactly the pattern that we would expect, fulfilling our first criterion. This same plot also demonstrates that MDL can meet the second criterion; the straight line down to shows the -value of the minimum description, but the turning point is fairly flat, and a reasonable choice could be anywhere from to . Here we should note that when Lee and Seung (1999) used this data set, they chose for their subspace size. They do not specify how they chose , but it may well have been using a trial-and-error approach, choosing that value when it gave good plots for their paper. Using MDL, we have found a result in a similar range with no parameter tuning or assumptions beyond the choice of precision. This estimated value of is certainly a sensible value, and the turning point is clear. We have also included results from rerunning the NMF algorithm on the data, which show no difference in the choice of and produce virtually identical lengths; in fact, the differences in the results are difficult to spot in the figure. This identical solution to reruns of the data is significant, as NMF does not necessarily produce one unique solution. Instead, all the reruns of the algorithm are likely to have produced somewhat different and matrices. Our estimation of does not change at all, implying a level of consistency across different NMF solutions. A final point to be noted from this plot is that the solid line shows results from the distributions, while the dashed line shows the results from using the histograms alone. There is no difference in estimation of and only small differences in description length values between the two methods. As both methods have potential, but complementary, flaws, the similarity in output implies that these flaws do not adversely affect the conclusion.

Figure 1:

(Left) The description lengths for the Faces data set, showing a minimum of the total description length at around , with a reasonable range from to . The solid line is for the description lengths found using the distributions and the dashed line from the histograms; the choice of is the same. (Right) The description lengths for synthetic data with a real of 150. MDL shows a clear minimum at , perfectly estimating the correct value.

Figure 1:

(Left) The description lengths for the Faces data set, showing a minimum of the total description length at around , with a reasonable range from to . The solid line is for the description lengths found using the distributions and the dashed line from the histograms; the choice of is the same. (Right) The description lengths for synthetic data with a real of 150. MDL shows a clear minimum at , perfectly estimating the correct value.

Table 1:
Data Sets Names, Type of Data, Number of Dimensions, , and Number of Data Points, .
NameTypeSource
Faces Image 361 2429 http://cbcl.mit.edu/software-datasets 
    /FaceData2.html 
Genes Biological 5000 38 http://www.broadinstitute.org/cgi-bin 
    /cancer/datasets.cgi 
FTSE 100 Financial 1305 94 University of Southampton Bloomberg 
    information terminal 
NameTypeSource
Faces Image 361 2429 http://cbcl.mit.edu/software-datasets 
    /FaceData2.html 
Genes Biological 5000 38 http://www.broadinstitute.org/cgi-bin 
    /cancer/datasets.cgi 
FTSE 100 Financial 1305 94 University of Southampton Bloomberg 
    information terminal 

The right plot in Figure 1 shows MDL applied to synthetic data with and . The simple synthetic data are created by creating two matrices, and , with random locations of random nonzero terms. These are multiplied together and noise is added. The size of the subspace is 150 and is estimated correctly by the MDL approach. Clearly these data are simple, and the selection of an appropriate from here does not prove our approach is effective, but it does show that MDL can find the appropriate value of for some data sets. Again there is no difference in the conclusions drawn from the histograms and the distributions. We thus claim that our MDL algorithms can fulfill the third criteria we set out.

The left plot in Figure 2 shows the total description lengths for several different data types (see Table 1) and allows us to meet our second and fourth criteria. There is no real ground truth to these data sets, so it is not possible to confirm that MDL is picking a good choice of . It is, though, selecting an that seems to be reasonable for each of the plots and also different from each other. If we were seeing all the turning points at similar values, we might suspect that it was a feature of the algorithm rather than the data; the different locations of the turning points suggest that it is extracting information from the data. The genes data set has been extensively used, often with an implied of 2 or 3 (Devarajan, 2008) which is similar to our estimate of between 2 and 5. Our estimate of the FTSE 100 data set -value is around , where the value of could be considered as the number of economic sectors, such as energy, telecommunications, and information technology. An estimate of the number of these sectors of around 10 would be reasonable and is close to our evaluation.

Figure 2:

(Left) The total description length for a range of data sets showing where turning points occur, potentially signaling an effective choice of . While no ground truth is known, the plots show minima at sensible locations, which correspond reasonably well with estimates from other sources. (Right) The total description length for a range of synthetic data. MDL correctly identifies the term for each data set except for , which is still very close (the red vertical line) to the correct value (the black line).

Figure 2:

(Left) The total description length for a range of data sets showing where turning points occur, potentially signaling an effective choice of . While no ground truth is known, the plots show minima at sensible locations, which correspond reasonably well with estimates from other sources. (Right) The total description length for a range of synthetic data. MDL correctly identifies the term for each data set except for , which is still very close (the red vertical line) to the correct value (the black line).

The right plot of Figure 2 shows a range of results for synthetic data created as discussed earlier but with values of 25, 50, 80, 120, and 150. The black vertical lines show the actual location of the real value. For all results except for the MDL estimation is identical to the actual value. Our algorithm is correctly estimating the real value of .

The final aspect of our technique that we consider in this letter is the robustness of our technique to certain changes. We have already demonstrated that our technique is robust to reruns of the NMF algorithm, which can produce significantly different and matrices. To further attempt to get an impression of the uncertainty in our technique, we applied bootstrapping to the faces data set to produce five different variations of the original, in addition to the non-bootstrapped variant. NMF was then used to find the and matrices and our MDL technique applied. In the left plot of Figure 3, we see the results from applying the MDL techniques to this data set. The solid line shows the results of applying MDL using the distributions for the images of faces data set. The dotted line shows the same but for bootstrapped data; this is hard to see, as the results are almost identical. The dashed line shows the same but for MDL applied using the histograms and the dash-dot line for the equivalent bootstrapped results. The differences are very marginal, and the choice of is similar for both. There is almost no difference seen in results when bootstrapping the data. We therefore consider the method to be reasonably robust to this type of alteration of the data.

Figure 3:

(Left) The solid line shows the results of applying MDL using the distributions for the images of faces data set. The dotted line shows the same but for bootstrapped data. These are hard to see because the results are almost identical. The dashed line shows the same but for MDL applied using the histograms and the dash-dot line for the equivalent bootstrapped results. Again, there is almost no difference. Our results show no significant variation under bootstrapping. (Right) The results of for varying the reduced size of for the images of faces data set. We see a clear reduction in the optimal value as is reduced. The dashed line is for the histogram plots and the solid line for the distributions.

Figure 3:

(Left) The solid line shows the results of applying MDL using the distributions for the images of faces data set. The dotted line shows the same but for bootstrapped data. These are hard to see because the results are almost identical. The dashed line shows the same but for MDL applied using the histograms and the dash-dot line for the equivalent bootstrapped results. Again, there is almost no difference. Our results show no significant variation under bootstrapping. (Right) The results of for varying the reduced size of for the images of faces data set. We see a clear reduction in the optimal value as is reduced. The dashed line is for the histogram plots and the solid line for the distributions.

The right-hand plot of Figure 3 shows how the location of the MDL selection of changes with the number of samples from up to (the full data set). The vertical lines record the value of the minimum for each sample size. It is apparent that the choice of decreases with a smaller sample size. The final two terms with and are similar, but there is a considerable fall in the selected -value with smaller . We offer an explanation of why we see such a fall in the best choice of consistent with these results. If we consider the data to be made up of features with a range of importance, in the case of a set of images of faces, important features might be eyes, noses, ears, or mouths. These features, and variants of them, will be required for almost all faces. Features such as moustaches, however, are far less common. With lower numbers of samples, it may be better to assume a moustache feature, which may be used by a small number of images, is not worth considering as a feature; instead, a moustache can be considered as noise and accepted as part of the corrections made by the matrix. As the number of samples increases, it becomes possible to recognize that the extra feature is not noise, and so the number of features expands, and the capacity of the model increases with more data. We can potentially see the features that appear at low as the more important features, and as increases, we gain the features that are either less important to many of the data or important to only a small subset of the data. In reality, this analysis of the data may be overly simplistic in that the smaller number of features is likely to partially include the less important features. We may well then see combined features rather than the less relevant features being completely absent. Either way, as the number of data points increases, so does the capacity of the model.

4  Conclusion

Our novel technique is to apply an MDL technique to selection of . Before considering the results, there are several attractive features of MDL. First, all the data is used in MDL; there is no need to keep holdout folds so no need to average out the results from the different folds or to consider the variance in the results when drawing conclusions, as there is in techniques such as cross-validation. Second, MDL is an elegant technique with intuitive appeal that gives a natural trade-off between errors and model size. Third, the only potentially arbitrary parameter is precision, , but this has only a minor influence on the relative description length of different models and, in any case, may not be arbitrary if the precision of the data themselves can be used.

We have applied our MDL technique to a range of real and synthetic data. MDL is able to accurately estimate in synthetic data as well as provide reasonable estimates of the best in real data. Our technique is robust to reruns of the NMF algorithm and to bootstrapping the data, producing the same predictions of the best .

Our technique has been tested on a range of data and is likely to work better on some data than others. If the distributions of the matrices , , and match our set distributions well, then we would expect to make good predictions. Conversely, if the distributions do not match the data well, our estimates may be inaccurate. In particular, if is highly sparse, most of the errors will probably be zero, and our algorithm may require some alteration. An advantage of our algorithms is that problems should be observed in differences in results from the histogram and gamma distribution methods. Our algorithms also allow for the production of a range of graphs to test the similarity of distributions to the actual data, which should highlight potential problems. Extensions to this work would likely be to investigate whether other distributions do a better job than our gaussian and gamma choices.

We have a range of techniques for assessing an appropriate value of in the literature. The best method is likely to use several techniques to select an appropriate . We suggest that our technique adds to the potential toolbox that NMF researchers use to form judgments about the choice of .

References

Blei
,
D. M.
,
Cook
,
P. R.
, &
Hoffman
,
M.
(
2010
). Bayesian nonparametric matrix factorization for recorded music. In
Proceedings of the 27th International Conference on Machine Learning
(pp.
439
446
).
Madison, WI
:
Omnipress
.
Devarajan
,
K.
(
2008
).
Nonnegative matrix factorization: An analytical and interpretive tool in computational biology
.
PLoS Comput. Biol.
,
4
,
Gillis
,
N.
(
2014
).
The why and how of nonnegative matrix factorization
.
Regularization, Optimization, Kernels, and Support Vector Machines
,
12
,
257
291
.
Hoyer
,
P. O
. (
2004
).
Non-negative matrix factorization with sparseness constraints
.
Journal of Machine Learning Research
,
5
,
1457
1469
.
Kanagal
,
B.
, &
Sindhwani
,
V.
(
2010
).
Rank selection in low-rank matrix approximations: A study of cross-validation for NMFs
.
NIPS Workshop 2010
.
Lee
,
D. D.
, &
Seung
,
H. S.
(
1999
).
Learning the parts of objects by nonnegative matrix factorization
.
Nature
,
401
,
788
791
.
MacKay
,
D.J.C.
(
2003
).
Information theory, inference and learning algorithms
.
Cambridge
:
Cambridge University Press
.
Owen
,
A. B.
, &
Perry
,
P. O.
(
2009
).
Bi-cross-validation of the SVD and the nonnegative matrix factorization
.
Annals of Applied Statistics
,
3
,
564
594
.
Ulfarsson
,
M. O.
, &
Solo
,
V.
(
2013
). Tuning parameter selection for nonnegative matrix factorization. In
Proceedings of the IEEE Conference on Acoustics, Speech, and Signal Processing
.
Piscataway, NJ
:
IEEE
.
Wallace
,
C. S.
, &
Boulton
,
D. M.
(
1968
).
An information measure for classification
.
Computer Journal
,
11
,
185
194
.