Abstract
We study the problem of the unsupervised learning of graphical models in mixed discrete-continuous domains. The problem of unsupervised learning of such models in discrete domains alone is notoriously challenging, compounded by the fact that inference is computationally demanding. The situation is generally believed to be significantly worse in discrete-continuous domains: estimating the unknown probability distribution of given samples is often limited in practice to a handful of parametric forms, and in addition to that, computing conditional queries need to carefully handle low-probability regions in safety-critical applications. In recent years, the regime of tractable learning has emerged, which attempts to learn a graphical model that permits efficient inference. Most of the results in this regime are based on arithmetic circuits, for which inference is linear in the size of the obtained circuit. In this work, we show how, with minimal modifications, such regimes can be generalized by leveraging efficient density estimation schemes based on piecewise polynomial approximations. Our framework is realized on a recent computational abstraction that permits efficient inference for a range of queries in the underlying language. Our empirical results show that our approach is effective, and allows a study of the trade-off between the granularity of the learned model and its predictive power.
1. INTRODUCTION
Probabilistic representations, such as Bayesian and Markov networks, are fundamental to statistical machine learning and have applications in a range of domains, including biology, physics and robotics [1]. The attractiveness of such networks is that they can express probabilistic dependencies in a compact manner. Unfortunately, exact inference in these representations is intractable [2, 3]. Naturally, in the era of big data, there is also enormous interest in learning such structures directly from data. However, owing to the intractability of inference, learning also becomes challenging, since learning typically uses inference as a sub-routine [4], and moreover, even if such a representation is learned, the prediction will suffer because inference has to be approximated.
Tractable learning is a powerful new paradigm that attempts to learn representations that support efficient probabilistic querying. Much of the initial work focused on low tree-width models [5], but later, building on properties such as local structure [6], data structures such as arithmetic circuits (ACs) emerged. These circuit learners can also represent high tree-width models and enable exact inference for a range of queries in time polynomial in the circuit size. Sum-product networks (SPNs) [7] are instances of ACs with an elegant recursive structure – essentially, an SPN is a weighted sum of products of SPNs, and the base case is a leaf node denoting a tractable probability distribution (e.g., a univariate Bernoulli distribution). In so much as deep learning models can be understood as graphical models with multiple hidden variables, SPNs can be seen as a tractable deep architecture. Standard deep architectures rely on many layers of hidden variables for greater representational power, but inference with even a single layer is generally intractable, with additional layers exacerbating the problem. Tractable architectures, such as SPNs, have the advantage over standard deep learning methods in that they can learn unsupervised and they have full probabilistic semantics where inference is guaranteed to be tractable [8]. Of course, learning the architecture of standard deep models is very challenging [9], and in contrast, SPNs, by their very design, offer a reliable structure learning paradigm. While it is possible to specify SPNs by hand, weight learning is additionally required to obtain a probability distribution, but also the specification of SPNs has to obey conditions of completeness and decomposability, all of which makes structure learning an obvious choice. Since SPNs were introduced, a number of structure learning frameworks have been developed for those and related data structures, e.g., [10, 11, 12].
In this work, we study the problem of the unsupervised learning of tractable graphical models in mixed discrete-continuous domains. In general, tractability and learning issues aside, even in the inference context, the situation is generally believed to be significantly harder in discrete-continuous domains: estimating the unknown probability distribution of given samples is often limited in practice to a handful of parametric forms, and in addition to that, computing conditional queries needs to carefully handle low-probability regions in safety-critical applications, such as autonomous vehicles and health monitoring. Techniques based on Gaussian, conditional linear Gaussian [13] and mixing exponential families [14] are most common, for example.
To address the problem, then, we would first need a framework where it becomes possible to reason about continuous and mixed discrete-continuous event spaces in a general manner. With discrete models, weighted model counting (WMC) has emerged as an assembly language for state-of-the-art reasoning in Bayesian networks, factor graphs, probabilistic programs, and probabilistic databases [6, 15]. Exact WMC solvers are able to handle low-probability observations, and the query language can support arbitrary propositional formulas over logical connectives, including disjunctions, which make them particularly useful in the presence of logical soft and hard constraints. However, WMC is limited to discrete finite-outcome distributions only, and little was understood about whether a suitable extension exists for continuous and discrete-continuous random variables, until recently. The framework of weighted model integration (WMI) [16] extended the usual WMC setting by allowing real-valued variables over symbolic weight functions, as opposed to purely numeric weights in WMC. These symbolic weights can take the form of piecewise polynomials, for example, and thus, WMI appeals to the emerging interest in density estimation by piecewise polynomial approximations [17, 18, 19, 20]. Essentially, these approximations can be made arbitrarily close to exponential families, among others including log-concave distributions, mixtures of Gaussians and Poisson Binomial distributions [21]. WMI additionally enables efficient integration [22] and recent WMI solvers [23, 24] are maturing quickly in the sense of attempting to leverage the powerful strategies already used in WMC. In essence, WMI is a strict generalization of WMC [16]; for example, just as inference in discrete graphical models reduces to WMC [2, 6], obtained from the sum of products of atoms' weights in a propositional model, inference in hybrid graphical models reduces to the WMI, obtained by integrating the product of atoms' densities in a first-order model. For example, a Gaussian distribution N(5,1) for a random variable x can be represented using (truncated) models seen in Figure 1.
Comparison of piecewise constant vs polynomial using a 7-piece function.
As a computational strategy, what makes WMI particularly effective is that: (a) the intervals can be understood as (and mapped to) propositions [16], known in the literature as propositional abstraction [23], allowing us to piggyback on any propositional solver, and (b) WMI admits complex interval queries, such as Pr(x > 1:5 | y < 2) where x, y are random variables, yielding a powerful query interface. Notice here that the intervals in the queries do not have to correspond to the intervals used for specifying the distribution, and can overlap and subsume them arbitrarily.
To leverage WMI and piecewise polynomials more generally with structure learning, consider that the base case for SPNs is a tractable distribution, and so, SPNs are essentially a schema for composing tractable distributions. Although much of the literature focuses on univariate Bernoulli distributions [10, 12], continuous models can be considered too, which makes SPNs very appealing because real-world data are often hybrid, with discrete, categorical and continuous entities. Nonetheless, this rests on the assumption of having a tractable query interface to the underlying continuous and hybrid distributions. In [11], for example, the leaf nodes are assumed to be drawn from a Gaussian, and similarly, and [25, 26] assume the data are drawn from other families with a known parametric form.
In this work, we show how, with minimal modifications, structure learning can be generalized by leveraging efficient density estimation schemes based on piecewise polynomial approximations. Although SPN integration with parametric and non-parametric models is receiving increased attention, to the best of our knowledge, we are the first to achieve full integration of WMI and its powerful query mechanism, together with tractability guarantees regarding exact integration whilst achieving a clean separation of circuit learning from polynomial weight learning. In particular, owing to the tractability of integration over polynomial densities [27], inference becomes tractable, and can be computed by doing a few passes over the network. Moreover, among other things, the learning component allows us to induce both parametric and non-parametric models and their mixtures, including models such as Figure 1 and Figure 2, but with no prior knowledge of the true distribution, all of which are further composed over hidden layers in a deep probabilistic architecture. In other words, leveraging SPNs enables a fully unsupervised learning regime for hybrid data with WMI acting as an underlying inference framework. From a representational viewpoint, our framework allows the modeller to study the granularity of the continuous distributions (in terms of the number of piecewise components and the degree of the polynomial fit), and determine the empirical trade-off between accuracy and model size/interpretability. By lifting the propositional abstraction strategy of WMI, the framework is also instantiated in a manner that allows us to use any SPN learner in principle. To reiterate, to the best of our knowledge, this is a first attempt to combine WMI and circuit learning in their full generality, with an interface for complex interval queries. We will define this precisely in Section 3, but briefly the idea is that we can handle conjunctions and disjunctions of probabilistic queries, in the sense that these queries involve interval formulas but these intervals are not required to be mentioned in the specification of the distributions in an SPN. As a contrasting example, say we have the learning problem of credit risk. A deep neural network (DNN) would learn to classify an input customer as being either BAD for a customer potentially reneging on paying back a bank load or GOOD if the customer is likely to pay back a loan. With standard DNNs inference would be on the classification as Pr(class = BAD | X), where X is a vector of customer features (age, job, creditamount, savingsaccount, etc.). Inference requires X to have instantiated values but does not allow for more nuanced questions. In contrast with SPNs, specifically with our integration of WMI, we can now perform inference such as with Pr(class = BAD | job = Unemployed ∧ 7,500 < creditamount < 9,000) which allows for performing inference on discrete and continuous features with specified ranges with respect to the continuous features. As we will see, handling such queries requires multiple passes over the SPN. The code for the framework is available on FigShare①.
The piecewise polynomial approximate algorithm of a mixture of a Gaussian and Beta distribution.
This article is organized as follows. We discuss related work followed by the formal foundations for our work and then turn to our approach that integrates WMI and SPNs. We then discuss our interface for complex interval queries. We turn to empirical evaluations after that and finally conclude.
2. RELATED WORK
In addition to the related efforts we mentioned above, we remark that in a recent and independent effort②, [28] introduces so-called Mixed SPNs (MSPNs). Like our work, they are motivated by piecewise approximations for learning from hybrid data. They propose a decomposition and conditioning approach for such SPNs employing the Hirschfeld-Gebelein-Rényi Maximum Correlation Coefficient [29]. While close in spirit, there are several significant differences to our work. Firstly, although their proposal is motivated by piecewise approximations, they focus on the learning of piecewise constant or linear models. From a representational viewpoint, this seems like a shortcoming, because in principle these weight functions can be effectively mapped to propositional SPNs with numeric weights, which remove much of the complexity, and hence appeal when finer polynomial density approximations are needed③. Secondly, the realization of an appropriate query interface that dynamically handles arbitrarily complex interval queries is not a focus of the paper - this is a critical feature with regard to our model as we take advantage of WMI. From a computational viewpoint, such a realization requires, as we show, subtle decompositions and multiple traversals to compute the resulting probability. To reiterate, what we strive for here is a full integration of SPNs and WMI. Thirdly, and perhaps most significantly, our approach neatly separates the propositional layer from the continuous aspects (including the integration) through pre-processing the data, which makes our framework generic, and applicable to any SPN learner. Finally, in our empirical evaluations, we show that our approach significantly outperforms [28] on almost all data sets in terms of the average test log-likelihood measure. On the other hand, it is argued in [28] that the decomposition technique naturally handles random variables of unknown type, whereas we treat all real-valued variables via intervalization and piecewise polynomial approximations. So it may be fruitful to combine the strengths of both Renyi decomposition of variables of unknown type with our handling of continuous variables of unknown distribution, which we leave for future research.
Regarding inference and MSPNs, a later paper [30] expands MSPNs to include an inference specifically geared for Approximate Query Processing (AQP). AQP aims to deliver an estimate of the query given a fixed time-bound. Kulessa et al. proposed two strategies for querying, and one based on likelihood inference on the model and the other based on sampling. The differences between [30] and our approach are notably similar to the differences our approach had with [28], that is, treating linear piecewise approximations as constants, and not allowing user-defined subinterval queries. Nonetheless, given the focus on SQL queries, the nature of querying is far broader than our approach. We also note the inclusion of sampling is a unique approach which holds merit given the broad set of user-defined SQL queries allowed, and we observe that the inclusion of our methods of complex querying would be a constructive addition to their SQL syntax. We elaborate further in Section 5.
Another body of work released after our initial findings also addresses inference concerning MSPNs [31]. Both our work and [31] use a similar separation of structure learning with distribution modeling at the leaf nodes. [31] relies on MSPNs to derive the structure of the data and then use Gibbs sampling in conjunction with a likelihood dictionary (e.g., Gaussian, Gamma, Exponential distributions) to approximate variable distributions at the leaf nodes (parameters for parametric models). This approach replaces the linear piecewise polynomials from [28] and uses a Bayesian inference approach with a Gibbs sampling scheme. They can perform interval querying on continuous variables that are similar to our approach (complex querying), as well as learn on missing data which we do not handle. Our approach, however, is not reliant on a likelihood dictionary of parametric distributions nor sampling for parametric parameters. As will be explained in Section 4, our model is more aligned with the general tractable inference mechanism of SPNs with distribution modeling handled by piecewise polynomial learning on the data. It would be interesting to combine our piecewise polynomial learning modeling method with the likelihood dictionary and explore a sampling approach in performing inference for future endeavors. Nonetheless, we note that we perform exact inference in our current framework, which we think fits well with the tractable learning paradigm of learning structures that permit efficient exact inference. Leveraging a sampling-based method would have to be approached carefully, and the merits and demerits evaluated thoroughly.
Outside of the above lines of research, research in learning in hybrid domains has been primarily limited to well-known parametric families. For example, [13] focuses on conditional linear Gaussian models, and [14] focuses on mixing exponential families. This is perhaps not surprising, because inference in hybrid domains has been primarily limited to approximate computations, e.g., [32], and/or Gaussian models [33]. In a similar spirit, approaches such as [34, 35] consider learning of complex symbolic constraints by assuming base distributions as being either Gaussian or a softmax equality. Another inference direction for hybrid domains was done by [36], where SPNs interchange integrals with sum nodes to evaluate on continuous features that are provided with arbitrary input distributions. While that is a novel approach, we maintain an integration scheme at the leaf nodes to handle arbitrary interval queries.
A particular body work which utilizes parametric families for modelling hybrid distributions is Automatic Bayesian Density Analysis (ABDA) [31], which uses the SPN architecture to cluster feature distributions for mixture modelling, and SPN inference to impute missing values and for handling outlying values in the data. ABDA forgoes the piecewise linear approach of MSPNs and our piecewise polynomial method, as such we include comparisons of ABDA on hybrid data sets with our method in the experiments in Section 6. We also include experimental comparisons with Bayesian Sum-Product networks (BSPNs) which applies a Bayesian approach to structure learning by focusing on deriving so-called scope functions to formulate joint Bayesian functions over the structure and parameters of a SPN [37]. From our empirical evaluation we find our approach succeeds in beating both ABDA and BSPNs.
While our research focuses on structure learning, we mention an interesting framework which does away with SPN structure learning by constructing random region graphs populated with SPN nodes, so called random and tensorized SPNs (RAT-SPNs) [38]. A key aspect of RAT-SPNs is that by training the model discriminatively, it yields a classier on par with deep architectures. As the learning objective of a RAT-SPN can be supervised as opposed to unsupervised, a future direction of research could be the reframing of our learning method to be supervised by incorporating the discriminative training framework of RAT-SPNs. Another future research direction would be investigation into Einsum Networks (EiNet), which computes SPN sum and product nodes on the same topological layer via a single monolithic einsum operation [39]. Incorporating EiNet operations could lead to scaling up of large data set modelling with respect to our method.
From an inference viewpoint, WMI serves as a computational abstraction for exact and approximate inference [16, 40] based on piecewise polynomial approximations. It is thus different in spirit to variational methods and analytic closed-form solutions for parametric families. We refer readers to [16] for discussions. WMI has enjoyed considerable advances in recent years [41, 42], and in that sense, our framework might be the starting point for closing the gap between inference and learning.
3. INFERENCE FRAMEWORK AND GRAPH ARCHITECTURE
This section presents the core components of our methodology by explaining WMI and SPNs. Emphasis is given to WMI's relationship with complex queries, including a formal definition of complex queries. We also provide an illustration of the probabilistic modelling capabilities of SPNs, as well as basic notation used throughout.
3.1 Weighted Model Integration
Given a propositional formula Δ, the task of model counting is to compute the total number of satisfying assignments for Δ. Given an assignment (model or world) M, we write M ⊧ Δ to denote satisfaction. For example, p ∨ q has 3 models, and a satisfying ratio of 3/4. Weighted model counting (WMC) is its extension that additionally accords weights to the models [6]. Models are usually accorded weights by computing the product of weighted literals: that is, assuming w maps literals in Δ to positive reals, we define:
where the sum ranges over propositional models, and l ∊ M denotes the literals true at the model M. For example, suppose p and q are accorded a weight of .6 and .3, respectively, with the understanding that the weights of a negated atom ¬a = 1 – w(a). Then, the weight accorded to [p = 1, q = 1] would be .18, and WMC(p ∨ q,w) = .18 + .42 + .12 = .72.
WMC is a state-of-the-art exact inference scheme, and owing to its generality, inference in a number of formalisms, such as relational Bayesian networks and probabilistic programs [15], are encoded as WMC tasks. Given a formula Δ, a weight function w, a query q and evidence e, probabilities are computed by means of the expression: Pr(q | e, Δ) = WMC(Δ ⁁ q ⁁ e, w)/WMC(Δ ⁁ e, w).
Due to the propositional setting, however, WMC is limited to finite domain discrete random variables, which have spurred considerable interest in generalizing WMC to continuous and hybrid distributions [16, 18, 43]. Weighted model integration (WMI) is a computational abstraction for computing probabilities with continuous and mixed discrete-continuous distributions [16]. The key idea is to additionally consider inequality atoms, such as 0 ≤ x ≤ 10, along with possibly non-numeric weights, such as x2. The intuition is to let the inequality atom denote an interval of the domain of a density function, and let the weight define the function for that interval④. Formally, WMC was based on weighted propositional knowledge bases, and leverages SAT (propositional satisfiability) technology, and by extension, WMI is based on weighted linear arithmetic theories and leverages satisfiability modulo theory (SMT) technology [44].
Given such a knowledge base, the weighted model integration (WMI) is obtained from the model count of the theory together with a volume computation for the intervals. Formally:
where Δ- denotes a propositional abstraction, based on a mapping from linear arithmetic expressions to propositional atoms, and Δ+ denotes a refinement, where the atoms are mapped back to their linear arithmetic expressions. For example, (0 ≤ x ≤ 10) ∨ q on abstraction yields p v q, where p is a fresh atom chosen so as not to conflict with any of the existing atoms in the theory, which has a model count of 3 on abstraction.
Example 1 [16]
Suppose Δ is the following formula: (0 ≤ x ≤ 10) v q, where p is the propositional abstraction for (0 ≤ x ≤ 10). We also suppose that the weights are given with w(p) = x2and w(q) = 0.3, and the weights for negated atoms are w(¬ p) = 1 and w(¬q) = 0, respectively. There are three models for Δ–:
M = {p,¬q}: Since w(¬q) = 0, by definition we have
Thus WMI(Δ,w) = 100 + 3 + 0 = 103.
To understand the intuition here, consider that in the one-variable setting, we are assuming a density function f: ℝ → ℝ of the following form [45]:
Where A1, …, Ak are disjoint intervals in ℝ that do not depend on x and a0i, …, ani are constants for all i. We will say that f is a k–piece n-degree piecewise polynomial density approximation.
This is then encoded as a WMI atom Ai (e.g., x ∊ [α, β]) with weight w(Ai) = a0i + … + anixn. For example, suppose we also have the query atom α ≤ x ≤ β and e = true. (We treat the weights of query atoms and their negations as being 1.) On abstraction, we would use a proposition p to stand for the interval Ai. Clearly, calculating the probability of the interval simply amounts to calculating the area under the curve given by the polynomial w(Ai).
This can also be approached as:
Naturally, if the query refers to intervals not appearing in the specification, we have to decompose the problem. We will return to this point in more detail later (Section 5), but the rough idea is that given two pieces of the density function f, say A1 = α1 ≤ x ≤ β1 and A2 = β1 ≤ x ≤ b2 with weights P1(x) and P2(x), clearly the probability of x ∊ [α∗, β∗] where α1 < α∗ < β1 and β1 < β∗ < β2 would be obtained as:
While the mathematical treatment is immediate, computing these probabilities dynamically requires, as we will see, multiple passes over the SPN and so we will refer to such queries as complex queries.
Definition 1 Complex Queries: Given a k-piece n-degree piecewise polynomial density function f for r.v. x, defined over intervals Ai, …, Ak, we define a complex query as an interval query q:α∗ ≤ x ≤ β∗ where:
α∗ or β∗ is not one of the extreme points in A1, …, Ak.
given two (not necessarily complex) queries q and q', we are interested in the probability of ° q', where ° ∊{∧, ∨} or the probability of ¬q.
In general, the WMI apparatus is very flexible and on top of complex querying with continuous variables, we can also combine complex querying with Boolean variables.
3.1.1 Weighted Model Learning
Weight learning in the past [16] has focused on piecewise-constant density approximations where features are linear arithmetic sentences and weights are constants. This was addressed as follows: given a formula fi, we wish to ensure that ED[fi] = Ew[fi] by maximum likelihood estimation (MLE) with weights w = (w1, …, wk) and data D. This states that the empirical expectation of fi in D (for example the count) equals its expectation according to the current parameterization.
More precisely given an SMT theory Δ, the empirical expectation of a formula α ∊ Δ can be obtained by calculating:
This simply counts the items in D which are true for α. Given a weight function w, the expectation of α ∊ Δ wrt w is given by:
Here Z is a normalization constant, also referred to as the partition function. In our case, the partition function equals the integral of weights of all models of Δ, thus Z = WMI(Δ,w). Based on this formulation, standard iterative methods for convex optimization can be used for weight learning [16]. In our setting integration with circuit learning removes the burden of parameter learning from WMI, and moreover, we are able to handle piecewise polynomials density approximations.
3.2 Sum-Product Networks
SPNs are rooted acyclic graphs whose internal nodes are sums and products, and leaf nodes are tractable distributions, such as Bernoulli and Gaussian distributions [7, 10]. More precisely, letting the scope of an SPN be the set of variables appearing in it, a recursive definition is given as follows:
(1) a tractable univariate distribution is an SPN (the base case); (2) a product of SPNs with disjoint scopes is an SPN (denoting a factorisation over independent distributions); and (3) a weighted sum of SPNs with the same scope, where the weights are positive, is an SPN (denoting a mixture of distributions).
Formally, SPNs encode a function for every node that maps random variables X1, …, Xn to a (numeric) output. For node j, the corresponding function fj maps a world X1 = x1, …, Xn = xn to its probability if it is a leaf node, the weighted sum for sum nodes where are the children of node j, and for product nodes. Conditional probabilities are expressed using:
where 0 is the root node, and the notation f0 with only a subset of the arguments used in the denominator denotes marginalisation over the remaining arguments. The benefit of SPNs here is that a bottom-up pass allows us to compute conditional queries in time polynomial in the circuit size. (See, for example, [46] on other data structures that allow a more expressive class of queries.)
3.3 Notation
During our discussion, we use the following notation. Random variables are denoted using X or Y as well as Xi. The set of possible values for a variable (or feature) Xi are implicitly understood to be binned as intervals {x1i, x2i, …, xki}. Abusing notation, we treat and to true and false values accorded to the Boolean variable serving as a propositional abstraction for the interval xai. We periodically generalize xai with xi for a random variable Xi, where xi represents any given interval ≤k. We denote a data set as D, with T being the set of instances, and V being the set of variables. D' represents a binarized data set, that is, after binning the points in D as intervals and interpreting those intervals as Boolean variables. We refer to queries as q and we also refer to log-likelihood function as ℒ. For WMISPNs and SPNs, we use fj to refer to a general node j with f0 referring to the root node and referring to a node i with parent node j.
4. TRACTABLE MODEL STRUCTURE LEARNING
In this section we present our proposed algorithm which derives the structure a nd parameters for SPNs with piecewise polynomial leaves, equipped with WMI (WMISPNs). The method LearnWMISPN is capable of deriving such an SPN structure from hybrid data and further, it allows for the retention of the distribution parameters in the continuous case for use in advanced querying.
The structure learning approach for LearnWMISPN is derived from LearnSPN [10]. LearnSPN is a recursive top-down learning method which is capable of learning the structure and weights for SPNs by identifying mutually independent variables and clustering similar instances. LearnSPN takes a simple approach to structure learning by recursively splitting data into a product of SPNs over independent variables, or a sum of SPNs comprising subsets of instances. More generally, LearnSPN represents an algorithm schema which allows for a modular approach in structure learning from differing domains, which is why our approach for learning changes only the base case.
The primary extension with LearnWMISPN is the addition of generated polynomial leaf nodes. Given a data set with discrete, categorical and continuous attributes, LearnWMISPN learns polynomial weight functions as density approximations for the continuous cases, mapped to leaf nodes. By doing so, LearnWMISPN provides a model which is capable of complex querying in the continuous case, while also conditioning over other variable types for hybrid domains.
Prior to structure learning, a preprocessing step is performed which first transforms discrete, categorical, and continuous features into a binary representation. Much like the algorithm schema of LearnSPN, the preprocessing step can implement any number of unsupervised binning methods including using a mean split, equal frequency binning, or equal width binning [47]. This is followed by a polynomial learning function which identifies the polynomial function coefficients and probability densities for the continuous features in the hybrid domain without prior knowledge of the true density function.
Algorithm 1 describes the recursive structure learning algorithm for LearnWMISPN and Figure 3 illustrates the recursive pipeline. LearnWMISPN takes as input the preprocessed binary data set D' with T instances and V' variables and returns a WMISPN representing the distributions of the hybrid domains. LearnWMISPN recurses on subsets of V' until a vector of unit length is found, at which point a corresponding univariate distribution is returned.
LearnWMISPN (T, V').
A recursive algorithm for learning hybrid domains.
The main novelty in LearnWMISPN is the handling of the base case. Without any prior knowledge, LearnWMISPN will return a smoothed univariate distribution for discrete and categorical variables, or piecewise polynomial distribution for continuous variables. These two outcomes set LearnWMISPN apart from LearnSPN, and this is represented with polynomial leaf nodes corresponding to continuous distributions.
If the base case has not been reached, then LearnWMISPN continues the recursive structure learning of LearnSPN. For the decomposition step, a variable split is identified which results in mutually independent subsets, generating a product node for the resulting subset. Mutual independence for the variable splits is determined by a G-test for pairwise independence:
where the summations range over the elements of each variable interval x1 and x2, c(·) refers to the occurrence count for a setting of a variable pair or singleton, and T refers to the number of instances [10].
As an example, say we are checking the independence for two continuous features X1 = x1 and X2 = x2 (we ignore binning to focus on the test itself). As there are only four possible combinations from the count function , we derive a 4×4 matrix which list the counts for , , and .. We also calculate a count vector for x1 and x2, respectively that counts the number activations (1) and negations (0) in each feature separately (represented by and in the equation). The empty sets Ø in the count vectors are then counted (it is only ever possible to have 1 empty set or 0 empty sets). From there we sum over the possible activation/negation cases using the G-test algorithm, and the final G-test value must be less than the threshold value calculated from features x1 and x2 to be classified as independent. The threshold value is calculated as (2 × DOF × gfactor + 0.001). We can calculate the degrees of freedom (DOF) based on the number of states a feature can be represented by (2 in our case as it is either 0 or 1), and the empty set count for each feature . The gfactor is chosen as 0.001 and we add 0.001 to prevent a threshold of zero.
It is also noted that we perform the G-test on two features, say xai and xbi, derived from a continuous feature Xi where a and b refer to a bin interval such that 1 ≤ (a, b) ≤ k given {xai, xbi} ⊆ Xi. If xai has a large number of activations and xbi has a small number of negations (mostly 0's with say a single activation 1), the G-test will still be able to determine that the features are dependent. We remark that the G-test is not a perfect metric for determining independence, but generally, we found it very successful when provided with binarized data.
For the conditioning step, similar instances are clustered using hard incremental expected-maximization (EM) generating a sum node. The ratio split on the clustered subset of instances determines the corresponding weight and what is returned is the sum of the resulting weighted clusters. We condition initially on a slice S ⊆ D, which is a data structure composed of a subset instances (TS ⊆ T) and a subset of features (VS ⊆ V) from the initial data set D. Clusters ℂ are derived from slice S via 10 restarts through the subset of instances TS four times in random order. A new cluster C is added to ℂ if the log-likelihood is greater than the cluster penalty likelihood , where σ is the cluster penalty and F refers to the attribute count for each feature (2 in our model). In order for a set of clusters to remain in the WMISPN structure and for learning to continue, the best set of clusters ℂ is greedily chosen based on log-likelihood ℒ improvements:
where Cprior refers to the ratio of local instances TC to slice instances TS, St refers to the partition of data based on instance t, and C~ refers to the previous cluster in the recursive formula g(.,.). The function g(.,.) is used to calculate the maximum for the log-domain:
LearnSPN assumes a naive Bayes mixture model, where all variables are independent conditioned on the cluster; however LearnWMISPN applies a prior:
where Ci is the ith cluster and Xj is the jth variable, but the prior Pr(Xj) is only applied when the feature is continuous.
Overall, we upgrade both learning and querying over the models to handle piecewise polynomial density approximations. For the learning part, since only the base case is changed for our method, it suffices to show that the learning of the univariate distributions can be effective, which we discuss below. The querying is treated in the subsequent section.
Formally, let Xi = (x1i, …, xki) where k is the number of intervals for every continuous feature Xi. The polynomial learner provides to LearnWMISPN a probability constant representing priors for the continuous interval xai, a being an arbitrary interval ≤k. In the base case of LearnWMISPN, a univariate distribution is returned but calculation is contingent on whether the feature is continuous or discrete. Given a continuous feature interval xai, we wish to find the calculation of the unnormalized probability of a WMISPN . Here p represents the parent node to the child leaf l node which maps to an interval xai.
The leaf likelihood for an interval xai is calculated based on activations and negations of instances for the interval. For activations, we find the count activtions ), and negations . The likelihood for xai is then calculated as:
This calculation considers the one-hot encoded representation of xai as negations correspond to the probability of (x1i, …, x(a–1)I, x(a+1)i, …, xk) being activated instead.
Preprocessing Step: A preprocessing step is required to first transform the discrete, categorical, and continuous variables into a binary representation. The methods for converting a data set into a binary representation is contingent on the variable type in the hybrid data set. The preprocessing pipeline is summarized in Algorithm 2.
Preprocessing step(D, f, k).
The first sub-task converts the real-valued data set D into a binary representation D'. In our model, equal-width binning was performed on continuous features, with the number of equal-width bins specified as a parameter. (A meta-learning step can be designed to choose the optimal number by studying log-likelihood or polynomial improvements.) Higher bin counts would increase the representative power and improve accuracy for complex queries on continuous distributions (see section 5). Each new bin generated from a feature represents a discrete range on that continuous distribution. A one-hot encoding transformation was performed on the expanded feature set to get a binary representation. For discrete and categorical variables, the binary representation is achieved again with a one-hot encoded transformation with equal width binning not taken into consideration.
As an example for one-hot encoding in the discrete case, consider a random variable Y which has a domain of boolean values. Booleans result in two classification labels so Y is expanded to (y1, y2), which now correspond to each boolean instance. Formally:
The two cases for Y are then represented as [x] → [1, 0] and [¬x] → [0, 1], respectively. Now consider the continuous case, where equal width binning is implemented, we say that random variable X is defined as having a range [a, b] ⇒ x ∊ ℝ: a ≤ x ≤ b. If we split X into say three bins (x1, x2, x3), with the split point defined as s = (b – a)/3 then we again have another representation for one hot encoding. Formally:
So for the minimum and maximum values in X we get the following one-hot encoded representation [a] → [1, 0, 0] and [b] → [0, 0, 1]. This means the binary transformation of an instance Di,,: = [b, x] is D'i,: = [0,0,1,1,0].
It is worth noting that increasing bin intervals leads to overfitting of the data and spurious likelihood values. Consider a data set D with M instances and a N-sized feature space, and consider the effect of increasing interval counts k (where k > M) on a given continuous random variable Xi where and . Binning results in an increasing feature space N, which leads to continuous intervals xi frequently being 0. This implies that prior likelihoods of activations decreases, while prior likelihoods for deactivations increases, which can also be seen from the one-hot encoded representation that leads to additional features with no values. Such problems are to be expected with overfitting; however, as we will see in the following section, we use an information criterion to regularize bin counts, choose an optimal number of bins and temper the piecewise polynomial function degree in the model. This is further evidenced by our empirical evaluations (cf. Table 2 and 3), which shows that our results naturally converge to ≤ 5 bins per continuous feature.
Polynomial Learner: The second task learns a piecewise polynomial (PP) approximation for any univariate probability density function (PDF) without prior knowledge or simplifying assumptions. The original data set D, the indices for the continuous variables, and the number of bins k are taken as input, and as output, we receive a vector of probability densities for each continuous feature bin. This bin information is retained in its respective polynomial leaf node in the SPN for use in complex querying.
The piecewise polynomial functions are calculated through a linear combination of b-splines as described in [48]. B-splines are polynomial curves that are right-side continuous, differentiable, and they form a basis in the space of piecewise polynomials. This follows that linear combinations of b-splines can represent any piecewise polynomial function⑤.
B-spline interpolation can then be used to find an approximation of the density function as a linear combination of (Z = k + r – 1) b-splines, where r defines the degree order spanning the whole domain. Thus, for learning a given b-spline, only the order-r which is equivalent to r = degree - n + 1, the number of intervals defined by a binning method-k, and the knot sequence (δ = (a0, …, ak)) which refers to where pieces of the polynomial intersect are required. We define the jth b-spline , where j = 1, …, Z as follows:
where is the first derivative of and H(x) is the Heaviside function:
Our objective now is to learn piecewise polynomial (PP) weights for each of the intervals. So suppose g: I → ℝ+ is the density of an unknown distribution over I, where I is an interval. We would like to find a candidate PP hypothesis f. We say f is a k-piecewise, degree-n polynomial(also called mixture of polynomials (MOP)) if there is a partition of I into k disjoint intervals (I1, …, Ik) such that f(x) = fj(x)∀x ∊ Ij, and each (f1, …, fk) is a polynomial of degree ≤n. Let Pk,n (I) be the class of k-piecewise degree-n polynomials over I. We would now like to draw m ≤ |D| samples that finds m candidate hypothesis f1, …, fm ∊ Pk,n(I), and outputs f∗ ∊ {f1, …, fm) that maximizes the likelihood of observing the samples. Note that k is fixed by the intervalisation step, and so we only need to iterate over the maximum degree of the polynomial and that we are only learning univariate distributions for the leaf node.
The order, and, if not previously specified, the number of bins are chosen during training time by means of the Bayesian Information Criterion (BIC) [49].
Where ℒ denotes the log-likelihood. Use of BIC scores and b-spline interpolation follows the usual properties of probability density functions, such as guaranteeing that MOP approximations remain continuous, integrate to one, are non-negative, and free of prior assumptions [50].
The BIC scoring function has been shown to be robust and was chosen to avoid overfitting the model parameters. In addition, the chosen method favours smaller polynomial ranks [50] which improves the efficiency of the integration without losing accuracy. We reiterate that we do not make any assumptions about the true density⑥.
5. PROBABILISTIC MODEL QUERYING
This sect ion is devoted to expanding out definition of querying with respect to WMISPNs. We first detail the means by which SPNs perform tractable inference, then we illustrate our integration of complex querying into the WMISPN framework.
SPNs provide tractable querying on univariate distributions and can calculate the marginal and conditional probabilities on learned features. Other schemas for SPNs limit the scope of querying to binary activations Pr(Xi = 1) to the leaf nodes [7]. In WMISPNs, we are able to expand our queries to complex cases where leaf nodes maps Xi to new probability ranges (a ≤ x ≤ b), a, b ∊ ℝ.
5.1 SPN Inference
General SPN inference is initiated at the leaf nodes where queries are presented in the form of selected activations of the random variables q(X1 = 1, X2 = 0, …, XN = 1). The mapped probabilities xi propagate upward from the leaf nodes where values from children nodes are either multiplied at product nodes, or the weighted sum is taken at sum nodes. The final likelihood of the queries is returned at the root node 0.
To normalize the returned likelihood, a partition function is required to calculate the normalization constant. Formally:
Inference is deemed tractable due to the partition function being computational in time linear to the number of edges in the learned SPN, which in turn results in queries that can be calculated in time linear Pr(q) = f0(x1, 1 – x2, …, xn)/Z. The same is true for conditional likelihood estimates.
5.2 Complex Querying
We extend the SPN inference model to allow complex queries where inference can be performed over continuous as well as discrete features, and combinations thereof: for example, queries on discrete features such as male(X) conditioned on continuous ranges such as 40 ≤ weight(X) ≤ 50. The advantage of complex queries, with their capability of performing inference on conjunctions of discrete and continuous feature ranges, allows modellers added ability in assessing the relationship of probability distributions in data. Taking our example of credit risk, complex querying can provide means to model the decision boundary, provided a trained model and a mixed discrete continuous data set. An exhaustive querying of a model with complex queries can determine what feature values would lead to a classification change. Taking the credit risk example again, complex queries of the form Pr(class = BAD|0 < creditamount < k) and Pr(class = GOOD|k ≤ creditamount < 10,000) where k indicates a real value boundary for the feature creditamount, can inform the modeller where the likelihood of classification changes. In this example we focus on one feature but complex queries allow for multiple conjunctions as will be explained further.
First, observe that LearnWMISPN is given probability constants as well as polynomial representations for continuous intervals based on the BIC score, thus standard univariate boolean queries do not require integration of new probability constants. In particular, suppose L(S PN) refers to the propositional abstraction of the SPN (that is, think of intervals as propositions and only refer to the probability constants), and let PP(S PN) refer to the continuous representation. We can then describe a query q ∊ L(S PN) to mean a Boolean query defined as where bi ∊ {0, 1}.
Proposition 1Suppose T is a learned WMISPN. Then the partition function of T for any q ∊ L(T), the probability of q can be computed in time linear to the number of edges in T.
Our system augments the above-described machinery for querying to allow for complex queries. This adds the ability to ask queries over arbitrary continuous ranges which in turn enables a mixed querying setup between continuous and discrete features in the same formulas.
Standard SPN inference is initiated at the leaf nodes where queries are presented in the form of selected activations of the random variables (i.e., a state). The mapped probabilities propagate upward from the leaf nodes where values from children nodes are either multiplied at product nodes, or the weighted sum is taken at sum nodes. The final likelihood of the queries is returned at the root node.
Inference for continuous attributes is based on WMI. Let us explain the intuition using a univariate distribution. Our structure learning algorithm above assigns ranges ri = [αi, βi] and their piecewise polynomial density approximation pi(x) to a continuous SPN leaf node i. With that, we can perform inference at the leaf node itself through WMI. For example, assume a continuous leaf node for the attribute “weight” has the range 34 ≤ weight(X) ≤ 55 and suppose its calculated polynomial is −0.051+0.0016x. The probability for the query 40 ≤ weight(X) ≤ 50 can then be calculated using . Once the leaf has processed the query its value is passed to the SPN where it is applied to the other variables of the query. We reiterate that inference using the polynomials is not performed in the SPN itself. Only the value is passed on to be interpreted. This demonstrates the built-in modularity of our approach.
Naturally, posing a query such as (40 ≤ weight(X) ≤ 60), spanning over multiple intervals is less trivial. In this case, assume a second interval for weight(X) is specified 55 ≤ weight(X) ≤ 77 with a polynomial density of −0.1469–0.0019x. Then, the probability of the query is calculated as . In general, as the ranges were defined to be mutually exclusive, the calculation of the probability boils down to:
dividing the range into m-k subranges where each corresponds to a leaf node i, i ∊ {1, …, n}, 1 ≤ k ≤ m ≤ n;
performing inference through WMI in each leaf with the polynomial density approximation pi(x); and then
adding the calculated probability values to obtain the final value.
That is,
By extension, if a query consists of multiple continuous features X1, …, XN, we find that we require a joint density approximation over the variables in order to infer the likelihood. We recall that given multiple dependent variables, the joint likelihood is calculated via Pr(X1, …, XN) = Pr(X1, …, XN–1) Pr(XN|X1, …, XN–1) as per the chain rule. We also recall Equation (9) for the SPN conditional likelihood and apply it WMISPNs where we see the likelihood over dependent features presented as:
This can be reduced to f0(X1, …, XN)/Z. Effectively, we are able to represent the joint density function P1…N(X1, …, XN) with a WMISPN model over the variables, and perform inference on the continuous features using the leaf nodes to integrate over the ranges:
When integrating the model into the SPN some SPN properties need to be considered. The root node will always calculate the likelihood of a query by taking the product of its children. Given the mutual exclusivity of a continuous feature's multiple ranges, two or more passes are required for an SPN to calculate the likelihood of a query atom over multiple intervals for the same feature.
For any univariate query q(x) of the form (a ≤ x ≤ b), we define the query length to be the number of propositions (x1, …, xk) such that . Here is the refinement of the boolean activation which retrieves the interval it corresponds to. As an example, suppose leaves in the WMISPN mention α ≤ x ≤ β and β ≤ x ≤ γ. Then a query of length 1 is of the form α∗ ≤ x ≤ β∗, where α ≤ α∗ and β∗ ≤ β. A query of length 2 is of the form α∗ ≤ x ≤ γ∗, where α ≤ α∗ as before but β < γ∗ ≤ γ. In essence, complex querying finds the number of intervals in the WMISPN that q(x) intersects with.
Proposition 2Suppose T is a learned WMISPN and q ∊ PP(T). The probability of the univariate interval query q of length k requires k passes through the WMISPN.
If q ∊ L(S PN) then query length is 1, we then immediately get Proposition 1 as a special case. For simplicity, we assume that queries are limited to the following syntax:
Every query atom is either an interval x ∊ [a, b] or a discrete feature A ∊ [a1, …, an].
A query consists of a number of conjunctions.
Each conjunction has to be distinct, e.g. no conjunction shares the same query atoms.
Here, (2) and (3) are not fundamental restrictions: it is easy to extend (2) by applying the rule Pr(A v B) = Pr(A) + Pr(B) - Pr(A ⁁ B), and in the case of (3), linear arithmetic solvers can be used to reason about multiple constraints for the same variable: for example, Pr(30 ≤ weight(X)) ⁁ Pr(50 ≤ weight(X) < 60) can be resolved to Pr(30 ≤weight(X) < 60). So, (1) says computing the probability of expressions such as x > y and x + y > 2z, where x, y, z are continuous variables, cardinality queries [8], among others are not dealt with currently.
We remark that in [30], the tractable inference advantages of MSPNs were utilized for methods akin to our complex querying. The proposed Model-based Approximate Query Processing (AQP) takes advantage of MSPN inference, which while similar to our method, is specifically concerned with SQL style queries and speed of inference. As the queries posed are SQL in nature (user-defined functions, arithmetic expressions, etc.) the scope of inference queries is larger than that found with complex queries as defined here. [30] utilizes both probabilistic and sampling approaches for inference, specifically with sampling serving to aid in approximating complex aggregation queries. Our account is focused on exact inference currently.
There are some shared techniques, such as readjusting of leaf weights for inference. [30] uses a sampling method with conditioned constraints, while our method uses the interval range on a given continuous feature to determine the weight adjustment at the leaf node. As mentioned, our approach is exact, does not use sampling methods, and as we are not investigating increasing inference speeds we find pruning of our WMISPN model unnecessary whereas Kulessa et al. actively do this. As [30] uses MSPNs, they calculate continuous variables with linear interpolations of histograms, as such their inequality queries are limited to linear approximations on continuous intervals (see Section 2) where they are treated as constants preventing user-specified inequality type queries (see Equation (6)). With our framework, we find our piecewise polynomial approximations allow for user-defined complex querying. On the other hand, the SQL mechanism has advantages such as biased sampling on rare sub-populations for ad-hoc querying as well as providing general SQL aggregation queries such as count, average, and sum. As such, we find it a useful research direction to combine complex querying with the SQL style queries with aggregation.
6. EXPERIMENTAL EVALUATION
The goal of this section is to evaluate the merits of LearnWMISPN and the complex query interface. Specifically, we attempt to address the following questions:
Q1 How effective (in terms of log-likelihood) is LearnWMISPN on complex mixed discrete-continuous data?
Q2 How reasonable is the learned distribution: that is, what is the order of the learned polynomials, and what is the spread of the probabilities for the underlying intervals?
Q3 How effective is the query interface in the presence of increasing query lengths?
We measured the performance of the learned SPNs on preprocessed hybrid domain data sets discussed in the independent effort [28] that introduced Mixed-Sum Product Networks (MSPNs), Automatic Bayesian Density Analysis (ABDA) [31], and Bayesian SPNs (BSPNs) [37], as discussed in Section 2.
Our performance evaluation of SPNs in conjunction with WMI was performed using three different regimes. We measured the performance of the learned SPNs from LearnWMISPN on preprocessed hybrid domain data sets. Doing so allowed us to directly compare the performance of LearnWMISPN to learn [28, 31, 37], and test whether our approach with continuous distributions was effective. We then investigated whether further increasing the number of leaves per continuous feature resulted in improvements in the model. Next, measuring the complex querying capacity of WMI with SPNs was performed by recording the computation times dependent on query length. We then demonstrate complex querying accuracy by measuring error rates using synthetic data. As we mentioned before, WMI is a very flexible framework for inference in hybrid domains, and as our empirical results show, our integration with SPNs has yielded a scalable unsupervised learning architecture that is able to handle benchmarks and many (preprocessed) UCI data sets involving hundreds of variables.
Q1 Hybrid Benchmarks: We evaluated LearnWMISPN on 11 hybrid domain data sets taken from the UCI machine learning repository [51]. Each data set was composed of differing proportions of categorical, discrete, and continuous features. The diverse set of domains comprised data from financial, medical, automotive and other sectors. The hybrid data sets were selected based on previous research in learning on hybrid domains [28, 31, 37]. As we wanted the capacity to make direct comparisons to previous research, we primarily focused on the data sets used in other works. The most important factor was that the data sets remain mixed with discrete and continuous features to take advantage of our methods integration of piecewise polynomial approximations. MSPNs, ABDA, and BSPNs were used as a comparative baselines in our evaluation. MSPNs proved to be a successful mixed probabilistic model [28], and given the similar nature of our objectives, a comparison was appropriate. ABDA and BSPNs relied on MSPNs as a comparative baseline in their respective works, thus they both were included as comparative models as well. Structure learning with LearnMSPN was performed with different variants. MSPNs were trained using the Gower distance, which is a metric over hybrid domains, with Gaussian distributional assumptions for continuous variables. MSPNs were also trained using the randomized dependency coefficient (RDC) which does not make parametric assumptions. For each model, histogram representations were compared against isometric regression models.
The dimensions of the data sets can be seen in Table 1. Given the original data sets used real values, our preprocessing results in an augmented binary matrix that has a larger variable count. In the original work, more data sets were used to measure performance, but in our case, we omitted data sets which did not contain continuous features. We used the original LearnSPN training, validation, and test ratio splits, which were 75%, 10%, and 15%, respectively.
Data set . | |V| . | |V'| . | Train . | Valid . | Test . |
---|---|---|---|---|---|
anneal-U | 38 | 95 | 673 | 90 | 134 |
australian | 15 | 50 | 517 | 69 | 103 |
auto | 26 | 85 | 119 | 16 | 23 |
car | 9 | 50 | 294 | 39 | 58 |
cleave | 14 | 35 | 222 | 29 | 44 |
crx | 15 | 54 | 488 | 65 | 97 |
diabetes | 9 | 33 | 576 | 76 | 115 |
german | 21 | 76 | 750 | 99 | 150 |
german-org | 25 | 70 | 750 | 99 | 150 |
heart | 14 | 35 | 202 | 27 | 40 |
iris | 5 | 11 | 112 | 15 | 22 |
Data set . | |V| . | |V'| . | Train . | Valid . | Test . |
---|---|---|---|---|---|
anneal-U | 38 | 95 | 673 | 90 | 134 |
australian | 15 | 50 | 517 | 69 | 103 |
auto | 26 | 85 | 119 | 16 | 23 |
car | 9 | 50 | 294 | 39 | 58 |
cleave | 14 | 35 | 222 | 29 | 44 |
crx | 15 | 54 | 488 | 65 | 97 |
diabetes | 9 | 33 | 576 | 76 | 115 |
german | 21 | 76 | 750 | 99 | 150 |
german-org | 25 | 70 | 750 | 99 | 150 |
heart | 14 | 35 | 202 | 27 | 40 |
iris | 5 | 11 | 112 | 15 | 22 |
The log-likelihood results for Learn methods and LearnWMISPN are shown in Table 2. As can be seen, our model outperforms other models by a wide margin. The performance of our model results in significantly better likelihoods across the majority of data sets. In most cases, our implementation only required two bins per continuous feature, and one instance (the iris data set) required a three bin split in order to produce a likelihood competitive to the models.
Data set . | WMI-SPN . | Gower- . | RDC- . | ABDA . | BSPN . | ||
---|---|---|---|---|---|---|---|
LearnWMISPN (bin size) . | hist . | iso . | hist . | iso . | - . | - . | |
anneal-U | −14.54 (2) | −63.55 | −38.83 | −60.31 | −38.31 | −2.65 | −16.44 |
australian | −10.47 (2) | −18.51 | −30.37 | 17.89 | −31.02 | −17.70 | −21.51 |
auto | −27.12 (2) | −72.99 | −69.40 | −73.37 | −70.06 | −70.62 | −58.45 |
car | −9.11 (2) | −30.46 | −31.08 | −29.13 | −30.51 | −35.04 | −28.73 |
cleave | −13.82 (2) | −26.13 | −25.86 | −29.13 | −25.44 | −22.60 | −22.95 |
crx | −10.52 (2) | −22.42 | −31.62 | −24.03 | −31.72 | −15.53 | −11.54 |
diabetes | −6.29 (2) | −15.28 | −26.96 | −15.93 | −27.24 | −17.48 | −21.21 |
german | −20.42 (2) | −40.82 | −26.85 | −38.82 | −32.36 | −32.10 | −26.76 |
german-org | −21.14 (2) | −43.61 | −26.85 | −37.45 | −27.29 | −26.29 | −21.32 |
heart | −14.87 (2) | −20.69 | −26.99 | −20.37 | −25.90 | −23.39 | −22.93 |
iris | −3.56 (3) | −3.61 | −2.89 | −3.44 | −2.84 | −2.96 | −3.54 |
Data set . | WMI-SPN . | Gower- . | RDC- . | ABDA . | BSPN . | ||
---|---|---|---|---|---|---|---|
LearnWMISPN (bin size) . | hist . | iso . | hist . | iso . | - . | - . | |
anneal-U | −14.54 (2) | −63.55 | −38.83 | −60.31 | −38.31 | −2.65 | −16.44 |
australian | −10.47 (2) | −18.51 | −30.37 | 17.89 | −31.02 | −17.70 | −21.51 |
auto | −27.12 (2) | −72.99 | −69.40 | −73.37 | −70.06 | −70.62 | −58.45 |
car | −9.11 (2) | −30.46 | −31.08 | −29.13 | −30.51 | −35.04 | −28.73 |
cleave | −13.82 (2) | −26.13 | −25.86 | −29.13 | −25.44 | −22.60 | −22.95 |
crx | −10.52 (2) | −22.42 | −31.62 | −24.03 | −31.72 | −15.53 | −11.54 |
diabetes | −6.29 (2) | −15.28 | −26.96 | −15.93 | −27.24 | −17.48 | −21.21 |
german | −20.42 (2) | −40.82 | −26.85 | −38.82 | −32.36 | −32.10 | −26.76 |
german-org | −21.14 (2) | −43.61 | −26.85 | −37.45 | −27.29 | −26.29 | −21.32 |
heart | −14.87 (2) | −20.69 | −26.99 | −20.37 | −25.90 | −23.39 | −22.93 |
iris | −3.56 (3) | −3.61 | −2.89 | −3.44 | −2.84 | −2.96 | −3.54 |
Note: The bin size is given for the WMISPN log likelihoods.
The effectiveness of the framework can be attributed to a number of factors. We observed that unsupervised binning interfaced with LearnSPN's existing structure learning performed well on the hybrid data sets. This can be thought of as piecewise constant WMISPNs. The approach is simple and thus, fast, adding negligible complexity to the original LearnSPN machinery: the effectiveness in learning binary representations and the simple counting of Boolean variable activations is inherited from LearnSPN. In addition to that, we support the learning of polynomial weights, and not just constant or linear ones. In particular, our use of the BIC criteria to choose the most optimal representation, which has a bias towards smaller bin numbers and polynomial exponents.
Q2 Model Complexity: When investigating the correlation of model accuracy to model complexity, we used data sets with the majority variables in the continuous domain. We also used data sets with significantly more instances to learn on, ranges being from 1,024 to 17,389 in order to demonstrate the scalability of WMISPNs. The data sets investigated were the Cloud data set, Statlog (Shuttle) data set, and a subset of the MiniBooNE particle identification data set, which were all from the UCI machine learning repository [51]. Our inquires below use equal width binning as a discretisation method for continuous features.
The most immediate question is what is the nature of learned polynomials. The BIC measure, as we mentioned, is used to determine the number of bins and polynomial order. So, by relaxing the criteria for the number of bins to be used, we can study the polynomials. In Table 3, we see statistics on the order of the learned polynomials for 2 vs 5 bins. The order never goes beyond 6, and in a majority of the cases is ≤4, confirming once again that the learned orders stay manageable. Increasing the bin size favours low-order polynomials: only Australia and Cloud have order 6 polynomials with 5 bins.
Data set . | Bins . | 2nd-Order . | 3rd-Order . | 4th-Order . | 5th-Order . | 6th-Order . |
---|---|---|---|---|---|---|
Australia | 2 | 0 | 16.667 | 16.667 | 33.3 | 33.3 |
Australia | 5 | 16.667 | 33.333 | 16.667 | 16.667 | 16.667 |
Auto | 2 | 15.8 | 36.842 | 31.579 | 12.789 | 0 |
Auto | 5 | 73.684 | 15.789 | 10.526 | 0 | 0 |
German-org | 2 | 0 | 33.333 | 0 | 0 | 66.666 |
German-org | 5 | 33.333 | 33.333 | 0 | 33.333 | 0 |
Heart | 2 | 0 | 20 | 20 | 60 | 0 |
Heart | 5 | 0 | 60 | 40 | 0 | 0 |
Iris | 2 | 50 | 0 | 50 | 0 | 0 |
Iris | 5 | 75 | 25 | 0 | 0 | 0 |
Statlog | 2 | 42.857 | 0 | 0 | 0 | 57.143 |
Statlog | 5 | 28.571 | 57.143 | 0 | 14.289 | 0 |
Cloud | 2 | 10 | 10 | 40 | 10 | 30 |
Cloud | 5 | 30 | 50 | 10 | 0 | 10 |
Data set . | Bins . | 2nd-Order . | 3rd-Order . | 4th-Order . | 5th-Order . | 6th-Order . |
---|---|---|---|---|---|---|
Australia | 2 | 0 | 16.667 | 16.667 | 33.3 | 33.3 |
Australia | 5 | 16.667 | 33.333 | 16.667 | 16.667 | 16.667 |
Auto | 2 | 15.8 | 36.842 | 31.579 | 12.789 | 0 |
Auto | 5 | 73.684 | 15.789 | 10.526 | 0 | 0 |
German-org | 2 | 0 | 33.333 | 0 | 0 | 66.666 |
German-org | 5 | 33.333 | 33.333 | 0 | 33.333 | 0 |
Heart | 2 | 0 | 20 | 20 | 60 | 0 |
Heart | 5 | 0 | 60 | 40 | 0 | 0 |
Iris | 2 | 50 | 0 | 50 | 0 | 0 |
Iris | 5 | 75 | 25 | 0 | 0 | 0 |
Statlog | 2 | 42.857 | 0 | 0 | 0 | 57.143 |
Statlog | 5 | 28.571 | 57.143 | 0 | 14.289 | 0 |
Cloud | 2 | 10 | 10 | 40 | 10 | 30 |
Cloud | 5 | 30 | 50 | 10 | 0 | 10 |
In Figure 4, It is evident that increased binning results in piecewise polynomial functions that can better approximate the spread of data. In the case of the diabetes data set, at 5 bins the learned polynomial function was better able to capture the distribution of continuous feature points compared with the 2 bin approximation. The improved polynomial function approximation also lends itself to more accurate bin probabilities for SPN inference calculations. Also of note is the polynomial order for the diabetes continuous feature (Plasma glucose concentration) at 2 bins was a 4th order polynomial, while at 5 bins the polynomial order was only 2.
Comparison of learned piecewise polynomial functions for feature 2 (Plasma glucose concentration) of the diabetes data set.
Instances of expanded Cloud data sets and resulting normalized log-likelihoods.
The second question, then, is how are the probabilities spread from the learned representation? (That is, assume we learn p1(x) for 0 ≤x ≤5 and p2(x) for 5 ≤x ≤10; we consider the probabilities on computing the volumes and normalising.) Naturally, this spread depends very much on the data, but by considering multiple data sets, one can empirically study the effectiveness of the learning regime. We see in Figure 6 that the representations match the characteristics of the data (e.g., sparseness for some attributes, missing values), diverse as they are.
The spread of probabilities, with 2 bins per attribute.
The third natural question is whether learning polynomials are beneficial at all? We mentioned earlier that unsupervised binning along with a simple binarisation scheme can be seen as a simplistic hybrid model – an instance of piecewise constant WMISPNs – for which LearnSPN suffices. Clearly, such an endeavour would come at a significant loss of expressiveness, e.g., no interval querying. So, by letting BIC determine the polynomial order but explicitly setting the bin parameter, we can contrast LearnSPN and LearnWMISPN. It should be noted that as each bin range corresponds to a distribution represented by a given polynomial, further bin increases on a feature result in differing polynomial representations. (Analogously, classical SPNs will redistribute the spread of discrete probabilities.) In Figure 5, the plotted performance of average log-likelihoods over bin complexity is normalized by the number of new variables generated from equal width binning. We see that accuracy on the data set does increase as the number of bins per feature increases, and thus LearnWMISPN yields a more accurate representation.
Comparisons done with LearnWMISPN (red) and LearnSPN (green) [10]. It should be noted that LearnWMISPN is yielding a highly granular and intricate representation. For example, at 5 bins per feature, the first bin of the first feature (the pixel mean) defined over a range of (3.0 ≤x ≤20.1) is given a density approximation that is a 3rd degree polynomial, and then at 15 bins per feature, over a range of (3.0 ≤x ≤8.7) the density approximation is now a 4th degree polynomial.
Q3 Query Interface: The capacity to query SPNs trained on continuous domains represents a significant improvement in terms of interpreting data. With regard to posing complex queries to WMISPNs, we studied the computation time for inferences. The three data sets used for measuring this are the Cloud, Statlog (Shuttle), and MiniBooNE data sets. All data sets are comprised of continuous features, with Statlog containing a single discrete feature. In order to measure complexity, we refer to query length (see Section 5) and refer to continuous queries as ci and discrete as di with i representing the query count. We generated 10 random queries based on the continuous ranges for each data set, including discrete outcomes for Statlog, and averaged the inference time. We also fixed the number of bins per continuous feature to two.
In Table 4, we see the inference speeds on the complex queries. Overall, it is clear that complex queries do not slow down the model. For all query lengths, the time per query remains in the nanosecond range. As query length increases, more time is required but overall the increase in query time is linear. The model was also capable of handling queries with mixed continuous and discrete atoms, further demonstrating the capability of WMISPNs.
Data set . | q1 . | q2 . | q3 . | q4 . | q5 . | ||||
---|---|---|---|---|---|---|---|---|---|
c1 . | c2 . | c1, d1 . | c3 . | c2, d1 . | c4 . | c3, d1 . | c5 . | c4, d1 . | |
Cloud | 1401 | 2154 | n/a | 2563 | n/a | 3659 | n/a | 4025 | n/a |
Statlog | 1299 | n/a | 1878 | n/a | 2365 | n/a | 2749 | n/a | 2982 |
MiniBooNE | 1168 | 2071 | n/a | 2219 | n/a | 2290 | n/a | 3048 | n/a |
Data set . | q1 . | q2 . | q3 . | q4 . | q5 . | ||||
---|---|---|---|---|---|---|---|---|---|
c1 . | c2 . | c1, d1 . | c3 . | c2, d1 . | c4 . | c3, d1 . | c5 . | c4, d1 . | |
Cloud | 1401 | 2154 | n/a | 2563 | n/a | 3659 | n/a | 4025 | n/a |
Statlog | 1299 | n/a | 1878 | n/a | 2365 | n/a | 2749 | n/a | 2982 |
MiniBooNE | 1168 | 2071 | n/a | 2219 | n/a | 2290 | n/a | 3048 | n/a |
Note: Time per query is measured in (ns/query).
Finally, in order to measure the accuracy of the learned models, we calculated the inference likelihood mean square error (MSE) for three synthetic data sets. The data sets comprised a 1000-instance sampled from a Gaussian N (0.7, 02), an exponential (λ = 1), and a distribution mixture N (0.5, 0.1) and Beta Be(2, 20). 150 complex queries were generated each of length 1. From our results in Figure 7, we observe the overall accuracy of the model in handling complex queries. The MSE also tends to approach lower values as the bin count increases, with the exception of the exponential distribution where the MSE was lowest with 3 bins but generally the rate remained steady. The learned WMISPN model also demonstrates robustness to handling inference on mixtures of distributions, again, seen by the decreasing error rate. Effectively our model is capable of performing both tractable inference, and returning accurate likelihoods for complex queries on various distributions.
The inference likelihood mean square error for sampled data taken from Gaussian, Exponential, and mixture distributions of Gaussian and Beta.
It is worth mentioning again the comparisons here to MSPNs [28] and other works [30,31]. Regarding MSPNs, the differences in the general framework include our use of the G-test versus the use of the Renyi decomposition. As we abstract continuous features into discretized bins, it is sufficient to rely on the G-test. We also reiterate our use of piecewise polynomials over piecewise linear constants which allows our model to perform comparisons. We also point out inference is mixed concerning MSPNs, while our model is WMI based, and by doing so we can perform complex querying with respect to arbitrary intervals. Contrast that with the MSPN SQL approach [30] which uses the piecewise linear constants for inference on continuous random variables and is generally dissimilar to our methods (see Section 5) as SQL queries are more expansive. [31] also explores inference with MSPNs and replaces the piecewise linear approximations mapped to leaf nodes with a likelihood dictionary where parameters are learned via Gibbs sampling. Again, our model is agnostic to the distribution of the data and learns piecewise polynomial approximation which augments our inference mechanism with complex querying. Future research would benefit from the integration of all these methods, and we also find we are in a good position to go beyond intervals and consider oblique supports (see Section 7).
7. CONCLUSION
Deep architectures are powerful learning paradigms that capture the latent structure and have proven to be very successful in machine learning. Guessing the right architecture for complex data is challenging, and so paradigms such as SPNs are attractive alternatives in providing a unsupervised learning regime in addition to robust inference computations.
Here, we pushed the envelope further to consider a systematic integration of SPNs and WMI, allowing us to learn tractable, non-parametric distributions and convex combinations thereof for hybrid data, also in a unsupervised fashion. The integration was achieved by minimally adapting the base case for an SPN structure learning module, which makes our approach generic to a large extent. Different from our predecessors, we show for the first time how tractable distributions of arbitrary granularity can be learned, and more importantly, how to query these distributions over a rich interval syntax. Our empirical results show that our implemented system is effective, scalable and incurs a very little cost for handling continuous features, all of which are very desirable for learning from big uncertain data. One interesting direction is to extend the underlying constraint language to handle oblique supports (e.g. x > y and x + y > 2) rather than only intervals. A second challenge for the future includes extending the query language with features like counting operators, which would allow us to reason about the cardinality of sets of objects in an image, thus enabling an interface for commonsensical reasoning within deep architectures.
AUTHOR CONTRIBUTIONS
A. Bueff ([email protected]), S. Speichert ([email protected]) and V. Belle ([email protected]) were all responsible for the design of the research, the implementation of the approach, the analysis of the results. All authors contributed to the manuscript writing and they edited and reviewed the final version of the article.
ACKNOWLEDGEMENTS
Andreas Bueff was partly supported by EPSRC Platform Grant EP/N014758/1. Vaishak Belle was partly supported by the EPSRC grant Towards Explainable and Robust Statistical AI: A Symbolic Approach, and partly supported by a Royal Society University Research Fellowship.
A preliminary version of our work was online on July 2018: https://arxiv.org/abs/1807.05464.
It is possible that learning polynomials of arbitrary degree over intervals of arbitrary granularity are compatible with their approach, but without a discussion on the trade-off between accuracy and representation, and an algorithm for learning such polynomials, it is hard to assess such potential extensions.
Oblique supports such as (x > y) or (x - y > 5) may also be considered [45], but since we will be interested in weighted mixtures of univariate distributions, we will direct our attention to intervals here. Intervals can also be used to define many classes of multivariate distributions via hyper cubes [45]. These are intervals extended to multiple dimensions.
Note that we may place additional constraints on the learning regime. For example, [17] show that they can find a candidate hypothesis f ∊ Pk,n(I) such that ‖f – g‖, g being the unknown true density, can be made arbitrarily small. We chose the basis-spline technique, but none of our technical results hinge on opting for either of these methods.
It is interesting to observe that B-splines have been influential in a number of disciplines, including physics, engineering, and computer vision, and our algorithm contributes, for the first time, “deep B-splines”. For the future, it would be interesting to study how LearnWMISPN would be applied in these domains, for modelling latent dependencies in computer graphics, for example.