Abstract

The mixture-of-experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from a Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternative result, which shows that the class of MoE mean functions is dense in the class of all continuous functions over arbitrary compact domains of estimation. Our result can be viewed as a universal approximation theorem for MoE models. The theorem we present allows MoE users to be confident in applying such models for estimation when data arise from nonlinear and nondifferentiable generative processes.

1  Introduction

The mixture-of-experts (MoE) model is a neural network architecture for nonlinear regression and classification. The model was introduced in Jacobs, Jordan, Nowlan, and Hinton (1991) and Jordan and Jacobs (1994); reviews can be found in McLachlan and Peel (2000) and Yuksel, Wilson, and Gader (2012). Recent research includes Chamroukhi, Glotin, and Same (2013) and Nguyen and McLachlan (2016), where MoE models are used for curve classification and robust estimation, respectively.

Let be a random variable and be a vector. Let the conditional probability density function of given be
1.1
where
is a realization of , and is a univariate component probability density function (PDF) (in ) with mean and nuisance parameter . Here , , and for each , and . We say that equation 1.1 is a MoE with mean function
1.2
where is the function’s parameter vector. The superscript indicates matrix transposition.

Zeevi, Meir, and Maiorov (1998) showed that there exists a sequence of functions that converges uniformly to any target function , in the index , assuming that belongs to a Sobolev class of functions and is a closed unit hypercube. Their result was generalized to nonlinear mappings of the expression in the component PDFs of the MoE in Jiang and Tanner (1999b). The result from Jiang and Tanner (1999b) was expanded on in Jiang and Tanner (1999a), where it was shown that there exists a sequence of conditional PDFs that converges in Kullback-Leibler divergence to any target-conditional PDF in , assuming that belongs to the one-parameter exponential family of density functions; extensions to multivariate conditional density estimation are obtained in Norets (2010). Convergence results for MoE models with polynomial mean functions were obtained in Mendes and Jiang (2012). We note that the target mean function is assumed to belong to a Sobolev class of functions in each of Jiang and Tanner (1999a, 1999b) and Mendes & Jiang (2012), as they are in Zeevi et al. (1998).

Define the class of all mean functions of form 1.2 as
and let be the class of continuous functions on the domain . In this note, we prove that is dense in the set under the assumption that is compact. Our result is obtained via the Stone-Weierstrass theorem (Stone, 1948; see also Cotter, 1990, for a discussion in the context of neural networks). Our result is a universal approximation theorem, similar in spirit to Cybenko (1989, theorem 2), where the linear combination of sigmoidal functions is proved dense in . We show denseness of approximations to some conditional mean functions, whereas Cybenko (1989) targets a marginal multivariate mean function. Our results allow MoE users to be confident in applying such models for estimation when data arise from nonlinear and nondifferentiable generative processes as they improve on the guarantees of Zeevi et al. (1998).

2  Main Result

Define to be a vector of zeros of an appropriate dimensionality. In order to facilitate the proofs, let
where
and is the function’s parameter vector. Note that
if ; thus .
Theorem 1.

The class is dense in . Furthermore, the class is dense in , since .

3  Comparisons to Zeevi et al. (1998)

Zeevi et al. (1998, theorem 1) proved the class dense within the Sobolev class over the closed-unit hypercube domain (see Zeevi et al., 1998, for definitions). First, unlike Zeevi et al. (1998), we make no assumptions on the domain other than compactness. Second, the target space makes no restrictions on differentiability, whereas requires the partial derivatives to exist. Finally, we do not require the target function or its partial derivatives to be measurable or bounded, whereas Zeevi et al. (1998) require partial derivatives up to order to be measurable and possess finite norms bounded by .

Unfortunately, by operating in rather than , we are unable to obtain convergence rates for functions from to target functions in . The convergence rates obtained in Zeevi et al. (1998) are conditional on the differentiability and the norm order .

4  Proof of Main Result

The Stone-Weierstrass theorem can be phrased as follows (cf. Cotter, 1990):

Theorem 2.

Let be a compact set and let be a set of continuous real-valued functions on . Assume that

• The constant function is in .

• For any two points such that , there exists a function such that .

• If and , then .

• If , then .

• If , then .

If assumptions i to v are true, then is dense in . In other words, for any and any , there exists a such that .

We note that is compact if and only if it is bounded and closed in Euclidean spaces (see Dudley, 2004, chap. 2). We proceed to prove that is dense in .

Lemma 1.

The constant function is in .

Proof.

Let and . Set . For any choice of , . We obtain the result by noting that .

Lemma 2.

For any two points such that , there exists a function such that .

Proof.
Let and , where . Set , and assume that for , such that . Let ; this is equivalent to
by substitution and reduces to
4.1

Equation 4.1 is violated if either , which causes a contradiction, or if is such that whenever , for . To avoid violation of equation 4.1, we can set for all .

Thus, let and , where and . If , then . We obtain the result by noting that .

Lemma 3.

If and , then .

Proof.
Let and . We can write
where for . Thus, , where . We obtain the result by noting that .
Lemma 4.

If , then .

Proof.
Let ,
and
and set and . Here, the superscripts and denote the parameter components belonging to the functions and , respectively. We can write
4.2
To simplify equation 4.2, for each and , we can write
4.3
On performing the mapping from Table 1A, we can write the final line of equation 4.3 as , where for . Furthermore, via the mapping from Table 1A, equation 4.2 can be simplified to
where . We obtain the result by noting that .
Lemma 5.

If , then .

Table 1:
Mapping of Parameter Components for Lemmas 6 and 7.
A. Lemma 6

B. Lemma 7

A. Lemma 6

B. Lemma 7

Proof.
Let ,
and
and set and . Here, the superscripts and denote the parameter components belonging to the functions and , respectively. We can write
4.4
On performing the mapping from Table 1B, we can write equation 4.4 as
where and for . We obtain the result by noting that .

Lemmas 3 to 7 imply that the class satisfies Assumptions (i)–(v) of Theorem 2; thus Theorem 1 is proved.

5  Conclusion

In this note, we utilized the Stone-Weierstrass theorem to prove that the class of MoE mean functions is dense in the class of continuous functions on the compact domain .

Unlike in Zeevi et al. (1998), Jiang and Tanner (1999a, 1999b), and Mendes and Jiang (2012), we do not obtain convergence rates. Furthermore, our result does not guarantee statistical estimability of the MoE mean functions. Maximum likelihood (ML) estimation can obtain consistent estimates for mean functions, when is known (see Zeevi et al., 1998; Jiang & Tanner, 2000; and Nguyen & McLachlan, 2016). Results regarding regularized ML estimation of MoE models were obtained in Khalili (2010). In Grun and Leisch (2007) and Nguyen and McLachlan (2016), the Bayesian information criterion (BIC; Schwarz (1978)) is shown effective for determination of unknown (see Olteanu & Rynkiewicz, 2011, for theoretical justification of the BIC).

References

Chamroukhi
,
F.
,
Glotin
,
H.
, &
Same
,
A.
(
2013
).
Model-based functional mixture discriminant analysis with hidden process regression for curve classification
.
Neurocomputing
,
112
,
153
163
.
Cotter
,
N. E.
(
1990
).
The Stone-Weierstrass theorem and its application to neural networks
.
IEEE Transactions on Neural Networks
,
1
,
290
295
.
Cybenko
,
G.
(
1989
).
Approximation by superpositions of a sigmoidal function
.
Mathematics of Control, Signals, and Systems
,
2
,
303
314
.
Dudley
,
R. M.
(
2004
).
Real analysis and probability
.
Cambridge
:
Cambridge University Press
.
Grun
,
B.
, &
Leisch
,
F.
(
2007
).
Fitting finite mixtures of generalized linear regressions in R
.
Computational Statistics and Data Analysis
,
51
,
5247
5252
.
Jacobs
,
R. A.
,
Jordan
,
M. I.
,
Nowlan
,
S. J.
, &
Hinton
,
G. E.
(
1991
).
Adaptive mixtures of local experts
.
Neural Computation
,
3
,
79
87
.
Jiang
,
W.
, &
Tanner
,
M. A.
(
1999a
).
Hierachical mixtures-of-experts for exponential family regression models: approximation and maximum likelihood estimation
.
Annals of Statistics
,
27
,
987
1011
.
Jiang
,
W.
, &
Tanner
,
M. A.
(
1999b
).
On the approximation rate of hierachical mixtures-of-experts for generalized linear models
.
Neural Computation
,
11
,
1183
1198
.
Jiang
,
W.
, &
Tanner
,
M. A.
(
2000
).
On the asymptotic normality of hierachical mixtures-of-experts for generalized linear models
.
IEEE Transactions on Information Theory
,
46
,
1005
1013
.
Jordan
,
M. I.
, &
Jacobs
,
R. A.
(
1994
).
Hierarchical mixtures of experts and the EM algorithm
.
Neural Computation
,
6
,
181
214
.
Khalili
,
A.
(
2010
).
New estimation and feature selection methods in mixture-of-experts models
.
Canadian Journal of Statistics
,
38
,
519
539
.
McLachlan
,
G. J.
, &
Peel
,
D.
(
2000
).
Finite mixture models
.
New York
:
Wiley
.
Mendes
,
E. F.
, &
Jiang
,
W.
(
2012
).
On convergence rates of mixture of polynomial experts
.
Neural Computation
,
24
,
3025
3051
.
Nguyen
,
H. D.
, &
McLachlan
,
G. J.
(
2016
).
Laplace mixture of linear experts
.
Computational Statistics and Data Analysis
,
93
,
177
191
.
Norets
,
A.
(
2010
).
Approximation of conditional densities by smooth mixtures of regressions
.
Annals of Statistics
,
38
,
1733
1766
.
Olteanu
,
M.
, &
Rynkiewicz
,
J.
(
2011
).
Asymptotic properties of mixture-of-experts models
.
Neurocomputing
,
74
,
1444
1449
.
Schwarz
,
G.
(
1978
).
Estimating the dimensions of a model
.
Annals of Statistics
,
6
,
461
464
.
Stone
,
M. H.
(
1948
).
The generalized Weierstrass approximation theorem
.
Mathematical Magazine
,
21
,
237
254
.
Yuksel
,
S. E.
,
Wilson
,
J. N.
, &
Gader
,
P. D.
(
2012
).
Twenty years of mixture of experts
.
IEEE Transactions on Neural Networks and Learning Systems
,
23
,
1177
1193
.
Zeevi
,
A. J.
,
Meir
,
R.
, &
Maiorov
,
V.
(
1998
).
Error bounds for functional approximation and estimation using mixtures of experts
.
IEEE Transactions on Information Theory
,
44
,
1010
1025
.