## Abstract

The mixture-of-experts (MoE) model is a popular neural network architecture for nonlinear regression and classification. The class of MoE mean functions is known to be uniformly convergent to any unknown target function, assuming that the target function is from a Sobolev space that is sufficiently differentiable and that the domain of estimation is a compact unit hypercube. We provide an alternative result, which shows that the class of MoE mean functions is dense in the class of all continuous functions over arbitrary compact domains of estimation. Our result can be viewed as a universal approximation theorem for MoE models. The theorem we present allows MoE users to be confident in applying such models for estimation when data arise from nonlinear and nondifferentiable generative processes.

## 1 Introduction

The mixture-of-experts (MoE) model is a neural network architecture for nonlinear regression and classification. The model was introduced in Jacobs, Jordan, Nowlan, and Hinton (1991) and Jordan and Jacobs (1994); reviews can be found in McLachlan and Peel (2000) and Yuksel, Wilson, and Gader (2012). Recent research includes Chamroukhi, Glotin, and Same (2013) and Nguyen and McLachlan (2016), where MoE models are used for curve classification and robust estimation, respectively.

Zeevi, Meir, and Maiorov (1998) showed that there exists a sequence of functions that converges uniformly to any target function , in the index , assuming that belongs to a Sobolev class of functions and is a closed unit hypercube. Their result was generalized to nonlinear mappings of the expression in the component PDFs of the MoE in Jiang and Tanner (1999b). The result from Jiang and Tanner (1999b) was expanded on in Jiang and Tanner (1999a), where it was shown that there exists a sequence of conditional PDFs that converges in Kullback-Leibler divergence to any target-conditional PDF in , assuming that belongs to the one-parameter exponential family of density functions; extensions to multivariate conditional density estimation are obtained in Norets (2010). Convergence results for MoE models with polynomial mean functions were obtained in Mendes and Jiang (2012). We note that the target mean function is assumed to belong to a Sobolev class of functions in each of Jiang and Tanner (1999a, 1999b) and Mendes & Jiang (2012), as they are in Zeevi et al. (1998).

^{2}), where the linear combination of sigmoidal functions is proved dense in . We show denseness of approximations to some conditional mean functions, whereas Cybenko (1989) targets a marginal multivariate mean function. Our results allow MoE users to be confident in applying such models for estimation when data arise from nonlinear and nondifferentiable generative processes as they improve on the guarantees of Zeevi et al. (1998).

## 2 Main Result

The class is dense in . Furthermore, the class is dense in , since .

## 3 Comparisons to Zeevi et al. (1998)

Zeevi et al. (1998, theorem ^{1}) proved the class dense within the Sobolev class over the closed-unit hypercube domain (see Zeevi et al., 1998, for definitions). First, unlike Zeevi et al. (1998), we make no assumptions on the domain other than compactness. Second, the target space makes no restrictions on differentiability, whereas requires the partial derivatives to exist. Finally, we do not require the target function or its partial derivatives to be measurable or bounded, whereas Zeevi et al. (1998) require partial derivatives up to order to be measurable and possess finite norms bounded by .

Unfortunately, by operating in rather than , we are unable to obtain convergence rates for functions from to target functions in . The convergence rates obtained in Zeevi et al. (1998) are conditional on the differentiability and the norm order .

## 4 Proof of Main Result

The Stone-Weierstrass theorem can be phrased as follows (cf. Cotter, 1990):

Let be a compact set and let be a set of continuous real-valued functions on . Assume that

The constant function is in .

For any two points such that , there exists a function such that .

If and , then .

If , then .

If , then .

If assumptions i to v are true, then is dense in . In other words, for any and any , there exists a such that .

We note that is compact if and only if it is bounded and closed in Euclidean spaces (see Dudley, 2004, chap. 2). We proceed to prove that is dense in .

The constant function is in .

Let and . Set . For any choice of , . We obtain the result by noting that .

For any two points such that , there exists a function such that .

Equation 4.1 is violated if either , which causes a contradiction, or if is such that whenever , for . To avoid violation of equation 4.1, we can set for all .

Thus, let and , where and . If , then . We obtain the result by noting that .

If and , then .

If , then .

If , then .

Lemmas ^{3} to ^{7} imply that the class satisfies Assumptions (i)–(v) of Theorem ^{2}; thus Theorem ^{1} is proved.

## 5 Conclusion

In this note, we utilized the Stone-Weierstrass theorem to prove that the class of MoE mean functions is dense in the class of continuous functions on the compact domain .

Unlike in Zeevi et al. (1998), Jiang and Tanner (1999a, 1999b), and Mendes and Jiang (2012), we do not obtain convergence rates. Furthermore, our result does not guarantee statistical estimability of the MoE mean functions. Maximum likelihood (ML) estimation can obtain consistent estimates for mean functions, when is known (see Zeevi et al., 1998; Jiang & Tanner, 2000; and Nguyen & McLachlan, 2016). Results regarding regularized ML estimation of MoE models were obtained in Khalili (2010). In Grun and Leisch (2007) and Nguyen and McLachlan (2016), the Bayesian information criterion (BIC; Schwarz (1978)) is shown effective for determination of unknown (see Olteanu & Rynkiewicz, 2011, for theoretical justification of the BIC).