## Abstract

In this letter, we consider a mixture-of-experts structure where ** m** experts are mixed, with each expert being related to a polynomial regression model of order

**. We study the convergence rate of the maximum likelihood estimator in terms of how fast the Hellinger distance of the estimated density converges to the true density, when the sample size**

*k***increases. The convergence rate is found to be dependent on both**

*n***and**

*m***, while certain choices of**

*k***and**

*m***are found to produce near-optimal convergence rates.**

*k*## 1. Introduction

Mixture-of-experts models (ME) (Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical mixture-of-experts models (HME) (Jordan & Jacobs, 1994) are powerful tools for estimating the density of a random variable *Y* conditional on a known set of covariates *X*. The idea is to “divide and conquer.” We split the space of covariates and approximate the conditional density within each subspace. Additionally, it can be seen as a generalization of the classical mixture models, whose weights are constant across the covariate space. Mixture-of-experts have been widely used on a variety of fields, including image recognition and classification, medicine, audio classification, and finance. Such flexibility has also inspired a series of distinct models, those of including Wood, Jiang, and Tanner (2002), Carvalho and Tanner (2005a), Geweke and Keane (2007), Wood, Kohn, Cottet, Jiang, and Tanner (2008), Villani, Kohn, and Giordani (2009), Young and Hunter (2010), and Wood, Rosen, and Kohn (2011), among many others.

We consider a framework similar to Jiang and Tanner (1999a) and others. Assume each expert is a member of a one-parameter exponential family with mean ϕ(*h _{k}*), where

*h*is a

_{k}*k*th-degree polynomial on the conditioning variables

*X*(hence, a linear function of the parameters) and ϕ(·) is the inverse link function. In other words, each expert is a generalized linear model on a one-dimensional exponential family (GLM1). We allow the target density to be in the same family of distributions, but with conditional mean ϕ(

*h*) where , a Sobolev class with α derivatives and bounding constant

*K*

_{0}(see section 2.1). Some examples of target densities include the Poisson, binomial, Bernoulli, and exponential distributions with unknown mean. Normal, gamma, and beta distributions also fall in this class if the dispersion parameter is known.

One might be skeptical about using (H)ME models with polynomial experts since it leads to more complex models as the degree *k* of the polynomials increases. We justify the use of such models through the approximation and estimation errors. We show that some choices of *k* and *m* lead to better convergence rates. This discussion about whether it is better to mix many simple models or fewer complex models is not new in the literature of mixture of experts. Earlier in the literature, Jacobs et al. (1991) and Peng, Jacobs, and Tanner (1996) proposed mixing many simple models; more recently, Wood et al. (2002) and Villani et al. (2009) considered using only a few complex models. Celeux, Hurn, and Robert (2000) and Geweke (2007) advocate for mixing fewer complex models, claiming that mixture models can be very difficult to estimate and interpret.

This work extends Jiang and Tanner (1999a) in some directions. We show that by including polynomial terms, one is able to improve the approximation rate on sufficiently smooth classes. This rate is sharp for the piecewise polynomial approximation with a fixed degree *k* and increasing number of “pieces,” as shown in Windlund (1977). Moreover, we contribute to the literature by providing rates of convergence of the maximum likelihood estimator to the true density. We emphasize that such rates have never been developed for this class of models and the method used can be easily generalized to more general classes of mixture of experts. Convergence of the estimated density function to the true density and parametric consistency of the maximum likelihood estimator are also obtained.

Zeevi, Meir, and Maiorov (1998) show approximation in the *L ^{p}* norm and estimation error for the conditional expectation of the ME with generalized linear experts. Jiang and Tanner (1999a) show consistency and approximation rates for the HME with generalized linear model as experts and a general specification for the gating functions. They consider the target density to belong to the exponential family with one parameter. Their approximation rate of the Kullback-Leibler divergence between the target density and the model is

*O*(1/

*m*

^{4/s}), where

*m*is the number of experts and

*s*is the number of covariates or independent variables. Norets (2010) shows the approximation rate for the mixture of gaussian experts where both the variance and the mean can be nonlinear and the weights are given by multinomial logistic functions. The target density is considered to be a smooth, continuous function and the dependent variable

*Y*to be continuous and satisfy some moment conditions. The approximation rate is

*O*(1/

*m*

^{s+2+1/(q−2)+ϵ}), where

*Y*is assumed to have at least

*q*moments and ϵ is a sufficiently small number. Despite these findings, there are no convergence rates for the maximum likelihood estimator of mixture-of-experts class of models in the literature.

We show that under conditions similar to those of Jiang and Tanner (1999a), the approximation rate in Kullback-Leibler divergence is uniformly bounded by , where *c* is some positive constant not depending on *k* or *m*, *k** = (*k* + 1) ∧ α. This is a generalization of the rate found in Jiang and Tanner (1999a), who assume α = 2 and *k* = 1. In squared Hellinger distance, the convergence rate of the maximum likelihood estimator to the true density is , where *d*_{m,k} is the total number of parameters in the model. To show the previous results, we do not assume identifiability of the model as it is natural for mixture of experts to be nonidentifiable under permutation of the experts (Jiang & Tanner, 1999c). Near-optimal nonparametric rates of convergence can be attained for some choices of *k* and *m*.

Throughout the letter, we use the notation ‖*h*(·)‖_{p,λ(S)} = [∫_{S}|*h*(*z*)|^{p}*d*λ(*z*)]^{1/p} for all 0 < *p* < ∞ and . The former is the *L ^{p}*(λ) function norm of

*h*(·) with respect to the measure λ over the set

*S*and the latter

*L*

^{∞}(λ) the function norm of

*h*(·) with respect to the measure λ over the set

*S*. In the case that the set is not specified, we consider the entire support of the measure. Similarly, if the measure is omitted, ‖ · ‖

_{p}is the

*L*vector norm, for 0 <

^{p}*p*⩽ ∞.

The remainder of the letter is organized as follows. In the next section, we introduce the target density and mixture-of-experts models. We also demonstrate that the maximum likelihood estimator is consistent. Section 3 establishes the main results of the letter: approximation rate, convergence rate, and nonparametric consistency. Section 4 discusses model specification and the trade-off that we unveil between the number of experts and the degree of the polynomials. In the concluding remarks, we compare our results with Jiang and Tanner (1999a) and provide some directions for future research. The appendix differs from the main body of the letter for being more technical. In appendix A, we present the main steps in showing convergence rate in Hellinger distance. Appendix B has a set of useful lemmas required for proving the main results of the letter in appendix C.

## 2. Preliminaries

In this section we introduce the target class of density functions and the mixture-of-experts model with GLM1 experts.

### 2.1. Target Family of Densities.

*X*′

_{i},

*Y*)′}

_{i}^{n}

_{i=1}defined on where , and is the Borel σ-algebra generated by the set

*S*. We assume that

*P*has a density

_{XY}*p*=

_{xy}*p*

_{y|x}

*p*with respect to some measure λ. More precisely, we assume that

_{x}*p*is known and

_{x}*p*

_{y|x}belongs to a one-dimensional exponential family; the target density is such that where

*a*(·) and

*b*(·) are known functions, three times continuously differentiable, with first derivative bounded away from zero, and

*a*(·) has a nonnegative second derivative;

*c*(·) is a known measurable function of

*Y*. The function

*h*(·) is an element of , a Sobolev class of order α.

^{1}Throughout the letter, we denote the class of target density functions

*p*=

_{xy}*p*

_{y|x}

*p*as .

_{x}The one-parameter exponential family of distributions includes the Bernoulli, exponential, Poisson, and binomial distributions. It also includes the gaussian, gamma, and Weibull distributions if the dispersion parameter is known. In this work, we focus only on the one-parameter case, but we conjecture that the results still hold in the case where the dispersion parameter has to be estimated from the data.

### 2.2. Mixture-of-Experts Model.

*g*(·; ν) is a positive function of

_{j}*x*indexed by a

*v*-dimensional parameter vector ,

_{m}^{2}and ∑

^{m}

_{j=1}

*g*(·; ·) = 1. The function

_{j}*h*(

_{k}*x*; θ

_{j}) is a

*k*th-degree polynomial on

*x*indexed by a

*J*-dimensional parameter vector . The parameter vector of the model is ζ = (ν′, θ′

_{k}_{1}, …, θ′

_{m})′ and is defined on , with

*d*

_{m,k}=

*v*+

_{m}*mJ*. Throughout the letter, we denote by the class of (approximant) densities

_{k}*f*

_{m,k}.

To derive consistency and convergence rates, one needs to impose some restrictions on the functions π and *g _{j}* to avoid abnormal cases. This condition is not restrictive and is satisfied by the multinomial logistic weight functions and the Bernoulli, binomial, Poisson, and exponential experts, among many other classes of distributions and weight functions.

*The next conditions hold jointly:*

*The parameter space*.*v*is contained inside a hypercube of sufficiently large side_{m}*l*, possibly depending polynomially on_{1}*m*. Each parameter space Θ_{k}is contained in a hypercube of sufficiently large side*l*_{2}*For all m′>m, .*

We present two examples of mixtures of experts satisfying the previous assumption. In the examples, we assume the gating functions are multinomial logistic functions and two distinct distributions: Bernoulli and Poisson. For simplicity, we take the number of covariates to be one (i.e., *s* = 1) and .

_{1}, β

_{1}, …, α

_{m}, β

_{m})′, and φ(·) is the logistic function. Condition i is satisfied by choosing an appropriate parameter space; condition ii is satisfied by the multinomial logistic functions if the parameter space for the α’s increases logarithmic with

*m*(see Ge & Jiang, 2006). The final condition can be shown by taking the derivatives of log

*f*

_{m,k}, which leads to a choice of

*F*(

*x*,

*y*) = |

*x*|.

### 2.3. Maximum Likelihood Estimation.

_{0}(

*X*,

*Y*) = exp(

*c*(

*Y*))

*p*(

_{x}*X*), that is, The maximum likelihood estimator is not necessarily unique. In general, mixture-of-experts models are not identifiable under permutation of the experts. To circumvent this issue, one must impose restrictions on the experts and the weighting (or the parameter vector of the model), as shown in Jiang and Tanner (1999c).

*p*and

_{xy}*f*

_{m,k}as The log-likelihood function in equation 2.3 converges to its expectation with probability one as the number of observations increases. Therefore, in the limit, the maximizer of equation 2.3 (indexed by ) also minimizes the Kullback-Leibler divergence between the true density and the estimated density.

As stated earlier, this work considers only independent and identically distributed (i.i.d.) observations, but it is straightforward to extend the results to more general data-generating processes (e.g., martingales). The next assumption formalizes it.

*The sequence (X _{i}, Y_{i})^{n}_{i=1}, n = 1, 2, … is an i.i.d. sequence of random vectors with common distribution P_{XY}.*

The next result ensures the existence of such estimator:

*For a sequence {( Z_{m,k})_{n}} of compact subsets of Z_{m,k}, n = 1, 2, …, there exists a measurable function , satisfying equation 2.4P_{XY}-almost surely*.

The maximum likelihood estimator consistently estimates ζ*, where ζ* ∈ *Z*_{m,k} is a minimizer of equation 2.5. We require the classic conditions: parametric identifiability and the existence of a unique minimizer of equation 2.5. If the i.i.d. condition fails, it can be shown that ergodicity of (log *f*_{m,k}(*X _{i}*,

*Y*; ζ))

_{i}^{n}

_{i=1}is a sufficient condition for consistency. However, conditions to ensure ergodicity of the log-likelihood function are out of the scope of this letter.

*For any distinct ζ _{1} and ζ_{2} in Z_{m,k}, the two corresponding densities f_{m,k}(·, · ; ζ_{1}) and f_{m,k}(·, · ; ζ_{2}) are not almost everywhere equal*.

Jiang and Tanner (1999c) find sufficient conditions for identifiability of the parameter vector for the HME with one layer and Mendes, Veiga, and Medeiros (2006) for a binary tree structure. Both cases can be adapted to more general specifications. Although one can show consistency to a set, we adopt a more traditional approach requiring identifiability of the parameter vector.

This assumption follows from a second-order Taylor expansion of the expected likelihood around the parameter vector that minimizes equation 2.5, denoted by ζ*. We require the Hessian matrix in equation 2.6 to be invertible at ζ*. The requirement for an identifiable unique maximizer is technical only in the sense that the objective function is not allowed to become too flat around the maximum. (For more discussion on this topic, see Bates & White, 1985, and White, 1996.) A similar assumption was made in the series of papers from Carvalho and Tanner (2005a, 2005b, 2006, 2007) and Zeevi et al. (1998) and is a usual assumption in the estimation of misspecified models.

*Under assumptions 1, 2, 3, and 4, the maximum likelihood estimate as n → ∞ P_{XY}-almost surely*.

## 3. Main Results

### 3.1. Approximation Rate.

We follow Jiang and Tanner (1999a) to bound the approximation error. Before presenting the main conditions, we introduce some key concepts.

*For m = 1, 2, …, let be a partition of Ω. If m → ∞ and if for all x_{1}, x_{2} ∈ Q^{m}_{j}, ‖x_{1} − x_{2}‖_{∞} ⩽ c_{0}/r^{1/s}_{m}, for some constant c_{0} independent of x_{1}, x_{2}, m or j. Then {Q^{m}, m = 1, 2, …} is called a sequence of fine partitions with cardinality r_{m} and bounding constant c_{0}.*

The key idea behind the approximation rate is to control the approximation rate inside each fine partition of the space. More precisely, bound the approximation inside the “worst” (i.e., most difficult to approximate) partition. We need the following assumption:

This assumption is similar to the one employed in Jiang and Tanner (1999a) and requires that the vector approximate the vector of indicator functions at a rate not slower than *O*(*r _{m}*).

The cardinality *r _{m}* is similar to Jiang and Tanner (1999b), who essentially used

*r*= ⌊

_{m}*m*

^{1/s}⌋

^{s}so as to form a regular hypercube partition of an

*s*-dimensional domain of

*x*.

*p*and

*q*denote densities. The χ

^{2}divergence (see, e.g., Wong & Shen, 1995) between

*p*and

*q*is where the equivalence is obtained by expanding the squares and integrating the densities to 1.

This result is a generalization of Jiang and Tanner (1999a) in three directions. First, we allow the target function *h*(·) to be in a Sobolev class with α derivatives; second, we consider a polynomial approximation to the target function in each expert (in fact, their result is a special case when α = 2 and *k* = 1); finally, we consider a divergence measure stronger than the Kullback-Leibler divergence. The result also holds under more general specifications of densities and experts. If a dispersion parameter has to be estimated from the data, we have to modify lemma 4 accordingly, and the same result holds.

The approximation rate also agrees with the optimal one for approxi-mating functions on by piecewise polynomials of a fixed degree (Windlund, 1977). Under assumption 5, it is exactly what we are doing, and therefore this approximation rate is sharp, meaning one cannot derive faster rates for fixed *k*.

### 3.2. Convergence Rate.

*p*and

*q*with respect to λ, the squared Hellinger distance is The estimation error, measured in the squared Hellinger distance, is

*O*(

_{p}*d*

_{m,k}

*n*

^{−1}log(

*d*

_{m,k}

*n*)). We also show that some choices of

*k*and

*m*achieve near-optimal convergence rates.

The next theorem summarizes the convergence rate of the maximum likelihood estimator with respect to the squared Hellinger distance between the true density *p _{xy}* and the estimated density.

Although the previous result is derived for the i.i.d. case, the result also holds for a more general data-generating process. This convergence rate is close to the optimal rate found in the sieves literature if *k** = α (see, e.g., Stone, 1980, and Barron & Sheu, 1991). To derive the convergence rate, we do not assume *f**_{m,k} is a unique, identifiable maximizer of the log-likelihood function 2.3. Here, *f**_{m,k} is allowed to be any of such maximizers. The price to pay for such generality is the inclusion of the “log *n*” term in the convergence rates.

### 3.3. Consistency.

We apply the previous result to show that the maximum likelihood estimator is consistent, that is, the Hellinger distance between the true density and the estimated model approaches zero as the sample size *n* and the index of the approximation class *m* go to infinity.

*Let and denote its maximum likelihood estimator on . Allow m = m_{n} and m(n^{−1}log n) → 0 as n increases. Under assumptions 1, 2, and 5, as n increases*.

## 4. Effects of *m* and *k*

*m*

*k*

Two important problems in the area of ME are: (1) What number of experts *m* should be chosen, given the size *n* of the training data, and (2) Given the total number of parameters, whether it is better to use a few complex experts or combine many simple experts. Our results will not be able to answer these questions completely, but they can provide some qualitative insights. We provide a related theoretical result.

*d*=

*m*(

*k*+ 1)

^{s}, an upper bound in the number of parameters

*d*

_{m,k}. The convergence rate of

*d*

^{2}

_{h}in equation 3.5 can be upper-bounded by a simpler expression such that

*k*

_{1}=

*k*+ 1,

*k** =

*k*

_{1}∧ α, and

*c*is a positive constant. This assumes that

*v*(

*m*) =

*O*(

*m*) and uses the fact that the number of parameters needed in

*s*-dimensional polynomials of order

*k*is bounded by

*J*⩽ (

_{k}*k*+ 1)

^{s}. We have also used a lower bound of the factorial based on Stirling's formula. We now study the upper bound

*U*.

*Let*(

*c*be a positive constant that does not depend on*n*but can take different values at different places. Let*k*_{1}=*k*+ 1,*k** =*k*_{1}∧ α, and*which is an upper bound for the*.

*d*^{2}_{h}convergence rate derived in theorem 4). Let*d*=*mk*^{s}_{1}, which is a bound for the approximate order of the total number of parameters*Then the following statements are true:*

*Consider the case where α is finite:**As*.*n*→ ∞, we have*U*→ 0 if*m*→ ∞ and*d*=*o*(*n*/*log n*)*U achieves a near optimal rate**O*(*n*^{−2α/(s+2α)}(*log n*)^{c}) for some*c*>0,under the following choices:*k*_{1}⩾ α*and k*_{1}=*O*((*log n*)^{c})*for some c*>0.*m is of order for any constant*.

*Consider the case where α = ∞ (or α ⩾**k*_{1}):*As*.*n*→ ∞, we have*U*→ 0 if*d*→ ∞ and*d*=*o*(*n*/*log n*)*The following choices will make**U*to have a “near-parametric rate”*U*=*O*((*log n*)^{c}/*n*) for some*c*>0:^{3}*m*⩾ 1 and*m*=*O*((*log n*)^{c}) for some*c*>0.*k*_{1}⩾*c log n*for any constant*c*>0, and*k*_{1}=*O*((*log n*)^{c}) for some*c*>0.

(a) The results above do not completely answer the earlier questions and on how to choose *m* and *k* in practice. For example, the results on *m* and *k* are known only up to some order in *n*. In addition, the convergence rates may depend on the smoothness parameter α, which may be unknown in practice. A practical method of the choice of *m* and *k* may involve a complexity penalty or cross-validation and is outside of the scope of this letter. On the other hand, some qualitative insights could be useful from our convergence rate analysis.

(b) For the very smooth situation, result IIa suggests that for the purpose of consistency (which means the convergence of *d*^{2}_{h} to 0 in probability), question ii about the ratio between (*m*, *k* + 1) is not relevant as long as *d* = *m*(*k* + 1)^{s} grows to infinity at a rate slower than *n*/log *n*. However, consistency is not enough to guarantee a good performance. For example, equation 4.1 suggests that for *s* = 1, (*m*, *k* + 1) = (log *n*, 1) will lead to a very slow rate *O*((log *n*)^{−2}), and (*m*, *k* + 1) = (1, log *n*) will lead to a very fast rate *O*((log *n*)^{2}/*n*), and in both cases the total number of parameters *d* = *m*(*k* + 1) are the same. It is therefore important to look into the convergence rates.

(c) Results Ib and IIb imply that smoother target functions (with large α) and lower dimensions (*s*) generally encourage using fewer experts. For finite α, the near-optimal rates described in result Ib are achieved when *m* ≫ *k* in order. For the very smooth situation α = ∞, even *m* = 1(≪*k*) can lead to near-optimal performances.

(d) We note that *near*-optimal convergence rates can always be achieved with *k*_{1} not being too large compared to the sample size *n*. This is summarized in the two situations in results Ib and IIb, where we see that even in the case α = ∞, we only need about *k*_{1} ∼ log *n* for us to achieve a near-parametric convergence rate.

(e) Although in result Ib (with finite α) we have used *m* ≫ *k* to achieve near-optimal rates, we conjecture that even with *m* = 1, a good (but perhaps suboptimal) convergence rate can be attained. For example, for *s* = 1, using the Legendre approximation technique 7.5 of Barron and Sheu (1991), we conjecture that a convergence rate is of the form , where *d* = *mk*_{1} and *c* is a positive constant. Therefore (denoting α* = α − 1), even when *m* = 1, we can still take *k*_{1} to be of order and get , which is suboptimal compared to result Ib but is still converging to 0 if α>1. [Similarly, we conjecture that *m* → ∞ is not necessary for the consistency result Ia; we need only *d* → ∞ and *d* = *o*(*n*/log *n*).]

## 5. Conclusion

In this letter, we study the mixture-of-experts model with *m* experts in a one-exponential family with conditional mean ϕ(*h _{k}*), where

*h*is a

_{k}*k*th order polynomial and ϕ(·) is the inverse link function. We derive the approximation rate and convergence rate of the maximum likelihood estimator to densities in a one-parameter exponential family with mean ϕ(

*h*) with , a Sobolev class with α derivatives, and bounding constant

*K*

_{0}. We found that the convergence rate of the maximum likelihood estimator to the true density in squared Hellinger distance is , for

*k** = (

*k*+ 1) ∧ α and

*c*some positive constant.

We discuss choices of *k* and *m* for achieving good convergence rates. The results of this letter can be generalized to more complex target densities (from, e.g., a Besov or a piecewise Besov class) and models (e.g., mixture of trigonometric polynomials or wavelets) with simple modifications to the proofs.

We generalize Jiang and Tanner (1999a) in several directions: (1) we assume one can include polynomial terms of the variables on the GLM1 experts; (2) we assume the target density is in a class, for α>0, instead of ; (3) we show consistency of the maximum likelihood estimator for a fixed number of experts; (4) we calculate convergence rates of the maximum likelihood estimator in squared Hellinger distance; (5) we show consistency when the number of experts and the sample size increase; and, finally, (6) we find that using polynomials in the experts, one can yield better estimation and error bounds. These developments have shed light on the important question of how the number of experts and the complexity of the experts jointly affect the convergence rate.

## Appendix A: Showing the Convergence Rate

In this appendix we explain and justify the main steps in proving the convergence rate.

One of the drawbacks of working with the Kullback-Leibler divergence is that it is not bounded. We will use the Hellinger distance:

The next lemma summarizes basic inequalities well known in the literature (e.g., Wong & Shen, 1995) relating the Hellinger distance, the Kullback-Leibler divergence, and the χ^{2} divergence.

In order to bound the estimation error, we use results from the theory of empirical processes. The convergence rate theorem presented below is derived for the i.i.d. case; however, the same result holds for martingales (see van der Geer, 2000).

The control of estimation rate inside a class of functions requires the knowledge of the complexity of the functional class. Denote by the number of ϵ-brackets, with respect to the distance ‖ · ‖, needed to cover the set and —the respective bracketing entropy.^{4} The use of a bracketing entropy to assess the complexity of a class of mixture of regressions is not new. Genovese and Wasserman (2000) and Viele and Tong (2002) use the entropy with bracketing to measure the complexity in a class of mixture models and mixture of regressions, respectively. Applying their method in our setting gives a bracketing entropy of the same order as if one employs the method used in this letter. We use the latter for the ease of exposition.

Let *d*_{m,k} = *v _{m}* +

*mJ*,

_{k}*Z*

_{m,k}=

*V*× Θ

_{m}_{mk}, and use

*c*for any arbitrary positive constant that may change its value every time it appears.

*i*= 1, 2. Each function

*g*can be written as , for ζ

_{i}_{i}∈

*Z*

_{m,k}, which depends on ζ only through . Then for each (

*x*,

*y*), The derivative on the right-hand side can be bounded by

_{2}and , we have that which is bounded by by assumption 1. The number of ϵ-balls with respect to

*L*

_{∞}needed to cover

*Z*is because by assumption 1,

_{m}*Z*is a hypercube with side

_{m}*l*. It follows from lemma 6 that Since

*l*is polynomial in

*d*

_{m,k}, we can take , for some

*a*⩾ 1 and

*c*>0. Taking the log, we obtain equation A.3. Then we apply lemma 5.

We use a modified version of theorem 10.13 in van der Geer (2000) to show the rate of convergence of the Hellinger distance between the maxi-mum likelihood estimator and the true density. This modification allows for unbounded likelihood ratios, that is, it relaxes the assumption that ‖*p _{xy}*/

*f**

_{m,k}‖

_{∞,λ}is bounded.

*p*/

_{xy}*f**

_{m,k}‖

_{∞,λ}⩽

*c*(van der Geer, 2000, eq. 10.69). We modify the proof as follows: This allows us to proceed with the χ

^{2}-divergence without needing to bound the densities as in equation 10.69 of van der Geer (2000).

## Appendix B: Auxiliary Results

In the next lemma, we use the notation ∂_{θ} = ∂/∂θ, , *a _{j}* =

*a*(

*h*(

_{k}*x*; θ

_{j})), , and so on.

*Let . Under assumption 1*,

- •
- •
- •
- •
*if we further assume 3 and 4, then and is nonsingular at ζ*.*

*a*and

_{j}*b*are continuous differentiable functions of

_{j}*h*(

_{k}*x*; θ

_{j}). Since for any fixed

*k*, then both

*a*and

_{j}*b*are also bounded. The same reasoning can be applied to , , , and . Also, by definition, for any

_{j}*p*⩾ 0. Then

Since ζ* is a maximizer of over , has to be nonnegative definite. Assumption 4 tells us it is also invertible; therefore, is positive definite.

We can use the upper divergence to bound the χ^{2} divergence.

^{2}divergence and concavity of the logarithm that for any and , where

*a*=

*a*(

*h*(

*x*)),

*a*=

_{j}*a*(

*h*(

_{k}*x*; θ

_{j})), and

*b**(

*a*) =

*b*(

*h*(

*x*)). Consider the identity hence, which does not depend on

*y*. A second-order Taylor expansion of

*b**(2

*a*−

*a*) and

_{j}*b**(

*a*) gives us, respectively, for and on the line connecting

_{j}*a*and

*a*.

_{j}*b**(

*a*) gives Call . Use the inequality

*e*

^{|x|}− 1 ⩽ |

*x*|

*e*

^{|x|}and the mean value theorem to show that for some ,

The next lemma provides a bound on the bracketing number of functional classes that are Lipschitz in a parameter:

## Appendix C: Proof of the Main Results

The data-generating process of (*x*, *y*) and the structure of the model are enough to satisfy the measurability assumptions (i.e., it is a weighted sum of measurable functions).

The approximating density *f _{m}*(

*x*,

*y*; ζ) is a continuous function of the parameter vector ζ

*P*-almost everywhere. We verify this claim by choosing (

_{XY}*x*,

*y*) ∈ Ω ×

*A*from a set with positive probability and noting that (1) π(

*h*(

_{k}*x*; θ),

*y*) is a continuous function of θ and (2) (

*g*

_{1}(

*x*; ν), …,

*g*(

_{m}*x*; ν)) is a vector of continuous functions of ν; both imply that

*f*(

_{m}*x*,

*y*; ζ) = ∑

_{i}

*g*(

_{i}*x*; ν)π(

*h*(

_{k}*x*; θ

_{i}),

*y*) is also a continuous function of the parameter vector ζ = (ν′, θ′

_{1}, …, θ′

_{m})′.

The result follows from theorem 2.12 in White (1996).

There are different approaches to show the consistency of the estimate We verify the conditions of theorem 3.5 in White (1996).

*Z*

_{m,k}is a compact subset of . Measurability and continuity conditions are already satisfied; thus, it remains to show that log

*f*

_{m,k}is bounded by an integrable function. We can bound the log-likelihood function by

Define the bounding function . The function because max_{i}*x _{i}* = 1 and ∑

_{i}|θ

_{i}| < ∞; then both

*a*(

*h*) and

_{k}*b*(

*h*) are finite. Thus, it is straightforward to show that , given that , which is satisfied by assumption about

_{k}*p*

_{y|x}. As a conclusion, as

*n*→ ∞.

It follows from theorem 3.5 in White (1996) that *P _{XY}*-a.s. as

*n*→ ∞.

*P*has continuous density function with respect to the finite measure λ on Ω. Consider

_{X}*A*

_{1}and

*A*

_{2}). The second term can be written as

where the equality follows from the fact that and .

**= (**

*k**k*

_{1}, …,

*k*) is a vector of positive integers satisfying |

_{s}**| =**

*k**k*+ 1,

**! =**

*k**k*

_{1}!, …,

*k*! and . This claim follows from a Taylor expansion of

_{s}*h*(

*x*) around fixed points

*x*∈

_{j}*Q*and the fact that . Similarly, if

^{m}_{j}*k*⩾ α, we can use the expansion only up to α terms. By assumption 5, . Then

where *c*_{2} = *c*_{0}*K*^{1/s}_{0}, and *k** = α ∧ (*k* + 1).

*f** =

*f**

_{m,k}. By lemma 2, we can choose . This choice of function that satisfies Ψ(δ)/δ

^{2}is nonincreasing, and we can take . To appreciate that this choice of δ

_{n}is valid, note that for all

*n*sufficiently large and some positive constant

*c*, Then, .

(Ia) Write where *d* = *m*(*k* + 1)^{s} and *k** = (*k* + 1) ∧ α, . Note that *k** ⩾ 1 since α is a positive integer for the Sobolev space introduced in section 2.1. Therefore, the first term of *U* converges to 0 as *m* → ∞. For the second term, apply the condition *d* = *o*(*n*/log *n*), and we have *d*log(*dn*)/*n* = *o*(*n*/log *n*)log(*o*(*n*/log *n*)*n*))/*n* = *o*(1). This shows *U* → 0.

(Ib) In our notation, *k*_{1} = *k* + 1. When (*k*_{1}=)*k* + 1 ⩾ α, *k** = (*k* + 1) ∧ α = α. Then *U* = (*cs*/(α*m*^{1/s}))^{2α} + *m*(*k* + 1)^{s}log(*m*(*k* + 1)^{s}*n*)/*n*. We plug in the choice *k* + 1 = *O*((log *n*)^{c}) for some positive constant *c*, and the choice *m* being of order for some constant power *c*′; then we have that both terms in *U* are at most of order *O*(*n*^{−2α/(s+2α)}(log *n*)^{c}) for some positive power *c*.

(IIa) When α = ∞ (or at least *k* + 1, where *k* ⩾ 0 is the degree of the polynomial model), we have *k** = (*k* + 1) ∧ α = *k* + 1. Then we can write *U* = (*cs*/(*m*^{1/s}(*k* + 1)))^{2(k+1)} + *d*log(*dn*)/*n* = (*cs*/*d*^{1/s})^{2(k+1)} + *d*log(*dn*)/*n*. The first term converges to 0 as *d* → ∞. The second term converges to 0 due to *d* = *o*(*n*/log *n*) (the same as in the proof of Ia).

(IIb) Consider the expression in the proof of IIa: *U* = (*cs*/(*m*^{1/s}(*k* + 1)))^{2(k+1)} + *d*log(*dn*)/*n*, where *d* = *m*(*k* + 1)^{s}. The second term in *U* is at most *O*(*n*^{−1}(log *n*)^{c}) for some *c*>0, when *m* and *k* + 1 are both at most some powers of log *n* in order. When *m* ⩾ 1 and (*k* + 1) ⩾ *c*log *n* for some positive constant *c*, the first term in *U* is at most for large *n*, for some positive constants *c*_{1} and *c*_{2}, which is negligible for large *n* compared to the order *O*(*n*^{−1}(log *n*)^{c}) of the second term of *U*.

## Acknowledgments

We are grateful to the referees for their useful comments that have substantially improved the overall presentation of our letter. Also, we thank Martin Tanner, Thomas Severini, Robert Kohn, Marcelo Fernandes, and Marcelo Medeiros for insightful discussions about mixture of experts and/or comments on previous versions of this letter.

## References

## Notes

^{1}

Suppose 1 ⩽ *p* ⩽ ∞ and α>0 is an integer. We define as the collection of measurable functions *h* with all partial derivatives *D ^{r}h*, |

*r*| ⩽ α, on

*L*(

^{p}*P*), satisfying . Here and |

_{X}*r*| =

*r*

_{1}+ ⋅ ⋅ ⋅ +

*r*for

_{s}*r*= (

*r*

_{1}, …,

*r*).

_{s}^{2}

We denote , for some finite *l*.

^{3}

“Near-parametric rate” stands for “close to the parametric rate *O*(1/*n*).”

^{4}

For a formal definition of bracketing numbers, see van der Vaart and Wellner (1996).

Estimation and asymptotic theory for a new class of mixture models. Unpublished manuscript, Pontifical Catholic University of Rio de Janeiro.