## Abstract

In this letter, we consider a mixture-of-experts structure where m experts are mixed, with each expert being related to a polynomial regression model of order k. We study the convergence rate of the maximum likelihood estimator in terms of how fast the Hellinger distance of the estimated density converges to the true density, when the sample size n increases. The convergence rate is found to be dependent on both m and k, while certain choices of m and k are found to produce near-optimal convergence rates.

## 1.  Introduction

Mixture-of-experts models (ME) (Jacobs, Jordan, Nowlan, & Hinton, 1991) and hierarchical mixture-of-experts models (HME) (Jordan & Jacobs, 1994) are powerful tools for estimating the density of a random variable Y conditional on a known set of covariates X. The idea is to “divide and conquer.” We split the space of covariates and approximate the conditional density within each subspace. Additionally, it can be seen as a generalization of the classical mixture models, whose weights are constant across the covariate space. Mixture-of-experts have been widely used on a variety of fields, including image recognition and classification, medicine, audio classification, and finance. Such flexibility has also inspired a series of distinct models, those of including Wood, Jiang, and Tanner (2002), Carvalho and Tanner (2005a), Geweke and Keane (2007), Wood, Kohn, Cottet, Jiang, and Tanner (2008), Villani, Kohn, and Giordani (2009), Young and Hunter (2010), and Wood, Rosen, and Kohn (2011), among many others.

We consider a framework similar to Jiang and Tanner (1999a) and others. Assume each expert is a member of a one-parameter exponential family with mean ϕ(hk), where hk is a kth-degree polynomial on the conditioning variables X (hence, a linear function of the parameters) and ϕ(·) is the inverse link function. In other words, each expert is a generalized linear model on a one-dimensional exponential family (GLM1). We allow the target density to be in the same family of distributions, but with conditional mean ϕ(h) where , a Sobolev class with α derivatives and bounding constant K0 (see section 2.1). Some examples of target densities include the Poisson, binomial, Bernoulli, and exponential distributions with unknown mean. Normal, gamma, and beta distributions also fall in this class if the dispersion parameter is known.

One might be skeptical about using (H)ME models with polynomial experts since it leads to more complex models as the degree k of the polynomials increases. We justify the use of such models through the approximation and estimation errors. We show that some choices of k and m lead to better convergence rates. This discussion about whether it is better to mix many simple models or fewer complex models is not new in the literature of mixture of experts. Earlier in the literature, Jacobs et al. (1991) and Peng, Jacobs, and Tanner (1996) proposed mixing many simple models; more recently, Wood et al. (2002) and Villani et al. (2009) considered using only a few complex models. Celeux, Hurn, and Robert (2000) and Geweke (2007) advocate for mixing fewer complex models, claiming that mixture models can be very difficult to estimate and interpret.

This work extends Jiang and Tanner (1999a) in some directions. We show that by including polynomial terms, one is able to improve the approximation rate on sufficiently smooth classes. This rate is sharp for the piecewise polynomial approximation with a fixed degree k and increasing number of “pieces,” as shown in Windlund (1977). Moreover, we contribute to the literature by providing rates of convergence of the maximum likelihood estimator to the true density. We emphasize that such rates have never been developed for this class of models and the method used can be easily generalized to more general classes of mixture of experts. Convergence of the estimated density function to the true density and parametric consistency of the maximum likelihood estimator are also obtained.

Zeevi, Meir, and Maiorov (1998) show approximation in the Lp norm and estimation error for the conditional expectation of the ME with generalized linear experts. Jiang and Tanner (1999a) show consistency and approximation rates for the HME with generalized linear model as experts and a general specification for the gating functions. They consider the target density to belong to the exponential family with one parameter. Their approximation rate of the Kullback-Leibler divergence between the target density and the model is O(1/m4/s), where m is the number of experts and s is the number of covariates or independent variables. Norets (2010) shows the approximation rate for the mixture of gaussian experts where both the variance and the mean can be nonlinear and the weights are given by multinomial logistic functions. The target density is considered to be a smooth, continuous function and the dependent variable Y to be continuous and satisfy some moment conditions. The approximation rate is O(1/ms+2+1/(q−2)+ϵ), where Y is assumed to have at least q moments and ϵ is a sufficiently small number. Despite these findings, there are no convergence rates for the maximum likelihood estimator of mixture-of-experts class of models in the literature.

We show that under conditions similar to those of Jiang and Tanner (1999a), the approximation rate in Kullback-Leibler divergence is uniformly bounded by , where c is some positive constant not depending on k or m, k* = (k + 1) ∧ α. This is a generalization of the rate found in Jiang and Tanner (1999a), who assume α = 2 and k = 1. In squared Hellinger distance, the convergence rate of the maximum likelihood estimator to the true density is , where dm,k is the total number of parameters in the model. To show the previous results, we do not assume identifiability of the model as it is natural for mixture of experts to be nonidentifiable under permutation of the experts (Jiang & Tanner, 1999c). Near-optimal nonparametric rates of convergence can be attained for some choices of k and m.

Throughout the letter, we use the notation ‖h(·)‖p,λ(S) = [∫S|h(z)|pdλ(z)]1/p for all 0 < p < ∞ and . The former is the Lp(λ) function norm of h(·) with respect to the measure λ over the set S and the latter L(λ) the function norm of h(·) with respect to the measure λ over the set S. In the case that the set is not specified, we consider the entire support of the measure. Similarly, if the measure is omitted, ‖ · ‖p is the Lp vector norm, for 0 < p ⩽ ∞.

The remainder of the letter is organized as follows. In the next section, we introduce the target density and mixture-of-experts models. We also demonstrate that the maximum likelihood estimator is consistent. Section 3 establishes the main results of the letter: approximation rate, convergence rate, and nonparametric consistency. Section 4 discusses model specification and the trade-off that we unveil between the number of experts and the degree of the polynomials. In the concluding remarks, we compare our results with Jiang and Tanner (1999a) and provide some directions for future research. The appendix differs from the main body of the letter for being more technical. In appendix  A, we present the main steps in showing convergence rate in Hellinger distance. Appendix  B has a set of useful lemmas required for proving the main results of the letter in appendix  C.

## 2.  Preliminaries

In this section we introduce the target class of density functions and the mixture-of-experts model with GLM1 experts.

### 2.1.  Target Family of Densities.

Consider a sequence of random vectors {(Xi, Yi)′}ni=1 defined on where , and is the Borel σ-algebra generated by the set S. We assume that PXY has a density pxy = py|xpx with respect to some measure λ. More precisely, we assume that px is known and py|x belongs to a one-dimensional exponential family; the target density is such that
2.1
where a(·) and b(·) are known functions, three times continuously differentiable, with first derivative bounded away from zero, and a(·) has a nonnegative second derivative; c(·) is a known measurable function of Y. The function h(·) is an element of , a Sobolev class of order α.1 Throughout the letter, we denote the class of target density functions pxy = py|xpx as .

The one-parameter exponential family of distributions includes the Bernoulli, exponential, Poisson, and binomial distributions. It also includes the gaussian, gamma, and Weibull distributions if the dispersion parameter is known. In this work, we focus only on the one-parameter case, but we conjecture that the results still hold in the case where the dispersion parameter has to be estimated from the data.

### 2.2.  Mixture-of-Experts Model.

The mixture-of-experts model with GLM1 experts is defined as
2.2
Each gj(·; ν) is a positive function of x indexed by a vm-dimensional parameter vector ,2 and ∑mj=1gj(·; ·) = 1. The function hk(x; θj) is a kth-degree polynomial on x indexed by a Jk-dimensional parameter vector . The parameter vector of the model is ζ = (ν′, θ′1, …, θ′m)′ and is defined on , with dm,k = vm + mJk. Throughout the letter, we denote by the class of (approximant) densities fm,k.

To derive consistency and convergence rates, one needs to impose some restrictions on the functions π and gj to avoid abnormal cases. This condition is not restrictive and is satisfied by the multinomial logistic weight functions and the Bernoulli, binomial, Poisson, and exponential experts, among many other classes of distributions and weight functions.

Assumption 1.

The next conditions hold jointly:

• The parameter space vm is contained inside a hypercube of sufficiently large side l1, possibly depending polynomially on m. Each parameter space Θk is contained in a hypercube of sufficiently large side l2.

• For all m′>m, .

• For each , there exists a square integrable function F(x, y), such that for each (x, y),
and
where and C(l) is at most a polynomial function of l, where l = l1l2.

Remark 1.
Note that if ‖fm,k/pxy < c for some finite c,

We present two examples of mixtures of experts satisfying the previous assumption. In the examples, we assume the gating functions are multinomial logistic functions and two distinct distributions: Bernoulli and Poisson. For simplicity, we take the number of covariates to be one (i.e., s = 1) and .

Example 1 (mixture of Bernoulli experts).
A mixture of Bernoulli experts with multinomial logistic gating functions is given by:
where ν = (α1, β1, …, αm, βm)′, and φ(·) is the logistic function. Condition i is satisfied by choosing an appropriate parameter space; condition ii is satisfied by the multinomial logistic functions if the parameter space for the α’s increases logarithmic with m (see Ge & Jiang, 2006). The final condition can be shown by taking the derivatives of log fm,k, which leads to a choice of F(x, y) = |x|.
Example 2 (mixture of Poisson experts).
A mixture of Poisson experts with multinomial logistic gating functions is given by
where ν = (α1, β1, …, αm, βm)′. Conditions i and ii have already been discussed. By taking the derivatives of log fm,k, one finds that it is sufficient to take .

### 2.3.  Maximum Likelihood Estimation.

We want to find the parameter vector that maximizes the log-likelihood function of the data,
2.3
where ϕ0(X, Y) = exp(c(Y))px(X), that is,
2.4
The maximum likelihood estimator is not necessarily unique. In general, mixture-of-experts models are not identifiable under permutation of the experts. To circumvent this issue, one must impose restrictions on the experts and the weighting (or the parameter vector of the model), as shown in Jiang and Tanner (1999c).
Define the Kullback-Leibler (KL) divergence between pxy and fm,k as
2.5
The log-likelihood function in equation 2.3 converges to its expectation with probability one as the number of observations increases. Therefore, in the limit, the maximizer of equation 2.3 (indexed by ) also minimizes the Kullback-Leibler divergence between the true density and the estimated density.

As stated earlier, this work considers only independent and identically distributed (i.i.d.) observations, but it is straightforward to extend the results to more general data-generating processes (e.g., martingales). The next assumption formalizes it.

Assumption 2 (data-generating process).

The sequence (Xi, Yi)ni=1, n = 1, 2, … is an i.i.d. sequence of random vectors with common distribution PXY.

The next result ensures the existence of such estimator:

Theorem 1 (existence).

For a sequence {(Zm,k)n} of compact subsets of Zm,k, n = 1, 2, …, there exists a measurable function , satisfying equation 2.4PXY-almost surely.

The maximum likelihood estimator consistently estimates ζ*, where ζ* ∈ Zm,k is a minimizer of equation 2.5. We require the classic conditions: parametric identifiability and the existence of a unique minimizer of equation 2.5. If the i.i.d. condition fails, it can be shown that ergodicity of (log fm,k(Xi, Yi; ζ))ni=1 is a sufficient condition for consistency. However, conditions to ensure ergodicity of the log-likelihood function are out of the scope of this letter.

Assumption 3 (identifiability).

For any distinct ζ1 and ζ2 in Zm,k, the two corresponding densities fm,k(·, · ; ζ1) and fm,k(·, · ; ζ2) are not almost everywhere equal.

Jiang and Tanner (1999c) find sufficient conditions for identifiability of the parameter vector for the HME with one layer and Mendes, Veiga, and Medeiros (2006) for a binary tree structure. Both cases can be adapted to more general specifications. Although one can show consistency to a set, we adopt a more traditional approach requiring identifiability of the parameter vector.

Assumption 4 (unique maximizer).
Let ζ = (ν′, θ′)′ and let ζ* be the argument that minimizes KL(pxy, fm,k) over ζ ∈ Zm,k. Then
2.6

This assumption follows from a second-order Taylor expansion of the expected likelihood around the parameter vector that minimizes equation 2.5, denoted by ζ*. We require the Hessian matrix in equation 2.6 to be invertible at ζ*. The requirement for an identifiable unique maximizer is technical only in the sense that the objective function is not allowed to become too flat around the maximum. (For more discussion on this topic, see Bates & White, 1985, and White, 1996.) A similar assumption was made in the series of papers from Carvalho and Tanner (2005a, 2005b, 2006, 2007) and Zeevi et al. (1998) and is a usual assumption in the estimation of misspecified models.

Theorem 2 (parametric consistency of misspecified models).

Under assumptions 1, 2, 3, and 4, the maximum likelihood estimate as n → ∞ PXY-almost surely.

Huerta, Jiang, and Tanner (2003) and the series of papers by Carvalho and Tanner (2005a, 2005b, 2006, 2007) derive similar results for time series processes.

## 3.  Main Results

### 3.1.  Approximation Rate.

We follow Jiang and Tanner (1999a) to bound the approximation error. Before presenting the main conditions, we introduce some key concepts.

Definition 1 (fine partition).

For m = 1, 2, …, let be a partition of Ω. If m → ∞ and if for all x1, x2Qmj, ‖x1x2c0/r1/sm, for some constant c0 independent of x1, x2, m or j. Then {Qm, m = 1, 2, …} is called a sequence of fine partitions with cardinality rm and bounding constant c0.

The key idea behind the approximation rate is to control the approximation rate inside each fine partition of the space. More precisely, bound the approximation inside the “worst” (i.e., most difficult to approximate) partition. We need the following assumption:

Assumption 5.
For a fine partition Qm of Ω, with bounding constant c0 and cardinality sequence rm = ⌊m1/ss, m = 1, 2, …, there exists a constant c1>0, and a parameter vector such that
3.1
where IQ(x) = 1 if xQ and 0 otherwise.

This assumption is similar to the one employed in Jiang and Tanner (1999a) and requires that the vector approximate the vector of indicator functions at a rate not slower than O(rm).

The cardinality rm is similar to Jiang and Tanner (1999b), who essentially used rm = ⌊m1/ss so as to form a regular hypercube partition of an s-dimensional domain of x.

We show the approximation rate in a divergence measure stronger than the Kullback-Leibler divergence. Let p and q denote densities. The χ2 divergence (see, e.g., Wong & Shen, 1995) between p and q is
3.2
where the equivalence is obtained by expanding the squares and integrating the densities to 1.
Theorem 3 (approximation rate).
Let and . If assumptions 1 and 5 hold, then
3.3
where k* = α ∧ (k + 1), and some positive constant c not depending on m or k. It also follows that
3.4

This result is a generalization of Jiang and Tanner (1999a) in three directions. First, we allow the target function h(·) to be in a Sobolev class with α derivatives; second, we consider a polynomial approximation to the target function in each expert (in fact, their result is a special case when α = 2 and k = 1); finally, we consider a divergence measure stronger than the Kullback-Leibler divergence. The result also holds under more general specifications of densities and experts. If a dispersion parameter has to be estimated from the data, we have to modify lemma 4 accordingly, and the same result holds.

The approximation rate also agrees with the optimal one for approxi-mating functions on by piecewise polynomials of a fixed degree (Windlund, 1977). Under assumption 5, it is exactly what we are doing, and therefore this approximation rate is sharp, meaning one cannot derive faster rates for fixed k.

### 3.2.  Convergence Rate.

In the previous section, we found a bound for the approximation error. In this section, we will find the estimation error and combine it with the approximation error to derive the rate of convergence. The estimation error describes how far the estimated function is from the best approximant in the class. In this work, the convergence rates are derived with respect to the squared Hellinger distance, defined below. For any two densities p and q with respect to λ, the squared Hellinger distance is
The estimation error, measured in the squared Hellinger distance, is Op(dm,kn−1log(dm,kn)). We also show that some choices of k and m achieve near-optimal convergence rates.

The next theorem summarizes the convergence rate of the maximum likelihood estimator with respect to the squared Hellinger distance between the true density pxy and the estimated density.

Theorem 4 (convergence rate).
Let and denote its maximum likelihood estimator on . Let m = mn be such that m(n−1log n) → 0 as n increases. Under assumptions 1, 2, and 5,
3.5
where k* = α ∧ (k + 1) and c is a positive constant. In particular, if we assume vm = O(m) and let m be proportional to , then
3.6
Remark 2.

Although the previous result is derived for the i.i.d. case, the result also holds for a more general data-generating process. This convergence rate is close to the optimal rate found in the sieves literature if k* = α (see, e.g., Stone, 1980, and Barron & Sheu, 1991). To derive the convergence rate, we do not assume f*m,k is a unique, identifiable maximizer of the log-likelihood function 2.3. Here, f*m,k is allowed to be any of such maximizers. The price to pay for such generality is the inclusion of the “log n” term in the convergence rates.

### 3.3.  Consistency.

We apply the previous result to show that the maximum likelihood estimator is consistent, that is, the Hellinger distance between the true density and the estimated model approaches zero as the sample size n and the index of the approximation class m go to infinity.

Corollary 1 (consistency)

Let and denote its maximum likelihood estimator on . Allow m = mn and m(n−1log n) → 0 as n increases. Under assumptions 1, 2, and 5, as n increases.

## 4.  Effects of m and k

Two important problems in the area of ME are: (1) What number of experts m should be chosen, given the size n of the training data, and (2) Given the total number of parameters, whether it is better to use a few complex experts or combine many simple experts. Our results will not be able to answer these questions completely, but they can provide some qualitative insights. We provide a related theoretical result.

Start by setting and d = m(k + 1)s, an upper bound in the number of parameters dm,k. The convergence rate of d2h in equation 3.5 can be upper-bounded by a simpler expression
such that k1 = k + 1, k* = k1 ∧ α, and c is a positive constant. This assumes that v(m) = O(m) and uses the fact that the number of parameters needed in s-dimensional polynomials of order k is bounded by Jk ⩽ (k + 1)s. We have also used a lower bound of the factorial based on Stirling's formula. We now study the upper bound U.
Proposition 1.
Let c be a positive constant that does not depend on n but can take different values at different places. Let k1 = k + 1, k* = k1 ∧ α, and
4.1
(which is an upper bound for the d2h convergence rate derived in theorem 4). Let d = mks1, which is a bound for the approximate order of the total number of parameters.

Then the following statements are true:

1. Consider the case where α is finite:

1. As n → ∞, we have U → 0 if m → ∞ and d = o(n/log n).

2. U achieves a near optimal rate O(n−2α/(s+2α)(log n)c) for some c>0,under the following choices:

• k1 ⩾ α and k1 = O((log n)c) for some c>0.

• m is of order for any constant .

1. Consider the case where α = ∞ (or α ⩾ k1):

1. As n → ∞, we have U → 0 if d → ∞ and d = o(n/log n).

2. The following choices will make U to have a “near-parametric rate”U = O((log n)c/n) for some c>0:3

• m ⩾ 1 and m = O((log n)c) for some c>0.

• k1c log n for any constant c>0, and k1 = O((log n)c) for some c>0.

Remark 3.

(a) The results above do not completely answer the earlier questions and on how to choose m and k in practice. For example, the results on m and k are known only up to some order in n. In addition, the convergence rates may depend on the smoothness parameter α, which may be unknown in practice. A practical method of the choice of m and k may involve a complexity penalty or cross-validation and is outside of the scope of this letter. On the other hand, some qualitative insights could be useful from our convergence rate analysis.

(b) For the very smooth situation, result IIa suggests that for the purpose of consistency (which means the convergence of d2h to 0 in probability), question ii about the ratio between (m, k + 1) is not relevant as long as d = m(k + 1)s grows to infinity at a rate slower than n/log n. However, consistency is not enough to guarantee a good performance. For example, equation 4.1 suggests that for s = 1, (m, k + 1) = (log n, 1) will lead to a very slow rate O((log n)−2), and (m, k + 1) = (1, log n) will lead to a very fast rate O((log n)2/n), and in both cases the total number of parameters d = m(k + 1) are the same. It is therefore important to look into the convergence rates.

(c) Results Ib and IIb imply that smoother target functions (with large α) and lower dimensions (s) generally encourage using fewer experts. For finite α, the near-optimal rates described in result Ib are achieved when mk in order. For the very smooth situation α = ∞, even m = 1(≪k) can lead to near-optimal performances.

(d) We note that near-optimal convergence rates can always be achieved with k1 not being too large compared to the sample size n. This is summarized in the two situations in results Ib and IIb, where we see that even in the case α = ∞, we only need about k1 ∼ log n for us to achieve a near-parametric convergence rate.

(e) Although in result Ib (with finite α) we have used mk to achieve near-optimal rates, we conjecture that even with m = 1, a good (but perhaps suboptimal) convergence rate can be attained. For example, for s = 1, using the Legendre approximation technique 7.5 of Barron and Sheu (1991), we conjecture that a convergence rate is of the form , where d = mk1 and c is a positive constant. Therefore (denoting α* = α − 1), even when m = 1, we can still take k1 to be of order and get , which is suboptimal compared to result Ib but is still converging to 0 if α>1. [Similarly, we conjecture that m → ∞ is not necessary for the consistency result Ia; we need only d → ∞ and d = o(n/log n).]

## 5.  Conclusion

In this letter, we study the mixture-of-experts model with m experts in a one-exponential family with conditional mean ϕ(hk), where hk is a kth order polynomial and ϕ(·) is the inverse link function. We derive the approximation rate and convergence rate of the maximum likelihood estimator to densities in a one-parameter exponential family with mean ϕ(h) with , a Sobolev class with α derivatives, and bounding constant K0. We found that the convergence rate of the maximum likelihood estimator to the true density in squared Hellinger distance is , for k* = (k + 1) ∧ α and c some positive constant.

We discuss choices of k and m for achieving good convergence rates. The results of this letter can be generalized to more complex target densities (from, e.g., a Besov or a piecewise Besov class) and models (e.g., mixture of trigonometric polynomials or wavelets) with simple modifications to the proofs.

We generalize Jiang and Tanner (1999a) in several directions: (1) we assume one can include polynomial terms of the variables on the GLM1 experts; (2) we assume the target density is in a class, for α>0, instead of ; (3) we show consistency of the maximum likelihood estimator for a fixed number of experts; (4) we calculate convergence rates of the maximum likelihood estimator in squared Hellinger distance; (5) we show consistency when the number of experts and the sample size increase; and, finally, (6) we find that using polynomials in the experts, one can yield better estimation and error bounds. These developments have shed light on the important question of how the number of experts and the complexity of the experts jointly affect the convergence rate.

## Appendix A:  Showing the Convergence Rate

In this appendix we explain and justify the main steps in proving the convergence rate.

One of the drawbacks of working with the Kullback-Leibler divergence is that it is not bounded. We will use the Hellinger distance:

Definition 2 (Hellinger distance).
Let P and Q denote two probability measures absolutely continuous with respect to some measure λ. The Hellinger distance between P and Q is given by
A.1
Alternatively, the Hellinger distance between two densities p and q with respect to λ is given by
A.2

The next lemma summarizes basic inequalities well known in the literature (e.g., Wong & Shen, 1995) relating the Hellinger distance, the Kullback-Leibler divergence, and the χ2 divergence.

Lemma 1.
Let pxy = dP/dλ. For we have

In order to bound the estimation error, we use results from the theory of empirical processes. The convergence rate theorem presented below is derived for the i.i.d. case; however, the same result holds for martingales (see van der Geer, 2000).

The control of estimation rate inside a class of functions requires the knowledge of the complexity of the functional class. Denote by the number of ϵ-brackets, with respect to the distance ‖ · ‖, needed to cover the set and —the respective bracketing entropy.4 The use of a bracketing entropy to assess the complexity of a class of mixture of regressions is not new. Genovese and Wasserman (2000) and Viele and Tong (2002) use the entropy with bracketing to measure the complexity in a class of mixture models and mixture of regressions, respectively. Applying their method in our setting gives a bracketing entropy of the same order as if one employs the method used in this letter. We use the latter for the ease of exposition.

We are particularly interested in the class of functions
for some fixed , and 0 < δ ⩽ 1.

Let dm,k = vm + mJk, Zm,k = Vm × Θmk, and use c for any arbitrary positive constant that may change its value every time it appears.

Lemma 2 (bracketing entropy).
Under assumption 1, for any and some a ⩾ 1,
A.3
where
A.4
Proof.
The proof of equation A.3 makes use of lemma 6. Set , i = 1, 2. Each function gi can be written as , for ζiZm,k, which depends on ζ only through . Then for each (x, y),
The derivative on the right-hand side can be bounded by
By applying lemma 6 with ‖ · ‖ = ‖ · ‖2 and , we have that
which is bounded by by assumption 1. The number of ϵ-balls with respect to L needed to cover Zm is
because by assumption 1, Zm is a hypercube with side l. It follows from lemma 6 that
Since l is polynomial in dm,k, we can take , for some a ⩾ 1 and c>0. Taking the log, we obtain equation A.3. Then we apply lemma 5.

We use a modified version of theorem 10.13 in van der Geer (2000) to show the rate of convergence of the Hellinger distance between the maxi-mum likelihood estimator and the true density. This modification allows for unbounded likelihood ratios, that is, it relaxes the assumption that ‖pxy/f*m,k∞,λ is bounded.

Theorem 5 (modified version of theorem 10.13 in van der Geer, 2000).
Let denote the maximum likelihood estimator of pxy over . Set
for some fixed satisfying f*>0 λ-a.e. Choose
in such a way that Ψ(δ)/δ2 is a nonincreasing function of δ. Then, for , we have
Sketch Proof.
The proof is parallel to the one of theorem 10.13 in van der Geer (2000). At line 3 of the displayed equations on p. 191, the second term,
which is lower-bounded by, using the Cauchy-Schwarz inequality,
using the uniform bound condition on the likelihood ratio, that is, ‖pxy/f*m,k∞,λc (van der Geer, 2000, eq. 10.69). We modify the proof as follows:
This allows us to proceed with the χ2-divergence without needing to bound the densities as in equation 10.69 of van der Geer (2000).

## Appendix B:  Auxiliary Results

In the next lemma, we use the notation ∂θ = ∂/∂θ, , aj = a(hk(x; θj)), , and so on.

Lemma 3.

Let . Under assumption 1,

• •

• •

• •

• •

if we further assume 3 and 4, then and is nonsingular at ζ*.

Proof.
This theorem is proved by calculating the derivatives and bounding it. First, note that aj and bj are continuous differentiable functions of hk(x; θj). Since for any fixed k, then both aj and bj are also bounded. The same reasoning can be applied to , , , and . Also, by definition, for any p ⩾ 0. Then
Let and c* = maxj‖∂νlog gj∞,Ω. Then
The same follows for , and . Let , and choose any vector α with appropriate dimensions satisfying α′α = 1. Then

Since ζ* is a maximizer of over , has to be nonnegative definite. Assumption 4 tells us it is also invertible; therefore, is positive definite.

We use the next lemma to bound uniformly the approximation rate of the family of functions with respect to the χ2 divergence. Define the upper divergence between and as
B.1

We can use the upper divergence to bound the χ2 divergence.

Lemma 4.
Let and , then, under condition 1(i),

where M is a finite, positive constant (see the proof of the lemma for a closed-form expression).

Proof.
It follows from the definition of χ2 divergence and concavity of the logarithm that for any and ,
where a = a(h(x)), aj = a(hk(x; θj)), and b*(a) = b(h(x)). Consider the identity
hence,
which does not depend on y. A second-order Taylor expansion of b*(2aaj) and b*(aj) gives us, respectively,
for and on the line connecting a and aj.
Adding up these equations and subtracting from 2b*(a) gives
Call . Use the inequality e|x| − 1 ⩽ |x|e|x| and the mean value theorem to show that for some ,
Choose to conclude that
Lemma 5.
For any and a positive constant C,
B.2
Proof.
For any 0 < a < bC,
The bound on the gamma function follows from the definition of the incomplete gamma function and the Mill's ratio:

The result follows from the fact that x>1/2.

The next lemma provides a bound on the bracketing number of functional classes that are Lipschitz in a parameter:

Lemma 6 (theorem 2.7.11 in van der Vaart & Wellner, 1996).
Let be a class of functions satisfying
for some metric d on T, function F on the sample space, and every x. Then for any norm ‖ · ‖,
B.3
where N(ϵ, T, ‖ · ‖) is the ϵ-covering number of T with respect to the metric L.

## Appendix C:  Proof of the Main Results

Proof of Theorem 1.

The data-generating process of (x, y) and the structure of the model are enough to satisfy the measurability assumptions (i.e., it is a weighted sum of measurable functions).

The approximating density fm(x, y; ζ) is a continuous function of the parameter vector ζ PXY-almost everywhere. We verify this claim by choosing (x, y) ∈ Ω × A from a set with positive probability and noting that (1) π(hk(x; θ), y) is a continuous function of θ and (2) (g1(x; ν), …, gm(x; ν)) is a vector of continuous functions of ν; both imply that fm(x, y; ζ) = ∑igi(x; ν)π(hk(x; θi), y) is also a continuous function of the parameter vector ζ = (ν′, θ′1, …, θ′m)′.

The result follows from theorem 2.12 in White (1996).

Proof of Theorem 2.

There are different approaches to show the consistency of the estimate We verify the conditions of theorem 3.5 in White (1996).

The first assumptions regarding the existence of the estimate are already shown to be satisfied in theorem 1. Assumption 3.2 in White (1996), regarding identifiability is satisfied by assumptions 3 and 4. It remains to satisfy assumption 3.1 in White (1996), regarding boundedness and uniform convergence of the log-likelihood function. We can show continuity of by noting that we can interchange integration with limits and a first-order Taylor expansion:
which is bounded by lemma 3 and by the fact that ϵ is arbitrary.
To show uniform convergence of the likelihood function, we satisfy the conditions of theorem 2 in Jennrich (1969). By assumption, Zm,k is a compact subset of . Measurability and continuity conditions are already satisfied; thus, it remains to show that log fm,k is bounded by an integrable function. We can bound the log-likelihood function by

Define the bounding function . The function because maxixi = 1 and ∑ii| < ∞; then both a(hk) and b(hk) are finite. Thus, it is straightforward to show that , given that , which is satisfied by assumption about py|x. As a conclusion, as n → ∞.

It follows from theorem 3.5 in White (1996) that PXY-a.s. as n → ∞.

Proof of Theorem 3.
It follows from lemma 4 that it is enough to bound the upper divergence defined as
Assumption 5 ensures the existence of a such that , where is finite because PX has continuous density function with respect to the finite measure λ on Ω. Consider
C.1
Now we just have to find bounds for both terms in the right-hand side of equation C.1 (A1 and A2). The second term can be written as

where the equality follows from the fact that and .

If k < α, one can choose θj such that
where k = (k1, …, ks) is a vector of positive integers satisfying |k| = k + 1, k! = k1!, …, ks! and . This claim follows from a Taylor expansion of h(x) around fixed points xjQmj and the fact that . Similarly, if k ⩾ α, we can use the expansion only up to α terms. By assumption 5, . Then
C.2

where c2 = c0K1/s0, and k* = α ∧ (k + 1).

Therefore, . Note that
where the last inequality is due to equation C.2 and assumption 5.
Combining the results for (A1) and (A2),
C.3
It follows from lemma 4 that
C.4
We choose m1/sr1/sm = ⌊m1/s⌋ ⩾ m1/s/2. By assumption 1(ii), . Hence,
where c3 = M(c1 + 1) does not depend on f. Therefore,
proving the first result. The second result follows from lemma 1.
Proof of Theorem 4.
We use theorem 5 in appendix  A setting f* = f*m,k. By lemma 2, we can choose . This choice of function that satisfies Ψ(δ)/δ2 is nonincreasing, and we can take . To appreciate that this choice of δn is valid, note that for all n sufficiently large and some positive constant c,
Then, .
We use theorems 5 and 3 to arrive at our result, equation 3.5:
Proof of Proposition 1.

(Ia) Write where d = m(k + 1)s and k* = (k + 1) ∧ α, . Note that k* ⩾ 1 since α is a positive integer for the Sobolev space introduced in section 2.1. Therefore, the first term of U converges to 0 as m → ∞. For the second term, apply the condition d = o(n/log n), and we have dlog(dn)/n = o(n/log n)log(o(n/log n)n))/n = o(1). This shows U → 0.

(Ib) In our notation, k1 = k + 1. When (k1=)k + 1 ⩾ α, k* = (k + 1) ∧ α = α. Then U = (cs/(αm1/s)) + m(k + 1)slog(m(k + 1)sn)/n. We plug in the choice k + 1 = O((log n)c) for some positive constant c, and the choice m being of order for some constant power c′; then we have that both terms in U are at most of order O(n−2α/(s+2α)(log n)c) for some positive power c.

(IIa) When α = ∞ (or at least k + 1, where k ⩾ 0 is the degree of the polynomial model), we have k* = (k + 1) ∧ α = k + 1. Then we can write U = (cs/(m1/s(k + 1)))2(k+1) + dlog(dn)/n = (cs/d1/s)2(k+1) + dlog(dn)/n. The first term converges to 0 as d → ∞. The second term converges to 0 due to d = o(n/log n) (the same as in the proof of Ia).

(IIb) Consider the expression in the proof of IIa: U = (cs/(m1/s(k + 1)))2(k+1) + dlog(dn)/n, where d = m(k + 1)s. The second term in U is at most O(n−1(log n)c) for some c>0, when m and k + 1 are both at most some powers of log n in order. When m ⩾ 1 and (k + 1) ⩾ clog n for some positive constant c, the first term in U is at most for large n, for some positive constants c1 and c2, which is negligible for large n compared to the order O(n−1(log n)c) of the second term of U.

## Acknowledgments

We are grateful to the referees for their useful comments that have substantially improved the overall presentation of our letter. Also, we thank Martin Tanner, Thomas Severini, Robert Kohn, Marcelo Fernandes, and Marcelo Medeiros for insightful discussions about mixture of experts and/or comments on previous versions of this letter.

## References

Barron
,
A.
, &
Sheu
,
C.
(
1991
).
Approximation of density functions by sequences of exponential families
.
Annals of Statistics
,
19
(
3
),
1347
1369
.
Bates
,
C.
, &
White
,
H.
(
1985
).
A unified theory of consistent estimation for parametric models
.
Econometric Theory
,
1
(
2
),
151
178
.
Carvalho
,
A.
, &
Tanner
,
M.
(
2005a
).
Modeling nonlinear time series with local mixtures of generalized linear models
.
,
33
(
1
),
97
113
.
Carvalho
,
A.
, &
Tanner
,
M.
(
2005b
).
Mixtures-of-experts of autoregressive time series: Asymptotic normality and model specification
.
IEEE Transactions on Neural Networks
,
16
(
1
),
39
56
.
Carvalho
,
A.
, &
Tanner
,
M.
(
2006
).
Modeling nonlinearities with mixtures-of-experts of time series models
.
International Journal of Mathematics and Mathematical Sciences
,
9
,
1
22
.
Carvalho
,
A.
, &
Tanner
,
M.
(
2007
).
Modelling nonlinear count time series with local mixtures of Poisson autoregressions
.
Computational Statistics and Data Analysis
,
51
(
11
),
5266
5294
.
Celeux
,
G.
,
Hurn
,
M.
, &
Robert
,
C.
(
2000
).
Computation and inferential difficulties with mixture distributions
.
Journal of the American Statistical Association
,
99
,
957
970
.
Ge
,
Y.
, &
Jiang
,
W.
(
2006
).
On consistency of Bayesian inference with mixtures of logistic regression
.
Neural Computation
,
18
(
1
),
224
243
.
Genovese
,
C.
, &
Wasserman
,
L.
(
2000
).
Rates of convergence for the gaussian mixture sieve
.
Annals of Statistics
,
128
,
1105
1127
.
Geweke
,
J.
(
2007
).
Interpretation and inference in mixture models: Simple MCMC works
.
Computational Statistics and Data Analysis
,
51
,
3529
3550
.
Geweke
,
J.
, &
Keane
,
M.
(
2007
).
Smoothly mixing regressions
.
Journal of Econometrics
,
138
(
1
),
252
290
.
Huerta
,
G.
,
Jiang
,
W.
, &
Tanner
,
M.
(
2003
).
Time series modeling via hierarchical mixtures
.
Statistica Sinica
,
13
(
4
),
1097
1118
.
Jacobs
,
R.
,
Jordan
,
M.
,
Nowlan
,
S.
, &
Hinton
,
G.
(
1991
).
.
Neural Computation
,
3
(
1
),
79
87
.
Jennrich
,
R.
(
1969
).
Asymptotic properties of non-linear least squares estimators
.
Annals of Mathematical Statistics
,
40
(
2
),
633
643
.
Jiang
,
W.
, &
Tanner
,
M.
(
1999a
).
Hierarchical mixtures-of-experts for exponential family regression models: Approximation and maximum likelihood estimation
.
Annals of Statistics
,
27
,
987
1011
.
Jiang
,
W.
, &
Tanner
,
M.
(
1999b
).
On the approximation rate of hierarchical mixtures-of-experts for generalized linear models
.
Neural Computation
,
11
(
5
),
1183
1198
.
Jiang
,
W.
, &
Tanner
,
M.
(
1999c
).
On the identifiability of mixtures-of-experts
.
Neural Networks
,
12
(
9
),
1253
1258
.
Jordan
,
M.
, &
Jacobs
,
R.
(
1994
).
Hierarchical mixtures of experts and the EM algorithm
.
Neural Computation
,
6
(
2
),
181
214
.
Mendes
,
E.
,
Veiga
,
A.
, &
Medeiros
,
M.
(
2006
).
Estimation and asymptotic theory for a new class of mixture models. Unpublished manuscript, Pontifical Catholic University of Rio de Janeiro.
Norets
,
A.
(
2010
).
Approximation of conditional densities by smooth mixtures of regressions
.
Annals of Statistics
,
38
(
3
),
1733
1766
.
Peng
,
F.
,
Jacobs
,
R.
, &
Tanner
,
M.
(
1996
).
Bayesian inference in mixtures-of-experts and hierarchical mixtures-of-experts models with an application to speech recognition
.
Journal of the American Statistical Association
,
91
,
953
960
.
Stone
,
C.
(
1980
).
Optimal rates of convergence for nonparametric estimators
.
Annals of Statistics
,
8
(
6
),
1348
1360
.
van der Geer
,
S.
(
2000
).
Empirical processes in M-estimation
.
Cambridge, UK
:
Cambridge University Press
.
van der Vaart
,
A.
, &
Wellner
,
J.
(
1996
).
Weak convergence and empirical processes
.
New York
:
Springer-Verlag
.
Viele
,
K.
, &
Tong
,
B.
(
2002
).
Modeling with mixture of linear regressions
.
Statistics and Computing
,
12
,
315
330
.
Villani
,
M.
,
Kohn
,
R.
, &
Giordani
,
P.
(
2009
).
Regression density estimation using smooth adaptive gaussian mixtures
.
Journal of Econometrics
,
153
(
2
),
155
173
.
White
,
H.
(
1996
).
Estimation, inference and specification analysis
.
Cambridge, UK
:
Cambridge University Press
.
Windlund
,
O.
(
1977
).
On best error bounds for approximation by piecewise polynomial functions
.
Numerische Mathematik
,
27
,
327
338
.
Wong
,
W.
, &
Shen
,
X.
(
1995
).
Probability inequalities for likelihood ratios and convergence rates of sieves MLEs
.
Annals of Statistics
,
23
(
2
),
339
362
.
Wood
,
S.
,
Jiang
,
W.
, &
Tanner
,
M.
(
2002
).
Bayesian mixture of splines for spatially adaptive nonparametric regression
.
Biometrika
,
89
(
3
),
513
528
.
Wood
,
S.
,
Kohn
,
R.
,
Cottet
,
R.
,
Jiang
,
W.
, &
Tanner
,
M.
(
2008
).
.
Journal of Computational and Graphical Statistics
,
17
(
2
),
352
372
.
Wood
,
S.
,
Rosen
,
O.
, &
Kohn
,
R.
(
2011
).
Bayesian mixtures of autoregressive models
.
Journal of Computational and Graphical Statistics
,
20
(
1
),
174
195
.
Young
,
D.
, &
Hunter
,
D.
(
2010
).
Mixtures of regressions with predictor-dependent mixing proportions
.
Computational Statistics and Data Analysis
,
54
(
10
),
2253
2266
.
Zeevi
,
A.
,
Meir
,
R.
, &
Maiorov
,
V.
(
1998
).
Error bounds for functional approximation and estimation using mixtures of experts
.
IEEE Transactions on Information Theory
,
44
(
3
),
1010
1025
.

## Notes

1

Suppose 1 ⩽ p ⩽ ∞ and α>0 is an integer. We define as the collection of measurable functions h with all partial derivatives Drh, |r| ⩽ α, on Lp(PX), satisfying . Here and |r| = r1 + ⋅ ⋅ ⋅ + rs for r = (r1, …, rs).

2

We denote , for some finite l.

3

“Near-parametric rate” stands for “close to the parametric rate O(1/n).”

4

For a formal definition of bracketing numbers, see van der Vaart and Wellner (1996).