Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-8 of 8
Wenxin Jiang
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2012) 24 (11): 3025–3051.
Published: 01 November 2012
Abstract
View article
PDF
In this letter, we consider a mixture-of-experts structure where m experts are mixed, with each expert being related to a polynomial regression model of order k . We study the convergence rate of the maximum likelihood estimator in terms of how fast the Hellinger distance of the estimated density converges to the true density, when the sample size n increases. The convergence rate is found to be dependent on both m and k , while certain choices of m and k are found to produce near-optimal convergence rates.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2011) 23 (10): 2683–2712.
Published: 01 October 2011
FIGURES
Abstract
View article
PDF
This letter considers Bayesian binary classification where data are assumed to consist of multiple time series (panel data) with binary class labels (binary choice). The observed data can be represented as { y it , x it } T , t=1 i = 1, … , n . Here y it ∈ { 0, 1 } represents binary choices, and x it represents the exogenous variables. We consider prediction of y it by its own lags, as well as by the exogenous components. The prediction will be based on a Bayesian treatment using a Gibbs posterior that is constructed directly from the empirical error of classification. Therefore, this approach is less sensitive to the misspecification of the probability model compared to the usual likelihood-based posterior, which is confirmed by Monte Carlo simulations. We also study the effects of various choices of n and T both numerically (by simulations) and theoretically (by considering two alternative asymptotic situations: large n and large T ). We find that increasing T helps to reduce the prediction error more effectively compared to increasing n . We also illustrate the method in a real data application on the brand choice of yogurt purchases.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (11): 2762–2776.
Published: 01 November 2006
Abstract
View article
PDF
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. We use a prior to select a limited number of candidate variables to enter the model, applying a popular method with selection indicators. We show that this approach can induce posterior estimates of the regression functions that are consistently estimating the truth, if the true regression model is sparse in the sense that the aggregated size of the regression coefficients are bounded. The estimated regression functions therefore can also produce consistent classifiers that are asymptotically optimal for predicting future binary outputs. These provide theoretical justifications for some recent empirical successes in microarray data analysis.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (1): 224–243.
Published: 01 January 2006
Abstract
View article
PDF
This is a theoretical study of the consistency properties of Bayesian inference using mixtures of logistic regression models. When standard logistic regression models are combined in a mixtures-of-experts setup, a flexible model is formed to model the relationship between a binary (yes-no) response y and a vector of predictors x. Bayesian inference conditional on the observed data can then be used for regression and classification. This letter gives conditions on choosing the number of experts (i.e., number of mixing components) k or choosing a prior distribution for k , so that Bayesian inference is consistent, in the sense of often approximating the underlying true relationship between y and x. The resulting classification rule is also consistent, in the sense of having near-optimal performance in classification. We show these desirable consistency properties with a nonstochastic k growing slowly with the sample size n of the observed data, or with a random k that takes large values with nonzero but small probabilities.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2004) 16 (4): 789–810.
Published: 01 April 2004
Abstract
View article
PDF
This letter is a comprehensive account of some recent findings about AdaBoost in the presence of noisy data when approached from the perspective of statistical theory. We start from the basic assumption of weak hypotheses used in AdaBoost and study its validity and implications on generalization error. We recommend studying the generalization error and comparing it to the optimal Bayes error when data are noisy. Analytic examples are provided to show that running the unmodified AdaBoost forever will lead to overfit. On the other hand, there exist regularized versions of AdaBoost that are consistent, in the sense that the resulting prediction will approximately attain the optimal performance in the limit of large training samples.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2002) 14 (10): 2415–2437.
Published: 01 October 2002
Abstract
View article
PDF
Previous researchers developed new learning architectures for sequential data by extending conventional hidden Markov models through the use of distributed state representations. Although exact inference and parameter estimation in these architectures is computationally intractable, Ghahramani and Jordan (1997) showed that approximate inference and parameter estimation in one such architecture, factorial hidden Markov models (FHMMs), is feasible in certain circumstances. However, the learning algorithm proposed by these investigators, based on variational techniques, is difficult to understand and implement and is limited to the study of real-valued data sets. This chapter proposes an alternative method for approximate inference and parameter estimation in FHMMs based on the perspective that FHMMs are a generalization of a well-known class of statistical models known as generalized additive models (GAMs; Hastie & Tibshirani, 1990). Using existing statistical techniques for GAMs as a guide, we have developed the generalized backfitting algorithm. This algorithm computes customized error signals for each hidden Markov chain of an FHMM and then trains each chain one at a time using conventional techniques from the hidden Markov models literature. Relative to previous perspectives on FHMMs, we believe that the viewpoint taken here has a number of advantages. First, it places FHMMs on firm statistical foundations by relating them to a class of models that are well studied in the statistics community, yet it generalizes this class of models in an interesting way. Second, it leads to an understanding of how FHMMs can be applied to many different types of time-series data, including Bernoulli and multinomial data, not just data that are real valued. Finally, it leads to an effective learning procedure for FHMMs that is easier to understand and easier to implement than existing learning procedures. Simulation results suggest that FHMMs trained with the generalized backfitting algorithm are a practical and powerful tool for analyzing sequential data.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2000) 12 (6): 1293–1301.
Published: 01 June 2000
Abstract
View article
PDF
The mixtures-of-experts (ME) methodology provides a tool of classification when experts of logistic regression models or Bernoulli models are mixed according to a set of local weights. We show that the Vapnik-Chervonenkis dimension of the ME architecture is bounded below by the number of experts m and above by O ( m 4 s 2 ), where s is the dimension of the input. For mixtures of Bernoulli experts with a scalar input, we show that the lower bound m is attained, in which case we obtain the exact result that the VC dimension is equal to the number of experts.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1999) 11 (5): 1183–1198.
Published: 01 July 1999
Abstract
View article
PDF
We investigate a class of hierarchical mixtures-of-experts (HME) models where generalized linear models with nonlinear mean functions of the form ψ(α + x T β) are mixed. Here ψ(·) is the inverse link function. It is shown that mixtures of such mean functions can approximate a class of smooth functions of the form ψ( h (x)), where h (·) ε W ∞ 2;k (a Sobolev class over [0, 1] s , as the number of experts m in the network increases. An upper bound of the approximation rate is given as O(m −2/s ) in L p norm. This rate can be achieved within the family of HME structures with no more than s -layers, where s is the dimension of the predictor x.