Skip Nav Destination
*PDF*
*PDF*
*PDF*

Update search

### NARROW

Format

Journal

Date

Availability

Journal Articles

Publisher: Journals Gateway

*Neural Computation*(2007) 19 (4): 1112–1153.

Published: 01 April 2007

Abstract

View article
It is well known that in unidentifiable models, the Bayes estimation provides much better generalization performance than the maximum likelihood (ML) estimation. However, its accurate approximation by Markov chain Monte Carlo methods requires huge computational costs. As an alternative, a tractable approximation method, called the variational Bayes (VB) approach, has recently been proposed and has been attracting attention. Its advantage over the expectation maximization (EM) algorithm, often used for realizing the ML estimation, has been experimentally shown in many applications; nevertheless, it has not yet been theoretically shown. In this letter, through analysis of the simplest unidentifiable models, we theoretically show some properties of the VB approach. We first prove that in three-layer linear neural networks, the VB approach is asymptotically equivalent to a positive-part James-Stein type shrinkage estimation. Then we theoretically clarify its free energy, generalization error, and training error. Comparing them with those of the ML estimation and the Bayes estimation, we discuss the advantage of the VB approach. We also show that unlike in the Bayes estimation, the free energy and the generalization error are less simply related with each other and that in typical cases, the VB free energy well approximates the Bayes one, while the VB generalization error significantly differs from the Bayes one.

Journal Articles

Publisher: Journals Gateway

*Neural Computation*(2003) 15 (5): 1013–1033.

Published: 01 May 2003

Abstract

View article
Hierarchical learning machines such as layered neural networks have singularities in their parameter spaces. At singularities, the Fisher information matrix becomes degenerate, with the result that the conventional learning theory of regular statistical models does not hold. Recently, it was proved that if the parameter of the true distribution is contained in the singularities of the learning machine, the generalization error in Bayes estimation is asymptotically equal to λ/ n , where 2 λ is smaller than the dimension of the parameter and n is the number of training samples. However, the constant λ strongly depends on the local geometrical structure of singularities; hence, the generalization error is not yet clarified when the true distribution is almost but not completely contained in the singularities. In this article, in order to analyze such cases, we study the Bayes generalization error under the condition that the Kullback distance of the true distribution from the distribution represented by singularities is in proportion to 1 / n and show two results. First, if the dimension of the parameter from inputs to hidden units is not larger than three, then there exists a region of true parameters such that the generalization error is larger than that of the corresponding regular model. Second, if the dimension from inputs to hidden units is larger than three, then for arbitrary true distribution, the generalization error is smaller than that of the corresponding regular model.

Journal Articles

Publisher: Journals Gateway

*Neural Computation*(2001) 13 (4): 899–933.

Published: 01 April 2001

Abstract

View article
This article clarifies the relation between the learning curve and the algebraic geometrical structure of a nonidentifiable learning machine such as a multilayer neural network whose true parameter set is an analytic set with singular points. By using a concept in algebraic analysis, we rigorously prove that the Bayesian stochastic complexity or the free energy is asymptotically equal to λ 1 log n − ( m 1 − 1) loglog n + constant, where n is the number of training samples and λ 1 and m 1 are the rational number and the natural number, which are determined as the birational invariant values of the singularities in the parameter space. Also we show an algorithm to calculate λ 1 and m 1 based on the resolution of singularities in algebraic geometry. In regular statistical models, 2λ 1 is equal to the number of parameters and m 1 = 1, whereas in nonregular models, such as multilayer networks, 2λ 1 is not larger than the number of parameters and m 1 ≥ 1. Since the increase of the stochastic complexity is equal to the learning curve or the generalization error, the nonidentifiable learning machines are better models than the regular ones if Bayesian ensemble learning is applied.