Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-14 of 14
Noboru Murata
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2017) 29 (7): 1838–1878.
Published: 01 July 2017
FIGURES
| View All (14)
Abstract
View article
PDF
We propose a method for intrinsic dimension estimation. By fitting the power of distance from an inspection point and the number of samples included inside a ball with a radius equal to the distance, to a regression model, we estimate the goodness of fit. Then, by using the maximum likelihood method, we estimate the local intrinsic dimension around the inspection point. The proposed method is shown to be comparable to conventional methods in global intrinsic dimension estimation experiments. Furthermore, we experimentally show that the proposed method outperforms a conventional local dimension estimation method.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2016) 28 (12): 2687–2725.
Published: 01 December 2016
FIGURES
| View All (60)
Abstract
View article
PDF
This study considers the common situation in data analysis when there are few observations of the distribution of interest or the target distribution, while abundant observations are available from auxiliary distributions. In this situation, it is natural to compensate for the lack of data from the target distribution by using data sets from these auxiliary distributions—in other words, approximating the target distribution in a subspace spanned by a set of auxiliary distributions. Mixture modeling is one of the simplest ways to integrate information from the target and auxiliary distributions in order to express the target distribution as accurately as possible. There are two typical mixtures in the context of information geometry: the - and -mixtures. The -mixture is applied in a variety of research fields because of the presence of the well-known expectation-maximazation algorithm for parameter estimation, whereas the -mixture is rarely used because of its difficulty of estimation, particularly for nonparametric models. The -mixture, however, is a well-tempered distribution that satisfies the principle of maximum entropy. To model a target distribution with scarce observations accurately, this letter proposes a novel framework for a nonparametric modeling of the -mixture and a geometrically inspired estimation algorithm. As numerical examples of the proposed framework, a transfer learning setup is considered. The experimental results show that this framework works well for three types of synthetic data sets, as well as an EEG real-world data set.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2014) 26 (9): 2074–2101.
Published: 01 September 2014
FIGURES
| View All (7)
Abstract
View article
PDF
Clustering is a representative of unsupervised learning and one of the important approaches in exploratory data analysis. By its very nature, clustering without strong assumption on data distribution is desirable. Information-theoretic clustering is a class of clustering methods that optimize information-theoretic quantities such as entropy and mutual information. These quantities can be estimated in a nonparametric manner, and information-theoretic clustering algorithms are capable of capturing various intrinsic data structures. It is also possible to estimate information-theoretic quantities using a data set with sampling weight for each datum. Assuming the data set is sampled from a certain cluster and assigning different sampling weights depending on the clusters, the cluster-conditional information-theoretic quantities are estimated. In this letter, a simple iterative clustering algorithm is proposed based on a nonparametric estimator of the log likelihood for weighted data sets. The clustering algorithm is also derived from the principle of conditional entropy minimization with maximum entropy regularization. The proposed algorithm does not contain a tuning parameter. The algorithm is experimentally shown to be comparable to or outperform conventional nonparametric clustering methods.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2014) 26 (7): 1455–1483.
Published: 01 July 2014
FIGURES
| View All (24)
Abstract
View article
PDF
A graph is a mathematical representation of a set of variables where some pairs of the variables are connected by edges. Common examples of graphs are railroads, the Internet, and neural networks. It is both theoretically and practically important to estimate the intensity of direct connections between variables. In this study, a problem of estimating the intrinsic graph structure from observed data is considered. The observed data in this study are a matrix with elements representing dependency between nodes in the graph. The dependency represents more than direct connections because it includes influences of various paths. For example, each element of the observed matrix represents a co-occurrence of events at two nodes or a correlation of variables corresponding to two nodes. In this setting, spurious correlations make the estimation of direct connection difficult. To alleviate this difficulty, a digraph Laplacian is used for characterizing a graph. A generative model of this observed matrix is proposed, and a parameter estimation algorithm for the model is also introduced. The notable advantage of the proposed method is its ability to deal with directed graphs, while conventional graph structure estimation methods such as covariance selections are applicable only to undirected graphs. The algorithm is experimentally shown to be able to identify the intrinsic graph structure.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2012) 24 (7): 1853–1881.
Published: 01 July 2012
FIGURES
| View All (7)
Abstract
View article
PDF
Kernel methods are known to be effective for nonlinear multivariate analysis. One of the main issues in the practical use of kernel methods is the selection of kernel. There have been a lot of studies on kernel selection and kernel learning. Multiple kernel learning (MKL) is one of the promising kernel optimization approaches. Kernel methods are applied to various classifiers including Fisher discriminant analysis (FDA). FDA gives the Bayes optimal classification axis if the data distribution of each class in the feature space is a gaussian with a shared covariance structure. Based on this fact, an MKL framework based on the notion of gaussianity is proposed. As a concrete implementation, an empirical characteristic function is adopted to measure gaussianity in the feature space associated with a convex combination of kernel functions, and two MKL algorithms are derived. From experimental results on some data sets, we show that the proposed kernel learning followed by FDA offers strong classification power.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2011) 23 (6): 1623–1659.
Published: 01 June 2011
FIGURES
| View All (29)
Abstract
View article
PDF
The Bradley-Terry model is a statistical representation for one's preference or ranking data by using pairwise comparison results of items. For estimation of the model, several methods based on the sum of weighted Kullback-Leibler divergences have been proposed from various contexts. The purpose of this letter is to interpret an estimation mechanism of the Bradley-Terry model from the viewpoint of flatness, a fundamental notion used in information geometry. Based on this point of view, a new estimation method is proposed on a framework of the em algorithm. The proposed method is different in its objective function from that of conventional methods, especially in treating unobserved comparisons, and it is consistently interpreted in a probability simplex. An estimation method with weight adaptation is also proposed from a viewpoint of the sensitivity. Experimental results show that the proposed method works appropriately, and weight adaptation improves accuracy of the estimate.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2010) 22 (11): 2887–2923.
Published: 01 November 2010
FIGURES
| View All (4)
Abstract
View article
PDF
Reducing the dimensionality of high-dimensional data without losing its essential information is an important task in information processing. When class labels of training data are available, Fisher discriminant analysis (FDA) has been widely used. However, the optimality of FDA is guaranteed only in a very restricted ideal circumstance, and it is often observed that FDA does not provide a good classification surface for many real problems. This letter treats the problem of supervised dimensionality reduction from the viewpoint of information theory and proposes a framework of dimensionality reduction based on class-conditional entropy minimization. The proposed linear dimensionality-reduction technique is validated both theoretically and experimentally. Then, through kernel Fisher discriminant analysis (KFDA), the multiple kernel learning problem is treated in the proposed framework, and a novel algorithm, which iteratively optimizes the parameters of the classification function and kernel combination coefficients, is proposed. The algorithm is experimentally shown to be comparable to or outperforms KFDA for large-scale benchmark data sets, and comparable to other multiple kernel learning techniques on the yeast protein function annotation task.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2010) 22 (9): 2417–2451.
Published: 01 September 2010
FIGURES
| View All (15)
Abstract
View article
PDF
Given a set of rating data for a set of items, determining preference levels of items is a matter of importance. Various probability models have been proposed to solve this task. One such model is the Plackett-Luce model, which parameterizes the preference level of each item by a real value. In this letter, the Plackett-Luce model is generalized to cope with grouped ranking observations such as movie or restaurant ratings. Since it is difficult to maximize the likelihood of the proposed model directly, a feasible approximation is derived, and the em algorithm is adopted to find the model parameter by maximizing the approximate likelihood which is easily evaluated. The proposed model is extended to a mixture model, and two applications are proposed. To show the effectiveness of the proposed model, numerical experiments with real-world data are carried out.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2008) 20 (6): 1596–1630.
Published: 01 June 2008
Abstract
View article
PDF
We discuss robustness against mislabeling in multiclass labels for classification problems and propose two algorithms of boosting, the normalized Eta-Boost.M and Eta-Boost.M, based on the Eta-divergence. Those two boosting algorithms are closely related to models of mislabeling in which the label is erroneously exchanged for others. For the two boosting algorithms, theoretical aspects supporting the robustness for mislabeling are explored. We apply the proposed two boosting methods for synthetic and real data sets to investigate the performance of these methods, focusing on robustness, and confirm the validity of the proposed methods.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2007) 19 (8): 2183–2244.
Published: 01 August 2007
Abstract
View article
PDF
Boosting is known as a gradient descent algorithm over loss functions. It is often pointed out that the typical boosting algorithm, Adaboost, is highly affected by outliers. In this letter, loss functions for robust boosting are studied. Based on the concept of robust statistics, we propose a transformation of loss functions that makes boosting algorithms robust against extreme outliers. Next, the truncation of loss functions is applied to contamination models that describe the occurrence of mislabels near decision boundaries. Numerical experiments illustrate that the proposed loss functions derived from the contamination models are useful for handling highly noisy data in comparison with other loss functions.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2005) 17 (11): 2508–2529.
Published: 01 November 2005
Abstract
View article
PDF
By employing the L 1 or L ∞ norms in maximizing margins, support vector machines (SVMs) result in a linear programming problem that requires a lower computational load compared to SVMs with the L 2 norm. However, how the change of norm affects the generalization ability of SVMs has not been clarified so far except for numerical experiments. In this letter, the geometrical meaning of SVMs with the L p norm is investigated, and the SVM solutions are shown to have rather little dependency on p .
Journal Articles
Publisher: Journals Gateway
Neural Computation (2004) 16 (7): 1437–1481.
Published: 01 July 2004
Abstract
View article
PDF
We aim at an extension of AdaBoost to U -Boost, in the paradigm to build a stronger classification machine from a set of weak learning machines. A geometric understanding of the Bregman divergence defined by a generic convex function U leads to the U -Boost method in the framework of information geometry extended to the space of the finite measures over a label set. We propose two versions of U -Boost learning algorithms by taking account of whether the domain is restricted to the space of probability functions. In the sequential step, we observe that the two adjacent and the initial classifiers are associated with a right triangle in the scale via the Bregman divergence, called the Pythagorean relation. This leads to a mild convergence property of the U -Boost algorithm as seen in the expectation-maximization algorithm. Statistical discussions for consistency and robustness elucidate the properties of the U -Boost methods based on a stochastic assumption for training data.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2004) 16 (2): 355–382.
Published: 01 February 2004
Abstract
View article
PDF
Natural gradient learning is known to be efficient in escaping plateau, which is a main cause of the slow learning speed of neural networks. The adaptive natural gradient learning method for practical implementation also has been developed, and its advantage in real-world problems has been confirmed. In this letter, we deal with the generalization performances of the natural gradient method. Since natural gradient learning makes parameters fit to training data quickly, the overfitting phenomenon may easily occur, which results in poor generalization performance. To solve the problem, we introduce the regularization term in natural gradient learning and propose an efficient optimizing method for the scale of regularization by using a generalized Akaike information criterion (network information criterion). We discuss the properties of the optimized regularization strength by NIC through theoretical analysis as well as computer simulations. We confirm the computational efficiency and generalization performance of the proposed method in real-world applications through computational experiments on benchmark problems.
Journal Articles
Publisher: Journals Gateway
Neural Computation (1993) 5 (1): 140–153.
Published: 01 January 1993
Abstract
View article
PDF
The present paper elucidates a universal property of learning curves, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastic machine is improved as the number of training examples increases. The error is measured by the entropic loss. It is proved that the generalization error converges to H 0 , the entropy of the conditional distribution of the true machine, as H 0 + m * /(2t ), while the training error converges as H 0 - m * /(2t ), where t is the number of examples and m * shows the complexity of the network. When the model is faithful, implying that the true machine is in the model, m * is reduced to m , the number of modifiable parameters. This is a universal law because it holds for any regular machine irrespective of its structure under the maximum likelihood estimator. Similar relations are obtained for the Bayes and Gibbs learning algorithms. These learning curves show the relation among the accuracy of learning, the complexity of a model, and the number of training examples.