## Abstract

In pattern recognition, data integration is an important issue, and when properly done, it can lead to improved performance. Also, data integration can be used to help model and understand multimodal processing in the brain. Amari proposed -integration as a principled way of blending multiple positive measures (e.g., stochastic models in the form of probability distributions), enabling an optimal integration in the sense of minimizing the -divergence. It also encompasses existing integration methods as its special case, for example, a weighted average and an exponential mixture. The parameter determines integration characteristics, and the weight vector assigns the degree of importance to each measure. In most work, however, and are given in advance rather than learned. In this letter, we present a parameter learning algorithm for learning and from data when multiple integrated target values are available. Numerical experiments on synthetic as well as real-world data demonstrate the effectiveness of the proposed method.

## 1. Introduction

When we make an educated guess in a recognition task, we use sensory data from multiple modalities (visual, audio, taste, smell, and touch) and integrate them. Several hypotheses have been proposed to account for the multimodal integration mechanism in the brain at many different levels (Wolpert & Kawato, 1998; Amari, 2007). Also, in pattern recognition, data integration has become an important issue since aggregating information from multiple sources helps with disambiguation and generally leads to higher performance. Several data integration algorithms have been proposed to address the issues (Hall & Llinas, 1997), such as Bayesian inference (Pearl, 1988), evidence theory (Dempster, 1967; Shafer, 1976), clustering algorithms (Duda, Hart, & Stork, 2001), and neural networks (Bishop, 1995). Kernel-based integration methods are also commonly used (Lanckriet, Deng, Cristianini, Jordan, & Noble, 2004; Choi, Choi, & Choe, 2008).

Recently, a general framework, -integration, was proposed by Amari (2007) for stochastic model integration of multiple positive measures. -integration is a one-parameter family of integration, where the parameter determines the characteristics of integration. Given a number of stochastic models in the form of probability distributions, it finds out the optimal integration of the sources in the sense of minimizing the -divergence. Many artificial neural models for stochastic models, such as the mixture (or product) of experts model (Jacobs, Jordan, Nowlan, & Hinton, 1991; Hinton, 2002), can be considered as special cases of -integration. Some psychophysical laws such as Weber's law and Steven's law support that our brain could use something like -representation when proper values are used (see Kandel, Schwarts, & Jessell, 2000, and Amari, 2007).

However, there is an unresolved critical issue in -integration. In most existing works on -integration (Minka, 2005; Amari, 2007; Cichocki, Lee, Kim, & Choi, 2008; Kim, Cichocki, & Choi, 2008; Choi, Katake, Choi, & Choe, 2010), the value of as well as the weight vector are given in advance rather than learned. However, these existing -integration approaches cannot effectively handle cases where only a number of integrated (mixed) results are available while the value and are unknown, as in Figure 1. When we have a very sparse sampling of integrated observations plus the original measurements, we should be able to deduce the and the weight so that we can obtain the full integration result. So, the value or needs to be learned adaptively from a small number of integrated observations and the corresponding measurements (the raw data). The full integration result can then be estimated using the deduced and .

We can also choose a specific value if we have an underlying stochastic model or assumption. For example, we can set to 1 so that the resulting integration is a geometric mean. In this case, the underlying stochastic model is the exponential family, a specific case of -family. However, if the value is fixed like that, the benefit of generalizing to an arbitrary stochastic model is lost. So we should automatically infer the value, which amounts to a search through a space of stochastic models, which will lead to a more accurate integration that optimizes on -divergence.

In this letter, we propose a new algorithm to learn -integration parameters and from the data sources and a small number of integrated target values. We first define an objective function with respect to and and then derive two update rules to learn the parameters based on gradient descent. The procedure consists of two steps: (1) an -integration step and (2) a parameter update step (our main contribution). These steps are executed iteratively.

The rest of this letter is organized as follows. First, we briefly review -integration and -divergence in section 2. Then in section 3, we propose a new parameter learning algorithm. Section 4 includes experimental results and analysis with toy and real-world data sets. In section 5, we discuss the proposed algorithm in terms of stochastic models and geometry, followed by the conclusion in section 6.

## 2. Previous Work

In this section, we provide a brief overview of -mean, -integration, and -divergence. More details can be found in Amari (2007).

### 2.1. α-Mean.

*-mean*is a one-parameter family of means defined by where is a differentiable monotone function given by The function in equation 2.2 is the only function that makes the -mean to be linear scale free for

*c*>0, that is, the -mean of

*c m*

_{1}(

*x*) and

*c m*

_{2}(

*x*) is (Hardy, Littlewood, & Polya, 1994; Amari & Nagaoka, 2000).

-mean includes various means as its special case. For or , -mean becomes arithmetic mean, geometric mean, harmonic mean, minimum, or maximum, respectively. Figure 1 shows an example of -mean with two source measures. The value of the parameter (which is usually specified in advance) reflects the characteristics of the integration. As increases, the -mean resorts more to the smaller of *m*_{1}(*x*) or *m*_{2}(*x*), while as decreases, the larger of the two is weighed more heavily, as shown in Figure 2 (Amari, 2007).

### 2.2. α-Integration.

*-integration*is a generalization of -mean to multiple positive sources, with different weights,

*w*(Amari, 2007). The -integration equation of

_{i}*m*(

_{i}*x*), , with

*w*is defined by where

_{i}*w*>0 for and .

_{i}*M*positive measures, the goal of integration is to seek their weighted average that is as close to each of the measures as possible, while the closeness of two positive measures is evaluated by divergence. Amari (2007) showed that -integration is optimal in the sense that is minimized, where is the -divergence of from the measures

*m*(

_{i}*x*).

### 2.3. α-Divergence.

For applications where -integration is applied to real problems, see Kim et al. (2008), and Choi, Choi, Katake, Kang, and Choe (2010), and Choi, Katake et al. (2010), where and are given and fixed (i.e., specified by the user). However, these values are unknown a priori unless we have a specific data set and understand it. In the next section, we propose a new objective function and update rules for learning and .

## 3. Learning the Parameters for α-Integration

The problem that we consider in this paper is as follows. Given *M* positive measurements, *m _{i}*(

*x*), our task is to determine an -integration when target values for are sparsely observed. In other words, we learn the parameters and such that the optimal -integration is as close as possible to the observed target values.

*m*(

_{i}*x*) be

_{k}*i*th measurement for

*x*, where and , with the number of targets,

_{k}*S*, (). Given true target values

_{N}*t*(the integrated values) where

_{j}*j*=1, ⋅⋅⋅,

*S*, our objective function to be minimized for and , , is simply defined as One example of the error surface using the objective function is shown in Figure 3. The error surface was generated with 30 randomly selected target values from the synthetic data set in the experiment section. Here, we can see that the optimal parameters, , are found at the global minimum of the error surface.

_{N}### 3.1. Update Rule for α.

### 3.2. Update Rule for *w*

*f*was taken off, where

*f*has nothing to do with (only when ). Then we can use the least-square method for optimization. In this letter, however, we use equation 3.9 as the update rule for to keep the method working with arbitrary values including .

### 3.3. Learning *α* and *w* at the Same Time

In this section, we show how we can learn both and at the same time and discuss some related issues. As in Figure 4, and usually play different roles during integration. For example, in the figure, the two diagonal lines are sources, and the horizontal line is an integration of the two sources with , which is the arithmetic mean. The dotted line and the dotted curve are integrations with , and , respectively. These integrated lines and the curve are generated by changing only one of the two parameters compared to the case of arithmetic mean to show the different roles they play. Obviously, the dotted line is a linearly weighted sum of the two sources, which does not care about the magnitude of the sources. However, the dotted curve follows the magnitude, not the sources. Since the two parameters play different roles during integration, we can estimate the parameters by applying the two update rules simultaneously starting from a random initial point. Equations 3.7 and 3.9 can be applied sequentially in each iteration of the learning procedure.

*m*

_{1}(

*x*

_{1}) and

*m*

_{2}(

*x*

_{1}) and the integration point , we have only one equation with two parameters: and . While either of the two parameters with the other one fixed can solve the equation, we may not be able to optimize the two parameters at the same time. So in order to avoid such cases, we need enough labeled data points to learn both parameters. That is, if we have more data points, we can have more stable solutions for both parameters even in noisy cases.

Another issue is to determine the appropriate learning rates: and . Suppose there are two positive numbers and we want to find the maximum of them. Then there are two ways to get the maximum: (1) take and (2) if the first one is bigger, or otherwise. That is, to obtain the maximum, both , and (or [1, 0]) could be the solution. And to find the minimum, both and (or [0, 1]) could be the solution. Here, to get the same effect, the range of is , while for , it is [0, 1]. So this different scale should be reflected on the learning rates. That is, the learning rate for , should be much greater than the one for , . We will see some examples in section 4.

## 4. Experiments

In order to show the effectiveness of our proposed algorithm, we carried out experiments with three data sets: (1) a synthetic data set with two curves and a few true integrated values as in Figure 5; (2) monthly average temperatures of multiple cities from www.cityrating.com, as in Table 1; and (3) a color image in Figure 12, which is converted to gray images in different ways.

### 4.1. Synthetic Data.

In this section, we learn the parameters from a synthetic data set where we know the true parameter values (but hidden to the algorithm) so that we can measure the accuracy of the learning algorithms. We show the results of learning the parameters and simultaneously. Based on repeated experiments with randomly generated target values, we confirmed that there is no significant difference on the rate of convergence when different parameters were used.

Figure 5 shows two data sources—the true integrated curves and the target values. We generated 30 labeled data points. In addition, the learning rates for and should have different scales, which are 0.5 and 0.00005, respectively. The true values for and are 3 and [0.7, 0.3], both of which are hidden to the learning algorithm.

Figure 6 shows the evolution of the parameter values over time. The parameters start from 0 and [0.5, 0.5]. As the learning algorithm updates the parameters, they converge to the true values, 3 and [0.7, 0.3], respectively. Note that increases up to 6 and then goes back to converge to 3. This is because of the interaction between and , which may be sharing the same role under certain situations. However, the error decreases monotonically and converges to zero (see Figure 7).

To check the interaction between and , we decreased the number of labeled data points. In Figure 8, the error surface is obtained from five randomly selected target values. In Figure 8b, the tangent space at the bottom of the surface seems to meet the surface on a straight line that includes the error value at the true parameters. It means that the learning procedure could stop at any parameter set corresponding to the line. This line is a function of and , indicating the interaction between them. If we increase the number of labeled data points, this line becomes curved so that the algorithm could learn the true parameters. Compare this error surface to the one in Figure 3, where 30 randomly selected target values are given. Note that these surfaces were generated from randomly selected target values so that sometimes it could learn the parameter almost perfectly, even with just five target values.

### 4.2. Temperature Data.

As a second data set, we used a monthly average temperature data from several cities in the United States (see Table 1). We used three cities (New York, New York; Chicago, Illinois; and Houston, Texas) as sources and estimated the temperature of Atlanta, Georgia. We used temperatures from 10 randomly selected months in Atlanta to learn the parameters and tested with the 2 remaining months’ temperature. The temperature scale is Fahrenheit (F^{o}).

City . | Jan. . | Feb. . | Mar. . | Apr. . | May . | Jun. . | Jul. . | Aug. . | Sep. . | Oct. . | Nov. . | Dec. . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Chicago | 21.0 | 25.4 | 37.2 | 48.6 | 58.9 | 68.6 | 73.2 | 71.7 | 64.4 | 52.8 | 40.0 | 26.6 |

Boston | 28.6 | 30.3 | 38.6 | 48.1 | 58.2 | 67.7 | 73.5 | 71.9 | 64.8 | 54.8 | 45.3 | 33.6 |

New York | 31.5 | 33.6 | 42.4 | 52.5 | 62.7 | 71.6 | 76.8 | 75.5 | 68.2 | 57.5 | 47.6 | 36.6 |

Atlanta | 41.0 | 44.8 | 53.5 | 61.5 | 69.2 | 76.0 | 78.8 | 78.1 | 72.7 | 62.3 | 53.1 | 44.5 |

Houston | 50.4 | 53.9 | 60.6 | 68.3 | 74.5 | 80.4 | 82.6 | 82.3 | 78.2 | 69.6 | 61.0 | 53.5 |

San Antonio | 49.3 | 53.5 | 61.7 | 69.3 | 75.5 | 82.2 | 85.0 | 84.9 | 79.3 | 70.2 | 60.4 | 52.2 |

City . | Jan. . | Feb. . | Mar. . | Apr. . | May . | Jun. . | Jul. . | Aug. . | Sep. . | Oct. . | Nov. . | Dec. . |
---|---|---|---|---|---|---|---|---|---|---|---|---|

Chicago | 21.0 | 25.4 | 37.2 | 48.6 | 58.9 | 68.6 | 73.2 | 71.7 | 64.4 | 52.8 | 40.0 | 26.6 |

Boston | 28.6 | 30.3 | 38.6 | 48.1 | 58.2 | 67.7 | 73.5 | 71.9 | 64.8 | 54.8 | 45.3 | 33.6 |

New York | 31.5 | 33.6 | 42.4 | 52.5 | 62.7 | 71.6 | 76.8 | 75.5 | 68.2 | 57.5 | 47.6 | 36.6 |

Atlanta | 41.0 | 44.8 | 53.5 | 61.5 | 69.2 | 76.0 | 78.8 | 78.1 | 72.7 | 62.3 | 53.1 | 44.5 |

Houston | 50.4 | 53.9 | 60.6 | 68.3 | 74.5 | 80.4 | 82.6 | 82.3 | 78.2 | 69.6 | 61.0 | 53.5 |

San Antonio | 49.3 | 53.5 | 61.7 | 69.3 | 75.5 | 82.2 | 85.0 | 84.9 | 79.3 | 70.2 | 60.4 | 52.2 |

Source: www.cityrating.com.

The parameters and were initialized to 0 and [1/3, 1/3, 1/3], respectively. We expected to decrease so much that the integration approaches the higher value, and also expected Houston to have more weight than others, since Atlanta's temperature is more similar to Houston's than the other cities. In the experiments, as the learning procedure proceeded, the parameters converged to −6.86 and [0.27, 0.46, 0.27], as shown in Figure 9. In this example, the learned values of the parameters take on a real-world meaning, based on temperature-based geometry. This was confirmed on similar experiments with other cities as shown in Table 1.

The error decreases to 0.604 as shown in Figure 10. Figure 11 shows the true temperature of Atlanta, along with the estimated temperature using the optimal parameters. There are many causes that affect the temperature of a city. Each month of each city might have various unique factors that affect its temperature, which can be the source of error we cannot overcome simply by averaging, even with accurately learned and values. However, our proposed method is better than the simple arithmetic (or geometric or harmonic) average. It theoretically achieves the minimum error among all linear-scale-free averaging methods.

### 4.3. Color-to-Grayscale Image Conversion.

Our method can be applied to approximately discover a proper color-to-grayscale image conversion strategy, given a color image and an example grayscale version of the image.

An pixel-sized color image consists of three different sources: red, green, and blue (RGB). To convert the color image to grayscale, simple strategies like the luminosity method find a weighted sum of the RGB sources. Among many complex strategies, we took five popular ones: Rasche05, Grundland07, Smith08, Color2gray, and CIE_Y. (For details of these strategies, see Cadík, 2008.)

A pixel color image used in the experiment is shown in Figure 12, and the five converted grayscale images are shown in Figure 13a. Given the color image and the target gray image, we learned the parameters with 500 randomly selected pixel samples from the pair of images. Then we applied -integration to the color image to convert it into grayscale with the learned parameters. The target grayscale images are not exactly -integration of RGB, which means our learning would be somewhat of an approximation of the original conversion strategies. Figure 13b shows the five -integrated gray images, which are qualitatively similar to the original grayscale images in Figure 13a.

The learned parameters are summarized in Table 2. As we found real-world meanings of the learned parameter values in the previous section, we can understand some characteristics of the conversion strategies based on the parameters learned for -integration. For example, the learned value for the Smith08 strategy is much lower compared to other strategies, which could mean that the grayscale result of the Smith08 strategy may be brighter than other strategies. Also, from the weight vector, we can tell that the blue region in the color image would be mapped to a brighter gray in the first two strategies (Rasche05 and Grundland07) relative to those of the other strategies (*w*_{3} is 0.4520 and 0.4046 for the first two; and 0.0417, 0.0000 and 0.0683 for the rest). These interpretations are confirmed in Figure 13. Also note that we can find infinitely many new strategies by changing the parameters.

Algorithm . | . | . |
---|---|---|

Rasche05 | −0.1973 | [0.2722, 0.2758, 0.4520] |

Grundland07 | −3.5940 | [0.0188, 0.5766, 0.4046] |

Smith08 | −7.9890 | [0.0966, 0.8617, 0.0417] |

Color2gray | −0.5416 | [0.3490, 0.6510, 0.0000] |

CIE_Y | −3.6599 | [0.1970, 0.7347, 0.0683] |

Algorithm . | . | . |
---|---|---|

Rasche05 | −0.1973 | [0.2722, 0.2758, 0.4520] |

Grundland07 | −3.5940 | [0.0188, 0.5766, 0.4046] |

Smith08 | −7.9890 | [0.0966, 0.8617, 0.0417] |

Color2gray | −0.5416 | [0.3490, 0.6510, 0.0000] |

CIE_Y | −3.6599 | [0.1970, 0.7347, 0.0683] |

An earlier version of our work reported here has also been used in a real-world application. Based on our previous conference paper (Choi, Choi, Katake, & Choe, 2010), Wang et al. (2011) have successfully applied our method to colonic polyp detection in CT colonography, where they integrated decisions from the CAD system (machine intelligence) and the MTurk workers (human intelligence) based on learning only .

## 5. Discussion

In this paper, we propose a learning algorithm for estimating -integration parameters and from the data; previous work required manually determined, fixed values for those parameters. In this section, we discuss the proposed algorithms in terms of stochastic models and geometry.

In terms of stochastic models, the -family includes stochastic models like the exponential family and the mixture family, so learning can be seen as finding the best family out of all the stochastic families in the -family. In that sense, our proposed algorithm can be seen as trying to find the best stochastic family model and the best distribution to fit the model. That is, when we learn the parameters, they find a better model (a set of distributions) given the current integration samples and the data sources. As a consequence, -integration with learning the value of and identifies, arguably, the best average out of all possible distributions in the -family. Since we have not theoretically proved the objective function to be convex, there is the possibility that the learning finds local optimal parameters.

From a geometrical point of view, defining the distance between any two points in a data set leads to a single corresponding metric, which can be used to determine the manifold on which the data points lie. Thus, learning the parameters corresponds to defining the manifold of the probability distributions (or nonnegative measurements). When we initialize the parameters, we assume one manifold, and when the parameters are updated, the shape of the manifold we assumed will change. The -integration and the manifold shape are updated iteratively. In the end, -integration with learning the value of and gives us the best integration with a metric for the manifold. These two interpretations are strongly related since -integration originated from information geometry (Amari & Nagaoka, 2000).

Another advantage of our approach is that the learned parameter values can take on a concrete real-world meaning depending on the data set. In the temperature experiments, the parameters have some temperature-based geometrical meaning, and in the color-to-grayscale conversion strategies, the parameter values can tell the relative brightness and bias in color composition. Likewise, with an arbitrary data set, we can try to interpret the optimized parameter values after learning is complete.

In addition to some engineering data sets, our approach can be used in neuroscientific models based on specific parameter values. When part of the brain gets multiple inputs and generates one output by any kind of mixture, we can apply -integration with learning and estimate the functional role of the input sources based on the optimized parameter values.

## 6. Conclusion

In this paper, we proposed a new method for learning -integration parameters from sparse integration samples (target values): the characteristic value and the weight vector , for -integration to optimize data integration. The update rules were rigorously derived, and the performance was validated in experiments with several synthetic and real-world data sets. Given only a few target values, our method found the best parameters to achieve the best integration in terms of -divergence. The estimated parameter values have domain-specific meaning (semantics); thus the resulting parameters can be used to infer the functional nature of the mixture.

We expect our approach to help automate the -integration framework. Furthermore, -integration parameter optimization may be possible in different domains, such as reinforcement learning and unsupervised learning.

## Acknowledgments

This work is partly based on our previous conference presentation (Choi, Choi, Katake, & Choe, 2010). This publication is based in part on work supported by Award KUS-C1-016-04, made by King Abdullah University of Science and Technology (KAUST). S.C. was supported by National Research Foundation (NRF) of Korea (2012-0005785 and 2012-0005786), the Converging Research Center Program funded by the Ministry of Education, Science, and Technology (2012K001343), Korea MKE and NIPA IT Consilience Creative Program (C1515-1121-0003), and NRF World Class University Program (R31-10100).