## Abstract

Recent advances in weakly supervised classification allow us to train a classifier from only positive and unlabeled (PU) data. However, existing PU classification methods typically require an accurate estimate of the class-prior probability, a critical bottleneck particularly for high-dimensional data. This problem has been commonly addressed by applying principal component analysis in advance, but such unsupervised dimension reduction can collapse the underlying class structure. In this letter, we propose a novel representation learning method from PU data based on the information-maximization principle. Our method does not require class-prior estimation and thus can be used as a preprocessing method for PU classification. Through experiments, we demonstrate that our method, combined with deep neural networks, highly improves the accuracy of PU class-prior estimation, leading to state-of-the-art PU classification performance.

## 1 Introduction

In real-world applications, it is conceivable that only positive and unlabeled (PU) data are available for training a classifier. For instance, in land-cover image classification, images of urban regions can be easily labeled, while images of nonurban regions are difficult to annotate due to the high diversity of nonurban regions containing, for example, forest, seas, grasses, and soil (Li, Guo, & Elkan, 2011). To cope with such situations, PU classification has been actively studied (Letouzey, Denis, & Gilleron, 2000; Elkan & Noto, 2008; du Plessis, Niu, & Sugiyama, 2015), and the state-of-the-art method allows us to systematically train deep neural networks only from PU data (Kiryo, Niu, du Plessis, & Sugiyama, 2017).

However, existing PU classification methods typically require an estimate of the class-prior probability, and their performance is sensitive to the quality of class-prior estimation (Kiryo et al., 2017). Although various class-prior estimation methods from PU data have been proposed so far (du Plessis & Sugiyama, 2014; Ramaswamy, Scott, & Tewari, 2016; Jain, White, & Radivojac, 2016; du Plessis, Niu, & Sugiyama, 2017; Northcutt, Wu, & Chuang, 2017), accurate estimation of the class-prior is still highly challenging, particularly for high-dimensional data.

In practice, principal component analysis is commonly used to reduce the data dimensionality in advance (Ramaswamy et al., 2016; du Plessis et al., 2017). However, such unsupervised dimension reduction completely abandons label information, and thus the underlying class structure may be smashed. As a result, class-prior estimation often becomes even more difficult after dimension reduction.

The goal of this letter is to cope with this problem by proposing a representation learning method that can be executed only from PU data. Our method is developed within the framework of information maximization (Linsker, 1988).

Mutual information (MI) (Cover & Thomas, 2006) is a statistical dependency measure between random variables that is popularly used in information-theoretic machine learning (Torkkola, 2003; Krause, Perona, & Gomes, 2010). However, empirically approximating MI from continuous-valued training data is not straightforward (Moon, Rajagopalan, & Lall, 1995; Kraskov, Stögbauer, & Grassberger, 2004; Khan, Bandyopadhyay, Ganguly, & Saigal, 2007; Van Hulle, 2005; Suzuki, Sugiyama, Sese, & Kanamori, 2008) and is often sensitive to outliers (Basu, Harris, Hjort, & Jones, 1998; Sugiyama, Suzuki, & Kanamori, 2012b). For this reason, we employ a squared-loss variant of mutual information (SMI) (Suzuki, Sugiyama, Kanamori, & Sese, 2009; Sugiyama, 2013), whose empirical estimator is known to be robust to outliers and possess superior numerical properties (Kanamori, Suzuki, & Sugiyama, 2012).

Our contributions are summarized as follows:

In section 3, we develop a novel estimator of SMI that can be computed only from PU data and prove its convergence to the optimal estimate of SMI in the optimal parametric rate when the linear-in-parameter model is used.

Based on this PU-SMI estimator, in section 4, we propose a representation learning method that can be executed without estimating the class-prior probabilities of unlabeled data.

Finally, in section 5, we experimentally demonstrate that our PU representation learning method combined with deep neural networks highly improves the accuracy of PU class-prior estimation; consequently, the accuracy of PU classification can also be boosted significantly.

## 2 SMI

In this section, we review the definition of ordinary MI and its variant, SMI.

Let $x\u2208Rd$ be an input pattern, $y\u2208{\xb11}$ be a corresponding class label, and $p(x,y)$ be the underlying joint density, where $d$ is a positive integer.

*Mutual information*(MI) (Cover & Thomas, 2006) is a statistical dependency measure defined as

So far, methods for estimating SMI from positive and negative samples and SMI-based machine learning algorithms have been explored extensively, and their effectiveness has been demonstrated (Sugiyama, 2013).

## 3 SMI Estimation from PU Data

The goal of this letter is to develop a representation learning method from PU data. To this end, we propose an estimator of SMI that can be computed only from PU data in this section.

### 3.1 SMI with PU Data

First, we express SMI in equation 2.1 in terms of only the densities of PU data, without negative data (see appendix A for its proof):

If PU densities $p(x\u2223y=+1)$ and $p(x)$ are estimated from PU data, the above $PU-SMI$ allows us to approximate SMI only from PU data. However, such a naive approach works poorly due to hardness of density estimation, and computing the ratio of estimated densities further magnifies the estimation error (Sugiyama, Suzuki, & Kanamori, 2012a).

### 3.2 PU-SMI Estimation

Here, we propose a more sophisticated approach to estimating $PU-SMI$ from PU data.

First, the following theorem gives a lower-bound of $PU-SMI$ (see appendix B for its proof):

While $PU-SMI$ itself contains $p(x\u2223y=+1)$ and $p(x)$ in a complicated way, the lower bound consists only of the expectations over $p(x\u2223y=+1)$ and $p(x)$. Thus, the lower bound can be immediately approximated empirically.

Note that if $W$ contains the true density-ratio function $p(x\u2223y=+1)/p(x)$, $PU-SMI^\u2192PU-SMI(nP,nU\u2192\u221e)$ with some regularity condition. On the other hand, if the function class does not contain the true density-ratio function, there is a gap between $PU-SMI$ and $PU-SMI^$ even if $nP,nU\u2192\u221e$. Such a gap often arises in real-world applications because a function class does not always include the true density-ratio function. However, the gap may not be a critical issue in practice as long as a reasonably flexible function class is chosen, as demonstrated by the experiments in section 5. Although the gap may exist in practical implementation, we show that the classification performance can be improved by our proposed representation learning method.

### 3.3 Analytic Solution for Linear-in-Parameter Models

Our SMI estimator is applicable to any density-ratio model $w$.

If a neural network is used as $w$, the solution may be obtained by a stochastic gradient method (Goodfellow, Bengio, & Courville, 2016; Abadi et al., 2015; Jia et al., 2014).

Note that all hyperparameters such as the regularization parameter can be tuned by the value of $JPU$ approximated by (cross-)validation samples.

### 3.4 Convergence Analysis

Here we analyze the convergence rate of learned parameters of the density-ratio model and the PU-SMI approximator based on the perturbation analysis of optimization problems (Bonnans & Cominetti, 1996; Bonnans & Shapiro, 1998).

In our theoretical analysis, we focus on the linear-in-parameter model in equation 3.5. We first define $\beta *\u22a4\varphi (x)$ as the minimizer of the expected error, $\beta *:=argmin\beta \u2208RbJPU(\beta )$, and denote its estimator by $\beta ^=argmin\beta \u2208RbJ^PU(\beta )$ in this analysis. Note that the linear-in-parameter model is assumed as a simple baseline for theoretical analysis.

For the linear-in-parameter model, we assume that the basis functions satisfy $0\u2264\varphi \u2113(x)\u22641$ for all $\u2113=1,\u2026,b$, and $H^U$ and $HU$ are positive-definite matrices.

Theorem ^{3} guarantees the convergence of the density-ratio estimator and the PU-SMI approximator. In our setting, since $nP$ and $nU$ can increase independently, this is the optimal convergence rate without any additional assumption (Kanamori, Hido, & Sugiyama, 2009; Kanamori et al., 2012).

Theorem ^{3} shows that both positive and unlabeled data contribute to convergence. This implies that unlabeled data are directly used in the estimation rather than extracting the information of a data structure, such as the cluster structure frequently assumed in semisupervised learning (Chapelle, Schölkopf, & Zien, 2006). The theorem also shows that the convergence rate of our method is dominated by the smaller size of positive or unlabeled data.

Note that since this analysis focuses on the linear-in-parameter model, there might be a gap between $PU-SMI$ and $PU-SMI*$, implying that $PU-SMI<PU-SMI*$. The convergence analysis guarantees that $PU-SMI^$ with the linear-in-parameter model converges to $PU-SMI*$, but there might be an approximation error (Mohri, Rostamizadeh, & Talwalkar, 2012), as discussed in section 3.2.

## 4 PU Representation Learning

In this section, we propose a representation learning method based on PU-SMI maximization. We extend the existing SMI-based dimension reduction (Suzuki & Sugiyama, 2013), called *least-squares dimension reduction* (LSDR), to PU representation learning. While LSDR considers only linear dimension reduction, we extend it to nonlinear dimension reduction by neural networks.

*sufficient dimension reduction*(Li, 1991). Let $SMI\u02dc$ be SMI between $v(x)$ and $y$. Suzuki and Sugiyama (2013) proved $SMI\u2265SMI\u02dc$, and equality holds when condition 4.1 is satisfied. That is, maximizing SMI is finding sufficient representation for the output $y$.

Following the information-maximization principle (Linsker, 1988), we maximize PU-SMI with respect to the mapping to find low-dimensional representation that maximally preserves dependency between input and output.

^{1}We refer to our representation learning method for PU data as

*positive-unlabeled representation learning*(PURL).

Note again that in the above optimization process, unknown class-prior ratio $\theta P/\theta N$ does not need to be estimated in advance, a significant advantage of the proposed method.

## 5 Experiments

In this section, we experimentally investigate the behavior of the proposed PU-SMI estimator and evaluate the performance of the proposed representation learning method on various benchmark data sets.

### 5.1 Accuracy of PU-SMI Estimation

First, we investigate the estimation accuracy of the proposed PU-SMI estimator on data sets obtained from the *LIBSVM* web page (Chang & Lin, 2011).

As the model $w$, we use the linear-in-parameter model with the gaussian basis functions $\varphi \u2113(x):=exp(-\u2225x-x\u2113\u22252/(2\sigma 2))$ for $\u2113=1,\u2026,b$, where $\sigma >0$ is the bandwidth and ${x\u2113}\u2113=1b$ are the centers of the gaussian functions randomly sampled from ${xkU}k=1nU$. The gaussian bandwidth and the $\u21132$-regularization parameter are determined by five-fold cross-validation. We vary the number of positive/unlabeled samples from 10 to 200, with the number of unlabeled/positive samples fixed. The class-prior was assumed to be known in this illustrative experiment and set at $\theta P=0.5$.

^{2}This shows that the mean squared error decreases both when the number of positive samples is increased and the number of unlabeled samples is increased. Therefore, both positive and unlabeled data contribute to improving the estimation accuracy of SMI, which agrees well with our theoretical analysis in section 3.4.

### 5.2 Representation Learning

Next, we evaluate the performance of the proposed representation learning method, PURL.

#### 5.2.1 Illustration

^{3}(the supervised counterpart of PURL) to the data. As the label information for FDA and PNRL, U data are simply regarded as N data even though U data are a mixture of P and N data. Since PCA and FDA are linear transformations, we also use a linear transformation in PNRL and PURL for this numerical illustration. Specifically, we use a two-layer perceptron for $w$. The first fully connected layer is used as linear transformation to obtain one-dimensional representation. The rectified linear unit (ReLU) (Glorot, Bordes, & Bengio, 2011) is used for activation functions of the output of the first layer, which can be seen as feature mapping functions in the linear-in-parameter model. The second layer is just a single connection that weighs the output of the first layer.

We plot the subspaces obtained by PCA, FDA, PNRL, and our proposed method in Figure 2a. Since the data are distributed vertically, the subspace obtained by PCA is almost parallel to the vertical axis (the dashed line). FDA and PNRL return diagonal lines (the dash-dotted and dotted lines), showing that regarding U data as N data is not an appropriate way. On the other hand, the subspace obtained by our method is almost parallel to the horizontal axis (the solid line). Figure 2b plots projected labeled data onto those subspaces. This shows that the labels of the data projected by PCA, FDA, and PNRL are barely distinguishable due to significant overlap, which makes class-prior estimation very hard. In contrast, we can easily separate the classes of samples projected by the proposed method, which eases class-prior estimation.

#### 5.2.2 Benchmark Data

Next we apply the PURL method to benchmark data sets. To obtain low-dimensional representation, we set $m=20$ and use a fully connected neural network with four layers ($d$-60-20-1; $v$ is $d$-60-20 and $g$ is 20-1) for $w$ except the text classification data set. For text classification data sets, we use another fully connected neural network with four layers ($d$-30-10-1) for $w$: $m=10$. ReLU is used as activation functions for hidden layers, and batch normalization (Ioffe & Szegedy, 2015) is applied to all hidden layers. Stochastic gradient descent is used for optimization with learning rate 0.001. Also, weight decay with 0.0005 and gradient noise with 0.01 are applied. We iteratively update $w$ with four mini-batches and $v$ with one mini-batch.

We compare the accuracy of class-prior estimation with and without dimension reduction. For comparison, we also consider PCA, FDA, and PNRL. For PCA, we vary the numbers of components as follows: $\u230ad/4\u230b$, $\u230ad/2\u230b$, and $\u230a3d/4\u230b$, where $\u230a\xb7\u230b$ is the floor function. For FDA, the reduced dimension is 1 due to the property of FDA (Hastie, Tibshirani, & Friedman, 2009) in which the reduced dimension becomes the minimum of $m$ or $(thenumberofclasses-1)$. The neural network for PNRL is the same as the one for the proposed method.

As a class-prior estimation method, we use the method based on the kernel mean embedding (KM) method proposed by Ramaswamy et al. (2016). With the estimated class-prior, we then train a fully connected neural network with five layers ($m$-300-300-300-1). ReLU is used as activation functions for hidden layers, and batch normalization is applied to all hidden layers. Except for text classification data sets, we train the neural networks by Adam (Kingma & Ba, 2015) until 200 epochs. For text classification data sets, we use AdaGrad (Duchi, Hazan, & Singer, 2011) and set the number of epochs to 300. For nonnegative PU learning (Kiryo, Niu, du Plessis, & Sugiyama, 2017), we use the sigmoid loss function and set $\beta $ and $\gamma $ in the paper to 0 and 1, respectively.

We use the ijcnn1, phishing, mushrooms, and a9a data sets taken from the LIBSVM web page (Chang & Lin, 2011). Also, we use the MNIST (LeCun, Bottou, Bengio, & Haffner, 1998), Fashion-MNIST (F-MNIST) (Xiao, Rasul, & Vollgraf, 2017), and 20 Newsgroups (Lang, 1995) data sets. For the MNIST and F-MNIST data sets, we divide the whole classes into two groups to make binary classification tasks. For the 20 Newsgroups data set, we use the “com” topic as the positive class and the “sci” topic as the negative class,^{4} and make 2000-dimensional *tf-idf* vector. From the data sets, we draw $nP=1000$ positive and $nU=2000$ unlabeled samples. For validation, we use $nP=50$ and $nU=200$ samples.

Table 1 lists the average absolute error between the estimated class-prior and the true value. Overall, our proposed dimension-reduction method tends to outperform other methods, meaning that our method provides useful low-dimensional representation. Except for the ijcnn1 data set, the error of FDA tends to be larger than the other methods, implying that regarding U data as N data does not help in class-prior estimation. For the mushrooms and a9a data sets, applying the unsupervised dimension-reduction method, PCA, does not improve the estimation accuracy, while our method reduces the error of class-prior estimation. In particular, for the 20 Newsgroups data set, the existing approaches (PCA, FDA, and PNRL) perform poorly. In contrast, applying our method significantly reduces the error of class-prior estimation.

. | . | . | PCA . | . | . | . | ||
---|---|---|---|---|---|---|---|---|

Data Set . | $\theta P$ . | None . | $\u230ad/4\u230b$ . | $\u230ad/2\u230b$ . | $\u230a3d/4\u230b$ . | FDA . | PNRL . | PURL . |

ijcnn1 | 0.3 | 0.23 (0.02) | 0.26 (0.11) | 0.26 (0.11) | 0.28 (0.04) | 0.03 (0.01) | 0.26 (0.08) | 0.21 (0.07) |

0.5 | 0.18 (0.05) | 0.14 (0.09) | 0.14 (0.09) | 0.17 (0.06) | 0.04 (0.01) | 0.21 (0.08) | 0.19 (0.07) | |

0.7 | 0.08 (0.01) | 0.11 (0.05) | 0.11 (0.05) | 0.10 (0.04) | 0.07 (0.01) | 0.11 (0.05) | 0.10 (0.01) | |

phishing | 0.3 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.04 (0.02) | 0.03 (0.01) | 0.02 (0.00) |

0.5 | 0.01 (0.00) | 0.01 (0.00) | 0.01 (0.00) | 0.01 (0.00) | 0.07 (0.03) | 0.04 (0.02) | 0.03 (0.02) | |

0.7 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.11 (0.04) | 0.05 (0.03) | 0.02 (0.00) | |

mushrooms | 0.3 | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.09 (0.03) | 0.03 (0.00) | 0.03 (0.00) |

0.5 | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.16 (0.03) | 0.04 (0.01) | 0.04 (0.00) | |

0.7 | 0.03 (0.01) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.01) | 0.20 (0.06) | 0.03 (0.00) | 0.04 (0.03) | |

a9a | 0.3 | 0.11 (0.02) | 0.11 (0.02) | 0.11 (0.02) | 0.11 (0.02) | 0.05 (0.01) | 0.08 (0.03) | 0.04 (0.00) |

0.5 | 0.10 (0.02) | 0.10 (0.02) | 0.10 (0.02) | 0.10 (0.02) | 0.09 (0.04) | 0.09 (0.03) | 0.04 (0.01) | |

0.7 | 0.08 (0.03) | 0.08 (0.03) | 0.08 (0.03) | 0.08 (0.03) | 0.18 (0.06) | 0.08 (0.03) | 0.04 (0.01) | |

MNIST | 0.3 | 0.09 (0.02) | 0.09 (0.02) | 0.09 (0.02) | 0.09 (0.02) | 0.27 (0.01) | 0.01 (0.00) | 0.05 (0.02) |

0.5 | 0.15 (0.11) | 0.15 (0.11) | 0.15 (0.11) | 0.15 (0.11) | 0.46 (0.01) | 0.03 (0.00) | 0.06 (0.03) | |

0.7 | 0.60 (0.21) | 0.60 (0.21) | 0.60 (0.21) | 0.60 (0.21) | 0.65 (0.02) | 0.06 (0.01) | 0.07 (0.01) | |

F-MNIST | 0.3 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.25 (0.01) | 0.03 (0.00) | 0.03 (0.00) |

0.5 | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.45 (0.01) | 0.02 (0.00) | 0.04 (0.03) | |

0.7 | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.66 (0.02) | 0.02 (0.00) | 0.07 (0.03) | |

20 News | 0.3 | 0.04 (0.00) | 0.04 (0.01) | 0.04 (0.00) | 0.04 (0.00) | 0.29 (0.00) | 0.29 (0.09) | 0.03 (0.01) |

0.5 | 0.08 (0.03) | 0.06 (0.01) | 0.07 (0.01) | 0.08 (0.03) | 0.49 (0.00) | 0.25 (0.07) | 0.05 (0.01) | |

0.7 | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.13 (0.03) | 0.07 (0.01) |

. | . | . | PCA . | . | . | . | ||
---|---|---|---|---|---|---|---|---|

Data Set . | $\theta P$ . | None . | $\u230ad/4\u230b$ . | $\u230ad/2\u230b$ . | $\u230a3d/4\u230b$ . | FDA . | PNRL . | PURL . |

ijcnn1 | 0.3 | 0.23 (0.02) | 0.26 (0.11) | 0.26 (0.11) | 0.28 (0.04) | 0.03 (0.01) | 0.26 (0.08) | 0.21 (0.07) |

0.5 | 0.18 (0.05) | 0.14 (0.09) | 0.14 (0.09) | 0.17 (0.06) | 0.04 (0.01) | 0.21 (0.08) | 0.19 (0.07) | |

0.7 | 0.08 (0.01) | 0.11 (0.05) | 0.11 (0.05) | 0.10 (0.04) | 0.07 (0.01) | 0.11 (0.05) | 0.10 (0.01) | |

phishing | 0.3 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.04 (0.02) | 0.03 (0.01) | 0.02 (0.00) |

0.5 | 0.01 (0.00) | 0.01 (0.00) | 0.01 (0.00) | 0.01 (0.00) | 0.07 (0.03) | 0.04 (0.02) | 0.03 (0.02) | |

0.7 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.11 (0.04) | 0.05 (0.03) | 0.02 (0.00) | |

mushrooms | 0.3 | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.09 (0.03) | 0.03 (0.00) | 0.03 (0.00) |

0.5 | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.05 (0.01) | 0.16 (0.03) | 0.04 (0.01) | 0.04 (0.00) | |

0.7 | 0.03 (0.01) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.01) | 0.20 (0.06) | 0.03 (0.00) | 0.04 (0.03) | |

a9a | 0.3 | 0.11 (0.02) | 0.11 (0.02) | 0.11 (0.02) | 0.11 (0.02) | 0.05 (0.01) | 0.08 (0.03) | 0.04 (0.00) |

0.5 | 0.10 (0.02) | 0.10 (0.02) | 0.10 (0.02) | 0.10 (0.02) | 0.09 (0.04) | 0.09 (0.03) | 0.04 (0.01) | |

0.7 | 0.08 (0.03) | 0.08 (0.03) | 0.08 (0.03) | 0.08 (0.03) | 0.18 (0.06) | 0.08 (0.03) | 0.04 (0.01) | |

MNIST | 0.3 | 0.09 (0.02) | 0.09 (0.02) | 0.09 (0.02) | 0.09 (0.02) | 0.27 (0.01) | 0.01 (0.00) | 0.05 (0.02) |

0.5 | 0.15 (0.11) | 0.15 (0.11) | 0.15 (0.11) | 0.15 (0.11) | 0.46 (0.01) | 0.03 (0.00) | 0.06 (0.03) | |

0.7 | 0.60 (0.21) | 0.60 (0.21) | 0.60 (0.21) | 0.60 (0.21) | 0.65 (0.02) | 0.06 (0.01) | 0.07 (0.01) | |

F-MNIST | 0.3 | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.02 (0.00) | 0.25 (0.01) | 0.03 (0.00) | 0.03 (0.00) |

0.5 | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.45 (0.01) | 0.02 (0.00) | 0.04 (0.03) | |

0.7 | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.03 (0.00) | 0.66 (0.02) | 0.02 (0.00) | 0.07 (0.03) | |

20 News | 0.3 | 0.04 (0.00) | 0.04 (0.01) | 0.04 (0.00) | 0.04 (0.00) | 0.29 (0.00) | 0.29 (0.09) | 0.03 (0.01) |

0.5 | 0.08 (0.03) | 0.06 (0.01) | 0.07 (0.01) | 0.08 (0.03) | 0.49 (0.00) | 0.25 (0.07) | 0.05 (0.01) | |

0.7 | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.69 (0.00) | 0.13 (0.03) | 0.07 (0.01) |

Notes: “None” means that the class-prior is estimated without dimension-reduction methods, PCA is the principal component analysis, FDA is Fisher's discriminant analysis, and PNRL is the supervised counterpart of the proposed method. The class-prior is estimated by the method based on kernel mean embedding. The boldface denotes the best and comparable approaches in terms of the average absolute error according to the *t*-test at the significance level 5%.

Then we summarize the average misclassification rates in Table 2. Since the accuracy of class-prior estimation is improved on the mushrooms and a9a data sets, the classification accuracy is also improved. In particular, the classification results on the 20 Newsgroups data set with $\theta P=0.7$ are improved substantially. Overall, our proposed method tends to give the lower or comparable misclassification rates compared with the other methods.

. | . | . | PCA . | . | . | . | ||
---|---|---|---|---|---|---|---|---|

Data Set . | $\theta P$ . | None . | $\u230ad/4\u230b$ . | $\u230ad/2\u230b$ . | $\u230a3d/4\u230b$ . | FDA . | PNRL . | PURL . |

ijcnn1 | 0.3 | 25.32 (1.23) | 27.79 (2.05) | 27.79 (2.05) | 29.21 (1.30) | 7.00 (0.59) | 28.57 (2.52) | 25.92 (2.64) |

0.5 | 21.43 (1.22) | 17.88 (1.72) | 17.88 (1.72) | 20.75 (1.15) | 8.25 (1.33) | 26.52 (2.26) | 21.52 (1.69) | |

0.7 | 12.07 (0.53) | 14.94 (1.02) | 14.94 (1.02) | 14.61 (1.15) | 11.23 (1.34) | 17.34 (1.58) | 13.70 (1.04) | |

phishing | 0.3 | 7.41 (0.46) | 7.46 (0.48) | 7.46 (0.48) | 7.57 (0.46) | 10.30 (2.42) | 11.09 (0.98) | 7.62 (0.45) |

0.5 | 12.85 (2.11) | 9.75 (0.46) | 9.75 (0.46) | 9.82 (0.40) | 24.43 (3.09) | 32.02 (3.05) | 10.05 (0.46) | |

0.7 | 8.07 (0.44) | 8.85 (1.08) | 8.85 (1.08) | 7.63 (0.37) | 25.62 (1.40) | 29.04 (0.73) | 8.02 (0.37) | |

mushrooms | 0.3 | 0.73 (0.20) | 1.15 (0.58) | 0.57 (0.14) | 0.49 (0.14) | 1.52 (0.36) | 0.24 (0.06) | 0.43 (0.09) |

0.5 | 0.57 (0.11) | 0.57 (0.11) | 0.78 (0.16) | 0.57 (0.11) | 3.40 (0.47) | 1.10 (0.24) | 3.40 (2.39) | |

0.7 | 1.42 (0.28) | 1.42 (0.28) | 1.50 (0.27) | 1.42 (0.28) | 6.38 (0.66) | 1.40 (0.27) | 1.61 (0.48) | |

a9a | 0.3 | 24.93 (1.19) | 26.49 (1.89) | 26.49 (1.89) | 26.20 (1.73) | 21.09 (0.59) | 26.31 (2.36) | 22.32 (0.65) |

0.5 | 30.35 (1.55) | 26.07 (1.01) | 26.07 (1.01) | 29.52 (1.81) | 22.70 (0.77) | 27.48 (1.47) | 23.70 (0.67) | |

0.7 | 20.35 (0.80) | 20.54 (0.61) | 20.54 (0.61) | 19.94 (0.78) | 19.70 (0.97) | 20.59 (0.60) | 19.39 (0.66) | |

MNIST | 0.3 | 24.58 (2.82) | 17.99 (1.44) | 17.99 (1.44) | 22.18 (2.75) | 20.92 (0.74) | 12.74 (0.63) | 11.76 (0.78) |

0.5 | 23.00 (1.60) | 22.35 (1.10) | 22.35 (1.10) | 23.55 (1.80) | 42.10 (1.85) | 15.35 (0.75) | 18.18 (2.43) | |

0.7 | 53.34 (3.78) | 52.19 (4.41) | 54.42 (3.99) | 53.39 (3.74) | 60.86 (1.25) | 16.38 (0.84) | 18.64 (2.83) | |

F-MNIST | 0.3 | 14.88 (1.30) | 18.02 (2.86) | 18.02 (2.86) | 15.12 (1.18) | 19.24 (0.91) | 14.54 (1.14) | 13.54 (0.75) |

0.5 | 13.40 (0.69) | 12.05 (0.96) | 12.05 (0.96) | 13.22 (0.62) | 37.73 (1.56) | 12.15 (0.48) | 14.10 (1.16) | |

0.7 | 9.94 (1.30) | 8.89 (0.84) | 8.89 (0.84) | 8.54 (0.84) | 55.65 (2.14) | 8.65 (0.83) | 9.29 (0.47) | |

20 News | 0.3 | 38.89 (3.00) | 40.30 (3.64) | 42.48 (3.54) | 38.70 (3.81) | 18.66 (0.47) | 66.62 (1.59) | 36.31 (4.13) |

0.5 | 44.48 (1.82) | 43.85 (2.03) | 46.67 (1.15) | 47.77 (0.87) | 34.73 (0.80) | 50.00 (0.00) | 45.88 (1.64) | |

0.7 | 50.69 (0.95) | 53.61 (0.73) | 51.77 (1.05) | 50.36 (0.83) | 50.69 (0.95) | 30.61 (0.59) | 29.85 (0.13) |

. | . | . | PCA . | . | . | . | ||
---|---|---|---|---|---|---|---|---|

Data Set . | $\theta P$ . | None . | $\u230ad/4\u230b$ . | $\u230ad/2\u230b$ . | $\u230a3d/4\u230b$ . | FDA . | PNRL . | PURL . |

ijcnn1 | 0.3 | 25.32 (1.23) | 27.79 (2.05) | 27.79 (2.05) | 29.21 (1.30) | 7.00 (0.59) | 28.57 (2.52) | 25.92 (2.64) |

0.5 | 21.43 (1.22) | 17.88 (1.72) | 17.88 (1.72) | 20.75 (1.15) | 8.25 (1.33) | 26.52 (2.26) | 21.52 (1.69) | |

0.7 | 12.07 (0.53) | 14.94 (1.02) | 14.94 (1.02) | 14.61 (1.15) | 11.23 (1.34) | 17.34 (1.58) | 13.70 (1.04) | |

phishing | 0.3 | 7.41 (0.46) | 7.46 (0.48) | 7.46 (0.48) | 7.57 (0.46) | 10.30 (2.42) | 11.09 (0.98) | 7.62 (0.45) |

0.5 | 12.85 (2.11) | 9.75 (0.46) | 9.75 (0.46) | 9.82 (0.40) | 24.43 (3.09) | 32.02 (3.05) | 10.05 (0.46) | |

0.7 | 8.07 (0.44) | 8.85 (1.08) | 8.85 (1.08) | 7.63 (0.37) | 25.62 (1.40) | 29.04 (0.73) | 8.02 (0.37) | |

mushrooms | 0.3 | 0.73 (0.20) | 1.15 (0.58) | 0.57 (0.14) | 0.49 (0.14) | 1.52 (0.36) | 0.24 (0.06) | 0.43 (0.09) |

0.5 | 0.57 (0.11) | 0.57 (0.11) | 0.78 (0.16) | 0.57 (0.11) | 3.40 (0.47) | 1.10 (0.24) | 3.40 (2.39) | |

0.7 | 1.42 (0.28) | 1.42 (0.28) | 1.50 (0.27) | 1.42 (0.28) | 6.38 (0.66) | 1.40 (0.27) | 1.61 (0.48) | |

a9a | 0.3 | 24.93 (1.19) | 26.49 (1.89) | 26.49 (1.89) | 26.20 (1.73) | 21.09 (0.59) | 26.31 (2.36) | 22.32 (0.65) |

0.5 | 30.35 (1.55) | 26.07 (1.01) | 26.07 (1.01) | 29.52 (1.81) | 22.70 (0.77) | 27.48 (1.47) | 23.70 (0.67) | |

0.7 | 20.35 (0.80) | 20.54 (0.61) | 20.54 (0.61) | 19.94 (0.78) | 19.70 (0.97) | 20.59 (0.60) | 19.39 (0.66) | |

MNIST | 0.3 | 24.58 (2.82) | 17.99 (1.44) | 17.99 (1.44) | 22.18 (2.75) | 20.92 (0.74) | 12.74 (0.63) | 11.76 (0.78) |

0.5 | 23.00 (1.60) | 22.35 (1.10) | 22.35 (1.10) | 23.55 (1.80) | 42.10 (1.85) | 15.35 (0.75) | 18.18 (2.43) | |

0.7 | 53.34 (3.78) | 52.19 (4.41) | 54.42 (3.99) | 53.39 (3.74) | 60.86 (1.25) | 16.38 (0.84) | 18.64 (2.83) | |

F-MNIST | 0.3 | 14.88 (1.30) | 18.02 (2.86) | 18.02 (2.86) | 15.12 (1.18) | 19.24 (0.91) | 14.54 (1.14) | 13.54 (0.75) |

0.5 | 13.40 (0.69) | 12.05 (0.96) | 12.05 (0.96) | 13.22 (0.62) | 37.73 (1.56) | 12.15 (0.48) | 14.10 (1.16) | |

0.7 | 9.94 (1.30) | 8.89 (0.84) | 8.89 (0.84) | 8.54 (0.84) | 55.65 (2.14) | 8.65 (0.83) | 9.29 (0.47) | |

20 News | 0.3 | 38.89 (3.00) | 40.30 (3.64) | 42.48 (3.54) | 38.70 (3.81) | 18.66 (0.47) | 66.62 (1.59) | 36.31 (4.13) |

0.5 | 44.48 (1.82) | 43.85 (2.03) | 46.67 (1.15) | 47.77 (0.87) | 34.73 (0.80) | 50.00 (0.00) | 45.88 (1.64) | |

0.7 | 50.69 (0.95) | 53.61 (0.73) | 51.77 (1.05) | 50.36 (0.83) | 50.69 (0.95) | 30.61 (0.59) | 29.85 (0.13) |

Note: The boldface denotes the best and comparable approaches in terms of the average absolute error according to the $t$-test at the significance level 5%.

## 6 Conclusion

In this letter, we have proposed an information-theoretic representation learning method from positive and unlabeled (PU) data. Our method is based on the information maximization principle, and we find low-dimensional representation maximally preserving a squared-loss variant of mutual information (SMI) between inputs and labels. Unlike the existing PU learning methods, since our representation learning method can be executed without knowing an estimate of the class-prior in advance, our method can also be used as preprocessing for the class-prior estimation method. Through numerical experiments, we demonstrated the effectiveness of our method.

## Appendix A: Proof of Theorem ^{1}

$a\u25a1$

## Appendix B: Proof of Theorem ^{2}

## Appendix C: Proof of Theorem ^{3}

First we have the following lemma:

$JPU(\xb7,u)-JPU(\xb7)$ is Lipschitz continuous modulus $\omega (u)=O(\u2225UU\u2225Fro+\u2225uP\u22252)$, where $\u2225\xb7\u2225Fro$ is the Frobenius norm.

Finally, we prove theorem ^{3}.

^{4}) and $JPU(\xb7,u)-JPU(\xb7)$ is Lipschitz continuous modulus $\omega (u)$ (see lemma

^{5}), we can use proposition 6.1 in Bonnans and Shapiro (1998) and have the first half of theorem

^{3}:

^{3}. For the squared errors, we have

## Appendix D: Effect of Dimension Reduction

In this appendix, we illustrate the effect of dimension reduction and how the number of samples affects class-prior estimation.

We use the artificial data set used in section 5.2 and vary both $nP$ and $nU$ from 500 to 5000. We set the true class-prior $\theta P$ as 0.5. The class-prior is estimated by the method based on kernel mean embedding (KM) (Ramaswamy et al., 2016). To evaluate the performance with and without dimension reduction, we use the one-dimensional samples obtained by $b\u22a4x$, where $b=(1,0)\u22a4$, and the original two-dimensional samples.

## Appendix E: Supervised Counterpart of the Proposed Method

In this appendix, we review the SMI estimation method (Suzuki et al., 2009; Sugiyama, 2013; Sakai & Sugiyama, 2014).

Similar to the proposed PURL method, we maximize $PN-SMI^$ to learn low-dimensional representation.

## Notes

^{1}

We also tried to optimize $w$ and $v$ simultaneously. That is, $JPU$ is minimized with respect to $g$ without decomposing $g$ into $w$ and $v$, but it did not work well in our preliminary experiments.

^{2}

We compute the squared error between the estimated PU-SMI and the supervised SMI estimator (Suzuki et al., 2009) with a sufficiently large number of positive and negative samples.

^{3}

The details of PNRL are described in appendix E.

^{4}

See http://qwone.com/∼jason/20Newsgroups/ for the details of topics.

## Acknowledgments

T.S. was supported by KAKENHI 15J09111. G.N. was supported by the JST CREST JPMJCR1403. M.S. was supported by KAKENHI 17H01760. We thank Ikko Yamane, Ryuichi Kiryo, and Takeshi Teshima for their comments.

## References

*10*,

## Author notes

T.S. is currently affiliated with NEC Corporation.