## Abstract

Recent advances in weakly supervised classification allow us to train a classifier from only positive and unlabeled (PU) data. However, existing PU classification methods typically require an accurate estimate of the class-prior probability, a critical bottleneck particularly for high-dimensional data. This problem has been commonly addressed by applying principal component analysis in advance, but such unsupervised dimension reduction can collapse the underlying class structure. In this letter, we propose a novel representation learning method from PU data based on the information-maximization principle. Our method does not require class-prior estimation and thus can be used as a preprocessing method for PU classification. Through experiments, we demonstrate that our method, combined with deep neural networks, highly improves the accuracy of PU class-prior estimation, leading to state-of-the-art PU classification performance.

## 1  Introduction

In real-world applications, it is conceivable that only positive and unlabeled (PU) data are available for training a classifier. For instance, in land-cover image classification, images of urban regions can be easily labeled, while images of nonurban regions are difficult to annotate due to the high diversity of nonurban regions containing, for example, forest, seas, grasses, and soil (Li, Guo, & Elkan, 2011). To cope with such situations, PU classification has been actively studied (Letouzey, Denis, & Gilleron, 2000; Elkan & Noto, 2008; du Plessis, Niu, & Sugiyama, 2015), and the state-of-the-art method allows us to systematically train deep neural networks only from PU data (Kiryo, Niu, du Plessis, & Sugiyama, 2017).

However, existing PU classification methods typically require an estimate of the class-prior probability, and their performance is sensitive to the quality of class-prior estimation (Kiryo et al., 2017). Although various class-prior estimation methods from PU data have been proposed so far (du Plessis & Sugiyama, 2014; Ramaswamy, Scott, & Tewari, 2016; Jain, White, & Radivojac, 2016; du Plessis, Niu, & Sugiyama, 2017; Northcutt, Wu, & Chuang, 2017), accurate estimation of the class-prior is still highly challenging, particularly for high-dimensional data.

In practice, principal component analysis is commonly used to reduce the data dimensionality in advance (Ramaswamy et al., 2016; du Plessis et al., 2017). However, such unsupervised dimension reduction completely abandons label information, and thus the underlying class structure may be smashed. As a result, class-prior estimation often becomes even more difficult after dimension reduction.

The goal of this letter is to cope with this problem by proposing a representation learning method that can be executed only from PU data. Our method is developed within the framework of information maximization (Linsker, 1988).

Mutual information (MI) (Cover & Thomas, 2006) is a statistical dependency measure between random variables that is popularly used in information-theoretic machine learning (Torkkola, 2003; Krause, Perona, & Gomes, 2010). However, empirically approximating MI from continuous-valued training data is not straightforward (Moon, Rajagopalan, & Lall, 1995; Kraskov, Stögbauer, & Grassberger, 2004; Khan, Bandyopadhyay, Ganguly, & Saigal, 2007; Van Hulle, 2005; Suzuki, Sugiyama, Sese, & Kanamori, 2008) and is often sensitive to outliers (Basu, Harris, Hjort, & Jones, 1998; Sugiyama, Suzuki, & Kanamori, 2012b). For this reason, we employ a squared-loss variant of mutual information (SMI) (Suzuki, Sugiyama, Kanamori, & Sese, 2009; Sugiyama, 2013), whose empirical estimator is known to be robust to outliers and possess superior numerical properties (Kanamori, Suzuki, & Sugiyama, 2012).

Our contributions are summarized as follows:

• In section 3, we develop a novel estimator of SMI that can be computed only from PU data and prove its convergence to the optimal estimate of SMI in the optimal parametric rate when the linear-in-parameter model is used.

• Based on this PU-SMI estimator, in section 4, we propose a representation learning method that can be executed without estimating the class-prior probabilities of unlabeled data.

• Finally, in section 5, we experimentally demonstrate that our PU representation learning method combined with deep neural networks highly improves the accuracy of PU class-prior estimation; consequently, the accuracy of PU classification can also be boosted significantly.

## 2  SMI

In this section, we review the definition of ordinary MI and its variant, SMI.

Let $x∈Rd$ be an input pattern, $y∈{±1}$ be a corresponding class label, and $p(x,y)$ be the underlying joint density, where $d$ is a positive integer.

Mutual information (MI) (Cover & Thomas, 2006) is a statistical dependency measure defined as
$MI:=∑y=±1∫p(x,y)logp(x,y)p(x)p(y)dx,$
where $p(x)$ is the marginal density of $x$ and $p(y)$ is the probability mass of $y$. MI can be regarded as the Kullback-Leibler divergence from $p(x,y)$ to $p(x)p(y)$, and therefore MI is nonnegative and takes zero if and only if $p(x,y)=p(x)p(y)$, that is, $x$ and $y$ are statistically independent. This property allows us to evaluate the dependency between $x$ and $y$. However, empirically approximating MI from continuous data is not straightforward (Moon et al., 1995; Kraskov et al., 2004; Khan et al., 2007; Van Hulle, 2005; Suzuki et al., 2008) and is often sensitive to outliers (Basu et al., 1998; Sugiyama et al., 2012b).
To cope with this problem, squared-loss MI (SMI) has been proposed (Suzuki et al., 2009). It is a squared-loss variant of MI defined as
$SMI:=∑y=±1p(y)2∫p(x,y)p(x)p(y)-12p(x)dx.$
(2.1)
SMI can be regarded as the Pearson divergence (Pearson, 1990) from $p(x,y)$ to $p(x)p(y)$. SMI is also nonnegative and takes zero if and only if $x$ and $y$ are independent.

So far, methods for estimating SMI from positive and negative samples and SMI-based machine learning algorithms have been explored extensively, and their effectiveness has been demonstrated (Sugiyama, 2013).

## 3  SMI Estimation from PU Data

The goal of this letter is to develop a representation learning method from PU data. To this end, we propose an estimator of SMI that can be computed only from PU data in this section.

### 3.1  SMI with PU Data

Suppose that we are given PU data (Ward, Hastie, Barry, Elith, & Leathwick, 2009):
${xiP}i=1nP∼i.i.d.p(x∣y=+1),{xkU}k=1nU∼i.i.d.p(x)=θPp(x∣y=+1)+θNp(x∣y=-1),$
where $θP:=p(y=+1)$ and $θN:=p(y=-1)$ are the class-prior probabilities.

First, we express SMI in equation 2.1 in terms of only the densities of PU data, without negative data (see appendix A for its proof):

Theorem 1.
Let
$PU-SMI:=θP2θN∫p(x∣y=+1)p(x)-12p(x)dx.$
(3.1)
Then we have $PU-SMI=SMI$.

If PU densities $p(x∣y=+1)$ and $p(x)$ are estimated from PU data, the above $PU-SMI$ allows us to approximate SMI only from PU data. However, such a naive approach works poorly due to hardness of density estimation, and computing the ratio of estimated densities further magnifies the estimation error (Sugiyama, Suzuki, & Kanamori, 2012a).

### 3.2  PU-SMI Estimation

Here, we propose a more sophisticated approach to estimating $PU-SMI$ from PU data.

First, the following theorem gives a lower-bound of $PU-SMI$ (see appendix B for its proof):

Theorem 2.
For any function $f(x)$,
$PU-SMI≥θPθN-JPU(f)-12,$
(3.2)
where
$JPU(f):=12∫f2(x)p(x)dx-∫f(x)p(x∣y=+1)dx,$
and the equality holds if and only if
$f(x)=p(x∣y=+1)p(x).$

While $PU-SMI$ itself contains $p(x∣y=+1)$ and $p(x)$ in a complicated way, the lower bound consists only of the expectations over $p(x∣y=+1)$ and $p(x)$. Thus, the lower bound can be immediately approximated empirically.

Based on this theorem, we maximize an empirical approximation to the lower bound, equation 3.2, which is expressed as
$w^:=argminw∈WJ^PU(w),$
where
$J^PU(w):=12nU∑k=1nUw2(xkU)-1nP∑i=1nPw(xiP)$
(3.3)
and $W$ is a (user-defined) function class such as linear-in-parameter models, kernel models, and neural networks. In this optimization, we can drop the unknown class-prior ratio $θP/θN$, which is difficult to estimate accurately (du Plessis & Sugiyama, 2014; Ramaswamy et al., 2016; Jain et al., 2016; du Plessis et al., 2017).
Finally, our PU-SMI estimator is given as
$PU-SMI^=θPθN-J^PU(w^)-12.$
(3.4)
$PU-SMI^$ includes the class-prior ratio $θP/θN$ only as a proportional constant. Therefore, class-prior estimation is not needed when we just want to maximize or minimize PU-SMI. We utilize this excellent property in section 4 when we develop a representation learning method.

Note that if $W$ contains the true density-ratio function $p(x∣y=+1)/p(x)$, $PU-SMI^→PU-SMI(nP,nU→∞)$ with some regularity condition. On the other hand, if the function class does not contain the true density-ratio function, there is a gap between $PU-SMI$ and $PU-SMI^$ even if $nP,nU→∞$. Such a gap often arises in real-world applications because a function class does not always include the true density-ratio function. However, the gap may not be a critical issue in practice as long as a reasonably flexible function class is chosen, as demonstrated by the experiments in section 5. Although the gap may exist in practical implementation, we show that the classification performance can be improved by our proposed representation learning method.

### 3.3  Analytic Solution for Linear-in-Parameter Models

Our SMI estimator is applicable to any density-ratio model $w$.

If a neural network is used as $w$, the solution may be obtained by a stochastic gradient method (Goodfellow, Bengio, & Courville, 2016; Abadi et al., 2015; Jia et al., 2014).

Another candidate of the density-ratio model is a linear-in-parameter model,
$w(x)=∑ℓ=1bβℓϕℓ(x)=β⊤ϕ(x),$
(3.5)
where $β:=(β1,…,βb)⊤∈Rb$ is a vector of parameters, $⊤$ denotes the transpose, $b$ is the number of parameters, and $ϕ(x):=(ϕ1(x),…,ϕb(x))⊤∈Rb$ is a vector of basis functions. This model allows us to obtain an analytic-form PU-SMI estimator. Furthermore, the optimal convergence is theoretically guaranteed, as shown in section 3.4.
When the $ℓ2$-regularizer is included, the optimization problem yields
$β^:=argminβ12β⊤H^Uβ-β⊤h^P+λPU2∥β∥22,$
where $λPU≥0$ is the regularization parameter, $∥·∥2$ denotes the $ℓ2$-norm, and
$H^ℓ,ℓ'U:=1nU∑k=1nUϕℓ(xkU)ϕℓ'(xkU),h^ℓP:=1nP∑i=1nPϕℓ(xiP).$
Note that $H^ℓ,ℓ'U$ is the $(ℓ,ℓ')$th element of $H^U$ and $h^ℓP$ is the $ℓ$th element of $h^P$. The solution can be obtained analytically by differentiating the objective function with respect to $β$ and setting it to zero: $β^=(H^U+λPUIb)-1h^P$. Finally, with the obtained estimator, we can compute an SMI approximator only from positive and unlabeled data:
$PU-SMI^=θPθNβ^⊤h^P-12β^⊤H^Uβ^-12.$

Note that all hyperparameters such as the regularization parameter can be tuned by the value of $JPU$ approximated by (cross-)validation samples.

### 3.4  Convergence Analysis

Here we analyze the convergence rate of learned parameters of the density-ratio model and the PU-SMI approximator based on the perturbation analysis of optimization problems (Bonnans & Cominetti, 1996; Bonnans & Shapiro, 1998).

In our theoretical analysis, we focus on the linear-in-parameter model in equation 3.5. We first define $β*⊤ϕ(x)$ as the minimizer of the expected error, $β*:=argminβ∈RbJPU(β)$, and denote its estimator by $β^=argminβ∈RbJ^PU(β)$ in this analysis. Note that the linear-in-parameter model is assumed as a simple baseline for theoretical analysis.

For the linear-in-parameter model, we assume that the basis functions satisfy $0≤ϕℓ(x)≤1$ for all $ℓ=1,…,b$, and $H^U$ and $HU$ are positive-definite matrices.

Let
$PU-SMI*:=θPθN-JPU(β*)-12$
be the PU-SMI with $β*$. Similarly, $PU-SMI^$ is the estimate of the PU-SMI with $β^$. Let $Op$ denote the order in probability. Then we have the following convergence results (its proof is given in appendix C):
Theorem 3.
As $nP,nU→∞$, we have
$∥β^-β*∥2=Op(1/nP+1/nU),|PU-SMI*-PU-SMI^|=Op(1/nP+1/nU).$

Theorem 3 guarantees the convergence of the density-ratio estimator and the PU-SMI approximator. In our setting, since $nP$ and $nU$ can increase independently, this is the optimal convergence rate without any additional assumption (Kanamori, Hido, & Sugiyama, 2009; Kanamori et al., 2012).

Theorem 3 shows that both positive and unlabeled data contribute to convergence. This implies that unlabeled data are directly used in the estimation rather than extracting the information of a data structure, such as the cluster structure frequently assumed in semisupervised learning (Chapelle, Schölkopf, & Zien, 2006). The theorem also shows that the convergence rate of our method is dominated by the smaller size of positive or unlabeled data.

Note that since this analysis focuses on the linear-in-parameter model, there might be a gap between $PU-SMI$ and $PU-SMI*$, implying that $PU-SMI. The convergence analysis guarantees that $PU-SMI^$ with the linear-in-parameter model converges to $PU-SMI*$, but there might be an approximation error (Mohri, Rostamizadeh, & Talwalkar, 2012), as discussed in section 3.2.

## 4  PU Representation Learning

In this section, we propose a representation learning method based on PU-SMI maximization. We extend the existing SMI-based dimension reduction (Suzuki & Sugiyama, 2013), called least-squares dimension reduction (LSDR), to PU representation learning. While LSDR considers only linear dimension reduction, we extend it to nonlinear dimension reduction by neural networks.

Let $v:Rd→Rm$, where $m, be a mapping from an input vector to its low-dimensional representation. If the mapping function satisfies
$p(y∣x)=p(y∣v(x)),$
(4.1)
the obtained low-dimensional representation can be used as the new input instead of the original input vector. Finding the mapping function satisfying condition 4.1 is known as sufficient dimension reduction (Li, 1991). Let $SMI˜$ be SMI between $v(x)$ and $y$. Suzuki and Sugiyama (2013) proved $SMI≥SMI˜$, and equality holds when condition 4.1 is satisfied. That is, maximizing SMI is finding sufficient representation for the output $y$.

Following the information-maximization principle (Linsker, 1988), we maximize PU-SMI with respect to the mapping to find low-dimensional representation that maximally preserves dependency between input and output.

More specifically, since $PU-SMI^=-θP/θN·minw∈WJ^PU(w)-θP/(2θN)$, we minimize $J^PU(w)$ with respect to $w$. Furthermore, inspired by the alternative optimization algorithm for the SMI-based dimension-reduction method (Suzuki & Sugiyama, 2013), we decompose $w$ into $g$ and $v$ such that $w=g∘v$ and minimize $J^PU(g∘v)$ by optimizing $g$ and $v$ alternatively, where $g:Rm→R$ and “$∘$” denotes the function composition, $(g∘v)(x)=g(v(x))$. In this decomposition, $v$ can be regarded as a mapping function extracting features from input patterns and $g$ a density ratio function $p(x∣y=+1)/p(x)$. First, we approximate SMI by minimizing equation 3.3 with respect to density ratio $g$ with current mapping $v^$ fixed:
$g^=argmingJ^PU(g∘v^).$
Then we update mapping $v^$ to increase the estimated PU-SMI with current density ratio $g^$ fixed:
$v^←v^-ɛ∇vJ^PU(g^∘v),$
where $ɛ$ is the step size. This process is repeated until convergence. In practice, we may alternately optimize $g$ and $v$ as described in algorithm 1 to simplify the implementation.1 We refer to our representation learning method for PU data as positive-unlabeled representation learning (PURL).

Note again that in the above optimization process, unknown class-prior ratio $θP/θN$ does not need to be estimated in advance, a significant advantage of the proposed method.

## 5  Experiments

In this section, we experimentally investigate the behavior of the proposed PU-SMI estimator and evaluate the performance of the proposed representation learning method on various benchmark data sets.

### 5.1  Accuracy of PU-SMI Estimation

First, we investigate the estimation accuracy of the proposed PU-SMI estimator on data sets obtained from the LIBSVM web page (Chang & Lin, 2011).

As the model $w$, we use the linear-in-parameter model with the gaussian basis functions $ϕℓ(x):=exp(-∥x-xℓ∥2/(2σ2))$ for $ℓ=1,…,b$, where $σ>0$ is the bandwidth and ${xℓ}ℓ=1b$ are the centers of the gaussian functions randomly sampled from ${xkU}k=1nU$. The gaussian bandwidth and the $ℓ2$-regularization parameter are determined by five-fold cross-validation. We vary the number of positive/unlabeled samples from 10 to 200, with the number of unlabeled/positive samples fixed. The class-prior was assumed to be known in this illustrative experiment and set at $θP=0.5$.

Figure 1 summarizes the average and standard error of the squared estimation error of PU-SMI over 50 trials.2 This shows that the mean squared error decreases both when the number of positive samples is increased and the number of unlabeled samples is increased. Therefore, both positive and unlabeled data contribute to improving the estimation accuracy of SMI, which agrees well with our theoretical analysis in section 3.4.
Figure 1:

Average and standard error of the squared estimation error of PU-SMI over 50 trials. (a) $nP$ is increased while $nU=400$ is fixed. (b) $nU$ is increased while $nP=200$ is fixed. The results show that both positive and unlabeled samples contribute to improving the estimation accuracy of SMI.

Figure 1:

Average and standard error of the squared estimation error of PU-SMI over 50 trials. (a) $nP$ is increased while $nU=400$ is fixed. (b) $nU$ is increased while $nP=200$ is fixed. The results show that both positive and unlabeled samples contribute to improving the estimation accuracy of SMI.

### 5.2  Representation Learning

Next, we evaluate the performance of the proposed representation learning method, PURL.

#### 5.2.1  Illustration

We first illustrate how our proposed method works on an artificial data set. We generate samples from the following densities:
$p(x∣y=+1)=Nx;00,0.25004,p(x∣y=-1)=12Nx;30,0.25004+12Nx;-30,0.25004,$
where $N(x;μ,Σ)$ is the normal density with the mean vector $μ$ and the covariance matrix $Σ$. The class-prior is set at $θP=0.7$. From the densities, we draw $nP=400$ positive and $nU=1000$ unlabeled samples. For comparison, we apply PCA, Fisher's discriminant analysis (FDA), and PNRL3 (the supervised counterpart of PURL) to the data. As the label information for FDA and PNRL, U data are simply regarded as N data even though U data are a mixture of P and N data. Since PCA and FDA are linear transformations, we also use a linear transformation in PNRL and PURL for this numerical illustration. Specifically, we use a two-layer perceptron for $w$. The first fully connected layer is used as linear transformation to obtain one-dimensional representation. The rectified linear unit (ReLU) (Glorot, Bordes, & Bengio, 2011) is used for activation functions of the output of the first layer, which can be seen as feature mapping functions in the linear-in-parameter model. The second layer is just a single connection that weighs the output of the first layer.
Figure 2:

(a) Positive and unlabeled data. The estimated subspaces obtained by PCA, FDA, PNRL, and our proposed method. (b) Unlabeled data with the true labels projected onto the subspaces obtained by PCA, FDA, and our method, respectively. The results indicate that PCA, FDA, and PNRL smash underlying class structure, while the positive and negative labels are visibly separated in the subspace obtained by our method.

Figure 2:

(a) Positive and unlabeled data. The estimated subspaces obtained by PCA, FDA, PNRL, and our proposed method. (b) Unlabeled data with the true labels projected onto the subspaces obtained by PCA, FDA, and our method, respectively. The results indicate that PCA, FDA, and PNRL smash underlying class structure, while the positive and negative labels are visibly separated in the subspace obtained by our method.

We plot the subspaces obtained by PCA, FDA, PNRL, and our proposed method in Figure 2a. Since the data are distributed vertically, the subspace obtained by PCA is almost parallel to the vertical axis (the dashed line). FDA and PNRL return diagonal lines (the dash-dotted and dotted lines), showing that regarding U data as N data is not an appropriate way. On the other hand, the subspace obtained by our method is almost parallel to the horizontal axis (the solid line). Figure 2b plots projected labeled data onto those subspaces. This shows that the labels of the data projected by PCA, FDA, and PNRL are barely distinguishable due to significant overlap, which makes class-prior estimation very hard. In contrast, we can easily separate the classes of samples projected by the proposed method, which eases class-prior estimation.

#### 5.2.2  Benchmark Data

Next we apply the PURL method to benchmark data sets. To obtain low-dimensional representation, we set $m=20$ and use a fully connected neural network with four layers ($d$-60-20-1; $v$ is $d$-60-20 and $g$ is 20-1) for $w$ except the text classification data set. For text classification data sets, we use another fully connected neural network with four layers ($d$-30-10-1) for $w$: $m=10$. ReLU is used as activation functions for hidden layers, and batch normalization (Ioffe & Szegedy, 2015) is applied to all hidden layers. Stochastic gradient descent is used for optimization with learning rate 0.001. Also, weight decay with 0.0005 and gradient noise with 0.01 are applied. We iteratively update $w$ with four mini-batches and $v$ with one mini-batch.

We compare the accuracy of class-prior estimation with and without dimension reduction. For comparison, we also consider PCA, FDA, and PNRL. For PCA, we vary the numbers of components as follows: $⌊d/4⌋$, $⌊d/2⌋$, and $⌊3d/4⌋$, where $⌊·⌋$ is the floor function. For FDA, the reduced dimension is 1 due to the property of FDA (Hastie, Tibshirani, & Friedman, 2009) in which the reduced dimension becomes the minimum of $m$ or $(thenumberofclasses-1)$. The neural network for PNRL is the same as the one for the proposed method.

As a class-prior estimation method, we use the method based on the kernel mean embedding (KM) method proposed by Ramaswamy et al. (2016). With the estimated class-prior, we then train a fully connected neural network with five layers ($m$-300-300-300-1). ReLU is used as activation functions for hidden layers, and batch normalization is applied to all hidden layers. Except for text classification data sets, we train the neural networks by Adam (Kingma & Ba, 2015) until 200 epochs. For text classification data sets, we use AdaGrad (Duchi, Hazan, & Singer, 2011) and set the number of epochs to 300. For nonnegative PU learning (Kiryo, Niu, du Plessis, & Sugiyama, 2017), we use the sigmoid loss function and set $β$ and $γ$ in the paper to 0 and 1, respectively.

We use the ijcnn1, phishing, mushrooms, and a9a data sets taken from the LIBSVM web page (Chang & Lin, 2011). Also, we use the MNIST (LeCun, Bottou, Bengio, & Haffner, 1998), Fashion-MNIST (F-MNIST) (Xiao, Rasul, & Vollgraf, 2017), and 20 Newsgroups (Lang, 1995) data sets. For the MNIST and F-MNIST data sets, we divide the whole classes into two groups to make binary classification tasks. For the 20 Newsgroups data set, we use the “com” topic as the positive class and the “sci” topic as the negative class,4 and make 2000-dimensional tf-idf vector. From the data sets, we draw $nP=1000$ positive and $nU=2000$ unlabeled samples. For validation, we use $nP=50$ and $nU=200$ samples.

Table 1 lists the average absolute error between the estimated class-prior and the true value. Overall, our proposed dimension-reduction method tends to outperform other methods, meaning that our method provides useful low-dimensional representation. Except for the ijcnn1 data set, the error of FDA tends to be larger than the other methods, implying that regarding U data as N data does not help in class-prior estimation. For the mushrooms and a9a data sets, applying the unsupervised dimension-reduction method, PCA, does not improve the estimation accuracy, while our method reduces the error of class-prior estimation. In particular, for the 20 Newsgroups data set, the existing approaches (PCA, FDA, and PNRL) perform poorly. In contrast, applying our method significantly reduces the error of class-prior estimation.

Table 1:
Average Absolute Error (with Standard Error) between the Estimated Class-Prior and the True Value on Benchmark Data Sets over 20 Trials.
PCA
Data Set$θP$None$⌊d/4⌋$$⌊d/2⌋$$⌊3d/4⌋$FDAPNRLPURL
ijcnn1 0.3 0.23 (0.02) 0.26 (0.11) 0.26 (0.11) 0.28 (0.04) 0.03 (0.010.26 (0.08) 0.21 (0.07)
0.5 0.18 (0.05) 0.14 (0.09) 0.14 (0.09) 0.17 (0.06) 0.04 (0.010.21 (0.08) 0.19 (0.07)
0.7 0.08 (0.010.11 (0.05) 0.11 (0.05) 0.10 (0.04) 0.07 (0.010.11 (0.05) 0.10 (0.01
phishing 0.3 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.04 (0.02) 0.03 (0.01) 0.02 (0.00
0.5 0.01 (0.000.01 (0.000.01 (0.000.01 (0.000.07 (0.03) 0.04 (0.02) 0.03 (0.02)
0.7 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.11 (0.04) 0.05 (0.03) 0.02 (0.00
mushrooms 0.3 0.05 (0.01) 0.05 (0.01) 0.05 (0.01) 0.05 (0.01) 0.09 (0.03) 0.03 (0.000.03 (0.00
0.5 0.05 (0.010.05 (0.010.05 (0.010.05 (0.010.16 (0.03) 0.04 (0.010.04 (0.00
0.7 0.03 (0.010.03 (0.000.03 (0.000.03 (0.010.20 (0.06) 0.03 (0.000.04 (0.03)
a9a 0.3 0.11 (0.02) 0.11 (0.02) 0.11 (0.02) 0.11 (0.02) 0.05 (0.010.08 (0.03) 0.04 (0.00
0.5 0.10 (0.02) 0.10 (0.02) 0.10 (0.02) 0.10 (0.02) 0.09 (0.04) 0.09 (0.03) 0.04 (0.01
0.7 0.08 (0.03) 0.08 (0.03) 0.08 (0.03) 0.08 (0.03) 0.18 (0.06) 0.08 (0.03) 0.04 (0.01
MNIST 0.3 0.09 (0.02) 0.09 (0.02) 0.09 (0.02) 0.09 (0.02) 0.27 (0.01) 0.01 (0.000.05 (0.02)
0.5 0.15 (0.11) 0.15 (0.11) 0.15 (0.11) 0.15 (0.11) 0.46 (0.01) 0.03 (0.000.06 (0.03)
0.7 0.60 (0.21) 0.60 (0.21) 0.60 (0.21) 0.60 (0.21) 0.65 (0.02) 0.06 (0.010.07 (0.01
F-MNIST 0.3 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.25 (0.01) 0.03 (0.000.03 (0.00
0.5 0.03 (0.000.03 (0.000.03 (0.000.03 (0.000.45 (0.01) 0.02 (0.000.04 (0.03)
0.7 0.03 (0.000.03 (0.000.03 (0.000.03 (0.000.66 (0.02) 0.02 (0.000.07 (0.03)
20 News 0.3 0.04 (0.000.04 (0.010.04 (0.000.04 (0.000.29 (0.00) 0.29 (0.09) 0.03 (0.01
0.5 0.08 (0.03) 0.06 (0.010.07 (0.010.08 (0.03) 0.49 (0.00) 0.25 (0.07) 0.05 (0.01
0.7 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.13 (0.030.07 (0.01
PCA
Data Set$θP$None$⌊d/4⌋$$⌊d/2⌋$$⌊3d/4⌋$FDAPNRLPURL
ijcnn1 0.3 0.23 (0.02) 0.26 (0.11) 0.26 (0.11) 0.28 (0.04) 0.03 (0.010.26 (0.08) 0.21 (0.07)
0.5 0.18 (0.05) 0.14 (0.09) 0.14 (0.09) 0.17 (0.06) 0.04 (0.010.21 (0.08) 0.19 (0.07)
0.7 0.08 (0.010.11 (0.05) 0.11 (0.05) 0.10 (0.04) 0.07 (0.010.11 (0.05) 0.10 (0.01
phishing 0.3 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.04 (0.02) 0.03 (0.01) 0.02 (0.00
0.5 0.01 (0.000.01 (0.000.01 (0.000.01 (0.000.07 (0.03) 0.04 (0.02) 0.03 (0.02)
0.7 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.11 (0.04) 0.05 (0.03) 0.02 (0.00
mushrooms 0.3 0.05 (0.01) 0.05 (0.01) 0.05 (0.01) 0.05 (0.01) 0.09 (0.03) 0.03 (0.000.03 (0.00
0.5 0.05 (0.010.05 (0.010.05 (0.010.05 (0.010.16 (0.03) 0.04 (0.010.04 (0.00
0.7 0.03 (0.010.03 (0.000.03 (0.000.03 (0.010.20 (0.06) 0.03 (0.000.04 (0.03)
a9a 0.3 0.11 (0.02) 0.11 (0.02) 0.11 (0.02) 0.11 (0.02) 0.05 (0.010.08 (0.03) 0.04 (0.00
0.5 0.10 (0.02) 0.10 (0.02) 0.10 (0.02) 0.10 (0.02) 0.09 (0.04) 0.09 (0.03) 0.04 (0.01
0.7 0.08 (0.03) 0.08 (0.03) 0.08 (0.03) 0.08 (0.03) 0.18 (0.06) 0.08 (0.03) 0.04 (0.01
MNIST 0.3 0.09 (0.02) 0.09 (0.02) 0.09 (0.02) 0.09 (0.02) 0.27 (0.01) 0.01 (0.000.05 (0.02)
0.5 0.15 (0.11) 0.15 (0.11) 0.15 (0.11) 0.15 (0.11) 0.46 (0.01) 0.03 (0.000.06 (0.03)
0.7 0.60 (0.21) 0.60 (0.21) 0.60 (0.21) 0.60 (0.21) 0.65 (0.02) 0.06 (0.010.07 (0.01
F-MNIST 0.3 0.02 (0.000.02 (0.000.02 (0.000.02 (0.000.25 (0.01) 0.03 (0.000.03 (0.00
0.5 0.03 (0.000.03 (0.000.03 (0.000.03 (0.000.45 (0.01) 0.02 (0.000.04 (0.03)
0.7 0.03 (0.000.03 (0.000.03 (0.000.03 (0.000.66 (0.02) 0.02 (0.000.07 (0.03)
20 News 0.3 0.04 (0.000.04 (0.010.04 (0.000.04 (0.000.29 (0.00) 0.29 (0.09) 0.03 (0.01
0.5 0.08 (0.03) 0.06 (0.010.07 (0.010.08 (0.03) 0.49 (0.00) 0.25 (0.07) 0.05 (0.01
0.7 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.69 (0.00) 0.13 (0.030.07 (0.01

Notes: “None” means that the class-prior is estimated without dimension-reduction methods, PCA is the principal component analysis, FDA is Fisher's discriminant analysis, and PNRL is the supervised counterpart of the proposed method. The class-prior is estimated by the method based on kernel mean embedding. The boldface denotes the best and comparable approaches in terms of the average absolute error according to the t-test at the significance level 5%.

Then we summarize the average misclassification rates in Table 2. Since the accuracy of class-prior estimation is improved on the mushrooms and a9a data sets, the classification accuracy is also improved. In particular, the classification results on the 20 Newsgroups data set with $θP=0.7$ are improved substantially. Overall, our proposed method tends to give the lower or comparable misclassification rates compared with the other methods.

Table 2:
Average Misclassification Rates (with Standard Error) on Benchmark Data Sets over 20 Trials.
PCA
Data Set$θP$None$⌊d/4⌋$$⌊d/2⌋$$⌊3d/4⌋$FDAPNRLPURL
ijcnn1 0.3 25.32 (1.23) 27.79 (2.05) 27.79 (2.05) 29.21 (1.30) 7.00 (0.5928.57 (2.52) 25.92 (2.64)
0.5 21.43 (1.22) 17.88 (1.72) 17.88 (1.72) 20.75 (1.15) 8.25 (1.3326.52 (2.26) 21.52 (1.69)
0.7 12.07 (0.5314.94 (1.02) 14.94 (1.02) 14.61 (1.1511.23 (1.3417.34 (1.58) 13.70 (1.04
phishing 0.3 7.41 (0.467.46 (0.487.46 (0.487.57 (0.4610.30 (2.4211.09 (0.98) 7.62 (0.45
0.5 12.85 (2.119.75 (0.469.75 (0.469.82 (0.4024.43 (3.09) 32.02 (3.05) 10.05 (0.46
0.7 8.07 (0.448.85 (1.088.85 (1.087.63 (0.3725.62 (1.40) 29.04 (0.73) 8.02 (0.37
mushrooms 0.3 0.73 (0.20) 1.15 (0.580.57 (0.14) 0.49 (0.141.52 (0.36) 0.24 (0.060.43 (0.09
0.5 0.57 (0.110.57 (0.110.78 (0.160.57 (0.113.40 (0.47) 1.10 (0.243.40 (2.39
0.7 1.42 (0.281.42 (0.281.50 (0.271.42 (0.286.38 (0.66) 1.40 (0.271.61 (0.48
a9a 0.3 24.93 (1.19) 26.49 (1.89) 26.49 (1.89) 26.20 (1.73) 21.09 (0.5926.31 (2.36) 22.32 (0.65
0.5 30.35 (1.55) 26.07 (1.01) 26.07 (1.01) 29.52 (1.81) 22.70 (0.7727.48 (1.47) 23.70 (0.67
0.7 20.35 (0.8020.54 (0.6120.54 (0.6119.94 (0.7819.70 (0.9720.59 (0.6019.39 (0.66
MNIST 0.3 24.58 (2.82) 17.99 (1.44) 17.99 (1.44) 22.18 (2.75) 20.92 (0.74) 12.74 (0.6311.76 (0.78
0.5 23.00 (1.60) 22.35 (1.10) 22.35 (1.10) 23.55 (1.80) 42.10 (1.85) 15.35 (0.7518.18 (2.43
0.7 53.34 (3.78) 52.19 (4.41) 54.42 (3.99) 53.39 (3.74) 60.86 (1.25) 16.38 (0.8418.64 (2.83
F-MNIST 0.3 14.88 (1.3018.02 (2.8618.02 (2.8615.12 (1.1819.24 (0.91) 14.54 (1.1413.54 (0.75
0.5 13.40 (0.6912.05 (0.9612.05 (0.9613.22 (0.6237.73 (1.56) 12.15 (0.4814.10 (1.16
0.7 9.94 (1.308.89 (0.848.89 (0.848.54 (0.8455.65 (2.14) 8.65 (0.839.29 (0.47
20 News 0.3 38.89 (3.00) 40.30 (3.64) 42.48 (3.54) 38.70 (3.81) 18.66 (0.4766.62 (1.59) 36.31 (4.13)
0.5 44.48 (1.82) 43.85 (2.03) 46.67 (1.15) 47.77 (0.87) 34.73 (0.8050.00 (0.00) 45.88 (1.64)
0.7 50.69 (0.95) 53.61 (0.73) 51.77 (1.05) 50.36 (0.83) 50.69 (0.95) 30.61 (0.5929.85 (0.13
PCA
Data Set$θP$None$⌊d/4⌋$$⌊d/2⌋$$⌊3d/4⌋$FDAPNRLPURL
ijcnn1 0.3 25.32 (1.23) 27.79 (2.05) 27.79 (2.05) 29.21 (1.30) 7.00 (0.5928.57 (2.52) 25.92 (2.64)
0.5 21.43 (1.22) 17.88 (1.72) 17.88 (1.72) 20.75 (1.15) 8.25 (1.3326.52 (2.26) 21.52 (1.69)
0.7 12.07 (0.5314.94 (1.02) 14.94 (1.02) 14.61 (1.1511.23 (1.3417.34 (1.58) 13.70 (1.04
phishing 0.3 7.41 (0.467.46 (0.487.46 (0.487.57 (0.4610.30 (2.4211.09 (0.98) 7.62 (0.45
0.5 12.85 (2.119.75 (0.469.75 (0.469.82 (0.4024.43 (3.09) 32.02 (3.05) 10.05 (0.46
0.7 8.07 (0.448.85 (1.088.85 (1.087.63 (0.3725.62 (1.40) 29.04 (0.73) 8.02 (0.37
mushrooms 0.3 0.73 (0.20) 1.15 (0.580.57 (0.14) 0.49 (0.141.52 (0.36) 0.24 (0.060.43 (0.09
0.5 0.57 (0.110.57 (0.110.78 (0.160.57 (0.113.40 (0.47) 1.10 (0.243.40 (2.39
0.7 1.42 (0.281.42 (0.281.50 (0.271.42 (0.286.38 (0.66) 1.40 (0.271.61 (0.48
a9a 0.3 24.93 (1.19) 26.49 (1.89) 26.49 (1.89) 26.20 (1.73) 21.09 (0.5926.31 (2.36) 22.32 (0.65
0.5 30.35 (1.55) 26.07 (1.01) 26.07 (1.01) 29.52 (1.81) 22.70 (0.7727.48 (1.47) 23.70 (0.67
0.7 20.35 (0.8020.54 (0.6120.54 (0.6119.94 (0.7819.70 (0.9720.59 (0.6019.39 (0.66
MNIST 0.3 24.58 (2.82) 17.99 (1.44) 17.99 (1.44) 22.18 (2.75) 20.92 (0.74) 12.74 (0.6311.76 (0.78
0.5 23.00 (1.60) 22.35 (1.10) 22.35 (1.10) 23.55 (1.80) 42.10 (1.85) 15.35 (0.7518.18 (2.43
0.7 53.34 (3.78) 52.19 (4.41) 54.42 (3.99) 53.39 (3.74) 60.86 (1.25) 16.38 (0.8418.64 (2.83
F-MNIST 0.3 14.88 (1.3018.02 (2.8618.02 (2.8615.12 (1.1819.24 (0.91) 14.54 (1.1413.54 (0.75
0.5 13.40 (0.6912.05 (0.9612.05 (0.9613.22 (0.6237.73 (1.56) 12.15 (0.4814.10 (1.16
0.7 9.94 (1.308.89 (0.848.89 (0.848.54 (0.8455.65 (2.14) 8.65 (0.839.29 (0.47
20 News 0.3 38.89 (3.00) 40.30 (3.64) 42.48 (3.54) 38.70 (3.81) 18.66 (0.4766.62 (1.59) 36.31 (4.13)
0.5 44.48 (1.82) 43.85 (2.03) 46.67 (1.15) 47.77 (0.87) 34.73 (0.8050.00 (0.00) 45.88 (1.64)
0.7 50.69 (0.95) 53.61 (0.73) 51.77 (1.05) 50.36 (0.83) 50.69 (0.95) 30.61 (0.5929.85 (0.13

Note: The boldface denotes the best and comparable approaches in terms of the average absolute error according to the $t$-test at the significance level 5%.

## 6  Conclusion

In this letter, we have proposed an information-theoretic representation learning method from positive and unlabeled (PU) data. Our method is based on the information maximization principle, and we find low-dimensional representation maximally preserving a squared-loss variant of mutual information (SMI) between inputs and labels. Unlike the existing PU learning methods, since our representation learning method can be executed without knowing an estimate of the class-prior in advance, our method can also be used as preprocessing for the class-prior estimation method. Through numerical experiments, we demonstrated the effectiveness of our method.

## Appendix A: Proof of Theorem 1

We express SMI in equation 2.1 as
$SMI=θP2∫p(x∣y=+1)p(x)-12p(x)dx+θN2∫p(x∣y=-1)p(x)-12p(x)dx.$
(A.1)
From the marginal density $p(x)$, we have
$θNp(x∣y=-1)p(x)=1-θPp(x∣y=+1)p(x),θNp(x∣y=-1)p(x)-1=θP1-p(x∣y=+1)p(x),p(x∣y=-1)p(x)-12=θP2θN2p(x∣y=+1)p(x)-12,$
where the equality between the first and second equations can be confirmed by using $θP+θN=1$. Plugging the last equation into the second term of equation A.1, we then obtain an expression of SMI only with positive and unlabeled data (PU-SMI) as
$SMI=θP2θN∫p(x∣y=+1)p(x)-12p(x)dx=:PU-SMI.$

$a□$

## Appendix B: Proof of Theorem 2

Let
$s(x):=p(x∣y=+1)p(x)$
be the density ratio. Then PU-SMI can be expressed as
$PU-SMI=θP2θN∫(s(x)-1)2p(x)dx=θPθN12∫s2(x)p(x)dx-12,$
where $s(x)p(x)=p(x∣y=+1)$ is used. Based on the Fenchel inequality (Boyd & Vandenberghe, 2004), for any function $f(x)$ in a function class $F$, we have
$12s2(x)≥f(x)s(x)-12f2(x).$
Then we obtain the lower bound of the PU-SMI by
$PU-SMI≥θPθN∫f(x)p(x∣y=+1)dx-12∫f2(x)p(x)dx-12=θPθN-JPU(f)-12,$
where
$JPU(f):=12∫f2(x)p(x)dx-∫f(x)p(x∣y=+1)dx.$
Thus, from the Fenchel duality (Keziou, 2003; Nguyen, Wainwright, & Jordan, 2007), we have
$PU-SMI=supf∈FθPθN-JPU(f)-12,$
where equality in the supremum is attained when $f(x)=s(x)=p(x∣y=+1)/p(x)$ and $s∈F$. $□$

## Appendix C: Proof of Theorem 3

The idea of the proof is to view the approximated squared error as a perturbed optimization of expected one. In the analysis, we focus on the linear-in-parameter model $w(x)=∑ℓ=1bβℓϕℓ(x)=β⊤ϕ(x)$. We assume that $0≤ϕℓ≤1$ for all $ℓ=1,…,b$ and $x∈Rd$, and $H^U$ and $HU$ are positive-definite matrices. Recall
$β*=argminβ∈RbJPU(β),$
where
$JPU(β)=12β⊤HUβ⊤-β⊤hP,HU=∫ϕ(x)ϕ(x)⊤p(x)dx,hP=∫ϕ(x)p(x∣y=+1)dx.$
Similarly,
$β^=argminβ∈RbJ^PU(β),$
where
$J^PU(β)=12β⊤H^Uβ⊤-β⊤h^P,H^U=1nU∑k=1nUϕ(xkU)ϕ(xkU)⊤,h^P=1nP∑i=1nPϕ(xiP).$

First we have the following lemma:

Lemma 1.
Let $ε$ be the smallest eigenvalue of $HU$. We have
$JPU(β)≥JPU(β*)+ε∥β-β*∥22.$
That is, $JPU$ satisfies the second-order growth condition (Bonnans & Shapiro, 1998).
Proof.
Since $HU$ is positive definite, $JPU(β)$ is strongly convex with parameter at least $ε$. Then, we have
$JPU(β)≥JPU(β*)+∇JPU(β*)⊤(β-β*)+ε∥β-β*∥22=JPU(β*)+ε∥β-β*∥22,$
where the optimality condition $∇JPU(β*)=0$ is used.
Let us define a set of perturbation parameters as
$U:={UU,uP∣UU∈Sb,uP∈Rb},$
where $Sb$ is the set of symetric $b×b$ matrices. With these perturbation parameters, we express $H^U$ and $h^P$ as $UU=H^U-HU$ and $uP=h^P-hP$, respectively. Let $u∈U$. Our perturbed objective function and the solution are given by
$JPU(β,u):=12β⊤(HU+UU)β-β⊤(hP+uP),β(u):=argminβ∈RbJPU(β,u).$
Apparently $JPU(β)=JPU(β,0)$. Also, $J^PU(β)=JPU(β,u)$ and $β^=β(u)$ for $u≠0$. We then have the following lemma:
Lemma 2.

$JPU(·,u)-JPU(·)$ is Lipschitz continuous modulus $ω(u)=O(∥UU∥Fro+∥uP∥2)$, where $∥·∥Fro$ is the Frobenius norm.

Proof.
First, we have
$JPU(β,u)-JPU(β)=12β⊤UUβ-β⊤uP.$
The partial gradient is given by
$∂∂β(JPU(β,u)-JPU(β))=UUβ-uP.$
Let us define the $δ$-ball of $β*$ as $Bδ(β*):={β∣∥β-β*∥2≤δ}$ and $M=∥β*∥2$. For any $β∈Bδ(β*)$, we can easily show
$∥β∥2≤∥β-β*∥2+∥β*∥2≤δ+M,$
where we first used the triangle inequality and then $∥β-β*∥2≤δ$ and $M=∥β*∥2$. Thus,
$∥∂∂β(JPU(β,u)-JPU(β))∥2≤(δ+M)∥UU∥Fro+∥uP∥2.$
This means that $JPU(·,u)-JPU(·)$ is Lipschitz continuous on $Bδ(β*)$ with a Lipschitz constant of order $O(∥UU∥Fro+∥uP∥2)$.

Finally, we prove theorem 3.

Proof.
According to the central limit theorem, we have
$∥UU∥Fro=Op(1/nU),∥uP∥2=Op(1/nP)$
as $nP,nU→∞$. Since we proved that $JPU$ satisfies the second-order growth condition (see lemma 4) and $JPU(·,u)-JPU(·)$ is Lipschitz continuous modulus $ω(u)$ (see lemma 5), we can use proposition 6.1 in Bonnans and Shapiro (1998) and have the first half of theorem 3:
$∥β^-β*∥2≤ε-1ω(u)=O(∥UU∥Fro+∥uP∥2)=Op(1/nP+1/nU).$
Next, we prove the latter half of theorem 3. For the squared errors, we have
$|J^PU(β^)-JPU(β*)|≤|J^PU(β^)-J^PU(β*)|+|J^PU(β*)-JPU(β*)|.$
Here, we have
$J^PU(β^)-J^PU(β*)=12(β^+β*)⊤H^U(β^-β*)-(β^-β*)⊤h^P,J^PU(β*)-JPU(β*)=12β*⊤UUβ*-uPβ*.$
Since $0≤ϕℓ(x)≤1$ and $M=∥β*∥2$, it leads to
$|J^PU(β^)-JPU(β*)|≤|J^PU(β^)-J^PU(β*)|+|J^PU(β*)-JPU(β*)|≤Op(∥β^-β*∥2)+Op(∥UU∥Fro+∥uP∥2)=Op(1/nP+1/nU).$
Recall
$PU-SMI*=θPθN-JPU(β*)-12,PU-SMI^=θPθN-J^PU(β^)-12.$
We thus have
$|PU-SMI*-PU-SMI^|=θPθN|J^PU(β^)-JPU(β*)|=Op(1/nP+1/nU).$

## Appendix D: Effect of Dimension Reduction

In this appendix, we illustrate the effect of dimension reduction and how the number of samples affects class-prior estimation.

We use the artificial data set used in section 5.2 and vary both $nP$ and $nU$ from 500 to 5000. We set the true class-prior $θP$ as 0.5. The class-prior is estimated by the method based on kernel mean embedding (KM) (Ramaswamy et al., 2016). To evaluate the performance with and without dimension reduction, we use the one-dimensional samples obtained by $b⊤x$, where $b=(1,0)⊤$, and the original two-dimensional samples.

Figure 3a shows the mean absolute error (with its standard error) between the true and estimated class-priors over 10 trials. The error of KM without dimension reduction decreases the number of samples until around $nP=nU=2000$, but the error is not reduced even if we increase $nP=nU=5000$. In contrast, at $nP=nU=500$, the error of KM with dimension reduction is already smaller than that without dimension reduction. Figure 3b shows the mean computation time (with its standard error) over 10 trials. The computation time grows with the number of samples. Since the short computation time and low absolute error are desirable, this result shows the effectiveness of dimension reduction.
Figure 3:

Although the error of KM without dimension reduction decreases the number of samples until around $nP=nU=2,000$, the error of KM with dimension reduction is smaller than that without dimension reduction. For the computation time, it grows with the number of samples. Since the short computation time and low absolute error are desirable, this result shows the effectiveness of dimension reduction.

Figure 3:

Although the error of KM without dimension reduction decreases the number of samples until around $nP=nU=2,000$, the error of KM with dimension reduction is smaller than that without dimension reduction. For the computation time, it grows with the number of samples. Since the short computation time and low absolute error are desirable, this result shows the effectiveness of dimension reduction.

## Appendix E: Supervised Counterpart of the Proposed Method

In this appendix, we review the SMI estimation method (Suzuki et al., 2009; Sugiyama, 2013; Sakai & Sugiyama, 2014).

According to Suzuki and Sugiyama (2013), SMI can be expressed as
$SMI=12∑y{±1}∫r2(x,y)p(x)p(y)dx-12,$
(E.1)
where
$r(x,y):=p(x,y)p(x)p(y).$
Based on the Fnechel inequality (Boyd & Vandenberghe, 2004), for any function $h:Rd×{±1}→R$, we have
$12r2(x,y)≥h(x,y)r(x,y)-12h2(x,y),$
(E.2)
where the equality condition is $h(x,y)=r(x,y)$. We thus obtain the lower bound of SMI by
$SMI≥L(h)-12,$
(E.3)
where
$L(h):=∑y∈{±1}∫h(x,y)p(x,y)dx-12∑y∈{±1}∫h2(x,y)p(x)p(y)dx.$
(E.4)
To obtain an SMI estimate, we first train $h$ with PN data by solving the following optimization problem:
$maximizeh∈HL^(h),$
(E.5)
where $H$ is a user-specified function class and $L^$ is sample approximation of $L$:
$L^(h):=1n∑i=1nh(xi,yi)-∑y∈{±1}p^(y)2n∑i=1nh2(xi,y)-12.$
(E.6)
A simple approach to approximation of $p^(y)$ is to use the number of labeled samples, that is, $p^(y=+1)$ is approximated by the number of positive samples divided by that of all labeled samples. The SMI approximator from PN data is then given by
$PN-SMI^:=L^(h^)-12,$
where $h^:=argmaxh∈HL^(h)$.

Similar to the proposed PURL method, we maximize $PN-SMI^$ to learn low-dimensional representation.

## Notes

1

We also tried to optimize $w$ and $v$ simultaneously. That is, $JPU$ is minimized with respect to $g$ without decomposing $g$ into $w$ and $v$, but it did not work well in our preliminary experiments.

2

We compute the squared error between the estimated PU-SMI and the supervised SMI estimator (Suzuki et al., 2009) with a sufficiently large number of positive and negative samples.

3

The details of PNRL are described in appendix E.

4

See http://qwone.com/∼jason/20Newsgroups/ for the details of topics.

## Acknowledgments

T.S. was supported by KAKENHI 15J09111. G.N. was supported by the JST CREST JPMJCR1403. M.S. was supported by KAKENHI 17H01760. We thank Ikko Yamane, Ryuichi Kiryo, and Takeshi Teshima for their comments.

## References

,
M.
,
Agarwal
,
A.
,
Barham
,
P.
,
Brevdo
,
E.
,
Chen
,
Z.
,
Citro
,
C.
, …
Zheng
,
X.
(
2015
).
TensorFlow: Large-scale machine learning on heterogeneous systems
. https://www.tensorflow.org/.
Basu
,
A.
,
Harris
,
I. R.
,
Hjort
,
N. L.
, &
Jones
,
M. C.
(
1998
).
Robust and efficient estimation by minimising a density power divergence
.
Biometrika
,
85
(
3
),
549
559
.
Bonnans
,
J. F.
, &
Cominetti
,
R.
(
1996
).
Perturbed optimization in Banach spaces I: A general theory based on a weak directional constraint qualification; II: A theory based on a strong directional qualification condition; III: Semi-infinite optimization
.
SIAM Journal on Control and Optimization
,
34
(
4
), 1151–1171, 1172–1189, 1555–1567.
Bonnans
,
J. F.
, &
Shapiro
,
A.
(
1998
).
Optimization problems with perturbations: A guided tour
.
SIAM Review
,
40
(
2
),
228
264
.
Boyd
,
S.
, &
Vandenberghe
,
L.
(
2004
).
Convex optimization
.
Cambridge
:
Cambridge University Press
.
Chang
,
C.-C.
, &
Lin
,
C.-J.
(
2011
).
LIBSVM: A library for support vector machines
.
ACM Transactions on Intelligent Systems and Technology
,
2
, 1–27. http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
Chapelle
,
O.
,
Schölkopf
,
B.
, &
Zien
,
A.
(Eds.). (
2006
).
Semi-supervised learning
.
Cambridge, MA
:
MIT Press
.
Cover
,
T. M.
, &
Thomas
,
J. A.
(
2006
).
Elements of information theory
(2nd ed.).
Hoboken, NJ
:
Wiley
.
du Plessis
,
M. C.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2015
). Convex formulation for learning from positive & unlabeled data. In
Proceedings of 32nd International Conference on Machine Learning
,
37
(pp.
1386
1394
).
du
Plessis
,
M. C.
,
Niu
,
G.
, &
Sugiyama
,
M.
(
2017
).
Class-prior estimation for learning from positive and unlabeled data
.
Machine Learning
,
106
(
4
),
463
492
.
du Plessis
,
M. C.
, &
Sugiyama
,
M.
(
2014
).
Class prior estimation from positive and unlabeled data
.
IEICE Transactions on Information and Systems
,
E97-D
(
5
),
1358
1362
.
Duchi
,
J.
,
Hazan
,
E.
, &
Singer
,
Y.
(
2011
).
.
Journal of Machine Learning Research
,
12
,
2121
2159
.
Elkan
,
C.
, &
Noto
,
K.
(
2008
). Learning classifiers from only positive and unlabeled data. In
Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(pp.
213
220
).
New York
:
ACM
.
Glorot
,
X.
,
Bordes
,
A.
, &
Bengio
,
Y.
(
2011
). Deep sparse rectifier neural networks. In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
(pp.
315
323
).
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
.
Cambridge, MA
:
MIT Press
.
Hastie
,
T.
,
Tibshirani
,
R.
, &
Friedman
,
J.
(
2009
).
The elements of statistical learning
(2nd ed.).
New York
:
Springer-Verlag
.
Ioffe
,
S.
, &
Szegedy
,
C.
(
2015
). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In
Proceedings of the 32nd International Conference on Machine Learning
(pp.
448
456
).
Jain
,
S.
,
White
,
M.
, &
,
P.
(
2016
). Estimating the class prior and posterior from noisy positives and unlabeled data. In
D. D.
Lee
,
M.
Sugiyama
,
U. V.
Luxburg
,
L.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
.
Red Hook, NY
:
Curran
.
Jia
,
Y.
,
Shelhamer
,
E.
,
Donahue
,
J.
,
Karayev
,
S.
,
Long
,
J.
,
Girshick
,
R.
, …
Darrell
,
T.
(
2014
). Caffe: Convolutional architecture for fast feature embedding. In
Proceedings of the 22nd ACM International Conference on Multimedia
(pp.
675
678
).
New York
:
ACM
.
Kanamori
,
T.
,
Hido
,
S.
, &
Sugiyama
,
M.
(
2009
).
A least-squares approach to direct importance estimation
.
Journal of Machine Learning Research
,
10
,
1391
1445
.
Kanamori
,
T.
,
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2012
).
Statistical analysis of kernel-based least-squares density-ratio estimation
.
Machine Learning
,
86
(
3
),
335
367
.
Keziou
,
A.
(
2003
).
Dual representation of $ϕ$-divergences and applications
.
Comptes Rendus Mathématique
,
336
(
10
),
857
862
.
Khan
,
S.
,
,
S.
,
Ganguly
,
A.
, &
Saigal
,
S.
(
2007
).
Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data
.
Physical Review E
,
76
, 026209.
Kingma
,
D. P.
, &
Ba
,
J.
(
2015
). Adam: A method for stochastic optimization. In
Proceedings of the Third International Conference on Learning Representations
. OpenReview.net.
Kiryo
,
R.
,
Niu
,
G.
, du
Plessis
,
M. C.
, &
Sugiyama
,
M.
(
2017
). Positive-unlabeled learning with non-negative risk estimator. In
L.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
1674
1684
).
Red Hook, NY
:
Curran
.
,
A.
,
Stögbauer
,
H.
, &
Grassberger
,
P.
(
2004
).
Estimating mutual information
.
Physical Review E
,
69
(
6
), 066138.
Krause
,
A.
,
Perona
,
P.
, &
Gomes
,
R. G.
(
2010
). Discriminative clustering by regularized information maximization. In
J. D.
Lafferty
,
C. K. I.
Williams
,
J.
Shaw-Taylor
,
R. S.
Zemel
, &
A.
Culotta
(Eds.),
Advances in neural information processing systems
,
23
(pp.
775
783
).
Red Hook, NY
:
Curran
.
Lang
,
K.
(
1995
). Newsweeder: Learning to filter netnews. In
Proceedings of the Twelfth International Conference on Machine Learning
(pp.
331
339
).
San Mateo, CA
:
Morgan Kaufman
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
. In
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Letouzey
,
F.
,
Denis
,
F.
, &
Gilleron
,
R.
(
2000
). Learning from positive and unlabeled examples. In
Proceedings of the 11th International Conference on Algorithmic Learning Theory
(pp.
71
85
).
Berlin
:
Springer
.
Li
,
K.-C.
(
1991
).
Sliced inverse regression for dimension reduction
.
Journal of the American Statistical Association
,
86
(
414
),
316
327
.
Li
,
W.
,
Guo
,
Q.
, &
Elkan
,
C.
(
2011
).
A positive and unlabeled learning algorithm for one-class classification of remote-sensing data
.
IEEE Transactions on Geoscience and Remote Sensing
,
49
(
2
),
717
725
.
Linsker
,
R.
(
1988
).
Self-organization in a perceptual network
.
Computer
,
21
(
3
),
105
117
.
Mohri
,
M.
,
,
A.
, &
Talwalkar
,
A.
(
2012
).
Foundations of machine learning
.
Cambridge, MA
:
MIT Press
.
Moon
,
Y.-I.
,
Rajagopalan
,
B.
, &
Lall
,
U.
(
1995
).
Estimation of mutual information using kernel density estimators
.
Physical Review E
,
52
(
3
),
2318
2321
.
Nguyen
,
X. L.
,
Wainwright
,
M. J.
, &
Jordan
,
M. I.
(
2007
). Nonparametric estimation of the likelihood ratio and divergence functionals. In
Proceedings of the IEEE International Symposium on Information Theory
(pp.
2016
2020
).
Piscataway, NJ
:
IEEE
.
Northcutt
,
C. G.
,
Wu
,
T.
, &
Chuang
,
I. L.
(
2017
). Learning with confident examples: Rank pruning for robust classification with noisy labels. In
Proceedings of the 33rd Conference on Uncertainty in Artificial Intelligence
.
AUAI Press
.
Pearson
,
K.
(
1990
).
On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arise from random sampling
.
Philosophical Magazine Series 5
,
50
(
302
),
157
175
.
Ramaswamy
,
H. G.
,
Scott
,
C.
, &
Tewari
,
A.
(
2016
). Mixture proportion estimation via kernel embedding of distributions. In
Proceedings of the 33rd International Conference on Machine Learning
.
Sakai
,
T.
, &
Sugiyama
,
M.
(
2014
).
Computationally efficient estimation of squared-loss mutual information with multiplicative kernel models
.
IEICE Transactions on Information and Systems
,
E97-D
(
4
),
968
971
.
Sugiyama
,
M.
(
2013
).
Machine learning with squared-loss mutual information
.
Entropy
,
15
,
80
112
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012a
).
Density ratio estimation in machine learning
.
Cambridge
:
Cambridge University Press
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012b
).
Density ratio matching under the Bregman divergence: A unified framework of density ratio estimation
.
Annals of the Institute of Statistical Mathematics
,
64
(
5
),
1009
1044
.
Suzuki
,
T.
, &
Sugiyama
,
M.
(
2013
).
Sufficient dimension reduction via squared-loss mutual information
.
Neural Computation
,
25
(
3
),
725
758
.
Suzuki
,
T.
,
Sugiyama
,
M.
,
Kanamori
,
T.
, &
Sese
,
J.
(
2009
).
Mutual information estimation reveals global associations between stimuli and biological processes
.
BMC Bioinformatics
, 10,
S52
,
1
12
.
Suzuki
,
T.
,
Sugiyama
,
M.
,
Sese
,
J.
, &
Kanamori
,
T.
(
2008
). Approximating mutual information by maximum likelihood density ratio estimation. In
Proceedings of ECML-PKDD2008 Workshop on New Challenges for Feature Selection in Data Mining and Knowledge Discovery
, vol.
4
(pp.
5
20
).
Torkkola
,
K.
(
2003
).
Feature extraction by non-parametric mutual information maximization
.
Journal of Machine Learning Research
,
3
,
1415
1438
.
Van Hulle
,
M. M.
(
2005
).
Edgeworth approximation of multivariate differential entropy
.
Neural Computation
,
17
(
9
),
1903
1910
.
Ward
,
G.
,
Hastie
,
T.
,
Barry
,
S.
,
Elith
,
J.
, &
Leathwick
,
J. R.
(
2009
).
Presence-only data and the EM algorithm
.
Biometrics
,
65
(
2
),
554
563
.
Xiao
,
H.
,
Rasul
,
K.
, &
Vollgraf
,
R.
(
2017
).
Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms.
arXiv:1708.07747.

## Author notes

T.S. is currently affiliated with NEC Corporation.