## Abstract

Many machine learning problems can be formulated as predicting labels for a pair of objects. Problems of that kind are often referred to as pairwise learning, dyadic prediction, or network inference problems. During the past decade, kernel methods have played a dominant role in pairwise learning. They still obtain a state-of-the-art predictive performance, but a theoretical analysis of their behavior has been underexplored in the machine learning literature. In this work we review and unify kernel-based algorithms that are commonly used in different pairwise learning settings, ranging from matrix filtering to zero-shot learning. To this end, we focus on closed-form efficient instantiations of Kronecker kernel ridge regression. We show that independent task kernel ridge regression, two-step kernel ridge regression, and a linear matrix filter arise naturally as a special case of Kronecker kernel ridge regression, implying that all these methods implicitly minimize a squared loss. In addition, we analyze universality, consistency, and spectral filtering properties. Our theoretical results provide valuable insights into assessing the advantages and limitations of existing pairwise learning methods.

## 1 Introduction to Pairwise Learning

### 1.1 Settings in Pairwise Learning

Many real-world machine learning problems can naturally be represented as pairwise learning or dyadic prediction problems. In contrast to more traditional learning settings, the goal here consists of making predictions for pairs of objects $u\u2208U$ and $v\u2208V$, as elements of two universes $U$ and $V$. Such an ordered pair $(u,v)$ is often referred to as a dyad, and both elements in the dyad are usually equipped with a feature representation. In contrast to many statistical settings, these dyads are not independently and identically distributed, as the same objects tend to appear many times as part of different pairs.

Applications of pairwise learning often arise in the life sciences, such as predicting various types of interactions in all sorts of biological networks (e.g., drug-target networks, gene regulatory networks, and species interaction networks). Similarly, pairwise learning methods are also used to extract novel relationships in social networks, such as author-citation networks. Other popular applications include recommender systems (predicting interactions between users and items) and information retrieval (predicting interactions between search queries and search results).

Formally speaking, in pairwise learning, one attempts to learn a function of the form $f(u,v)$, that is, a function to predict properties of two objects. Such functions are fitted using a set of $n$ labeled examples: the training set $S={(uh,vh,yh)\u2223h=1,\u2026,n}$. Further on, $U={ui\u2223i=1,\u2026,m}$ and $V={vj\u2223j=1,\u2026,q}$ will denote the sets of distinct objects of both types, later referred to as instances and tasks, respectively, in the training set with $m=|U|$ and $q=|V|$.

Pairwise learning holds strong connections with many other machine learning settings. Especially a link with multitask learning can be advocated by calling the first object of a dyad an “instance” and the second object a “task.” The underlying idea for making the distinction between instances and tasks is that the feature description of the instances is often considered as more informative, while the feature description of the tasks is mainly used to steer learning in the right direction. Albeit less common in traditional multitask learning formulations, feature representations for tasks play a crucial role in recent paradigms such as zero-shot learning (see, e.g., Palatucci, Hinton, Pomerleau, & Mitchell, 2009; Lampert, Nickisch, & Harmeling, 2014).

The connection between pairwise learning and multitask learning allows one to distinguish different prediction settings that are crucial in the context of this letter. Formally, four settings for predicting the label of the dyad $(u,v)$ can be distinguished in pairwise learning, based on whether testing objects are in-sample (appear in the training data) or out-of-sample (do not appear in the training data):

- •
*Setting A*: Both $u$ and $v$ are observed during training, as parts of different dyads, but the label of the dyad $(u,v)$ must be predicted. - •
*Setting B*: Only $v$ is known during training, while $u$ is not observed in any training dyad, and the label of the dyad $(u,v)$ must be predicted. - •
*Setting C*: Only $u$ is known during training, while $v$ is not observed in any training dyad, and the label of the dyad $(u,v)$ must be predicted. - •
*Setting D*: Neither $u$ nor $v$ occurs in any training dyad, and the label of the dyad $(u,v)$ must be predicted.

Figure 1 shows data of the four settings graphically in four matrix representations. Setting A resembles a matrix completion or matrix filtering scenario, as typically encountered in collaborative filtering problems. In principle, feature representations are not needed if the structure of the matrix is exploited to generate predictions, but additional information might be helpful. Setting B resembles a classical multitask learning scenario, where the columns represent instances and the rows tasks. For a predefined set of tasks, one aims for predicting the labels of novel instances. Setting C then considers the converse setting, where the instances are all known during training and some tasks are unobserved. This setting is in essence identical to setting B if one interchanges the notions of task and instance. Setting D is the most difficult prediction setting of the four. In the multitask learning literature, this setting is known as zero-shot learning, as one aims for predicting the labels of tasks with zero training data.

In pairwise learning, it is extremely important to distinguish these four prediction scenarios. Without bearing them in mind, one might select the wrong model for the given scenario or obtain an under- or overestimation of the generalization error. For example, a pairwise recommender system that can generalize well to new users might perform poorly for new items. In a large-scale metastudy about biological network identification, it was found that these concepts are vital to correctly evaluate pairwise learning models (Park & Marcotte, 2012). Certain properties of different models discussed in this work hold only for certain settings.

### 1.2 Kernel Methods for Pairwise Learning with Complete Data Sets

During the past decade, various types of methods for pairwise learning have been proposed in the literature. Kernel methods in particular have been extensively used (see, e.g., Vert & Yamanishi, 2005; Zaki, Lazarova-Molnar, El-Hajj, & Campbell, 2009; Huynh-Thu, Irrthum, Wehenkel, & Geurts, 2010; van Laarhoven, Nabuurs, & Marchiori, 2011; Cao et al., 2012; Liu & Yang, 2015). Especially in bioinformatics applications, they have been popular because biological entities are often easier to represent in terms of similarity scores than feature representations (Ben-Hur & Noble, 2005; Shen et al., 2007; Vert, Qiu, & Noble, 2007).^{1}

In this work, we focus on kernel methods for pairwise learning. We believe that kernel methods have a number of appealing properties:

First, the methods that we analyze in this letter are general-purpose methods. They can be applied to a wide range of settings, including settings A to D and a wide range of application domains. More recent methods might outperform kernel methods in specific situations, but they are usually not applicable to settings A to D at the same time, or they are mainly developed for specific application domains with very specific types of data sets (e.g., computer vision and text mining data sets).

Second, the methods that we analyze often form an essential building block of more recent (and more complicated) methods. This is, for example, the case for zero-shot learning methods in computer vision. It is therefore important to provide a theoretical analysis of older methods in order to gain a better understanding of more recent methods that are often black-box engineering approaches. More details on this aspect are given section 4.

Third, the methods that we analyze in this paper are still clear winners for specific scenarios. One of those scenarios is cross-validation in pairwise learning, for which kernel methods outperform other methods substantially with regard to computational scalability. Furthermore, scalable and exact algorithms can be derived to learn a model online or when the data set is not complete (see definition ^{1}). For more information on these aspects, we refer readers to our complementary work (Stock, De Baets, & Waegeman, 2017; Stock, 2017; Stock, Pahikkala, Airola, Waegeman, & De Baets, 2018).

These three reasons are the key motivations for studying kernel-based pairwise learning methods from a theoretical perspective. The key idea of extending kernel methods to pairwise learning is to construct so-called pairwise kernels, which measure the similarity between two dyads $(u,v)$ and $(u\u203e,v\u203e)$. Kernels of that kind can be used in tandem with any conventional kernelized learning algorithm such as support vector machines, kernel ridge regression (KRR), and kernel Fisher discriminant analysis. In this letter, we particularly focus on pairwise learning methods that are inspired by kernel ridge regression. Due to the algebraic properties of such methods, they are especially useful when analyzing so-called complete data sets in pairwise learning.

(complete dataset). A training set is called complete if it contains exactly one labeled example for every dyad $(u,v)\u2208U\xd7V$.

If the label matrix contains only a few missing labels, matrix imputation methods can be applied to render the matrix complete (Mazumder, Hastie, & Tibshirani, 2010; Stekhoven & Bühlmann, 2012; Zachariah & Sundin, 2012). Complete data sets, however, occur frequently, for example, in biological networks such as drug-protein interactions or species interactions. Here, screenings or field studies generate a set of observed interactions, while interactions that are not observed are either interactions not occurring or false negatives (Schrynemackers, Küffner, & Geurts, 2013; Jordano, 2016). In such cases, the positive instances are labeled 1 and the negatives are labeled 0. Theoretical work by Elkan and Noto (2008) has shown that models can still be learned from such data sets. Outside of biological network inference, complete datasets occur in recommender systems with implicit feedback; for example, buying a book can be seen as a proxy for liking a book (Isinkaye, Folajimi, & Ojokoh, 2015). Setting A (reestimating labels) is still relevant for such data sets if the labels are noisy or contain false positives or false negatives. A pairwise learning model can be used to detect and curate such errors.

For a complete training set, we introduce a further notation for the matrix of labels $Y\u2208Rm\xd7q$, so that its rows are indexed by the objects in $U$ and the columns by the objects in $V$. Furthermore, we use $Yi.$ (resp. $Y.j$) to denote the $i$th row (resp. $j$th column) of $Y$. The vectorization of the matrix $Y$ by stacking its columns in one long vector will be denoted $y$.

### 1.3 Scope and Objectives of This Letter

The goal of this letter is to provide theoretical insight into the working of existing pairwise learning methods that are based on kernel ridge regression. To this end, we focus on scenarios with complete training data sets while analyzing the behavior for settings A to D. More specifically, we provide an in-depth discussion of the following four methods:

- •
*Kronecker kernel ridge regression*: Adopting a least-squares formulation, this method is representative for many existing systems based on pairwise kernels. - •
*Two-step kernel ridge regression*: This recent method has some interesting properties such as simplicity and computational efficiency. The method has been independently proposed in Pahikkala et al. (2014) and Romera-Paredes and Torr (2015). In a variant of it, tree-based methods replace kernel ridge regression as base learners (Schrynemackers, Wehenkel, Babu, & Geurts, 2015). In a statistical context, similar models have been developed for structural equation modelling (Bollen, 1996; Bollen & Bauer, 2004; Jung, 2013). - •
*Linear matrix filtering*: This recently proposed method provides predictions in setting A without the need for object features, similar to collaborative filtering methods. Though simple, this linear filter was found to perform very well in predicting interactions in a variety of species-species and protein-ligand interaction datasets (Stock, Poisot, Waegeman, & De Baets, 2017; Stock, 2017). On these data sets, it outperforms standard matrix factorization methods and is very tolerant of a large number of false negatives in the label matrices. - •
*Independent-task kernel ridge regression*: This method serves as a baseline and a building block for some of the other methods. This approach resembles the traditional kernel ridge regression method, applied to each task (i.e., each column of $Y$) separately. When the method is applied to a single task, we speak of single-task kernel ridge regression.

The learning properties of the four methods are theoretically analyzed in section 3. In a first series of results, we establish equivalences using special kernels and algebraic operations. We discuss several links that are specific for settings A, B, C, or D. Figure 2 gives an overview of what readers might expect to learn. In a second series of results, we prove the universality of Kronecker product pairwise kernels, and we analyze the consistency of the algorithms that can be derived from such kernels. To this end, we provide a spectral interpretation of Kronecker and two-step kernel ridge regression. This will give further insight into the behavior of these methods.

## 2 Pairwise Learning with Methods Based on Kernel Ridge Regression

In this section we formally review the four methods outlined in section 1. We start by explaining a baseline multitask learning formulation that will be needed to understand more complicated methods. We call this method independent-task kernel ridge regression, since it constructs independent models for the different tasks, that is, the different columns of $Y$. Subsequently, we elaborate on Kronecker kernel ridge regression as an instantiation of a method that employs pairwise kernels. In sections 2.3 and 2.4, we review two-step kernel ridge regression and the linear matrix filter. In what follows we adopt a multitask learning formulation in which the objects of $U$ and $V$ are referred to as instances and tasks, respectively.

### 2.1 Independent-Task Kernel Ridge Regression

### 2.2 Pairwise and Kronecker Kernel Ridge Regression

Several authors have pointed out that while the size of the system in equation 2.5 is considerably large, its solutions for the Kronecker product kernel can be found efficiently via tensor algebraic optimization (Van Loan, 2000; Martin & Van Loan, 2006; Kashima, Kato, Yamanishi, Sugiyama, & Tsuda, 2009; Raymond & Kashima, 2010; Pahikkala et al., 2013; Álvarez et al., 2012). This is because the eigenvalue decomposition of a Kronecker product of two matrices can easily be computed from the eigenvalue decomposition of the individual matrices. The time complexity scales roughly with $O(m3+q3)$, which is required for computing the singular value decomposition of $K$ and $G$ (see property 2 in the appendix), but the complexities can be scaled down even further by using sparse kernel matrix approximation (Mahoney, 2011; Gittens & Mahoney, 2013).

However, these computational shortcuts concern only the case in which the training set is complete. If some of the instance-task pairs in the training set are missing or if there are several occurrences of certain pairs, one has to resort, for example, to gradient-descent-based training approaches (Park & Chu, 2009; Pahikkala et al., 2013; Kashima et al., 2009; Airola & Pahikkala, 2018). While the training can be accelerated via tensor algebraic optimization, such techniques still remain considerably slower than the approach based on eigenvalue decomposition.

### 2.3 Two-Step Kernel Ridge Regression

Clearly, independent-task ridge regression can generalize to new instances, but not to new tasks as no dependence between these tasks is encoded in the model. Kronecker KRR, on the other hand, can be used for all four prediction settings depicted in Figure 1. But since our definition of *instances* and *tasks* is purely conventional, nothing is preventing us from building a model using the kernel function $g(\xb7,\xb7)$ to generalize to new tasks for the same instances. By combining two ordinary KRRs, one for generalizing to new instances and one that generalizes to new tasks, one can indirectly predict for new dyads.

### 2.4 Linear Filter for Matrices

Single-task KRR uses a feature description only for the objects $u$, while Kronecker and two-step KRR incorporate feature descriptions of both objects $u$ and $v$. Is it possible to make predictions without any features at all? Obviously this would be possible only for setting A, where both objects are known during training. The structure of the label matrix $Y$ (e.g., being low rank) often contains enough information to successfully make predictions for this setting. In recommender systems, methods that do not take side features into account are often categorized as collaborative filtering methods (Su & Khoshgoftaar, 2009).

In order to use our framework, we have to construct some feature description in the form of a kernel function. An object $u$ (resp. $v$) can be described by the observed labels of the dyads that contain the object. In the context of item recommendation, this seems reasonable: users are described by the ratings they have given to items, and items are described by users' ratings. For example, Basilico and Hofmann (2004) use a kernel based on the Pearson correlation of rating vectors of users to obtain a kernel description of users for collaborative filtering. In bioinformatics, van Laarhoven and colleagues (2011) predict drug-target interactions using so-called gaussian interaction profile kernels, that is, the classical radial basis kernel applied to the corresponding row or column of the label matrix. There is nothing inherently wrong with using the labels to construct feature descriptions or kernels for the object. One should be cautious only when taking a holdout set for model selection or model evaluation; the omitted labels should also be removed from the feature description to prevent overfitting.

Kernels that take observed labels into account, such as the gaussian interaction profile kernel, are in theory quite powerful. Because they can be used to learn nonlinear associations, they lead to more expressive models than matrix factorization. The advantage of using these kernels compared to other collaborative filtering techniques such as matrix factorization, $k$-nearest neighbors, or restricted Boltzmann machines is that side features can elegantly be incorporated into the model. To this end, one only has to combine the collaborative and content-based kernel matrices, for example, by computing a weighted sum or element-wise multiplication.

In section 3.2 we will show that this linear filter is a special instance of Kronecker KRR. This filter can hence be written in the form of equation 1.1 with the parameters obtained by solving a system of the form 1.2. In practice, however, one would always prefer to work directly using equation 2.15. The parameters $\alpha 1,\alpha 2,\alpha 3$, and $\alpha 4$ can be set by means of leave-one-pair-out cross validation using equation 2.16.

## 3 Theoretical Considerations

In sections 3.1 and 3.2 we show how the four methods of section 2 are related via special kernels and algebraic equivalences. We establish several links that are specific for settings A, B, C, or D. Therefore, each result is formulated as a theorem that indicates the setting to which it applies in its header. In section 3.3, the universality of the Kronecker product pairwise kernels is proven. This result provides a theoretical justification for the observation that Kronecker-based systems often obtain very satisfactory performance in empirical studies. The universality is also used to prove the consistency of the methods that we analyze. This is done in section 3.4 via a spectral interpretation. In addition, this interpretation also allows us to illustrate that two-step kernel ridge regression adopts a special decomposable filter.

### 3.1 Equivalence between Two-Step and Other Kernel Ridge Regression Methods

The relation between two-step kernel ridge regression and independent-task ridge regression is given in the following theorem.

When $G$ is singular, the $q$ outputs for the different tasks are projected on a lower-dimensional subspace by two-step KRR. This means that a dependence between the tasks is enforced even when $\lambda v=0$.

The connection between two-step and Kronecker KRR is established by the following results.

^{3}is that it formulates two-step KRR as an empirical risk minimization problem for setting A (see equation 3.1). It is important to note that the pairwise kernel matrix $\Xi $ appears only in the regularization term of this variational problem. The loss function is dependent on only the predicted values $F$ and the label matrix $Y$. Using two-step KRR for setting A when dealing with incomplete data is thus well defined. The empirical risk minimization problem of equation 3.1 can be modified so that the squared loss takes only the observed dyads into account:

Two-step and Kronecker KRR also coincide in an interesting way for setting D (e.g., the special case in which there is no labeled data available for the target task). This in turn will allow us to show the consistency of two-step KRR via its universal approximation and spectral regularization properties. Theorem ^{4} shows the relation between two-step KRR and ordinary Kronecker KRR for setting D.

### 3.2 Smoother Kernels Lead to the Linear Filter

Using these kernels in the Kronecker-based models has an interesting interpretation: the predictions can be written as a weighted sum of averages.

The smoother kernel is thus quite restrictive in the type of models that can be learned. It can only exploit the fact that some rows or columns have a larger average value (e.g., in item recommendation, some items in collaborative filtering have a high average rating, independent for the user). Nevertheless, it can lead to good baseline predictions for setting A and is particularly useful for small data sets with no side features, such as species interaction networks.

### 3.3 Universality of the Kronecker Product Pairwise Kernel

Here we consider the universal approximation properties of Kronecker KRR and, by theorems ^{3} and ^{4}, of two-step KRR. This is a necessary step in showing the consistency of the latter method. First, recall the concept of universal kernel functions.

(Steinwart, 2002). A continuous kernel $k(\xb7,\xb7)$ on a compact metric space $X$ (i.e., $X$ is closed and bounded) is called universal if the reproducing kernel Hilbert space (RKHS) induced by $k(\xb7,\xb7)$ is dense in $C(X)$, where $C(X)$ is the space of all continuous functions $f:X\u2192R$.

The universality property indicates that the hypothesis space induced by a universal kernel can approximate any continuous function on the input space $X$ to be learned arbitrarily well, given that the available set of training data is large and representative enough and the learning algorithm can efficiently find this approximation from the hypothesis space (Steinwart, 2002). In other words, the learning algorithm is consistent in the sense that, informally put, the hypothesis learned by it gets closer to the function to be learned while the size of the training set gets larger. The consistency properties of two-step KRR are considered in more detail in section 3.4.

Next, we consider the universality of the Kronecker product pairwise kernel. The following result is a straightforward modification of some of the existing results in the literature (e.g., Waegeman et al., 2012), but it is presented here for self-containedness. This theorem is mainly related to setting D but also covers the other settings as special cases.

The kernel $\Gamma KK((\xb7,\xb7),(\xb7,\xb7))$ on $U\xd7V$ defined in equation 2.6 is universal if the instance kernel $k(\xb7,\xb7)$ on $U$ and the task kernel $g(\xb7,\xb7)$ on $V$ are both universal.

The space $U\xd7V$ is compact if both $U$ and $V$ are compact according to Tikhonov's theorem. It is straightforward to see that $C(U)\u2297C(V)$ is a subalgebra of $C(U\xd7V)$; separates points in $U\xd7V$, vanishes at no point of $C(U\xd7V)$, and is therefore dense in $C(U\xd7V)$ due to the Stone-Weierstraß theorem. Thus, $H(k)\u2297H(g)$ is also dense in $C(U\xd7V)$, and $\Gamma $ is a universal kernel on $U\xd7V$. $\u25a1$

### 3.4 Spectral Interpretation and Consistency

In this section we study the difference between independent task, Kronecker, and two-step KRR from the point of view of spectral regularization. The universal approximation properties of this kernel, already shown, are also connected to the consistency properties of two-step KRR, as we elaborate in more detail in this section.

In the literature, the constant $\nu \u203e$ is called the qualification of the regularizer, and it is related to the consistency properties of the learning method, as we will describe in more detail.

^{8}:

^{8}can be verified by direct computation. $\u25a1$

^{2}).

^{11}:

The following result is assembled from the existing literature concerning spectral filtering-based regularization methods, and we present it here only in a rather abstract form. (For the details and further elaboration, we refer to Baldassarre et al., 2012; Lo Gerfo et al., 2008; and Bauer et al., 2007.)

^{11}. Furthermore, if the regularization parameter is set as $\lambda =1n2\nu \u203e+1$, where $n$ denotes the number of independent and identically drawn training examples, the following holds with high probability:

Intuitively put, the universality of the kernel ensures that the regression function belongs to the hypothesis space of the learning algorithm, and the admissibility of the regularizer ensures that $R(f^\lambda )$ converges to it when the size of the training set approaches infinity and the rate of convergence is reasonable.

The result follows from the admissibility of the pairwise filter function, the universality of the pairwise Kronecker product kernel and the fact that the training set consists of at least $min(m,q)$ independent and identically drawn training examples. $\u25a1$

Hence, it is proven that the two-step KRR is not only a universal method (can approximate any pairwise prediction function) but will also converge to the prediction function that generated the data when provided with enough training examples.

## 4 Related Work

In section 1, we argued that it remains important to study the theoretical properties of kernel methods for three reasons: (1) kernel methods are general-purpose instruments, (2) they often serve as building blocks for more complicated methods, and (3) they clearly outperform other methods for specific scenarios such as cross-validation. As such observations have been reported in other papers, including quantitative results on real-world data sets, we see no merit in providing additional experimental evidence. We refer to other work that pairwisely compares the kernel methods discussed in this letter with other machine learning methods (e.g., Ding, Takigawa, Mamitsuka, & Zhu, 2013; Romera-Paredes & Torr, 2015; Schrynemackers, Wehenkel, Babu, & Geurts, 2015; Stock, Poisot, Waegeman, & De Baets, 2017). However, it remains important to outline the commonalities and differences with other methods. In what follows, we subdivide these methods according to their applicability to settings A, B, C, or D.

### 4.1 Methods That Are Applicable to Setting A

In this section, we review methods for setting A—matrix completion methods. In section 2, such methods were claimed to be useful for a pairwise learning setting with partially observed matrices $Y$. Both $u$ and $v$ are observed, but not for all instance-target combinations. In setting A, side information about instances or targets is not required per se. We hence distinguish between methods that ignore side information and methods that also exploit such information, in addition to analyzing the matrix $Y$.

Inspired by the Netflix challenge in 2006, the former type of methods has been mainly popular in the area of recommender systems. Those methods often impute missing values by computing a low-rank approximation of the sparsely filled matrix $Y$, and many variants exist in the literature, including algorithms based on nuclear norm minimization (Candes & Recht, 2008), gaussian processes (Lawrence & Urtasun, 2009), probabilistic methods (Shan & Banerjee, 2010), spectral regularization (Mazumder et al., 2010), nonnegative matrix factorization (Gaujoux & Seoighe, 2010), graph-regularized nonnegative matrix factorization (Cai, He, Han, & Huang, 2011), and alternating least-squares minimization (Jain, Netrapalli, & Sanghavi, 2013). In addition to recommender systems, matrix factorization methods are commonly applied to social network analysis (Menon & Elkan, 2010), biological network inference (Gönen, 2012; Liu, Sun, Guan, Zheng, & Zhou, 2015), and travel time estimation in car navigation systems (Dembczyński, Kotłowski, Gawel, Szarecki, & Jaszkiewicz, 2013).

In addition to matrix factorization, a few other methods exist for setting A. Historically, memory-based collaborative filtering has been popular, and corresponding methods are very easy to implement. They make predictions for the unknown cells of the matrix by modeling a similarity measure between either rows or columns (see, e.g., Takács, Pilászy, Németh, & Tikk, 2008). For example, when rows and columns correspond to users and items, respectively, one can predict novel items for a particular user by searching for other users with similar interests. To this end, different similarity measures are commonly used, including cosine similarity, Tanimoto similarity, and statistical similarity measures (Basnou, Vicente, Espelta, & Pino, 2015).

Many variants of matrix factorization and other collaborative methods have been presented in which side information of rows and columns is considered during learning, in addition to exploiting the structure of the matrix $Y$ (see, e.g., Basilico and Hofmann, 2004; Abernethy, Bach, Evgeniou, & Vert, 2008; Adams, Dahl, & Murray, 2010; Fang & Si, 2011; Zhou, Chen, & Ye, 2011; Menon & Elkan, 2011; Zhou, Shan, Banerjee, & Sapiro, 2012; Gönen, 2012; Liu & Yang, 2015; Ezzat, Zhao, Wu, Li, & Kwoh, 2017). One simple but effective method is to extract latent feature representations for instances and targets in a first step, and combine those latent features with explicit features in a second step (Volkovs & Zemel, 2012). To this end, the methods that have been described in this letter could be used, as well as other pairwise learning methods that depart from explicit feature representations.

### 4.2 Methods That Are Applicable to Settings B and C

When side information is available for the objects $u$ and $v$, it would be pointless to ignore this information. The hybrid filtering methods from the previous paragraph seek to combine the best of both worlds by simultaneously modeling side information and the structure of $Y$. In addition to setting A, they can often be applied to settings B and C, which coincide, respectively, with a novel user and a novel item in recommender systems. In that context, one often speaks of cold-start recommendations.

However, when focusing on settings B and C only, a large bunch of machine learning methods is closely connected to pairwise learning. In fact, many multitarget prediction problems can be interpreted as specific pairwise learning problems. All multitask learning problems, and multilabel classification and multivariate regression problems as special cases, can be seen as pairwise learning problems by calling $u$ an instance and $v$ a label (target/output/task). (We refer readers to Waegeman, Dembczynski, & Hüllermeier, 2018), for a recent review on connections between multitarget prediction problems and pairwise learning.

Multitask learning, multilabel classification, and multivariate regression are huge research fields, so it is beyond the scope of this letter to give an in-depth review of all methods developed in those fields. Moreover, not all multitarget prediction methods are relevant for the discussion we intend to provide. Roughly speaking, simple multitarget prediction methods consider side information for only one type of object, let's say the objects $u$, which represent the instances. No side information is available for the targets, which could then be denoted $v$. Since no side information is available for the targets, simple multitarget prediction methods can be applied to only settings B and C. We note that $u$ and $v$ are interchangeable, so settings B and C are identical settings from a theoretical point of view.

The situation changes when side information in the form of relations or feature representations becomes available for both instances and targets. In such a scenario, multitarget prediction methods that process side information about targets are more closely related to the pairwise learning methods that are analyzed in this letter. We will therefore provide a thorough review of such methods in the next section. Furthermore, the availability of side information on both instance and target level implies that setting D can now be covered, in addition to settings B and C. Exploiting side information about targets has two main purposes: it might boost the predictive performance in settings B and C, and it is essential for generalizing to novel targets in setting D.

### 4.3 Methods That Are Applicable to Settings B, C, and D

In setting D, side information for both $u$ and $v$ is essential for generalizing to zero-shot problems, such as a novel target molecule in drug discovery, a novel tag in document tagging, or a novel person in image classification. In this area, kernel methods have played a prominent role in the past, but tree-based methods are also commonly used (Geurts, Touleimat, Dutreix, & D'Alché-Buc, 2007; Schrynemackers et al., 2015). In bioinformatics, a subdivision is usually made between global methods, which construct one predictive model, and local methods, which separate the problem into several subproblems (Vert, 2008; Bleakley & Yamanishi, 2009; Schrynemackers et al., 2013).

Factorization machines (Rendle, 2010; Steffen, 2012) deserve special mention here, as they can be seen as an extension of matrix factorization methods toward settings B, C, and D. They work by simultaneously learning a lower-dimensional feature embedding and a polynomial (usually of degree two) predictive model. Factorization machines can effectively cope with large, sparse data sets frequently encountered in collaborative and content-based filtering. For such problems, they are expected to outperform kernel methods. Their main drawback, however, is that training them becomes a nonconvex problem and requires relatively large data sets to train. The relation between factorization machines, polynomial networks, and kernel machines was recently explored by Blondel, Ishihata, Fujino, and Ueda (2016).

In recent years, specific zero-shot learning methods based on deep learning have become extremely popular in image classification applications. The central idea in all those methods is to construct semantic feature representations for class labels, for which various techniques might work. One class of methods constructs binary vectors of visual attributes (Lampert, Nickisch, & Harmeling, 2009; Palatucci et al., 2009; Liu, Kuipers, & Savarese, 2011; Fu, Hospedales, Xiang, & Gong, 2013). Another class of methods rather considers continuous word vectors that describe the linguistic context of images (Mikolov, Chen, Corrado, & Dean, 2013; Frome et al., 2013; Socher et al., 2013).

Some of the zero-shot learning methods from computer vision also turn out to be useful for the related field of text classification. For documents, it is natural to model a latent representation for both the (document) instances and class labels in a joint space (Nam, Loza Mencía, & Fürnkranz, 2016). Nonetheless, many of those approaches are tailor-made for particular application domains. In contrast to kernel methods, they do not provide general purpose tools for analyzing general data types.

## 5 Conclusion

In this work we have studied several models derived from kernel ridge regression. First, we independently derived single-task kernel ridge regression, Kronecker kernel ridge regression, and two-step kernel ridge regression and the linear filter. Subsequently, we have shown that they are all related; two-step kernel ridge regression and the linear filter are a special case of pairwise kernel ridge regression, itself being merely kernel ridge regression with a specific pairwise kernel. From this, universality and consistency results could be derived, motivating the general use of these methods.

Pairwise learning is a broadly applicable machine learning paradigm. It can be applied to problems as diverse as multitask learning, content and collaborative filtering, transfer learning, network inference, and zero-shot learning. This work offers a general tool kit to tackle such problems. Despite being easy to implement and computationally efficient, kernel methods have been found to attain an excellent performance on a wide variety of problems. We believe that the intriguing algebraic properties of the Kronecker product will serve as a basis for developing novel learning algorithms, and we hope that the results of this work will be helpful in that regard.

## Appendix: Matrix Properties

The trick of pairwise learning is transforming a matrix in a vector. This can be done by the vectorization operation.

(vectorization). The vectorization operator $vec(\xb7)$ is a linear operator that transforms an $n\xd7m$ matrix $A$ in a column vector of length $nm$ by stacking the columns of $A$ on top of each other.

Further, the Kronecker product is defined as follows:

**Property 1.**For any conformable matrices $N,M$ and $X$, it holds that

Computing the Kronecker product of two reasonably large matrices results in a huge matrix, often too large to fit in computer memory. If the Kronecker product is needed only in an intermediary step, the above identity can be used to dramatically reduce computation time and memory requirement.

Using the eigenvalue decomposition of matrices, a large system of equations using the Kronecker product can be solved efficiently.

**Property 2**(Pahikkala et al., 2013). Let $A,B\u2208Rn\xd7n$ be diagonalizable matrices, that is, matrices that can be eigendecomposed as

**Property 3.**The product $C=AB$ is given by

**Property 4.**The inverse $D=A-1$ is given by

## Notes

^{1}

Recent advances in convolutional neural networks, however, have resulted in intriguing ways to generate representations for molecules (Duvenaud et al., 2015), proteins (Jo, Hou, Eickholt, & Cheng, 2015) and nucleic acids (Alipanahi, Delong, Weirauch, & Frey, 2015). Such feature representations, obtained by pretraining on large data sets, will likely replace kernel methods in the future, at least to some extent.

## Acknowledgments

M. S. is supported by the Research Foundation–Flanders (FWO17/PDO/067). This work was supported by the Academy of Finland (grants 311273 and 313266 to T.P. and grant 289903 to A.A.).