Multi-view Feature Learning for the Over-penalty in Adversarial Domain Adaptation

ABSTRACT Domain adaptation aims to transfer knowledge from the labeled source domain to an unlabeled target domain that follows a similar but different distribution. Recently, adversarial-based methods have achieved remarkable success due to the excellent performance of domain-invariant feature presentation learning. However, the adversarial methods learn the transferability at the expense of the discriminability in feature representation, leading to low generalization to the target domain. To this end, we propose a Multi-view Feature Learning method for the Over-penalty in Adversarial Domain Adaptation. Specifically, multi-view representation learning is proposed to enrich the discriminative information contained in domain-invariant feature representation, which will counter the over-penalty for discriminability in adversarial training. Besides, the class distribution in the intra-domain is proposed to replace that in the inter-domain to capture more discriminative information in the learning of transferrable features. Extensive experiments show that our method can improve the discriminability while maintaining transferability and exceeds the most advanced methods in the domain adaptation benchmark datasets.


INTRODUCTION
Deep learning has achieved great success in a variety of computer vision tasks [1] [2], but such success mainly depends on a large amount of labeled training data and the i.i.d.assumption.It is often difficult to meet in real-world applications.For example, an image classification model trained by simulated images cannot be directly applied to real image classification due to the distribution divergence.Even the same object, owing to the acquisition equipment, lighting, angle, and other factors, may display a diversity of visual features and result in a different distribution.To address these challenges, Domain Adaptation (DA) has been proposed, which tries to mitigate the distribution differences between domains and utilizes the knowledge learned from a related label-rich source domain to assist the learning task in an unlabeled target domain.
The deep learning models have dominated this field owing to their outstanding performance in the learning of transferable features.And these methods fall into two categories broadly: discrepancy-based methods [1], [4], [5] and adversarial-based methods [6], [7], [8], [9], [10], [11], [12], [13].The former mitigates the distribution discrepancy between the source and target domain by minimizing the discrepancy metric, such as maximum mean discrepancy (MMD) [4].While the latter is inspired by generative adversarial networks [14], it introduces a new component domain discriminator to realize domain confusion.The adversarial learning model is an effective mechanism for learning invariant features in domain adaptation and has become an increasingly popular method.
There are two key factors in domain adaptation, transferability and discriminability.Transferability depends on the similarity between two domains, ensuring that the model trained on the source domain can be used on the target domain.While discriminability indicates the ability of learned features to separate different classes.However, the existing adversarial methods focus more on transferability while rare attention has been paid to discriminability in feature representations.It will lead to performance degradation in the target domain.There are two main reasons for the lower discriminability in representation.
1) Recent studies [7] point out that adversarial methods improve transferability at the expense of discriminability.And there is a contradiction between transferability and discriminability.Especially, when learning domain invariant features, the eigenvector with the largest singular value tends to carry more transferable knowledge, while other eigenvectors with low singular values may embody domain variations and are weakened.This leads to the over-penalty of other eigenvectors, which may be crucial for discriminability.2) Most of the previous methods measure the distance of inter-classes and intra-classes between two domains, and the inaccurate pseudo-alignment will lead to the deviation of distance.
To address these issues above, we propose a Multi-view Feature Learning method for the Over-penalty in Adversarial Domain Adaptation, using multi-view representation to learn more discriminative and transferable information to counter the over-penalty for discriminative information in adversarial learning.In addition, to learn more discriminative features, the class-distribution in the source domain is used to replace that in the inter-domain to measure the class-distance more accurately.
The contributions of our work can be summarized as follows.
-A multi-view learning framework is proposed to enlarge both the transferable and discriminative feature representation.Multi-view representations contain diverse and complementary information to resist the over-penalizing of discriminative information in adversarial learning.-To further improve the discriminability, the class-distribution in the intra-domain is used to modify the discriminative loss, which can measure class-distance more accurately, and then learn more discriminative features.

RELATED WORK
In this section, we will introduce the related work in two aspects: adversarial-based domain adaptation and multi-view learning.

Adversarial-based Domain Adaptation
In recent years, adversarial-based methods have become popular.In this case, a new component domain discriminator and a two-player minimax game are introduced to realize domain confusion.DANN [9] first introduces adversarial learning into domain adaptation, and the domain invariant feature is obtained in an adversarial manner in which the domain discriminator cannot distinguish the domain label of the feature extracted by the feature extractor.Although DANN [9] achieves impressive results, it only aligns the global distribution without further consideration of multi-mode structure information.Following it, MADA [12] uses multiple domain discriminators to achieve fine-grained alignment.To further consider the importance of marginal and conditional distribution, DAAN [8] proposes a dynamic adversarial factor to evaluate the relative importance of the marginal and conditional distributions dynamically in adversarial domain adaptation.Moreover, JADA [11] matches domain-level and class-level distributions at the same time, achieving better results than only matching one of them.In addition, some approaches are beginning to introduce semantic information into domain adaptation.MSTN [13] tries to learn the semantic representations for unlabeled target samples by aligning the center of the labeled source and pseudo-labeled target samples.DSR [10] uses a variational auto-encoder and a dual adversarial network to learn disentangled semantic representation in which the semantic latent variables are independent of the domain latent variables, thus we can classify the labels more easily.To improve the representation, MCD [15] uses task-specific decision boundaries, and MSTN [13] uses pseudo labels to align the class-distribution across domains.
Although the adversarial-based approaches mentioned above have achieved impressive results, recent studies [7] have shown that the adversarial-based methods can cause loss of discriminability to some extent while improving transferability, both of which are key factors of domain adaptation.And there have been several attempts to solve this problem.BSP [7] finds that the eigenvector with the largest singular value will determine the transferability of the feature while transferability is enhanced at the expense of over-penalizing other eigenvectors, which contain rich structures and are critical for discriminability.To solve this problem, it tries to penalize the eigenvector with maximum singular value, so that other eigenvectors are relatively enhanced to improve the discriminability.AADA [6] proposes a new asymmetric adversarial scheme, in which the traditional domain discriminator is replaced by the autoencoder and only the target sample is added to adversarial training, thus avoiding the loss of discriminability in the traditional domain adversarial training.

Multi-View Learning
Multi-view learning is a way to learn the different descriptions and characterizations of the same object.With the different processes of data gathering, different feature representations of the same object can be obtained.However, these different representations contain complementary information, which can provide a more comprehensive representation or description of the object.Many domain adaptation approaches utilize multi-view learning to address cross-language text classification problems.These documents in different languages can be viewed as different views of the original document.MVTL-LM [16] proposes a multi-view transfer learning framework to achieve consistency between multiple views.However, the application of multi-view in cross-domain image classification has not been fully explored, MRAN [17] proposes a multi-representation adaptation network that tries to learn multiple different feature representations via an Inception Adaptation Module(IAM) and align distributions of them.Because more information is contained in the feature, better performance is achieved than the single representation adaptation.

OUR METHOD
In this section, we will cover the details of our method for unsupervised domain adaptation.And we are given a source domain with n s labeled samples and a target domain with n t unlabeled samples.The goal of our work is to predict the target labels by transferring the knowledge learned from the source domain.
The framework of our work is shown in Fig. 1.The first component is the multi-view feature extractor.A general convolutional neural network (CNN) is used to get a low-pixel image, and then different networks are used to extract multi-view feature spaces to enrich and enlarge the feature representations.The second component is multi-view discriminative adversarial training.The modified discriminative loss is combined with the discriminator loss to learn the transferrable and discriminative features for DA.

Multi-View Feature Extractor
Although adversarial-based methods can improve transferability effectively, they also cause a loss of discriminability to some extent at the same time.To solve the problem, we borrow the idea from multiview learning to counter the loss of discriminability.The description of different views of the same object contains different information, which can increase the diversity of features and thus increase the discriminative information contained in the feature representations.In our method, we try to learn multiple different representations to enrich the discriminative information contained in the final feature representation.
In the usual way, we can get multiple different feature representations through training multiple convolutional networks, but this operation is very time-consuming.Thus we introduce an Inception Adaptation Module (IAM) [17] containing four different substructures, S1, S2, S3, and S4 to get four different feature representations and make them have different dimensions.Firstly, we use a general convolutional neural network preprocessing the raw data to get a low-pixel image.Secondly, the IAM is used to extract different features and the features with different dimensions will contain more complementary information, thus we can get the different feature representations.
It is thought that different networks will learn different representations for the same sample.In our method, four different convolutional neural networks are used to learn four different representations of images.Compared with single-view learning, the combination of four representations contains more and some duplicated information, which is prepared to resist the over-penalty of the adversarial model.Specifically, on the one hand, the four representations contain different information, which means they all contain more information about this object.On the other hand, there must be redundant or duplicate information among the four representations.In addition, learning multiple-view representations using IAM does increase the time cost to an extent.With more time cost, it improves the performance of adversarialbased DA.Especially, the time cost of multi-view representations will be less than four times of single-view.It is because the four subnetworks share the convolutional layer and only differ in the representation layer.

Multi-view Discriminative Adversarial Training
To make the multi-view representations more transferable and discriminative, our method introduces the discriminative loss into the discriminator loss.The traditional domain adaptation methods only focus on the discriminator loss, which tries to reduce the domain discrepancy.Different from these methods, we furthermore consider the discriminative loss, which is calculated by maximizing the inter-class distance and minimizing the intra-class distance among source samples at the same time.The discriminative loss is shown in Fig. 2.
The discriminative domain adaptation loss is formulated as follows: The first part of Formula 1 is the discriminator loss L D , and it is the traditional domain confusion loss calculated in an adversarial manner.But different from previous works, we need to align multiple pairs of representations separately rather than a single representation.Thus L D is calculated as follows.
where L is the cross-entropy loss function and n r is the num of representations.d j means the domain label of the samples, 0 for the source domain, and 1 for the target domain.G f and G d are feature extractor and domain discriminator respectively.
The second part of Formula 1 is the discriminative domain adaptation loss L dis , which is formulated as follows: where ,

s c i
x means that the i-th sample of the source domain belongs to class c and μ is a parameter.
We attempt to maximize the inter-class distance and minimize the intra-class distance between source domain samples to improve the discriminability.The source domain and the target domain are highly close in the shared space.Therefore, by making the sample distribution of the same class in the source domain closer, and separating samples of different classes, the target domain aligned with the source domain will also become more discriminative.
It is worth noting the difference between our discriminative loss and others [18].Owing to that there is no label in the target domain, the pseudo labels of the target domain have to be used to calculate the distance between the samples of the source domain and the target domain according to the classes [18].The accuracy of the pseudo labels will have a great influence on the experimental results.Thus in this paper, we do not rely on the pseudo labels of the target domain, only calculate the discriminative loss on source samples.Through the application of discriminative loss, we can make the sample distribution of the same class closer and the boundary between different classes clearer, which will undoubtedly make the features to be more discriminative.And through the combination of domain confusion loss and discriminative loss, we can align the feature distributions of the source and target domain and ensure the discriminative ability of the feature.
For each view, we can get a representation with the L DDA .And the multi-view different feature representations will be got with the multi-view feature spaces, namely f1, f 2, f 3, and f 4.And the final feature representation is represented as f = f 1 ⊕ f2 ⊕ f 3 ⊕ f 4. ⊕ represents the concatenation of the feature.By combining these different features, we get a better domain invariant representation, which includes more discriminative information.In this way, the presentations are enriched.

The Overall Training Objective
Except for the loss function presented above, we also need to use supervised source domain data to train an effective classifier G y : And previous studies have shown that entropy minimization can improve the discrimination of the model for target data: ( ) where ˆk y represents the probability of classifying the sample x to label k, that is, the softmax output of the classifier.Same as the prior work, we only use entropy minimization to update the feature extractor.
The overall objective function and the optimization procedure can be formulated as , min max where λ ∈ [0, 1] is a trade-off parameter.
With the above, the pseudocode of our method is shown in Algorithm 1. Use a general CNN preprocessing of the raw data to get a low-pixel image.

3:
Use IAM to extract different features.

4:
for j = 1 to n r do 5: Use , j j src tar X X to compute the traditional domain confusion loss as Formula (2).

6:
Use j src X to compute the discriminative loss as Formula (3).

7:
Use domain confusion loss and discriminative loss to compute the discriminative domain adaptation loss as Formula (1).8: end for 9: Join the multiple features.10: Use the joint feature to compute the classifier loss as Formula (6).11: Compute the regularization term as Formula (7).
12: Update the feature extractor G f, domain discriminator G d and the classifier G y by minimizing the total loss in Formula (8).13: end for 14: return result

EXPERIMENT
We conduct experiments on three benchmark datasets to evaluate the effectiveness of our method.

Setup
Office-31 is a popular benchmark dataset for domain adaptation, which contains 4110 images from 31 classes and it consists of three domains: Amazon(A), DSLR(D), Webcam(W).We evaluate our method on all six adaptation tasks with standard evaluation protocol.
ImageCLEF-DA is a dataset for the ImageCLEF 2014 domain adaptation challenge, which includes three domains: Caltech-256(C), ImageNet ILSVRC 2012(I), Pascal VOC 2012(P).Each domain is the same size, containing 12 classes and 50 images class.And we consider all six adaptation tasks.
Office-Home is a more challenging dataset than Office-31 and ImageCLEF-DA because it includes 15500 images from 65 classes, with four extremely distinct domains: Artistic images(Ar), Clipart(Cl) images, Product(Pr) images, and Real-World(Rw) images.We consider all twelve adaptation tasks.
We implement our method based on PyTorch.For all datasets, we use ResNet-50 as the backbone network, and it is pre-trained on the ImageNet dataset.And we follow the standard evaluation protocols for domain adaptation as in DANN [9].We use mini-batch stochastic gradient descent to update parameters, the momentum is set to 0.9 and the base learning rate for the feature extractor is 0.001, and the learning rate of the classifier is 10 times that of the feature extractor.And the learning annealing rate is adjusted by ( ) [9], where η 0 = 0.01, α = 10, β = 0.75 and p denotes the training progress linearly changing from 0 to 1.To reduce the influence of noise at the early stage, λ is gradually increased from 0 to 1 by a schedule [22]: and γ is fixed to 10.And we set the μ in the discriminative loss to 0.1.We set the structure used to extract different features as S1(conv1×1, conv5×5), S2(conv1×1, conv3×3, conv3×3), S3(conv1×1) and S4(pool, conv1×1), and is borrowed from IAM [17], so the representation num n r is 4 in our method.IAM is inspired by GoogLeNet [23], using the inception module to fuse multiple representations.

Results
The classification results on OfficeHome, ImageCLEF-DA, and Office-31 datasets are shown in Table I From the results, we can see that • Our method outperforms all baselines in most domain adaptation tasks in three datasets.This illustrates the superiority of our proposed method.• Methods considering both transferability and discriminability performs better than those just considering only one of them.Compared with DANN, MADA, CDAN and other adversarial-based methods, BSP, AADA and our method perform better because they consider both transferability and discriminability and focus on the over-penalty in adversarial learning.Especially, BSP+DANN and BSP+CDAN perform better than DANN and CDAN.• As for all methods considering discriminability, our method achieves state-of-the-art, which shows that multi-view is effective in enriching discriminative information contained in the domain invariant features.Especially, BSP improves discriminability by enhancing other eigenvectors.AADA is the latest method that proposes a new asymmetric adversarial scheme to avoid the loss of discriminability.Different from them, our method uses multiple representations to enrich the discriminative information contained in the domain invariant features, and also uses a discriminative loss to improve the discriminability, achieving better results.It indicates that our method can improve the transferability and discriminability simultaneously.• Both MRAN and our method learn multiple feature representations.MRAN uses the traditional MMD distance measurement, while our method uses the adversarial manner.It can be seen that compared with DAN and DANN, the performance of the multi-representation adaptation method is better than that of the single-representation method.This illustrates that multi-view representations can enrich the information contained in the domain invariant features.At the same time, compared with MRAN, our method further uses a discriminative loss to achieve better results than MRAN.

Analysis
Spectral Analysis In this section, we will further prove our method can resist over-penalty in adversarial models and maintains transferability.Previous studies [5] have proposed that Singular Values (SV) and Corresponding Angles (CA) of eigenvectors obtained by Singular Value Decomposition (SVD) of representation, can be used to compare discriminability.Inspired by this, we conduct an experiment on a more difficult task D→A.Firstly, we apply SVD to the source feature matrix and target feature 1 ...
 to compute the singular values and eigenvectors.
where b is the batch size, U t denotes the eigenvectors, ∑ t denotes the eigenvalue and V t is a unitary matrix.
In Fig. 2(a), we plot the normalized singular values of three models including ResNet, DANN, and Ours.It can be observed that the maximum singular value of the DANN feature matrix is significantly larger than other singular values, which will impair the information signal of the eigenvectors with smaller singular values.In comparison, our method can effectively reduce the large gap between the maximum and other singular values, which preserves more discriminability in feature learning.
In Fig. 2(b), we plot the normalized corresponding angles of singular values.The corresponding angle is defined as the angle between two eigenvectors corresponding to the same singular value index, which shows the transferability of the features.For DANN, the sharp decay trend indicates that the eigenvector with the largest singular value dominates the transferability of feature representation, thus the transferability is enhanced at the expense of over-penalty other eigenvectors that embody rich structures crucial for discriminability.However, our method gives consecutive eigenvectors a more prominent role during the transfer process.
In Fig. 2(c), we plot the A-distance which is a measure of domain discrepancy that reflects the transferability of feature representations.It is defined as d A = 2(1 − 2ϵ), where ϵ is the error rate of a domain classifier.And the A-distance with features of our method is smaller than DANN, which proves that our methods not only enhance the discriminability but also have great transferability.

Ablation Study
In this section, we will study the performance of different components in our model on the Office-31 dataset.It mainly includes two parts, the influence of discriminative loss and the representation number on experimental results.The result of this experiment is shown in Table IV.In the first part, we can see the importance of the discriminative loss.w/o discriminative loss removes the discriminative loss in our model, and MMD for discriminability replaces the discriminative loss with MMD for discriminability in JPDA [18].w/o discriminative loss performs better than MMD for discriminability, which shows that calculating discriminative loss in our method is effective.Then we further discuss the effect of representation quantity on experimental results.The single feature removes the multiple representations and only uses a single representation to perform domain adaptation.From the results, we can see that as the number of representations increases, we can get better results.And as with MRAN [17], we only consider using four representations at most.When either part was removed, the accuracy of the experiment declined.We can see that every component in our framework is necessary.

Sensitiveness of Parameters
μ is a trade-off parameter between the intra-class distance and the inter-class distance.Overall speaking, the performance of our method is not sensitive to μ.The curve in Fig. 5 does not fluctuate much with the changing of μ.And the best performance is achieved when μ falls in [0.08-0.13].Feature Visualization To show the adaptive effect of the method more intuitively, we use t-SNE [28] embeddings to get the feature learned by DANN [9] and our method on the adaptation task A →W of the Office-31 dataset and Art → Product of OfficeHome dataset.The visualization result is shown in Fig. 3.We can see that compared with the traditional adversarial-based methods that only consider transferability, our method can better align the corresponding classes of the source domain and target domain.Samples of the same class of the source domain are more closely distributed, while samples of different classes have clear boundaries, and the target domain aligned with the source domain can also be well distinguished.It works well in the Office-31 dataset that differs slightly between domains, as well as in OfficeHome datasets that differ greatly.

CONCLUSION
In this paper, we propose an unsupervised domain adaptation method to improve the discriminability while maintaining the transferability by learning multiple different feature representations to enrich the discriminative information contained in domain-invariant features.At the same time, we further introduce a discriminative loss to maximize the inter-class distance and minimize the intra-class distance to make it easier to distinguish between classes.
In the near future, we will further explore the relationship between the number of representations and the transfer performance, as well as between representations, to design a better multi-view model for domain adaptation, while trying to achieve fine-grained alignment.

Figure 1 .
Figure 1.The framework of our proposed method, blue and orange represent source and target data respectively.Our framework includes two components.The first component is the multi-view feature extractor, in which, four different network structures are used to extract multi-view representations.The second component is multi-view discriminative adversarial training, in which, ⊕represents the concatenation of multi-view feature representations.

Figure 2 .
Figure 2. The calculation process of the discriminative domain adaptation loss.

Algorithm 1 .
Multi-view Discriminative Feature Learning Trained model G f , G y , G d 1: for i = 1 to I do 2:

Figure 4 .
Figure 4.The t-SNE visualization of feature representations learned by DANN and our method for task A→W and Art→Pr.Note that blue points are samples from the source domain and red points are from the target domain.

Figure 5 .
Figure 5.The performance changing with the varying of μ in Formula 3 on Office-31 dataset.
, TableII, and Table III respectively.For a fair comparison, all baseline methods use ResNet50 as the backbone network.The result in the first line shows accuracies when directly applying the classifier trained in the source domain to the target domain, so domain adaptation methods can solve the domain shift problem effectively.