Abstract
Due to the difficulty of collecting labeled images for hundreds of thousands of visual categories, zero-shot learning, where unseen categories do not have any labeled images in training stage, has attracted more attention. In the past, many studies focused on transferring knowledge from seen to unseen categories by projecting all category labels into a semantic space. However, the label embeddings could not adequately express the semantics of categories. Furthermore, the common semantics of seen and unseen instances cannot be captured accurately because the distribution of these instances may be quite different. For these issues, we propose a novel deep semisupervised method by jointly considering the heterogeneity gap between different modalities and the correlation among unimodal instances. This method replaces the original labels with the corresponding textual descriptions to better capture the category semantics. This method also overcomes the problem of distribution difference by minimizing the maximum mean discrepancy between seen and unseen instance distributions. Extensive experimental results on two benchmark data sets, CU200-Birds and Oxford Flowers-102, indicate that our method achieves significant improvements over previous methods.