## Abstract

Our work focuses on unsupervised and generative methods that address the following goals: (1) learning unsupervised generative representations that discover latent factors controlling image semantic attributes, (2) studying how this ability to control attributes formally relates to the issue of latent factor disentanglement, clarifying related but dissimilar concepts that had been confounded in the past, and (3) developing anomaly detection methods that leverage representations learned in the first goal. For goal 1, we propose a network architecture that exploits the combination of multiscale generative models with mutual information (MI) maximization. For goal 2, we derive an analytical result, lemma 1, that brings clarity to two related but distinct concepts: the ability of generative networks to control semantic attributes of images they generate, resulting from MI maximization, and the ability to disentangle latent space representations, obtained via total correlation minimization. More specifically, we demonstrate that maximizing semantic attribute control encourages disentanglement of latent factors. Using lemma 1 and adopting MI in our loss function, we then show empirically that for image generation tasks, the proposed approach exhibits superior performance as measured in the quality and disentanglement of the generated images when compared to other state-of-the-art methods, with quality assessed via the Fréchet inception distance (FID) and disentanglement via mutual information gap. For goal 3, we design several systems for anomaly detection exploiting representations learned in goal 1 and demonstrate their performance benefits when compared to state-of-the-art generative and discriminative algorithms. Our contributions in representation learning have potential applications in addressing other important problems in computer vision, such as bias and privacy in AI.

## 1  Introduction

### 1.1  Motivations and Goals

There has been unsurpassed success in the application of deep learning (DL) in several areas of visual analysis, computer vision, natural language processing, and medical imaging (Litjens et al., 2017). The transformative contribution of DL to artificial intelligence (AI), principally in discriminative models for supervised learning, has mostly hinged on the availability of large training data sets such as ImageNet (Russakovsky et al., 2015). Open problems still remain in DL, especially in unsupervised learning, inference on out-of-training-distribution test samples, domain shift, and discriminative tasks where data are not easily obtained or when manual labeling is impractical or prohibitively onerous. Generative models, with their ability to generate data similar to a given data set and efficient representations of these data, may assist in addressing some of these challenges.

Considering these challenges, the ability to generate images with good quality and diversity at high resolution and allow the unsupervised discovery and control of individual semantic images attributes via latent space factors are of paramount importance, and are the main aims motivating our study. These goals are also extended here to applying these learned representations to the practical task of anomaly detection.

Generative methods, broadly speaking, learn to sample from the underlying training data distribution so as to generate novel, fake samples that are visually or statistically indistinguishable from the underlying training data. A simple taxonomy of generative models includes generative adversarial networks (GANs) (see Goodfellow et al., 2014; Salimans et al., 2016), autoencoders/variational autoencoders (VAEs) (Kingma & Welling, 2013), and, used to a lesser extent in comparison to the aforementioned ones, generative autoregressive models in Oord et al. (2016), invertible flow-based latent vector models in Kingma and Dhariwal (2018), or hybrids of the above models as in Grover, Dhar, and Ermon (2018).

### 1.2  Prior Work

Prior and recent research in generative models addressing some of our aims and inspiring our work are organized along the following areas.

#### 1.2.1  High-Resolution GAN Approaches

A goal of GANs has been to achieve good-quality image generation for high-resolution images. Despite sustained research in GANs, only relatively recently have GANs been able to generate images at somewhat high resolutions (greater than 256 $×$ 256 pixels). Examples of methods that achieve such results include ProGAN (Karras, Aila, Laine, & Lehtinen, 2018), BigGAN (Brock, Donahue, & Simonyan, 2019), and COCO-GAN (Lin et al., 2019). For example, BigGAN relied on SAGAN (Zhang, Goodfellow, Metaxas, & Odena, 2019) as a baseline (self-attention GANs) and used large batch sizes to improve performance (multiplying batch size by a factor of 8 leads to more than a 40% increase in inception score over other state-of-the-art algorithms). That study also noted that larger networks had a comparable positive effect, and so did the use of the truncation trick (i.e., for the generator, sampling from a standard normal distribution in training while sampling instead from a truncated normal distribution in inference, where samples that are above a certain threshold are resampled). Truncating with a lower threshold allowed control of the trade-off between higher fidelity and lower diversity.

Unfortunately, while many best-of-breed generative approaches made progress in terms of visual quality and high resolutions, these methods cannot be directly used for semantic attribute control, one of our goals in this letter.

#### 1.2.2  Style Transfer

StyleGAN (Karras, Laine, & Aila, 2019) has been very successful at addressing the generation of high-dimensional images (1024 $×$ 1024) and is an extension of ProGAN, which was based on progressively growing the encoder and decoder/discriminator in GAN networks. Some of the specific novel features in Karras et al. (2019) consisted of injecting noise at every scale resolution of the decoder and using a fully connected (FC) network that mapped a latent vector $Z$ into a 512-length intermediate latent vector $W$. This so-called style vector had incidence on the generation of images throughout multiple scales and was used to influence some attributes of the image (at the low scale, coarse attributes like skin tone, and at the higher-scale, fine attributes like hair). An updated version of StyleGAN (Karras et al., 2020), StyleGAN2, introduced architectural improvements, along with an improved backprojection method for mapping images to latent spaces.

While both methods successfully allowed for using multiple style vectors at different scales as a natural extension, the resulting images may have included undesirable attributes from the original base images. These methods therefore did not allow for directly controlling specific semantic image attributes and, consequently, unsupervised discovery of consistent semantic attributes. In contrast, the approach we seek strives for the discovery, control, and isolation of desirable attributes from spurious attributes for complex imagery.

#### 1.2.3  Information-Theoretic Approaches

In Chen et al. (2016), InfoGAN made use for the first time of the principle of maximizing the mutual information $I(C;X^)$ between a semantic latent vector $C$ and the generated image $X^$, where $C$ is a semantic component of the latent vector representation $Z=(Z',C)$, with $Z'$ being a noise vector. This process was originally set up to achieve disentanglement in latent factors between different scales. Experiments exemplified various degrees of agreement between such factors and semantic attributes. However, as our study should demonstrate, the mutual information principle actually promotes the control of attributes in images, but true disentanglement is not always achieved via this maximization of mutual information. Indeed, the results in Chen et al. (2016) suggested varying degrees of success and consistency between the ability of the network to actually control and disentangle.

Instead, in our view, the concept of disentanglement was appropriately defined in Chen, Li, Grosse, and Duvenaud (2018) as the process of finding latent factors that satisfy minimum total correlation, defined by the Kullback-Leibler divergence between the vector's joint distribution and the product of its marginal components. Despite this contribution, the concepts of control and disentanglement have remained used interchangeably and are often confused with each other in recent literature. In this study, we adopt this definition and help bring order by formally demonstrating how this concept relates to the concept of control and maximization of mutual information.

#### 1.2.4  Generative Methods for Anomaly Detection

Our goal 3 strives to demonstrate the utility of disentangled representations for an important machine learning task, specifically, anomaly detection. Related to this is past work that has used DL discriminative and generative model representation learning for anomaly detection. While DL-based anomaly detection schemes initially used discriminative representations (Erfani, Rajasegarar, Karunasekera, & Leckie, 2016), where deep belief networks were combined with statistical methods, more recent methods have made use of generative representation learning (Abay, Gehly, Balage, Brown, & Boyce, 2018; Akçay, Atapour-Abarghouei, & Breckon, 2018; Bergmann, Löwe, Fauser, Sattlegger, & Steger, 2018; Deecke, Vandermeulen, Ruff, Mandt, & Kloft, 2018; Gray, Smolyak, Badirli, & Mohler, 2020; Jain, Manikonda, Hernandez, Sengupta, & Kambhampati, 2018; Kimura & Yanagihara, 2018; Lai, Hu, Tsai, & Chiu, 2018; Liu et al., 2018; Naphade et al., 2018; Schlegl, Seeböck, Waldstein, Schmidt-Erfurth, & Langs, 2017; Zenati, Foo, Lecouat, Manek, & Chandrasekhar, 2018). Generally these exploit two main strategies: using cyclic reconstruction error in the image space as an anomaly detection metric or directly using the GAN discriminator. These two techniques were employed for pixel-based anomaly detection in Schlegl et al. (2017) and for image-based anomaly detection in Zenati et al. (2018) and Deecke et al. (2018). Another approach used metrics of reconstruction in latent/code space is embodied in the work of Akçay et al. (2018). The method in Akçay et al. (2018) was later extended to include skip connections (Akçay, Atapour-Abarghouei, & Breckon, 2019). However those methods did not probe the use of generative models with the ability for large-resolution image generation and allowed disentanglement, as is done here. Recently, Burlina, Joshi, and Wang (2019) found that most generative approaches fell short of using discriminative embeddings. We consequently focus on comparing to these discriminative models as the state of the art.

### 1.3  Novel Contributions

When compared to prior work, the novel contributions of this work are as follows:

1. We consider the disentanglement of image attributes and the unsupervised discovery of such attributes,1 proposing a novel approach relying on the combination of mutual information maximization with multiscale GANs.

2. We bring clarity to the concepts of semantic control and disentanglement, demonstrating that there is a connection between mutual information maximization and total correlation minimization, that is, between the concepts of attribute control and disentanglement. We demonstrate in lemma 1 that maximizing semantic attribute control encourages the minimizing of entanglement for latent factors.

3. We show empirically that the proposed approach results in high-resolution generation with the ability for unsupervised discovery of latent codes that help control specific semantic image attributes.

4. We develop several methods with the proposed generative architectures, used for representation learning, for the end task of anomaly detection. We then demonstrate empirically the resulting performance benefits when compared to other state-of-the-art discriminative methods.

## 2  Approach

This section presents our approach, including details on representation learning via our proposed architecture (InfoStyleGAN) in section 2.1 and the loss function design and analysis in section 2.2.

### 2.1  Definitions of Attribute Control, Disentanglement, Architecture, and Loss Functions

We now discuss the concepts of attribute control and disentanglement. Consider first the architecture used here, which is depicted in Figure 1 and borrows some of the GAN components in StyleGAN (Karras et al., 2019), including a multiscale generator $X^=G(Z)$ and a discriminator $D$.
Figure 1:

Our proposed multiscale generator and discriminator architecture. The latent code in $Z$ is split into noise terms $Z'$ and semantically relevant variables $C$. Sample $(z',c)$ is fed into a mapping network, and a mutual information-maximizing loss is used between the latent and output-generated image, combined with a conditional loss and traditional generative adversarial network (GAN) adversarial loss. The pathways for the information $Q$ auxiliary network are shown in dotted green, while the pathways for anomaly detection via representation and one class support vector machine (OCSVM) and local outlier factor (LOF) are shown in dashed yellow.

Figure 1:

Our proposed multiscale generator and discriminator architecture. The latent code in $Z$ is split into noise terms $Z'$ and semantically relevant variables $C$. Sample $(z',c)$ is fed into a mapping network, and a mutual information-maximizing loss is used between the latent and output-generated image, combined with a conditional loss and traditional generative adversarial network (GAN) adversarial loss. The pathways for the information $Q$ auxiliary network are shown in dotted green, while the pathways for anomaly detection via representation and one class support vector machine (OCSVM) and local outlier factor (LOF) are shown in dashed yellow.

Since we seek to learn latent space representations that relate to generated image attributes, for attribute discovery and control, the latent vector $Z=(Z',C)$ is decomposed into a standard gaussian noise vector $Z'$ and a latent vector component $C$ (henceforth called the latent factors) with distribution $p(C)$, where $C$ and $Z'$ are independent of each other.

We define the concept of disentanglement as the minimization of the total correlation between the different latent factors $C$, while the faculty for discovery and control of semantic image attributes is defined via the maximization of mutual information between the latent factors and the generated image. The last concept is explicitly used in our considered loss function. We demonstrate analytically that the former concept of total correlation minimization is implied, under certain conditions, from the latter principle of mutual information maximization.

#### 2.1.1  Control via Maximization of Mutual Information (MI)

The maximization of the mutual information $I(C;G(Z',C))$ between the semantic vector $C$ and the observation $X^=G(Z',C)$ is used as a means of discovering the factors of variations in images and forcing the coupling the vector $C$ to the different factors of variations in the images $X$ in the data set. Since MI computation is complicated by the fact that it entails knowing the posterior $p(C|X^)$, one can instead, as in InfoGAN in Chen et al. (2016), employ an auxiliary distribution $Q(C|X^)$ that approximates this posterior and that can be selected to maximize the resulting MI measure.

#### 2.1.2  Disentanglement via Minimization of Total Correlation

Although Chen et al. (2016) introduced an auxiliary loss to maximize the MI between $C=(C1,C2,…,CL)$, where each $Ci$ can be governed by a different distribution and $X^=G(Z',C)$, where $Z'$ is an additional noise vector, there is no explicit objective for controlling a diverse set of image attributes. For example, even with the independence implementation on the prior $p(C)=∏ip(Ci)$, in Chen et al. (2016), all semantically relevant variables could only seemingly affect the skin in faces, with different variables focusing on skin tone, skin texture, glare on the skin, and other attributes. In effect, despite being sampled as independent, the effects between individual variables on the image are highly correlated. Consequently, a desirable end goal for disentanglement is that knowledge of one latent factor from the image does not affect the knowledge of other latent factors, that is, having $p(C|X^=x^)=∏ip(Ci|X^=x^)$ or conditional independence of the true posterior of the latent factors given a realization $X^=x^$ of the generated image. This is equivalent to having $TC(C|X^=x^)=0$, where $TC(·|X^=x^)$ is the total correlation (Chen et al., 2018) given $X^=x^$, defined as the Kullback-Leibler (KL) divergence, denoted by $DKL(·∥·)$, between the conditional joint distribution given $X^=x^$ and the product of the conditional marginal distributions given $X^=x^$:
$TC(C|X^=x^)=DKLp(C|X^=x^)∥∏ip(Ci|X^=x^)).$

We will argue that the previous two concepts are connected to each other as shown in lemma 1. Indeed, under certain assumptions, MI maximization constrains the total correlation. Therefore, only one such constraint, namely, the MI constraint, will be considered henceforth.

### 2.2  Loss Function

The loss function comprises two parts: the adversarial loss $V(D,G)$ for the generator $G$ and discriminator $D$, as well as the lower bound $Linfo(G,Q)$ on the mutual information
$V(D,G)=EC∼p(C),X^∼G(Z',C)(log(1-D(X^)))+EX(log(D(X))),$
(2.1)
$Linfo(G,Q)=EC∼p(C),X^∼G(Z',C)(log(Q(C|X^))-logp(C)),$
(2.2)
where $E(·)$ denotes the expectation operator, $Q$ denotes the auxiliary network referenced earlier, and $0≤D(·)≤1$. Note that we abused the notation in $X^∼G(Z',C)$ to denote the conditional distribution $p(X^|C)$ given $C$ (this conditional distribution is intrinsically determined by the distribution of $Z'$ and the gaussian noise maps). The optimization problem consists then of determining the triplet $(D,G,Q)$ that achieves
$minG,QmaxDV(D,G)-βLinfo(G,Q),$
(2.3)
where coefficient $β≥0$ is a hyperparameter.
Monte Carlo estimates of $V(D,G)$ and $Linfo(G,Q)$, $V˜(D,Q)$ and $L˜info(G,Q)$, respectively, are used for realizing a tractable optimization. Using a batch size $B$, ${z'(l),c'(l)}l=1B$ is sampled from $Z$, as $Z$ is explicitly defined beforehand, and then fed into the generator to produce the fake image $x^(l)=G(z'(l),c'(l))$$B$ times, as well as the real images $x(l)$ being sampled $B$ times. The estimates for the losses are thus given by
$V˜(D,G)=1B∑l=1B(log(1-D(x^(l)))+log(D(x(l))),$
(2.4)
$L˜info(G,Q)=1B∑l=1B(log(Q(c'(l)|x^(l)))-logp(c'(l))).$
(2.5)

For nomenclature, we refer to InfoGAN as disabling styles and setting $β=1$, StyleGAN as having styles and $β=0$, InfoStyleGAN as having styles and $β=1$, and InfoStyleGAN-Discrete as InfoStyleGAN with C only containing discrete variables.

## 3  Relating Attribute Control and Disentaglement

We next show that the choice of the mean field encoder, that is, using $Q(C|X^)=∏iQ(Ci|X^)$, for optimizing the mutual information (as used in Chen et al., 2016) contributes to forcing the total correlation to zero.

Lemma 1.
Assume that each $Ci$ is discrete for $i=1,…,L$ (hence, $I(C;X^)$ is bounded from above) and that $Q(C|X^)=∏iQ(Ci|X^)$. When $Linfo→I(C;X^)$, then
$TC(C|X^)→0,$
almost everywhere in $X^$, where $TC(C|X^)$ is the total correlation in $C$ given $X^$.
Proof.
First, recall the derivation of the InfoGAN objective from Chen et al. (2016):
$I(C;X^)=-H(C|X^)+H(C)$
(3.1)
$=EX^[EC'∼p(C|X^)(logp(C'|X^))]+H(C)$
(3.2)
$=EX^[DKL(p(C|X^)∥Q(C|X^))︸≥0+EC'∼p(C|X^)(logQ(C'|X^))]+H(C)$
(3.3)
$≥EX^[EC'∼p(C|X^)(logQ(C'|X^))]+H(C)$
(3.4)
$=EC∼p(C),X^∼G(Z',C)(logQ(C|X^))+H(C)=Linfo,$
(3.5)
where $H(·)$ and $H(·|·)$ denote entropy and conditional entropy, respectively. Note that $H(C)=∑iH(Ci)$ as we assume the prior to the generative model factorizes independently. Starting from equation 3.2, we can decompose the logarithmic term for each individual $Ci$ to get
$I(C;X^)=EX^EC'∼p(C|X^)logp(C'|X^)∏ip(Ci'|X^)+∑iEX^[ECi'∼p(Ci|X^)(logp(Ci'|X^))]+H(Ci)$
(3.6)
$=EC∼p(C),X^∼G(Z',C)(TC(C|X^))+∑iI(Ci;X^).$
(3.7)
Now, purely maximizing the mutual information with respect to all variables could also increase the total correlation of the posterior $p(C|X^)$ of a fake image $X^$, implying that the factors given the image are more entangled, which is undesirable. Thus, we desire a low total correlation $TC$ and high mutual information $I$ between $X^$, the generated image, and each variable $Ci$ individually, which we argue that the original InfoGAN objective implicitly satisfies for discrete (finite-valued) variables.
Indeed, we can lower-bound each individual $I(Ci;X^)$ term in equation 3.7 via the same method as in equation 3.4: we have
$H(C)≥I(C;X^)$
(3.8)
$≥I(C;X^)-EC∼p(C),X^∼G(Z',C)(TC(C|X^))$
(3.9)
$=∑iI(Ci;X^)$
(3.10)
$≥∑iEX^[ECi∼p(Ci|X^)(logQ(Ci|X^))]+H(Ci)$
(3.11)
$=EC∼p(C),X^∼G(Z',C)(logQ(C|X^))+H(C)=Linfo,$
(3.12)
where equation 3.10 holds by equation 3.7, equation 3.11 follows from equation 3.4 applied to each $I(Ci;X^)$ term. The equality before the last is due to our assumption of a mean field encoder and the fact that $p(C)=∏ip(Ci)$.

Thus, when $Linfo→I(C;X^)$, the two inequalities in equations 3.9 and 3.11 become tight, in turn implying that $Q(C|X^)→p(C|X^)$ and that $TC(C|X^)→0$ almost everywhere in $X^$.

As explained earlier, while this lemma only applies when $C$ is a discrete random vector, it nevertheless provides useful insights regarding the connection between MI maximization and total correlation minimization and between the concepts of attribute control and disentanglement, which had been conflated in past literature, by demonstrating that maximizing semantic attribute control encourages minimizing entanglement of latent factors. It shows, however, that while connected, these two concepts are formally not equivalent. Furthermore, the direction of this relationship, namely, that control encourages disentanglement and not the other way around, motivates our choice of using MI rather than $TC$ in our loss function for the rest of the study.

## 4  Application to Anomaly Detection

Given our lemma and architecture above, the discriminator may learn a representation of the image not only in the auxiliary network but in layers used for discriminating real versus fake. This is due to using shared weights for both the auxiliary network and the discriminator, encouraging the discriminator to learn a mapping that includes potentially semantic information.

Using InfoStyleGAN trained on a subset of each data set, we then extract the raw vector output by the discriminator at the end of (1) the auxiliary network Q, with dimensions depending on how C is implemented; (2) the last convolutional layer in the discriminator network of dimensions 512 $×$ 4 $×$ 4, and (3) the last dense layer in the discriminator network, typically of dimension 512.

For nomenclature, the acronyms consisting of these different possibilities for representations and discriminative use of GANs for anomaly detection are described in Table 4. We whiten these embeddings using principal component analysis (PCA) and keep all components for the best performance. We then use these representations as embeddings for two anomaly detection methods, including one done via one-class support vector machine (OCSVM) and the other via local outlier factor (LOF). One-class support vector machines learn a hypersphere on the data given such that the radius containing most of the data is minimized. LOFs compare a given point to its nearest neighbors, and if the density around the given point is less than those of its neighbors, it is categorized as an outlier.

As an alternative, we also test using embeddings output from two networks: the average pooling layer from an inception network pretrained to classify 1000 categories of objects from Imagenet and the global representation from the Deep InfoMax (DIM) network. Deep InfoMax (Bachman, Hjelm, & Buchwalter, 2019) is trained via a contrastive loss to maximize the similarity between two different views of the same image, while minimizing the similarity with all other representations from different images. Consequently, it should learn as unique a representation as possible with respect to factors invariant to the differences in the view. Note that as both of these networks were pretrained on the entirety of ImageNet, they can leverage the extra data to output more unique representations when compared to our methods.

## 5  Experiments

### 5.1  Data

The data sets we use are the public domain CelebA (Liu, Luo, Wang, & Tang, 2015), a data set of celebrity faces over 200,000 images, and Stanford Cars (Krause, Stark, Deng, & Fei-Fei, 2013), containing around 16,000 images. CelebA consists of over 200,000 celebrity faces with various attributes such as gender and age. These images are 218 $×$ 178 pixels, and to preprocess them, we take a 128 $×$ 128 crop with the center (121, 89). Stanford Cars contains 16,151 images of vehicles of different makes, models, and years. For preprocessing, we use the same preprocessing that Karras et al. (2019) used for LSUN Cars on this data set, targeting a resolution of 512 $×$ 384 pixels. First, we crop height to match the correct aspect ratio, ignoring any images that require upsampling. We resize to 512 $×$ 384 pixels, and then pad this image to a 512 $×$ 512 image.

For disentanglement experiments, we use the full data set in all cases to cover every potential semantic attribute of the data set. For anomaly detection experiments, we use the following classes as inliers and outliers: male celebrities as inliers and female celebrities as outliers for CelebA, and small vehicles (compacts/sedans) as inliers and large vehicles (trucks/vans) as outliers for Stanford Cars.

We adopt the basic settings of the implementation from Karras et al. (2019) for all data sets. However, mixing is turned off, as it was found that mixing the styles reduces disentanglement. Moreover, as we are focused on reconstructing the initial vector used to create the style vector, introducing additional style vectors during generation would cause the information loss term to become ambiguous. For computational efficiency, we append $Q$ to $D$ as in Figure 1 similar to Chen et al. (2016). For CelebA, we use a 512-dimension latent code for $(Z',C)$, with $C$ consisting of 7 discrete Bernoulli variables, 1 discrete categorical variable of dimension 3, and 10 continuous uniform variables. For Stanford Cars, we use a 512-dimension latent code for $(Z',C)$, with $C$ consisting of 20 continuous uniform variables. For implementing $Q(C|X^)$, we use logits for each of the discrete variables and treat the posterior distribution of the continuous variables as gaussian $N(μ(X^),σ2(X^))$ with mean $μ(X^)$ and variance $σ2(X^)$.

Specifically for anomaly detection, we train each GAN on a data set consisting solely of images in the inlier class and take a subsampling for training OCSVM or LOF. We then test on a balanced test set for both inliers and outliers for each data set. Each set of representations was normalized, as well as fed through PCA using whitened components. All generative models and Deep InfoMax were found to perform best when all components are used, whereas Inception V3 performs best when 1024 components were used. These standardized representations were then used to train the OCSVM/LOF model.

### 5.3  Disentanglement, Control, and Generative Quality: Quantitative Assessment

Although several metrics have emerged to characterize quality and diversity in generative models, we use FID (Heusel, Ramsauer, Unterthiner, Nessler, & Hochreiter, 2017) as our primary measure of distance between the two distributions, which is computed as
$FID=∥μr-μg∥2+Tr(Σr+Σg-2(ΣrΣg)0.5),$
(5.1)
where $∥·∥2$ and $Tr(·)$ denote the $L2$ norm and the trace, respectively, and the assumption is made that $Xr∼N(μr,Σr)$, that is, a normal distribution (with mean-vector $μr$ and covariance matrix $Σr$) for the activations from the Inception v3 pool layer for real examples, and likewise $Xg∼N(μg,Σg)$ for generated examples.

FID is reported for each of the architectural variants and both data sets, as shown in Table 1. Additionally comparisons are shown in the last rows, and all but COCO-GAN are taken from Lucic et al. (2018), where the training scheme is different; that study used a smaller architecture and a $64×64$ resolution version of CelebA. For COCO-GAN, Lin et al. (2019) used a larger model along with a different resolution of CelebA. InfoGAN and InfoStyleGAN perform slightly worse than StyleGAN in terms of FID, and using only discrete variables in C in InfoStyleGAN performs significantly worse.

As the FID comprises two terms that can roughly be taken to mean the fidelity and diversity of the fake distribution, we use two additional metrics, precision and recall (Kynkäänniemi et al., 2019), to better characterize the fidelity (precision) and diversity (recall) separately for CelebA. Kynkäänniemi et al. (2019) use the image features of a specific network, Inception V3, chosen in our case to match the feature space of FID, to construct an approximate manifold of each the real images and fake images separately. This manifold comprises hyperspheres around each individual point in the data set, where each hypersphere has the smallest radius that contains the nearest $k$ neighbors. The precision is then defined as the proportion of fake images whose features lie in the constructed manifold of real images, and the recall is defined as the proportion of real images whose features lie in the constructed manifold of fake images.

Table 1:
Quality of Generated Images.
ModelsCelebAStanford Cars
InfoGAN 9.91 ($±$0.06) $†$
StyleGAN 9.08 ($±$0.10) 21.35 ($±$2.01)
InfoStyleGAN 9.90 ($±$0.07) 23.52 ($±$1.97)
InfoStyleGAN-Discrete 14.3 ($±$0.05) $†$
WGAN GP* 30.0 $†$
BEGAN* 38.9 $†$
DRAGAN* 42.3 $†$
COCO-GAN* 4.2 $†$
ModelsCelebAStanford Cars
InfoGAN 9.91 ($±$0.06) $†$
StyleGAN 9.08 ($±$0.10) 21.35 ($±$2.01)
InfoStyleGAN 9.90 ($±$0.07) 23.52 ($±$1.97)
InfoStyleGAN-Discrete 14.3 ($±$0.05) $†$
WGAN GP* 30.0 $†$
BEGAN* 38.9 $†$
DRAGAN* 42.3 $†$
COCO-GAN* 4.2 $†$

Notes: This table demonstrates that adding disentanglement properties does not have a negative impact on visual performance (FID). Best FID found during training for various architectures including InfoStyleGAN, with standard deviations taken over five estimates of the metric. We also list comparisons to other models on this data set, with the first three taken from Lucic, Kurach, Michalski, Gelly, and Bousquet (2018). As we were conducting a large-scale study where models were not fully trained in order to explore hyperparameters, the results for these models could possibly be better. Daggers denote experiments that were not done or available, and the asterisks denote experiments that had different training schemes.

Table 2:
Separate Metrics for Precision and Recall.
MetricsInfoGANStyleGANInfoStyleGANInfoStyleGAN-Discrete
Precision $(k=3)$ 0.654 0.720 0.629 0.635
Recall $(k=3)$ 0.291 0.306 0.319 0.226
Precision $(k=10)$ 0.855 0.893 0.848 0.850
Recall $(k=10)$ 0.498 0.561 0.562 0.447
MetricsInfoGANStyleGANInfoStyleGANInfoStyleGAN-Discrete
Precision $(k=3)$ 0.654 0.720 0.629 0.635
Recall $(k=3)$ 0.291 0.306 0.319 0.226
Precision $(k=10)$ 0.855 0.893 0.848 0.850
Recall $(k=10)$ 0.498 0.561 0.562 0.447

Notes: We use the precision and recall metrics from Kynkäänniemi et al. (2019) on our trained CelebA models. We use Inception V3 as our feature extractor and either $k=3$ or $k=10$ neighbors.

We show the results in Table 2, where we see that StyleGAN has the highest precision at 0.720, while InfoStyleGAN has the highest recall at 0.319, for $k=3$. Comparing InfoGAN and InfoStyleGAN, we see that the results appear to match both models having nearly equal FID, with InfoGAN having marginally higher precision and lower recall than InfoStyleGAN, exhibiting some trade-off between the quality and diversity. For $k=10$, we see that the metrics overall follow a similar trend as what is presented in Kynkäänniemi et al. (2019) for the FFHQ data set, which can be construed as a higher-quality CelebA.

In order to characterize the ability of the algorithms to control individual known image attributes of generated images, we also utilize the mutual information gap (MIG) (Chen et al., 2018) for CelebA. Stanford Cars did not have ground-truth attributes associated with it.

#### 5.3.1  Mutual Information Gap

As we have ground-truth attribute labels for CelebA describing various semantic attributes ${Vk}k=1K$, where $K$ is the number of ground-truth attributes, we can use those directly in a supervised fashion to estimate the mutual information between each $Vk$ and each $Ci$, $i=1,…,L$, as described by the auxiliary network $Q$. Consequently, we estimate
$I(Vk;Ci)=EVk,Ci'∼Q(Ci|Vk)log∑x˜∈XVkQ(Ci'|x˜)p(x˜|Vk)+H(Ci)$
(5.2)
over $k$ and $i$, where $Xvk$ is the set of images that correspond to having the label $vk$. As the overall conditional probability $p(x˜|vk)$ is unknown, we assume a uniform distribution over all $x˜$ that have $vk$ as a label. The MIG is then
$MIG=1K∑k=1K1H(Vk)I(Vk;Ci(k))-maxi≠i(k)I(Vk;Ci)$
(5.3)
where $i(k)=argmaxiI(Vk;Ci)$ is the index over the latent factors $Ci$ that selects the $Ci$ with the maximum mutual information with respect to the given ground-truth attribute. Consequently, the MIG is the normalized difference between the maximum and second-largest mutual information. For CelebA, the set of attributes ${Vk}$ consists of 40 attributes, including binary variables for smiling, attractiveness, and other characteristics.

There are two consequences of using this measure: (1) a larger value indicates that the information about a ground-truth attribute is concentrated in a single latent factor that aligns with the goals of disentanglement, and (2) a small value does not necessarily indicate that our model does not successfully disentangle but that the semantic attributes it discovers may not align with any of the ground-truth semantic attributes. Consequently, we also include the maximum mutual information to indicate if this is occurring. In Table 3, we see that InfoStyleGAN for CelebA does improve on disentanglement compared to InfoGAN.

### 5.4  Disentanglement, Control, and Generative Quality: Qualitative Assessment

Figures 2a and 2b display example attribute control for the proposed architecture for CelebA. Similar to the original results for InfoGAN, we see variables corresponding to emotion and head position; however, these correspond to a continuous variable here rather than a categorical variable and exhibit in our case a greater magnitude of control. Surprisingly, the image fourth from the right in Figure 2a appears to also control the glare of the glasses as the head is tilted up, although some entanglement is still seen, such as the example on the left side of Figure 2a, where the faces become more masculine as the orientation increases, or the third from the right on Figure 2b, which has glasses appear through increasing the smile. However, across all source images, the effect of the attribute control is always very consistent in nature.

Table 3:
Faculty for Attribute Control and Disentanglement.
ModelsMIG$maxk,iI(Vk;Ci)$
InfoGAN 2.4e-2 4.9e-2
InfoStyleGAN 3.4e-2 5.9e-2
InfoStyleGAN-Discrete 1.2e-4 3.0e-4
ModelsMIG$maxk,iI(Vk;Ci)$
InfoGAN 2.4e-2 4.9e-2
InfoStyleGAN 3.4e-2 5.9e-2
InfoStyleGAN-Discrete 1.2e-4 3.0e-4

Notes: For CelebA, we use mutual information gap (MIG, equation 3.8) on the ground-truth binary labels of CelebA using the auxiliary network. Unlike the various autoencoder architectures, we do not necessarily expect that the semantic attributes learned will be all-inclusive, so we include the maximum mutual information (third column) over each ground-truth attribute and $Ci$ pair to see if any are actually significant. We do see an improvement in disentanglement using architectures with mutual information loss, whereas InfoStyleGAN-Discrete discovers semantic attributes that are not aligned with the ground-truth attributes. Consequently, for the discrete-only architecture, the MIG measure is inconclusive.

In contrast, for InfoGAN, we see comparatively more entanglement across the various source images and less control. For Figure 2c, we see that InfoGAN exhibits a significant degree of control over the horizontal orientation of the head, but the endpoints are not consistent with their starting and final positions compared to Figure 2a. For Figure 2d, we also see control over the gender, which is entangled somewhat with smiling, and the degree of this smiling is not as significant as Figure 2b. Consequently, these controls on the tuned baseline do not show a consistent effect like those on InfoStyleGAN. The continuous variables in both cases exhibited the most interesting factors, whereas the Bernoulli or categorical variables primarily affected the pose.

### 5.5  Anomaly Detection: Quantitative Assessment

Tables 4 and 5 report comparisons between the various different variants of our pipeline for anomaly detection on CelebA and Stanford Cars, respectively. We weigh receiver operating characteristic curve area under the curve (ROC AUC) most heavily, though accuracy and F1 score are closely correlated.

Figure 2:

Effects of control and disentanglement by InfoGAN and InfoStyleGAN on CelebA.

Figure 2:

Effects of control and disentanglement by InfoGAN and InfoStyleGAN on CelebA.

For CelebA in Table 4, we see that the Q network representations perform the worst, followed by the convolutional representations, and that the dense representations are the best.

However, the generative models perform on par overall with the discriminative methods, performing better than Deep InfoMax and worse than Inception V3.

Table 4:
Table Describing ROC AUC, Overall Accuracy, and F1 Score for Each Method Tested on CelebA Data Set.
MethodROC AUCAccuracyF1 Score
InfoStyleGAN $→$ Q $→$ OCSVM IQO 0.567 [0.559, 0.576] 54.47 % [53.74%, 55.19%] 0.524
InfoStyleGAN $→$ Q $→$ LOF IQL 0.567 [0.559, 0.576] 52.47 % [51.74%, 53.2%] 0.244
InfoStyleGAN $→$ Conv $→$ OCSVM ICO 0.593 [0.585, 0.601] 58.09 % [57.37%, 58.81%] 0.632
InfoStyleGAN $→$ Conv $→$LOF ICL 0.6 [0.592, 0.608] 52.12 % [51.39%, 52.85%] 0.329
InfoStyleGAN $→$ Dense $→$ OCSVM IDO 0.607 [0.599, 0.615] 58.83 % [58.11%, 59.55%] 0.643
InfoStyleGAN $→$ Dense $→$ LOF IDL 0.608 [0.6, 0.617] 54.24 % [53.52%, 54.97%] 0.392
Inception V3 $→$ OCSVM IO 0.629 [0.621, 0.638] 59.24 % [58.52%, 59.96%] 0.704
Inception V3 $→$ LOF IL 0.629 [0.621, 0.637] 60.96 % [60.25%, 61.67%] 0.661
Deep InfoMax $→$ OCSVM DO 0.604 [0.595, 0.612] 51.78 % [51.05%, 52.51%] 0.675
Deep InfoMax $→$ LOF DL 0.603 [0.595, 0.611] 57.17 % [56.44%, 57.89%] 0.696
MethodROC AUCAccuracyF1 Score
InfoStyleGAN $→$ Q $→$ OCSVM IQO 0.567 [0.559, 0.576] 54.47 % [53.74%, 55.19%] 0.524
InfoStyleGAN $→$ Q $→$ LOF IQL 0.567 [0.559, 0.576] 52.47 % [51.74%, 53.2%] 0.244
InfoStyleGAN $→$ Conv $→$ OCSVM ICO 0.593 [0.585, 0.601] 58.09 % [57.37%, 58.81%] 0.632
InfoStyleGAN $→$ Conv $→$LOF ICL 0.6 [0.592, 0.608] 52.12 % [51.39%, 52.85%] 0.329
InfoStyleGAN $→$ Dense $→$ OCSVM IDO 0.607 [0.599, 0.615] 58.83 % [58.11%, 59.55%] 0.643
InfoStyleGAN $→$ Dense $→$ LOF IDL 0.608 [0.6, 0.617] 54.24 % [53.52%, 54.97%] 0.392
Inception V3 $→$ OCSVM IO 0.629 [0.621, 0.638] 59.24 % [58.52%, 59.96%] 0.704
Inception V3 $→$ LOF IL 0.629 [0.621, 0.637] 60.96 % [60.25%, 61.67%] 0.661
Deep InfoMax $→$ OCSVM DO 0.604 [0.595, 0.612] 51.78 % [51.05%, 52.51%] 0.675
Deep InfoMax $→$ LOF DL 0.603 [0.595, 0.611] 57.17 % [56.44%, 57.89%] 0.696

Notes: Confidence intervals are in brackets. The first entry in each method is the network used to get the representation, the second entry for generative methods is the representation used, and the last is the anomaly detection method. The second column is the abbreviation for each method, taking the first letter from each field. Inliers are male celebrities, and outliers are female celebrities. The numbers in bold denote the best approaches for methods using InfoStyleGAN.

Table 5:
Table Describing ROC AUC, Overall Accuracy, and F1 Score for Each Method Tested on the Stanford Cars Data Set.
MethodROC AUCAccuracyF1 Score
InfoStyleGAN $→$ Q $→$ OCSVM IQO 0.326 [0.293, 0.359] 39.10 % [36.08%, 42.12%] 0.206
InfoStyleGAN $→$ Q $→$ LOF IQL 0.346 [0.312, 0.38] 44.50 % [41.42%, 47.58%] 0.031
InfoStyleGAN $→$ Conv $→$ OCSVM ICO 0.379 [0.345, 0.414] 42.60 % [39.54%, 45.66%] 0.305
InfoStyleGAN $→$ Conv $→$ LOF ICL 0.460 [0.424, 0.496] 48.70 % [45.60%, 51.80%] 0.111
InfoStyleGAN $→$ Dense $→$ OCSVM IDO 0.809 [0.782, 0.836] 71.70 % [68.91%, 74.49%] 0.748
InfoStyleGAN $→$ Dense $→$ LOF IDL 0.867 [0.845, 0.89] 77.50 % [74.91%, 80.09%] 0.742
Inception V3 $→$ OCSVM IO 0.807 [0.78, 0.834] 74.40 % [71.70%, 77.10%] 0.767
Inception V3 $→$ LOF IL 0.707 [0.675, 0.739] 52.70 % [49.61%, 55.79%] 0.218
Deep InfoMax $→$ OCSVM DO 0.577 [0.542, 0.613] 55.80 % [52.72%, 58.88%] 0.546
Deep InfoMax $→$ LOF DL 0.569 [0.533, 0.604] 52.40 % [49.30%, 55.50%] 0.244
MethodROC AUCAccuracyF1 Score
InfoStyleGAN $→$ Q $→$ OCSVM IQO 0.326 [0.293, 0.359] 39.10 % [36.08%, 42.12%] 0.206
InfoStyleGAN $→$ Q $→$ LOF IQL 0.346 [0.312, 0.38] 44.50 % [41.42%, 47.58%] 0.031
InfoStyleGAN $→$ Conv $→$ OCSVM ICO 0.379 [0.345, 0.414] 42.60 % [39.54%, 45.66%] 0.305
InfoStyleGAN $→$ Conv $→$ LOF ICL 0.460 [0.424, 0.496] 48.70 % [45.60%, 51.80%] 0.111
InfoStyleGAN $→$ Dense $→$ OCSVM IDO 0.809 [0.782, 0.836] 71.70 % [68.91%, 74.49%] 0.748
InfoStyleGAN $→$ Dense $→$ LOF IDL 0.867 [0.845, 0.89] 77.50 % [74.91%, 80.09%] 0.742
Inception V3 $→$ OCSVM IO 0.807 [0.78, 0.834] 74.40 % [71.70%, 77.10%] 0.767
Inception V3 $→$ LOF IL 0.707 [0.675, 0.739] 52.70 % [49.61%, 55.79%] 0.218
Deep InfoMax $→$ OCSVM DO 0.577 [0.542, 0.613] 55.80 % [52.72%, 58.88%] 0.546
Deep InfoMax $→$ LOF DL 0.569 [0.533, 0.604] 52.40 % [49.30%, 55.50%] 0.244

Notes: Confidence intervals are in brackets. Inliers are small vehicles such as sedans and compacts, and outliers are large vehicles such as trucks and vans. The numbers in bold denote the best approaches for methods using InfoStyleGAN.

For Stanford Cars in Table 5, we also note that the dense representations perform best, while all other representations perform significantly worse. Given the failure of the Q network to optimize for its lower bound, the performance of the Q representations is somewhat expected, though not for the other two representations. One possible reason for this disparity can be deduced from Figure 4b (discussed below) since large and small vehicles appear similar locally, and the global structure of the dense representation helps to discern the two classes. Interestingly, the generative approach actually does overtake both discriminative approaches here, which also uses a more global representation of the image.
Figure 3:

Plots of ROC curves for each data set used. Dashed lines are methods using OCSVM, while solid lines use LOF.

Figure 3:

Plots of ROC curves for each data set used. Dashed lines are methods using OCSVM, while solid lines use LOF.

Finally, we plot the ROC curves for each data set in Figures 3a and 3b. For Stanford Cars, we see that improvements in AUC are distributed evenly throughout the curve. However, for CelebA, we see that we do not get any actual improvement in the true positive rate between different methods without increasing the false-positive rate significantly. This may be due to how gender is characterized by finer details, such as makeup or facial structure, which may be treated as invariants by only training on the inliers.

### 5.6  Anomaly Detection: Qualitative Assessment

Finally, we show in Figures 4a and 4b a confusion matrix containing example images for our anomaly detector for CelebA and Stanford Cars, respectively. Each figure is split into four quadrants, with the columns denoting the predictions by the model and rows denoting the true status of the image.

## 6  Discussion

### 6.1  Control

Visual inspection of our results demonstrates control of semantic variables as well as a good degree of disentanglement for our proposed full-fledged model for CelebA. In particular, we found on par FID (i.e., 9.90 for InfoStyleGAN) when compared to the ablated baselines (i.e., 9.91 for InfoGAN), while InfoGAN did not perform as well with regard to disentanglement (i.e., MIG $=$ 3.4e-2 for InfoStyleGAN, compared to MIG $=$ 2.4e-2 for InfoStyleGAN). FID was worse for other state-of-the-art algorithms that did not attempt to disentangle, except for COCO-GAN, which uses a much bigger model, so the comparison is not really apple-to-apple.

Figure 4:

A confusion matrix, with each column denoting the predictions and each row the ground truth of the anomaly detector using the dense representation of InfoStyleGAN with OCSVM on each data set.

Figure 4:

A confusion matrix, with each column denoting the predictions and each row the ground truth of the anomaly detector using the dense representation of InfoStyleGAN with OCSVM on each data set.

In general, comparing InfoGAN to our method InfoStyleGAN, we see that InfoGAN does not have the same level of control as InfoStyleGAN, likely due to these architectural differences despite having a comparable distance to the real distribution. From the purported benefits of StyleGAN, this disentanglement on the semantic variables is consistent with increased disentanglement of the overall latent space. Also interesting is that the overall distance from the real distribution is similar for InfoGAN and InfoStyleGAN, with StyleGAN performing slightly better. The FID comprises two terms, one measuring the distance between the two means and the other measuring the distance between the covariances. As the mutual information maximization increases the entropy of the generated images, then presumably the info-methods are closer in terms of diversity rather than closer in terms of the average, especially given that the generators still have to generate all combinations of semantic attributes and still maintain the image's realism. These generators may not be able to infer from the training data set how to combine these factors in a realistic way, which may be reflected in the precision and recall metrics in Table 2 for $k=3$.

Of note, during our experiments we observed that CelebA did not have $Linfo$ fully maximized, which explains why some entanglement is still observable in some of the images. Consequently, an avenue of future work is to address this issue by trying to explicitly minimize total correlation in the loss function along with attempting different methods for the maximization of mutual information such as those found in Poole, Ozair, van den Oord, Alemi, and Tucker (2019).

In terms of Stanford Cars, the best lower bound of the mutual information it could achieve was roughly 0.1. One possibility for this performance may be that due to the data set having significantly fewer data compared to CelebA, the generator does not know how to combine potential attributes in a realistic way, so it focuses more on maintaining its realism rather than optimizing for the mutual information, which may also relate to its performance in anomaly detection.

In aggregate, we find that the proposed approach gave promising results for control and discovery of semantic attributes. Our findings that sometimes one latent factor appeared to influence jointly two different attributes is consistent with our lemma explaining that it is the control objective (via mutual information maximization) that encourages disentanglement (via total correlation minimization). Indeed since mutual information maximization is achieved via maximization of a lower bound on mutual information, if this bound is not tight, then minimization of total correlation down to zero may not be achieved (in the case of discrete variables), which explains residual correlation between factors.

We believe that the discovery of latent factors that control semantic attributes has other potential applications for AI tasks such as debiasing in health care. It allows the discovery of factors of variations that can be the basis for a sensitivity analysis that can help address issues of bias in AI. These can be of importance for health care applications where phenotype discovery and sensitivity analysis of AI with regard to these phenotype are of interest. Fundamentally it is complementary to methods that address direct classification of image semantic attributes when those attributes are known a priori. When attributes and phenotypes are not known, the problem is much more arduous, and our method can help address this issue.

### 6.2  Anomaly Detection

From the results in section 5 with regard to anomaly detection, we note that anomaly detection based on representation learning from these generative models provides encouraging results. We view ROC AUC as our primary metric for comparison, though Accuracy and F1 Score are also both correlated with ROC AUC.

In general, CelebA was a difficult data set for all methods, with the best overall, Inception V3, achieving an AUC of only 0.629, a performance close to InfoStyleGAN with LOF, which had an AUC of 0.608. Also, as can be seen in Figure 3a, for regimes with low false-positive rates, InfoStyleGAN-based methods somewhat outperform the discriminative methods. From the off-diagonals of Figure 4a, one possible reason for this could be that the most apparent features of gender could actually apply to both inliers and outliers. As an example, take the top right-most image in the predicted outlier/actual inlier and the bottom left-most image in the predicted inlier/actual outlier. Ignoring the hat, both have thick eyebrows and darkened eyes, with the women having slightly redder lips and a slightly different facial structure.

For Stanford Cars, the generative models actually improve on the discriminative methods, which is fairly surprising given that large vehicles and small vehicles are contained in several separate categories in ImageNet. One possible reason is that of domain shift, given that some of the images in Figure 4b are heavily zoomed in, making distinguishing between small and large vehicles difficult from purely the size and forcing features to focus on areas such as the grill and body shape. Similar reasons possibly explain why the dense/global representation performs so much better than the other representations.

## 7  Conclusion

This study is concerned with generative models that address the joint objectives of discovery, disentanglement, and control of image semantic attributes, as well as the goal of performing anomaly detection using representation from this model. We describe a method that uses multiscale generative models and maximizes mutual information (so-called InfoStyleGAN) to achieve those joint goals, as evaluated both quantitatively and qualitatively. Results show that our method is competitive in two data sets in terms of anomaly detection compared with models trained on significantly larger data sets with multiple diverse classes, giving promising new directions to continue research into generative anomaly detection methods.

## Notes

1

Concurrent work we became aware of after first submitting this work in arXiv includes Harkonen, Hertzmann, Lehtinen, and Paris (2020), Nie et al. (2020), Shen and Zhou (2020), and Tewari et al. (2020).

## References

Abay
,
R.
,
Gehly
,
S.
,
Balage
,
S.
,
Brown
,
M.
, &
Boyce
,
R.
(
2018
).
Maneuver detection of space objects using generative adversarial networks
. Paper presented at the
Advanced Maui Optical and Space Surveillance Technologies Conference
.
Akçay
,
S.
,
Atapour-Abarghouei
,
A.
, &
Breckon
,
T. P.
(
2018
).
GANomaly: Semi-supervised anomaly detection via adversarial training
. In
C.
Jawahar
,
H.
Li
,
G.
Mori
, &
K.
Schindler
(Eds.),
Lecture Notes in Computer Science: Vol. 11363. Computer Vision—ACCV 2018
.
Berlin
:
Springer
. doi.org/10.1007/978-3-030-20893-6_39.
Akçay
,
S.
,
Atapour-Abarghouei
,
A.
, &
Breckon
,
T. P.
(
2019
).
Skip-GANomaly: Skip connected and adversarially trained encoder-decoder anomaly detection
. In
Proceedings of the 2019 IEEE International Joint Conference on Neural Networks
(pp.
1
8
).
Piscataway, NJ
:
IEEE
.
Bachman
,
P.
,
Hjelm
,
R. D.
, &
Buchwalter
,
W.
(
2019
).
Learning representations by maximizing mutual information across views
. CoRR, abs/1906.00910.
Bergmann
,
P.
,
Löwe
,
S.
,
Fauser
,
M.
,
Sattlegger
,
D.
, &
Steger
,
C.
(
2018
).
Improving unsupervised defect segmentation by applying structural similarity to autoencoders
. CoRR, abs/1807.02011.
Brock
,
A.
,
Donahue
,
J.
, &
Simonyan
,
K.
(
2019
).
Large scale GAN training for high fidelity natural image synthesis
. In
Proceedings of the International Conference on Learning Representations
. OpenReview.
Burlina
,
P.
,
Joshi
,
N.
, &
Wang
,
I.
(
2019
).
Where's Wally now? Deep generative and discriminative embeddings for novelty detection
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp.
11507
11516
).
Piscataway, NJ
:
IEEE
.
Chen
,
T. Q.
,
Li
,
X.
,
Grosse
,
R. B.
, &
Duvenaud
,
D. K.
(
2018
).
Isolating sources of disentanglement in variational autoencoders
. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N. Cesa
-
Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
2610
2620
).
Red Hook, NY
:
Curran
.
Chen
,
X.
,
Duan
,
Y.
,
Houthooft
,
R.
,
Schulman
,
J.
,
Sutskever
,
I.
, &
Abbeel
,
P.
(
2016
). InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2172
2180
).
Red Hook, NY
:
Curran
.
Deecke
,
L.
,
Vandermeulen
,
R.
,
Ruff
,
L.
,
Mandt
,
S.
, &
Kloft
,
M.
(
2018
).
Image anomaly detection with generative adversarial networks
. In
Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
.
Cham
:
Springer
.
Erfani
,
S. M.
,
Rajasegarar
,
S.
,
Karunasekera
,
S.
, &
Leckie
,
C.
(
2016
).
High-dimensional and large-scale anomaly detection using a linear one-class SVM with deep learning
.
Pattern Recognition
,
58
,
121
134
.
Goodfellow
,
I.
,
,
J.
,
Mirza
,
M.
,
Xu
,
B.
,
Warde-Farley
,
D.
,
Ozair
,
S.
, …
Bengio
,
Y.
(
2014
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2672
2680
).
Red Hook, NY
:
Curran
.
Gray
,
K.
,
Smolyak
,
D.
,
,
S.
, &
Mohler
,
G.
(
2020
).
Coupled IGMM-GANs for deep multimodal anomaly detection in human mobility data
.
ACM Transactions on Spatial Algorithms and Systems
,
6
(
4
), article 24.
Grover
,
A.
,
Dhar
,
M.
, &
Ermon
,
S.
(
2018
).
Flow-GAN: Combining maximum likelihood and adversarial learning in generative models
. In
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
.
Palo Alto, CA
:
AAAI
.
Harkonen
,
E.
,
Hertzmann
,
A.
,
Lehtinen
,
J.
, &
Paris
,
S.
(
2020
).
GANSpace: Discovering interpretable GAN controls
. CoRR, abs/2004.02546.
Heusel
,
M.
,
Ramsauer
,
H.
,
Unterthiner
,
T.
,
Nessler
,
B.
, &
Hochreiter
,
S.
(
2017
). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
6626
6637
).
Red Hook, NY
:
Curran
.
Jain
,
N.
,
Manikonda
,
L.
,
Hernandez
,
A. O.
,
Sengupta
,
S.
, &
Kambhampati
,
S.
(
2018
).
Imagining an engineer: On GAN-based data augmentation perpetuating biases
. CoRR, abs/1811.03751.
Karras
,
T.
,
Aila
,
T.
,
Laine
,
S.
, &
Lehtinen
,
J.
(
2018
).
Progressive growing of GANS for improved quality, stability, and variation
. In
Proceedings of the International Conference on Learning Representations
. OpenReview.
Karras
,
T.
,
Laine
,
S.
, &
Aila
,
T.
(
2019
).
A style-based generator architecture for generative adversarial networks
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Karras
,
T.
,
Laine
,
S.
,
Aittala
,
M.
,
Hellsten
,
J.
,
Lehtinen
,
J.
, &
Aila
,
T.
(
2020
).
Analyzing and improving the image quality of StyleGAN
. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
.
Piscataway, NJ
:
IEEE
.
Kimura
,
M.
, &
Yanagihara
,
T.
(
2018
).
Semi-supervised anomaly detection using GANS for visual inspection in noisy training data
. CoRR, abs/1807.01136.
Kingma
,
D. P.
, &
Dhariwal
,
P.
(
2018
). Glow: Generative flow with invertible 1 × 1 convolutions. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
10215
10224
).
Red Hook, NY
:
Curran
.
Kingma
,
D. P.
, &
Welling
,
M.
(
2013
).
Auto-encoding variational Bayes
. CoRR, abs/1312.6114.
Krause
,
J.
,
Stark
,
M.
,
Deng
,
J.
, &
Fei-Fei
,
L.
(
2013
).
3D object representations for fine-grained categorization
. In
Proceedings of the 4th International IEEE Workshop on 3D Representation and Recognition
.
Piscataway, NJ
:
IEEE
.
Kynkäänniemi
,
T.
,
Karras
,
T.
,
Laine
,
S.
,
Lehtinen
,
J.
, &
Aila
,
T.
(
2019
).
Improved precision and recall metric for assessing generative models
. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
3927
3936
).
Red Hook, NY
:
Curran
.
Lai
,
Y.
,
Hu
,
J.
,
Tsai
,
Y.
, &
Chiu
,
W.
(
2018
).
Industrial anomaly detection and one-class classification using generative adversarial networks
. In
Proceedings of the 2018 IEEE/ASME International Conference on Advanced Intelligent Mechatronics
(pp.
1444
1449
).
Piscataway, NJ
:
IEEE
.
Lin
,
C. H.
,
Chang
,
C.
,
Chen
,
Y.
,
Juan
,
D.
,
Wei
,
W.
, &
Chen
,
H.
(
2019
).
COCO-GAN: Generation by parts via conditional coordinating
. In
Proceedings of the IEEE International Conference on Computer Vision
.
Piscataway, NJ
:
IEEE
.
Litjens
,
G.
,
Kooi
,
T.
,
Bejnordi
,
B. E.
,
Setio
,
A. A. A.
,
Ciompi
,
F.
,
Ghafoorian
,
M.
, …
Sánchez
,
C. I.
(
2017
).
A survey on deep learning in medical image analysis
.
Medical Image Analysis
,
42
,
60
88
.
Liu
,
Y.
,
Li
,
Z.
,
Zhou
,
C.
,
Jiang
,
Y.
,
Sun
,
J.
,
Wang
,
M.
, &
He
,
X.
(
2018
).
Generative adversarial active learning for unsupervised outlier detection
. CoRR, abs/1809.10816.
Liu
,
Z.
,
Luo
,
P.
,
Wang
,
X.
, &
Tang
,
X.
(
2015
).
Deep learning face attributes in the wild.
In
Proceedings of International Conference on Computer Vision
.
Piscataway, NJ
:
IEEE
.
Lucic
,
M.
,
Kurach
,
K.
,
Michalski
,
M.
,
Gelly
,
S.
, &
Bousquet
,
O.
(
2018
).
Are GANs created equal? A large-scale study
. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Red Hook, NY
:
Curran
.
,
M.
,
Chang
,
M.-C.
,
Sharma
,
A.
,
Anastasiu
,
D. C.
,
Jagarlamudi
,
V.
,
Chakraborty
,
P.
, …
Siwei
,
L.
(
2018
).
The 2018 NVIDIA AI city challenge.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshop
(pp.
53
60
).
Piscataway, NJ
:
IEEE
.
Nie
,
W.
,
Karras
,
T.
,
Garg
,
A.
,
Debnath
,
S.
,
Patney
,
A.
,
Patel
,
A. B.
, &
Anandkumar
,
A.
(
2020
).
Semi-supervised StyleGAN for disentanglement learning
. In
Proceedings of the International Conference of Machine Learning
.
Oord
,
A. v. d.
,
Dieleman
,
S.
,
Zen
,
H.
,
Simonyan
,
K.
,
Vinyals
,
O.
,
Graves
,
A.
, …
Kavukcuoglu
,
K.
(
2016
).
Wavenet: A generative model for raw audio
. CoRR, arXiv:1609.03499.
Poole
,
B.
,
Ozair
,
S.
,
van den Oord
,
A.
,
Alemi
,
A.
, &
Tucker
,
G.
(
2019
).
On variational bounds of mutual information.
In
Proceedings of the 36th International Conference on Machine Learning
.
Russakovsky
,
O.
,
Deng
,
J.
,
Su
,
H.
,
Krause
,
J.
,
Satheesh
,
S.
,
Ma
,
S.
, …
Fei-Fei
,
L.
(
2015
).
ImageNet large scale visual recognition challenge
.
International Journal of Computer Vision
,
115
(
3
),
211
252
. doi:10.1007/s11263-015-0816-y.
Salimans
,
T.
,
Goodfellow
,
I.
,
Zaremba
,
W.
,
Cheung
,
V.
,
,
A.
, &
Chen
,
X.
(
2016
). Improved techniques for training GANs. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2234
2242
).
Red Hook, NY
:
Curran
.
Schlegl
,
T.
,
Seeböck
,
P.
,
Waldstein
,
S. M.
,
Schmidt-Erfurth
,
U.
, &
Langs
,
G.
(
2017
).
Unsupervised anomaly detection with generative adversarial networks to guide marker discovery
. In
Proceedings of the International Conference on Information Processing in Medical Imaging
(pp.
146
157
).
Cham
:
Springer
.
Shen
,
Y.
, &
Zhou
,
B.
(
2020
).
Closed-form factorization of latent semantics in GANS
. arXiv:2007.06600.
Tewari
,
A.
,
Elgharib
,
M.
,
Bharaj
,
G.
,
Bernard
,
F.
,
Seidel
,
H.-P.
,
Pérez
,
P.
, …
Theobalt
,
C.
(
2020
).
StyleRig: Rigging styleGAN for 3D control over portrait images.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
.
Piscataway, NJ
:
IEEE
.
Zenati
,
H.
,
Foo
,
C. S.
,
Lecouat
,
B.
,
Manek
,
G.
, &
Chandrasekhar
,
V. R.
(
2018
).
Efficient GAN-based anomaly detection
. CoRR, abs/1802.06222.
Zhang
,
H.
,
Goodfellow
,
I.
,
Metaxas
,
D.
, &
Odena
,
A.
(
2019
).