Abstract
While the link between color and emotion has been widely studied, how context-based changes in color impact the intensity of perceived emotions is not well understood. In this work, we present a new multimodal dataset for exploring the emotional connotation of color as mediated by line, stroke, texture, shape, and language. Our dataset, FeelingBlue, is a collection of 19,788 4-tuples of abstract art ranked by annotators according to their evoked emotions and paired with rationales for those annotations. Using this corpus, we present a baseline for a new task: Justified Affect Transformation. Given an image I, the task is to 1) recolor I to enhance a specified emotion e and 2) provide a textual justification for the change in e. Our model is an ensemble of deep neural networks which takes I, generates an emotionally transformed color palette p conditioned on I, applies p to I, and then justifies the color transformation in text via a visual-linguistic model. Experimental results shed light on the emotional connotation of color in context, demonstrating both the promise of our approach on this challenging task and the considerable potential for future investigations enabled by our corpus.1
1 Introduction
Color is a powerful tool for conveying emotion across cultures, a connection apparent in both language and art (Mohr and Jonauskaite, 2022; Mohammad and Kiritchenko, 2018). Metaphoric language frequently uses color as a vehicle for emotion: Familiar English metaphors include “feeling blue” or “green with envy”. Similarly, artists often pick specific colors in order to convey particular emotions in their work, while viewer perceptions of a piece of art are affected by its color palette (Sartori et al., 2015). Previous studies have mostly been categorical, focusing on confirming known links between individual colors and emotions like blue & sadness or yellow & happiness (Machajdik and Hanbury, 2010; Sartori et al., 2015; Zhang et al., 2011). However, in the wild, the emotional connotation of color is often mediated by line, stroke, texture, shape, and language. Very little work has examined these associations. Does the mere presence of blue make an image feel sad? If it is made bluer, does it feel sadder? Is it dependent on its associated form or its surrounding color context? And, if the change is reflected in an accompanying textual rationale, is it more effective?
Our work is the first to explore these questions. We present FeelingBlue, a new corpus of relative emotion labels for abstract art paired with English rationales for the emotion labels (see Figure 1 and Section 3). A challenge with such annotations is the extreme subjectivity inherent to emotion. In contrast to existing Likert-based corpora, we employ a Best-Worst Scaling (BWS) annotation scheme that is more consistent and replicable (Mohammad and Bravo-Marquez, 2017). Moreover, as our focus is color in context (colors and their form), we restrict our corpus to abstract art, a genre where color is often the focus of the experience, mitigating the effect of confounding factors like facial expressions and recognizable objects on perceived emotions (as observed in Mohammad, 2011; Sartori et al., 2015; Zhang et al., 2011; Alameda-Pineda et al., 2016).
To demonstrate FeelingBlue’s usefulness in explorations of the emotional connotation of color in context, we introduce a novel task, Justified Affect Transformation—conditional on an input image Io and an emotion e, the task is 1) to recolor Io to produce an image Ie that evokes e more intensely than Io and 2) to provide justifications for why Io evokes e less intensely and why Ie evokes e more intensely. Using FeelingBlue, we build a baseline system for two subtasks: image recoloring and rationale retrieval (Section 4).
We conduct a thorough human evaluation that confirms both the promise of our approach on this challenging new task and the opportunity for future investigations enabled by FeelingBlue. Our results reveal regularities between context-based changes in color and emotion while also demonstrating the potential of linguistic framing to mold this subjective human experience (Section 5).
Our dataset, code, and models are available at https://github.com/amith-ananthram/feelingblue.
2 Related Work
While the body of work studying the relationship between color and emotion is quite large, almost all of it has focused on identifying categorical relationships in text to produce association lexicons (notable works include Mohammad, 2011; Sutton and Altarriba, 2016; Mikellides, 2012).
In the domain of affective image analysis, previous work has mostly explored classifying the emotional content of images. Machajdik and Hanbury (2010), Sartori et al. (2015), Zhang et al. (2011) and Alameda-Pineda et al. (2016) focus on identifying low-level image features correlated with each perceived emotion, with the latter two examining abstract art specifically. Rao et al. (2019) employ mid- and high-level features for classification. Some work has investigated the distribution of emotions in images: Zhao et al. (2017, 2018) create probability distributions over multiple emotions and Kim et al. (2018) look at gradient values of arousal and valence, though none of the works correlate emotion values with image colors. Similarly, Xu et al. (2018) learn dense emotion representations through a text-based multi-task model but they do not explore its association with color.
Image recoloring is a very small subset of work in the field of image transformation and has mostly focused on palettes. PaletteNet recolors a source image given a target image palette (Cho et al., 2017) while Bahng et al. (2018) semantically generate palettes for coloring gray-scaled images. No previous work has examined recoloring images to change their perceived emotional content. Similarly, the field of text-to-image synthesis, which has seen major progress recently with models like DALL-E 2 (Ramesh et al., 2022), has centered on generating images de novo or on in-painting. There has been no work that recolors an image while preserving its original structure.
3 Dataset
Our dataset, FeelingBlue, is a collection of abstract art ranked by the emotions they most evoke with accompanying textual justifications (see Figure 1). It contains 5 overlapping subsets of images (one each for anger, disgust, fear, happiness, and sadness) with continuous-value scores that measure the intensity of the respective emotion in each image compared to the other images in the subset.
While WikiArt (Mohammad and Kiritchenko, 2018), DeviantArt2 (Sartori et al., 2015), and other emotion corpora contain images with multi-label continuous emotion scores, these scores were collected for each image in isolation without accompanying rationales. They reflect how often annotators believed that a particular emotion fit an image resulting in a measure of the presence of the emotion rather than its intensity. As such, their scores are not a suitable way to order these images by the strength of the emotion they evoke. In contrast, our annotations were collected by asking annotators to 1) rank groups of images according to a specified emotion and 2) justify their choice.
Below, we detail how we compiled this corpus.
3.1 Image Compilation
The images3 in our dataset are drawn from both WikiArt and DeviantArt.4 As we are most interested in the emotional connotation of color as constrained by its form, we manually removed images of photographs, statues, people, or recognizable objects. This eliminated many confounding factors like facial expressions, flowers, or skulls that might affect a person’s emotional response to an image, leaving primarily color and its visual context. Our final corpus contains 2,756 images.
3.2 Annotation Collection
We began by partitioning our images into overlapping emotion subsets where each image appears twice, in both subsets of its top 2 emotions according to its original corpus (WikiArt or DeviantArt) scores. As we want meaningful continuous value scores of emotional intensity, restricting images to their top 2 emotions ensures that the scored emotion is present. Within each subset, we randomly generated 4-tuples of images such that each image appears in at least 2 4-tuples. With these 4-tuples in hand, we collected annotations via Best-Worst Scaling (BWS) (Flynn and Marley, 2014), a technique for obtaining continuous scores previously used to construct sentiment lexicons (Kiritchenko and Mohammad, 2016) and to label the emotional intensity of tweets (Mohammad and Bravo-Marquez, 2017). In fact, Mohammad and Bravo-Marquez (2017) found that BWS is a reliable method for generating consistent scaled label values. It produces more replicable results than the typical Likert scale where annotators often do not agree with their own original assessments when shown an item they have already labeled.
In BWS, annotators are presented with n options (where often n = 4), and asked to pick the ‘best’ and ‘worst’ option according to a given criterion. For our task, we present each annotator with a 4-tuple (i.e., 4 images) of abstract art and an emotion, and the ‘best’ and ‘worst’ options are the images that ‘most’ and ‘least’ evoke the emotion relative to the other images. In addition, we also asked each annotator to provide rationales describing the salient features of their chosen ‘most’ and ‘least’ emotional images. As is common practice with BWS, for each subset corresponding to an emotion e, we calculate continuous value scores for each image I by subtracting the number of times I was selected as ‘least’ evoking e from the number of times I was selected as ‘most’ evoking e and then dividing by the number of times I appeared in an annotated 4-tuple. We collected 3 annotations per 4-tuple task from Master Workers on Amazon Mechanical Turk (AMT) via the BWS procedure of Flynn and Marley (2014). Workers were paid consistent with the minimum wage.
We did not use all of the emotion classes from the original datasets in our study. Mohammad and Kiritchenko (2018) found that emotions such as arrogance, shame, love, gratitude, and trust were interpreted mostly through facial expression. We also excluded anticipation and trust as they were assumed to be related to the structural composition of the artwork rather than color choice. For each remaining emotion, 50 images from their corresponding subsets were sampled for a pilot study of the 4-tuple data collection. Surprise was removed as pilot participants exhibited poor agreement and did not reference color or terms that evoke color in its corresponding rationales. This left 5 emotions: anger, disgust, fear, happiness, and sadness.
We manually filtered the collected annotations to remove uninformative rationales (such as “it makes me feel sad”). This filtering resulted in splitting each collected BWS annotation into a ‘best’ 4-tuple (which contains the 4 images and the most emotional choice among them, accompanied by a rationale) and a ‘worst’ 4-tuple (which contains the 4 images and the least emotional choice among them, accompanied by a rationale). Our final corpus contains 19,788 annotations, nearly balanced with 9,912 ‘best’ and 9,876 ‘worst’ (as in some 4-tuples, either the ‘best’ or the ‘worst’ rationale was retained but not both).
Table 1 contains summary and inter-annotator agreement statistics for our corpus, broken down by emotion subset. We rely on 3 different measures to gauge the consistency of these annotations. The first captures the degree to which differences among annotations result in changes to the ranking of the images by BWS score. For each emotion, we randomly split the 3 annotations for each of its 4-tuples and then calculate BWS scores for both of the resulting random partitions. We do this 30 times, calculating the Spearman rank correlation between the pairs of scores for each partition, and present the mean and standard deviation of the resulting coefficients. The second is a measure of what percentage of all annotations agree with the majority annotation for a particular 4-tuple—cases where all annotators disagree have no majority annotation. The final is a measure of the # of distinct choices made by annotators for each 4-tuple. Given the considerable subjectivity of the annotation task, these inter-annotator agreement numbers are reasonable and consistent with those reported by Mohammad and Kiritchenko (2018) for the relatively abstract genres from which our corpus is drawn. Happiness, the only emotion with a positive valence, exhibits the worst agreement.
Emotion . | # Images . | # Best . | # Worst . | Total . | Spearman-R . | Maj. Agree % . | # Labels . |
---|---|---|---|---|---|---|---|
Anger | 187 | 1,078 | 1,087 | 2,165 | 0.685 (0.034) | 64,68,66 | 2.03,1.92 |
Disgust | 525 | 3,072 | 3,078 | 6,150 | 0.676 (0.019) | 65,66,66 | 2.00,1.98 |
Fear | 396 | 2,352 | 2,315 | 4,667 | 0.691 (0.017) | 65,66,66 | 1.99,1.94 |
Happiness | 399 | 2,326 | 2,346 | 4,672 | 0.573 (0.029) | 65,61,63 | 2.00,2.13 |
Sadness | 183 | 1,084 | 1,050 | 2,134 | 0.672 (0.040) | 68,64,66 | 1.97,2.00 |
1688 | 9,912 | 9,876 | 19,788 |
Emotion . | # Images . | # Best . | # Worst . | Total . | Spearman-R . | Maj. Agree % . | # Labels . |
---|---|---|---|---|---|---|---|
Anger | 187 | 1,078 | 1,087 | 2,165 | 0.685 (0.034) | 64,68,66 | 2.03,1.92 |
Disgust | 525 | 3,072 | 3,078 | 6,150 | 0.676 (0.019) | 65,66,66 | 2.00,1.98 |
Fear | 396 | 2,352 | 2,315 | 4,667 | 0.691 (0.017) | 65,66,66 | 1.99,1.94 |
Happiness | 399 | 2,326 | 2,346 | 4,672 | 0.573 (0.029) | 65,61,63 | 2.00,2.13 |
Sadness | 183 | 1,084 | 1,050 | 2,134 | 0.672 (0.040) | 68,64,66 | 1.97,2.00 |
1688 | 9,912 | 9,876 | 19,788 |
We present images, rationales, and scores from our corpus in Figure 1.
3.3 Corpus Analysis
We explore a number of linguistic features (color, shape, texture, concreteness, emotion, and simile) in FeelingBlue’s rationales to better understand how they change with image emotion and color.
To measure color, we count both explicit color terms (e.g., ‘red’, ‘green’) and implicit references (e.g., ‘milky’, ‘grass’). As the artwork is abstract and can only convey meaning through “line, stroke, color, texture, form, and shape” (IdeelArt, 2016), the use of adjectives and nouns with strong color correlation is a likely reference to those colors in the image. For explicit color terms, we use the base colors from the XKCD dataset (Monroe, 2010). For implicit color references, we use the NRC Color Lexicon (NRC) (Mohammad, 2011), taking the words with associated colors where at least 75% of the responders agreed with the color association. In order to compare color references with the main color of each image, the primary colors of each image were binned to the 11 color terms used in the NRC according to nearest Delta-E distance. This results in an uneven number of rationales per color bin. The implicit color terms were mapped directly to the color bins; each explicit color term was mapped to its nearest.
Shape words were collected by aggregating lists of 2D shapes online and corroborating them with student volunteers. For texture words, we used the 98 textures from Bhushan et al. (1997). Concreteness was measured as a proxy for how often the rationales ground the contents of an image in actual objects and scenery, like referring to a gradient of color as a ‘sunset’ or a silver streak as a ‘knife’. To calculate concreteness, we used the lexicon collected by Brysbaert et al. (2014), which rates 40,000 English lemmas on a scale from 0 to 5. After empirical examination, we threshold concreteness at 4 and ignore all of the explicit color terms, shapes and textures. Rationales were labeled as concrete if at least one word in the rationale was above the concreteness threshold. Similes were identified by presence of ‘like’ (but not ‘I like’) such as “The blue looks like a monster”.
In Figure 2 we present heatmaps which break down this exploration. In total, 69.3% of the rationales refer to color, highlighting the central role color plays in FeelingBlue. We see that 51.2% contain explicit color references and 36.1% contain implicit references (with a 26% overlap). Surprisingly, the majority of these references were not to the primary color of each image. This suggests that viewers were drawn to colors which were either more central in the canvas or contrasted against the primary color. A notable exception is ‘red’ for ‘anger’: when the image is primarily red, it is mentioned 80% of the time in the rationale for the angriest images. Interestingly, when the image is primarily green the rationale explicitly mentions ‘anger’, as if the image evokes ‘anger’ despite the coloring. This is a surprising contrast to the sad images, where ‘blue’, deeply tied to ‘sadness’ in English, is hardly mentioned when the main color is blue, but for those same images, the rationales explicitly call the image ‘sad’. It may be that blue is so intrinsically tied to ‘sadness’ that responders felt sad without consciously linking the two.
The happiest images are described as ‘bright’ and ‘graceful’ while the saddest are ‘dark’ and ‘muddy’. Though the least happy are ‘dark’ as well, they are also ‘simple’ and ‘dull’, while the least sad are ‘simple’, ‘light’, and ‘empty’. As the language varies across these valence pairs (e.g., least happy/most sad), this suggests that the rationales reflect the full continuum of each emotion.
Shapes are referred to much less frequently, a mere 11.4%, and texture is mentioned in only 4.9% of the rationales. However, despite the images being of abstract art, 52.4% of the rationales were ‘concrete’ (with 17.9% containing simile), revealing the importance of grounding in rationalizations of the emotional connotation of color.
4 Justified Affect Transformation
We define a new task to serve as a vehicle for exploring the emotional connotation of color in context enabled by our corpus: Justified Affect Transformation. Given an image Io and a target emotion e ∈ E, the task is:
change the color palette of Io to produce an image Ie that evokes e more intensely than Io
provide textual justifications, one explaining why Io evokes e less intensely and another explaining why Ie evokes e more intensely
By focusing on changes in color (conditional on form), we can understand the affect of different palettes in different contexts. And by producing justifications for those changes, we can explore the degree to which the emotional connotation can be accurately verbalized in English.
To solve this task, we propose a two step approach: 1) an image recoloring component that takes as input an image Io and a target emotion e ∈ E and outputs Ie, a version of Io recolored to better evoke e (Section 4.1); and 2) a rationale retrieval component that takes as input two images, Io and Ie, an emotion e ∈ E, and a large set of candidate rationales R, and outputs a ranked list of rationales Rless that justify why Io evokes e less than Ie and a ranked list of rationales Rmore that justify why Ie evokes e more than Io (Section 4.2).
4.1 Image Recoloring
Our image recoloring model takes an image Io and an emotion e as inputs and outputs a recolored image Ie,∀e ∈ E that better evokes the given emotion. In an ideal scenario, this model would be trained on a large corpus that directly reflects the task of emotional recoloring: differently colored versions of the same image, ranked according to their emotion. Such a corpus is difficult to construct. Instead, we use our corpus, FeelingBlue, which contains 3-tuples, (Iless,Imore,e), where Iless and Imore are entirely different images and Iless evokes e less intensely than Imore.
Our image recoloring model is an ensemble of neural networks designed to accommodate this challenging training regime. It consists of two subnetworks, an emotion-guided image selector and a palette applier, each trained independently. The emotion-guided image selector takes two images and an emotion and identifies which of the two better evokes the emotion. The palette applier (PaletteNet, Cho et al. (2017)) takes an image and a c-color palette and applies the palette to the image in a context-aware manner.
To produce Ie from Io for a specific emotion e, we begin with a randomly initialized palette pe, apply it to Io with the frozen palette applier to produce Ie and rank Io against Ie with the frozen emotion-guided image selector. We update pe via backpropagation so that the recolored Ie more intensely evokes e according to the emotion-guided image selector (see Figure 3). We avoid generating adversarial transformations by restricting the trainable parameters to the colors in the image’s palette (instead of the image itself), forcing the backpropagation through the emotion-guided image selector to find a solution on the manifold of recolorizations of Io. Additionally, we avoid local minima by optimizing 100 randomly initialized palettes for each (Io,e) pair, allowing us to select a palette from the resulting set that balances improved expression of e against other criteria.
Our emotion-guided image selector, palette applier and palette training objectives are detailed in the Section 4.1.1, 4.1.2, and 4.1.3 respectively.
4.1.1 Emotion-Guided Image Selector
We begin by training our emotion-guided image selector. This model takes two images (I1,I2) and an emotion e as input and predicts which of the two images more intensely evokes e. The architecture produces dense representations from the final pooling layer of a pretrained instance of ResNet (He et al., 2016), concatenates those representations and a 1-hot encoding of e and passes this through lES fully connected (FC) layers, re-concatenating the encoding of e after every layer. We apply a dropout rate of dES to the output of the first lES/2 FC layers and employ leaky ReLU activations to facilitate backpropagation through this network later. The model’s prediction is the result of a final softmax-activated layer.
To encourage the model to be order agnostic, we expand our corpus by simply inverting each pair (producing one “left” example with the more intense image first and another “right” example with it second). We optimize a standard cross entropy loss and calculate accuracies by split (“train” or “valid”), side (“left” or “right”), and emotion. We choose the checkpoint with the best, most balanced performance across these axes.
4.1.2 Palette Applier
Our palette applier takes an image Io and c-color palette p as input and outputs a recolored version of the image Ir using the given palette p. Its architecture is that of PaletteNet (Cho et al., 2017), a convolutional encoder-decoder model. The encoder is built with residual blocks from a randomly initialized instance of ResNet, while the decoder relies on blocks of convolutional layers, instance norm, leaky ReLU activations, and upsampling to produce the recolored image Ir. The outputs of each convolutional layer in the encoder and the palette p are concatenated to the inputs of the corresponding layer of the decoder, ensuring that the model has the signal necessary to apply the palette properly. The palette applier outputs the A and B channels of Ir. As in Cho et al. (2017), Io’s original L channel is reused (see Figure 3).
We train this model on a corpus of recolored tuples (Io,Ir,p) generated from our images as in Cho et al. (2017). For each image Io, we convert it to HSV space and generate rPA recolored variants Ir by shifting its H channel by fixed amounts, converting them back to LAB space and replacing their L channels with the original from Io. We extract a c-color palette p for each of these recolored variants Ir using “colorgram.”5 We augment this corpus via flipping and rotation. We optimize a pixel-wise L2 loss in LAB space and choose the checkpoint with the best “train” and “valid” losses.
4.1.3 Palette Generation
We use frozen versions of our emotion-guided image selector ES and palette applier PA to generate a set of c-color palettes pe,∀e ∈ E and L-channel shifts be,∀e ∈ E which, when applied to our original image Io, produce recolored variants Ie,∀e ∈ E, each evoking e more intensely than Io.
To avoid getting stuck in local minima, we optimize 100 randomly initialized palettes for each emotion e. Choosing the palette with the smallest loss produces similar transformations for certain emotions (such as fear and disgust). One desirable property for Ie,∀e is color diversity. To prioritize this, we consider the top 50 palettes according to their loss for each emotion e and select one palette for each e such that the pairwise L2 distance among the resulting Ie is maximal.6
4.2 Rationale Retrieval
Our rationale ranking model takes as input two images, Iless and Imore, an emotion e ∈ E, and a set of candidate rationales R drawn from FeelingBlue. It then outputs 1) Rless, a ranking of rationales from R explaining why Iless evokes e less intensely and 2) Rmore, a ranking of rationales from R explaining why Imore evokes e more intensely.
The architecture embeds Iless and Imore with CLIP (Radford et al., 2021), a state-of-the-art multimodal model trained to collocate web-scale images with their natural language descriptions via a contrastive loss. We concatenate these CLIP embeddings with an equally sized embedding of e and pass this through a ReLU-activated layer producing a shared representation t. We apply dropout dRR before separate linear heads project t into CLIP’s multimodal embedding space, resulting in tless and tmore.
Given (Iless,Imore,rless,rmore), with rless and rmore ∈ R, we optimize CLIP’s contrastive loss, encouraging the logit scaled cosine similarities between tless|more and CLIP embeddings of R to be close to 1 for rless|more and near 0 for the rest. We weight this loss by the frequency of rationales in our corpus and reuse CLIP’s logit scaling factor.
4.3 Training Details
4.3.1 Corpus
We extract pairs of images ordered by the emotion they evoke from FeelingBlue. Each 4-tuple is ranked according to a particular emotion e resulting in a ‘Least’, ℓ, ‘Most’, m, and two unordered middle images, u1 and u2. This provides us with 5 ordered image pairs of (less of emotion e, more of emotion e): (ℓ, u1), (ℓ, u2), (ℓ, m), (u1, m), and (u2, m) which we use to train both our image recoloring and our rationale retrieval models. Note that while FeelingBlue restricts us to the 5 emotions for which it contains annotations, both the task and our approach could be extended to other emotions with access to similarly labeled data.
4.3.2 Preprocessing
We preprocess each image by first resizing it to 224 × 224, zero-padding the margins to maintain aspect ratios, converting it from RGB to LAB space, and then normalizing it to a value between − 1 and 1. As LAB space attempts to represent human perception of color, it allows our model to better associate differences in perceived color with differences in perceived emotion. We note here that our emotion-guided image selector relies on fine-tuning a version of ResNet pretrained on images in RGB space, not LAB space. Thus, we incur an additional domain shift cost in our fine-tuning. While this cost could be avoided by training ResNet from scratch in LAB space (perhaps on a corpus of abstract art), our experimental results show that it appears to have been more than offset by the closer alignment between input representation and human visual perception.
4.3.3 Hyperparameters
For all of our models, we extract c = 6 color palettes. Our Emotion-Guided Image Selector uses a pre-trained ResNet − 50 backbone and lES = 6 fully connected layers. It was trained with dropout dES = 0.1 (Srivastava et al., 2014), learning rate lrES = 5e − 5, and a batch size of 96 for 30 epochs using Adam (Kingma and Ba, 2015). Our Palette Applier was trained for 200 epochs with a batch size of 128 using the same hyperparameters as Cho et al. (2017). To generate our palettes, we use learning rate lrPG = 0.01 and iterate up to T = 2000 steps. And finally, our Rationale Retrieval model was trained with dropout dRR = 0.4, learning rate lrRR = 1e − 4 and a batch size of 256 for 100 epochs using Adam.
5 Results and Discussion
To understand our approach’s strengths, we evaluate our image recoloring model and rationale retrieval model both separately and together.
Evaluation Data for Imaging Recoloring.
The evaluation set for our image recoloring model contains 100 images—30 are randomly selected from our validation set and the remaining 70 are unseen images from WikiArt. We generate 5 recolorized versions Ie of each image Io corresponding to each of our 5 emotions, resulting in 1000 recolored variants for evaluation (see Section 5.1).
Evaluation Data for Rationale Retrieval.
To evaluate our rationale retrieval model as a standalone module we choose an evaluation set consisting of 1000 image pairs, 200 for each emotion. For this dataset, each pair of images contains different images. Again, 30% are randomly selected from images in our FeelingBlue validation set and the remaining 70% consist of unseen images from WikiArt. To identify and order these (Iless,Imore) image pairs, we use the continuous labels produced by our BWS annotations for the former and WikiArt’s agreement labels as a proxy for emotional content in the latter. We retrieve and evaluate the top 5 Rless and Rmore rationales for each (Iless,Imore) (see Section 5.2).
Evaluation Data for Image Recoloring+ Rationale Retrieval.
Finally, to evaluate our models together, we retrieve and evaluate the top 5 Rless and Rmore rationales for all 1000 recolored (Io,Ie) pairs, which we refer to as “recolored image rationales” (Section 5.2).
As our domain (art recolorings) and class set (emotions) are both non-standard, automatic image generation metrics like Fréchet Inception Distance (FID) (Heusel et al., 2017) that are trained on ImageNet (Deng et al., 2009) are ill-suited to its evaluation (Kynkäänniemi et al., 2022). Thus, given the novel nature of this task, we rely more heavily on human annotation. Each evaluation task is annotated by 3 Master Workers on Amazon Mechanical Turk (AMT). Their compensation was in line with the minimum wage. We ensure the quality of these evaluations via a control task which asks annotators to identify the colors in a separate test image. We did not restrict these evaluation tasks to native English speakers. As associations between specific colors and emotions are not universal (Philip, 2006), this may have had a negative effect on both our agreement and scores.
In total, we collected 9000 evaluation annotations which we release as FeelingGreen, an additional complementary corpus that could be instructive to researchers working on this task.
5.1 Image Recoloring Results
To evaluate the quality of our image recoloring, for each pair of images (Io,Ie), we asked annotators whether the recolored image Ie, when compared to Io, evoked less, an equal amount or more of all 5 of our emotions. Given that our image recolorization model is only designed to increase the specified emotion e, a task that only measures e would be trivial as the desired transformation is always more. Conversely, asking annotators to identify e would enforce a single label constraint for a problem that is inherently multilabel.
As is clear from the agreement scores reported in Figure 5, emotion identification is very subjective, a fact corroborated by Mohammad and Kiritchenko (2018) for the abstract genres from which our images are compiled. Therefore, in addition to reporting the percentage of tasks for each emotion with a specific majority label, we include 1) cases where at least 1 annotator selected a given label and 2) the performance of our system according to our top 7 annotators when considered individually.
The scores in Figure 5 demonstrate the difficulty of this task. More often than not, the majority label indicates that our system left the targeted emotion unchanged. In the case of happiness, we successfully enhanced its expression in 33.5% of tasks, the sole emotion for which more beats both less and equal. However, the opposite holds for anger where we reduced its expression in 33% of tasks while increasing it in just 16.5%. As less angry and more happy are similar in terms of valence, this suggests a bias in our approach reflected in the confusion matrix in Figure 5. Perhaps the random initialization of palettes and our preference for a diverse set of recolorings for a given Io result in multi-colored transformations which, while satisfactory to our emotion-guided image selector, appear to most annotators as happy.
Another possibility is that distinct shape (e.g., an unambiguous circle) constrains the emotional potential of color. To test this, we calculate the max CLIP (Radford et al., 2021) similarity between each work of art and terms in our shape lexicon7 and consider the difference in scores between the 40 works in the bottom quintile and the 40 works in the top quintile (with the least distinct and most distinct shape according to CLIP). We find that on average an additional 1.5%,4.5% (for ≥ 2 annotators and ≥ 1 annotator labeling) of our recolorings were effective (i.e., labeled as more) when comparing the bottom shape quintile to the top shape quintile while less and equal fell by 0%,2% and 3%,1.5% respectively. This lends some credence to the notion that dominant shapes restrict the breadth of emotions color can connote.
When we consider our annotators individually, our recolorings were effective for at least one annotator in more than half of the tasks for each emotion. In fact, annotators 2 and 6 indicated that, when not equal, our system enhanced the intended emotion. This suggests an opening for emotional recolorings conditioned on the subjectivity of a particular viewer. We leave this to future work.
We display an example recoloring in Figure 4. Additionally, in Figure 6, we present a visualization of the recolorings produced by our system. When read from top (each Io’s top 2 colors) to bottom (its corresponding Ie’s top 2 colors, ∀e ∈ E), some interesting properties emerge. It is clear that our diverse image selection heuristic described in Section 4.1.3 is effective, resulting in few overlapping color bands for the same image Io across all 5 emotions. As expected, recoloring for happiness results in brighter palettes but surprisingly, when the original image begins with a light palette, our system prefers dark primary colors and bright secondary colors, that is, extreme visual contrast. While a few trends for other emotions are also identifiable, the lack of a simple relationship between emotion and generated palette or even original image color, emotion and generated palette suggests that the model is using other deeper contextual features (less prominent colors and the image’s composition) to produce its recoloring.
5.2 Rationale Retrieval Results
We evaluate the rationales from Rless and Rmore against two criteria: ‘Descriptive’ and ‘Justifying’. ‘Descriptive’ indicates that the rationale refers to content present in the specified image (for Rless this is Io and for Rmore this is Ie) and allows us to measure how well our rationale retrieval model correlates image features with textual content, for example, by retrieving rationales with appropriate color words. ‘Justifying’ means that the rationale is a reasonable justification for why the specified image evokes more or less of the target emotion than the other image in its pair. This allows us to measure whether the model 1) picks rationales that identify a difference between the two images and 2) more generally picks rationales that describe patterns of image differences that correspond to perceived emotional differences.
For every image and its more emotional counterpart (either the paired image or its recolored variant), we asked annotators to evaluate the top five rationales from the pair’s Rless and Rmore according to both criteria. As a strong baseline, we include 2 class-sampled rationalesC randomly sampled from the subset of rationales in FeelingBlue justifying image choices for the same emotion and direction (e.g., more angry). Thus, these rationales exhibit language that is directionally correct but perhaps specific to another image. All 7 rationales were randomly ordered so annotators would not be able to identify them by position.
Table 2 reports agreement and two different metrics for each of our criteria across both the “distinct image” and “recolored image” sets: precision@k and precision-within-k, the percent of top-k rationale groups where at least one rationale satisfied the criterion. Because the validation and unseen splits had similar scores, we present only the union of both. As with our image recoloring evaluation (and emotion annotations more generally), agreement scores are again quite low (though better for ‘Descriptive’ than ‘Justifying’). The 2 class-sampled rationales (C) are a very strong baseline for our model to beat – our model retrieves rationales by comparing combined image representations to the full set of rationales (across all emotions and for both directions), instead of drawing them from the specified emotion and direction subset as is the case for the class-sampled rationales. That precision for the class-sampled rationale is relatively high shows that people tended to gravitate towards similar features as salient to the emotional content of different images. Still, our model regularly outperforms this baseline.
. | Distinct Image . | Recolored Image . | ||||
---|---|---|---|---|---|---|
. | Descriptive . | Justifying . | Both . | Descriptive . | Justifying . | Both . |
α | 0.021 | 0.006 | – | 0.012 | − 0.073 | – |
k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k |
1 | 0.716,0.716 | 0.577,0.577 | 0.469,0.469 | 0.845,0.845 | 0.761,0.761 | 0.692,0.692 |
2 | 0.717, 0.897 | 0.587, 0.801 | 0.475, 0.683 | 0.842, 0.963 | 0.753, 0.908 | 0.682, 0.851 |
5 | 0.719,0.989 | 0.596,0.968 | 0.489,0.908 | 0.845,0.999 | 0.749,0.995 | 0.683,0.971 |
C | 0.683, 0.904 | 0.555,0.779 | 0.441,0.655 | 0.826,0.952 | 0.734,0.905 | 0.655,0.839 |
1 | 0.729,0.729 | 0.669,0.669 | 0.545,0.545 | 0.796,0.796 | 0.694,0.694 | 0.614,0.614 |
2 | 0.726, 0.912 | 0.650, 0.852 | 0.527, 0.738 | 0.789,0.935 | 0.703,0.877 | 0.613,0.792 |
5 | 0.733,0.990 | 0.644,0.979 | 0.524,0.920 | 0.794,0.994 | 0.698,0.979 | 0.613,0.946 |
C | 0.662,0.866 | 0.580,0.798 | 0.448,0.660 | 0.816, 0.954 | 0.704, 0.884 | 0.630, 0.818 |
. | Distinct Image . | Recolored Image . | ||||
---|---|---|---|---|---|---|
. | Descriptive . | Justifying . | Both . | Descriptive . | Justifying . | Both . |
α | 0.021 | 0.006 | – | 0.012 | − 0.073 | – |
k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k | @k, wi-k |
1 | 0.716,0.716 | 0.577,0.577 | 0.469,0.469 | 0.845,0.845 | 0.761,0.761 | 0.692,0.692 |
2 | 0.717, 0.897 | 0.587, 0.801 | 0.475, 0.683 | 0.842, 0.963 | 0.753, 0.908 | 0.682, 0.851 |
5 | 0.719,0.989 | 0.596,0.968 | 0.489,0.908 | 0.845,0.999 | 0.749,0.995 | 0.683,0.971 |
C | 0.683, 0.904 | 0.555,0.779 | 0.441,0.655 | 0.826,0.952 | 0.734,0.905 | 0.655,0.839 |
1 | 0.729,0.729 | 0.669,0.669 | 0.545,0.545 | 0.796,0.796 | 0.694,0.694 | 0.614,0.614 |
2 | 0.726, 0.912 | 0.650, 0.852 | 0.527, 0.738 | 0.789,0.935 | 0.703,0.877 | 0.613,0.792 |
5 | 0.733,0.990 | 0.644,0.979 | 0.524,0.920 | 0.794,0.994 | 0.698,0.979 | 0.613,0.946 |
C | 0.662,0.866 | 0.580,0.798 | 0.448,0.660 | 0.816, 0.954 | 0.704, 0.884 | 0.630, 0.818 |
. | Our Model (k = 2) . | Class-Sampled (C) . | ||||||
---|---|---|---|---|---|---|---|---|
Feature . | % . | Descriptive . | Justifying . | Both . | % . | Descriptive . | Justifying . | Both . |
has color | 60.3 | 0.765 | 0.665 | 0.564 | 54.9 | 0.724 | 0.626 | 0.523 |
no color | 39.7 | 0.773 | 0.686 | 0.590 | 45.1 | 0.774 | 0.664 | 0.569 |
is concrete | 72.7 | 0.760 | 0.665 | 0.565 | 64.2 | 0.732 | 0.630 | 0.529 |
not concrete | 27.3 | 0.792 | 0.695 | 0.599 | 35.8 | 0.773 | 0.667 | 0.569 |
simile | 27.8 | 0.764 | 0.655 | 0.566 | 23.1 | 0.727 | 0.637 | 0.537 |
no similar | 72.2 | 0.770 | 0.680 | 0.578 | 76.9 | 0.752 | 0.645 | 0.546 |
. | Our Model (k = 2) . | Class-Sampled (C) . | ||||||
---|---|---|---|---|---|---|---|---|
Feature . | % . | Descriptive . | Justifying . | Both . | % . | Descriptive . | Justifying . | Both . |
has color | 60.3 | 0.765 | 0.665 | 0.564 | 54.9 | 0.724 | 0.626 | 0.523 |
no color | 39.7 | 0.773 | 0.686 | 0.590 | 45.1 | 0.774 | 0.664 | 0.569 |
is concrete | 72.7 | 0.760 | 0.665 | 0.565 | 64.2 | 0.732 | 0.630 | 0.529 |
not concrete | 27.3 | 0.792 | 0.695 | 0.599 | 35.8 | 0.773 | 0.667 | 0.569 |
simile | 27.8 | 0.764 | 0.655 | 0.566 | 23.1 | 0.727 | 0.637 | 0.537 |
no similar | 72.2 | 0.770 | 0.680 | 0.578 | 76.9 | 0.752 | 0.645 | 0.546 |
One explanation for the surprising strength of the “class-sampled” rationales is that broader, more generally applicable rationales are over-represented in FeelingBlue relative to specific rationales that only apply to certain images. To explore this, in Table 2 we also present the prevalence and scores of rationales from our model and the “class-sampled” baseline along three different axes of specificity: color, concrete language and simile (as identified in Section 3.3). The results show that not only was our model more likely to prefer specific rationales, it also used them more effectively. Because specificity is more easily falsifiable than non-specificity, our model’s preference for specificity depresses its aggregate scores relative to the baseline (Simpson’s paradox).
Finally, it is interesting that annotators regularly found rationales for our recolored image pairs (Io,Ie) to be ‘Justifying’ despite the relatively worse agreement with the intended emotion. As we ask annotators to consider a rationale ‘Justifying’ assuming the intended emotional difference is true, we cannot conclude that the rationales change the annotators’ opinion about the recoloring. But it does show that people can recognize how others might respond emotionally to an image even if they might not agree. We include example retrievals for both variants in Figure 4.
6 Conclusion
We introduce FeelingBlue, a new corpus of abstract art with relative emotion labels and English rationales. Enabled by this dataset, we present a baseline system for Justified Affect Transformation, the novel task of 1) recoloring an image to enhance a specific emotion and 2) providing a textual rationale for the recoloring.
Our results reveal insights into the emotional connotation of color in context: its potential is constrained by its form and effective justifications of its effects can range from the general to the specific. They also suggest an interesting direction for future work—how much is our emotional response to color affected by linguistic framing? We hope that FeelingBlue will enable such future inquiries.
Acknowledgments
We would like to express our gratitude to our annotators for their contributions and the artists whose work they annotated for their wonderful art. Additionally, we would like to thank our reviewers and Action Editor for their thoughtful feedback.
Notes
Our dataset, code, and models are available at https://github.com/amith-ananthram/feelingblue.
These images are a mix of copyright protected and public domain art. We do not distribute these images. Instead, we provide URLs to where they may be downloaded.
From the 283/500 images that remain available.
As this is NP-complete, we use an approximation.
We embed “an image of [SHAPE]” for each SHAPE.
References
Author notes
Action Editor: Yulan He