Current work on image-based story generation suffers from the fact that the existing image sequence collections do not have coherent plots behind them. We improve visual story generation by producing a new image-grounded dataset, Visual Writing Prompts (VWP). VWP contains almost 2K selected sequences of movie shots, each including 5-10 images. The image sequences are aligned with a total of 12K stories which were collected via crowdsourcing given the image sequences and a set of grounded characters from the corresponding image sequence. Our new image sequence collection and filtering process has allowed us to obtain stories that are more coherent, diverse, and visually grounded compared to previous work. We also propose a character-based story generation model driven by coherence as a strong baseline. Evaluations show that our generated stories are more coherent, visually grounded, and diverse than stories generated with the current state-of-the-art model. Our code, image features, annotations and collected stories are available at https://vwprompt.github.io/.

In this work, we improve the quality of text stories generated by neural models from image sequences. We do so by improving the curation of the image sequences that form the basis for collecting the story/image pairs used to train the models: We build a dataset in which the images lend themselves better to telling a story. To show the usefulness of our dataset, we train a coherence-driven model where we design a coherence component inspired by entity grid models. Experiments show that our model produces more coherent, visually grounded and diverse stories than previous models.

Stories are essential in natural language understanding and generation because they are the key mechanism for humans to understand the world (Piper et al., 2021). Automatically generating good stories is a challenging task requiring various capabilities in language processing (Peng et al., 2018), event understanding (Martin et al., 2018; Hong et al., 2020), and world knowledge (Guan et al., 2020; Hsu et al., 2020) to come together. Previous approaches to story generation have used different kinds of input to guide the story: Some use a textual prompt to start the story (Fan et al., 2018), yet others involve describing a sequence of images to direct the story (Huang et al., 2016). We choose to work inside the latter family of approaches in order to exploit the rich information contained in image sequences and to prevent suffering from the symbol grounding problem (Harnad, 1990).

Research on visual narratives shows how it would be possible to construct the sort of dataset we propose: Image sequences should consist of a series of coherent events centered around one or more main characters (Cohn, 2020). In fact, even Aristotle points out in Poetics that event and character are the most important elements for a good story.

To date, several datasets of image sequences for narrative generation exist, such as the Visual Storytelling (VIST; Huang et al., 2016) dataset, which includes sets of images extracted from Flickr albums. However, image sequences generated this way have the drawback that they may not lend themselves well to storytelling. Consider for instance the image sequence shown in the first column of Figure 1: The people featured across the image sequence are all different, and there is no real development of an event or a plot. So the stories that humans were able to write for these types of image sequences are often quite poor from a narrative point of view and lead to low-quality training data for our story generation algorithms, which in turn, unsurprisingly, generate quite bad stories.

Figure 1: 

Comparison of one story in Visual Writing Prompts with one story in Visual Storytelling and five stories Travel Blogs. Our dataset has recurring characters across all five images and sub-stories. Each occurrence of a character in a sub-story has a bounding box in the corresponding image, which grounds the textual appearance to visual input.

Figure 1: 

Comparison of one story in Visual Writing Prompts with one story in Visual Storytelling and five stories Travel Blogs. Our dataset has recurring characters across all five images and sub-stories. Each occurrence of a character in a sub-story has a bounding box in the corresponding image, which grounds the textual appearance to visual input.

Close modal

We thus argue that image sequences serving as writing prompts should be comprehensible as visual narratives by humans. Humans (with reasonable writing proficiency) can then “translate” such visual narratives into textual narratives. For an image sequence to qualify as a visual narrative, events and characters must have two properties: coherence, meaning that the events are semantically related and centered around recurring characters; and diversity, meaning that several different events jointly construct a plot. Psycholinguistic experiments show that missing either of these properties impedes human comprehension of image sequences as visual narratives (Cohn et al., 2012). In addition, the characters should be easily recognized in the image sequences and can be straightforwardly linked to the stories (visual groundedness). Image sequences without these properties are hence not effective writing prompts.

In this work, we define the term visual tellability to mean the tellability (Hühn et al., 2014) of image sequences, that is, how likely it is that humans can write a story with an image sequence, which measures whether the image sequences have the three properties described above. We propose a new dataset, Visual Writing Prompts (VWP), containing curated image sequences and matching user-generated stories. Our image selection process allows us to choose optimized image sequences that have high visual tellability, and to encourage our crowdsourced storytellers to produce coherent and visually grounded stories with high diversity.

To obtain coherent and visually grounded stories, we provide cropped images of characters explicitly with image sequences for storytellers. To improve diversity, we select images from a data source that is already likely to have a plot: image sequences selected from movie scenes with aligned synopses. To further show the importance of coherence and visual groundedness, we propose a story generation model with a representation of visual coherence focused principally on character continuity as a strong baseline. Experiments show that our model outperforms the current state-of-the-art model TAPM (Yu et al., 2021) and generates stories that are more coherent, visually grounded, and diverse.

We summarize our contributions in this work as follows: (a) We propose a pipeline to extract image sequences automatically from annotated movies as story writing prompts, which leads to image sequences with higher visual tellability. (b) We collect a new dataset of stories based on curated image sequences with grounded characters, which is more coherent and has better diversity than previous datasets. (c) We propose a character-grounded story generation model driven by visual coherence as a strong baseline for image-based story generation, which generates more coherent, diverse, and visually grounded stories than the current state-of-the-art model.

Story Generation.

There are several existing datasets for generating a story conditioned on a prompt such as title (Fan et al., 2018), keyword (Yao et al., 2019), cue phrase (Xu et al., 2020), script (Pu et al., 2022), or story plot (Rashkin et al., 2020). The ROCStories corpus (Mostafazadeh et al., 2016) is a collection of short stories with rich causal and temporal relations. In subsequent work, new datasets have also been formed by gathering annotations on subsets of ROCStories for specialized story generation tasks such as modeling character psychology (Rashkin et al., 2018), counterfactual reasoning (Qin et al., 2019), and so forth. The STORIUM dataset (Akoury et al., 2020) of collaboratively written long stories contains rich annotations such as narrator prompts, character goals, and other attributes to guide story generation. However, all these datasets relying on textual prompts suffer from the symbol grounding problem that the meanings of textual stories are grounded on textual symbols (Harnad, 1990). In contrast, our dataset contains stories grounded on nonsymbolic prompts from visual perception, that is, characters in image sequences.

Visually Grounded Stories.

Early work on the VIST dataset (Huang et al., 2016) identified that language in visually grounded stories is much more diverse than in image captions. However, most of the previous datasets of visually grounded stories have several limitations: characters are not explicitly annotated (Chandu et al., 2019), the dataset is limited in scale (Xiong et al., 2019), or there is no sequence of events behind the images (Park and Kim, 2015; Huang et al., 2016). Our dataset is the first large-scale dataset that is focused on overcoming these limitations. Unlike the VIST images, images in our VWP dataset do not feature people posing for the camera in limited contexts. Instead, they depict a rich range of situations, interactions, and emotions. Furthermore, providing character annotations in VWP ensures that the entities in the narrative are grounded to the image sequence and can be easily tracked across the sequence even when some visual attributes change. We hypothesize that these features will result in more coherent and visually grounded stories while maintaining a high level of diversity.

In this section, we describe how we obtain image sequences and design a pipeline to filter and sample images. Our objective is to construct image sequences that are visually tellable, that is, are coherent, diverse, and visually grounded. Our pipeline for image sequence construction is shown in Figure 2.

Figure 2: 

Image processing pipeline. Black squares are input or output. Circles are processing steps.

Figure 2: 

Image processing pipeline. Black squares are input or output. Circles are processing steps.

Close modal

Movie Scene Extraction.

To achieve high coherence and diversity, we choose to select images from movie scenes that have a plot consisting of a series of events around several main characters. We extract movie scenes from the MovieNet dataset (Huang et al., 2020) since it is a dataset that contains movie synopses, annotated movie scenes with extracted movie shots, and identified main characters. The paragraphs in each movie synopsis describe sub-plots of the movie plot, which are aligned with one or more movie scenes.

Changing from one paragraph to another in the synopsis indicates scene changes (Xiong et al., 2019). Moreover, events and characters in one movie scene are semantically coherent. We can make use of these properties to achieve high diversity by sampling image sequences from movie scenes aligned with only one paragraph, so that image sequences are from one sub-plot with a series of different events.

Filtering Movies.

Since we want to constrain the range of commonsense inferences of storytellers to the real world and help them to produce coherent stories, we first filter out all fantasy, science fiction, and horror movies. We also filter out all animations because their image characteristics are too different from the other movies.

Filtering Images.1

To help storytellers to write stories that are visually grounded on characters or objects around them, we discard blurry images and images without any COCO “objects”.2 We measure the amount of image blur by calculating the variance of the Laplacian (Pech-Pacheco et al., 2000) and remove images with a variance lower than 30. We further apply a MaskRCNN-based object detector (He et al., 2020) and filter out images without any detected objects—this will help us generate stories with interesting grounded objects in the image.

To increase the diversity of image sequences, we need to avoid including shots that are very similar (as can happen when a person speaks in a long monologue, for example) to one another. To detect the degree of similarity, we first feed the images to a ResNet-50 pre-trained on ImageNet and extract image features after the fc7 layer. Then we compute pairwise cosine similarities of the image features within each image sequence and discard an image if its cosine similarity with any one of the other images is larger than 0.89.

Additionally, we detect adult content by applying a pre-trained classifier3 and exclude images that trigger the classifier. We also remove the first or the last image sequence in a movie to avoid images with credits.

Image Sampling.

The most intuitive way to collect stories is to use extracted movie scenes directly as writing prompts. Since these movie scenes contain a large number of movie shots, we control the workload by constraining the number of images for each writing task to a lower number K which is obtained through the second pilot studies in Section 4.1. So from each selected movie scene, we sample images consecutively in non-overlapping sliding windows with a size of K and use each set of K images as one writing prompt.

In this section, we design a crowdsourcing experiment to collect stories using our collected image sequences as writing prompts. Our objective is to obtain coherent stories that have high diversity from crowdsourced storytellers.

We design and run all our studies on Amazon Mechanical Turk (AMT).4 The worker user interface is shown in Figure 3. In each assignment, we ask the worker to select a subset of images from the image sequence and write a short story (50 to 300 words) that fits the image sequence. To ensure that the human-written stories are grounded on main characters, we provide names and cropped images of at most five major characters. We retrieve the bounding boxes for each character from the MovieNet annotations and choose the least blurry appearance of each character in the image sequence. We pose three questions to the workers. The first two questions are used to identify workers who have watched the movie from which the image sequence is taken, as they might exhibit different behaviors during story-writing. The third question is to measure visual tellability on a 5-point Likert scale, which is used to show the effectiveness of our image sequence construction pipeline.

Figure 3: 

Worker interface on Amazon Mechanical Turk. We first show the instructions and the requirements. The main characters are provided on the left side. On the right side, each image is accompanied by a textarea. The full story is presented under the input area. We also show the word count and the number of images used for workers’ convenience. The questionnaire is at the bottom.

Figure 3: 

Worker interface on Amazon Mechanical Turk. We first show the instructions and the requirements. The main characters are provided on the left side. On the right side, each image is accompanied by a textarea. The full story is presented under the input area. We also show the word count and the number of images used for workers’ convenience. The questionnaire is at the bottom.

Close modal

We also design a review form for story reviewers to judge the quality of collected stories. We ask the reviewers: 1) whether they want to approve the story; 2) if not, which requirement does it break? 3) if yes, judge the statement: this is a good story. on a 5-point Likert scale. The first two questions are to assure that the collected stories fulfill the following requirements: the story is grammatical, the story is diverse, and the story is visually grounded. The third question is to get judgments of the quality of the approved stories.

4.1 Pilot Studies

We identify the following design questions of the crowdsourcing experiment for data collection:

1. Does the image filtering process improve the tellability of the image sequences?

2. What is the optimal number of images to provide to workers to achieve high visual tellability at a reasonable workload in one writing prompt?

We conducted two pilot studies to investigate these questions. We collect 5 stories per image sequence at most from different writers.

Pilot Study 1: Effectiveness of Image Filtering.

The first study tests whether our image-filtering steps (see Section 3) increase the visual tellability of the extracted image sequences. We extract 180 movie scenes containing 10 images each from selected movies; on half of these, we apply our image filters, while we leave the others as is. All resulting image sequences have 5 to 10 images.

Results show that the average visual tellability score of image sequences with filtering is 3.7, which is significantly higher (unpaired t-test, t = 4.89, p-value <0.001) than the average visual tellability score of image sequences without filtering (3.29). This shows that our image filtering process in the image sequence construction pipeline leads to higher visual tellability and we will apply image filtering in our data collection.

Pilot Study 2: Number of Images to Display.

The second study explores the effect of the number of images K in a writing prompt on workload and visual tellability. We randomly sample 150 movie scenes with 20 images, where writers can choose from 5 to 20 images for their stories. We set the minimum number of images to 5 because the most common narrative structure is 5-part play that contains five components (Cohn, 2013). In addition, since there are five, we can make our dataset comparable to theirs. We set the maximum number to 20 because we find in a preliminary experiment that the workload of writing prompts with more than 20 images is too high considering our budget. We then run our study on these scenes.

We find a negative correlation between the actual number of images used by storytellers and the visual tellability scores, r(500) = −0.17, p < 0.001. This result indicates that showing fewer images can both improve visual tellability and reduce workload. However, we also want to obtain longer stories. Since a majority of 89% of the human-written stories use 5 to 10 images out of 20 and achieve a reasonably high average visual tellability (3.75), we set the maximum number of images we display to 10.

In this section, we describe how we collect and process the stories in the VWP dataset. Our goal is to obtain narratives given the curated image sequences as writing prompts.

Worker Qualification.

In order to improve story quality, we apply a qualification process to workers. We first collect 4.5K stories together with visual tellability judgments and obtain 556 candidate workers. Each story is reviewed by one of five graduate students from Saarland University who are proficient in English. To ensure that the reviewers mutually understand the purpose of the task, we let the reviewers judge 100 stories then check the reviews together to agree on the judgment standards. We then select 58 qualified workers with an acceptance rate ≥90%, average story quality >3.1, and accepted assignments ≥5. We assign a qualification to these workers and invite them to bulk collection.

Bulk Collection.

We collect 7.5K stories with the qualified workers in bulk collection. We group about 300 image sequences into a batch and collect 1.5K stories per batch. For each batch, we sample s stories from each worker and review the stories to update the assessment of the worker,
s=10,ifnw<1010lognw,otherwise
where nw is the number of stories that worker w wrote in this batch. We run the bulk collection batch by batch and revoke the qualification if the worker does not satisfy the selection criteria anymore.

Text Processing.

We process the raw text to make it easier for training story generation models. We tokenize all stories with the spaCy English tokenizer (Virtanen et al., 2020). We then recognize all entities using a Name Entity Recognition model (Peters et al., 2017). We change all location names to placeholders and replace all named characters in each story to [male0],…,[maleM],[female0],…,[femaleN]. We obtain the gender of each named person based on a name statistics following Huang et al. (2016). Finally, to mark the alignment between images and story sections, we add a special separator token [sent]. We randomly sample 849 stories as validation split and 586 stories as test split.

5.1 Statistics of the Dataset

We present statistics, automatic measures of coherence and diversity of our dataset to show that our collected stories are more coherent and diverse.

Statistics.

We compare the properties of our dataset to similar previous datasets including Travel blogs (Park and Kim, 2015)5 and VIST (Huang et al., 2016) in Table 1. Our VWP dataset has 1965 image sequences with 20763 unique images from 122 movies. Each image sequence has 5 to 10 images. Our stories have 45% more tokens, 103% more events, and 285% more characters per text compared to the VIST dataset. While the Travel blogs dataset has longer stories, it has only one image per story.

Table 1: 

Comparison of statistics of VWP against previous datasets. Numbers with ‡ are obtained from a small sample of the Disney split of the Travel Blogs dataset that is available in their repository.

NameImage# Text# Image# token# Event# Char.
Genreper Textper Textper Textper Text
VIST photos 50 K 57.6 6.3 3.4 
Travel Blogs photos 10 K 222.3‡ 3.8‡ 2.3‡ 
VWP (Ours) movie shots 12 K [5, 10] 83.7 12.8 13.1 
NameImage# Text# Image# token# Event# Char.
Genreper Textper Textper Textper Text
VIST photos 50 K 57.6 6.3 3.4 
Travel Blogs photos 10 K 222.3‡ 3.8‡ 2.3‡ 
VWP (Ours) movie shots 12 K [5, 10] 83.7 12.8 13.1 

Coherence.

We first analyze coherence of the stories focusing on the characters and their appearances. According to Centering theory (Grosz et al., 1995), coherent narratives are typically structured such that salient entities often appear in strong grammatical roles like subject or object. As a result, we apply a model based on this theory, Entity Grid (Lapata and Barzilay, 2005), to measure the local coherence of our dataset. We apply the generative Entity Grid model implemented in the Cohere toolkit (Smith et al., 2016) on VIST and VWP. We calculate the log-likelihood based on entity transitions as the story coherence. The results in Table 2 show that our dataset is significantly more coherent compared to the VIST dataset (unpaired t-test, t = −5, p-value <0.001).

Table 2: 

Coherence by log-likelihood (LL) and average log-likelihood (Avg. LL) on validation split of VIST versus a sample split from our VWP dataset with the same number of image sequences. The stories are more coherent if the number is larger.

Dataset# storiesLLAvg. LL
VIST 4987 −4017 −0.8055 
VWP (Ours) 4680 37220.7953
Dataset# storiesLLAvg. LL
VIST 4987 −4017 −0.8055 
VWP (Ours) 4680 37220.7953

To further check whether event elements are semantically related given the same image sequence, we also compute the average Jaccard similarities between event elements of the stories for each image sequence by main characters, predicates (without auxiliary verbs), and arguments in different semantic roles. We identify the main characters in the raw text using coreference clusters (Lee et al., 2018). To ensure that characters mentioned only once in the story can be detected by the coreference resolution model, we append the stories with one introductory sentence per character. For example, to identify the character Jack in Figure 1, we add “This is Jack.” before the story. The Jaccard similarity between story A and B is defined as J(A,B)=ABAB, where A, B are the token sets of predicate/argument in story A and B. The results in Table 3 show that the event elements of stories conditioned on the same image sequence are more semantically related to each other. Our dataset has higher semantic cohesion compared to the VIST dataset.

Table 3: 

Semantic similarity between stories of each image sequence. For all results, the higher the number, the better except the first column which is the number of image sequences. PRD refers to predicate.

Dataset#PRDCharactersArgumentsArg0Arg1Arg2ArgM-LOC
VIST 998 0.063 0.184 0.055 0.041 0.018 0.018 0.013 
VWP (Ours) 1000 0.068 0.21 0.057 0.101 0.048 0.025 0.017 
Dataset#PRDCharactersArgumentsArg0Arg1Arg2ArgM-LOC
VIST 998 0.063 0.184 0.055 0.041 0.018 0.018 0.013 
VWP (Ours) 1000 0.068 0.21 0.057 0.101 0.048 0.025 0.017 

Diversity.

We then measure diversity of the stories from two perspectives: 1) If a story has a plot with a series of different events, it must have diverse events instead of just repeating one event; 2) If these events are combined into different n-grams in the plot, then the story must have diverse predicate n-grams. For example, in the last column in Figure 1, the character Will has a predicate trigram (tell, convince, work), which is different from the next trigram (convince, work, call).

For event diversity, we follow Goldfarb-Tarrant et al. (2020) to obtain the unique number of verbs, the verb-vocabulary ratio, verb-token ratio, and the percentage of diverse verbs (not in the top 5 most frequent verbs). The results in Table 4 show that our dataset has higher event diversity than VIST across all measures. To measure predicate n-gram diversity, we extract and lemmatize verbs obtained from a Semantic Role Labeling model (Shi and Lin, 2019) and calculate the unique:total ratios of predicate unigram, bigram, and trigram (Table 4). We observe that the event sequences in VWP are more diverse than those in VIST, because VWP has higher bigram and trigram ratios.

Table 4: 

Comparison of diversity. The first five columns show event diversity for the validation split of VIST versus a comparable sample of VWP. We report measures including the vocabulary size (Voc), unique number of verbs (Verb), verb-vocabulary ratio (Verb: Voc %), verb-token ratio (Verb: Tok %), and percentage of diverse verbs (Diverse Verb %). The last three columns show predicate n-gram diversity for VIST versus VWP. We measure diversity using unique:total ratios of predicate unigram, bigram, and trigram. For all results the higher the number, the better.

DatasetVocVerbVerb:Verb:Diverseunigrambigramtrigram
Voc %Tok %Verb %
VIST 12627 3447 27.3 1.2 73.6 3.39 33.48 75.22 
VWP (Ours) 13637 4811 35.28 1.23 79 2.71 34.87 79.10 
DatasetVocVerbVerb:Verb:Diverseunigrambigramtrigram
Voc %Tok %Verb %
VIST 12627 3447 27.3 1.2 73.6 3.39 33.48 75.22 
VWP (Ours) 13637 4811 35.28 1.23 79 2.71 34.87 79.10 

Visual Groundedness.

To check visual groundedness of the stories, we first apply the same semantic role labeler to 25 human-written stories each from VWP and VIST. We obtain 299 events and 715 arguments from the VWP samples, and 84 events and 196 arguments from the VIST samples. We then manually annotated these events and arguments with three labels: 1) Grounded means the event or argument is in the corresponding image; 2) Inferred means not in the image, but can be inferred; 3) Hallucianted means not in the image and cannot be inferred.

The results in Table 5 show that about 55% of the events and 63% of the arguments in the VWP stories appear in images, which are higher than 45% of the events and 54% of the arguments in the VIST stories. The numbers of events and arguments that are not in the images but can be inferred are similar between two datasets. Only 2% of the arguments in VWP stories are not in the images and cannot be inferred (i.e., not visually grounded). However, there are 8% of the events and 14% of arguments are not visually grounded in VIST. The results show that stories in VWP are more visually grounded than stories in VIST.

Table 5: 

Visual Groundedness of stories. We report counts and percentages of each label in each data. E means event and A means argument.

LabelVWPVIST
#%#%
E Grounded 164 54.9 38 45.2 
E Inferred 134 44.8 39 46.4 
E Hallucianted 0.3 8.3 
A Grounded 447 62.5 105 53.6 
A Inferred 254 35.5 64 32.7 
A Hallucianted 14 2.0 27 13.8 
LabelVWPVIST
#%#%
E Grounded 164 54.9 38 45.2 
E Inferred 134 44.8 39 46.4 
E Hallucianted 0.3 8.3 
A Grounded 447 62.5 105 53.6 
A Inferred 254 35.5 64 32.7 
A Hallucianted 14 2.0 27 13.8 

In this section, we propose a strong baseline model for character-grounded story generation. We then experiment on our VWP dataset and show the results. Our goal is to demonstrate the usefulness of our dataset.

We extract features for all images with Swin Transformer (Liu et al., 2021), a state-of-the-art computer vision backbone model where all parameters are fixed. We use their official model checkpoint, pre-trained on the ImageNet-21K dataset, to increase domain generality. We extract three different visual features:

1. Global features (global) are most commonly used in image-based language generation. We extract global features from the output of the last feedforward layer.

2. Object features (obj) are widely used in image-based language generation. Since person is also a label in object detection (Lin et al., 2014), using object features is a proper baseline for character features. We obtain object features using a Cascade Mask R-CNN object detector (Cai and Vasconcelos, 2021) with the same Swin Transformer backbone. We crop the bounding boxes of the top 20 objects that the detector predicts for each image and extract the features the same way as global features.

3. Character features (char) are extracted by cropping out the least blurry instance of each character using bounding boxes from our dataset. We feed the bounding boxes to the same Swin Transformer backbone and get the features from the last feedforward layer.

We use the following models for visual story generation as baselines:

GPT-2. (GPT-2; Radford et al., 2019) is a Transformer-based language model pre-trained on large-scale text. We use the small version, which is widely used in previous works of story generation.

TAPM. (TAPM; Yu et al., 2021) is a Transformer-based model which adapts the visual features with pre-trained GPT-2. This is the current state-of-the-art model for visual story generation.

For each baseline, we consider four different variants with different inputs: 1) only global image features; 2) global features and object features; 3) global features and character features; and 4) all three available features.

6.1 Character-based Visual Story Generation

We propose the character-grid transformer model (CharGrid) as a strong baseline to show the importance of modeling coherence and visual groundedness. We hypothesize that characters and different instances of them in image sequences play an important role in visual story generation models in two dimensions: firstly, explicit character representations can improve quality of generated stories, which has been observed in textual story generation (Clark et al., 2018). Secondly, representations that describe different instances of characters across images are beneficial to image-based story generation models.

Character Grid.

To represent coherence of image sequences, we proposed a novel visual representation, character grid. As we mentioned in Section 5.1, one of the most effective methods to measure text coherence is Entity Grid, a matrix of sentences by entities where the cells are the grammatical roles of the entities in the sentence context (Lapata and Barzilay, 2005). The contribution of an entity’s mention to the sentence’s coherence is defined by its within-sentence grammatical role.

Inspired by this, we measure the narrative importance of a character in an image by the similarity between global image features and the character’s features. We thus model the coherence of an image sequence using a matrix C of images by character instances shown in Figure 4. We obtain the narrative importance of each character instance by computing the dot product of each character’s features and the corresponding global image features. In the character grid C, each element is computed as cab =ia ·lb, where ia is the global features of image a, and lb is the features of character b.

Figure 4: 

Example of character grid representations. Each row represents an image and each column represents a character. Shades of the cells indicate the similarities between the character features and the image features. The darker color represents higher similarity. The green square shows a pattern that indicates high coherence and the red square represents low coherence.

Figure 4: 

Example of character grid representations. Each row represents an image and each column represents a character. Shades of the cells indicate the similarities between the character features and the image features. The darker color represents higher similarity. The green square shows a pattern that indicates high coherence and the red square represents low coherence.

Close modal

Model Architecture.

As we show in Figure 5, the architecture is based on the Transformer model. The input to the Transformer is a sequence of tokenized features including global image features, character features, character grid, and text features. Global image features and character features are the same as the features for baseline models described above, which are first fed to trainable image and character encoders that consists of a feedforward layer. Text features are tokenized representations of the generated context, which are presented to the model incrementally. The character grid is flattened and fed to a feedforward layer. The four inputs then pass through the Transformer module. The output obtained at each time step is a probability distribution over all possible output tokens from a pre-trained GPT-2 tokenizer (Wolf et al., 2020).

Figure 5: 

Architecture of character-grid transformer. The blue circles are pre-trained components where the parameters are fixed.

Figure 5: 

Architecture of character-grid transformer. The blue circles are pre-trained components where the parameters are fixed.

Close modal

We also construct two variants of our model to inspect the contributions of each design decision. We replace the character features with object features to obtain the object-grid transformer model (ObjGrid). We use both character features and object features to obtain the entity-grid transformer model (EntiGrid).

Model Training.

We randomly initialized the model parameters except for the vision backbone model. We optimize the model by maximizing the likelihood of the image sequence-story pairs in the training set. The parameters are updated via backpropagation. We employ Nucleus sampling (Holtzman et al., 2020) to obtain the full output sequence for validation. We compute the METEOR score (Banerjee and Lavie, 2005) on the validation set after each training epoch. If the current epoch gets a lower METEOR score, we consider the current epoch as the best epoch and run automatic metrics on the test set. We choose the METEOR score following previous work in visual story generation (see Section 2). In addition, Huang et al. (2016) found METEOR correlates better with human judgment than BLEU and Skip-Thoughts similarity on the VIST dataset.

6.2 Reference-based Metrics

Our goal is to show the effectiveness of character grid representations. Although it has been shown that reference-based metrics correlate poorly with human judgments in open-ended language generation tasks (Guan and Huang, 2020; Gehrmann et al., 2021), it is still efficient to use them for comparison across many different models. Furthermore, we want to make our results comparable to the results of the state-of-the-art model TAPM (Yu et al., 2021). They applied greedy search to generate stories with their models for testing and reported reference-based metrics. We thus follow the same setting and compare our proposed CharGrid model against several previous baselines.

We train all the models for at most 15 epochs with 3 different random seeds. We apply the reference-based metrics including unigram (B-1), bigram (B-2), trigram (B-3), and 4-gram (B-4) BLEU scores (B; Papineni et al., 2002), METEOR (M; Banerjee and Lavie, 2005), ROUGE-L (R; Lin, 2004), and CIDEr (C; Vedantam et al., 2015), which were used in the visual storytelling shared task (Mitchell et al., 2018). We then report the mean and standard deviation of 3 runs.

Results in Table 6 show that the character-grid transformer model (CharGrid) driven by visual coherence outperforms TAPM with character features (TAPM + char) significantly on BLEU-1/2/3 and CIDEr. CharGrid model also outperforms GPT-2 with character features (GPT-2 + char) significantly on most metrics except marginally on BLEU-4 and METEOR. The object-grid transformer model (ObjGrid) outperforms TAPM with object features (TAPM + obj) significantly on BLEU-1/2/3 and CIDEr. The ObjGrid model also outperforms GPT-2 with object features (GPT-2 + obj) significantly on most metrics except marginally on BLEU-4. The entity-grid transformer model (EntiGrid) outperforms TAPM with all features (TAPM + obj, char) significantly on most metrics except marginally on METEOR and ROUGE-L. The EntiGrid model also outperforms GPT-2 with all features (GPT-2 + obj, char) on most metrics except BLEU-4. These results show the effectiveness of character/ object/entity grid representations for coherence of image sequences.

Table 6: 

Results of all models using different input features on the test set of VWP using reference-based metrics including BLEU (B), METEOR (M), ROUGE-L (R-L), and CIDEr (C). All numbers are average of three runs with different random seeds. +, *, and ** represent that the number is one, two, or three standard deviations away from the mean of the CharGrid model.

ModelFeaturesB-1B-2B-3B-4MR-LC
GPT-2 global 38.65** 20.28** 9.78** 4.68* 31.64** 24.24+ 1.66** 
GPT-2 + obj global, obj 40.65** 21.35** 10.2** 4.87* 31.69** 24.05+ 1.85** 
GPT-2 + char global, char 39.95** 21.04** 10.11** 4.92+ 31.85* 24.19+ 1.57** 
GPT-2 + obj,char global, obj, char 40.41** 21.44** 10.56** 5.06 32.03* 24.38 1.87** 
TAPM global 39.85** 21.7** 10.72** 5.19 32.38+ 25.09 1.48** 
TAPM + obj global, obj 40.86** 22.13** 10.83** 5.25 32.34+ 24.91 1.82** 
TAPM + char global, char 40.03** 21.68** 10.66** 5.18 32.42+ 24.88 1.4** 
TAPM + obj,char global, obj, char 40.87** 21.99** 10.72** 5.06+ 32.48+ 24.87 1.59** 
Ours         
ObjGrid global, obj 47.66 25.26 11.95 5.42 32.83 24.42 4.68 
EntiGrid global, obj, char 45.83 24.85 12.11 5.7 32.68 24.89 3.53+ 
CharGrid global, char 47.71 25.33 11.95 5.42 33.03 25.01 4.83 
ModelFeaturesB-1B-2B-3B-4MR-LC
GPT-2 global 38.65** 20.28** 9.78** 4.68* 31.64** 24.24+ 1.66** 
GPT-2 + obj global, obj 40.65** 21.35** 10.2** 4.87* 31.69** 24.05+ 1.85** 
GPT-2 + char global, char 39.95** 21.04** 10.11** 4.92+ 31.85* 24.19+ 1.57** 
GPT-2 + obj,char global, obj, char 40.41** 21.44** 10.56** 5.06 32.03* 24.38 1.87** 
TAPM global 39.85** 21.7** 10.72** 5.19 32.38+ 25.09 1.48** 
TAPM + obj global, obj 40.86** 22.13** 10.83** 5.25 32.34+ 24.91 1.82** 
TAPM + char global, char 40.03** 21.68** 10.66** 5.18 32.42+ 24.88 1.4** 
TAPM + obj,char global, obj, char 40.87** 21.99** 10.72** 5.06+ 32.48+ 24.87 1.59** 
Ours         
ObjGrid global, obj 47.66 25.26 11.95 5.42 32.83 24.42 4.68 
EntiGrid global, obj, char 45.83 24.85 12.11 5.7 32.68 24.89 3.53+ 
CharGrid global, char 47.71 25.33 11.95 5.42 33.03 25.01 4.83 

6.3 Human Evaluation

Because story generation is an open-domain task, reference-based metrics can only show how output stories match with the references. To measure the quality of generated stories directly, we conduct a crowdsourcing experiment to obtain human binary judgments between two systems. We design the first question for Grammaticality, which measures whether the textual outputs are at least grammatical and sets a foundation for other metrics. We then design questions for two properties that we identified for good textual stories: Coherence and Diversity. Finally, we ask a question to compare the Visual Groundedness in order to make sure that the stories are relevant to the input image sequence.

We conduct the experiment with 28 crowd workers over 50 pairs of stories and report the percentage of the judgments for each system that annotators are in favor of. To make the stories more readable, we change the generated character placeholders to randomly sampled names. The results in Table 7 show that TAPM with character features (TAPM + char) outperforms TAPM in Visual Groundedness significantly. CharGrid outperforms TAPM + char on all metrics significantly. We use two-sided binomial tests. This indicates that our character grid representation yields better stories. These results confirm the findings in the evaluation with reference-based metrics.

Table 7: 

Human binary judgments (in percentage) of generated stories between TAPM and TAPM with character features (TAPM + char), TAPM + char and our model (CharGrid) on the test set of VWP across four criteria: Grammaticality, Coherence, Visually Groundedness, and Diversity. The numbers are percentages. * p-value <0.05. ** p-value <0.01.

ModelGrammaticalCoherenceVisual GroundednessDiversity
TAPM + char vs. TAPM +2.45 +1.99 +3.99* +1.69 
CharGrid vs. TAPM + char +6.49** +8.41** +6.25* +11.06** 
ModelGrammaticalCoherenceVisual GroundednessDiversity
TAPM + char vs. TAPM +2.45 +1.99 +3.99* +1.69 
CharGrid vs. TAPM + char +6.49** +8.41** +6.25* +11.06** 

6.4 Qualitative Evaluation

We also conduct a qualitative evaluation to show that stories generated by TAPM with character features are more visually grounded than without character features and character grid representation further improves the coherence and visual groundedness. To obtain more diverse text, we use Nucleus Sampling (Holtzman et al., 2020) with p = 0.1 on all models to generate the stories. As in Figure 6, TAPM generates unreasonable noun phrases (the train). With character features, TAPM + char is able to explore character-object interaction and reason that there is no train in the image. So it generates more reasonable terms (a street).

Figure 6: 

Qualitative results of generated and human-written stories. The red color represents errors made by models and the green color indicates better output.

Figure 6: 

Qualitative results of generated and human-written stories. The red color represents errors made by models and the green color indicates better output.

Close modal

However, TAPM + char model fails to represent the relations between characters, TAPM + char generates the pronoun they without introducing characters in the second image. In contrast, CharGrid introduces two new characters correctly.

We show that curated image sequences with characters are effective as writing prompts for visual story generation in both data collection and model design. By filtering images without any objects that could be recognized by the object detector and removing highly similar images to boost diversity, we can improve the visual tellability of image sequences. Presenting selected characters during the story-writing yields stories with characters grounded in images, which are more coherent and diverse. Correspondingly, using character features as input to the story generation model can improve the quality of generated stories. Adding the character grid representation can bring further improvements in coherence, grammaticality, visual groundedness, and diversity.

Future Work.

One important property of visual narratives not covered in this work is narrativity (Piper et al., 2021), that is, whether an image sequence contains the necessary narrative structures to make a good story. A narrative structure can be achieved by events following a typical order with roles like Establisher, Initial, Initial Peak, and Release (Cohn, 2013). We observe that these roles of events emerge in our collected stories. Our annotations of different instances of the same character across a story allow us to construct event chains for each character. Future work should investigate how to annotate the roles of these events, measure narrativity, and build a model to generate stories with higher narrativity.

A major assumption of all previous work in storytelling is that all humans are equally and reasonably proficient in story-writing and can translate visual narratives into textual narratives. However, individual differences in the writing proficiency of humans must have an impact on story quality. Exploring this from the perspective of both data selection and model design would be an interesting future direction to take.

Xudong Hong is supported by International Max Planck Research School for Computer Science (IMPRS-CS) of Max-Planck Institute for Informatics (MPI-INF). This research was funded in part by a Swedish Research Council (VR) grant (2014-39) for the Centre for Linguistic Theory and Studies in Probability (CLASP). This research was also funded in part by the Chair of Computer Science and Computational Linguistics at Saarland University. We thank three anonymous reviewers for their detailed and insightful reviews that helped us to improve this paper. We sincerely thank our action editor, and the editorial team at Transactions of the Association for Computational Linguistics. We also thank our student assistants: Andrew Johnson, AriaRay Brown, Danielle Gregg and Teresa Martín Soeder. Last but not least, we thank all the anonymous story writers for their hard work and creative stories.

1 

Hyper-parameters in this section are determined by a preliminary experiment that optimizes the filter process manually on 50 image sequences.

2 

A human character is also labeled as an “object” in COCO dataset (Lin et al., 2014).

Nader
Akoury
,
Shufan
Wang
,
Josh
Whiting
,
Stephen
Hood
,
Nanyun
Peng
, and
Mohit
Iyyer
.
2020
.
STORIUM: A dataset and evaluation platform for machine-in-the-loop story generation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
6470
6484
,
Online
.
Association for Computational Linguistics
.
Satanjeev
Banerjee
and
Alon
Lavie
.
2005
.
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments
. In
Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization
, pages
65
72
,
Ann Arbor, Michigan
.
Association for Computational Linguistics
.
Zhaowei
Cai
and
Nuno
Vasconcelos
.
2021
.
Cascade R-CNN: High quality object detection and instance segmentation
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
43
(
5
):
1483
1498
. ,
[PubMed]
Khyathi
Chandu
,
Eric
Nyberg
, and
Alan W.
Black
.
2019
.
Storyboarding of recipes: Grounded contextual generation
. In
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
, pages
6040
6046
,
Florence, Italy
.
Association for Computational Linguistics
.
Elizabeth
Clark
,
Yangfeng
Ji
, and
Noah A.
Smith
.
2018
.
Neural text generation in stories using entity representations as context
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
2250
2260
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Neil
Cohn
.
2013
.
Visual narrative structure
.
Cognitive Science
,
37
(
3
):
413
452
. ,
[PubMed]
Neil
Cohn
.
2020
.
Visual narrative comprehension: Universal or not?
Psychonomic Bulletin & Review
,
27
(
2
):
266
285
. ,
[PubMed]
Neil
Cohn
,
Martin
Paczynski
,
Ray
Jackendoff
,
Phillip J.
Holcomb
, and
Gina R.
Kuperberg
.
2012
.
(Pea)nuts and bolts of visual narrative: Structure and meaning in sequential image comprehension
.
Cognitive Psychology
,
65
(
1
):
1
38
. ,
[PubMed]
Angela
Fan
,
Mike
Lewis
, and
Yann
Dauphin
.
2018
.
Hierarchical neural story generation
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
889
898
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Sebastian
Gehrmann
,
Tosin
Adewumi
,
Karmanya
Aggarwal
,
Pawan Sasanka
Ammanamanchi
,
Anuoluwapo
Aremu
,
Antoine
Bosselut
,
Khyathi Raghavi
Chandu
,
Miruna-Adriana
Clinciu
,
Dipanjan
Das
,
Kaustubh
Dhole
,
Wanyu
Du
,
Esin
Durmus
,
Ondřej
Dušek
,
Chris Chinenye
Emezue
,
Varun
Gangal
,
Cristina
Garbacea
,
Tatsunori
Hashimoto
,
Yufang
Hou
,
Yacine
Jernite
,
Harsh
Jhamtani
,
Yangfeng
Ji
,
Shailza
Jolly
,
Mihir
Kale
,
Dhruv
Kumar
,
Faisal
Ladhak
,
Aman
Madaan
,
Mounica
Maddela
,
Khyati
Mahajan
,
Saad
Mahamood
,
Bodhisattwa Prasad
Majumder
,
Pedro Henrique
Martins
,
Angelina
McMillan-Major
,
Simon
Mille
,
Emiel
van Miltenburg
,
Moin
Nadeem
,
Shashi
Narayan
,
Vitaly
Nikolaev
,
Andre Niyongabo
Rubungo
,
Salomey
Osei
,
Ankur
Parikh
,
Laura
Perez-Beltrachini
,
Niranjan Ramesh
Rao
,
Vikas
Raunak
,
Juan Diego
Rodriguez
,
Sashank
Santhanam
,
João
Sedoc
,
Thibault
Sellam
,
Samira
Shaikh
,
Anastasia
Shimorina
,
Marco Antonio Sobrevilla
Cabezudo
,
Hendrik
Strobelt
,
Nishant
Subramani
,
Wei
Xu
,
Diyi
Yang
,
Akhila
Yerukola
, and
Jiawei
Zhou
.
2021
.
The GEM benchmark: Natural language generation, its evaluation and metrics
. In
Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics (GEM 2021)
, pages
96
120
,
Online
.
Association for Computational Linguistics
.
Seraphina
Goldfarb-Tarrant
,
Tuhin
Chakrabarty
,
Ralph
Weischedel
, and
Nanyun
Peng
.
2020
.
Content planning for neural story generation with aristotelian rescoring
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4319
4338
,
Online
.
Association for Computational Linguistics
.
Barbara J.
Grosz
,
Aravind K.
Joshi
, and
Scott
Weinstein
.
1995
.
Centering: A framework for modeling the local coherence of discourse
.
Computational Linguistics
,
21
(
2
):
203
225
.
Jian
Guan
,
Fei
Huang
,
Zhihao
Zhao
,
Xiaoyan
Zhu
, and
Minlie
Huang
.
2020
.
A knowledge-enhanced pretraining model for commonsense story generation
.
Transactions of the Association for Computational Linguistics
,
8
:
93
108
.
Jian
Guan
and
Minlie
Huang
.
2020
.
UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
9157
9166
,
Online
.
Association for Computational Linguistics
.
Stevan
Harnad
.
1990
.
The symbol grounding problem
.
Physica D: Nonlinear Phenomena
,
42
(
1
):
335
346
.
Kaiming
He
,
Georgia
Gkioxari
,
Piotr
Dollár
, and
Ross
Girshick
.
2020
.
Mask R-CNN
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
42
(
2
):
386
397
. ,
[PubMed]
Ari
Holtzman
,
Jan
Buys
,
Li
Du
,
Maxwell
Forbes
, and
Yejin
Choi
.
2020
.
The curious case of neural text degeneration
. In
International Conference on Learning Representations
.
Xudong
Hong
,
Rakshith
Shetty
,
Asad
Sayeed
,
Khushboo
Mehra
,
Vera
Demberg
, and
Bernt
Schiele
.
2020
.
Diverse and relevant visual storytelling with scene graph embeddings
. In
Proceedings of the 24th Conference on Computational Natural Language Learning
, pages
420
430
,
Online
.
Association for Computational Linguistics
.
Chao-Chun
Hsu
,
Zi-Yuan
Chen
,
Chi-Yang
Hsu
,
Chih-Chia
Li
,
Tzu-Yuan
Lin
,
Ting-Hao
Huang
, and
Lun-Wei
Ku
.
2020
.
Knowledge-enriched visual storytelling
.
Proceedings of the AAAI Conference on Artificial Intelligence
,
34
(
05
):
7952
7960
.
Qingqiu
Huang
,
Yu
Xiong
,
Anyi
Rao
,
Jiaze
Wang
, and
Dahua
Lin
.
2020
.
Movienet: A holistic dataset for movie understanding
. In
Computer Vision – ECCV 2020
, pages
709
727
,
Cham
.
Springer International Publishing
.
Ting-Hao Kenneth
Huang
,
Francis
Ferraro
,
Nasrin
Mostafazadeh
,
Ishan
Misra
,
Aishwarya
Agrawal
,
Jacob
Devlin
,
Ross
Girshick
,
Xiaodong
He
,
Pushmeet
Kohli
,
Dhruv
Batra
,
C.
Lawrence Zitnick
,
Devi
Parikh
,
Lucy
Vanderwende
,
Michel
Galley
, and
Margaret
Mitchell
.
2016
.
Visual storytelling
. In
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
1233
1239
,
San Diego, California
.
Association for Computational Linguistics
.
Peter
Hühn
,
Jan Christoph
Meister
,
John
Pier
, and
Wolf
Schmid
, editors.
2014
.
Handbook of Narratology
.
De Gruyter
,
Berlin, München, Boston
.
Maria
Lapata
and
Regina
Barzilay
.
2005
.
Automatic evaluation of text coherence: Models and representations
. In
IJCAI’05 Proceedings of the 19th International Joint Conference on Artificial Intelligence
, pages
1085
1090
.
Morgan Kaufmann Publishers Inc.
Kenton
Lee
,
Luheng
He
, and
Luke
Zettlemoyer
.
2018
.
Higher-order coreference resolution with coarse-to-fine inference
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)
, pages
687
692
,
New Orleans, Louisiana
.
Association for Computational Linguistics
.
Chin-Yew
Lin
.
2004
.
ROUGE: A package for automatic evaluation of summaries
. In
Text Summarization Branches Out
, pages
74
81
,
Barcelona, Spain
.
Association for Computational Linguistics
.
Tsung-Yi
Lin
,
Michael
Maire
,
Serge
Belongie
,
James
Hays
,
Pietro
Perona
,
Deva
Ramanan
,
Piotr
Dollár
, and
C.
Lawrence Zitnick
.
2014
.
Microsoft COCO: Common objects in context
. In
Computer Vision – ECCV 2014
, pages
740
755
,
Cham
.
Springer International Publishing
.
Ze
Liu
,
Yutong
Lin
,
Yue
Cao
,
Han
Hu
,
Yixuan
Wei
,
Zheng
Zhang
,
Stephen
Lin
, and
Baining
Guo
.
2021
.
Swin transformer: Hierarchical vision transformer using shifted windows
. In
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
, pages
9992
10002
.
Lara
Martin
,
Prithviraj
Ammanabrolu
,
Xinyu
Wang
,
William
Hancock
,
Shruti
Singh
,
Brent
Harrison
, and
Mark
Riedl
.
2018
.
Event representations for automated story generation with deep neural nets
.
Proceedings of the AAAI Conference on Artificial Intelligence
,
32
(
1
).
Margaret
Mitchell
,
Ting-Hao ‘Kenneth’
Huang
,
Francis
Ferraro
, and
Ishan
Misra
, editors.
2018
. In
Proceedings of the First Workshop on Storytelling
.
Association for Computational Linguistics
,
New Orleans, Louisiana
.
Nasrin
Mostafazadeh
,
Nathanael
Chambers
,
Xiaodong
He
,
Devi
Parikh
,
Dhruv
Batra
,
Lucy
Vanderwende
,
Pushmeet
Kohli
, and
James
Allen
.
2016
.
A corpus and cloze evaluation for deeper understanding of commonsense stories
. In
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
, pages
839
849
,
San Diego, California
.
Association for Computational Linguistics
.
Kishore
Papineni
,
Salim
Roukos
,
Todd
Ward
, and
Wei-Jing
Zhu
.
2002
.
BLEU: A method for automatic evaluation of machine translation
. In
Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics
, pages
311
318
,
Philadelphia, Pennsylvania, USA
.
Association for Computational Linguistics
.
Cesc C.
Park
and
Gunhee
Kim
.
2015
.
Expressing an image stream with a sequence of natural sentences
. In
Advances in Neural Information Processing Systems
, volume
28
.
Curran Associates, Inc.
José
Luis Pech-Pacheco
,
Gabriel
Cristóbal
,
Jesús
Chamorro-Martinez
, and
Joaquín
Fernández-Valdivia
.
2000
.
Diatom autofocusing in brightfield microscopy: A comparative study
. In
Proceedings 15th International Conference on Pattern Recognition. ICPR-2000
, volume
3
, pages
314
317
.
Nanyun
Peng
,
Marjan
Ghazvininejad
,
Jonathan
May
, and
Kevin
Knight
.
2018
.
Towards controllable story generation
. In
Proceedings of the First Workshop on Storytelling
, pages
43
49
,
Association for Computational Linguistics
.
Matthew E.
Peters
,
Waleed
Ammar
,
Chandra
Bhagavatula
, and
Russell
Power
.
2017
.
Semi-supervised sequence tagging with bidirectional language models
. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
1756
1765
,
Vancouver, Canada
.
Association for Computational Linguistics
.
Andrew
Piper
,
Richard Jean
So
, and
David
Bamman
.
2021
.
Narrative theory for computational narrative understanding
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
298
311
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Dongqi
Pu
,
Xudong
Hong
,
Pin-Jie
Lin
,
Ernie
Chang
, and
Vera
Demberg
.
2022
.
Two-stage movie script summarization: An efficient method for low-resource long document summarization
. In
Proceedings of the Workshop on Automatic Summarization for Creative Writing
, pages
57
66
,
Gyeongju, Republic of Korea
.
Association for Computational Linguistics
.
Lianhui
Qin
,
Antoine
Bosselut
,
Ari
Holtzman
,
Chandra
Bhagavatula
,
Elizabeth
Clark
, and
Yejin
Choi
.
2019
.
Counterfactual story reasoning and generation
. In
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
, pages
5043
5053
,
Hong Kong, China
.
Association for Computational Linguistics
.
Alec
Radford
,
Jeffrey
Wu
,
Rewon
Child
,
David
Luan
,
Dario
Amodei
, and
Ilya
Sutskever
.
2019
.
Language models are unsupervised multitask learners
.
OpenAI Blog
,
1
(
8
):
9
.
Hannah
Rashkin
,
Antoine
Bosselut
,
Maarten
Sap
,
Kevin
Knight
, and
Yejin
Choi
.
2018
.
Modeling naive psychology of characters in simple commonsense stories
. In
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
, pages
2289
2299
,
Melbourne, Australia
.
Association for Computational Linguistics
.
Hannah
Rashkin
,
Asli
Celikyilmaz
,
Yejin
Choi
, and
Jianfeng
Gao
.
2020
.
PlotMachines: Outline-conditioned generation with dynamic plot state tracking
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
4274
4295
,
Online
.
Association for Computational Linguistics
.
Peng
Shi
and
Jimmy
Lin
.
2019
.
Simple BERT models for relation extraction and semantic role labeling
.
CoRR
,
abs/1904.05255
.
Karin Sim
Smith
,
Wilker
Aziz
, and
Lucia
Specia
.
2016
.
Cohere: A toolkit for local coherence
. In
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16)
, pages
4111
4114
,
Portorož, Slovenia
.
European Language Resources Association (ELRA)
.
R.
Vedantam
,
C.
Zitnick
, and
D.
Parikh
.
2015
.
CIDEr: Consensus-based image description evaluation
. In
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages
4566
4575
,
Los Alamitos, CA, USA
.
IEEE Computer Society
.
Pauli
Virtanen
,
Ralf
Gommers
,
Travis E.
Oliphant
,
Matt
Haberland
,
Tyler
Reddy
,
David
Cournapeau
,
Evgeni
Burovski
,
Pearu
Peterson
,
Warren
Weckesser
,
Jonathan
Bright
,
Stéfan J.
van der Walt
,
Matthew
Brett
,
Joshua
Wilson
,
K.
Jarrod Millman
,
Nikolay
Mayorov
,
Andrew R. J.
Nelson
,
Eric
Jones
,
Robert
Kern
,
Eric
Larson
,
C J
Carey
,
İlhan
Polat
,
Yu
Feng
,
Eric W.
Moore
,
Jake
VanderPlas
,
Denis
Laxalde
,
Josef
Perktold
,
Robert
Cimrman
,
Ian
Henriksen
,
E. A.
Quintero
,
Charles R.
Harris
,
Anne M.
Archibald
,
Antônio H.
Ribeiro
,
Fabian
Pedregosa
,
Paul
van Mulbregt
, and
SciPy 1.0 Contributors.
2020
.
SciPy 1.0: Fundamental algorithms for scientific computing in Python
.
Nature Methods
,
17
:
261
272
. ,
[PubMed]
Thomas
Wolf
,
Lysandre
Debut
,
Victor
Sanh
,
Julien
Chaumond
,
Clement
Delangue
,
Anthony
Moi
,
Pierric
Cistac
,
Tim
Rault
,
Remi
Louf
,
Morgan
Funtowicz
,
Joe
Davison
,
Sam
Shleifer
,
Patrick
von Platen
,
Clara
Ma
,
Yacine
Jernite
,
Julien
Plu
,
Canwen
Xu
,
Teven Le
Scao
,
Sylvain
Gugger
,
Mariama
Drame
,
Quentin
Lhoest
, and
Alexander
Rush
.
2020
.
Transformers: State-of-the-art natural language processing
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
38
45
,
Online
.
Association for Computational Linguistics
.
Yu
Xiong
,
Qingqiu
Huang
,
Lingfeng
Guo
,
Hang
Zhou
,
Bolei
Zhou
, and
Dahua
Lin
.
2019
.
A graph-based framework to bridge movies and synopses
. In
2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 – November 2, 2019
, pages
4591
4600
.
IEEE
.
Peng
Xu
,
Mostofa
Patwary
,
Mohammad
Shoeybi
,
Raul
Puri
,
Pascale
Fung
,
Anima
Anandkumar
, and
Bryan
Catanzaro
.
2020
.
MEGATRON-CNTRL: Controllable story generation with external knowledge using large-scale language models
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
2831
2845
,
Online
.
Association for Computational Linguistics
.
Lili
Yao
,
Nanyun
Peng
,
Ralph
Weischedel
,
Kevin
Knight
,
Dongyan
Zhao
, and
Rui
Yan
.
2019
.
Plan-and-write: Towards better automatic storytelling
.
Proceedings of the AAAI Conference on Artificial Intelligence
,
33
(
01
):
7378
7385
.
Youngjae
Yu
,
Jiwan
Chung
,
Heeseung
Yun
,
Jongseok
Kim
, and
Gunhee
Kim
.
2021
.
Transitional adaptation of pretrained models for visual storytelling
. In
2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pages
12653
12663
.

Author notes

Action Editor: Marco Baroni

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.