Abstract
Personalized Headline Generation aims to generate unique headlines tailored to users’ browsing history. In this task, understanding user preferences from click history and incorporating them into headline generation pose challenges. Existing approaches typically rely on predefined styles as control codes, but personal style lacks explicit definition or enumeration, making it difficult to leverage traditional techniques. To tackle these challenges, we propose General Then Personal (GTP), a novel framework comprising user modeling, headline generation, and customization. We train the framework using tailored designs that emphasize two central ideas: (a) task decoupling and (b) model pre-training. With the decoupling mechanism separating the task into generation and customization, two mechanisms, i.e., information self-boosting and mask user modeling, are further introduced to facilitate the training and text control. Additionally, we introduce a new evaluation metric to address existing limitations. Extensive experiments conducted on the PENS dataset, considering both zero-shot and few-shot scenarios, demonstrate that GTP outperforms state-of-the-art methods. Furthermore, ablation studies and analysis emphasize the significance of decoupling and pre-training. Finally, the human evaluation validates the effectiveness of our approaches.1
1 Introduction
The task of headline generation aims to produce a concise sentence for expressing the salient information of the document (Liu et al., 2020). In addition to preserving the content information, recent studies have further proposed generating appealing headlines. For instance, some researchers propose to inject specific styles, such as humor and romance, into news headlines or generate interrogative ones to attract readers’ attention (Shu et al., 2018; Jin et al., 2020; Zhan et al., 2022; Zhang et al., 2018b). However, these approaches only consider one type of style to catch the attention, neglecting that individuals may possess different preferences. Therefore, Personalized Headline Generation (PHG) (Ao et al., 2021) has emerged as a new research direction, which customizes news headlines by considering user information. Specifically, given body-headline pairs along with users’ click history, the goal is to infer the implicit user preference and further incorporate it into generating personalized headlines. Besides, the task is usually formulated as zero-shot learning due to the high cost of collecting large-scale personalized headlines (Zhang et al., 2022).
Nevertheless, the task formulation brings the first critical challenge, i.e., generating personalized headlines without ground-truth annotation. In such a scenario, models are required to incorporate the implicit user preference without explicit supervision for personal text styles. The previous work (Ao et al., 2021) tackles the task by reinforcement learning, taking the users’ click history as a learning signal to construct the style supervision. However, their learning framework does not leverage the news headlines that can contribute to improving headline quality. Therefore, we make the first attempt to facilitate personalization learning without or with limited ground-truth annotations while enhancing the generation quality by leveraging the news headlines.
In this paper, we propose a framework named General Then Personal (GTP) to tackle this challenge. Specifically, we propose to decouple the generation process into headline generation and headline customization. The goal of headline generation is to produce headlines targeting a general audience, while headline customization aims to further customize them based on the control code2 of a specific user. With the task decoupling, we could pre-train the headline generator using body-headline pairs from a diverse range of news articles, thus improving the quality of generation. Afterward, the control code and generated headlines are jointly utilized by the headline customizer.
However, constructing the control code poses several challenges. In previous work, the control code typically refers to specific and discrete attributes of the target headlines, such as topics, sentiments, keywords, or descriptive prompts (Keskar et al., 2019; Chan et al., 2021a; He et al., 2022; Carlsson et al., 2022). While for PHG task, the target attributes are the user preferences encapsulated in the click histories, which cannot be directly defined. As such, we follow a similar approach to Ao et al. (2021) and pre-train recommendation models on impression logs to extract the features of click histories as the control codes.
To train the headline customizer utilizing the control codes, we could form the training samples with news articles and corresponding user click histories. However, since a news article can attract multiple users, each with potentially distinct preferences, the pairing between news and user click history is not one-to-one. This limitation impedes the learning of preference control. Thus, we construct a new dataset to alleviate the issue, which assumes that users who click on the same news have similar preferences. Specifically, we integrate the click histories of these users to synthesize a pseudo click history, which helps build a new user profile with more specific interests in the target news.
Moreover, due to the distinct latent space of control codes compared to regular text features, models may inadvertently disregard the control codes during the learning process. Thus, we design two mechanisms, Masked User Modeling (MUM) and Information Self-Boosting (ISB), to alleviate these issues. MUM serves as a pre-training objective to make the control code recognizable to the model, while ISB leverages the generated headline to recall information from the article, reducing information loss in the two-stage generation process.
Finally, the last issue is the lack of evaluation metrics. Previous works only depend on lexical similarity for evaluation, constraining the models to generate one type of headline. However, given an article, a user could be attracted by various headlines beyond ground-truth ones. This argument is akin to the multi-target summarization problem (Cachola et al., 2020), suggesting multiple valid summaries could exist for an article. To benchmark the degree of personalization, we propose the Anomaly-based Personalization Evaluation (APE) metric, inspired by the anomaly detection task and the evaluation metrics used in the field of vision domain. To implement APE, we train auto-encoder models to assess whether an input headline adheres to the same distribution as user-written ones. The detection model is expected to learn text style to distinguish the inputs, enabling us to consider the hidden states as style features. We quantify the results by measuring the feature distance between generated and user-written headlines. However, relying solely on distance measurements may not provide a comprehensive quality assessment. To offer a more intuitive metric, we introduce the editor headlines as reference points for comparison and employ relative values to convey the results. Unlike ROUGE scores, which focus on lexical similarity, APE enables a distribution-wise evaluation, providing a more flexible reference for assessment.
The contributions are summarized as follows:
We propose to decouple the personal headline generation task into generation and customization for incorporating the user preference in a late fusion style. We propose two mechanisms, MUM and ISB, to leverage user and content information better.
We propose a novel formulation for constructing the control code with one-to-one mappings between click histories and news headlines for better modeling text styles.
We introduce a new evaluation metric, APE, from the perspective of anomaly detection to provide a more flexible reference and validate the metric with human evaluation.
Extensive experiments, analysis, and user study demonstrate that the proposed GTP outperforms state-of-the-art approaches significantly under both zero-shot and few-shot settings.
2 Related Work
2.1 Control over Text Generation
Style-controlled Text Generation.
The advancements in pre-training techniques have enabled modern language models (Chowdhery et al., 2022; Brown et al., 2020) to generate text that is nearly indistinguishable from human-written text (Clark et al., 2021). This has sparked increased interest among researchers in modeling and controlling text attributes, giving rise to the field of controllable text generation (CTG) (Prabhumoye et al., 2020). Previous work has explored various attributes, such as keywords (He, 2021), specified entities (Dong et al., 2021), document diction (Dathathri et al., 2020), topics (Keskar et al., 2019), sentiments (Chan et al., 2021a), humor (Amin and Burghardt, 2020), authorship (Syed et al., 2020), and social bias (Barikeri et al., 2021). These examples show attributes can be approached from different perspectives, including grammatical, artistic, or cognitive aspects. Certain perspectives, notably sentiment and humor, are closely associated with general stylistic aspects, while others are more related to intrinsic text qualities, such as keywords and diction. In addition to controlling text by unconditional language models (Subramani et al., 2019), the techniques of CTG lead to the emergence of controllable text summarization and controllable headline generation tasks. For instance, He et al. (2022) prepend descriptive prompts to articles to enable controllability. Chan et al. (2021b) propose an RL framework based on a constrained Markov decision process. Yamada et al. (2021) propose a Transformer-based framework to generate summaries with specified phrases. Jin et al. (2020) apply multi-tasking to learn headline generation and denoising autoencoding for specific style corpora.
Style Transfer.
Besides controlling text generation conditionally or unconditionally, another research line focuses on text attribute transfer (Hu and Li, 2021), aiming to edit the existing text to possess desired attributes without considering contextual information. Similar to the CTG, the attributes could be style-related or intrinsic text qualities. The approaches involve disentangling text into content and attributes in the latent space for manipulation (Yi et al., 2020; Li et al., 2020), editing based on sentence templates (Madaan et al., 2020; Li et al., 2018), and creating pseudo-parallel data (Jin et al., 2019; Nikolov and Hahnloser, 2019). These methods often focus on transferring attributes in short sentences, making them naturally suited for tasks like headline generation. However, many previous works consider well-defined properties. In this paper, we focus on injecting personal preferences into headlines, posing a great challenge since preferences can be vague and difficult to capture and utilize effectively.
2.2 Realization of Control Codes
The main component of text control is injecting target attribute information, i.e., control codes (Keskar et al., 2019), into models. For verbalizable control codes such as keywords, topics, or entities, an approach is to make the corresponding tokens as the hard prompts of inputs during inference (Keskar et al., 2019; Fan et al., 2018). Alternatively, some works learn continuous representations of target attributes, known as soft prompts, to enable the control over general attributes (Li and Liang, 2021; Yu et al., 2021). Another research line is to make the control codes as learning targets. Much research has attempted to train scorers for target attributes, and utilized them as reward functions within the RL framework (Song et al., 2020; Stiennon et al., 2020) or the sampling bias during the decoding process (Krause et al., 2021; Mireshghallah et al., 2022). However, for PHG, control codes are the user preferences encapsulated in click histories. Unlike categorical attributes such as topics or dictions, user preferences involve attributes from different perspectives, including grammatical, artistic, and cognitive. Previous work applies recommendation models to extract user representations and design a reward to match the representations of generated headlines (Ao et al., 2021). Although straightforward, such a scheme does not explore the benefit of news headlines. In this paper, we construct our control code in a fine-grained manner and take click-through rate and news headlines into consideration to better capture user interests.
3 Problem Formulation
We denote the database of news articles as , where xi and yi represent the body and headline of article ni, respectively. The personalized headline generation task aims to generate user-specific headlines for a given user τ, taking into account the user’s preferences and the content of the news articles. To achieve this, the models need to understand the implicit user preference, which acts as the control code, from the user’s click histories denoted as ℋτ = nk. Additionally, we include the user’s impression logs, which contain information about the displayed news and click-through behaviors over a period of time, in the dataset to learn the user’s preferences. The clicked news articles are considered positive samples 𝒫τ, while the unclicked news articles are considered negative samples 𝒩τ. For evaluation, we utilize the PENS dataset introduced by Ao et al. (2021), which provides click histories and a series of news articles with user-written headlines. Our work addresses both zero-shot and few-shot settings for personalized headline generation. In the zero-shot setting, the model learns to generate personalized headlines without ground-truth annotations. Furthermore, we explore the few-shot setting, where a limited number of user annotations are available during the learning phase. These two settings require different capabilities from the models (Yin et al., 2020), enabling us to investigate the proposed methods from diverse perspectives.
4 Methodology
In this section, we elaborate on how we tackle the PHG task as shown in Fig. 1. The following sections are organized as follows. Sec. 4.1 describes how to establish the control code from users’ click histories. Sec. 4.2 introduces the generation framework to incorporate the control code effectively. Sec. 4.3 presents the training strategy and the process of pre-training dataset construction. Finally, Sec. 4.4 introduces an evaluation metric to quantify the degree of personalization.
4.1 Control Code Construction
Interested-News Only Filter.
4.2 Late Fusion Generation Model
Late Fusion Framework.
Information Self-boosting.
To address concerns about potential information loss arising from the two-stage generation process (Song et al., 2022), we propose a mechanism to incorporate supporting information from the news article into the generated headlines. Specifically, we extract information from the news body x based on the outputs generated by HG. We employ a greedy selection algorithm to retrieve relevant sentences, using lexical overlapping to measure text similarity. This extracted information denoted as s, is then concatenated with the outputs of HG and fed into HC. The customization process can be described as follows: . By incorporating relevant information, we enhance the customization stage and mitigate potential information loss during the two-stage generation.
Masked User Modeling.
4.3 Training Strategy
PENS-SH Dataset Construction.
To train a personalized headline generation model without explicit annotations, previous research (Ao et al., 2021) encodes and aggregates the generated results into condensed features, subsequently matched with positive news items from the click history. However, we have identified significant potential for improving text quality over reinforcement learning-based methods. Specifically, we propose a two-stage approach to improve text quality while facilitating collaboration with preference information. In this methodology, editor headlines are treated as pseudo targets, and corresponding click histories are simulated based on users who have clicked on the same pseudo target. However, it is important to note that a single news article can attract multiple users, each with potentially distinct preferences, resulting in multiple click histories for a pseudo target. Learning from these potentially conflicting examples can profoundly impact the preference-aware generation process. Hence, we introduce PENS-SH to alleviate these issues as shown in Fig. 5, which assumes that users who click on the same news article have similar preferences. Specifically, we integrate the click histories of users who have clicked on a particular news article into a news pool. From this pool, we select news articles with higher occurrence frequencies to synthesize a pseudo click history. By leveraging shared information among these users, we construct a pseudo click history that helps establish a new user profile with clearer and more specific interests related to the target in a one-to-one manner.
Headline Generator Training.
Headline Customizer Pre-training.
Headline Customizer Finetuning.
We further consider the few-shot setting for PHG to explore model behaviors from different perspectives. Specifically, we finetune the headline customizer user-wisely with a few annotations. The HC is further finetuned by Eq. 7, where the generation target y is replaced by the user-written headline yτ.
4.4 Anomaly-based Personalization Evaluation
5 Experiments
5.1 Setups
Implementation Details.
We conduct experiments on the PENS dataset (Ao et al., 2021), including 113,762 news articles and 500,000 impressions from online users. The testing set comprises data from 103 users. Each user’s click history and 200 headlines written by the user are available. Thus, there are a total of 103 * 200 = 20,600 personalized headlines. The models will be evaluated on all 20,600 testing samples for the zero-shot setting. For few-shot learning, we divide the 200 headlines of each user into 80/20/100 splits for training, validation, and testing, respectively. Essentially, we train a model for a user with the 80 training examples, and evaluation is performed on the 100 testing examples. This setup can be understood as the intra-user setting, as the objective is to evaluate the model’s generalization given a few examples specific to each user. Additionally, we also consider the inter-user setting. To achieve this, we utilize 40/13/50 users for training, validation, and testing. Specifically, the model is trained using 40 * 200 = 8000 examples. For testing, the split is chosen to encompass 50 users, resulting in 50 * 200 = 10,000 testing examples. This setting helps us evaluate the model’s generalization across different users. The proposed PENS-SH dataset includes 14,505 pairs, and we take 12,505 and 2000 for training and validating the HC. Both HG and HC are initialized with BART (Lewis et al., 2020). For the APE metric, the fold number P is 2, and the fold size S is 100.
Baselines.
Our baselines are PENS (Ao et al., 2021) and EUI-PENS (Zhang et al., 2022). EUI-PENS builds upon PENS using entity words from news and input-dependent user representations. Then, we compare with ChatGPT via the OpenAI API3 to explore the benefits of using an large language model on the personalized generation task. To facilitate the task with ChatGPT, we formulate the prompt as follows:
I want you to act as a personalized headline writer. I will provide you with some headlines clicked by a user and a target document. You will answer the personalized headline of the target document for the user.
The clicked headlines are: <clicked>. The target document is: <document>.
, where <clicked> comprises 50 concatenated headlines extracted from the user’s click history.4 As prior works primarily focus on the zero-shot setting, we adopt the ablations of GTP as our baselines for the few-shot setting.
Evaluation Metrics.
ROUGE-n (Lin, 2004) evaluates the lexical similarity between the generation results and references. BLEURT (Sellam et al., 2020) involves pre-training BERT using millions of synthetic examples to enhance generalization and robustness in evaluation. BARTScore (Yuan et al., 2021), built upon BART, evaluates text by considering its probability of being generated from or generating other textual inputs and outputs. Both BLEURT and BARTScore utilize references, i.e., ground-truth, for evaluation. On the other hand, G-Eval (Liu et al., 2023) employs large language models with customized prompts to execute reference-free evaluation for different aspects of texts. The proposed APE metric evaluates the degree of personalization specifically.
5.2 Main Results
Table 1 summarizes the results obtained in zero- and few-shot settings. Notably, we observe that the APE metric exhibits higher sensitivity to out-of-distribution samples than ROUGE. This sensitivity can be attributed to the deliberate inclusion of user-written headlines in the APE learning process. APE models operate on the principle of anomaly detection, restricting the availability of solely in-domain data (i.e., user-written headlines) during the learning phase. Consequently, when confronted with out-of-domain data during inference, the generated results could be subpar and sensitive. This situation resembles neural models’ challenges in domain generation (Wang et al., 2022). We leverage this characteristic to emphasize the discrepancies between the generated results. Moreover, APE evaluates similarity based on the collective knowledge acquired by the model, which provides an evaluation from a different angle, compensating for the one-to-one comparison metrics.
. | Methods . | RM . | Pre-train . | ROUGE-1 / 2 / L ↑ . | BLEURT ↑ . | BARTScore ↑ . | APE ↓ . |
---|---|---|---|---|---|---|---|
Zero-Shot | Editor | – | – | 47.81 / 26.67 / 36.74 | 51.47 | 3.71 | 0.71 |
Pointer-Gen | – | – | 19.86 / 7.76 / 18.83 | – | – | – | |
PG+RL-ROUGE | – | – | 20.56 / 8.84 / 20.03 | – | – | – | |
PENS | EBRN | – | 25.49 / 9.14 / 20.82 | – | – | – | |
PENS | DKN | – | 27.48 / 10.07 / 21.81 | – | – | – | |
PENS | NPA | – | 26.11 / 9.58 / 21.40 | – | – | – | |
PENS | NRMS | – | 26.15 / 9.37 / 21.03 | – | – | – | |
PENS | LSTUR | – | 24.10 / 8.82 / 20.73 | – | – | – | |
PENS | NAML | – | 28.01 / 10.72 / 22.24 | – | – | – | |
EUI-PENS | Ent-CNN | – | 32.34 / 13.93 / 26.90 | – | – | – | |
ChatGPT | – | – | 29.80 / 11.04 / 24.15 | 40.97 | 2.32 | 13.04 | |
One-Stage | – | – | 33.680.007 / 14.090.004 / 27.700.010 | 42.220.002 | 2.950.001 | 1.590.001 | |
One-Stage† | TrRMIo | PENS-SH | 33.450.008 / 13.970.004 / 27.600.009 | 41.770.003 | 2.920.001 | 3.690.002 | |
GTP | TrRMIo | PENS | 33.500.008 / 14.030.009 / 27.650.002 | 41.850.002 | 2.960.001 | 1.920.077 | |
GTP | TrRMIo | PENS-SH | 33.84*0.002 / 14.23*0.000 / 27.85*0.001 | 42.26*0.002 | 3.01*0.001 | 0.76*0.003 | |
Intra Few-Shot | One-Stage | – | – | 33.870.16 / 14.180.11 / 27.830.10 | 41.680.002 | 2.920.002 | 1.460.002 |
Two-Stage | – | ✗ | 34.120.17 / 14.460.09 / 28.320.06 | 41.760.12 | 3.020.004 | 1.200.009 | |
Two-Stage‡ | TrRMIo | ✗ | 33.650.11 / 14.260.07 / 28.300.06 | 41.390.06 | 3.040.001 | 2.240.014 | |
GTP | TrRMIo | PENS-SH | 34.93*0.16 / 15.23*0.12 / 29.21*0.08 | 42.54*0.06 | 3.28*0.002 | 0.62*0.004 | |
Inter Few-Shot | One-Stage | – | – | 34.100.08 / 14.370.06 / 28.050.04 | 42.080.281 | 3.010.012 | 1.440.005 |
Two-Stage | – | ✗ | 34.130.37 / 14.550.27 / 28.520.08 | 42.090.282 | 2.920.019 | 2.400.386 | |
Two-Stage‡ | TrRMIo | ✗ | 33.480.11 / 14.060.07 / 28.190.06 | 41.640.083 | 3.050.006 | 4.370.941 | |
GTP | TrRMIo | PENS-SH | / 14.740.03 / 28.550.07 | 42.050.192 | 3.17*0.007 |
. | Methods . | RM . | Pre-train . | ROUGE-1 / 2 / L ↑ . | BLEURT ↑ . | BARTScore ↑ . | APE ↓ . |
---|---|---|---|---|---|---|---|
Zero-Shot | Editor | – | – | 47.81 / 26.67 / 36.74 | 51.47 | 3.71 | 0.71 |
Pointer-Gen | – | – | 19.86 / 7.76 / 18.83 | – | – | – | |
PG+RL-ROUGE | – | – | 20.56 / 8.84 / 20.03 | – | – | – | |
PENS | EBRN | – | 25.49 / 9.14 / 20.82 | – | – | – | |
PENS | DKN | – | 27.48 / 10.07 / 21.81 | – | – | – | |
PENS | NPA | – | 26.11 / 9.58 / 21.40 | – | – | – | |
PENS | NRMS | – | 26.15 / 9.37 / 21.03 | – | – | – | |
PENS | LSTUR | – | 24.10 / 8.82 / 20.73 | – | – | – | |
PENS | NAML | – | 28.01 / 10.72 / 22.24 | – | – | – | |
EUI-PENS | Ent-CNN | – | 32.34 / 13.93 / 26.90 | – | – | – | |
ChatGPT | – | – | 29.80 / 11.04 / 24.15 | 40.97 | 2.32 | 13.04 | |
One-Stage | – | – | 33.680.007 / 14.090.004 / 27.700.010 | 42.220.002 | 2.950.001 | 1.590.001 | |
One-Stage† | TrRMIo | PENS-SH | 33.450.008 / 13.970.004 / 27.600.009 | 41.770.003 | 2.920.001 | 3.690.002 | |
GTP | TrRMIo | PENS | 33.500.008 / 14.030.009 / 27.650.002 | 41.850.002 | 2.960.001 | 1.920.077 | |
GTP | TrRMIo | PENS-SH | 33.84*0.002 / 14.23*0.000 / 27.85*0.001 | 42.26*0.002 | 3.01*0.001 | 0.76*0.003 | |
Intra Few-Shot | One-Stage | – | – | 33.870.16 / 14.180.11 / 27.830.10 | 41.680.002 | 2.920.002 | 1.460.002 |
Two-Stage | – | ✗ | 34.120.17 / 14.460.09 / 28.320.06 | 41.760.12 | 3.020.004 | 1.200.009 | |
Two-Stage‡ | TrRMIo | ✗ | 33.650.11 / 14.260.07 / 28.300.06 | 41.390.06 | 3.040.001 | 2.240.014 | |
GTP | TrRMIo | PENS-SH | 34.93*0.16 / 15.23*0.12 / 29.21*0.08 | 42.54*0.06 | 3.28*0.002 | 0.62*0.004 | |
Inter Few-Shot | One-Stage | – | – | 34.100.08 / 14.370.06 / 28.050.04 | 42.080.281 | 3.010.012 | 1.440.005 |
Two-Stage | – | ✗ | 34.130.37 / 14.550.27 / 28.520.08 | 42.090.282 | 2.920.019 | 2.400.386 | |
Two-Stage‡ | TrRMIo | ✗ | 33.480.11 / 14.060.07 / 28.190.06 | 41.640.083 | 3.050.006 | 4.370.941 | |
GTP | TrRMIo | PENS-SH | / 14.740.03 / 28.550.07 | 42.050.192 | 3.17*0.007 |
Zero-shot Results.
One-Stage approach is our first stage model, HG, which achieves exceptional performance, indicating the importance of leveraging general news headlines. The improvements can be attributed to the partial commonality between personalized and non-personalized headlines. In contrast to PENS, which employs reinforcement learning and utilizes the similarity with the news body as a learning reward, our decoupling scheme enables HG to exploit all news headlines and optimize specifically for headline generation. To generate personalized headlines, we incorporate user representations from a recommendation model. However, we encounter challenges in integrating user information without style annotations. Specifically, the One-Stage (Early Fusion), which introduces the user representation alongside the news body, performs worse than the simple HG in terms of both lexical and style similarity metrics. Therefore, it is crucial to design pre-training objectives that facilitate model adaptation and the incorporation of control codes. By decoupling the generation process and introducing two proposed mechanisms for pre-training, namely ISB and MUM, GTP achieves significant improvements in all metrics. The decoupling and ISB mechanism allows the model to focus on transforming a general headline into a personalized one, while the MUM objective guides the model in utilizing user information effectively. Moreover, we validate the effectiveness of pre-training GTP with PENS-SH by replacing it with the original PENS dataset. The results show that directly pre-training with PENS yields inferior results compared to the tailored PENS-SH, showing the benefits of establishing a one-to-one mapping between the control code and the pseudo target. Finally, GTP significantly outperforms ChatGPT. We hypothesize that this discrepancy arises from the intricate and implicit nature of personal preferences, posing a challenge for effective utilization without appropriate design. Our methods leverage a recommender model to encapsulate the nuanced information. The information is subsequently employed with our specialized methodologies, making the generated headline more cognizant of the underlying preferences.
Intra Few-shot Results.
We begin by presenting the performance of the first-stage outputs as the baseline in One-Stage (w/o finetuning), as the testing data in the few-shot setting differs from the zero-shot setting. Subsequently, we investigate the benefits of finetuning using a small number of user-written samples within the two-stage framework, which is reflected in the results of Two-Stage. These results indicate that few-shot finetuning can enhance performance for both metrics, suggesting a distribution discrepancy between news and user-written headlines. This discrepancy further emphasizes the challenge of generating personalized headlines without any user annotations. Additionally, we explore the advantages of incorporating user representations extracted from the click history as shown in Two-Stage (Late Fusion), where we employ the decoupling network and introduce the user representation in the second stage to facilitate few-shot finetuning. Similar to our observations in the zero-shot setting, directly adding user representation and finetuning the model lead to inferior performance. This finding underscores the difficulty of simultaneously utilizing out-of-distribution information while learning the user style from a limited number of samples. Consequently, we propose two mechanisms to pre-train the decoupled network, enabling better utilization of user representation. The results of GTP significantly outperform the baselines, underscoring the importance of making the control code recognizable to the model prior to few-shot finetuning.
Notably, Table 2 unveils that baseline models slightly outperform GTP in terms of the aspects such as coherence and consistency. To delve into this result, we also assess the G-Eval scores for editor-written and user-written headlines as shown in the first and second row. The results suggest that user-written headlines exhibit relatively weaker performance in these aspects. This observation is expected as users tend to prioritize their preferences over exhibiting superior text quality compared to well-trained editors. As a result, we could note that the performance of GTP closely aligns with that of user-written headlines. Overall, the evaluation from various metrics demonstrates that GTP ensures not only effective personalization but also maintains the textual quality compared to the baselines.
. | Methods . | RM . | Pre-train . | Coherence (1∼5) ↑ . | Consistency (1∼5) ↑ . | Fluency (1∼3) ↑ . | Relevance (1∼5) ↑ . |
---|---|---|---|---|---|---|---|
Intra Few-Shot | Editor | – | – | 3.89 | 4.21 | 2.73 | 4.16 |
User | – | – | 3.79 | 4.08 | 2.41 | 3.91 | |
ChatGPT | – | – | 3.94 | 4.17 | 2.76 | 4.23 | |
One-Stage | – | – | 3.88 | 4.23 | 2.74 | 4.18 | |
GTP | TrRMIo | PENS-SH | 3.86 | 4.17 | 2.65 | 4.09 |
. | Methods . | RM . | Pre-train . | Coherence (1∼5) ↑ . | Consistency (1∼5) ↑ . | Fluency (1∼3) ↑ . | Relevance (1∼5) ↑ . |
---|---|---|---|---|---|---|---|
Intra Few-Shot | Editor | – | – | 3.89 | 4.21 | 2.73 | 4.16 |
User | – | – | 3.79 | 4.08 | 2.41 | 3.91 | |
ChatGPT | – | – | 3.94 | 4.17 | 2.76 | 4.23 | |
One-Stage | – | – | 3.88 | 4.23 | 2.74 | 4.18 | |
GTP | TrRMIo | PENS-SH | 3.86 | 4.17 | 2.65 | 4.09 |
Inter Few-shot Results.
The results in Table 1 indicate that GTP can enhance performance even in inter-user scenarios for ROUGE, BARTScore, and APE metrics. Regarding BLEURT metrics, we observe significant variance, with similar performance levels across different methods. Overall, GTP could still offer advantages under inter-user settings, particularly benefiting applications where titles for new users are unavailable.
5.3 Ablation Study
Table 3 provides detailed ablation studies on GTP. The experiments are performed in the few-shot settings using three different data splits. Firstly, we replace the user encoder TrRMIo with a language model (Lewis et al., 2020) to evaluate the importance of using a recommendation model for obtaining the control code. From row 2, we identify that the APE score is degraded, indicating that the recommendation model is better equipped to capture user preferences beyond textual and content information. Additionally, row 3 presents the results without adopting the MUM pre-training objective, where the APE score drops more than the ROUGE scores. We consider the reason is that the MUM aims to assist the model in utilizing the text style encoded in the control code, which is better reflected in the APE metric compared to ROUGE. In another way, ISB aims to mitigate the information loss in the two-stage framework and can greatly contribute to both metrics, as shown in row 4, suggesting that ISB provides valuable information to enhance the customization process. Row 5 demonstrates the results without pre-training, where the model is instead initialized from a general language model. The results highlight the necessity of enabling the model to recognize the input formulation before few-shot learning. Lastly, row 6 presents the model’s performance without late fusion, where personalized headlines are generated in one stage, and the control code is injected along with the input article. The results indicate that such a scheme fails to effectively leverage the control codes, leading to significant performance drops in both metrics. Overall, these ablation studies provide insightful analysis of the various components and mechanisms in GTP, highlighting contributions of the proposed mechanisms, framework, and training scheme.
Method . | ROUGE-1 / ROUGE-2 / ROUGE-L ↑ . | BLEURT ↑ . | BARTScore ↑ . | APE ↓ . |
---|---|---|---|---|
(1) GTP | 34.930.16 / 15.230.12 / 29.210.08 | 42.540.06 | 3.280.002 | 0.620.004 |
(2) w/o TrRMIo | 34.790.12 / 15.210.08 / 29.190.07 (−0.06) | 42.410.04 (−0.13)* | 3.180.005 (−0.10)* | 0.850.016 (↑ 0.23)* |
(3) w/o MUM | 34.850.18 / 15.180.09 / 29.180.13 (−0.05) | 42.260.04 (−0.28)* | 3.260.001 (−0.02) | 1.080.028 (↑ 0.45)* |
(4) w/o ISB | 34.260.15 / 14.51*0.12 / 28.51*0.08 (−0.70) | 42.170.05 (−0.37)* | 3.070.001 (−0.21)* | 2.140.001 (↑ 1.51)* |
(5) w/o Pre-training | 33.65*0.11 / 14.26*0.07 / 28.30*0.06 (−1.06) | 41.390.06 (−1.15)* | 3.040.001 (−0.24)* | 2.240.014 (↑ 1.62)* |
(6) w/o Late Fusion | 33.57*0.14 / 14.08*0.08 / 27.90*0.07 (−1.27) | 41.270.06 (−1.27)* | 2.900.001 (−0.38)* | 3.130.191 (↑ 2.51)* |
Method . | ROUGE-1 / ROUGE-2 / ROUGE-L ↑ . | BLEURT ↑ . | BARTScore ↑ . | APE ↓ . |
---|---|---|---|---|
(1) GTP | 34.930.16 / 15.230.12 / 29.210.08 | 42.540.06 | 3.280.002 | 0.620.004 |
(2) w/o TrRMIo | 34.790.12 / 15.210.08 / 29.190.07 (−0.06) | 42.410.04 (−0.13)* | 3.180.005 (−0.10)* | 0.850.016 (↑ 0.23)* |
(3) w/o MUM | 34.850.18 / 15.180.09 / 29.180.13 (−0.05) | 42.260.04 (−0.28)* | 3.260.001 (−0.02) | 1.080.028 (↑ 0.45)* |
(4) w/o ISB | 34.260.15 / 14.51*0.12 / 28.51*0.08 (−0.70) | 42.170.05 (−0.37)* | 3.070.001 (−0.21)* | 2.140.001 (↑ 1.51)* |
(5) w/o Pre-training | 33.65*0.11 / 14.26*0.07 / 28.30*0.06 (−1.06) | 41.390.06 (−1.15)* | 3.040.001 (−0.24)* | 2.240.014 (↑ 1.62)* |
(6) w/o Late Fusion | 33.57*0.14 / 14.08*0.08 / 27.90*0.07 (−1.27) | 41.270.06 (−1.27)* | 2.900.001 (−0.38)* | 3.130.191 (↑ 2.51)* |
5.4 Analysis of Few-shot Sample
This section further analyzes the influence of sample selection for few-shot learning. We explore three strategies for selecting samples: 1) random sampling, where samples are randomly chosen from the news pool; 2) diversity sampling, which involves applying k-means clustering to identify distinctive data and selecting samples closest to the cluster centroids; and 3) similarity sampling, where samples with a higher similarity between the user-written and generated news headlines are chosen based on cosine similarity of sentence embeddings (Gao et al., 2021). The results presented in Table 4 reveal that random sampling achieves slightly better performance compared to the other two strategies, especially for the APE metric. As a result, we adopt random sampling as the default setting for few-shot finetuning. These findings also emphasize the challenges of effectively capturing the user style in personalized news headline generation. Furthermore, we analyze the influence of sample size on the performance. Fig. 6 provides an overview of the results, indicating that the performances improve as the sample size increases. This suggests that learning personal style from a limited number of samples is challenging.
5.5 Analysis of User Representation
This section analyzes the construction and the pre-training of user representations. Therefore, besides GTP, we consider the models without MUM (w/o MUM) and without training the user encoder with recommendation task (w/o TrRM). Fig. 7 shows the performances with different quantile thresholds of the Interested-only Filter under the three configurations. The results demonstrate that constructing user representations by the news with lower CTR performs better, especially when the user encoder is trained with the recommendation task. These observations also meet our hypothesis that a click history contains interested news and popular news, which could be identified by CTR. In addition, the results show that models achieve better performance with MUM and TrRM in various threshold settings.
5.6 Results of Personalized Recommendation
In the news recommendation literature, most methodologies rely on news titles to model news items due to their significant impact on users’ clicking behaviors (Wu et al., 2023). Several studies have expanded their approach by incorporating supplementary features, such as keywords (Zhang et al., 2018a), entities (Qi et al., 2021), categories (Wu et al., 2019a), topics (Wu et al., 2019c), location (Xun et al., 2021), popularity (Tavakolifard et al., 2013), and others. Although integrating more textual features is feasible for the proposed TrRMIo, we have opted to exclusively employ news titles to ensure the generalizability of our method and concentrate on studying the proposed methodologies since the additional information may be unavailable. Nevertheless, to provide a more comprehensive discourse on TrRMIo, we have conducted additional experiments by incorporating title keywords as auxiliary inputs for learning. The results are presented in the last three rows of Table 5. These results indicate that the performance can be slightly enhanced by incorporating additional textual elements (TrRMIo(title+keyword)). However, the exclusion of titles, with only keywords under consideration (TrRMIo(keyword)), significantly affects the performance, underscoring the need to incorporate titles. It is imperative to emphasize that the TrRMIo is designed as a general-purpose model with the objective of serving the proposed GTP framework. Table 5 demonstrates the superiority of TrRMIo over previous approaches.
Model . | AUC/MRR/NDCG@5/@10 . |
---|---|
EBRN (Okura et al., 2017) | 63.97 / 22.52 / 26.45 / 32.81 |
DKN (Wang et al., 2018) | 65.25 / 24.07 / 26.97 / 32.24 |
NPA (Wu et al., 2019b) | 64.97 / 23.65 / 26.72 / 33.96 |
NRMS (Wu et al., 2019d) | 64.27 / 23.28 / 26.60 / 33.58 |
LSTUR (An et al., 2019) | 62.49 / 22.69 / 24.71 / 32.28 |
NAML (Wu et al., 2019a) | 66.18 / 25.51 / 27.56 / 35.17 |
Entity-CNN (Zhang et al., 2022) | 66.28 / 25.34 / 27.58 / 35.53 |
TrRMIo (title) | 68.88 / 27.27 / 30.98 / 38.81 |
TrRMIo (title + keyword) | 69.01 / 27.05 / 30.72 / 38.61 |
TrRMIo (keyword) | 65.51 / 25.21 / 28.12 / 35.79 |
Model . | AUC/MRR/NDCG@5/@10 . |
---|---|
EBRN (Okura et al., 2017) | 63.97 / 22.52 / 26.45 / 32.81 |
DKN (Wang et al., 2018) | 65.25 / 24.07 / 26.97 / 32.24 |
NPA (Wu et al., 2019b) | 64.97 / 23.65 / 26.72 / 33.96 |
NRMS (Wu et al., 2019d) | 64.27 / 23.28 / 26.60 / 33.58 |
LSTUR (An et al., 2019) | 62.49 / 22.69 / 24.71 / 32.28 |
NAML (Wu et al., 2019a) | 66.18 / 25.51 / 27.56 / 35.17 |
Entity-CNN (Zhang et al., 2022) | 66.28 / 25.34 / 27.58 / 35.53 |
TrRMIo (title) | 68.88 / 27.27 / 30.98 / 38.81 |
TrRMIo (title + keyword) | 69.01 / 27.05 / 30.72 / 38.61 |
TrRMIo (keyword) | 65.51 / 25.21 / 28.12 / 35.79 |
5.7 Human Evaluation
In addition to automated evaluation, the proposed methods are assessed by soliciting human judgments. Our human evaluation necessitates participants to answer a set of binary-choice questions. Each question comprises a target headline authored by a user from the PENS corpus and two test headlines generated by two distinct models. Participants are required to select the test headline that demonstrates a greater resemblance to the target headline in terms of text style, encompassing factors such as length, vocabulary, structure, tone, and other pertinent aspects. Participants are instructed not to base their choices on personal preferences but to assume that the target headline represents their preferred option for answering the questions. Furthermore, an additional “tie” option is provided if participants cannot decide after careful consideration. The evaluation is conducted separately for the zero-shot and few-shot settings. In the zero-shot setting, 26 randomly sampled questions are presented. The corresponding test headlines are generated using the GTP and the method of One-Stage model with early fusion. In the few-shot setting, a similar approach is adopted, with 40 questions provided, where 20 questions are the comparison between GTP and One-Stage and the remaining 20 are associated with GTP and Editor headlines. The order of the questions is randomly permuted to mitigate potential recognition of the underlying generation methods.5 The win rates of GTP under various settings are presented in Table 6. The findings show that the headlines generated by GTP more closely resemble the desired headlines compared to various baselines, including editor-written ones, suggesting that GTP can better utilize the user information. Furthermore, it is noteworthy that a certain proportion of tie options were selected, indicating that the manifestation of preference may be subtle or inconspicuous in some instances, which could be attributed to the intrinsic content and topic of the news. These observations necessitate future works to investigate the varying level of difficulty associated with customizing distinct headlines. It is important to note that no personally identifiable information was collected, and participants were not exposed to offensive content.
. | GTP . | Baseline . | Tie . |
---|---|---|---|
zero-shot | |||
GTP vs One-Stage | 60.88% | 22.30% | 16.82% |
few-shot | |||
GTP vs One-Stage | 59.41% | 26.92% | 13.68% |
GTP vs Editor | 58.66% | 29.97% | 11.37% |
. | GTP . | Baseline . | Tie . |
---|---|---|---|
zero-shot | |||
GTP vs One-Stage | 60.88% | 22.30% | 16.82% |
few-shot | |||
GTP vs One-Stage | 59.41% | 26.92% | 13.68% |
GTP vs Editor | 58.66% | 29.97% | 11.37% |
5.8 APE Validation
We leverage the human evaluation outcomes from Sec. 5.7 to validate the effectiveness of the APE metric in aligning with human judgments, as shown in Table 7. First, we separate the chosen and unchosen headlines into two groups for each participant. Next, we compute the APE scores for both groups, considering each participant individually, and then calculate the overall average scores across all participants. If the APE metric agrees with human judgments, we would expect to observe lower APE scores for the chosen headline group, indicating a reflection of human perspectives. In addition to the APE scores, we present the win rate for the chosen and unchosen groups. Specifically, we designate the group with a lower APE score as the winner for each participant and calculate the corresponding win rate. The APE metric provides a high-level view of the agreement between the metric and human tendencies. At the same time, the win rate offers a low-level perspective for the agreement of each participant. The results consistently show that the chosen groups outperform the unchosen ones in both the high- and low-level APE scores, thereby confirming the APE’s alignment with human judgments.
. | Averaged Score . | Win Rate . | ||
---|---|---|---|---|
zero-shot | chosen | unchosen | chosen | unchosen |
GTP vs One-Stage | 0.8270.007 | 0.8810.006 | 69.05% | 30.95% |
few-shot | chosen | unchosen | chosen | unchosen |
GTP vs One-Stage | 1.0940.002 | 1.1520.004 | 80.49% | 19.51% |
GTP vs Editor | 0.9720.006 | 0.9870.008 | 61.90% | 38.10% |
. | Averaged Score . | Win Rate . | ||
---|---|---|---|---|
zero-shot | chosen | unchosen | chosen | unchosen |
GTP vs One-Stage | 0.8270.007 | 0.8810.006 | 69.05% | 30.95% |
few-shot | chosen | unchosen | chosen | unchosen |
GTP vs One-Stage | 1.0940.002 | 1.1520.004 | 80.49% | 19.51% |
GTP vs Editor | 0.9720.006 | 0.9870.008 | 61.90% | 38.10% |
6 Conclusion
In this paper, we propose a novel framework named General Then Personal (GTP) to tackle the challenges of constructing and incorporating control code for personalized headline generation. Specifically, we propose a late fusion model by decoupling the generation process. Two mechanisms are further introduced to facilitate the framework and enable text control. Additionally, we construct a pre-training dataset, PENS-SH, to build an effective control code, which enhances the connections between click history and target news. Moreover, we introduce a novel evaluation metric, APE, to quantify the degree of personalization. The extensive experiments and human evaluation demonstrate the necessity of all designs and show that GTP significantly outperforms state-of-the-art under both zero-shot and few-shot settings.
Acknowledgments
The authors would like to thank the anonymous reviewers and the action editor (Xiaojun Wan) for their valuable discussions and feedback. Lu Wang is supported by National Science Foundation through grant IIS-2046016. This work was supported in part by the National Science and Technology Council of Taiwan under Grants NSTC-109-2221-E-009-114-MY3, NSTC-112- 2221-E-A49-059-MY3, NSTC-111-2221-E-001- 021, and NSTC-112-2221-E-A49-094-MY3.
Notes
Our source code is available at https://github.com/yunzhusong/TACL-GTP.
The control code refers to the information for controlling the generation process toward target headlines (Keskar et al., 2019).
We use “gpt-3.5-turbo” for this experiment. To conform to the constraints of the employed OpenAI model, we truncate the prompt to 4000 tokens, and the output length is confined to 64 tokens.
The evaluation process requires approximately 20 minutes. We recruited 50 participants to evaluate both segments. Before engaging in the tasks, all participants provide informed consent and are duly compensated for their time, which is set at $5 per participant.
References
Author notes
Action Editor: Xiaojun Wan