Pre-train, Prompt and Recommendation: A Comprehensive Survey of Language Modelling Paradigm Adaptations in Recommender Systems

The emergence of Pre-trained Language Models (PLMs) has achieved tremendous success in the field of Natural Language Processing (NLP) by learning universal representations on large corpora in a self-supervised manner. The pre-trained models and the learned representations can be beneficial to a series of downstream NLP tasks. This training paradigm has recently been adapted to the recommendation domain and is considered a promising approach by both academia and industry. In this paper, we systematically investigate how to extract and transfer knowledge from pre-trained models learned by different PLM-related training paradigms to improve recommendation performance from various perspectives, such as generality, sparsity, efficiency and effectiveness. Specifically, we propose a comprehensive taxonomy to divide existing PLM-based recommender systems w.r.t. their training strategies and objectives. Then, we analyze and summarize the connection between PLM-based training paradigms and different input data types for recommender systems. Finally, we elaborate on open issues and future research directions in this vibrant field.


Introduction
As an important part of the online environment, Recommender Systems (RSs) play a key role in discovering users' interests and alleviating information overload in their decision-making process.Recent years have witnessed tremendous success in recommender systems empowered by deep neural architectures and increasingly improved computing infrastructures.However, deep recommendation models are inherently data-hungry with an enormous amount of parameters to learn, which are likely to overfit and fail to generalize well in practice when their training data (i.e., user-item interactions) are insufficient.Such scenarios widely exist in practical RSs when a large number of new users join in but have fewer interactions.Consequently, the data sparsity issue becomes a major performance bottleneck of the current deep recommendation models.
With the thriving of pre-training in NLP (Qiu et al., 2020), many language models have been pre-trained on large-scale unsupervised corpora and then fine-tuned in various downstream supervised tasks to achieve state-of-the-art results, such as GPT (Brown et al., 2020), and BERT (Devlin et al., 2019).One of the advantages of this pre-training and fine-tuning paradigm is that it can extract informative and transferrable knowledge from abundant unlabelled data through selfsupervision tasks such as masked LM (Devlin et al., 2019), which will benefit downstream tasks when the labelled data for these tasks is insufficient and avoid training a new model from scratch.A recently proposed paradigm, prompt learning (Liu et al., 2023b), further unifies the use of pretrained language models (PLMs) on different tasks in a simple yet flexible manner.In general, prompt learning relies on a suite of appropriate prompts, either hard text templates (Brown et al., 2020), or soft continuous embeddings (Qin and Eisner, 2021), to reformulate the downstream tasks as the pre-training task.The advantage of this paradigm lies in two aspects: (1) It bridges the gap between pre-training and downstream objectives, allowing better utilization of the rich knowledge in pretrained models.This advantage will be multiplied when very little downstream data is available.(2) Only a small set of parameters are needed to tune for prompt engineering, which is more efficient.
Motivated by the remarkable effectiveness of the aforementioned paradigms in solving data sparsity and efficiency issues, adapting language modelling paradigms for recommendation is seen as a promising direction in both academia and industry, which has greatly advanced the state-ofthe-art in RSs.Although there have been several surveys on pre-training paradigms in the fields of CV (Long et al., 2022), NLP (Liu et al., 2023b) and graph learning (Liu et al., 2023d), only a handful of literature reviews are relevant to RSs.Zeng et al. (2021) summarizes some research on the pre-training of recommendation models and discusses knowledge transfer methods between different domains.But it only covers a small number of BERT-like works and does not go deep into the training details of pre-trained recommendation models.Yu et al. (2023) give a brief overview of the advances of self-supervised learning in RSs.However, its focus is on a purely self-supervised recommendation setting, which means the supervision signals used to train the model are semiautomatically generated from the raw data itself.While our work does not strictly focus on the self-supervised training strategies but also incorporates the adaptation and exploration of supervised signals and data augmentation techniques in the pre-training, fine-tuning and prompting process for various recommendation purposes.Furthermore, none of them systematically analyzed the relationship between different data types and training paradigm choices in RSs.To the best of our knowledge, our survey is the first work that presents an up-to-date and comprehensive review of Language Modelling Paradigm Adaptations for Recommender Systems (LMRS) 1 .The main contributions of this paper are summerized as follows: • We survey the current state of PLM-based recommendation from perspectives of training strategy, learning objective and related data types, and provide the first systematic survey, to the best of our knowledge, in this nascent and rapidly developing field.
• We comprehensively review existing research works on adapting language modelling paradigms to recommendation tasks by systematically categorizing them from two 1 It is worth noting that most of the existing literature reviews on pre-trained models focus on the architecture of large-scale language models (such as Bert, T5, UniLMv2, etc.), while our survey mainly discusses training paradigms, which are not limited to pre-trained language model architectures.It can also be other neural networks, such as CNN (Chen et al., 2023), and GCN (Liu et al., 2023c).• We shed light on limitations and possible future research directions to help beginners and practitioners interested in this field learn more effectively with the shared integrated resources.
2 Generic Architecture of LMRS LMRS provides a new way to conquer the data sparsity problem via knowledge transfer from Pretrained models (PTMs).Figure 1 shows a highlevel overview of the LMRS, highlighting the data input, pre-training, fine-tuning/prompting and inference stages for various recommendation tasks.
In general, the types of input data objects can be relevant w.r.t.both the training and inference stages.After preprocessing the input into desired forms such as graphs, ordered sequences, or aligned text-image pairs, the training process takes in the preprocessed data and performs either "pre-train, fine-tune" or "pre-train, prompt" flow.If the inference is solely based upon the pretrained model, it can be seen as an end-to-end approach leveraging LM-based learning objectives.The trained model can then be used to infer different recommendation tasks.

Data Types
Encoding input data as embeddings is usually the first step in recommendations.However, the input for recommender systems is more diverse than most NLP tasks, and therefore, encoding techniques and processes may need to be adjusted to align with different input types.Textual data as a powerful medium of spreading and transmitting knowledge are commonly used as input for modelling user preferences.Examples of textual data include reviews, comments, summaries, news, conversations and codes.Note that we also consider item metadata and user profiles as a kind of textual data for simplicity.Sequential data, such as user-item interactions strictly arranged chronologically or in a specific order, are used as sequential input for sequential and session-based recommender systems.Graphs, usually containing different semantic information from other types of data inputs such as the useruser social graph or heterogeneous knowledge graph, are also commonly used to extract structural knowledge to improve recommendation performance.The diversity of online environments promotes the generation of massive multimedia content, which has been shown to improve recommendation performance in numerous research works.Therefore, Multi-modal data such as images, videos and audios can also be importance sources for LMRS.Multi-modal data plays a crucial role in recommendation systems.However, the utilization of multi-modal data in LMRS papers is scarce, possibly due to the absence of accessible datasets.A few scholars have gathered their individual datasets to facilitate text-videoaudio tri-modal music recommendations (Long et al., 2023) or to establish benchmarks for shopping scenarios (Long et al., 2023).

Training Strategies of LMRS
Given the significant impact that PLMs have had on NLP tasks in the pre-train and fine-tune paradigm, there has been a surge recently in adapting such paradigms to multiple recommendation tasks.As illustrated in Figure 1 4.1 Pre-train, fine-tune paradigm for RS The "pre-train, fine-tune" paradigm attracts increasing attention from researchers in the recommendation field due to several advantages: 1) Pre-training provides a better model initialization, which usually leads to better generalization on different downstream recommendation tasks, improves recommendation performance from various perspectives, and speeds up convergence on the fine-tuning stage; 2) Pre-training on huge source corpus can learn universal knowledge which can be beneficial for the downstream recommenders; 3) Pre-training can be regarded as a kind of regularization to avoid overfitting on lowresource, and small datasets (Erhan et al., 2010 (Hou et al., 2022;Yu et al., 2022;Wu et al., 2022a).For instance, to deal with the domain bias problem that BERT induces a non-smooth anisotropic semantic space for general texts resulting in a large language gap for texts from different domains of items, Hou et al.

Prompting paradigm for RSs
Instead of adapting PLMs to different downstream recommendation tasks by designing specific objective functions, a rising trend in recent years is to use the "pre-train, prompt, and inference" paradigm to reformulate downstream recommendations through hard/soft prompts.In this paradigm, fine-tuning can be avoided, and the pretrained model itself can be directly employed to predict item ratings, generate top-k item ranking lists, make conversations, recommend similar libraries for programmers while coding, or even output subtasks related to recommendation targets such as explanations (Li et al., 2023b).Prompt learning breaks through the problem of data constraints and bridges the gap of objective forms between pre-training and fine-tuning.
Fixed-PTM prompt tuning Prompt-tuning only requires tuning a small set of parameters for the prompts and labels, which is especially efficient for few-shot recommendation tasks.Despite the promising results achieved through constructing prompt information without significantly changing the structure and parameters of PTMs, it also calls for the necessity of choosing the most appropriate prompt template and verbalizer, which can greatly impact recommendation performance.Prompt tuning can be both in the form of discrete textual templates (Penha and Hauff, 2020), which are more human-readable, and soft continuous vectors (Wang et al., 2022c;Wu et al., 2022b).
For instance, Penha and Hauff (2020) manually designed several prompt templates to test the performance of movie/book recommendations on a pre-trained BERT model with a similarity measure.Wu et al. (2022b) proposed a personalized prompt generator tuned to generate a soft prompt as a prefix before the user behaviour sequence for sequential recommendation.
Fixed-prompt PTM tuning Fixed-prompt PTM tuning tunes the parameters of PTMs similarly to the "pre-train, fine-tune" strategy but additionally uses prompts with fixed parameters to steer the recommendation task.

Learning Objectives of LMRS
This section will overview several typical learning tasks and objectives of language models and their adaptations for different recommendation tasks.

Language modelling objectives to recommendation
The expensive manual efforts required for annotated datasets have led many language learning objectives to adopt self-supervised labels, converting them to classic probabilistic density estimation problems.Among language modelling objectives, autoregressive, reconstruction, and auxiliary are three categories commonly used (Liu et al., 2023b).Here, we only introduce several language modelling objectives used for RSs.
Partial/ Auto-regressive Modelling (P/AM) Given a text sequence the training objective of AM can be summarized as a joint negative log-likelihood of each variable given all previous variables: Modern LMRS typically utilize popular pretrained left-to-right LMs such as GPT-2 (Hada and Shevade, 2021) and DialoGPT (Wang et al., 2022a,c) as the backbone for explainable and conversational recommendations, respectively, to avoid the laborious task of pre-training from scratch.While auto-regressive objectives can effectively model context dependency, the modelling context can only be accessed from one direction, primarily left-to-right.To address this limitation, PAM is introduced, which extends AM by enabling the factorization step to be a span.For each input X, one factorization order M is sampled.One popular PTM that includes PAM as an objective is UniLMv2 (Bao et al., 2020).The pretrained UniLMv2 model can be utilized to initialize the news embedding model for news recommendation (Yu et al., 2022).
Besides directly leveraging PTMs trained on textual inputs, some researchers apply this objective to train inputs with sequential patterns, such as graphs (Geng et al., 2022b) and user-item interactions (Zheng et al., 2022).These patterns serve as either scoring functions to select suitable paths from the start node/user to the end node/item or detectors to explore novel user-item pairs.Masked Language Modelling (MLM) Taking a sequence of textual sentences as input, MLM first masks a token or multi-tokens with a special token such as [M ASK].Then the model is trained to predict the masked tokens taking the rest of the tokens as context.The objective is as follows: where M (X) and X M (X) represent the masked tokens in the input sequence X and the rest of the tokens in X respectively.A typical example of MLM training strategy can be found on BERT, which is leveraged as backbone in (Zhang et al., 2021a) where x and y represent two segments from the input corpus, and c = 1 if x and y are consecutive, otherwise c = 0.The NSP objective involves reasoning about the relationships between pairs of sentences and can be utilized for better representation learning of textual items such as news articles, item descriptions, and conversational data for recommendation purposes.Moreover, it can be employed to model the intimate relationships between two components.Malkiel et al. (2020) used the NSP to capture the relationship between the title and description of an item for next-item prediction.Furthermore, models pre-trained with NSP (such as BERT) can be leveraged for probing the learned knowledge with prompts, which are then infused in the fine-tuning stage to improve model training on adversarial data for conversational recommendation (Penha and Hauff, 2020).Sentence Order Prediction (SOP) as a variation of the NSP takes two consecutive segments from the same document as positive examples, which are then swapped in order as negative examples.SOP has been used to learn the inner coherence of title, description, and code for tag recommendation on StackOverflow (He et al., 2022).
where y t = 1(x t = x t ), and X is corrupted from the input sequence X. de Souza Pereira Moreira et al. ( 2021) trained a Transformer-based model with RTD objective for session-based recommendations, which achieved the best performance among MLM and AM objectives.This is probably because RTD takes the whole user-item interaction sequence as input and model the context from the bidirectional way.

Adaptive objectives to recommendation
Numerous pre-training or fine-tuning objectives draw inspiration from LM objectives and have been effectively applied to specific downstream tasks based on the input data types and recommendation goals.In sequential recommendations, there is a common interest in modelling an ordered input sequence in an auto-regressive manner from left to right.
Analogous to text sentences, Zheng et al. ( 2022) and Xiao et al. (2022) treated the user's clicked news history as input text and proposed to model user behavior in an auto-regressive manner for next-click prediction.However, as the sequential dependency may not always hold strictly in terms of user preference for recommendations (Yuan et al., 2020a), MLM objectives can be modified accordingly.Yuan et al. (2020b) randomly masked a certain percentage of historical user records and predicted the masked items during training.Autoregressive learning tasks can also be adapted to other types of data.Geng et al. (2022b) modeled a series of paths sampled from a knowledge graph in an auto-regressive manner for recommendation by generating the end node from the pretrained model.Zhao (2022) proposed pre-training the Rearrange Sequence Prediction task to learn the sequence-level information of the user's entire interaction history by predicting whether the user interaction history had been rearranged, which is similar to Permuted Language Modelling (PerLM) (Yang et al., 2019).
MLM, also known as Cloze Prediction, can be adapted to learn graph representations for different recommendation purposes.Wang et al. (2023a) proposed pre-training a transformer model on a reconstructed subgraph from a user-item-attribute heterogeneous graph, using Masked Node Prediction (MNP), Masked Edge Prediction (MEP), and meta-path type prediction as objectives.Specifically, MNP was performed by randomly masking a proportion of nodes in a heterogeneous subgraph and then predicting the masked nodes based on the remaining contexts by maximizing the distance between the masked node and the irrelevant node.Similarly, MEP was used to recover the masked edge of two adjacent nodes based on the surrounding context.Apart from that, MLM can also be adapted to multi-modal data called Masked Multimodal Modelling (MMM) (Wu et al., 2022a).MMM was performed by predicting the semantics of masked news and news image regions given the unmasked inputs and indicating whether a news image and news content segment correspond to each other for news recommendation purposes.
The NSP/SOP can be adapted for CTR prediction as Next K Behaviors Prediction (NBP).NBP was proposed to learn user representations in the pre-training stage by inferring whether a candidate behavior is the next i-th behavior of the target user based on their past N behaviors.NBP can also capture the relatedness between past and multiple future behaviors.

Formulating Training with Data Types
To associate training strategy, learning objectives with different input data types, we summarize representative works in this domain in Table 1.The listed training strategies and objectives are carefully selected and are typical in existing work.For the page limit, we only selected part of recent research on LMRS.For more research progress and related resources, please refer to https://github.com/SmartmediaAI/LMRS.
Considering that datasets are another important factor for empirical analysis of LMRS approaches, in Table 2, we also list several representative publicly available datasets taking into account the popularity of data usage and the di-versity of data types, as well as their corresponding recommendation tasks, training strategies, and adopted data types.From Table 2, We draw several observations: First, datasets can be converted into different data types, which can then be analyzed from various perspectives to enhance downstream recommendations.The integration of different data types can also serve different recommendation goals more effectively (Geng et al., 2022c;Liu et al., 2021).For instance, Liu et al. (2021) transformed user-item interactions and multimodal item side information into a homogeneous item graph.A sampling approach was introduced to select and prioritize neighboring nodes around a central node.This process effectively translated the graph data structure into a sequential format.The subsequent training employed a self-supervised signal within a transformer framework, utilizing an objective for reconstructing masked node features.The resultant pre-trained node embeddings could be readily applied for recommendation purposes, or alternatively, fine-tuned to cater to specific downstream objectives.Second, some training strategies can be applied to multiple downstream tasks by fine-tuning a few parameters from the pretrained model, adding an extra component, or using different prompts.Geng et al. (2022c) designed different prompt templates for five different tasks to train a transformer-based model with a single objective, and achieved improvements on multiple tasks with zero-shot prompting.Deng et al. (2023) unified the multiple goals of conversational recommenders into a single sequence-tosequence task with textual input, and designed various prompts to shift among different tasks.We further observe that prompting methods are primarily used in LMRS with textual and sequential data types, but there has been a lack of exploration for multi-modal or graph data.This suggests that investigating additional data types may be a future direction for research in prompting-based LMRS.

Evaluation metrics
As an essential aspect of recommendation design, evaluation can provide insights on recommendation quality from multiple dimensions.Apart from well-known metrics such as RMSE, MAP, AUC, MAE, Recall, Precision, MRR, NDCG, F1-score and HitRate in offline mode, some works define Group AUC (Zhang et al., 2022) or User Group AUC (Zheng et al., 2022) to evaluate the utility of group recommendations.JIANG et al. ( 2022) and Liu et al. (2022) conducted A/B testing to evaluate performance with online users using Conversion rate or CTR.
The integration of generative modules such as GPT and T5 into existing recommender systems offers additional possibilities for recommender systems, such as generating free-form textual explanations for recommendation results or simulating more realistic real-life dialogue scenarios during conversational recommendations to enhance users' experience.In such cases, BLEU and ROUGE are commonly adopted to automatically evaluate the relevance of generated text based on lexicon overlap.Besides, Perplexity (PPL), Distinct-n, and Unique Sentence Ratio (USR) are also widely used metrics to measure fluency, diversity, and informativeness of generated texts.Other evaluation metrics are leveraged with respect to special requests in LMRSs.For instance, Xie et al. (2023) adopted Entailment Ratio and MAUVE to measure if the generated explanations are factually correct and how close the generated contents are to the ground truth corpus, respectively.Geng et al. (2022a) adopted Feature Diversity (DIV) and CLIPScore (CS) to measure the generated explanations and text-image alignment.Besides, to assess the system's capability to provide item recommendations during conversations, Wang et al. (2022a) computed the Item Ratio within the final generated responses.They evaluated the recommendation performance in an end-to-end manner to prevent the inappropriate insertion of recommended items into dialogues.
Human evaluation complements objective evaluation, as automatic metrics may not match sub-jective feedback from users.Liu et al. (2023a) pointed out that human subjective and automatic objective evaluation measurements may yield opposite results, which underscores the limitations of existing automatic metrics for evaluating generated explanations and dialogues in LMRSs.Figure 3 displays usage frequency statistics for different evaluation metrics in their respective tasks.

Discussion on evaluation across datasets
In this section, we compare the results obtained from various models using commonly used datasets.Specifically, based on the reported results in the paper, we measured the improvement achieved by different models compared to a shared baseline and evaluated them using the same metrics on the same dataset.The comparisons are presented in Table 3∼6.Most improvements are highlighted in bold, and N@k denotes NDCG@k, H@k denotes HitRate@k.It's important to recognize that a comprehensive and precise assessment cannot be achieved without a carefully designed platform and thoughtful settings for conducting the experiments.Various factors, such as diverse training platforms, parameter settings, and data split strategy, can lead to fluctuations in the results.Hence, it is essential to consider the analysis solely for reference purposes.From the tables, we can observe that: First, among the four conversational recommender systems assessed using the ReDial dataset, fixed prompt PTM tuning paradigm Yang et al. (2022a) demonstrate the most significant improvements compared to the shared baselines.Second, on the Amazon dataset, zero-shot and few-shot learning of Chat-GPT underperformed the supervised recommendation baselines (Liu et al., 2023a).This could be due to language models' strength in capturing language patterns rather than effectively collaborating to suggest similar items based on user preferences (Zhang et al., 2021b).Besides, Liu et al. (2023a) pointed out that the position of candidate items in the item pool can also affect the direct recommendation performance.Another promptingbased model, P5, showed the most improvements for both Amazon and Yelp datasets (Geng et al., 2022c), which verifies the need for more guidance when using large pre-trained language models for recommendations.Finally, for news recommendation on the MIND dataset, Xiao et al. (2022)   N@5 H@10 N@10 H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10 Pre-train (Sun et  N@5 H@10 N@10 H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10 H@5 N@5 H@10 N@10 Fine-tune Holistic Model (Xiao et al., 2021) 0   Language bias and fact-consistency in language generation tasks of recommendation.While generating free-form responses of conversational recommender systems or explanations of the recommended results, the generative components of existing LMRSs tend to predict generic tokens to ensure sentences fluency or repeat certain universally applicable "safe" sentences (e.g."the hotel is very nice" generated from PETER (Li et al., 2021)).Therefore, one future research direction is to enhance the diversity and pertinence of generated explanations and replies while maintaining language fluency, rather than resorting to "Tai Chi" responses.Additionally, generating factually consistent sentences is also an urgent research problem that needs to be addressed but has not received sufficient attention (Xie et al., 2023).
Knowledge transmission and injection for downstream recommendations.Improper training strategies may cause varying degrees of problems when transferring knowledge from pretrained models.Zhang et al. (2022) have pointed out the catastrophic forgetting problem in continuously-trained industrial recommender systems.The degree of domain knowledge pretrained models possess and the effective ways to transfer and inject it for recommendation purposes are both open questions.For example, Zhang et al. (2021b) experimented with a simple approach to injecting knowledge through domainadaptive pre-training, resulting in only limited improvements.Furthermore, questions about maximizing knowledge transfer to different recom-mendation tasks, quantifying the degree of transferred knowledge, and whether an upper bound for knowledge transfer exists are all valuable issues that need to be studied and explored in the AI community.
Scalability of pre-training mechanism in recommendation.As model parameters growing larger and larger, the knowledge stored in them is also increasing.Despite the great success of pre-trained models in multiple recommendation tasks, how to maintain and update such complex and large-scale models without affecting the efficiency and accuracy of recommendations in reality needs more attention.Some works have proposed improving model updating efficiency by fine-tuning a partial pre-trained model or an extra part with far fewer parameters than the model's magnitude.However, Yuan et al. (2020b) empirically found that fine-tuning only the output layer often resulted in poor performance in recommendation scenarios.While properly fine-tuning the last few layers sometimes offered promising performance, the improvements were quite unstable and depended on the pre-trained model and tasks.Yu et al. (2022) proposed compressing large pretrained language models into student models to improve recommendation efficiency, while Yang et al. (2022b) focused on accelerating the finetuning of pre-trained language models and reducing GPU memory footprint for news recommendation by accumulating the gradients of redundant item encodings.Despite all these achievements, efforts are still needed in this rapidly developing field.

Balancing multiple objectives in pre-training
Many research works use multi-task learning objectives to better apply the knowledge learned in the pre-training phase to downstream tasks (Geng et al., 2022c;Wang et al., 2023a).The primary objective of multi-task learning for recommendation is to enhance recommendation accuracy and/or other related aspects by promoting interactions among related tasks.The learning optimization process requires trade-offs among different objectives.For instance, Wang et al. (2023b) fine-tuned parameters to optimize and balance the overarching goals of topic-level recommendation, semantic-level recommendation, and a specific aspect of topic learning.Similarly in (Wang et al., 2022b), the authors employed a parameter that required learning to achieve a balance between con-versation generation objective and quotation recommendation objective.Yang et al. (2022a) proposed a conversational recommendation framework that contain a generation module and a recommendation module.The overall objectives were designed to balance these two modules with a parameter learned through a fine-tuning process.However, improper optimization can lead to other problems as pointed out by Deng et al. (2023) that "Error Propagation" may occur when solving multiple tasks in sequential order, leading to a decrease in performance with the sequential completion of each task.Although some potential solutions to this issue (Deng et al., 2023;Li et al., 2022;Geng et al., 2022a) were suggested, further verification is still needed.Multiple Choices of PLM as Recommendation Bases.With the advances in variational PLMs, including ChatGPT, and their success in various downstream tasks, researchers have started exploring the potential of ChatGPT in conversational recommendation tasks.For example, Liu et al. (2023a) and Gao et al. (2023) have investigated the ability of GPT-3/GPT-3.5-basedChatGPT in zeroshot scenarios, using human-designed prompts to assess its performance in rating prediction, sequential recommendation, direct recommendation, and explanation generation.However, these studies are just initial explorations, and more extensive research is required on different recommendation tasks based on various pre-trained language models.This includes prompt design and performance evaluation in diverse domains.Moreover, recent LMRS studies have yet to explore instruction tuning, which could be a promising direction for future research.Privacy issue.The study conducted by Yuan et al. (2020b) revealed that pre-trained models can infer user profiles (such as gender, age, and marital status) based on learned user representations, which raises concerns about privacy protection.The pretraining process is often performed on large-scale web-crawled corpus without fine-grained filtering, which may expose users' sensitive information.Therefore, developing LMRS that strike a balance between privacy and high-performance recommendation algorithms remains an open issue.

Figure 1 :
Figure 1: A generic architecture of language modelling paradigm for recommendation purpose.
, there are mainly two classes regarding different training paradigms: pre-train, fine-tune paradigm and prompt learning paradigm.Each class is further classified into subclasses regarding different training efforts on different parts of the recommendation model.This section will go through various training strategies w.r.t.specific recommendation purposes.Figure 2(a) presents the statistics of recent publications of LMRSs grouped by different training strategies and the total number of published research works each year.Figure 2(b) shows the taxonomy and some corresponding representative LMRSs.

Figure 2 :
Figure 2: LMRS structure with representatives and statistics on different training strategies and the total number of publications per year.
(2022) applied a linear transformation layer to transform BERT representations of items from different domains followed by an adaptive combination strategy to derive a universal item representation.Meanwhile, considering the seesaw phenomenon that learning from multiple domainspecific behavioural patterns can be a conflict, they proposed sequence-item and sequence-sequence contrastive tasks for multi-task learning during the pre-training stage.They found only fine-tuning a small proportion of model parameters could quickly adapt the model to unseen domains with cold-start or new items.Pre-train, fine-tune extra part of the model With the increase in the depth of PTMs, the representation captured by them makes the downstream recommendation easier.Apart from the aforementioned two fine-tuning strategies, some works leverage a task-specific layer on top of the PTMs for recommendation tasks.Fine-tuning only goes through such extra parts of the PTMs by optimizing the parameters of the task-specific layer.Shang et al. (2019) pre-trained a GPT and a BERT model to learn patient visit embeddings, which were then used as input to fine-tune the extra prediction layer for medication recommendation.Another approach is to use the PTM to initialize a new model with a similar architecture in the finetuning stage, and the fine-tuned model is used for recommendations.In Zhou et al. (2020), a bidirectional Transformer-based model was first pretrained on four different self-supervised learning objectives (associated attribute prediction, masked item prediction, masked attribute prediction and segment prediction) to learn item embeddings.Then, the learned model parameters were adopted to initialize a unidirectional Transformer-based model for fine-tuning with pairwise rank loss for recommendation.In(McKee et al., 2023), the authors leveraged the pre-trained BLOOM-176B to generate natural languages descriptions of music given a set of music tags.Subsequently, two distinct pre-trained models, namely CLIP and the D2T pipeline, were employed to initialize textual, video, and audio representations of the provided music content.Following this, a transformerbased architecture model was fine-tuned for multimodal music recommendation.

Figure 3 :
Figure 3: The statistics of evaluation metrics on recommendation utility and generated text quality in LMRS.
Yang et al. (2022a)y shift/lead the conversations from various tasks.Deng et al. (2023)concatenate input sequences with special designed prompts, such as [goal], [topic], [item], and [system], to indicate different tasks: goal planning, topic prediction, item recommendation, and response generation in conversations.The model is trained using a multi-task learning scheme, and the parameters of the PTM are optimized with the same objective.Yang et al. (2022a)designed a [REC] token as a prompt to indicate the start of the recommendation process and to summarize the dialogue context for the conversational recommendation.
Xin et al. (2022)c);Geng et al., 2022c)designed discrete prompt to specify recommendation items.For instance,Zhang et al. (2021b)designed a prompt "A user watched item A, item B, and item C. Now the user may want to watch () " to reformulate the recommendation as a multi-token cloze task during fine-tuning of the LM-based PTM.The prompts can also be one or several tokensTuning-free prompting This training strategy can be referred to as zero-shot recommendations, which directly generate recommendations or/and related subtasks without changing the parameters of the PTMs but based only on the input prompts.Zero-shot recommendation has been shown to be effective in dealing with new users/items in one domain or cross-domain settings(Sileo et al., 2022;Geng et al., 2022c), compared to state-ofthe-art baselines.Specifically,Geng et al. (2022c)learned multiple tasks, such as sequential recommendation, rating prediction, explanation generation of explanations.In particular, they utilized the pre-trained CLIP component to convert images into image tokens.These tokens were added to the textual tokens of an item to create a personalized multimodal soft prompt.This com-ments in the readability and fluency of generated explanations using the proposed prompts.Note that the Prompt+PTM tuning stage does not necessarily mean the fine-tuning stage but can be any possible stage for tuning parameters from both sides for specific data input.Xin et al. (2022)adapted a reinforcement learning framework as a Prompt+PTM tuning strategy by learning rewardstate pairs as soft prompt encodings w.r.t.observed actions during training.At the inference stage, the trained prompt generator can directly generate soft prompt embeddings for the recommendation model to generate actions (items).

Table 1 :
to capture user-news matching signals for news recommendation.Concurrently, some research works propose multiple enhanced versions of MLM.RoBERTa A list of representative LMRS methods with open-source code.
introduced a model-agnostic fine-tuning framework

Table 3 :
LMRSs performance comparison using common benchmarks on the ReDial dataset.

Table 4 :
LMRSs performance comparison using common benchmarks on the Amazon Beauty dataset.

Table 5 :
LMRSs performance comparison using common benchmarks on the Yelp dataset.

Table 6 :
LMRSs performance comparison using common benchmarks on the MIND dataset.