In-Context Retrieval-Augmented Language Models

Abstract Retrieval-Augmented Language Modeling (RALM) methods, which condition a language model (LM) on relevant documents from a grounding corpus during generation, were shown to significantly improve language modeling performance. In addition, they can mitigate the problem of factually inaccurate text generation and provide natural source attribution mechanism. Existing RALM approaches focus on modifying the LM architecture in order to facilitate the incorporation of external information, significantly complicating deployment. This paper considers a simple alternative, which we dub In-Context RALM: leaving the LM architecture unchanged and prepending grounding documents to the input, without any further training of the LM. We show that In-Context RALM that builds on off-the-shelf general purpose retrievers provides surprisingly large LM gains across model sizes and diverse corpora. We also demonstrate that the document retrieval and ranking mechanism can be specialized to the RALM setting to further boost performance. We conclude that In-Context RALM has considerable potential to increase the prevalence of LM grounding, particularly in settings where a pretrained LM must be used without modification or even via API access.1


Introduction
Recent advances in language modeling (LM) have dramatically increased the usefulness of machinegenerated text across a wide range of use-cases and domains (Brown et al., 2020).However, the mainstream paradigm of generating text with LMs bears inherent limitations in access to external knowledge.First, LMs are not coupled with any Figure 1: Our framework, dubbed In-Context RALM, provides large language modeling gains on the test set of WikiText-103, without modifying the LM.Adapting the use of a BM25 retriever (Robertson and Zaragoza, 2009) to the LM task ( §5) yields significant gains, and choosing the grounding documents via our new class of Predictive Rerankers ( §6) provides a further boost.See Table 1 for the full results on five diverse corpora.source attribution, and must be trained in order to incorporate up-to-date information that was not seen during training.More importantly, they tend to produce factual inaccuracies and errors (Lin et al., 2022;Maynez et al., 2020;Huang et al., 2020).This problem is present in any LM generation scenario, and is exacerbated when generation is made in uncommon domains or private data.A promising approach for addressing the above is Retrieval-Augmented Language Modeling (RALM), grounding the LM during generation by conditioning on relevant documents retrieved from an external knowledge source.RALM systems include two high level components: (i) document selection, selecting the set of documents upon which to condition; and (ii) document reading, determining how to incorporate the selected documents into the LM generation process.

Language Model
World Cup 2022 was the last with 32 teams, before the increase to Retriever FIFA World Cup 2026 will expand to 48 teams.
World Cup 2022 was the last with 32 teams, before the increase to 48 in the 2026 tournament.tend to be focused on altering the language model architecture (Khandelwal et al., 2020;Borgeaud et al., 2022;Zhong et al., 2022;Levine et al., 2022c;Li et al., 2022).Notably, Borgeaud et al. (2022) introduced RETRO, featuring document reading via nontrivial modifications that require further training to the LM architecture, while using an off-theshelf frozen BERT retriever for document selection.Although the paper's experimental findings showed impressive performance gains, the need for changes in architecture and dedicated retraining has hindered the wide adoption of such models.
In this paper, we show that a very simple document reading mechanism can have a large impact, and that substantial gains can also be made by adapting the document selection mechanism to the task of language modeling.Thus, we show that many of the benefits of RALM can be achieved while working with off-the-shelf LMs, even via API access.Specifically, we consider a simple but powerful RALM framework, dubbed In-Context RALM (presented in Section 3), which employs a zero-effort document reading mechanism: we simply prepend the selected documents to the LM's input text (Figure 2).
Section 4 describes our experimental setup.To show the wide applicability of our framework, we performed LM experiments on a suite of five diverse corpora: WikiText-103 (Merity et al., 2016), RealNews (Zellers et al., 2019), and three datasets from The Pile (Gao et al., 2021): ArXiv, Stack Exchange and FreeLaw.We use open-source LMs ranging from 110M to 66B parameters (from the GPT-2, GPT-Neo, OPT and LLaMA model families).
In Section 5 we evaluate the application of offthe-shelf retrievers to our framework.In this minimal-effort setting, we found that In-Context RALM led to LM performance gains equivalent to increasing the LM's number of parameters by 2-3× across all of the text corpora we examined.In Section 6 we investigate methods for adapting doc-ument ranking to the LM task, a relatively underexplored RALM degree of freedom.Our adaptation methods range from using a small LM to perform zero-shot ranking of the retrieved documents, up to training a dedicated bidirectional reranker by employing self-supervision from the LM signal.These methods lead to further gains in the LM task corresponding to an additional size increase of 2× in the LM architecture.As a concrete example of the gains, a 345M parameter GPT-2 enhanced by In-Context RALM outperforms a 762M parameter GPT-2 when employing an off-the-shelf BM25 retriever (Robertson and Zaragoza, 2009), and outperforms a 1.5B parameter GPT-2 when employing our trained LM-oriented reranker (see Figure 1).For large model sizes, our method is even more effective: In-Context RALM with an off-the-shelf retriever improved the performance of a 6.7B parameter OPT model to match that of a 66B parameter parameter OPT model (see Figure 4).
In Section 7 we demonstrate the applicability of In-Context RALM to downstream open-domain questions answering (ODQA) tasks.
In a concurrent work, Shi et al. (2023) also suggest to augment off-the-shelf LMs with retrieved texts by prepending them to the input.Their results are based on training a dedicated retriever for language modeling.In contrast, we focus on the gains achievable in using off-the-shelf retrievers for this task.We show strong gains of this simpler setting by investigating: (1) which off-the-shelf retriever is best suited for language modeling, (2) the frequency of retrieval operations, and (3) the optimal query length.In addition, we boost the offthe-shelf retrieval performance by introducing two reranking methods that demonstrate further gains in perplexity.
We believe that In-Context RALM can play two important roles in making RALM systems more powerful and more prevalent.First, given its simple reading mechanism, In-Context RALM can serve as a clean probe for developing document retrieval methods that are specialized for the LM task.These in turn can be used to improve both In-Context RALM and other more elaborate RALM methods that currently leverage general purpose retrievers.Second, due to its compatibility with off-the-shelf LMs, In-Context RALM can help drive wider deployment of RALM systems.

Related Work
RALM approaches can be roughly divided into two families of models: (i) nearest-neighbor language models (also called kNN-LM), and (ii) retrieve and read models.Our work belongs to the second family, but is distinct in that it involves no further training of the LM.
Nearest Neighbor Language Models The kNN-LM approach was first introduced in Khandelwal et al. (2020).The authors suggest a simple inference-time model that interpolates between two next-token distributions: one induced by the LM itself, and one induced by the k neighbors from the retrieval corpus that are closest to the query token in the LM embedding space.Zhong et al. (2022) suggest a framework for training these models.While they showed significant gains from kNN-LM, the approach requires storing the representations for each token in the corpus, an expensive requirement even for a small corpus like Wikipedia.Although numerous approaches have been suggested for alleviating this issue (He et al., 2021;Alon et al., 2022), scaling any of them to large corpora remains an open challenge.
Retrieve and Read Models This family of RALMs creates a clear division between document selection and document reading components.All prior work involves training the LM.We begin by describing works that use this approach for tackling downstream tasks, and then mention works oriented towards RALM.Lewis et al. (2020) and Izacard and Grave (2021) fine tuned encoder-decoder architectures for downstream knowledge-intensive tasks.Izacard et al. (2022b) explored different ways of pretraining such models, while Levine et al. (2022c) pretrained an autoregressive LM on clusters of nearest neighbors in sentence embedding space.Levine et al. (2022a) showed competitive open domain question-answering performance by prompt-tuning a frozen LM as a reader.Guu et al. (2020) pretrained REALM, a retrieval augmented bidirectional, masked LM, later fine-tuned for open-domain question answering.The work closest to this paper-with a focus on the language modeling task-is RETRO (Borgeaud et al., 2022), which modifies an autoregressive LM to attend to relevant documents via chunked cross-attention, thus introducing new parameters to the model.Our In-Context RALM differs from prior work in this family of models in two key aspects: • We use off-the-shelf LMs for document reading without any further training of the LM.
• We focus on how to choose documents for improved LM performance.
3 Our Framework

In-Context RALM
Language models define probability distributions over sequences of tokens.Given such a sequence x 1 , ..., x n , the standard way to model its probability is via next-token prediction: p(x 1 , ..., x n ) = n i=1 p(x i |x <i ), where x <i := x 1 , ..., x i−1 is the sequence of tokens preceding x i , also referred to as its prefix.This autoregressive model is usually implemented via a learned transformer network (Vaswani et al., 2017) parameterized by the set of parameters θ: where the conditional probabilities are modeled by employing a causal self-attention mask (Radford et al., 2018).Notably, leading LMs such as GPT-2 (Radford et al., 2019), GPT-3 (Brown et al., 2020), OPT (Zhang et al., 2022) or Jurassic-1 (Lieber et al., 2021) follow this simple parameterization.
Retrieval augmented language models (RALMs) add an operation that retrieves one or more documents from an external corpus C, and condition the above LM predictions on these documents.Specifically, for predicting x i , the retrieval operation from C depends on its prefix: R C (x <i ), so the most general RALM decomposition is: In order to condition the LM generation on the retrieved document, previous RALM approaches used specialized architectures or algorithms (see §2).Inspired by the success of In-Context Learning (Brown et al., 2020;Dong et al., 2023), In-Context RALM refers to the following specific, simple method of concatenating the retrieved documents2 within the Transformer's input prior to the prefix (see Figure 2), which does not involve altering the LM weights θ: where [a; b] denotes the concatenation of strings a and b.
Since common Transformer-based LM implementations support limited length input sequences, when the concatenation of the document and the input sequence exceed this limit we remove tokens from the beginning of x until the overall input length equals that allowed by the model.Because our retrieved documents are passages of limited length, we always have enough context left from x (see §4.3).

RALM Design Choices
We detail below two practical design choices often made in RALM systems.In §5, we investigate the effect of these in the setting of In-Context RALM.
Retrieval Stride While in the above formulation a retrieval operation can occur at each generation step, we might want to perform retrieval only once every s > 1 tokens due to the cost of calling the retriever, and the need to replace the documents in the LM prefix during generation.We refer to s as the retrieval stride.This gives rise to the following In-Context RALM formulation (which reduces back to Eq. ( 2) for s = 1): where n s = n/s is the number of retrieval strides.
Notably, in this framework the runtime costs of each retrieval operation is composed of (a) applying the retriever itself, and (b) recomputing the embeddings of the prefix.In §5.2 we show that using smaller retrieval strides, i.e., retrieving as often as possible, is superior to using larger ones (though In-Context RALM with larger strides already provides large gains over vanilla LM).Thus, choosing the retrieval stride is ultimately a tradeoff between runtime and performance.
Retrieval Query Length While the retrieval query above in principle depends on all prefix tokens x ≤s•j , the information at the very end of the prefix is typically the most relevant to the generated tokens.If the retrieval query is too long then this information can be diluted.To avoid this, we restrict the retrieval query at stride j to the last ℓ tokens of the prefix, i.e., we use q s,ℓ j := x s•j−ℓ+1 , ..., x s•j .We refer to ℓ as the retrieval query length.Note that prior RALM work couples the retrieval stride s and the retrieval query length ℓ (Borgeaud et al., 2022).In §5, we show that enforcing s = ℓ degrades LM performance.Integrating these hyper-parameters into the In-Context RALM formulation gives (4)

Experimental Details
We now describe our experimental setup, including all models we use and their implementation details.

Datasets
We evaluated the effectiveness of In-Context RALM across five diverse language modeling datasets and two common open-domain question answering datasets.

Language Modeling
The first LM dataset is WikiText-103 (Merity et al., 2016), which has been extensively used to evaluate RALMs (Khandelwal et al., 2020;He et al., 2021;Borgeaud et al., 2022;Alon et al., 2022;Zhong et al., 2022).Second, we chose three datasets spanning diverse subjects from The Pile (Gao et al., 2021): ArXiv, Stack Exchange and FreeLaw.Finally, we also investigated Real-News (Zellers et al., 2019), since The Pile lacks a corpus focused only on news (which is by nature a knowledge-intensive domain).
Open-Domain Question Answering In order to evaluate In-Context RALM on downstream tasks as well, we use the Natural Questions (NQ; Kwiatkowski et al. 2019) and TriviaQA (Joshi et al., 2017) open-domain question answering datasets.

Models
Language Models We performed our experiments using the four models of GPT-2 (110M-1.5B;We elected to study these particular models for the following reasons.The first four (GPT-2) models were trained on WebText (Radford et al., 2019), with Wikipedia documents excluded from their training datasets.We were thus able to evaluate our method's "zero-shot" performance when retrieving from a novel corpus (for WikiText-103).The rest of the models brought two further benefits.First, they allowed us to investigate how our methods scale to models larger than GPT-2.Second, the fact that Wikipedia was part of their training data allowed us to investigate the usefulness of In-Context RALM for corpora seen during training.The helpfulness of such retrieval has been demonstrated for previous RALM methods (Khandelwal et al., 2020) and has also been justified theoretically by Levine et al. (2022c).
We ran all models with a maximum sequence length of 1,024, even though GPT-Neo, OPT and LLaMA models support a sequence length of 2,048. 4  Retrievers We experimented with both sparse (word-based) and dense (neural) retrievers.We used BM25 (Robertson and Zaragoza, 2009) as our sparse model.For dense models, we experimented with (i) a frozen BERT-base (Devlin et al., 2019) followed by mean pooling, similar to Borgeaud et al. (2022); and (ii) the Contriever (Izacard et al., 2022a) and Spider (Ram et al., 2022) models, which are dense retrievers that were trained in unsupervised manners.

Implementation Details
We implemented our code base using the Transformers library (Wolf et al., 2020).We based our dense retrieval code on the DPR repository (Karpukhin et al., 2020).

huggingface.co/
4 In preliminary experiments, we observed similar improvements from In-Context RALM when using a sequence length of 2,048.We used a sequence length of 1,024 in order to facilitate a direct comparison between all models.2020), our retrieval corpora consist of non-overlapping passages of 100 words (which translate to less than 150 tokens for the vast majority of passages).Thus, we truncate our retrieved passages at 256 tokens when input to the models, but they are usually much smaller.

The Effectiveness of In-Context RALM with Off-the-Shelf Retrievers
We now empirically show that despite its simple document reading mechanism, In-Context RALM leads to substantial LM gains across our diverse evaluation suite.We begin in this section by investigating the effectiveness of off-the-shelf retrievers for In-Context RALM; we go on in §6 to show that further LM gains can be made by tailoring document ranking functions to the LM task.The experiments in this section provided us with a recommended configuration for applying In- For each LM, we report: (a) its performance without retrieval, (b) its performance when fed the top-scored passage by BM25 ( §5), and (c) its performance when applied on the top-scored passage of each of our two suggested rerankers ( §6).All models share the same vocabulary, thus token-level perplexity (token ppl) numbers are comparable.For WikiText we follow prior work and report word-level perplexity (word ppl).
Context RALM: applying a sparse BM25 retriever that receives ℓ = 32 query tokens and is applied as frequently as possible.Practically, we retrieve every s = 4 tokens (ℓ and s are defined in §3).
Table 1 shows for the GPT-2 models that across all the examined corpora, employing In-Context RALM with an off-the-shelf retriever improved LM perplexity to a sufficient extent that it matched that of a 2-3× larger model.Figure 4 and Tables 2  and 5 show that this trend holds across model sizes up to 66B parameters, for both WikiText-103 and RealNews.

BM25 Outperforms Off-the-Shelf Neural Retrievers in Language Modeling
We experimented with different off-the-shelf general purpose retrievers, and found that the sparse (lexical) BM25 retriever (Robertson and Zaragoza, 2009) outperformed three popular dense (neural) retrievers: the self-supervised retrievers Contriever (Izacard et al., 2022a) and Spider (Ram et al., 2022), as well as a retriever based on the average pooling of BERT embeddings that was used in the RETRO system (Borgeaud et al., 2022).We conducted a minimal hyper-parameter search on the query length ℓ for each of the retrievers, and found that ℓ = 32 was optimal for BM25 (Figure 6), and ℓ = 64 worked best for dense retrievers (Figures 9, 10).Figure 3 compares the performance gains of In-Context RALM with these four general-purpose retrievers.The BM25 retriever clearly outperformed all dense retrievers.This outcome is consistent with prior work showing that BM25 outperforms neural retrievers across a wide array of tasks, when applied in zero-shot settings (Thakur et al., 2021).This result renders In-Context RALM even more In-Context RALM models use a BM25 retriever with s = 4 (i.e., the retriever is called every four tokens) and ℓ = 32 (i.e., the retriever query is comprised of the last 32 tokens of the prefix).In-Context RALM with an off-the-shelf retriever improved the performance of a 6.7B parameter OPT model to match that of a 66B parameter OPT model.
appealing since applying a BM25 retriever is significantly cheaper than the neural alternatives.

Frequent Retrieval Improves Language Modeling
We investigated the effect of varying the retrieval stride s (i.e., the number of tokens between consecutive retrieval operations).Figure 5 shows that LM performance improved as the retrieval operation became more frequent.This supports the intuition that retrieved documents become more relevant the closer the retrieval query becomes to the generated tokens.Of course, each retrieval operation imposes a runtime cost.To balance performance and runtime, we used s = 4 in our experiments.
For comparison, RETRO employed a retrieval frequency of s = 64 (Borgeaud et al., 2022), which leads to large degradation in perplexity.Intuitively, retrieving with high frequency (low retrieval stride) allows to ground the LM in higher resolution.

A Contextualization vs. Recency Tradeoff in Query Length
We also investigated the effect of varying ℓ, the length of the retrieval query for BM25. Figure 6 reveals an interesting tradeoff and a sweet spot around a query length of 32 tokens.Similar experiments for dense retrievers are given in App. A. We conjecture that when the retriever query is too short, it does not include enough of the input context, decreasing the retrieved document's relevance.Conversely, excessively growing the retriever query deemphasizes the tokens at the very end of the prefix, diluting the query's relevance to the LM task.

Improving In-Context RALM with LM-Oriented Reranking
Since In-Context RALM uses a fixed document reading component by definition, it is natural to ask whether performance can be improved by specializing its document retrieval mechanism to the LM task.Indeed, there is considerable scope for improvement: the previous section considered conditioning the model only on the first document re- trieved by the BM25 retriever.This permits very limited semantic understanding of the query, since BM25 is based only on the bag of words signal.Moreover, it offers no way to accord different degrees of importance to different retrieval query tokens, such as recognizing that later query tokens are more relevant to the generated text.
In this section, we focus on choosing which document to present to the model, by reranking the top-k documents returned by the BM25 retriever. 5e use Figure 7 as motivation: it shows the large potential for improvement among the top-16 documents returned by the BM25 retriever.We act upon this motivation by using two rerankers.Specifically, in §6.1 we show performance gains across our evaluation suite obtained by using an LM to perform zero-shot reranking of the top-k BM25 retrieved documents (results in third row for each of the models in Table 1).Then, in §6.2 we show that training a specialized bidirectional reranker of the top-k BM25 retrieved documents in a selfsupervised manner via the LM signal can provide further LM gains (results in forth row for each of the models in Table 1).

LMs as Zero-Shot Rerankers
First, we used off-the-shelf language models as document rerankers for the In-Context RALM setting.Formally, for a query q consisting of the last ℓ tokens in the prefix of the LM input x, let {d 1 , ..., d k } be the top-k documents returned by BM25.For retrieval iteration j, let the text for generation be y := x s•j+1 , ..., x s•j+s .Ideally, we would like to find the document d i * that maximizes the probability of the text for generation, i.e., (5) However, at test time we do not have access to the tokens of y.Instead, we used the last prefix tokens (which are available at test time), denoted by y ′ , for reranking.Formally, let s ′ be a hyper-parameter that determines the number of the prefix tokens by which to rerank.We define y ′ := x s•j−s ′ +1 , ..., x s•j (i.e., the stride of length s ′ that precedes y) and choose the document d î such GPT-2 1.5B (XL) GPT-2 110M (S) 16.2 9.9 GPT-2 1.5B (XL) 16.1 9.8 Table 3: Perplexity for zero-shot reranking ( §6.1) where the reranking models is smaller than the LM, or the LM itself.Reranking is performed on the top 16 documents retrieved by BM25.Using a GPT-2 110M (S) instead of a larger language model as a reranker leads to only a minor degradation. that The main motivation is that since BM25 is a lexical retriever, we want to incorporate a semantic signal induced by the LM.Also, this reranking shares conceptual similarities with the reranking framework of Sachan et al. (2022) for open-domain question answering, where y ′ (i.e., the last prefix tokens) can be thought of as their "question".Note that our zero-shot reranking does not require that the LM used for reranking is the same model as the LM used for generation (i.e., the LM in Eq. ( 6), parameterized by ϕ, does not need to be the LM in Eq. ( 2), parameterized by θ).This observation unlocks the possibility of reranking with smaller (and thus faster) models, which is important for two main reasons: (i) Reranking k documents requires k forward passes; and (ii) it allows our methods to be used in cases where the actual LM's log probabilities are not available (for example, when the LM is accessed through an API). 6esults A minimal hyper-parameter search on the development set of WikiText-103 revealed that the optimal query length is s ′ = 16,7 so we proceed with this value going forward.Table 1 shows the results of letting the LM perform zero-shot reranking on the top-16 documents retrieved by BM25 (third row for each of the models).It is evident that reranking yielded consistently better results than simply taking the first result returned by the retriever.
Table 3 shows that a small LM (GPT-2 117M) can be used to rerank the documents for all larger GPT-2 models, with roughly the same performance as having each LM perform reranking for itself, supporting the applicability of this method for LMs that are only accessible via an API.

Training LM-dedicated Rerankers
Next, we trained a reranker to choose one of the top-k documents retrieved by BM25.We refer to this approach as Predictive Reranking, since the reranker learns to choose which document will help in "predicting" the upcoming text.For this process, we assume availability of training data from the target corpus.Our reranker is a classifier that gets a prefix x ≤s•j and a document d i (for i ∈ [k]), and produces a scalar f (x ≤s•j , d i ) that should resemble the relevance of d i for the continuation of x ≤s•j .
We then normalize these relevance scores: and choose the document d î such that Collecting Training Examples To train our predictive reranker, we collected training examples as follows.Let x ≤s•j be a prefix we sample from the training data, and y := x s•j+1 , ..., x s•j+s be the text for generation upcoming in its next stride.We run BM25 on the query q s,ℓ j derived from x ≤s•j (see §3.2) and get k documents {d 1 , ..., d k }.For each document d i , we then run the LM to compute p θ (y| [d i ; x ≤s•j ]) similar to Eq. ( 4).Training Our reranker was a fine-tuned RoBERTa-base (Liu et al., 2019) that trained for 10,000 steps with a peak learning rate of 10 −5 and a batch size of 32.Overall, we created 300,000 examples from the training set of WikiText-103 as explained above.The loss function we use to train the reranker follows previous work (Guu et al., 2020;Lewis et al., 2020): Note that unlike those works, we train only the reranker (p rank ), keeping the LM weights θ frozen.
Results Table 1 shows the result of our predictive reranker, trained on WikiText-103.Specifically, we trained it with data produced by GPT-2 110M (S), and tested its effectiveness for all GPT-2 models.We observed significant gains obtained from Predictive Reranking.For example, the perplexity of GPT-2 110M (S) improved from 29.6 to 26.8, and that of GPT-2 1.5B (XL) improved from 16.6 to 15.4.This trend held for the other two models as well.Overall, these results demonstrate that training a reranker with domain-specific data was more effective than zero-shot reranking (Section 6.1).Note that these results-while impressive-still leave room for further improvements, compared to the top-16 BM25 oracle results (see Figure 7).Moreover, the oracle results themselves can be improved by retrieving k > 16 documents via a BM25 retriever, or by training stronger retrievers dedicated to the RALM task.We leave this direction for future work.2022b), our "reader" (i.e., the model that gets the question along with its corresponding retrieved documents, and returns the answer) is simply a frozen large LM: not pretrained, fine-tuned or prompted to be retrieval-augmented.For the closed-book setting, we utilize the prompt of Touvron et al. (2023).
For the open-book setting, we extend this prompt to include retrieved documents (see App. C).We use DPR (Karpukhin et al., 2020) as our retriever.

Varying the Number of Documents
To investigate the the effect of the number of documents shown to the model, we performed a minimal analysis on the development set of NQ and TriviaQA.Figure 8 demonstrates that showing documents incontext significantly improves the model's performance.In addition, most of the gain can be obtained by using only two documents (or even a single one in some cases).
Results Table 4  This paper presented the framework of In-Context RALM, enabling frozen, off-the-shelf LMs to benefit from retrieval.We demonstrated that substantial performance gains can be achieved by using general purpose retrievers, and showed that additional gains can be achieved by tailoring the document selection the LM setting.A recent work by Muhlgay et al. (2023) demonstrates that In-Context RALM is indeed able to improve the factuality of large LMs.
Several directions for further improvement remain for future work.First, this paper considers only the case of prepending a single external document to the context; adding more documents could drive further gains (for example, using the framework of Ratner et al. 2022).Second, we retrieved documents every fixed interval of s tokens, but see potential for large latency and cost gains by retrieving more sparsely, such as only when a specialized model predicts that retrieval is needed.
We release the code used in this work, for the community to use and improve over.We hope it will drive further research of RALM, which will enable its wider adoption.

A Query Length Ablations
Figure 9 and Figure 10 show ablations on the optimal query length ℓ for off-the-shelf dense retrievers (BERT and Contriever respectively).We omit the results of Spider as they are almost identical to those of Contriever.Consistently, using ℓ = 64 (tokens) is optimal.This is in contrast to similar experiments we conducted for BM25 (cf. Figure 6), where ℓ = 32 is optimal.
B GPT-Neo Results

Figure 2 :
Figure 2: An example of In-Context RALM: we simply prepend the retrieved document before the input prefix.

Figure 3 :
Figure3: The performance of four off-the-shelf retrievers used for In-Context RALM on the development set of WikiText-103.All RALMs are run with s = 4 (i.e., retrieval is applied every four tokens).For each RALM, we report the result of the best query length ℓ (see Figures6, 9, 10).

Figure 4 :
Figure4: Results of OPT models(Zhang et al., 2022) on the test set of WikiText-103 (word-level perplexity) and the development set of RealNews (token-level perplexity).In-Context RALM models use a BM25 retriever with s = 4 (i.e., the retriever is called every four tokens) and ℓ = 32 (i.e., the retriever query is comprised of the last 32 tokens of the prefix).In-Context RALM with an off-the-shelf retriever improved the performance of a 6.7B parameter OPT model to match that of a 66B parameter OPT model.

Figure 5 :Figure 6 :
Figure5: An analysis of perplexity as a function of s, the retrieval stride, i.e., the number of tokens between consecutive retrieval operations, on the development set of WikiText-103.Throughout the paper, we use s = 4 to balance perplexity and runtime.

Figure 7 :
Figure 7: Potential for gains from reranking: perplexity improvement (on the development set of WikiText-103) from an oracle that takes the best of the top-16 documents retrieved by BM25 rather than the first.

Figure 8 :
Figure 8: Zero-shot performance of In-Context RALM on the development set of Natural Questions and TriviaQA, when varying the number of documents (retrieved by DPR) shown in-context.

Figure 9 :Figure 10 :
Figure 9: An analysis of perplexity as a function of the number of tokens in the query for an offthe-shelf BERT retriever on the development set of WikiText-103.

Table 1 :
Perplexity on the test set of WikiText-103, RealNews and three datasets from the Pile.

Table 4 :
Zero-shot results of In-Context RALM on the test set of Natural Questions and TriviaQA measured by exact match.In the open-book setting, we include the top two documents returned by DPR.
gives the results of In-Context RALM on the test set of Natural Questions and TriviaQA.Motivated by our previous findings, we used two retrieved documents.It is evident that showing the model relevant documents significantly boosted its performance.

Table 5 :
Table 5 gives the results of applying In-Context RALM to the models from the GPT-Neo model family on WikiText-103 and RealNews.-Book Setting For the closed-book setting, we adopt the prompt of Touvron et al. (2023): Open-Book Setting For the open-book setting, we extend the above prompt as follows: The performance of models from the GPT-Neo family, measured by word-level perplexity on the test set of WikiText-103 and token-level perplexity on the development set of RealNews. Closed