Interactive Text Ranking with Bayesian Optimisation: A Case Study on Community QA and Summarisation

For many NLP applications, such as question answering and summarisation, the goal is to select the best solution from a large space of candidates to meet a particular user's needs. To address the lack of user-specific training data, we propose an interactive text ranking approach that actively selects pairs of candidates, from which the user selects the best. Unlike previous strategies, which attempt to learn a ranking across the whole candidate space, our method employs Bayesian optimisation to focus the user's labelling effort on high quality candidates and integrates prior knowledge in a Bayesian manner to cope better with small data scenarios. We apply our method to community question answering (cQA) and extractive summarisation, finding that it significantly outperforms existing interactive approaches. We also show that the ranking function learned by our method is an effective reward function for reinforcement learning, which improves the state of the art for interactive summarisation.


Introduction
Many text ranking tasks are highly subjective or context-dependent, yet information about the user's needs is often not available to the NLP system. Consider ranking summaries or answers to nonfactoid questions: the answer the user wants depends on what they already know and their current information needs (Liu and Agichtein, 2008;López et al., 1999). We address this problem by proposing an interactive text ranking approach that efficiently gathers user feedback and combines it with predictions from pre-trained, generic models.
To minimise the amount of effort the user must expend to train a ranker, we learn from pairwise preference labels, in which the user compares two candidates and labels the best one. Pairwise preference labels can often be provided faster than ratings or class labels (Yang and Chen, 2011;Kingsley and Brown, 2010;Kendall, 1948), can be used to rank candidates using learning-to-rank (Joachims, 2002), preference learning (Thurstone, 1927) or best-worst scaling (Flynn and Marley, 2014), or to train a reinforcement learning (RL) agent to find the optimal solution (Wirth et al., 2017).
To reduce the number of labels a user must provide, a common solution is active learning (AL). AL learns a model by iteratively acquiring labels: at each iteration, it trains a model on the labels collected so far, then uses an acquisition function to find the most appropriate candidates to query the user about. The acquisition function implements one of many different strategies to minimise the number of interaction rounds, such as reducing uncertainty (Settles, 2012). The steps of the active learning process are shown in Algorithm 1. Many active learning strategies, such as the pairwise preference learning method of Gao et al. (2018), aim to learn a good ranking for all candidates, e.g., by querying the annotator about candidates whose rank is most uncertain. However, we often need only to find and rank a small set of good candidates to present to the user. For instance, in question answering, if we can determine that an Q: Does whiskey go bad by freezing? A1: It is to cool it down without dilluting it-ice cubes would melt. And yes, you could simply cool the entire bottle, but it wouldn't look that fancy. Note that some purists would wrinkle their noses and insist that whisky is best enjoyed at room temperature and perhaps with a small dash of spring water. And I'm soooo not going into a whisky vs. whiskey debate here. A2: Putting strong spirits in the freezer should not harm them. The solubility of air gases increases at low temperature, which is why you see bubbles as it warms up. Drinks with a lower alcohol content will be affected in the freezer. There is potential to freeze water out of anything with an alcohol content of 28% or lower. Many people use the freezer to increase the alcohol content of their home brew in UK, by freezing water out of it-the alcohol stays in the liquid portion. Figure 1: Example from the Stack Exchange Cooking topic. Candidate answer A1 selected without user interaction by COALA (Rücklé et al., 2019); A2 chosen by our system (GPPL with IMP) after 10 user interaction. A2 answers the question (boldfaced texts) but A1 fails.
answer is probably irrelevant, it does not need to be shown to the user. The ordering of such irrelevant answers is unimportant and users should not waste time ranking them. Therefore, uncertaintybased AL strategies may waste labels on sorting poor candidates.
Here, we propose an interactive method for ranking texts using Bayesian optimisation (BO) (Močkus, 1975;Brochu et al., 2010), which minimises the number of labels needed to learn the best candidate, rather than trying to learn the entire ranking function. We define two BO acquisition functions for the active learning process in Algorithm 1. These functions optimise a Bayesian model that integrates prior predictions from a generic model that is not tailored to the user. Previous interactive text ranking methods either do not exploit prior information (Baldridge and Osborne, 2004;P.V.S and Meyer, 2017;Lin and Parikh, 2017;Siddhant and Lipton, 2018), combine heuristics with user feedback after active learning is complete (Gao et al., 2018), or require expensive re-training of a non-Bayesian method (Peris and Casacuberta, 2018). Here, we show how BO can use prior information to expedite interactive text ranking. Examples of our system outputs are shown in Figures 1 and 2.
Our contributions are (1) a Bayesian optimisation methodology for interactive text ranking that integrates prior predictions with user feedback, (2) acquisition functions for Bayesian optimisa-(a): A third leading advocate of the China Democracy Party who has been in custody for a month, Wang Youcai, was accused of "inciting the overthrow of the government," the Hong Kong-based Information Center of Human Rights and Democratic Movement in China reported. China's central government ordered the arrest of a prominent democracy campaigner and may use his contacts with exiled Chinese dissidents to charge him with harming national security, a colleague said Wednesday. One leader of a suppressed new political party will be tried on Dec. 17 on a charge of colluding with foreign enemies of China "to incite the subversion of state power," according to court documents given to his wife on Monday. (b): The arrests of Xu and Qin at their homes Monday night and the accusations against them and Wang were the sharpest action Chinese leaders have taken since dissidents began pushing to set up and legally register the China Democracy Party in June. Hours before China was expected to sign a key U.N. human rights treaty and host British Prime Minister Tony Blair, police hauled a prominent human rights campaigner in for questioning Monday. With attorneys locked up, harassed or plain scared, two prominent dissidents will defend themselves against charges of subversion Thursday in China's highest-profile dissident trials in two years. Wang was a student leader in the 1989 Tiananmen Square democracy demonstrations. (c): On the eve of China's signing the International Covenant of Civil and Political Rights (ICCPR) in October 1998, police detained Chinese human rights advocate Qin Yongmin for questioning. Eight weeks after signing the ICCPR, Chinese police arrested Qin and an associate in the China Democracy Party (CDP), Xu Wenli, without stating charges. Another CDP leader already in custody, Wang Youcai, was accused of "inciting the overthrow of the government". Qin and Wang went to trial in December for inciting subversion. Police pressure on potential defense attorneys forced the accused to mount their own defenses. Xu Wenli had not yet been charged. tion with pairwise labels, and (3) empirical evaluations on community question answering (cQA) and extractive multi-document summarisation, which show that our method brings substantial improvements in ranking and summarisation performance (e.g. for cQA, an average 25% increase in answer selection accuracy over the next-best method with 10 rounds of user interaction). We release the complete experimental software for future research.

Related Work
Interactive Learning in NLP. Previous work has applied active learning to tasks involving ranking or optimising generated text, including sum-marisation (P.V.S and Meyer, 2017), visual question answering (Lin and Parikh, 2017), and translation (Peris and Casacuberta, 2018). For summarisation, Gao et al. (2018) use active learning to learn a reward function for RL, an approach proposed by Lopes et al. (2009). Learning the reward function in this manner reduces the number of user interactions required to train a reinforcement learner from O(10 5 ) to O(100) compared to reinforcement learners querying the user directly, as in (Sokolov et al., 2016;Lawrence and Riezler, 2018;Singh et al., 2019). These previous works use uncertainty sampling strategies, which query the user about the candidates with the most uncertain rankings to try to learn all candidates' rankings with a high degree of confidence. We instead propose to find good candidates using Bayesian optimisation. Recently, Siddhant and Lipton (2018) carried out a large empirical study of uncertainty sampling for sentence classification, semantic role labelling and named entity recognition, finding that exploiting model uncertainty estimates provided by Bayesian neural networks improved performance. Our approach also exploits uncertainty estimates provided by a Bayesian method.

Bayesian Optimisation for Preference Learning
Bayesian approaches using Gaussian processes (GPs) have previously been used to reduce errors in NLP tasks involving sparse or noisy labels (Cohn and Specia, 2013;Beck et al., 2014), making them well suited to learning from user feedback. Gaussian process preference learning (GPPL) (Chu and Ghahramani, 2005) enables GP inference with pairwise preference labels. Simpson and Gurevych (2018) introduced scalable inference for GPPL using stochastic variational inference (SVI) (Hoffman et al., 2013), which outperformed SVM and LSTM methods at ranking arguments by convincingness. They included a study on active learning with pairwise labels, but tested GPPL only with uncertainty sampling, not BO. Here, we adapt GPPL to summarisation and cQA, show how to integrate prior predictions, and propose a BO framework for GPPL that facilitates interactive text ranking. Brochu et al. (2008) proposed a BO approach for pairwise comparisons but applied the approach only to a material design use case with a simple feature space. González et al. (2017) proposed alternative BO strategies for pairwise preferences, but their approach requires an expensive sampling method to estimate the utilities, which is too slow for an interactive setting. Recent work by Yang and Klabjan (2018) also proposes BO with pairwise preferences, but again, inference is expensive, the method is only tested on data with fewer than ten features, and it uses an inferior probability of improvement strategy (see Snoek et al. (2012)). Our GPPL-based framework permits much faster inference even when the input vector has more than 200 features, and hence allows rapid selection of new pairs when querying users. Ruder and Plank (2017) use BO to select training data for transfer learning in NLP tasks such as sentiment analysis, POS tagging, and parsing. However, unlike our interactive text ranking approach, their work does not involve pairwise comparisons and is not interactive, as the optimiser learns by training and evaluating a model on the selected data. In summary, previous work has not yet devised BO strategies for GPPL or suitable alternatives for interactive text ranking.

Background on Preference Learning
Popular preference learning models assume that users choose a candidate from a pair with probability p, where p is a function of the candidates' utilities (Thurstone, 1927). When candidates have similar utilities, the user's choice is close to random, while pairs with very different utilities are labelled consistently. Such models include the Bradley-Terry model (BT) (Bradley and Terry, 1952;Luce, 1959;Plackett, 1975), and the Thurstone-Mosteller model (Thurstone, 1927;Mosteller, 1951).
BT defines the probability that candidate a is preferred to candidate b as follows: where y a,b = a b is a binary preference label, φ(a) is the feature vector of a and w T is a weight parameter that must be learned. To learn the weights, we treat each pairwise label as two data points: the first point has input features x = φ(a) − φ(b) and label y, and the second point is the reverse pair, with x = φ(b) − φ(a) and label 1 − y. Then, we use standard techniques for logistic regression to find the weights w that minimise the L2-regularised cross entropy loss. The resulting linear model can be used to predict labels for any unseen pairs, or to estimate candidate utilities, f a = w T φ(a), which can be used for ranking.
Uncertainty (UNC). At each active learning iteration, the learner requests training labels for candi-dates that maximise the acquisition function. P.V.S and Meyer (2017) proposed an uncertainty sampling acquisition function for interactive document summarisation, which defines the uncertainty about a single summary's utility as follows: where p(a|D) = (1 + exp(−f a )) −1 is the probability that a is a good candidate and w is the set of BT model weights trained on the data collected so far, D, which consists of pairs of candidate texts and pairwise preference labels. For pairwise labels, Gao et al. (2018) define an acquisition function, which we refer to here as UNC, which selects the pair of summaries (a, b) with the two highest values of u(a|D) and u(b|D). UNC is intended to focus labelling effort on summaries whose utilities are uncertain. If the learner is uncertain about whether summary a is a good summary or not, p(a|D) will be close to 0.5, so a will have a higher chance of being selected. Unfortunately, it is also possible for p(a|D) to be close to 0.5 even if a has been labelled many times if a is a summary of intermediate utility. Therefore, when using UNC, labelling effort may be wasted re-labelling mediocre summaries. The problem is that BT cannot distinguish two types of uncertainty: epistemic uncertainty, which can be reduced by acquiring more training data, and aleatoric uncertainty, that will remain even if there are infinite observations. In fact, BT does not actually quantify the epistemic uncertainty in a candidate's utility, so to rectify this shortcoming, we replace BT with a model that both estimates the utility of a candidate and quantifies the uncertainty in that estimate.
Gaussian Process Preference Learning Since BT does not quantify epistemic uncertainty in the utilities, we turn to a Bayesian approach, GPPL. GPPL uses a Gaussian process (GP) to provide a nonlinear mapping from document feature vectors to utilities, and assumes a Thurstone-Mosteller model for the pairwise preference labels. Whereas BT simply estimates a scalar value of f a for each candidate, a, GPPL outputs a posterior distribution over the utilities, f , of all candidate texts, x: wheref is a vector of posterior mean utilities and C is the posterior covariance matrix of the utilities.
The entries off are predictions of f a for each candidate given D, and the diagonal entries of C represent posterior variance, which can be used to quantify uncertainty in the predictions. Thus, GPPL provides a way to separate candidates with uncertain utility from those with middling utility but many pairwise labels. In this paper, we infer the posterior distribution over the utilities using the SVI method of Simpson and Gurevych (2018).

Interactive Learning with GPPL
We now define four new acquisition functions for GPPL that take advantage of the posterior covariance, C, to account for uncertainty in the utilities. Table 1 summarises these acquisition functions.
Pairwise Uncertainty (UNPA). Rather than evaluating each candidate individually, as in UNC, we select the pair whose label is most uncertain. UNPA selects the pair with label probability p(y a,b ) closest to 0.5, where, for GPPL: where Φ is the probit likelihood andf a is the posterior mean utility for candidate a. Through C, this function accounts for correlations between candidates' utilities and epistemic uncertainty in the utilities. However, for two items with similar expected utilities,f a andf b , the p(y a,b ) is close to 0.5, i.e., it has high aleatoric uncertainty. Therefore, while UNPA will favour candidates with uncertain utilities, it may still waste labelling effort on pairs with similar utilities but low uncertainty.
Expected Information Gain (EIG). We now propose a second acquisition function for active learning with GPPL, which greedily reduces the epistemic uncertainty in the GPPL model. EIG chooses pairs that maximise information gain, which quantifies the information a pairwise label provides about the utilities, f . Unlike UNPA, this function avoids pairs that have high aleatoric uncertainty only. The information gain for a pairwise label, y a,b , is the reduction in entropy of the distribution over the utilities, f , given y a,b . Houlsby et al. (2011) note that this can be more easily computed if it is reversed using a method known as Bayesian active learning by disagreement (BALD), which computes the reduction in entropy of the label's distribution given f . Since we do not know the value of f , we take the expected information gain I with respect to f : where H is Shannon entropy. Unlike the related approach of González et al. (2017), this can be computed in closed form given the GPPL posterior, so does not need expensive sampling.
Expected Improvement (IMP). The previous acquisition functions for AL are uncertainty-based, and spread labelling effort across all items whose utilities are uncertain. However, for tasks such as summarisation or cQA, the goal is to find the best candidates. While it is important to distinguish between good and optimal candidates, it is wasted effort to compare candidates that we are already confident are not the best, even if their utilities are still uncertain. We propose to address this challenge using an acquisition function for Bayesian optimisation (BO) that estimates the expected improvement (Močkus, 1975) of a candidate, a, over our current estimated best solution, b, given current pairwise labels, D. We define improvement as the quantity max{0, f a − f b }, where b is our current best item and a is our new candidate.
Since the values of f a and f b are uncertain, we compute the expected improvement as follows. First, we estimate the posterior distribution over the candidates' utilities, N (f , C), then find the current best utility:f b = max i {f i }. For any candidate a, the difference f a − f b is Gaussian-distributed as it is a sum of Gaussians. The probability that this is larger than zero is given by the cumulative density function, Φ(z), where z =f a−fb √ v . We use this to derive expected improvement, which results in the following closed form equation: This weights the probability of finding a better solution, Φ(z), by the amount of improvement, √ vz. Both terms account for how closef a is tô f b , through z, as a larger distance causes z to be more negative, which decreases both the probability Φ(z) and the density N (z; 0, 1). Expected improvement also accounts for the uncertainty in both utilities through the posterior standard deviation, √ v, which scales both terms.
To select pairs of items, the IMP strategy greedily chooses the current best item and the item with the greatest expected improvement. Through the consideration of posterior uncertainty, IMP tradesoff exploration of unknown candidates with exploitation of promising candidates. In contrast, uncertainty-based strategies are pure exploration.
Thompson Sampling with Pairwise Labels (TP). Expected improvement is known to over-exploit in some cases (Qin et al., 2017). We therefore propose a strategy that uses Thompson sampling (Thompson, 1933) to balance exploitation and exploration. We select an item using Thompson sampling as follows: first draw a sample of candidate utilities from their posterior distribution, f thom ∼ N (f , C), then choose the item b with the highest score in the sample. Note that this sampling step depends on a Bayesian approach to provide a posterior distribution from which to sample. Sampling means that while candidates with high expected utilities have higher values of f thom in most samples, other candidates may also have the highest score in some samples. As the number of samples → ∞, the number of times each candidate is chosen is proportional to the probability that it is the best candidate.
To create a pair of items for preference learning, we compute the expected information gain for all pairs that include candidate b, and choose the pair with the maximum. This strategy is less greedy than IMP as it allows for more learning about uncertain items through both the Thompson sampling step and the information gain step. However, compared to EIG, the first step focuses effort more on items with potentially high scores.
Using Priors to Address Cold Start In previous work on summarisation (Gao et al., 2018), the BT model was trained from a cold start, i.e., with no prior knowledge or pre-training. Then, after active learning was complete, the predictions from the trained model were combined with prior predictions based on heuristics by taking an average of the normalised scores from both methods. We propose to use such prior predictions to determine an informative prior for GPPL, enabling the active learner to make more informed choices of candidates to label at the start of the active learning process, thereby alleviating the cold-start problem.
We integrate pre-computed predictions as follows. Given a set of prior predictions, µ, from a heuristic or pre-trained model, we set the prior mean of the Gaussian process to µ before collecting any data, so that the candidate utilities have the prior p(f |φ(x)) = N (µ, K), where K is a hyperparameter. Given this setup, AL can now take the prior predictions into account when choosing pairs of candidates for labelling.

Experiments
We perform experiments on three tasks to test our interactive text ranking approach: (1) community question answering (cQA), where the goal is to identify the best answer to a given question from a pool of candidate answers; (2) rating document summaries, which aims to learn ratings that reflect a user's preferences over summaries; and (3) generating the optimal extractive summary by training a reinforcement learner with the ranking function from (2) as a reward function. Using interactive learning to learn the reward function rather than the policy reduces the number of user interactions from many thousands to tens. Both the cQA and summarisation datasets were chosen because the answers and candidate summaries in these datasets are multi-sentence documents that take longer for users to read compared to tasks such as factoid question-answering. We expect our methods to have the greatest impact in this type of long-answer scenario by minimising user interaction time.
For cQA, we use datasets consisting of questions posted on StackExchange in the communities Apple, Cooking and Travel, along with their accepted answers and candidate answers taken from related questions (Rücklé et al., 2019). Statistics for the datasets are shown in Table 2  As inputs for the BT and GPPL models, we use the same feature vectors as COALA: for each question, COALA extracts aspects from the question and its candidate answers; each dimension of an answer's feature vector encodes how well the answer covers one of the extracted aspects. The vector dimension therefore depends on the number of aspects identified for a particular question. For summarisation, we use the DUC datasets 1 , which contain model summaries written by experts for collections of documents related to a topic. We obtain feature vectors for candidate summaries by taking bag-of-bigram embeddings with additional features proposed by Rioux et al. (2014). The first 200 dimensions of the feature vector have binary values to indicate the presence of each of the 200 most common bigrams in each topic after tokenising, stemming and applying a stop-list. The last 5 dimensions contain the following: the fraction of the 200 most common bigrams that are present in the document (coverage ratio); fraction of the 200 most common bigrams that occur more than once in the document (redundancy ratio); document length divided by 100 (length ratio); the sum over all extracted sentences of the reciprocal of the position of the extracted sentence in its source document (extracted sentence position feature); a single bit to indicate if document length exceeds the length limit of 100 tokens. The same features are used for both tasks (2) learning summary ratings and (3) reinforcement learning.
Methods. As baselines, we test BT as our preference learner with random selection and the UNC active learning strategy, and GPPL as the learner with random selection. We also combine GPPL with the acquisition functions described in Section 4, namely UNPA, EIG, IMP and TP. For random sampling, we repeat each experiment ten times.
Simulated Users. In tasks (1) and (2), we simulate a user's preferences with a noisy oracle based on the user-response models of Viappiani and Boutilier (2010). Given gold standard scores for two documents, g a and g b , the noisy oracle prefers document a with probability p(y a,b |g a , g b ) = (1 + exp( g b −ga t )) −1 , where t is a parameter that controls the noise level. In both datasets, we are provided with model summaries or gold answers, but no gold standard scores. We therefore estimate gold scores by computing a ROUGE score of the candidate summary or answer, a, against the model summary or gold answer, m. For cQA, we take the ROUGE-L score as a gold score, as it is a wellestablished metric for evaluating question answering systems (e.g. Nguyen et al. (2016); Bauer et al. (2018);Indurthi et al. (2018)) and set t = 0.1, which results in annotation accuracy of 83% (the fraction of times the pairwise label corresponds to the gold ranking). For summarisation, we use t = 1, which gives noisier annotations with 58% accuracy, reflecting the greater difficulty of choosing between two summaries. This corresponds to the accuracy found by Gao et al. (2019) for human annotators when comparing summaries from the same datasets. As gold for summarisation, we combine ROUGE scores using the following formula, which was previously found to correlate well with human preferences (P.V.S and Meyer, 2017):

Warm-start using Prior Information
We compare two approaches to integrate prior predictions of the utilities computed before acquiring user feedback. As a baseline, sum applies a  weighted mean to combine the prior predictions with posterior predictions learned using GPPL or BT. Based on preliminary experiments, we weight the prior and posterior predictions equally. Prior sets the prior mean of GPPL to the value of the prior predictions, as described in Section 4. Our hypothesis is that prior will provide more information at the start of the interactive learning process and help the learner to select more informative pairs. For cQA, we obtain prior predictions using COALA (Rücklé et al., 2019), which estimates the relevance of answers to a question by extracting aspects (e.g., n-grams or syntactic structures) from the question and answer texts using CNNs, then matching and aggregating the aspects. For each topic, we train COALA on the training set splits given by Rücklé et al. (2019), then predict on the test sets, i.e., the datasets in Table 2. For summarisation, we use the heuristic evaluation function described by Ryang and Abekawa (2012) as a prior. Table 3 presents results of a comparison on a subset of strategies, showing that prior results in substantially higher performance in most cases. Based on the results of these experiments, we apply prior to all further uses of GPPL.

Community Question Answering
We hypothesise that the performance of COALA can be improved by obtaining a small amount of user feedback for each question, and using it to re-rank candidate answers. To evaluate the top-ranked answers from each method, we compute accuracy as the percentage of top answers that match the gold answers. We also compare the top five highest-ranked solutions to the gold answers by computing normalised discounted cumulative gain (NDCG@5) using ROUGE-L as the relevance score. NDCG@k evaluates the relevance of the top k ranked items, putting more weight on higherranked items (Järvelin and Kekäläinen, 2002).
The results in Table 4 show that with only 10 user interactions, most methods are unable to improve performance over pre-trained COALA. UNC, UNPA, EIG and TP are out-performed by random selection and IMP (p .01 using a two-tailed Wilcoxon signed-rank test). To see whether other methods improve given more feedback, Figure 3 plots NDCG@5 against number of interactions. While IMP performance increases substantially, random selection improves only very slowly. The early interactions cause a performance drop with UNPA, EIG, and TP, but they begin to increase after five iterations. Initially, pairs of candidates with similar, middling utilities according to the prior are most uncertain, so that even EIG and TP sample heavily from mediocre candidates. This can affect the distributions over the utility of neighbouring strong candidates, causing a drop in performance. Performance rises again once the uncertainty of the mediocre candidates has been reduced and stronger candidates are selected.
Both BT methods start from a much worse initial position, but improve consistently, as their initial samples are not biased by the prior predictions. Note that UNC is worse than random with the BT model. Referring back to Table 3, we can see that even when using sum rather than prior, GPPL with IMP or UNPA outperforms BT random. Since GPPL-sum underperforms BT with random selection, these performance gains may be due to the UNPA and IMP active learning strategies.

Interactive Summary Rating
We apply interactive learning to refine a ranking over candidate summaries given prior information.
For each topic, we create 10,000 summaries with fewer than 100 words each, which are constructed by uniformly selecting sentences at random from the input documents. To determine whether some strategies benefit from more samples, we test each active learning method with between 10 and 100 user interactions with noisy simulated users. The  Table 4: Interactive text ranking for cQA with 10 rounds of interaction. "N5" is NDCG@5, "acc" is accuracy. method is fast enough for interactive scenarios: on a standard Intel desktop workstation with a quadcore CPU and no GPU, updates to GPPL at each iteration require around one second.
We evaluate the quality of the 100 highestranked summaries using NDCG@1%, and compute the Pearson correlation, r, between the predicted utilities for all candidates with the combined ROUGE scores (Equation (8)). Unlike NDCG@1%, the Pearson correlation does not focus on higher-ranked candidates but considers the utilities for all candidates. Hence, we expect that IMP and TP, which optimise the highest-ranked candidates, may not give the highest values of r.
The results in Table 5 show a clear advantage to IMP with both 10 and 100 interactions in terms of NDCG@1%. IMP outperforms the previous state of the art, BT with UNC (significant with p .01 on DUC'01 and DUC'02 with 10 interactions and all datasets with 100 interactions). In terms of r, IMP is often out-performed by TP, while TP gives the second best NDCG@1%, so TP appears more balanced between finding the best candidate and   learning the ratings for all candidates. Despite the focus of IMP on top candidates, it substantially improves r over the baseline, BT with UNC (significant with p .01 on DUC'02 with 10 interactions and all datasets with 100 interactions). UNPA only improves over random sampling with 100 interactions, while EIG and TP achieve gains with both 10 and 100 interactions compared to UNPA or random due to a better focus on epistemic uncertainty. Figure 4 shows the progress of each method with increasing numbers of interactions on DUC'01. The slow progress of the BT baselines is clear, illustrating the advantage the Bayesian methods have as a basis for active learning by incorporating uncertainty estimates and prior predictions.

RL for Summarisation
We now investigate whether our approach also improves performance when the ranking function is used to provide rewards for a reinforcement learner. Our hypothesis is that it does not matter whether the ranking or rewards assigned to bad candidates are correct, as long as they are distinguished from good candidates, as this will prevent the bad candidates from being chosen.
To test the hypothesis, we simulate a flatbottomed reward function for summarisation on the DUC'01 corpus: first, for each topic, we set the rewards for the 10,000 sampled summaries (see Section 5.3) to the gold standard score, R comb , defined in Eq. (8). Then, we normalise the rewards to [0, 10] and set the rewards for a varying percentage of the lowest-ranked summaries to 1.0 (the flat bottom). We train the reinforcement learner on the flat-bottomed rewards and plot ROUGE scores for the proposed summaries in Figure 5. The performance of the learner actually increases as candidate values are flattened until around 90% of the summaries have the same value. This supports our hypothesis that the user's labelling effort should be spent on the top candidates.
We now use the ranking functions learned in the previous summarisation task as rewards for reinforcement learning. We replicate the RL setup of Gao et al. (2018) for interactive multi-document summarisation, which previously achieved stateof-the-art performance using the BT learner with UNC. The RL agent models the summarisation process as follows: there is a current state, represented by the current draft summary; the agent uses a policy to select a sentence to be concatenated to the current draft summary or to terminate the summary construction. During the learning process, the agent receives a reward after terminating, which it  uses to update its policy to maximise these rewards. The model is trained for 5,000 episodes (i.e. generating 5,000 summaries and receiving their rewards), then the policy is used to produce a summary. We compare the produced summary using ROUGE to a human-generated model summary. By improving the reward function, we hypothesise that the quality of the resulting summary will also improve. Table 6 shows that the best-performing method from the previous tasks, IMP, again produces a strong improvement over the previous state of the art, BT with UNC (significant with p 0.01 on all cases with 100 interactions and DUC'01 with 10 interactions), in all cases except for DUC'04 with 10 interactions, where the Pearson correlation in task (2) was lower. This suggests that while the ranking was strong on DUC'04, the predicted values of the utilities may have been less accurate. EIG and TP also appear to consistently outperform BT with UNC for 100 interactions. The results confirm that gains made by Bayesian optimisation when learning the utilities translate to better summaries produced by reinforcement learning.

Conclusions
We proposed a novel approach to interactive text ranking that uses Bayesian optimisation (BO) to identify top-ranked texts by acquiring pairwise feedback from a user and applying Gaussian process preference learning (GPPL). Our experiments showed that our BO approach significantly improves the accuracy of answers chosen in a cQA task with small amounts of feedback, and leads to summaries that better match human-generated model summaries when used to learn a reward function for reinforcement learning.
Of two proposed Bayesian optimisation strategies, we found that expected improvement (IMP) outperforms Thompson sampling (TP) if the goal is to optimise the proposed best solution. TP may require a larger number of interactions due to its random sampling step. IMP is effective in both cQA and summarisation tasks, but has the strongest impact on cQA with only 10 interactions. This may be due to the greater sparsity of candidates in cQA (100 versus 10,000 for summarisation), which allows them to be more easily discriminated by the model, given good training examples.
We found that performance improves when including prior predictions as the GPPL prior mean. However, it is not clear how to include estimates of confidence in the prior predictions -here we assume equal confidence in all prior predictions. Therefore, future work will address this by adapting the GPPL prior covariance matrix, which may help to kick-start Bayesian optimisation. Furthermore, the method is currently limited to a single set of prior predictions: in future we intend to integrate multiple sets of predictions from a selection of models. Further evaluation is also necessary with real users to gauge the quantity of feedback needed in a particular domain.