Abstract
For many NLP applications, such as question answering and summarization, the goal is to select the best solution from a large space of candidates to meet a particular user’s needs. To address the lack of user or task-specific training data, we propose an interactive text ranking approach that actively selects pairs of candidates, from which the user selects the best. Unlike previous strategies, which attempt to learn a ranking across the whole candidate space, our method uses Bayesian optimization to focus the user’s labeling effort on high quality candidates and integrate prior knowledge to cope better with small data scenarios. We apply our method to community question answering (cQA) and extractive multidocument summarization, finding that it significantly outperforms existing interactive approaches. We also show that the ranking function learned by our method is an effective reward function for reinforcement learning, which improves the state of the art for interactive summarization.
1 Introduction
Many text ranking tasks are highly specific to an individual user’s topic of interest, which presents a challenge for NLP systems that have not been trained to solve that user’s problem. Consider ranking summaries or answers to non-factoid questions: A good solution requires understanding the topic and the user’s information needs (Liu and Agichtein, 2008; López et al., 1999). We address this by proposing an interactive text ranking approach that efficiently gathers user feedback and combines it with predictions from pretrained, generic models.
To minimize the amount of effort the user must expend to train a ranker, we learn from pairwise preference labels, in which the user compares two candidates and labels the best one. Pairwise preference labels can often be provided faster than ratings or class labels (Yang and Chen, 2011; Kingsley and Brown, 2010; Kendall, 1948), can be used to rank candidates using learning-to-rank (Joachims, 2002), preference learning (Thurstone, 1927), or best–worst scaling (Flynn and Marley, 2014), or to train a reinforcement learning (RL) agent to find the optimal solution (Wirth et al., 2017).
To reduce the number of labels a user must provide, a common solution is active learning (AL). AL learns a model by iteratively ac quiring labels: At each iteration, it trains a model on the labels collected so far, then uses an acquisition function to quantify the value of querying the user about a particular pair of candidates. The system then chooses the pairs with the highest values, and instructs the user to label them. The acquisition function implements one of many different strategies to minimize the number of interaction rounds, such as reducing uncertainty (Settles, 2012) by choosing informative labels that help learn the model more quickly.
Many active learning strategies, such as the pairwise preference learning method of Gao et al. (2018), aim to learn a good ranking model for all candidates, for example, by querying the annotator about candidates whose rank is most uncertain. However, we often need to find and rank only a small set of good candidates to present to the user. For instance, in question answering, irrelevant answers should not be shown to the user, so their precise ordering is unimportant and users should not waste time ranking them. Therefore, by reducing uncertainty for all candidates, uncertainty-based AL strategies may waste labels on sorting poor candidates.
Here, we propose an interactive method for ranking texts that replaces the standard uncertainty- based acquisition functions with acquisition functions for Bayesian optimization (BO) (Močkus, 1975; Brochu et al., 2010). In general, BO aims to find the maximum of a function while minimizing the number of queries to an oracle. Here, we use BO to maximize a ranking function that maps text documents to scores, treating the user as a noisy oracle. Our BO active learning strategy minimizes the number of labels needed to find the best candidate, in contrast to uncertainty-based strategies that attempt to learn the entire ranking function. This makes BO better suited to tasks such as question answering, summarization, or translation, where the aim is to find the best candidate and those with low quality can simply be disregarded rather than ranked precisely. In this paper, we define two BO acquisition functions for interactive text ranking.
While our approach is designed to adapt a model to a highly specialized task, generic models can provide hints to help us avoid low-quality candidates. Therefore, we learn the ranking function itself using a Bayesian approach, which integrates prior predictions from a generic model that is not tailored to the user. Previous interactive text ranking methods either do not exploit prior information (Baldridge and Osborne, 2004; P.V.S and Meyer, 2017; Lin and Parikh, 2017; Siddhant and Lipton, 2018), combine heuristics with user feedback after active learning is complete (Gao et al., 2018), or require expensive re-training of a non-Bayesian method (Peris and Casacuberta, 2018). Here, we show how BO can use prior information to expedite interactive text ranking. The interactive learning process is shown in Algorithm 1 and examples of our system outputs are shown in Figures 1 and 2.
Example from the Stack Exchange Cooking topic. Candidate answer A1 selected without user interaction by COALA (Rücklé et al., 2019); A2 chosen by our system (GPPL with IMP) after 10 user interaction. A2 answers the question (boldfaced texts) but A1 fails.
Example from the Stack Exchange Cooking topic. Candidate answer A1 selected without user interaction by COALA (Rücklé et al., 2019); A2 chosen by our system (GPPL with IMP) after 10 user interaction. A2 answers the question (boldfaced texts) but A1 fails.
Example summaries for DUC’04 produced by RL (see Section 5.4) with a reward function learnt from 100 user interactions using (a) the BT, UNC method of Gao et al. (2018) and (b) our GPPL, IMP method. (c) is a model summary written by an expert. Each color indicates a particular news event or topic, showing where it occurs in each summary. Compared to (a), summary (b) covers more of the events discussed in the reference, (c).
Example summaries for DUC’04 produced by RL (see Section 5.4) with a reward function learnt from 100 user interactions using (a) the BT, UNC method of Gao et al. (2018) and (b) our GPPL, IMP method. (c) is a model summary written by an expert. Each color indicates a particular news event or topic, showing where it occurs in each summary. Compared to (a), summary (b) covers more of the events discussed in the reference, (c).
Our contributions are (1) a Bayesian optimization methodology for interactive text ranking that integrates prior predictions with user feedback, (2) acquisition functions for Bayesian optimization with pairwise labels, and (3) empirical evaluations on community question answering (cQA) and extractive multidocument summarization, which show that our method brings substantial improvements in ranking and summarization performance (e.g., for cQA, an average 25% increase in answer selection accuracy over the next-best method with 10 rounds of user interaction). We release the complete experimental software for future work.1
2 Related Work
Interactive Learning in NLP.
Previous work has applied active learning to tasks involving ranking or optimising generated text, including summarization (P.V.S and Meyer, 2017), visual question answering (Lin and Parikh, 2017), and translation (Peris and Casacuberta, 2018). For summarization, Sokolov et al. (2016), Lawrence and Riezler (2018) and Singh et al. (2019), train reinforcement learners by querying the user directly for rewards, which requires in the order of 105 interactions. Gao et al. (2018) dramatically reduce the number of user interactions to the order of to 102 by using active learning to learn a reward function for RL, an approach proposed by Lopes et al. (2009). These previous works use uncertainty sampling strategies, which query the user about the candidates with the most uncertain rankings to try to learn all candidates’ rankings with a high degree of confidence. We instead propose to find good candidates using an optimization strategy. Siddhant and Lipton (2018) carried out a large empirical study of uncertainty sampling for sentence classification, semantic role labeling and named entity recognition, finding that exploiting model uncertainty estimates provided by Bayesian neural networks improved performance. Our approach also exploits Bayesian uncertainty estimates.
BO for Preference Learning.
Bayesian approaches using Gaussian processes (GPs) have previously been used to reduce errors in NLP tasks involving sparse or noisy labels (Cohn and Specia, 2013; Beck et al., 2014), making them well-suited to learning from user feedback. Gaussian process preference learning (GPPL) (Chu and Ghahramani, 2005) enables GP inference with pairwise preference labels. Simpson and Gurevych (2018) introduced scalable inference for GPPL using stochastic variational inference (SVI) (Hoffman et al., 2013), which outperformed SVM and LSTM methods at ranking arguments by convincingness. They included a study on active learning with pairwise labels, but tested GPPL only with uncertainty sampling, not BO. Here, we adapt GPPL to summarization and cQA, show how to integrate prior predictions, and propose a BO framework for GPPL that facilitates interactive text ranking.
Brochu et al. (2008) proposed a BO approach for pairwise comparisons but applied the approach only to a material design use case with a very simple feature space. González et al. (2017) proposed alternative BO strategies for pairwise preferences, but their approach requires expensive sampling to estimate the utilities, which is too slow for an interactive setting. Yang and Klabjan (2018) also propose BO with pairwise preferences, but again, inference is expensive, the method is only tested with fewer than ten features, and it uses an inferior probability of improvement strategy (see Snoek et al., 2012). Our GPPL-based framework permits much faster inference even when the input vector has more than 200 features, and hence allows rapid selection of new pairs when querying users.
Ruder and Plank (2017) use BO to select training data for transfer learning in NLP tasks such as sentiment analysis, POS tagging, and parsing. However, unlike our interactive text ranking approach, their work does not involve pairwise comparisons and is not interactive, as the optimizer learns by training and evaluating a model on the selected data. In summary, previous work has not yet devised BO strategies for GPPL or suitable alternatives for interactive text ranking.
3 Background on Preference Learning
Popular preference learning models assume that users choose a candidate from a pair with probability p, where p is a function of the candidates’ utilities (Thurstone, 1927). Utility is defined as the value of a candidate to the user, that is, it quantifies how well that instance meets their needs. When candidates have similar utilities, the user’s choice is close to random, while pairs with very different utilities are labeled consistently. Such models include the Bradley–Terry model (BT) (Bradley and Terry, 1952; Luce, 1959; Plackett, 1975), and the Thurstone–Mosteller model (Thurstone, 1927; Mosteller, 1951).
Uncertainty (UNC).
UNC is intended to focus labeling effort on candidates whose utilities are uncertain. If the learner is uncertain about whether candidate a is a good candidate, p(a|D) will be close to 0.5, so a will have a higher chance of being selected. Unfortunately, it is also possible for p(a|D) to be close to 0.5 even if a has been labeled many times if a is a candidate of intermediate utility. Therefore, when using UNC, labeling effort may be wasted re-labeling mediocre candidates.
The problem is that BT cannot distinguish two types of uncertainty. The first is aleatoric uncertainty due to the inherent unpredictability of the phenomenon we wish to model (Senge et al., 2014). For example, when predicting the outcome of a coin toss, we model the outcome as random. Similarly, given two equally preferable items, we assume that the user assigns a preference label randomly. It does not matter how much training data we observe: if the items are equally good, we are uncertain which one the user will choose.
The second type is epistemic uncertainty due to our lack of knowledge, which can be reduced by acquiring more training data, as this helps us to learn the model’s parameters with higher confidence. BT does not quantify aleatoric and epistemic uncertainty separately, unlike Bayesian approaches (Jaynes, 2003), so we may repeatedly select items with similar utilities that do not require more labels. To rectify this shortcoming, we replace BT with a Bayesian model that both estimates the utility of a candidate and quantifies the epistemic uncertainty in that estimate.
Gaussian Process Preference Learning
4 Interactive Learning with GPPL
We now define four acquisition functions for GPPL that take advantage of the posterior covariance, C, to account for uncertainty in the utilities. Table 1 summarises these acquisition functions.
Learner Strategy . | BT random . | BT UNC . | GPPL random . | GPPL UNPA . | GPPL EIG . | GPPL IMP . | GPPL TP . |
---|---|---|---|---|---|---|---|
Considers epistemic uncertainty | N | Y | N | Y | Y | Y | Y |
Ignores aleatoric uncertainty | N | N | N | N | Y | Y | Y |
Supports warm-start | N | N | Y | Y | Y | Y | Y |
Focus on finding best candidate | N | N | N | N | N | Y (greedy) | Y (balanced) |
Learner Strategy . | BT random . | BT UNC . | GPPL random . | GPPL UNPA . | GPPL EIG . | GPPL IMP . | GPPL TP . |
---|---|---|---|---|---|---|---|
Considers epistemic uncertainty | N | Y | N | Y | Y | Y | Y |
Ignores aleatoric uncertainty | N | N | N | N | Y | Y | Y |
Supports warm-start | N | N | Y | Y | Y | Y | Y |
Focus on finding best candidate | N | N | N | N | N | Y (greedy) | Y (balanced) |
Pairwise Uncertainty (UNPA).
Expected Information Gain (EIG).
Expected Improvement (IMP).
The previous acquisition functions for AL are uncertainty-based, and spread labeling effort across all items whose utilities are uncertain. However, for tasks such as summarization or cQA, the goal is to find the best candidates. While it is important to distinguish between good and optimal candidates, it is wasted effort to compare candidates that we are already confident are not the best, even if their utilities are still uncertain. We propose to address this using an acquisition function for BO that estimates the expected improvement (Močkus, 1975) of a candidate, a, over our current estimated best solution, b, given current pairwise labels, D. Here, we provide the first closed-form acquisition function that uses expected improvement for pairwise preference learning.
To select pairs of items, the IMP strategy greedily chooses the current best item and the item with the greatest expected improvement. Through the consideration of posterior uncertainty, IMP trades off exploration of unknown candidates with exploitation of promising candidates. In contrast, uncertainty-based strategies are pure exploration.
Thompson Sampling with Pairwise Labels (TP).
Expected improvement is known to overexploit in some cases (Qin et al., 2017): It chooses where to sample based on the current distribution, so if this distribution underestimates the mean and variance of the optimum, it may never be sampled. To increase exploration, we propose a strategy that uses Thompson sampling (Thompson, 1933). Unlike IMP, which is deterministic, TP introduces random exploration through sampling. TP is similar to dueling-Thompson sampling for continuous input domains (González et al., 2017), but uses an information gain step (described below) and samples from a pool of discrete candidates.
We select an item using Thompson sampling as follows: First draw a sample of candidate utilities from their posterior distribution, , then choose the item b with the highest score in the sample. This sampling step depends on a Bayesian approach to provide a posterior distribution from which to sample. Sampling means that while candidates with high expected utilities have higher values of fthom in most samples, other candidates may also have the highest score in some samples. As the number of samples , the number of times each candidate is chosen is proportional to the probability that it is the best candidate.
To create a pair of items for preference learning, the TP strategy computes the expected information gain for all pairs that include candidate b, and chooses the pair with the maximum. This strategy is less greedy than IMP as it allows more learning about uncertain items through both the Thompson sampling step and the information gain step. However, compared to EIG, the first step focuses effort more on items with potentially high scores.
Using Priors to Address Cold Start.
In previous work on summarization (Gao et al., 2018), the BT model was trained from a cold start, namely, with no prior knowledge or pretraining. Then, after active learning was complete, the predictions from the trained model were combined with prior predictions based on heuristics by taking an average of the normalized scores from both methods. We propose to use such prior predictions to determine an informative prior for GPPL, enabling the active learner to make more informed choices of candidates to label at the start of the active learning process, thereby alleviating the cold-start problem.
We integrate pre-computed predictions as follows. Given a set of prior predictions, μ, from a heuristic or pre-trained model, we set the prior mean of the Gaussian process to μ before collecting any data, so that the candidate utilities have the prior , where K is a hyper-parameter. Given this setup, AL can now take the prior predictions into account when choosing pairs of candidates for labeling.
5 Experiments
We perform experiments on three tasks to test our interactive text ranking approach: (1) Community question answering (cQA)–identify the best answer to a given question from a pool of candidate answers; (2) Rating extractive multidocument summaries according to a user’s preferences; (3) Generating an extractive multidocument summary by training a reinforcement learner with the ranking function from 2 as a reward function. Using interactive learning to learn the reward function rather than the policy reduces the number of user interactions from many thousands to 100 or less. These tasks involve highly specialized questions or topics where generic models could be improved with user feedback. For the first two tasks, we simulate the interactive process in Algorithm 1. The final task uses the results of this process.
Datasets.
Both the cQA and multidocument summarization datasets were chosen because the answers and candidate summaries in these datasets are multisentence documents that take longer for users to read compared to tasks such as factoid question-answering. We expect our methods to have the greatest impact in this type of long-answer scenario by minimizing user interaction time.
For cQA, we use datasets consisting of questions posted on StackExchange in the communities Apple, Cooking, and Travel, along with their accepted answers and candidate answers taken from related questions (Rücklé et al., 2019). Each accepted answer was marked by the user who posted the question, so reflects that user’s own opinion. Dataset statistics are shown in Table 2.
cQA Topics . | #questions . | #accepted answers . | #candidate answers . |
---|---|---|---|
Apple | 1,250 | 1,250 | 125,000 |
Cooking | 792 | 792 | 79,200 |
Travel | 766 | 766 | 76,600 |
Summarization Datasets | #topics | #model summaries | #docs |
DUC 2001 | 30 | 90 | 300 |
DUC 2002 | 59 | 177 | 567 |
DUC 2004 | 50 | 150 | 500 |
cQA Topics . | #questions . | #accepted answers . | #candidate answers . |
---|---|---|---|
Apple | 1,250 | 1,250 | 125,000 |
Cooking | 792 | 792 | 79,200 |
Travel | 766 | 766 | 76,600 |
Summarization Datasets | #topics | #model summaries | #docs |
DUC 2001 | 30 | 90 | 300 |
DUC 2002 | 59 | 177 | 567 |
DUC 2004 | 50 | 150 | 500 |
For summarization, we use the DUC datasets,2 which contain model summaries written by experts for collections of documents related to a narrow topic. Each topic has three model summaries, each written by a different expert and therefore reflecting different opinions about what constitutes a good summary. Compared with single-document summarization, the challenging multidocument case is an ideal testbed for interactive approaches, because the diversity of themes within a collection of documents makes it difficult to identify a single, concise summary suitable for all users.
Priors and Input Vectors.
We use our interactive approach to improve a set of prior predictions provided by a pretrained method. For cQA, we first choose the previous state-of-the-art for long answer selection, COALA (Rücklé et al., 2019),which estimates the relevance of answers to a question by extracting aspects (e.g., n-grams or syntactic structures) from the question and answer texts using CNNs, then matching and aggregating the aspects. For each topic, we train an instance of COALA on the training split given by Rücklé et al. (2019), then run the interactive process on the test set, that is, the dataset in Table 2, to simulate a user interactively refining the answers selected for their question. As inputs for the BT and GPPL models, we use the COALA feature vectors: For each question, COALA extracts aspects from the question and its candidate answers; each dimension of an answer’s 50-dimensional feature vector encodes how well the answer covers one of the extracted aspects.
Next we apply our interactive approach to refine predictions from the current state of the art (Xu et al., 2019), which we refer to as BERT-cQA. This method places two dense layers with 100 and 10 hidden units on top of BERT (Devlin et al., 2019). As inputs to BERT, we concatenate the question and candidate answer and pad sequences to 512 tokens (4% QA pairs are over-length and are truncated). The whole model is fine-tuned on the StackExchange training sets, the same as COALA. In our simulations, we use the fine-tuned, final-layer [CLS] embeddings with 768 dimensions as inputs to BT and GPPL for each question-answer pair.
As prior predictions for summary ratings we first evaluate REAPER, a heuristic evaluation function described by Ryang and Abekawa (2012). We obtain bigram+ feature vectors for candidate summaries by augmenting bag-of-bigram embeddings with additional features proposed by Rioux et al. (2014). The first 200 dimensions of the feature vector have binary values to indicate the presence of each of the 200 most common bigrams in each topic after tokenizing, stemming and applying a stop-list. The last 5 dimensions contain the following: the fraction of the 200 most common bigrams that are present in the document (coverage ratio); the fraction of the 200 most common bigrams that occur more than once in the document (redundancy ratio); document length divided by 100 (length ratio); the sum over all extracted sentences of the reciprocal of the position of the extracted sentence in its source document (extracted sentence position feature); a single bit to indicate if document length exceeds the length limit of 100 tokens. The same features are used for both tasks (2) learning summary ratings and (3) reinforcement learning.
We also test prior predictions from a state-of-the-art summary scoring method, SUPERT (Gao et al., 2020), which uses a variant of BERT that has been fine-tuned on news articles to obtain 1024-dimensional contextualized embeddings of a summary. To score a summary, SUPERT extracts a pseudo-reference summary from the source documents, then compares its embedding with that of the test summary. With the SUPERT priors we compare bigram+ feature vectors and the SUPERT embeddings as input to BT and GPPL for task (2).
Interactive Methods.
As baselines, we test BT as our preference learner with random selection and the UNC active learning strategy, and GPPL as the learner with random selection. We also combine GPPL with the acquisition functions described in Section 4, UNPA, EIG, IMP, and TP. For random sampling, we repeat each experiment ten times.
Simulated Users.
In tasks (1) and (2), we simulate a user’s preferences with a noisy oracle based on the user-response models of Viappiani and Boutilier (2010). Given gold standard scores for two documents, ga and gb, the noisy oracle prefers document a with probability , where t is a parameter that controls the noise level. Both datasets contain model summaries or gold answers, but no gold standard scores. We therefore estimate gold scores by computing a ROUGE score of the candidate summary or answer, a, against the model summary or gold answer, m. For cQA, we take the ROUGE-L score as a gold score, as it is a well-established metric for evaluating question answering systems (e.g., Nguyen et al., 2016; Bauer et al., 2018; Indurthi et al., 2018) and set t = 0.3, which results in annotation accuracy of 83% (the fraction of times the pairwise label corresponds to the gold ranking).
5.1 Warm-start Using Prior Information
We compare two approaches to integrate the prior predictions of utilities computed before acquiring user feedback. As a baseline, sum applies a weighted mean to combine the prior predictions with posterior predictions learned using GPPL or BT. Based on preliminary experiments, we weight the prior and posterior predictions equally. Prior sets the prior mean of GPPL to the value of the prior predictions, as described in Section 4. Our hypothesis is that prior will provide more information at the start of the interactive learning process and help the learner to select more informative pairs.
Table 3 presents results of a comparison on a subset of strategies, showing that prior results in higher performance in many cases. Based on the results of these experiments, we apply prior to all further uses of GPPL.
Strategy . | Prior . | Datasets . | ||
---|---|---|---|---|
Accuracy for cQA with COALA priors | ||||
#interactions=10 | Apple | Cooking | Travel | |
random | sum | .245 | .341 | .393 |
random | prior | .352 | .489 | .556 |
UNPA | sum | .293 | .451 | .423 |
UNPA | prior | .290 | .392 | .476 |
IMP | sum | .373 | .469 | .466 |
IMP | prior | .615 | .750 | .784 |
NDCG@1% for summarization with REAPER priors | ||||
#interactions=20 | DUC’01 | DUC’02 | DUC’04 | |
random | sum | .595 | .623 | .652 |
random | prior | .562 | .590 | .600 |
UNPA | sum | .590 | .628 | .650 |
UNPA | prior | .592 | .635 | .648 |
IMP | sum | .618 | .648 | .683 |
IMP | prior | .654 | .694 | .702 |
Strategy . | Prior . | Datasets . | ||
---|---|---|---|---|
Accuracy for cQA with COALA priors | ||||
#interactions=10 | Apple | Cooking | Travel | |
random | sum | .245 | .341 | .393 |
random | prior | .352 | .489 | .556 |
UNPA | sum | .293 | .451 | .423 |
UNPA | prior | .290 | .392 | .476 |
IMP | sum | .373 | .469 | .466 |
IMP | prior | .615 | .750 | .784 |
NDCG@1% for summarization with REAPER priors | ||||
#interactions=20 | DUC’01 | DUC’02 | DUC’04 | |
random | sum | .595 | .623 | .652 |
random | prior | .562 | .590 | .600 |
UNPA | sum | .590 | .628 | .650 |
UNPA | prior | .592 | .635 | .648 |
IMP | sum | .618 | .648 | .683 |
IMP | prior | .654 | .694 | .702 |
5.2 Community Question Answering
We hypothesize that the prior ranking given by COALA can be improved by incorporating a small amount of user feedback for each question. Our interactive process aims to find the best answer to a specific question, rather than learning a model that transfers to new questions, hence preferences are sampled for questions in the test splits.
To evaluate the top-ranked answers from each method, we compute accuracy as the fraction of top answers that match the gold answers. We also compare the five highest-ranked solutions to the gold answers using normalized discounted cumulative gain (NDCG@5) with ROUGE-L as the relevance score. NDCG@k evaluates the relevance of the top k ranked items, putting more weight on higher-ranked items (Järvelin and Kekäläinen, 2002).
The results in the top half of Table 4 show that with only 10 user interactions, most methods are unable to improve performance over pre-trained COALA. UNC, UNPA, EIG, and TP are out-performed by random selection and IMP (p ≪ .01 using a two-tailed Wilcoxon signed-rank test).
Learner . | Strategy . | Apple . | Cooking . | Travel . | |||
---|---|---|---|---|---|---|---|
. | . | acc . | N5 . | acc . | N5 . | acc . | N5 . |
COALA | .318 | .631 | .478 | .696 | .533 | .717 | |
COALA prior, #interactions=10 | |||||||
BT | random | .272 | .589 | .368 | .614 | .410 | .644 |
BT | UNC | .233 | .573 | .308 | .597 | .347 | .619 |
GPPL | random | .352 | .642 | .489 | .699 | .556 | .722 |
GPPL | UNPA | .290 | .591 | .392 | .631 | .476 | .656 |
GPPL | EIG | .302 | .628 | .372 | .671 | .469 | .692 |
GPPL | TP | .274 | .592 | .353 | .636 | .414 | .675 |
GPPL | IMP | .615 | .714 | .750 | .753 | .784 | .774 |
BERT-cQA | .401 | .580 | .503 | .625 | .620 | .689 | |
BERT-cQA prior, #interactions=10 | |||||||
BT | random | .170 | .626 | .228 | .637 | .315 | .676 |
BT | UNC | .129 | .580 | .181 | .583 | .326 | .618 |
GPPL | random | .407 | .593 | .510 | .594 | .631 | .657 |
GPPL | EIG | .080 | .559 | .140 | .552 | .095 | .526 |
GPPL | IMP | .614 | .715 | .722 | .731 | .792 | .744 |
Learner . | Strategy . | Apple . | Cooking . | Travel . | |||
---|---|---|---|---|---|---|---|
. | . | acc . | N5 . | acc . | N5 . | acc . | N5 . |
COALA | .318 | .631 | .478 | .696 | .533 | .717 | |
COALA prior, #interactions=10 | |||||||
BT | random | .272 | .589 | .368 | .614 | .410 | .644 |
BT | UNC | .233 | .573 | .308 | .597 | .347 | .619 |
GPPL | random | .352 | .642 | .489 | .699 | .556 | .722 |
GPPL | UNPA | .290 | .591 | .392 | .631 | .476 | .656 |
GPPL | EIG | .302 | .628 | .372 | .671 | .469 | .692 |
GPPL | TP | .274 | .592 | .353 | .636 | .414 | .675 |
GPPL | IMP | .615 | .714 | .750 | .753 | .784 | .774 |
BERT-cQA | .401 | .580 | .503 | .625 | .620 | .689 | |
BERT-cQA prior, #interactions=10 | |||||||
BT | random | .170 | .626 | .228 | .637 | .315 | .676 |
BT | UNC | .129 | .580 | .181 | .583 | .326 | .618 |
GPPL | random | .407 | .593 | .510 | .594 | .631 | .657 |
GPPL | EIG | .080 | .559 | .140 | .552 | .095 | .526 |
GPPL | IMP | .614 | .715 | .722 | .731 | .792 | .744 |
To see whether the methods improve given more feedback, Figure 3 plots NDCG@5 against number of interactions. Whereas IMP performance increases substantially, random selection improves only very slowly. Early interactions cause a performance drop with UNPA, EIG, and TP. This is unlikely to be caused by noise in the cQA data, because preference labels are generated using ROUGE-L scores computed against the gold answer. The drop is because uncertainty-based methods initially sample many low-quality candidates with high uncertainty. This increases the predicted utility of the preferred candidate in each pair, sometimes exceeding better candidates that were ranked higher by the prior, pushing them out of the top five. Performance rises once the uncertainty of mediocre candidates has been reduced and stronger candidates are selected. Both BT methods start from a worse initial position but improve consistently, as their initial samples are not biased by the prior predictions, although UNC remains worse than random.
NDCG@5 with increasing interactions, COALA prior, mean across 3 cQA topics.
The bottom half of Table 4 and Figure 4 show results for key methods with BERT-cQA priors and embeddings. The initial predictions by BERT-cQA have higher accuracy than COALA but lower NDCG@5. BERT-based models better account for question and answer semantics, leading to higher accuracy, but place less emphasis on lexical similarity, which reduces the ROUGE-L scores of top-ranked answers and consequently, NDCG@5. While IMP remains the most successful method, the end result is not a clear improvement over COALA, with a collapse in accuracy for the uncertainty-based EIG and both BT methods. As with COALA, these uncertainty-based methods focus initially on middling candidates, but due to the sparsity of the data with high-dimensional BERT-cQA embeddings, more samples are required to reduce their uncertainty before these methods start to sample strong candidates. The flexibility of the GP model means that it is particularly affected by data sparsity, hence the poor performance of EIG.
NDCG@5 with increasing number of interactions. BERT-cQA prior. Mean across 3 cQA topics.
NDCG@5 with increasing number of interactions. BERT-cQA prior. Mean across 3 cQA topics.
5.3 Interactive Summary Rating
We apply interactive learning to refine a ranking over candidate summaries given prior information. For each topic, we create 10,000 candidate summaries with fewer than 100 words each, which are constructed by uniformly selecting sentences at random from the input documents. To determine whether some strategies benefit from more samples, we test each active learning method with between 10 and 100 user interactions with noisy simulated users. The method is fast enough for interactive scenarios: on a standard Intel desktop workstation with a quad-core CPU and no GPU, updates to GPPL at each iteration require around one second.
We evaluate the quality of the 100 highest-ranked summaries using NDCG@1%, and compute the Pearson correlation, r, between the predicted utilities for all candidates and the combined ROUGE scores (Eq. (8)). Unlike NDCG@1%, r does not focus on higher-ranked candidates but considers the utilities for all candidates. Hence we do not expect that IMP or TP, which optimize the highest-ranked candidates, will have the highest r.
With REAPER priors, bigram+ features and 20 interactions, the top part of Table 5 shows a clear advantage to IMP in terms of NDCG@1%, which outperforms the previous state of the art, BT-UNC (significant with p ≪ .01 on all datasets). In terms of r, IMP is out-performed by TP (significant with p ≪ .01 on all datasets), which appears more balanced between finding the best candidate and learning the ratings for all candidates. UNPA improves slightly over random sampling for both metrics, while EIG is stronger due to a better focus on epistemic uncertainty. Unlike IMP, TP does not always outperform EIG on NDCG@1%.
Learner . | Strategy . | DUC’01 . | DUC’02 . | DUC’04 . | |||
---|---|---|---|---|---|---|---|
. | . | N1 . | r . | N1 . | r . | N1 . | r . |
REAPER | .539 | .262 | .573 | .278 | .597 | .322 | |
REAPER prior, bigram+ features, #interactions=20 | |||||||
BT | rand. | .596 | .335 | .626 | .358 | .659 | .408 |
BT | UNC | .609 | .340 | .641 | .365 | .674 | .415 |
GPPL | rand. | .558 | .248 | .590 | .266 | .603 | .289 |
GPPL | UNPA | .592 | .307 | .635 | .370 | .648 | .397 |
GPPL | EIG | .634 | .327 | .665 | .383 | .675 | .404 |
GPPL | TP | .629 | .378 | .665 | .403 | .690 | .453 |
GPPL | IMP | .654 | .303 | .694 | .345 | .702 | .364 |
SUPERT | .602 | .382 | .624 | .400 | .657 | .438 | |
SUPERT prior, bigram+ features, #interactions=20 | |||||||
BT | rand. | .633 | .415 | .654 | .438 | .684 | .483 |
BT | UNC | .550 | .277 | .561 | .287 | .588 | .334 |
GPPL | rand. | .601 | .351 | .630 | .377 | .657 | .419 |
GPPL | EIG | .633 | .365 | .662 | .399 | .671 | .435 |
GPPL | TP | .649 | .417 | .668 | .437 | .698 | .479 |
GPPL | IMP | .653 | .322 | .696 | .374 | .717 | .407 |
SUPERT prior, SUPERT embeddings, #interact.=20 | |||||||
GPPL | IMP | .624 | .297 | .630 | .284 | .653 | .339 |
SUPERT prior, bigram+ features, #interactions=100 | |||||||
GPPL | IMP | .668 | .308 | .788 | .466 | .815 | .521 |
SUPERT prior, SUPERT embeddings, #interact.=100 | |||||||
BT | rand. | .661 | .466 | .696 | .504 | .727 | .543 |
BT | UNC | .634 | .420 | .656 | .453 | .678 | .495 |
GPPL | rand. | .594 | .354 | .617 | .387 | .643 | .415 |
GPPL | EIG | .611 | .372 | .647 | .415 | .682 | .471 |
GPPL | IMP | .728 | .376 | .752 | .407 | .769 | .447 |
Learner . | Strategy . | DUC’01 . | DUC’02 . | DUC’04 . | |||
---|---|---|---|---|---|---|---|
. | . | N1 . | r . | N1 . | r . | N1 . | r . |
REAPER | .539 | .262 | .573 | .278 | .597 | .322 | |
REAPER prior, bigram+ features, #interactions=20 | |||||||
BT | rand. | .596 | .335 | .626 | .358 | .659 | .408 |
BT | UNC | .609 | .340 | .641 | .365 | .674 | .415 |
GPPL | rand. | .558 | .248 | .590 | .266 | .603 | .289 |
GPPL | UNPA | .592 | .307 | .635 | .370 | .648 | .397 |
GPPL | EIG | .634 | .327 | .665 | .383 | .675 | .404 |
GPPL | TP | .629 | .378 | .665 | .403 | .690 | .453 |
GPPL | IMP | .654 | .303 | .694 | .345 | .702 | .364 |
SUPERT | .602 | .382 | .624 | .400 | .657 | .438 | |
SUPERT prior, bigram+ features, #interactions=20 | |||||||
BT | rand. | .633 | .415 | .654 | .438 | .684 | .483 |
BT | UNC | .550 | .277 | .561 | .287 | .588 | .334 |
GPPL | rand. | .601 | .351 | .630 | .377 | .657 | .419 |
GPPL | EIG | .633 | .365 | .662 | .399 | .671 | .435 |
GPPL | TP | .649 | .417 | .668 | .437 | .698 | .479 |
GPPL | IMP | .653 | .322 | .696 | .374 | .717 | .407 |
SUPERT prior, SUPERT embeddings, #interact.=20 | |||||||
GPPL | IMP | .624 | .297 | .630 | .284 | .653 | .339 |
SUPERT prior, bigram+ features, #interactions=100 | |||||||
GPPL | IMP | .668 | .308 | .788 | .466 | .815 | .521 |
SUPERT prior, SUPERT embeddings, #interact.=100 | |||||||
BT | rand. | .661 | .466 | .696 | .504 | .727 | .543 |
BT | UNC | .634 | .420 | .656 | .453 | .678 | .495 |
GPPL | rand. | .594 | .354 | .617 | .387 | .643 | .415 |
GPPL | EIG | .611 | .372 | .647 | .415 | .682 | .471 |
GPPL | IMP | .728 | .376 | .752 | .407 | .769 | .447 |
Figure 5 shows the progress of each method with increasing numbers of interactions on DUC’01. The slow progress of the BT baselines is clear, illustrating the advantage the Bayesian methods have as a basis for active learning by incorporating uncertainty estimates and prior predictions.
DUC’01, REAPER prior, bigram+ features, changes in NDCG@1% with increasing interactions.
DUC’01, REAPER prior, bigram+ features, changes in NDCG@1% with increasing interactions.
The lower part of Table 5 and Figure 6 confirm the superior NDCG@1% scores of IMP with the stronger SUPERT priors. However, while pretrained SUPERT outperforms REAPER, the results after 20 rounds of interaction with bigram+ features are almost identical, suggesting that user feedback helps mitigate the weaker pretrained model. With only 20 interactions, bigram+ features work better than SUPERT embeddings as input to our interactive learners, even with the best-performing method, IMP, since there are fewer features and the model can cope better with limited labeled data. With 100 interactions, SUPERT embeddings provide superior performance as there are sufficient labels to leverage the richer input embeddings.
DUC’01, SUPERT prior, changes in NDCG@1% with increasing number of interactions.
DUC’01, SUPERT prior, changes in NDCG@1% with increasing number of interactions.
5.4 RL for Summarization
We now investigate whether our approach also improves performance when the ranking function is used to provide rewards for a reinforcement learner. Our hypothesis is that it does not matter whether the rewards assigned to bad candidates are correct, as long as they are distinguished from good candidates, as this will prevent the policy from choosing bad candidates to present to the user.
To test the hypothesis, we simulate a flat-bottomed reward function for summarization on DUC’01: First, for each topic, we set the rewards for the 10,000 sampled summaries (see § 5.3) to the gold standard, Rcomb (Eq. (8), normalized to [0,10]). Then, we set the rewards for a varying percentage of the lowest-ranked summaries to 1.0 (the flat bottom). We train the reinforcement learner on the flat-bottomed rewards and plot ROUGE scores for the proposed summaries in Figure 7. The performance of the learner actually increases as candidate values are flattened until around 90% of the summaries have the same value. This supports our hypothesis that the user’s labeling effort should be spent on the top candidates.
Performance of RL on DUC’01 when the rewards for the bottom x% summaries are flattened to one. Dashed line = ROUGE-2, solid line = ROUGE-L.
Performance of RL on DUC’01 when the rewards for the bottom x% summaries are flattened to one. Dashed line = ROUGE-2, solid line = ROUGE-L.
We now use the ranking functions learned in the previous summary rating task as rewards for reinforcement learning. As examples, we take the rankers learned using SUPERT priors with bigram+ features and 20 interactions and with SUPERT embeddings and 100 interactions. We replicate the RL setup of Gao et al. (2018) for interactive multidocument summarization, which previously achieved state-of-the-art performance using the BT learner with UNC. The RL agent models the summarization process as follows: there is a current state, represented by the current draft summary; the agent uses a policy to select a sentence to be concatenated to the current draft summary or to terminate the summary construction. During the learning process, the agent receives a reward after terminating, which it uses to update its policy to maximize these rewards. The model is trained for 5,000 episodes (i.e., generating 5,000 summaries and receiving their rewards), then the policy is used to produce a summary. We compare the produced summary to a human-generated model summary using ROUGE. By improving the reward function, we hypothesize that the quality of the resulting summary will also improve.
Table 6 shows that the best-performing method from the previous tasks, IMP, again produces a strong improvement over the previous state of the art, BT with UNC (significant with p ≪ 0.01 in all cases), as well as GPPL with EIG. With 20 interactions and bigram+ features, EIG also outperforms BT-UNC, indicating the benefits of the Bayesian approach, but this is less clear with SUPERT embeddings, where the high-dimensional embedding space may lead to sparsity problems for the GP. The standard deviation in performance over multiple runs of RL is ¡0.004 for all metrics, datasets, and methods, suggesting that the advantage gained by using IMP is robust to randomness in the RL algorithm. The results confirm that gains in NDCG@1% made by BO over uncertainty-based strategies when learning the utilities translate to better summaries produced by reinforcement learning in a downstream task.
#interactions . | Learner . | Features . | Strategy . | DUC’01 . | DUC’02 . | DUC’04 . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | R1 . | R2 . | RL . | RSU4 . | R1 . | R2 . | RL . | RSU4 . | R1 . | R2 . | RL . | RSU4 . |
0 | SUPERT | N/A | none | .324 | .061 | .252 | .097 | .345 | .070 | .270 | .107 | .375 | .086 | .293 | .128 |
20 | BT | bigrams+ | UNC | .335 | .072 | .265 | .104 | .364 | .086 | .286 | .120 | .390 | .101 | .307 | .136 |
20 | GPPL | bigrams+ | rand. | .324 | .064 | .252 | .097 | .358 | .081 | .281 | .115 | .383 | .095 | .302 | .131 |
20 | GPPL | bigrams+ | EIG | .346 | .073 | .269 | .110 | .377 | .095 | .295 | .126 | .394 | .106 | .310 | .137 |
20 | GPPL | bigrams+ | IMP | .355 | .086 | .277 | .114 | .385 | .103 | .300 | .130 | .419 | .122 | .331 | .154 |
100 | BT | SUPERT | UNC | .337 | .072 | .264 | .104 | .366 | .086 | .284 | .118 | .377 | .090 | .297 | .128 |
100 | GPPL | SUPERT | rand. | .317 | .057 | .247 | .092 | .344 | 071 | .270 | .107 | .372 | .087 | .292 | .124 |
100 | GPPL | SUPERT | EIG | .331 | .070 | .259 | .101 | .367 | .088 | .287 | .120 | .394 | .103 | .309 | .136 |
100 | GPPL | SUPERT | IMP | .370 | .100 | .293 | .123 | .406 | .118 | .316 | .140 | .422 | .130 | .337 | .155 |
#interactions . | Learner . | Features . | Strategy . | DUC’01 . | DUC’02 . | DUC’04 . | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
. | . | . | . | R1 . | R2 . | RL . | RSU4 . | R1 . | R2 . | RL . | RSU4 . | R1 . | R2 . | RL . | RSU4 . |
0 | SUPERT | N/A | none | .324 | .061 | .252 | .097 | .345 | .070 | .270 | .107 | .375 | .086 | .293 | .128 |
20 | BT | bigrams+ | UNC | .335 | .072 | .265 | .104 | .364 | .086 | .286 | .120 | .390 | .101 | .307 | .136 |
20 | GPPL | bigrams+ | rand. | .324 | .064 | .252 | .097 | .358 | .081 | .281 | .115 | .383 | .095 | .302 | .131 |
20 | GPPL | bigrams+ | EIG | .346 | .073 | .269 | .110 | .377 | .095 | .295 | .126 | .394 | .106 | .310 | .137 |
20 | GPPL | bigrams+ | IMP | .355 | .086 | .277 | .114 | .385 | .103 | .300 | .130 | .419 | .122 | .331 | .154 |
100 | BT | SUPERT | UNC | .337 | .072 | .264 | .104 | .366 | .086 | .284 | .118 | .377 | .090 | .297 | .128 |
100 | GPPL | SUPERT | rand. | .317 | .057 | .247 | .092 | .344 | 071 | .270 | .107 | .372 | .087 | .292 | .124 |
100 | GPPL | SUPERT | EIG | .331 | .070 | .259 | .101 | .367 | .088 | .287 | .120 | .394 | .103 | .309 | .136 |
100 | GPPL | SUPERT | IMP | .370 | .100 | .293 | .123 | .406 | .118 | .316 | .140 | .422 | .130 | .337 | .155 |
5.5 Limitations of User Simulations
By testing our interactive process with simulated users, we were able to compare numerous methods with a fixed labeling error rate. The user labels were simulated using data from real individuals: the gold answers for cQA were chosen by the user who posed the question, and the three model summaries for each topic in the DUC datasets were each authored by a different individual. While this work shows the promize of BO, further work is needed to test specific NLP applications with real end users. Our experiments illustrate plausible applications where users compare texts of up to 100 words and gain substantial performance advantages. Other applications require a broader study of reading and labeling time versus performance benefits and user satisfaction. It may also be possible to select chunks of longer documents for the user to compare, rather than reading whole documents.
Another dimension to consider is that real users may make systematic, rather than random errors. However, in the applications we foresee, it is accepted that their preference labels will often diverge from any established gold standard, as users adapt models to their own information needs. Future work may therefore apply interactive approaches to more subjective NLP tasks, such as adapting a summary to more personal information needs.
6 Conclusions
We proposed a novel approach to interactive text ranking that uses Bayesian optimization (BO) to identify top-ranked texts by acquiring pairwise feedback from a user and applying Gaussian process preference learning (GPPL). Our experiments showed that BO significantly improves the accuracy of answers chosen in a cQA task with small amounts of feedback, and leads to summaries that better match human-generated model summaries when used to learn a reward function for reinforcement learning.
Of two proposed Bayesian optimization strategies, we found that expected improvement (IMP) outperforms Thompson sampling (TP) if the goal is to optimize the proposed best solution. TP may require a larger number of interactions due to its random sampling step. IMP is effective in both cQA and summarization tasks, but has the strongest impact on cQA with only 10 interactions. This may be due to the greater sparsity of candidates in cQA (100 versus 10,000 for summarization), which allows them to be more easily discriminated by the model, given good training examples. Further evaluation with real users is required to gauge the quantity of feedback needed in a particular domain.
When using high-dimensional BERT embeddings as inputs, GPPL requires more labels to achieve substantial improvements. Future work may therefore investigate recent dimensionality reduction methods (Raunak et al., 2019). We found that performance improves when including prior predictions as the GPPL prior mean but it is unclear how best to estimate confidence in the prior predictions—here we assume equal confidence in all prior predictions. Future work could address this by adapting the GPPL prior covariance matrix to kick-start BO. The method is also currently limited to a single set of prior predictions: In future we intend to integrate predictions from several models.
Acknowledgments
This work was supported by the German Research Foundation (DFG) as part of the EVIDENCE project (grant GU 798/27-1) and the PEER project (grant GU 798/28-1). We would like to thank the reviewers and journal editors for their extremely helpful feedback.