We consider a task assignment problem in crowdsourcing, which is aimed at collecting as many reliable labels as possible within a limited budget. A challenge in this scenario is how to cope with the diversity of tasks and the task-dependent reliability of workers; for example, a worker may be good at recognizing the names of sports teams but not be familiar with cosmetics brands. We refer to this practical setting as heterogeneous crowdsourcing. In this letter, we propose a contextual bandit formulation for task assignment in heterogeneous crowdsourcing that is able to deal with the exploration-exploitation trade-off in worker selection. We also theoretically investigate the regret bounds for the proposed method and demonstrate its practical usefulness experimentally.
The quantity and quality of labeled data significantly affect the performance of machine learning algorithms. However, collecting reliable labels from experts is usually expensive and time-consuming. The recent emergence of crowdsourcing services such as Amazon Mechanical Turk (MTurk, https://www.mturk.com/mturk/) enables us to cheaply collect huge quantities of labeled data from crowds of workers for many machine learning tasks, for example, natural language processing (Snow, Connor, Jurafsky, & Ng, 2008) and computer vision (Welinder & Perona, 2010). In a crowdsourcing system, a requester asks workers to complete a set of labeling tasks by paying them a tiny amount of money for each label.
The primary interest of crowdsourcing research has been how to cope with the different reliability of workers and aggregate the collected noisy labels (Dawid & Skene, 1979; Smyth, Fayyad, Burl, Perona, & Baldi, 1994; Ipeirotis, Provost, & Wang, 2010; Raykar et al., 2010; Welinder, Branson, Belongie, & Perona, 2010; Yan et al., 2010; Kajino, Tsuboi, & Kashima, 2012; Liu, Peng, & Ihler, 2012; Zhou, Basu, Mao, & Platt, 2012). Usually weighted voting mechanism is implicitly or explicitly utilized for label aggregation, with workers’ reliability as weights. Many existing methods use expectation-maximization (EM) (Dempster, Laird, & Rubin, 1977) on static data sets of the collected labels to jointly estimate workers’ reliability and true labels. However, how to adaptively collect these labels is often neglected. Since the total budget for a requester to pay the workers is usually limited, it is necessary to consider how to intelligently use the budget to assign tasks to workers.
This leads to another important line of crowdsourcing research, the task routing or task assignment problem. There are two classes of task assignment methods: push and pull. In pull methods, the system takes a passive role and only sets up the environment for workers to find tasks themselves; in push methods, the system takes complete control over which tasks are assigned to whom (Law & von Ahn, 2011). In this letter, we focus on push methods and refer to them henceforth as task assignment methods. Most of the existing task assignment methods run in an online mode, simultaneously learning workers’ reliability and collecting labels (Donmez, Carbonell, & Schneider, 2009; Chen, Lin, & Zhou, 2013; Ertekin, Rudin, & Hirsh, 2014). To deal with the exploration (i.e., learning which workers are reliable) and exploitation (i.e., selecting the workers considered to be reliable) trade-off in worker selection, IEThresh (Donmez et al., 2009) and CrowdSense (Ertekin et al., 2014) dynamically sample worker subsets according to workers’ labeling performances. However, this is not enough in recent heterogeneous crowdsourcing where a worker may be reliable at only a subset of tasks with a certain type. For example, the workers considered to be good at previous tasks in IEThresh and CrowdSense may be bad at next ones. Therefore, it is more reasonable to model task-dependent reliability for workers in heterogeneous crowdsourcing. Another issue in IEThresh and CrowdSense is that the budget is not prefixed. That is, the requester will not know the total budget until the task assignment process ends. This makes those two methods not so practical for crowdsourcing. OptKG (Chen et al., 2013) runs within a prefixed budget and formulates the task assignment problem as a Markov decision process (MDP). However, it is difficult to give theoretical guarantees for OptKG when heterogeneous workers are involved.
In recent crowdsourcing markets, as the heterogeneity of tasks is increasing, many researchers have started to focus on heterogeneous crowdsourcing. Goel, Nikzad, and Singla (2014) studied the problem of incentive-compatible mechanism design for heterogeneous markets. The goal is to properly price the tasks for worker trustfulness and maximize the requester utility with the financial constrainto. Ho and Vaughan (2012) and Ho, Jabbari, and Vaughan (2013) studied the task assignment in heterogeneous crowdsourcing. However, it is variant of problem setting, where workers arrive online and the requester must assign a task (or sequence of tasks) to each new worker as he or she arrives (Slivkins & Vaughan, 2014). In our problem setting, the requester completely controls which task to pick and which worker to select at each step.
From a technical perspective, the problem setting most similar to ours is that of OptKG, where we can determine a task-worker pair at each step. For the purpose of extensive comparison, we also include two heuristic methods, IEThresh and CrowdSense, as well as OptKG in our experiments. These three task assignment methods are detailed in section 4.
We propose a contextual bandit formulation for task assignment in heterogeneous crowdsourcing. Our method models task-dependent reliability for workers by using weight, which depends on the context of a certain task. Here, context can be interpreted as the type or required skill of a task. For label aggregation, we adopt weighted voting, a common solution used for aggregating noisy labels in crowdsourcing. Our method consists of two phases: the pure exploration phase and the adaptive assignment phase. In the pure exploration phase, we explore workers’ reliability in a batch mode and initialize their weights as the input of the adaptive assignment phase. The adaptive assignment phase includes a bandit-based strategy, where we sequentially select a worker for a given labeling task with the help of the exponential weighting scheme (Cesa-Bianchi & Lugosi, 2006; Arora, Hazan, & Kale, 2012), a standard tool for bandit problems. The whole method runs within a limited budget. Moreover, we also investigate the regret bounds of our strategy theoretically and demonstrate its practical usefulness experimentally.
The rest of this letter is organized as follows. In section 2, we describe our proposed bandit-based task assignment method for heterogeneous crowdsourcing. Then we theoretically investigate its regret bounds in section 3. For comparison, we look into the details of the existing task assignment methods for crowdsourcing in section 4 and experimentally evaluate them together with the proposed method in section 5. We present our conclusions in section 6.
2 Bandit-Based Task Assignment
In this section, we describe our proposed bandit-based task assignment (BBTA) method.
2.1 Problem Formulation
Suppose we have N unlabeled tasks with indices , each characterized by a context s from a given context set , where . Let be the unknown true labels of tasks, where . Each time given a task, we ask one worker from a pool of K workers for a (possibly noisy) label, consuming one unit of the total budget T. Our goal is to find suitable task-worker assignment to collect as many reliable labels as possible within the limited budget T. Finally, we aggregate the collected labels to estimate the true labels .
Our proposed method consists of the pure exploration phase and the adaptive assignment phase. The pseudocode is given in algorithm 1, and the details are explained below.
2.2 Pure Exploration Phase
Pure exploration performs in a batch mode, and the purpose is to know which workers are reliable at which labeling tasks. To this end, we pick tasks for each of S distinct contexts () and let all K workers label them (lines 11–12 in algorithm 1). We denote the index set of tasks with context s as in this phase.
2.3 Adaptive Assignment Phase
In the adaptive assignment phase, task-worker assignment is determined for the remaining tasks in an online mode within the remaining budget T2. At each step t of this phase, to determine a task-worker pair, we need to further consider which task to pick and which worker to select for this task.
Given the picked task, selecting a worker reliable at this task is always favored. On the other hand, workers’ reliability is what we are dynamically learning in the method. There exists a trade-off between exploration (i.e., learning which worker is reliable) and exploitation (i.e., selecting the worker considered to be reliable) in worker selection. To address this trade-off, we formulate our task assignment problem as a multiarmed bandit problem, more specifically, a contextual bandit problem (Bubeck & Cesa-Bianchi, 2012).
Multiarmed bandit problems (Auer, Cesa-Bianchi, & Fischer, 2002; Auer, Cesa-Bianchi, Freund, & Schapire, 2002) are basic examples of sequential decision making with limited feedback and can naturally handle the exploration-exploitation trade-off. At each step, we allocate one unit of budget to one of a set of actions and obtain some observable reward (loss) given by the environment. The goal is to maximize (minimize) the cumulative reward (loss) in the whole allocation sequence. To achieve this goal, we must balance the exploitation of actions that are good in the past and the exploration of actions that may be better in the future. A practical extension of the basic bandit problem is the contextual bandit problem, where each step is marked by a context from a given set. Then the interest is finding good mappings from contexts to actions rather than identifying good actions.
In most contextual bandits, the context (side information) is provided as a feature vector that influences the reward or loss at each step, whereas in the heterogeneous crowdsourcing setting, we consider the simple situation of contextual bandits, where the context is marked by some discrete value, corresponding to the task type. Then our interest is to find a good mapping from task types to appropriate workers rather than the best worker for all tasks. Note that the task types are observable and provided by the environment.
The adaptive assignment phase includes our strategy of worker selection, where selecting a worker corresponds to taking an action. The objective is to collect reliable labels via the strategy of adaptively selecting workers from the worker pool.
We further assume the scheme that each worker adopts to assign labels as a black box. Then the labeled status of all tasks could abruptly change over time. This means the exact sequence of tasks in the adaptive assignment phase is unpredictable, given that we calculate confidence scores for all tasks based on their labeled status. Thus, we consider the task sequence as the external information in this phase.
The above assignment step is repeated T2 times until the budget is used up.
3 Theoretical Analysis
In this section, we theoretically analyze the proposed bandit-based task assignment (BBTA) method.
The behavior of a bandit strategy is studied by means of regret analysis. Usually the performance of a bandit strategy is compared with that of the optimal one to show the “regret” for not following the optimal strategy. In our task assignment problem, we use the notion of regret to investigate how well the proposed strategy can select better workers from the entire worker pool by comparing our strategy with the optimal one.
The reason that we use regret as the evaluation measure is that the whole task assignment process is working in an online mode. From the perspective of a requester, the objective is to maximize the average accuracy of the estimated true labels with the constraint of the budget. Ideally this is possible when we have complete knowledge of the whole process (e.g., the reliability of each worker and the budget for each context). However, in the setting of task assignment, since we cannot know beforehand any information about the worker behaviors and the coming contexts in the future, it is not meaningful to try to maximize the average accuracy. Instead, the notion of regret can be used as an evaluation measure for the strategy of worker selection in the proposed method, which evaluates a relative performance loss compared with the optimal strategy. As a general objective for online problems (Hazan, 2011; Shalev-Shwartz, 2012), minimizing the regret is a common and reasonable approach to guaranteeing the performance.
We further denote as the set of indices for steps with context s and as the total appearance count of context s in the adaptive assignment phase. The following theorem shows the regret bound of BBTA:
The proof of lemma 2 is provided in the appendix.
The pure exploration phase is necessary for obtaining some knowledge of workers’ reliability before adaptively assigning tasks, and controls the length of the pure exploration phase. In theorem 1, it is easy to see that a larger makes the term (corresponding to the adaptive assignment phase) smaller but results in larger (related to the pure exploration phase). Moreover, a longer pure exploration phase also consumes a larger proportion of the total budget. The pure exploration phase is effective (as we will see in section 5), but how effective it can be is not only related to the choice of (even we could obtain the optimal that minimizes the bound) but also depends on many other factors, such as the true accuracy of workers and how different their labeling performances are from each other, which are, unfortunately, unknown in advance. A practical choice used in our experiments in section 5 is to set a very small (e.g., ).
4 Review of Existing Task Assignment Methods for Crowdsourcing
In this section, we review the existing task assignment methods for crowdsourcing and discuss similarities to and differences from the proposed method.
Similar to IEThresh, CrowdSense (Ertekin et al., 2014) also dynamically samples worker subsets in an online mode. The criterion that CrowdSense uses for handling exploration-exploitation in worker selection is described as follows.
4.3 Optimistic Knowledge Gradient
Optimistic knowledge gradient (OptKG) (Chen et al., 2013) uses an N-coin-tossing model, formulates the task assignment problem as a Bayesian Markov decision process (MDP), and obtains the optimal allocation sequence for any finite budget T via a computationally efficient approximate policy.
Although IEThresh and CrowdSense employ different criteria for dealing with the exploration-exploitation trade-off in worker selection, they share a similar mechanism of dynamically sampling worker subsets. In particular, at each step, a new task comes, workers’ reliability is learned according to their labeling performances on previous tasks, and a subset of workers considered to be reliable is sampled for labeling the new task. However, in heterogeneous crowdsourcing, a worker in the subset who is reliable at previous tasks may be bad at new ones, and the worker who is good at new tasks may have already been eliminated from the subset. Therefore, it is reasonable to model the task-dependent reliability for workers and then match each task to workers who can do it best.
Another issue in IEThresh and CrowdSense is that the exact amount of total budget is not prefixed. In these two methods, a threshold parameter is used for controlling the size of worker subsets. That is, determines the total budget. However, how many workers in each subset is unknown beforehand. Therefore, we will not know the exact amount of total budget until the whole task assignment process ends. This is not so practical in crowdsourcing, since in the task assignment problem, we pursue not only collecting reliable labels but also intelligently using the prefixed budget. Moreover, both IEThresh and CrowdSense lack theoretical analyses about the relation between the budget and the performance.
OptKG and the proposed method (BBTA) attempt to intelligently use a prefixed budget and ask one worker for a label of the current task at each step. OptKG formulates the task assignment problem as a Bayesian MDP and is proved to produce a consistent policy in homogeneous worker setting (i.e., the policy will achieve 100% accuracy almost surely when the total budget goes to infinity). However, when the heterogeneous reliability of workers is introduced, more sophisticated approximation is also involved in OptKG, making it difficult to give theoretical analysis for OptKG in a heterogeneous worker setting. On the other hand, BBTA is a contextual bandit formulation designed for heterogeneous crowdsourcing, and the regret analysis demonstrates that the performance of our task assignment strategy will converge to that of the optimal one when the total budget goes to infinity.
In this section, we experimentally evaluate the usefulness of the proposed BBTA method. To compare BBTA with existing methods, we first conduct experiments on benchmark data with simulated workers and then use real data for further comparison. All of the experimental results are averaged over 30 runs.
5.1 Benchmark Data
We perform experiments on three popular UCI benchmark data sets (http://archive.ics.uci.edu/ml/): ionosphere (), breast (), and pima (). We consider instances in these data sets as labeling tasks in crowdsourcing. True labels of all tasks in these data sets are available. To simulate various heterogeneous cases in the real world, we first use k-means to cluster these three data sets into subsets, respectively (corresponding to different contexts). Since there are no crowd workers in these data sets, we simulate workers (, respectively) by using the following worker models in a heterogeneous setting:
Spammer-Hammer model. A hammer gives true labels, while a spammer gives random labels (Karger, Oh, & Shah, 2011). We introduce this model into a heterogeneous setting: each worker is a hammer on one subset of tasks but a spammer on others.
One-coin model. Each worker gives true labels with a given probability (i.e., accuracy). This model is widely used in the crowdsourcing literature (e.g., Raykar et al., 2010; Chen et al., 2013) for simulating workers. We use this model in a heterogeneous setting: each worker gives true labels with higher accuracy (we set it to 0.9) on one subset of tasks but with lower accuracy (we set it to 0.6) on others.
One-coin model (malicious). This model is based on the previous one, except that we add more malicious labels: each worker is good at one subset of tasks (accuracy: 0.9), malicious or bad at another one (accuracy: 0.3), and normal at the rest (accuracy: 0.6).
With the generated labels from simulated workers, we can calculate the true accuracy for each worker by checking the consistency with the true labels. Figure 1 illustrates the counts of simulated workers with the true accuracy falling in the associated interval (e.g., 0.65 represents that the true accuracy is between 60% and 65%). It is easy to see that the spammer-hammer model and the one-coin model (malicious) create more adversarial environments than the one-coin model.
We compare BBTA with three state-of-the-art task assignment methods, IEThresh, CrowdSense, and OptKG, in terms of accuracy. Accuracy is calculated as the proportion of correct estimates for true labels. Also, since the budget is limited, we expect that a task assignment method can achieve its highest accuracy as fast as possible as the budget increases (high convergence speed). We set for BBTA to see the effectiveness of the pure exploration phase. We also implement a naive baseline for comparison: we randomly select a task-worker pair at each step and use a majority voting mechanism for label aggregation. Accuracy of all methods is compared at different levels of budget. For BBTA, OptKG, and the naive baseline, we set the maximum amount of budget at . Since the budget is not prefixed in IEThresh and CrowdSense, we carefully select the threshold parameters for them, which affect the consumed budgets. Additionally, we try to introduce state-of-the-art methods (designed for the homogeneous setting) into the heterogeneous setting. Specifically, we split the total budget and allocate a subbudget to a context in proportion to the number of tasks with this context. In particular, for context s, we allocate the subbudget . Then we can run an instance of a homogeneous method for each context within the associated subbudget. Since OptKG has the most similar problem setting to that of the proposed method, it is straightforward to run multiple instances of OptKG with a prefixed subbudget for each context. However, for the two heuristic methods IEThresh and CrowdSense, it is difficult to figure out how to use them in this way, since the budget could not be predetermined in their settings.
Figure 2 shows the averages and standard errors of accuracy as functions of budgets for all methods in nine cases (i.e., three data sets with three worker models). As we can see, BBTA with works better than that with , indicating that the pure exploration phase helps in improving the performance. It is also shown that BBTA () outperforms other methods in all six cases with the spammer-hammer model and the one-coin model (malicious). This demonstrates that BBTA can handle spamming or malicious labels better than others in more adversarial heterogeneous environments. For the three cases with the one-coin model where there are more reliable labels, almost all methods have good performance. Nevertheless, IEThresh performs poorly in all cases, even worse than the naive baseline. The reason is that IEThresh samples a subset of reliable workers at each step by calculating the upper confidence intervals of workers based on their labeling performances on previous tasks. In heterogeneous settings, however, a worker reliable at previous tasks may be poor at next ones. This makes IEThresh learn workers’ reliability incorrectly, resulting in poor sampling of worker subsets. Although CrowdSense also adopts the mechanism of dynamically sampling worker subsets, its exploration-exploitation criterion gives it a chance of randomly selecting some workers who may be reliable at the next tasks. For OptKG, not surprisingly, OptKG (Multi.), which is aware of contexts, outperforms the original OptKG. The tendency of either of them implies that they may achieve the best accuracy as the budget goes to infinity, but the convergence speed is shown to be slower than those of BBTA and CrowdSense. In crowdsourcing, it is important to achieve the best accuracy as fast as possible, especially when the budget is limited.
We then run BBTA with changing to see how affects the performance. We set , and the results are shown in Figure 3. It can be seen that without the pure exploration phase (i.e., ), the performance is the worst in all nine cases. When we add the pure exploration phase (), performance improves. However, we are unable to conclude that the larger is, the better the performance is (e.g., does not always make the method achieve its highest accuracy fastest in all nine cases). Indeed, a larger means a longer pure exploration phase, which consumes a larger proportion of the total budget. For example, when , the performance usually starts from a lower accuracy level than that when we choose other exploration lengths. Although its start level is lower, as the budget increases, the performance when can outperform all the others in most cases of the spammer-hammer and one-coin (malicious) models. However, it can achieve the same level as the performance only when in all cases of the one-coin model, but with a lower convergence speed. In BBTA, we can choose only to affect the performance, and there are also some other factors such as the true reliability of workers and how different their labeling performances are from the others, of which we usually do not have prior knowledge in real-world crowdsourcing. If we could somehow know beforehand that the worker pool is complex (in terms of the difference of workers’ reliability) as in the spammer-hammer and one-coin (malicious) models, setting a larger may help; otherwise a practical choice would be to set a small .
5.2 Real Data
Next, we compare BBTA with the existing task assignment methods on two real-world datasets.
5.2.1 Recognizing Textual Entailment
We first use a real data set from recognizing textual entailment (RTE) tasks in natural language processing. This data set is collected by Snow et al. (2008) using Amazon Mechanical Turk (MTurk). For each RTE task in this data set, the worker is presented with two sentences and given a binary choice of whether the second sentence can be inferred from the first one. The true labels of all tasks are available and used for evaluating the performances of all task assignment methods in our experiments.
In this data set, there is no context information available, or we can consider all tasks as having the same context. That is, this is a homogeneous data set (). The numbers of tasks and workers in this data set are and , respectively. Since the originally collected label set is not complete (i.e., not every worker gives a label for each task), we decided to use a matrix completion algorithm to fill the incomplete label matrix to make sure that we can collect a label when any task is assigned to any worker in the experiments.1 Then we calculate the true accuracy of workers for this data set, as illustrated in Figure 4a.
Figure 5a depicts the comparison results on the RTE data, showing that all methods work very well. The reason is that there is a significant proportion of reliable workers in this data set, as we can see in Figure 4a, and finding them out is not a difficult mission for all methods. It is also shown in Figure 5a that BBTA with converges to the highest accuracy slightly faster than others. This is important in practice, especially in the budget-sensitive setting, because achieving higher accuracy within a lower budget is always favorable from the perspective of requesters in crowdsourcing.
5.2.2 Gender Hobby Dataset
The second real data set we use is Gender Hobby (GH) collected from MTurk by Mo, Zhong, and Yang (2013). Tasks in this data set are binary questions that are explicitly divided into two contexts (): sports and makeup-cooking. This is a typical heterogeneous data set, with tasks ( per context) and workers. Since the label matrix in the original GH data is also incomplete, we use the matrix completion algorithm again to fill the missing entries. Figure 4b illustrates the distribution of the true accuracy of workers in this data set. It is easy to see that the labels given by the workers in this data set are more malicious than those in the RTE data (see Figure 4a) due to the increased diversity of tasks.
Figure 5b plots the experimental results, showing that BBTA with and outperform others on this typical heterogeneous dataset.
6 Conclusion and Future Work
In this letter, we proposed a contextual bandit formulation to address the problem of task assignment in heterogeneous crowdsourcing. In the proposed method, bandit-based task assignment (BBTA), we first explored workers’ reliability and then attempted to adaptively assign tasks to appropriate workers.
We used the exponential weighting scheme to handle the exploration-exploitation trade-off in worker selection and utilized the weighted voting mechanism to aggregate the collected labels. Thanks to the contextual formulation, BBTA models the task-dependent reliability for workers and thus is able to intelligently match workers to tasks they can do best. This is a significant advantage over the state-of-the-art task assignment methods. We experimentally showed the usability of BBTA in heterogeneous crowdsourcing tasks.
We also theoretically investigated the regret bounds for BBTA. In the regret analysis, we showed that the performance of our strategy converges to that of the optimal one as the budget goes to infinity.
Heterogeneity is practical and important in recent real-world crowdsoucing systems. There is still a lot of room for further work in heterogeneous setting. In particular, we consider four possible directions:
The similarities among tasks with different types (corresponding to the information sharing among different contexts) can also be considered, since real-world workers may have similar behaviors on different types of tasks.
To further improve the reliability of the crowdsourcing system, we can involve the supervision of domain experts and adaptively collect labels from both workers and experts. In such a scenario, we need to cope with the balance between system reliability and the total budget, since expert labels (considered as the ground truth) are much more expensive than worker labels.
From a theoretical point of view, the context and loss could be considered to be dependent on the history of actions, although reasonably modeling this dependence relation is challenging in the setting of heterogeneous crowdsourcing. The current method considered the sequence of contexts and losses as external feedback and thus adopted a standard bandit formulation. If we could somehow appropriately capture the dependence relation mentioned above, it would be possible to further improve the current theoretical results.
In our future work, we will further investigate the challenging problems in heterogeneous crowdsourcing.
Appendix: Proof of Lemma 2
This proof follows the framework of regret analysis for adversarial bandits by Bubeck (2010).
H.Z. and Y.M. were supported by the MEXT scholarship and the CREST program. M.S. was supported by MEXT KAKENHI 25700022.