Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-2 of 2
Yao Ma
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
Sort by
Journal Articles
An Online Policy Gradient Algorithm for Markov Decision Processes with Continuous States and Actions
Publisher: Journals Gateway
Neural Computation (2016) 28 (3): 563–593.
Published: 01 March 2016
FIGURES
| View All (4)
Abstract
View article
PDF
We consider the learning problem under an online Markov decision process (MDP) aimed at learning the time-dependent decision-making policy of an agent that minimizes the regret—the difference from the best fixed policy. The difficulty of online MDP learning is that the reward function changes over time. In this letter, we show that a simple online policy gradient algorithm achieves regret for T steps under a certain concavity assumption and under a strong concavity assumption. To the best of our knowledge, this is the first work to present an online MDP algorithm that can handle continuous state, action, and parameter spaces with guarantee. We also illustrate the behavior of the proposed online policy gradient method through experiments.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2015) 27 (11): 2447–2475.
Published: 01 November 2015
FIGURES
| View All (6)
Abstract
View article
PDF
We consider a task assignment problem in crowdsourcing, which is aimed at collecting as many reliable labels as possible within a limited budget. A challenge in this scenario is how to cope with the diversity of tasks and the task-dependent reliability of workers; for example, a worker may be good at recognizing the names of sports teams but not be familiar with cosmetics brands. We refer to this practical setting as heterogeneous crowdsourcing. In this letter, we propose a contextual bandit formulation for task assignment in heterogeneous crowdsourcing that is able to deal with the exploration-exploitation trade-off in worker selection. We also theoretically investigate the regret bounds for the proposed method and demonstrate its practical usefulness experimentally.