Online machine learning framework for budgeted bandits with an option of giving up
Date
2021-12-14
Authors
Pon Kumar, Sharoff
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
We study an online learning problem where the game proceeds in epochs and an
agent takes an action in each epoch. Depending upon the action, the agent receives
a stochastic reward and the time taken for completing an epoch also depends on a
stochastic delay. The agent can only take a new action once the previous action is
completed. The game ends once the total allotted time budget runs out. The goal
of the agent is to maximize its cumulative reward over a fixed budget. However, the
agent can also “give up” on a action to optimize the time budget, which prevents
the agent from collecting that reward associated with that action; it can then choose
another action. We model this problem as a variant of multi-armed bandits problem
having stochastic reward and stochastic resource consumption with a fixed global
budget. For this problem, we first establish that the optimal arm is the arm that
maximizes the ratio of the expected reward of the arm to the expected waiting time
before the agent sees the reward due to pulling that arm. We then propose an upper
confidence bound-based algorithm Wait-UCB using a novel upper confidence bound
developed on this ratio which attains a logarithmic, problem-dependent regret bound
with an improved dependence on the problem dependent parameters compared to
previous works. We conduct simulations on the proposed algorithm in various problem
configuration comparing Wait-UCB against state-of-the-art algorithms, verifying the
effectiveness of our proposed algorithm. We then study this problem with additional
feedback, more than mere bandit feedback, where the agent observes the reward
of the actions having shorter waiting time; we call this type of feedback “leftward
chain feedback”. For this problem with additional feedback, we develop a novel upper
confidence bound-based algorithm, Wait-2 Learn UCB, which guarantees logarithmic,
problem-dependent regret bound. However, our regret bound does not yet show any
improvement over the regret bound for Wait-UCB.
Description
Keywords
Machine Learning, Online Machine Learning, Online Learning, Multi-armed bandit, Reinforcement Learning, Machine Learning Theory, Artificial Intelligence, Algorithms, Sequential Decision Making, Statistics