Online machine learning framework for budgeted bandits with an option of giving up

Pon Kumar, Sharoff

Online machine learning framework for budgeted bandits with an option of giving up

dc.contributor.author	Pon Kumar, Sharoff
dc.contributor.supervisor	Mehta, Nishant
dc.date.accessioned	2021-12-14T23:18:49Z
dc.date.available	2021-12-14T23:18:49Z
dc.date.copyright	2021	en_US
dc.date.issued	2021-12-14
dc.degree.department	Department of Computer Science	en_US
dc.degree.level	Master of Science M.Sc.	en_US
dc.description.abstract	We study an online learning problem where the game proceeds in epochs and an agent takes an action in each epoch. Depending upon the action, the agent receives a stochastic reward and the time taken for completing an epoch also depends on a stochastic delay. The agent can only take a new action once the previous action is completed. The game ends once the total allotted time budget runs out. The goal of the agent is to maximize its cumulative reward over a fixed budget. However, the agent can also “give up” on a action to optimize the time budget, which prevents the agent from collecting that reward associated with that action; it can then choose another action. We model this problem as a variant of multi-armed bandits problem having stochastic reward and stochastic resource consumption with a fixed global budget. For this problem, we first establish that the optimal arm is the arm that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. We then propose an upper confidence bound-based algorithm Wait-UCB using a novel upper confidence bound developed on this ratio which attains a logarithmic, problem-dependent regret bound with an improved dependence on the problem dependent parameters compared to previous works. We conduct simulations on the proposed algorithm in various problem configuration comparing Wait-UCB against state-of-the-art algorithms, verifying the effectiveness of our proposed algorithm. We then study this problem with additional feedback, more than mere bandit feedback, where the agent observes the reward of the actions having shorter waiting time; we call this type of feedback “leftward chain feedback”. For this problem with additional feedback, we develop a novel upper confidence bound-based algorithm, Wait-2 Learn UCB, which guarantees logarithmic, problem-dependent regret bound. However, our regret bound does not yet show any improvement over the regret bound for Wait-UCB.	en_US
dc.description.scholarlevel	Graduate	en_US
dc.identifier.bibliographicCitation	Sharoff, P., Mehta, N., and Ganti, R. (2020). A farewell to arms: Sequential reward maximization on a budget with a giving up option. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3707–3716. PMLR.	en_US
dc.identifier.uri	http://hdl.handle.net/1828/13590
dc.language	English	eng
dc.language.iso	en	en_US
dc.rights	Available to the World Wide Web	en_US
dc.subject	Machine Learning	en_US
dc.subject	Online Machine Learning	en_US
dc.subject	Online Learning	en_US
dc.subject	Multi-armed bandit	en_US
dc.subject	Reinforcement Learning	en_US
dc.subject	Machine Learning Theory	en_US
dc.subject	Artificial Intelligence	en_US
dc.subject	Algorithms	en_US
dc.subject	Sequential Decision Making	en_US
dc.subject	Statistics	en_US
dc.title	Online machine learning framework for budgeted bandits with an option of giving up	en_US
dc.type	Thesis	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: PonKumar_Sharoff_MSc_2021.pdf
Size:: 623.4 KB
Format:: Adobe Portable Document Format
Description:: Main article

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 2 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (ETD)
Theses (Computer Science)