Online machine learning framework for budgeted bandits with an option of giving up

dc.contributor.authorPon Kumar, Sharoff
dc.contributor.supervisorMehta, Nishant
dc.date.accessioned2021-12-14T23:18:49Z
dc.date.available2021-12-14T23:18:49Z
dc.date.copyright2021en_US
dc.date.issued2021-12-14
dc.degree.departmentDepartment of Computer Scienceen_US
dc.degree.levelMaster of Science M.Sc.en_US
dc.description.abstractWe study an online learning problem where the game proceeds in epochs and an agent takes an action in each epoch. Depending upon the action, the agent receives a stochastic reward and the time taken for completing an epoch also depends on a stochastic delay. The agent can only take a new action once the previous action is completed. The game ends once the total allotted time budget runs out. The goal of the agent is to maximize its cumulative reward over a fixed budget. However, the agent can also “give up” on a action to optimize the time budget, which prevents the agent from collecting that reward associated with that action; it can then choose another action. We model this problem as a variant of multi-armed bandits problem having stochastic reward and stochastic resource consumption with a fixed global budget. For this problem, we first establish that the optimal arm is the arm that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. We then propose an upper confidence bound-based algorithm Wait-UCB using a novel upper confidence bound developed on this ratio which attains a logarithmic, problem-dependent regret bound with an improved dependence on the problem dependent parameters compared to previous works. We conduct simulations on the proposed algorithm in various problem configuration comparing Wait-UCB against state-of-the-art algorithms, verifying the effectiveness of our proposed algorithm. We then study this problem with additional feedback, more than mere bandit feedback, where the agent observes the reward of the actions having shorter waiting time; we call this type of feedback “leftward chain feedback”. For this problem with additional feedback, we develop a novel upper confidence bound-based algorithm, Wait-2 Learn UCB, which guarantees logarithmic, problem-dependent regret bound. However, our regret bound does not yet show any improvement over the regret bound for Wait-UCB.en_US
dc.description.scholarlevelGraduateen_US
dc.identifier.bibliographicCitationSharoff, P., Mehta, N., and Ganti, R. (2020). A farewell to arms: Sequential reward maximization on a budget with a giving up option. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3707–3716. PMLR.en_US
dc.identifier.urihttp://hdl.handle.net/1828/13590
dc.languageEnglisheng
dc.language.isoenen_US
dc.rightsAvailable to the World Wide Weben_US
dc.subjectMachine Learningen_US
dc.subjectOnline Machine Learningen_US
dc.subjectOnline Learningen_US
dc.subjectMulti-armed banditen_US
dc.subjectReinforcement Learningen_US
dc.subjectMachine Learning Theoryen_US
dc.subjectArtificial Intelligenceen_US
dc.subjectAlgorithmsen_US
dc.subjectSequential Decision Makingen_US
dc.subjectStatisticsen_US
dc.titleOnline machine learning framework for budgeted bandits with an option of giving upen_US
dc.typeThesisen_US

Files

Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
PonKumar_Sharoff_MSc_2021.pdf
Size:
623.4 KB
Format:
Adobe Portable Document Format
Description:
Main article
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
2 KB
Format:
Item-specific license agreed upon to submission
Description: