Online machine learning framework for budgeted bandits with an option of giving up
dc.contributor.author | Pon Kumar, Sharoff | |
dc.contributor.supervisor | Mehta, Nishant | |
dc.date.accessioned | 2021-12-14T23:18:49Z | |
dc.date.available | 2021-12-14T23:18:49Z | |
dc.date.copyright | 2021 | en_US |
dc.date.issued | 2021-12-14 | |
dc.degree.department | Department of Computer Science | en_US |
dc.degree.level | Master of Science M.Sc. | en_US |
dc.description.abstract | We study an online learning problem where the game proceeds in epochs and an agent takes an action in each epoch. Depending upon the action, the agent receives a stochastic reward and the time taken for completing an epoch also depends on a stochastic delay. The agent can only take a new action once the previous action is completed. The game ends once the total allotted time budget runs out. The goal of the agent is to maximize its cumulative reward over a fixed budget. However, the agent can also “give up” on a action to optimize the time budget, which prevents the agent from collecting that reward associated with that action; it can then choose another action. We model this problem as a variant of multi-armed bandits problem having stochastic reward and stochastic resource consumption with a fixed global budget. For this problem, we first establish that the optimal arm is the arm that maximizes the ratio of the expected reward of the arm to the expected waiting time before the agent sees the reward due to pulling that arm. We then propose an upper confidence bound-based algorithm Wait-UCB using a novel upper confidence bound developed on this ratio which attains a logarithmic, problem-dependent regret bound with an improved dependence on the problem dependent parameters compared to previous works. We conduct simulations on the proposed algorithm in various problem configuration comparing Wait-UCB against state-of-the-art algorithms, verifying the effectiveness of our proposed algorithm. We then study this problem with additional feedback, more than mere bandit feedback, where the agent observes the reward of the actions having shorter waiting time; we call this type of feedback “leftward chain feedback”. For this problem with additional feedback, we develop a novel upper confidence bound-based algorithm, Wait-2 Learn UCB, which guarantees logarithmic, problem-dependent regret bound. However, our regret bound does not yet show any improvement over the regret bound for Wait-UCB. | en_US |
dc.description.scholarlevel | Graduate | en_US |
dc.identifier.bibliographicCitation | Sharoff, P., Mehta, N., and Ganti, R. (2020). A farewell to arms: Sequential reward maximization on a budget with a giving up option. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, volume 108 of Proceedings of Machine Learning Research, pages 3707–3716. PMLR. | en_US |
dc.identifier.uri | http://hdl.handle.net/1828/13590 | |
dc.language | English | eng |
dc.language.iso | en | en_US |
dc.rights | Available to the World Wide Web | en_US |
dc.subject | Machine Learning | en_US |
dc.subject | Online Machine Learning | en_US |
dc.subject | Online Learning | en_US |
dc.subject | Multi-armed bandit | en_US |
dc.subject | Reinforcement Learning | en_US |
dc.subject | Machine Learning Theory | en_US |
dc.subject | Artificial Intelligence | en_US |
dc.subject | Algorithms | en_US |
dc.subject | Sequential Decision Making | en_US |
dc.subject | Statistics | en_US |
dc.title | Online machine learning framework for budgeted bandits with an option of giving up | en_US |
dc.type | Thesis | en_US |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- PonKumar_Sharoff_MSc_2021.pdf
- Size:
- 623.4 KB
- Format:
- Adobe Portable Document Format
- Description:
- Main article
License bundle
1 - 1 of 1
No Thumbnail Available
- Name:
- license.txt
- Size:
- 2 KB
- Format:
- Item-specific license agreed upon to submission
- Description: