Policy-value concordance for deep actor-critic reinforcement learning algorithms

Buro, Jonas

Policy-value concordance for deep actor-critic reinforcement learning algorithms

dc.contributor.author	Buro, Jonas
dc.contributor.supervisor	Haworth, Brandon
dc.date.accessioned	2024-12-18T22:54:03Z
dc.date.available	2024-12-18T22:54:03Z
dc.date.issued	2024
dc.degree.department	Department of Computer Science
dc.degree.level	Master of Science MSc
dc.description.abstract	Designing general agents to optimize sequential decision-making underneath uncertainty has long been central to artificial intelligence research. Recent advances in deep reinforcement learning (RL) have made progress in this pursuit, achieving superhuman performance in a collection of challenging and visually complex domains, in a tabula rasa fashion without embedding human domain knowledge. Although making progress towards designing general problem-solving agents, these methods require significant amounts of data to learn effective decision-making policies relative to humans, preventing their application to most real-world problems for which no simulator exists. It is clear that the question of how to best learn models intended for downstream purposes such as planning in this context remains unresolved. Motivated by this gap in the literature, we propose a novel learning objective for RL algorithms with deep actor-critic architectures, with the goal of further investigating the efficacy of such methods as autonomous general problem solvers. These algorithms employ artificial neural networks as parameterized policy and value functions, which guide their decision-making processes. Our approach introduces a learning signal that explicitly captures desirable properties of the policy function in terms of the value function from the perspective of a downstream reward-maximizing agent. Specifically, the signal encourages the policy to favour actions in a manner that is concordant with the relative ordering of value function estimates during training. We hypothesize that when correctly balanced with other learning objectives, RL algorithms incorporating our method will converge to comparable strength policies using less real-world data relative to their original instantiations. To empirically investigate this hypothesis, we incorporate our technique with state-of-the-art RL algorithms, ranging from simple policy gradient actor-critic methods to more complex model-based architectures, and deploy them on standard deep RL benchmark tasks, and then perform statistical analysis on their performance data.
dc.description.embargo	2025-12-06
dc.description.scholarlevel	Graduate
dc.identifier.uri	https://hdl.handle.net/1828/20864
dc.language	English	eng
dc.language.iso	en
dc.rights	Available to the World Wide Web
dc.subject	Sequential decision making
dc.subject	Artificial intelligence
dc.subject	Reinforcement learning
dc.subject	Machine learning
dc.title	Policy-value concordance for deep actor-critic reinforcement learning algorithms
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Buro_Jonas_MSc_2024.pdf
Size:: 1.48 MB
Format:: Adobe Portable Document Format

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.62 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Electronic Theses and Dissertations (ETD)
Theses (Computer Science)