no code implementations • 4 Apr 2025 • Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian
First, due to us taking the Bayesian approach, we achieve a much better quality of semantic entropy estimates for a given budget of samples from the LLM.
no code implementations • 8 Oct 2024 • Arash Tavakoli, Sina Ghiassian, Nemanja Rakićević
This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases.
no code implementations • 30 Apr 2024 • Arsalan SharifNassab, Saber Salehkaleybar, Sina Ghiassian, Surya Kanoria, Dale Schuurmans
We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model.
no code implementations • 3 Apr 2024 • Nicolò Felicioni, Lucas Maystre, Sina Ghiassian, Kamil Ciosek
We compare this baseline to LLM bandits that make active use of uncertainty estimation by integrating the uncertainty in a Thompson Sampling policy.
no code implementations • 11 Mar 2024 • Zhenwen Dai, Federico Tomasi, Sina Ghiassian
In-context learning is a promising approach for online policy learning of offline reinforcement learning (RL) methods, which can be achieved at inference time without gradient optimization.
no code implementations • 25 Oct 2022 • Banafsheh Rafiee, Sina Ghiassian, Jun Jin, Richard Sutton, Jun Luo, Adam White
In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning.
no code implementations • 18 Mar 2022 • Eric Graves, Sina Ghiassian
A central challenge to applying many off-policy reinforcement learning algorithms to real world problems is the variance introduced by importance sampling.
no code implementations • 10 Sep 2021 • Sina Ghiassian, Richard S. Sutton
In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two.
2 code implementations • 2 Jun 2021 • Sina Ghiassian, Richard S. Sutton
In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter.
1 code implementation • 15 Feb 2021 • Dylan R. Ashley, Sina Ghiassian, Richard S. Sutton
Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon.
1 code implementation • 9 Nov 2020 • Banafsheh Rafiee, Zaheer Abbas, Sina Ghiassian, Raksha Kumaraswamy, Richard Sutton, Elliot Ludvig, Adam White
We present three new diagnostic prediction problems inspired by classical-conditioning experiments to facilitate research in online prediction learning.
1 code implementation • ICML 2020 • Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, Martha White
It is still common to use Q-learning and temporal difference (TD) learning-even though they have divergence issues and sound Gradient TD alternatives exist-because divergence seems rare and they typically perform well.
no code implementations • 16 Mar 2020 • Sina Ghiassian, Banafsheh Rafiee, Yat Long Lo, Adam White
Unfortunately, the performance of deep reinforcement learning systems is sensitive to hyper-parameter settings and architecture choices.
no code implementations • 29 Oct 2019 • Yat Long Lo, Sina Ghiassian
Yet, neural networks tend to forget what they learned in the past, especially when they learn online and fully incrementally, a setting in which the weights are updated after each sample is received and the sample is then discarded.
1 code implementation • 1 Mar 2019 • Xiang Gu, Sina Ghiassian, Richard S. Sutton
ETD was proposed mainly to address convergence issues of conventional Temporal Difference (TD) learning under off-policy training but it is different from conventional TD learning even under on-policy training.
no code implementations • 6 Nov 2018 • Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, Adam White
The ability to learn behavior-contingent predictions online and off-policy has long been advocated as a key capability of predictive-knowledge learning systems but remained an open algorithmic challenge for decades.
no code implementations • 18 May 2018 • Sina Ghiassian, Huizhen Yu, Banafsheh Rafiee, Richard S. Sutton
We apply neural nets with ReLU gates in online reinforcement learning.
no code implementations • 11 May 2017 • Sina Ghiassian, Banafsheh Rafiee, Richard S. Sutton
In this paper we present the first empirical study of the emphatic temporal-difference learning algorithm (ETD), comparing it with conventional temporal-difference learning, in particular, with linear TD(0), on on-policy and off-policy variations of the Mountain Car problem.