no code implementations • 8 May 2025 • Joel Z. Leibo, Alexander Sasha Vezhnevets, William A. Cunningham, Sébastien Krier, Manfred Diaz, Simon Osindero
Artificial Intelligence (AI) systems are increasingly placed in positions where their decisions have real consequences, e. g., moderating online spaces, conducting research, and advising on policy.
no code implementations • 26 Dec 2024 • Joel Z. Leibo, Alexander Sasha Vezhnevets, Manfred Diaz, John P. Agapiou, William A. Cunningham, Peter Sunehag, Julia Haas, Raphael Koster, Edgar A. Duéñez-Guzmán, William S. Isaac, Georgios Piliouras, Stanley M. Bileschi, Iyad Rahwan, Simon Osindero
Humans navigate a multi-scale mosaic of interlocking notions of what is appropriate for different situations.
no code implementations • 31 Oct 2024 • Marc Lanctot, Kate Larson, Michael Kaisers, Quentin Berthet, Ian Gemp, Manfred Diaz, Roberto-Rafael Maura-Rivero, Yoram Bachrach, Anna Koop, Doina Precup
This optimal ranking is the maximum likelihood estimate when evaluation data (which we view as votes) are interpreted as noisy samples from a ground truth ranking, a solution to Condorcet's original voting system criteria.
no code implementations • 3 Apr 2024 • Manfred Diaz, Liam Paull, Andrea Tacchetti
Teacher-Student Curriculum Learning (TSCL) is a curriculum learning framework that draws inspiration from human cultural transmission and learning.
no code implementations • 10 Oct 2021 • Shixiang Shane Gu, Manfred Diaz, Daniel C. Freeman, Hiroki Furuta, Seyed Kamyar Seyed Ghasemipour, Anton Raichuk, Byron David, Erik Frey, Erwin Coumans, Olivier Bachem
While reward maximization is at the core of RL, reward engineering is not the only -- sometimes nor the easiest -- way for specifying complex behaviors.
no code implementations • ICLR Workshop SSL-RL 2021 • Manfred Diaz, Liam Paull, Pablo Samuel Castro
We offer a novel approach to balance exploration and exploitation in reinforcement learning (RL).
2 code implementations • 9 Apr 2019 • Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J. Pal, Liam Paull
Our experiments show that domain randomization may lead to suboptimal, high-variance policies, which we attribute to the uniform sampling of environment parameters.