Search Results for author: John Schulman

Found 40 papers, 28 papers with code

Let's Verify Step by Step

3 code implementations Preprint 2023 Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset.

 Ranked #1 on Math Word Problem Solving on MATH minival (using extra training data)

Active Learning Math +2

GPT-4 Technical Report

10 code implementations Preprint 2023 OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Jan Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O'Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr H. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine B. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, Barret Zoph

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

1 Image, 2*2 Stitching Arithmetic Reasoning +14

Scaling laws for single-agent reinforcement learning

no code implementations31 Jan 2023 Jacob Hilton, Jie Tang, John Schulman

Recent work has shown that, in generative modeling, cross-entropy loss improves smoothly with model size and training compute, following a power law plus constant scaling law.

reinforcement-learning Reinforcement Learning (RL)

Scaling Laws for Reward Model Overoptimization

no code implementations19 Oct 2022 Leo Gao, John Schulman, Jacob Hilton

In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences.

reinforcement-learning Reinforcement Learning (RL)

Efficient Training of Language Models to Fill in the Middle

3 code implementations28 Jul 2022 Mohammad Bavarian, Heewoo Jun, Nikolas Tezak, John Schulman, Christine McLeavey, Jerry Tworek, Mark Chen

To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span.

Data Augmentation

Training Verifiers to Solve Math Word Problems

3 code implementations27 Oct 2021 Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, John Schulman

State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning.

GSM8K Math +1

Batch size-invariance for policy optimization

1 code implementation1 Oct 2021 Jacob Hilton, Karl Cobbe, John Schulman

We say an algorithm is batch size-invariant if changes to the batch size can largely be compensated for by changes to other hyperparameters.

Unsolved Problems in ML Safety

no code implementations28 Sep 2021 Dan Hendrycks, Nicholas Carlini, John Schulman, Jacob Steinhardt

Machine learning (ML) systems are rapidly increasing in size, are acquiring new capabilities, and are increasingly deployed in high-stakes settings.

The MineRL 2020 Competition on Sample Efficient Reinforcement Learning using Human Priors

no code implementations26 Jan 2021 William H. Guss, Mario Ynocente Castro, Sam Devlin, Brandon Houghton, Noboru Sean Kuno, Crissman Loomis, Stephanie Milani, Sharada Mohanty, Keisuke Nakata, Ruslan Salakhutdinov, John Schulman, Shinya Shiroshita, Nicholay Topin, Avinash Ummadisingu, Oriol Vinyals

Although deep reinforcement learning has led to breakthroughs in many difficult domains, these successes have required an ever-increasing number of samples, affording only a shrinking segment of the AI community access to their development.

Decision Making Efficient Exploration +2

Phasic Policy Gradient

3 code implementations9 Sep 2020 Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman

We introduce Phasic Policy Gradient (PPG), a reinforcement learning framework which modifies traditional on-policy actor-critic methods by separating policy and value function training into distinct phases.

 Ranked #1 on Reinforcement Learning (RL) on ProcGen (using extra training data)

Reinforcement Learning (RL)

Leveraging Procedural Generation to Benchmark Reinforcement Learning

6 code implementations ICML 2020 Karl Cobbe, Christopher Hesse, Jacob Hilton, John Schulman

We introduce Procgen Benchmark, a suite of 16 procedurally generated game-like environments designed to benchmark both sample efficiency and generalization in reinforcement learning.

Procgen Hard (100M) reinforcement-learning +1

Semi-Supervised Learning by Label Gradient Alignment

no code implementations6 Feb 2019 Jacob Jackson, John Schulman

We then formulate an optimization problem whose objective is to minimize the distance between the labeled and the unlabeled data in this space, and we solve it by gradient descent on the imputed labels.

General Classification

Model-Based Reinforcement Learning via Meta-Policy Optimization

1 code implementation14 Sep 2018 Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, Pieter Abbeel

Finally, we demonstrate that our approach is able to match the asymptotic performance of model-free methods while requiring significantly less experience.

Model-based Reinforcement Learning reinforcement-learning +1

Gotta Learn Fast: A New Benchmark for Generalization in RL

3 code implementations10 Apr 2018 Alex Nichol, Vicki Pfau, Christopher Hesse, Oleg Klimov, John Schulman

In this report, we present a new reinforcement learning (RL) benchmark based on the Sonic the Hedgehog (TM) video game franchise.

Few-Shot Learning reinforcement-learning +2

On First-Order Meta-Learning Algorithms

13 code implementations8 Mar 2018 Alex Nichol, Joshua Achiam, John Schulman

This paper considers meta-learning problems, where there is a distribution of tasks, and we would like to obtain an agent that performs well (i. e., learns quickly) when presented with a previously unseen task sampled from this distribution.

Few-Shot Image Classification Few-Shot Learning

Meta Learning Shared Hierarchies

3 code implementations ICLR 2018 Kevin Frans, Jonathan Ho, Xi Chen, Pieter Abbeel, John Schulman

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives---policies that are executed for large numbers of timesteps.

Meta-Learning

Proximal Policy Optimization Algorithms

173 code implementations20 Jul 2017 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov

We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.

Continuous Control Dota 2 +4

Teacher-Student Curriculum Learning

3 code implementations1 Jul 2017 Tambet Matiisen, Avital Oliver, Taco Cohen, John Schulman

We propose Teacher-Student Curriculum Learning (TSCL), a framework for automatic curriculum learning, where the Student tries to learn a complex task and the Teacher automatically chooses subtasks from a given set for the Student to train on.

UCB Exploration via Q-Ensembles

no code implementations ICLR 2018 Richard Y. Chen, Szymon Sidor, Pieter Abbeel, John Schulman

We show how an ensemble of $Q^*$-functions can be leveraged for more effective exploration in deep reinforcement learning.

Q-Learning reinforcement-learning +1

Equivalence Between Policy Gradients and Soft Q-Learning

no code implementations21 Apr 2017 John Schulman, Xi Chen, Pieter Abbeel

A partial explanation may be that $Q$-learning methods are secretly implementing policy gradient updates: we show that there is a precise equivalence between $Q$-learning and policy gradient methods in the setting of entropy-regularized reinforcement learning, that "soft" (entropy-regularized) $Q$-learning is exactly equivalent to a policy gradient method.

Policy Gradient Methods Q-Learning +2

#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning

3 code implementations NeurIPS 2017 Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel

In this work, we describe a surprising finding: a simple generalization of the classic count-based approach can reach near state-of-the-art performance on various high-dimensional and/or continuous deep RL benchmarks.

Atari Games Continuous Control +2

Variational Lossy Autoencoder

no code implementations8 Nov 2016 Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, Pieter Abbeel

Representation learning seeks to expose certain aspects of observed data in a learned representation that's amenable to downstream tasks like classification.

Density Estimation Image Generation +1

Concrete Problems in AI Safety

1 code implementation21 Jun 2016 Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané

Rapid progress in machine learning and artificial intelligence (AI) has brought increasing attention to the potential impacts of AI technologies on society.

BIG-bench Machine Learning Safe Exploration

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets

37 code implementations NeurIPS 2016 Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel

This paper describes InfoGAN, an information-theoretic extension to the Generative Adversarial Network that is able to learn disentangled representations in a completely unsupervised manner.

Generative Adversarial Network Image Generation +3

VIME: Variational Information Maximizing Exploration

2 code implementations NeurIPS 2016 Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, Pieter Abbeel

While there are methods with optimality guarantees in the setting of discrete state and action spaces, these methods cannot be applied in high-dimensional deep RL scenarios.

Continuous Control Reinforcement Learning (RL) +1

Theano: A Python framework for fast computation of mathematical expressions

1 code implementation9 May 2016 The Theano Development Team, Rami Al-Rfou, Guillaume Alain, Amjad Almahairi, Christof Angermueller, Dzmitry Bahdanau, Nicolas Ballas, Frédéric Bastien, Justin Bayer, Anatoly Belikov, Alexander Belopolsky, Yoshua Bengio, Arnaud Bergeron, James Bergstra, Valentin Bisson, Josh Bleecher Snyder, Nicolas Bouchard, Nicolas Boulanger-Lewandowski, Xavier Bouthillier, Alexandre de Brébisson, Olivier Breuleux, Pierre-Luc Carrier, Kyunghyun Cho, Jan Chorowski, Paul Christiano, Tim Cooijmans, Marc-Alexandre Côté, Myriam Côté, Aaron Courville, Yann N. Dauphin, Olivier Delalleau, Julien Demouth, Guillaume Desjardins, Sander Dieleman, Laurent Dinh, Mélanie Ducoffe, Vincent Dumoulin, Samira Ebrahimi Kahou, Dumitru Erhan, Ziye Fan, Orhan Firat, Mathieu Germain, Xavier Glorot, Ian Goodfellow, Matt Graham, Caglar Gulcehre, Philippe Hamel, Iban Harlouchet, Jean-Philippe Heng, Balázs Hidasi, Sina Honari, Arjun Jain, Sébastien Jean, Kai Jia, Mikhail Korobov, Vivek Kulkarni, Alex Lamb, Pascal Lamblin, Eric Larsen, César Laurent, Sean Lee, Simon Lefrancois, Simon Lemieux, Nicholas Léonard, Zhouhan Lin, Jesse A. Livezey, Cory Lorenz, Jeremiah Lowin, Qianli Ma, Pierre-Antoine Manzagol, Olivier Mastropietro, Robert T. McGibbon, Roland Memisevic, Bart van Merriënboer, Vincent Michalski, Mehdi Mirza, Alberto Orlandi, Christopher Pal, Razvan Pascanu, Mohammad Pezeshki, Colin Raffel, Daniel Renshaw, Matthew Rocklin, Adriana Romero, Markus Roth, Peter Sadowski, John Salvatier, François Savard, Jan Schlüter, John Schulman, Gabriel Schwartz, Iulian Vlad Serban, Dmitriy Serdyuk, Samira Shabanian, Étienne Simon, Sigurd Spieckermann, S. Ramana Subramanyam, Jakub Sygnowski, Jérémie Tanguay, Gijs van Tulder, Joseph Turian, Sebastian Urban, Pascal Vincent, Francesco Visin, Harm de Vries, David Warde-Farley, Dustin J. Webb, Matthew Willson, Kelvin Xu, Lijun Xue, Li Yao, Saizheng Zhang, Ying Zhang

Since its introduction, it has been one of the most used CPU and GPU mathematical compilers - especially in the machine learning community - and has shown steady performance improvements.

BIG-bench Machine Learning Clustering +2

Benchmarking Deep Reinforcement Learning for Continuous Control

15 code implementations22 Apr 2016 Yan Duan, Xi Chen, Rein Houthooft, John Schulman, Pieter Abbeel

Recently, researchers have made significant progress combining the advances in deep learning for learning feature representations with reinforcement learning.

Action Triplet Recognition Atari Games +4

Gradient Estimation Using Stochastic Computation Graphs

1 code implementation NeurIPS 2015 John Schulman, Nicolas Heess, Theophane Weber, Pieter Abbeel

In a variety of problems originating in supervised, unsupervised, and reinforcement learning, the loss function is defined by an expectation over a collection of random variables, which might be part of a probabilistic model or the external world.

High-Dimensional Continuous Control Using Generalized Advantage Estimation

17 code implementations8 Jun 2015 John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, Pieter Abbeel

Policy gradient methods are an appealing approach in reinforcement learning because they directly optimize the cumulative reward and can straightforwardly be used with nonlinear function approximators such as neural networks.

Continuous Control Policy Gradient Methods +1

Trust Region Policy Optimization

21 code implementations19 Feb 2015 John Schulman, Sergey Levine, Philipp Moritz, Michael. I. Jordan, Pieter Abbeel

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement.

Atari Games Policy Gradient Methods

Cannot find the paper you are looking for? You can Submit a new open access paper.