no code implementations • ICML 2020 • Nathan Kallus, Masatoshi Uehara
Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible.
1 code implementation • 17 Oct 2024 • Chenyu Wang, Masatoshi Uehara, Yichun He, Amy Wang, Tommaso Biancalani, Avantika Lal, Tommi Jaakkola, Sergey Levine, Hanchen Wang, Aviv Regev
Finally, we demonstrate the effectiveness of DRAKES in generating DNA and protein sequences that optimize enhancer activity and protein stability, respectively, important tasks for gene therapies and protein-based therapeutics.
1 code implementation • 15 Aug 2024 • Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Shuiwang Ji, Aviv Regev, Sergey Levine, Masatoshi Uehara
Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences.
1 code implementation • 18 Jul 2024 • Masatoshi Uehara, Yulai Zhao, Tommaso Biancalani, Sergey Levine
We explain the application of various RL algorithms, including PPO, differentiable optimization, reward-weighted MLE, value-weighted sampling, and path consistency learning, tailored specifically for fine-tuning diffusion models.
no code implementations • 17 Jun 2024 • Yulai Zhao, Masatoshi Uehara, Gabriele Scalia, Tommaso Biancalani, Sergey Levine, Ehsan Hajiramezanali
This work presents a novel method based on reinforcement learning (RL) to add additional controls, leveraging an offline dataset comprising inputs and corresponding labels.
1 code implementation • 30 May 2024 • Masatoshi Uehara, Yulai Zhao, Ehsan Hajiramezanali, Gabriele Scalia, Gökcen Eraslan, Avantika Lal, Sergey Levine, Tommaso Biancalani
To combine the strengths of both approaches, we adopt a hybrid method that fine-tunes cutting-edge diffusion models by optimizing reward models through RL.
no code implementations • 7 Mar 2024 • Zihao Li, Hui Lan, Vasilis Syrgkanis, Mengdi Wang, Masatoshi Uehara
In this paper, we study nonparametric estimation of instrumental variable (IV) regressions.
1 code implementation • 26 Feb 2024 • Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Sergey Levine, Tommaso Biancalani
It is natural to frame this as a reinforcement learning (RL) problem, in which the objective is to fine-tune a diffusion model to maximize a reward function that corresponds to some property.
no code implementations • 23 Feb 2024 • Masatoshi Uehara, Yulai Zhao, Kevin Black, Ehsan Hajiramezanali, Gabriele Scalia, Nathaniel Lee Diamant, Alex M Tseng, Tommaso Biancalani, Sergey Levine
Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins.
no code implementations • 8 Jan 2024 • Jakub Grudzien Kuba, Masatoshi Uehara, Pieter Abbeel, Sergey Levine
This kind of data-driven optimization (DDO) presents a range of challenges beyond those in standard prediction problems, since we need models that successfully predict the performance of new designs that are better than the best designs seen in the training set.
no code implementations • 25 Jul 2023 • Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara
We consider estimation of parameters defined as linear functionals of solutions to linear inverse problems.
1 code implementation • 26 Jun 2023 • Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, Yuta Saito
We show that the resulting estimator, which we call Adaptive IPS (AIPS), can be unbiased under any complex user behavior.
no code implementations • 29 May 2023 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories, rather than explicit reward signals.
no code implementations • 24 May 2023 • Wenhao Zhan, Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
Our proposed algorithm consists of two main steps: (1) estimate the implicit reward using Maximum Likelihood Estimation (MLE) with general function approximation from offline data and (2) solve a distributionally robust planning problem over a confidence set around the MLE.
1 code implementation • 19 Feb 2023 • Runzhe Wu, Masatoshi Uehara, Wen Sun
Our theoretical results show that for both finite-horizon and infinite-horizon discounted settings, FLE can learn distributions that are close to the ground truth under total variation distance and Wasserstein distance, respectively.
no code implementations • 10 Feb 2023 • Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara
In this paper, we study nonparametric estimation of instrumental variable (IV) regressions.
no code implementations • 13 Dec 2022 • Masatoshi Uehara, Chengchun Shi, Nathan Kallus
Reinforcement learning (RL) is one of the most vibrant research frontiers in machine learning and has been recently applied to solve a number of challenging problems.
no code implementations • 17 Aug 2022 • Andrew Bennett, Nathan Kallus, Xiaojie Mao, Whitney Newey, Vasilis Syrgkanis, Masatoshi Uehara
In a variety of applications, including nonparametric instrumental variable (NPIV) analysis, proximal causal inference under unmeasured confounding, and missing-not-at-random data with shadow variables, we are interested in inference on a continuous linear functional (e. g., average causal effects) of nuisance function (e. g., NPIV regression) defined by conditional moment restrictions.
1 code implementation • NeurIPS 2023 • Masatoshi Uehara, Haruka Kiyohara, Andrew Bennett, Victor Chernozhukov, Nan Jiang, Nathan Kallus, Chengchun Shi, Wen Sun
Finally, we extend our methods to learning of dynamics and establish the connection between our approach and the well-known spectral learning methods in POMDPs.
no code implementations • 12 Jul 2022 • Wenhao Zhan, Masatoshi Uehara, Wen Sun, Jason D. Lee
We show that given a realizable model class, the sample complexity of learning the near optimal policy only scales polynomially with respect to the statistical complexity of the model class, without any explicit polynomial dependence on the size of the state and observation spaces.
no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun
We study Reinforcement Learning for partially observable dynamical systems using function approximation.
no code implementations • 24 Jun 2022 • Masatoshi Uehara, Ayush Sekhari, Jason D. Lee, Nathan Kallus, Wen Sun
We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space.
1 code implementation • 31 Jan 2022 • Xuezhou Zhang, Yuda Song, Masatoshi Uehara, Mengdi Wang, Alekh Agarwal, Wen Sun
We present BRIEE (Block-structured Representation learning with Interleaved Explore Exploit), an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i. e., Block MDPs), where rich observations are generated from a set of unknown latent states.
1 code implementation • 12 Nov 2021 • Chengchun Shi, Masatoshi Uehara, Jiawei Huang, Nan Jiang
In this work, we first propose novel identification methods for OPE in POMDPs with latent confounders, by introducing bridge functions that link the target policy's value and the observed data distribution.
no code implementations • ICLR 2022 • Masatoshi Uehara, Xuezhou Zhang, Wen Sun
This work studies the question of Representation Learning in RL: how can we learn a compact low-dimensional representation such that on top of the representation we can perform RL procedures such as exploration and exploitation, in a sample efficient manner.
no code implementations • ICLR 2022 • Masatoshi Uehara, Wen Sun
Under the assumption that the ground truth model belongs to our function class (i. e., realizability in the function class), CPPO has a PAC guarantee with offline data only providing partial coverage, i. e., it can learn a policy that competes against any policy that is covered by the offline data.
1 code implementation • NeurIPS 2021 • Jonathan D. Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, Wen Sun
Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy.
1 code implementation • NeurIPS 2021 • Jonathan Daniel Chang, Masatoshi Uehara, Dhruv Sreenivas, Rahul Kidambi, Wen Sun
Instead, the learner is presented with a static offline dataset of state-action-next state triples from a potentially less proficient behavior policy.
no code implementations • 25 Mar 2021 • Nathan Kallus, Xiaojie Mao, Masatoshi Uehara
Previous work has relied on completeness conditions on these functions to identify the causal parameters and required uniqueness assumptions in estimation, and they also focused on parametric estimation of bridge functions.
no code implementations • 5 Feb 2021 • Masatoshi Uehara, Masaaki Imaizumi, Nan Jiang, Nathan Kallus, Wen Sun, Tengyang Xie
We offer a theoretical characterization of off-policy evaluation (OPE) in reinforcement learning using function approximation for marginal importance weights and $q$-functions when these are estimated using recent minimax methods.
no code implementations • 31 Jan 2021 • Yichun Hu, Nathan Kallus, Masatoshi Uehara
Second, we provide new analyses of FQI and Bellman residual minimization to establish the correct pointwise convergence guarantees.
1 code implementation • 21 Oct 2020 • Nathan Kallus, Yuta Saito, Masatoshi Uehara
We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i. e., stratified sampling.
1 code implementation • NeurIPS 2020 • Nathan Kallus, Masatoshi Uehara
Targeting deterministic policies, for which action is a deterministic function of state, is crucial since optimal policies are always deterministic (up to ties).
no code implementations • 6 Jun 2020 • Nathan Kallus, Masatoshi Uehara
Compared with the classic case of a pre-specified evaluation policy, when evaluating natural stochastic policies, the efficiency bound, which measures the best-achievable estimation error, is inflated since the evaluation policy itself is unknown.
1 code implementation • NeurIPS 2020 • Masahiro Kato, Masatoshi Uehara, Shota Yasui
Then, we propose doubly robust and efficient estimators for OPE and OPL under a covariate shift by using a nonparametric estimator of the density ratio between the historical and evaluation data distributions.
no code implementations • ICML 2020 • Nathan Kallus, Masatoshi Uehara
Policy gradient methods in reinforcement learning update policy parameters by taking steps in the direction of an estimated gradient of policy value.
1 code implementation • 30 Dec 2019 • Nathan Kallus, Xiaojie Mao, Masatoshi Uehara
A central example is the efficient estimating equation for the (local) quantile treatment effect ((L)QTE) in causal inference, which involves as a nuisance the covariate-conditional cumulative distribution function evaluated at the quantile to be estimated.
no code implementations • ICML 2020 • Masatoshi Uehara, Jiawei Huang, Nan Jiang
We provide theoretical investigations into off-policy evaluation in reinforcement learning using function approximators for (marginalized) importance weights and value functions.
no code implementations • 12 Sep 2019 • Nathan Kallus, Masatoshi Uehara
This precisely characterizes the curse of horizon: in time-variant processes, OPE is only feasible in the near-on-policy setting, where behavior and target policies are sufficiently similar.
1 code implementation • 22 Aug 2019 • Nathan Kallus, Masatoshi Uehara
Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible.
1 code implementation • NeurIPS 2019 • Nathan Kallus, Masatoshi Uehara
We propose new estimators for OPE based on empirical likelihood that are always more efficient than IS, SNIS, and DR and satisfy the same stability and boundedness properties as SNIS.
no code implementations • 15 May 2019 • Takeru Matsuda, Masatoshi Uehara, Aapo Hyvarinen
However, model selection methods for general non-normalized models have not been proposed so far.
no code implementations • 8 Mar 2019 • Masatoshi Uehara, Takeru Matsuda, Jae Kwang Kim
We propose estimation methods for such unnormalized models with missing data.
no code implementations • 23 Jan 2019 • Masatoshi Uehara, Takafumi Kanamori, Takashi Takenouchi, Takeru Matsuda
The parameter estimation of unnormalized models is a challenging problem.
no code implementations • 24 Aug 2018 • Masatoshi Uehara, Takeru Matsuda, Fumiyasu Komaki
First, we propose a method for reducing asymptotic variance by estimating the parameters of the auxiliary distribution.
no code implementations • 10 Oct 2016 • Masatoshi Uehara, Issei Sato, Masahiro Suzuki, Kotaro Nakayama, Yutaka Matsuo
Generative adversarial networks (GANs) are successful deep generative models.