We show that large language models (LLMs) can be adapted to be generalizable policies for embodied visual tasks.
A fairly reliable trend in deep reinforcement learning is that the performance scales with the number of parameters, provided a complimentary scaling in amount of training data.
Intuitively, one would expect accuracy of a trained neural network's prediction on test samples to correlate with how densely the samples are surrounded by seen training samples in representation space.
Data efficiency is a key challenge for deep reinforcement learning.
Ranked #2 on Atari Games 100k on Atari 100k (using extra training data)
In hospitals, data are siloed to specific information systems that make the same information available under different modalities such as the different medical imaging exams the patient undergoes (CT scans, MRI, PET, Ultrasound, etc.)
In this work we study locality and compositionality in the context of learning representations for Zero Shot Learning (ZSL).
Conditional text-to-image generation is an active area of research, with many possible applications.
Ranked #2 on Text-to-Image Generation on GeNeVA (i-CLEVR)
We argue that the estimation of mutual information between high dimensional continuous random variables can be achieved by gradient descent over neural networks.
One of the most successful techniques in generative models has been decomposing a complicated generation task into a series of simpler generation tasks.
Directed latent variable models that formulate the joint distribution as $p(x, z) = p(z) p(x \mid z)$ have the advantage of fast and exact sampling.