Our Dynamic Scene Transformer (DyST) model leverages recent work in neural scene representation to learn a latent decomposition of monocular real-world videos into scene content, per-view scene dynamics, and camera pose.
In this paper, we leverage recent progress in diffusion models to equip 3D scene representation learning models with the ability to render high-fidelity novel views, while retaining benefits such as object-level scene editing to a large degree.
Self-supervised methods for learning object-centric representations have recently been applied successfully to various datasets.
2 code implementations • 6 Mar 2023 • Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, Pete Florence
Large language models excel at a wide range of complex tasks.
Ranked #1 on Visual Question Answering (VQA) on OK-VQA
Automatically discovering composable abstractions from raw perceptual data is a long-standing challenge in machine learning.
Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis.
A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition.
In our work, we find evidence that these losses are insufficient for the task of scene decomposition, without also considering architectural inductive biases.
1 code implementation • • Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. M. Sajjadi, Matan Sela, Vincent Sitzmann, Austin Stone, Deqing Sun, Suhani Vora, Ziyu Wang, Tianhao Wu, Kwang Moo Yi, Fangcheng Zhong, Andrea Tagliasacchi
Data is the driving force of machine learning, with the amount and quality of training data often being more important for the performance of a system than architecture and training details.
We observe that the majority of artifacts in sparse input scenarios are caused by errors in the estimated scene geometry, and by divergent behavior at the start of training.
1 code implementation • • Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani Vora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, Andrea Tagliasacchi
In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass.
We present NeSF, a method for producing 3D semantic fields from posed RGB images alone.
We present a learning-based method for synthesizing novel views of complex scenes using only unstructured collections of in-the-wild photographs.
Variational Autoencoders (VAEs) provide a theoretically-backed and popular framework for deep generative models.
State-of-the-art video restoration methods integrate optical flow estimation networks to utilize temporal information.
Together with a video discriminator, we also propose additional loss functions to further reinforce temporal consistency in the generated sequences.
Recent advances in generative modeling have led to an increased interest in the study of statistical divergences as means of model comparison.
A possible explanation for training instabilities is the inherent imbalance between the networks: While the discriminator is trained directly on both real and fake samples, the generator only has control over the fake samples it produces since the real data distribution is fixed by the choice of a given dataset.
Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multiple low-resolution (LR) frames to generate high-quality images.
Ranked #6 on Video Super-Resolution on MSU Video Upscalers: Quality Enhancement (VMAF metric)
Single image super-resolution is the task of inferring a high-resolution image from a single low-resolution input.
Light field photography captures rich structural information that may facilitate a number of traditional image processing and computer vision tasks.
Peer grading is the process of students reviewing each others' work, such as homework submissions, and has lately become a popular mechanism used in massive open online courses (MOOCs).