Once the subject is embedded in the output domain of the model, the unique identifier can then be used to synthesize fully-novel photorealistic images of the subject contextualized in different scenes.
We explore a data-driven approach for learning to optimize neural networks.
We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet.
Ranked #1 on Speech Recognition on CHiME6
We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance.
Ranked #3 on Entity Linking on KILT: WNED-CWEB
In this work, we present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs for vision-and-language representation learning with minimal modality-specific design, and do not use text-specific modules such as tokenization or automatic speech recognition (ASR).
Our method, Dream Fields, can generate the geometry and color of a wide range of objects without 3D supervision.