UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.
Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K and Cityscapes datasets.
PhyCV is the first computer vision library which utilizes algorithms directly derived from the equations of physics governing physical phenomena.
Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects (e. g., object's texture) or augment the scene with visual effects (e. g., smoke, fire) in a semantically meaningful manner.
Most work on reward learning has used simulated environments, but complex information about values is often expressed in natural language, and we believe reward learning for language is a key to making RL practical and safe for real-world tasks.
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial intelligence-driven protein design.
We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks.
In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model.
First, we use synthetic language modeling tasks to understand the gap between SSMs and attention.
We introduce k-planes, a white-box model for radiance fields in arbitrary dimensions.
Ranked #1 on
Novel View Synthesis
on LLFF