We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher.
We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.
Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question.
Ranked #1 on Visual Question Answering on A-OKVQA
Scenic: A Jax Library for Computer Vision Research and Beyond
Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes.
In order to reap the benefits and avoid the drawbacks of CBFT and CFFT, we propose a novel framework with a Hybrid Feature Transformation module (HFT).
This is the first use of sparse convolution for 2D masked modeling.
Ranked #1 on Instance Segmentation on COCO 2017 val
Neural Radiance Fields (NeRF) are a rapidly growing area of research with wide-ranging applications in computer vision, graphics, robotics, and more.
Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm.
We thus propose a new gradient-disentangled embedding sharing method that avoids the tug-of-war dynamics, improving both training efficiency and the quality of the pre-trained model.
Ranked #1 on Question Answering on SWAG
Natural Language Inference Natural Language Understanding +2