Masked Diffusion Transformer is a Strong Image Synthesizer

sail-sg/mdt 25 Mar 2023

Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process.

Image Generation

ADAPT: Action-aware Driving Caption Transformer

jxbbb/adapt 1 Feb 2023

To bridge the gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action.

Autonomous Driving Decision Making

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

cloneofsimo/lora 2 Aug 2022

Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.

Text-to-Image Generation

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

opengvlab/unmasked_teacher 28 Mar 2023

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

 Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)

Action Classification Action Recognition +5

Mask-Free Video Instance Segmentation

syscv/maskfreevis 28 Mar 2023

A consistency loss is then enforced on the found matches.

Instance Segmentation Optical Flow Estimation +3

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

microsoft/visual-chatgpt 8 Mar 2023

To this end, We build a system called \textbf{Visual ChatGPT}, incorporating different Visual Foundation Models, to enable the user to interact with ChatGPT by 1) sending and receiving not only languages but also images 2) providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps.

Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization

dvlab-research/ref-npr 6 Dec 2022

We propose a ray registration process based on the stylized reference view to obtain pseudo-ray supervision in novel views.

Semantic correspondence

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

modelscope/modelscope 15 Mar 2023

A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.

Denoising Image Generation +1

ReVersion: Diffusion-Based Relation Inversion from Images

ziqihuangg/reversion 23 Mar 2023

Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior.

Contrastive Learning

