Search Results for author: Baining Guo

Found 44 papers, 27 papers with code

VASA-1: Lifelike Audio-Driven Talking Faces Generated in Real Time

no code implementations • 16 Apr 2024 • Sicheng Xu, Guojun Chen, Yu-Xiao Guo, Jiaolong Yang, Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining Guo

We introduce VASA, a framework for generating lifelike talking faces with appealing visual affective skills (VAS) given a single static image and a speech audio clip.

Paper
Add Code

GaussianCube: Structuring Gaussian Splatting using Optimal Transport for 3D Generative Modeling

no code implementations • 28 Mar 2024 • BoWen Zhang, Yiji Cheng, Jiaolong Yang, Chunyu Wang, Feng Zhao, Yansong Tang, Dong Chen, Baining Guo

To address the problem, we introduce GaussianCube, a structured GS representation that is both powerful and efficient for generative modeling.

Paper
Add Code

Simplified Diffusion Schrödinger Bridge

1 code implementation • 21 Mar 2024 • Zhicong Tang, Tiankai Hang, Shuyang Gu, Dong Chen, Baining Guo

This paper introduces a novel theoretical simplification of the Diffusion Schr\"odinger Bridge (DSB) that facilitates its unification with Score-based Generative Models (SGMs), addressing the limitations of DSB in complex data generation and enabling faster convergence and enhanced performance.

Paper
Code

VisualCritic: Making LMMs Perceive Visual Quality Like Humans

no code implementations • 19 Mar 2024 • Zhipeng Huang, Zhizheng Zhang, Yiting Lu, Zheng-Jun Zha, Zhibo Chen, Baining Guo

In this paper, we explore this question and provide the answer "Yes!".

Instruction Following

Paper
Add Code

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

no code implementations • 19 Mar 2024 • Zhipeng Huang, Zhizheng Zhang, Zheng-Jun Zha, Yan Lu, Baining Guo

The development of Large Vision-Language Models (LVLMs) is striving to catch up with the success of Large Language Models (LLMs), yet it faces more challenges to be resolved.

Language Modelling

Paper
Add Code

CCA: Collaborative Competitive Agents for Image Editing

1 code implementation • 23 Jan 2024 • Tiankai Hang, Shuyang Gu, Dong Chen, Xin Geng, Baining Guo

This paper presents a novel generative model, Collaborative Competitive Agents (CCA), which leverages the capabilities of multiple Large Language Models (LLMs) based agents to execute complex tasks.

Paper
Code

VolumeDiffusion: Flexible Text-to-3D Generation with Efficient Volumetric Encoder

no code implementations • 18 Dec 2023 • Zhicong Tang, Shuyang Gu, Chunyu Wang, Ting Zhang, Jianmin Bao, Dong Chen, Baining Guo

The 3D volumes are then trained on a diffusion model for text-to-3D generation using a 3D U-Net.

3D Generation Object +1

Paper
Add Code

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation

no code implementations • 30 Nov 2023 • Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, Jingxu Zhang, Qi Dai Zhiyuan Zhao, Chunyu Wang, Kai Qiu, Yuhui Yuan, Chuanxin Tang, Xiaoyan Sun, Chong Luo, Baining Guo

We present MicroCinema, a straightforward yet effective framework for high-quality and coherent text-to-video generation.

Text-to-Image Generation Text-to-Video Generation +1

Paper
Add Code

COLE: A Hierarchical Generation Framework for Multi-Layered and Editable Graphic Design

no code implementations • 28 Nov 2023 • Peidong Jia, Chenxuan Li, Yuhui Yuan, Zeyu Liu, Yichao Shen, Bohan Chen, Xingru Chen, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, Shanghang Zhang, Baining Guo

Our COLE system comprises multiple fine-tuned Large Language Models (LLMs), Large Multimodal Models (LMMs), and Diffusion Models (DMs), each specifically tailored for design-aware layer-wise captioning, layout planning, reasoning, and the task of generating images and text.

Image Generation

Paper
Add Code

CCEdit: Creative and Controllable Video Editing via Diffusion Models

no code implementations • 28 Sep 2023 • Ruoyu Feng, Wenming Weng, Yanhui Wang, Yuhui Yuan, Jianmin Bao, Chong Luo, Zhibo Chen, Baining Guo

The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models, as well as the option to provide the edited key frame.

Text-to-Image Generation Video Editing

Paper
Add Code

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

1 code implementation • 7 Sep 2023 • Zigang Geng, Binxin Yang, Tiankai Hang, Chen Li, Shuyang Gu, Ting Zhang, Jianmin Bao, Zheng Zhang, Han Hu, Dong Chen, Baining Guo

We present InstructDiffusion, a unifying and generic framework for aligning computer vision tasks with human instructions.

Keypoint Detection

332

Paper
Code

V-DETR: DETR with Vertex Relative Position Encoding for 3D Object Detection

1 code implementation • 8 Aug 2023 • Yichao Shen, Zigang Geng, Yuhui Yuan, Yutong Lin, Ze Liu, Chunyu Wang, Han Hu, Nanning Zheng, Baining Guo

We introduce a highly performant 3D object detector for point clouds using the DETR framework.

Ranked #2 on 3D Object Detection on ScanNetV2

3D Object Detection object-detection +1

Paper
Code

Adaptive Frequency Filters As Efficient Global Token Mixers

2 code implementations • ICCV 2023 • Zhipeng Huang, Zhizheng Zhang, Cuiling Lan, Zheng-Jun Zha, Yan Lu, Baining Guo

With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer.

118

Paper
Code

Swin3D: A Pretrained Transformer Backbone for 3D Indoor Scene Understanding

2 code implementations • 14 Apr 2023 • Yu-Qi Yang, Yu-Xiao Guo, Jian-Yu Xiong, Yang Liu, Hao Pan, Peng-Shuai Wang, Xin Tong, Baining Guo

We pretrained a large {\SST} model on a synthetic Structured3D dataset, which is an order of magnitude larger than the ScanNet dataset.

Ranked #2 on 3D Object Detection on S3DIS (using extra training data)

3D Object Detection Scene Understanding +1

1,124

Paper
Code

IRGen: Generative Modeling for Image Retrieval

1 code implementation • 17 Mar 2023 • Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Baining Guo

While generative modeling has been ubiquitous in natural language processing and computer vision, its application to image retrieval remains unexplored.

Image Retrieval Retrieval

Paper
Code

Efficient Diffusion Training via Min-SNR Weighting Strategy

2 code implementations • ICCV 2023 • Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, Baining Guo

Denoising diffusion models have been a mainstream approach for image generation, however, training these models often suffers from slow convergence.

Ranked #1 on Image Generation on ImageNet 256x256

Denoising Image Generation +2

161

Paper
Code

RemoteTouch: Enhancing Immersive 3D Video Communication with Hand Touch

no code implementations • 28 Feb 2023 • Yizhong Zhang, Zhiqi Li, Sicheng Xu, Chong Li, Jiaolong Yang, Xin Tong, Baining Guo

A key challenge in emulating the remote hand touch is the realistic rendering of the participant's hand and arm as the hand touches the screen.

Paper
Add Code

Improving CLIP Fine-tuning Performance

1 code implementation • ICCV 2023 • Yixuan Wei, Han Hu, Zhenda Xie, Ze Liu, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo

Experiments suggest that the feature map distillation approach significantly boosts the fine-tuning performance of CLIP models on several typical downstream vision tasks.

object-detection Object Detection +1

217

Paper
Code

iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition

no code implementations • CVPR 2023 • Yixuan Wei, Yue Cao, Zheng Zhang, Houwen Peng, Zhuliang Yao, Zhenda Xie, Han Hu, Baining Guo

This paper presents a method that effectively combines two prevalent visual recognition methods, i. e., image classification and contrastive language-image pre-training, dubbed iCLIP.

Classification Image Classification +2

Paper
Add Code

MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

1 code implementation • CVPR 2023 • Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, Baining Guo

To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i. e., MM-Diffusion), with two-coupled denoising autoencoders.

Denoising FAD +1

331

Paper
Code

Rodin: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion

no code implementations • CVPR 2023 • Tengfei Wang, Bo Zhang, Ting Zhang, Shuyang Gu, Jianmin Bao, Tadas Baltrusaitis, Jingjing Shen, Dong Chen, Fang Wen, Qifeng Chen, Baining Guo

This paper presents a 3D generative model that uses diffusion models to automatically generate 3D digital avatars represented as neural radiance fields.

Computational Efficiency

Paper
Add Code

Fine-Grained Image Style Transfer with Visual Transformers

1 code implementation • 11 Oct 2022 • Jianbo Wang, Huan Yang, Jianlong Fu, Toshihiko Yamasaki, Baining Guo

Such a design usually destroys the spatial information of the input images and fails to transfer fine-grained style patterns into style transfer results.

Style Transfer

Paper
Code

Implicit Conversion of Manifold B-Rep Solids by Neural Halfspace Representation

1 code implementation • 21 Sep 2022 • Hao-Xiang Guo, Yang Liu, Hao Pan, Baining Guo

We present a novel implicit representation -- neural halfspace representation (NH-Rep), to convert manifold B-Rep solids to implicit representations.

Surface Reconstruction

Paper
Code

Language-Guided Face Animation by Recurrent StyleGAN-based Generator

1 code implementation • 11 Aug 2022 • Tiankai Hang, Huan Yang, Bei Liu, Jianlong Fu, Xin Geng, Baining Guo

Specifically, we propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames.

Image Manipulation

Paper
Code

ComplexGen: CAD Reconstruction by B-Rep Chain Complex Generation

1 code implementation • 29 May 2022 • Haoxiang Guo, Shilin Liu, Hao Pan, Yang Liu, Xin Tong, Baining Guo

We view the reconstruction of CAD models in the boundary representation (B-Rep) as the detection of geometric primitives of different orders, i. e. vertices, edges and surface patches, and the correspondence of primitives, which are holistically modeled as a chain complex, and show that by modeling such comprehensive structures more complete and regularized reconstructions can be achieved.

CAD Reconstruction

Paper
Code

Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation

1 code implementation • 27 May 2022 • Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao, Jianmin Bao, Dong Chen, Baining Guo

These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools.

Ranked #2 on Instance Segmentation on COCO test-dev (using extra training data)

Contrastive Learning Image Classification +5

217

Paper
Code

iCAR: Bridging Image Classification and Image-text Alignment for Visual Recognition

no code implementations • 22 Apr 2022 • Yixuan Wei, Yue Cao, Zheng Zhang, Zhuliang Yao, Zhenda Xie, Han Hu, Baining Guo

Second, we convert the image classification problem from learning parametric category classifier weights to learning a text encoder as a meta network to generate category classifier weights.

Action Recognition Classification +7

Paper
Add Code

Protecting Celebrities from DeepFake with Identity Consistency Transformer

1 code implementation • CVPR 2022 • Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Ting Zhang, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, Baining Guo

In this work we propose Identity Consistency Transformer, a novel face forgery detection method that focuses on high-level semantics, specifically identity information, and detecting a suspect face by finding identity inconsistency in inner and outer face regions.

Face Swapping

Paper
Code

StyleSwin: Transformer-based GAN for High-resolution Image Generation

1 code implementation • CVPR 2022 • BoWen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, Baining Guo

To this end, we believe that local attention is crucial to strike the balance between computational efficiency and modeling capacity.

Ranked #1 on Image Generation on CelebA 256x256 (FID metric)

Blocking Computational Efficiency +3

476

Paper
Code

VirtualCube: An Immersive 3D Video Communication System

no code implementations • 13 Dec 2021 • Yizhong Zhang, Jiaolong Yang, Zhen Liu, Ruicheng Wang, Guojun Chen, Xin Tong, Baining Guo

The VirtualCube system is a 3D video conference system that attempts to overcome some limitations of conventional technologies.

Depth Estimation

Paper
Add Code

Vector Quantized Diffusion Model for Text-to-Image Synthesis

2 code implementations • CVPR 2022 • Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, Baining Guo

Our experiments indicate that the VQ-Diffusion model with the reparameterization is fifteen times faster than traditional AR methods while achieving a better image quality.

Ranked #1 on Text-to-Image Generation on Oxford 102 Flowers (using extra training data)

Denoising Text-to-Image Generation

836

Paper
Code

Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions

1 code implementation • CVPR 2022 • Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan Yang, Jianlong Fu, Baining Guo

To enable VL pre-training, we jointly optimize the HD-VILA model by a hybrid Transformer that learns rich spatiotemporal features, and a multimodal Transformer that enforces interactions of the learned video features with diversified texts.

Ranked #16 on Video Retrieval on MSR-VTT

Retrieval Super-Resolution +4

436

Paper
Code

Swin Transformer V2: Scaling Up Capacity and Resolution

19 code implementations • CVPR 2022 • Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, Baining Guo

Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images.

Ranked #4 on Image Classification on ImageNet V2 (using extra training data)

Action Classification Image Classification +3

29,758

Paper
Code

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

6 code implementations • CVPR 2022 • Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Lu Yuan, Dong Chen, Baining Guo

By further pretraining on the larger dataset ImageNet-21K, we achieve 87. 5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55. 7 mIoU.

Ranked #25 on Semantic Segmentation on ADE20K val

Image Classification Semantic Segmentation

5,254

Paper
Code

Aggregated Contextual Transformations for High-Resolution Image Inpainting

2 code implementations • 3 Apr 2021 • Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task.

Ranked #9 on Image Inpainting on Places2

Image Inpainting Texture Synthesis +1

4,198

Paper
Code

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

70 code implementations • ICCV 2021 • Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo

This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision.

Ranked #2 on Image Classification on OmniBenchmark

Image Classification Instance Segmentation +3

124,984

Paper
Code

Identity-Driven DeepFake Detection

no code implementations • 7 Dec 2020 • Xiaoyi Dong, Jianmin Bao, Dongdong Chen, Weiming Zhang, Nenghai Yu, Dong Chen, Fang Wen, Baining Guo

Our approach takes as input the suspect image/video as well as the target identity information (a reference image or video).

DeepFake Detection Face Swapping

Paper
Add Code

Learning Texture Transformer Network for Image Super-Resolution

1 code implementation • CVPR 2020 • Fuzhi Yang, Huan Yang, Jianlong Fu, Hongtao Lu, Baining Guo

In this paper, we propose a novel Texture Transformer Network for Image Super-Resolution (TTSR), in which the LR and Ref images are formulated as queries and keys in a transformer, respectively.

Hard Attention Image Generation +2

747

Paper
Code

Face X-ray for More General Face Forgery Detection

4 code implementations • CVPR 2020 • Lingzhi Li, Jianmin Bao, Ting Zhang, Hao Yang, Dong Chen, Fang Wen, Baining Guo

For this reason, face X-ray provides an effective way for detecting forgery generated by most existing face manipulation algorithms.

DeepFake Detection Face Swapping

Paper
Code

Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting

2 code implementations • CVPR 2019 • Yanhong Zeng, Jianlong Fu, Hongyang Chao, Baining Guo

As the missing content can be filled by attention transfer from deep to shallow in a pyramid fashion, both visual and semantic coherence for image inpainting can be ensured.

Image Inpainting Vocal Bursts Intensity Prediction

349

Paper
Code

Compressing Neural Networks using the Variational Information Bottelneck

1 code implementation • ICML 2018 • Bin Dai, Chen Zhu, Baining Guo, David Wipf

Neural networks can be compressed to reduce memory and computational requirements, or to increase accuracy by facilitating the use of a larger base architecture.

Paper
Code

Unsupervised Extraction of Video Highlights Via Robust Recurrent Auto-encoders

no code implementations • ICCV 2015 • Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, Baining Guo

With the growing popularity of short-form video sharing platforms such as \em{Instagram} and \em{Vine}, there has been an increasing need for techniques that automatically extract highlights from video.

Paper
Add Code

Orientational Pyramid Matching for Recognizing Indoor Scenes

no code implementations • CVPR 2014 • Lingxi Xie, Jingdong Wang, Baining Guo, Bo Zhang, Qi Tian

The novelty lies in that OPM uses the 3D orientations to form the pyramid and produce the pooling regions, which is unlike SPM that uses the spatial positions to form the pyramid.

General Classification Scene Classification +1

Paper
Add Code

Fast Neighborhood Graph Search using Cartesian Concatenation

no code implementations • 11 Dec 2013 • Jingdong Wang, Jing Wang, Gang Zeng, Rui Gan, Shipeng Li, Baining Guo

This structure augments the neighborhood graph with a bridge graph.

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.