Search Results for author: Ying Shan

Found 214 papers, 126 papers with code

StereoCrafter: Diffusion-based Generation of Long and High-fidelity Stereoscopic 3D from Monocular Videos

no code implementations11 Sep 2024 Sijie Zhao, WenBo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan

This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.

Video Inpainting

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

no code implementations3 Sep 2024 Wangbo Yu, Jinbo Xing, Li Yuan, WenBo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian

Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control.

3D Generation 3D Reconstruction +3

DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos

no code implementations3 Sep 2024 WenBo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan

Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length.

Monocular Depth Estimation Optical Flow Estimation +1

CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities

no code implementations23 Aug 2024 Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li

However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions.

Denoising Video Generation

Story3D-Agent: Exploring 3D Storytelling Visualization with Large Language Models

no code implementations21 Aug 2024 Yuzhou Huang, Yiran Qin, Shunlin Lu, Xintao Wang, Rui Huang, Ying Shan, Ruimao Zhang

Traditional visual storytelling is complex, requiring specialized knowledge and substantial resources, yet often constrained by human creativity and creation precision.

Logical Reasoning Motion Synthesis +1

SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses

no code implementations3 Aug 2024 Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu

Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval.

Natural Language Queries Video Grounding

Image Inpainting Models are Effective Tools for Instruction-guided Image Editing

no code implementations18 Jul 2024 Xuan Ju, Junhao Zhuang, Zhaoyang Zhang, Yuxuan Bian, Qiang Xu, Ying Shan

The most advanced methods, such as SmartEdit and MGIE, usually combine large language models with diffusion models through joint training, where the former provides text understanding ability, and the latter provides image generation ability.

Image Inpainting

Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models

1 code implementation14 Jul 2024 Qinyu Yang, Haoxin Chen, Yong Zhang, Menghan Xia, Xiaodong Cun, Zhixun Su, Ying Shan

In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement.

Denoising Video Enhancement

SEED-Story: Multimodal Long Story Generation with Large Language Model

1 code implementation11 Jul 2024 Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen

We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.

Image Generation Language Modelling +3

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

no code implementations CVPR 2024 Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu

Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency.

Contrastive Learning Image-text Retrieval +3

EA-VTR: Event-Aware Video-Text Retrieval

no code implementations10 Jul 2024 Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu

EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events.

Action Recognition Contrastive Learning +6

MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions

1 code implementation8 Jul 2024 Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan

Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention.

Video Alignment Video Generation

Image Conductor: Precision Control for Interactive Video Synthesis

no code implementations21 Jun 2024 Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, Ying Shan

To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image.

Object

VoCo-LLaMA: Towards Vision Compression with Large Language Models

1 code implementation18 Jun 2024 Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang

Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss.

Computational Efficiency Question Answering +1

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

1 code implementation5 Jun 2024 Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner.

Language Modelling Large Language Model

GrootVL: Tree Topology is All You Need in State Space Model

1 code implementation4 Jun 2024 Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan

The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency.

Image Classification object-detection +1

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

1 code implementation3 Jun 2024 Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He

Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model.

Video Generation

CV-VAE: A Compatible Video VAE for Latent Generative Video Models

1 code implementation30 May 2024 Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, WenBo Hu, Ying Shan

Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization.

Quantization

MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model

1 code implementation30 May 2024 Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng

We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations.

Image Animation Video Generation

Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh

no code implementations28 May 2024 Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, WenBo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, Long Quan

This approach reduces the need to design various algorithms for different types of Gaussian manipulation.

Novel View Synthesis

ToonCrafter: Generative Cartoon Interpolation

no code implementations28 May 2024 Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong

We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation.

Decoder

ReVideo: Remake a Video with Motion and Content Control

no code implementations22 May 2024 Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang

In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion.

Video Editing Video Generation

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

no code implementations13 May 2024 Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images.

Code Generation Descriptive

SEED-Data-Edit Technical Report: A Hybrid Dataset for Instructional Image Editing

1 code implementation7 May 2024 Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan

In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language.

Image Manipulation Language Modelling +2

SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension

2 code implementations25 Apr 2024 Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan

We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs.

Benchmarking Multiple-choice

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

1 code implementation22 Apr 2024 Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan

We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.

Image Generation

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

1 code implementation10 Apr 2024 Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.

Image to 3D

ST-LLM: Large Language Models Are Effective Temporal Learners

1 code implementation30 Mar 2024 Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li

In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?

Reading Comprehension Video-based Generative Performance Benchmarking +1

UV Gaussians: Joint Learning of Mesh Deformation and Gaussian Textures for Human Avatar Modeling

no code implementations18 Mar 2024 Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan

Therefore, we propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures.

Texture-GS: Disentangling the Geometry and Texture for 3D Gaussian Splatting Editing

no code implementations15 Mar 2024 Tian-Xing Xu, WenBo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang

3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering.

Disentanglement

SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model

no code implementations15 Mar 2024 Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li

Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.

Denoising Diversity +1

HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedback

no code implementations13 Mar 2024 Ang Li, Qiugen Xiao, Peng Cao, Jian Tang, Yi Yuan, Zijie Zhao, Xiaoyuan Chen, Liang Zhang, Xiangyang Li, Kaitong Yang, Weidong Guo, Yukang Gan, Xu Yu, Daniell Wang, Ying Shan

Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate.

Language Modelling Large Language Model +2

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

1 code implementation11 Mar 2024 Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu

Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs).

Image Inpainting

DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos

no code implementations9 Mar 2024 Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong liu, Yang Wu, Ying Shan, Xiaojuan Qi

Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.

Depth Estimation Disentanglement +5

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

1 code implementation16 Feb 2024 Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, YuFei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen

Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data.

Video Generation

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

1 code implementation CVPR 2024 Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.

Image Generation

Advances in 3D Generation: A Survey

no code implementations31 Jan 2024 Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan

In this survey, we aim to introduce the fundamental methodologies of 3D generation methods and establish a structured roadmap, encompassing 3D representation, generation methods, datasets, and corresponding applications.

3D Generation Novel View Synthesis

YOLO-World: Real-Time Open-Vocabulary Object Detection

2 code implementations CVPR 2024 Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan

The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.

Ranked #5 on Zero-Shot Object Detection on MSCOCO (using extra training data)

Instance Segmentation Language Modelling +5

TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts

no code implementations26 Jan 2024 Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan

To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region.

3D scene Editing

Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities

1 code implementation CVPR 2024 Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue

We propose to improve transformers of a specific modality with irrelevant data from other modalities, e. g., improve an ImageNet model with audio or point cloud datasets.

Supervised Fine-tuning in turn Improves Visual Foundation Models

1 code implementation18 Jan 2024 Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

2 code implementations CVPR 2024 Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan

Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.

Text-to-Video Generation Video Generation

LLaMA Pro: Progressive LLaMA with Block Expansion

1 code implementation4 Jan 2024 Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, Ping Luo

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.

Instruction Following Math

Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs

no code implementations CVPR 2024 Lin Song, Yukang Chen, Shuai Yang, Xiaohan Ding, Yixiao Ge, Ying-Cong Chen, Ying Shan

We empirically show that sparse attention not only reduces computational demands but also enhances model performance in both NLP and multi-modal tasks.

SEED-Bench: Benchmarking Multimodal Large Language Models

1 code implementation CVPR 2024 Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

Multimodal large language models (MLLMs) building upon the foundation of powerful large language models (LLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).

Benchmarking Image Generation +1

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition

1 code implementation CVPR 2024 Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

Time Series Time Series Forecasting

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

1 code implementation14 Dec 2023 Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.

Image Captioning In-Context Learning +4

EgoPlan-Bench: Benchmarking Multimodal Large Language Models for Human-Level Planning

1 code implementation11 Dec 2023 Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs.

Benchmarking Human-Object Interaction Detection

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models

1 code implementation CVPR 2024 Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan

Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

1 code implementation6 Dec 2023 Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan

Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement.

Object Video Generation

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

1 code implementation6 Dec 2023 Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang

For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image.

Image Animation Video Generation

MagicStick: Controllable Video Editing via Control Handle Transformations

1 code implementation5 Dec 2023 Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen

Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model.

Video Editing Video Generation

StyleCrafter: Enhancing Stylized Text-to-Video Generation with Style Adapter

2 code implementations1 Dec 2023 Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Yujiu Yang, Ying Shan

To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image.

Disentanglement Text-to-Video Generation +1

SEED-Bench-2: Benchmarking Multimodal Large Language Models

2 code implementations28 Nov 2023 Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).

Benchmarking Image Generation +1

HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion

no code implementations CVPR 2024 Jingbo Zhang, Xiaoyu Li, Qi Zhang, YanPei Cao, Ying Shan, Jing Liao

Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views.

3D Generation Image to 3D

ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis

no code implementations CVPR 2024 Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, YanPei Cao, Ying Shan, Long Quan

In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner.

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

no code implementations CVPR 2024 Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu

In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance.

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

3 code implementations27 Nov 2023 Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan

1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.

 Ranked #1 on Object Detection on COCO 2017 (mAP metric)

Image Classification Object Detection +3

ViT-Lens: Towards Omni-modal Representations

1 code implementation CVPR 2024 Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.

EEG Image Generation +2

GS-IR: 3D Gaussian Splatting for Inverse Rendering

1 code implementation CVPR 2024 Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia

We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results.

Inverse Rendering Novel View Synthesis

Vision-Language Instruction Tuning: A Review and Analysis

1 code implementation14 Nov 2023 Chen Li, Yixiao Ge, Dian Li, Ying Shan

Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.

Meta-Adapter: An Online Few-shot Learner for Vision-Language Model

1 code implementation NeurIPS 2023 Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan

Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed.

Few-Shot Learning Image Classification +3

SemanticBoost: Elevating Motion Generation with Augmented Textual Cues

no code implementations31 Oct 2023 Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan

Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD).

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

3 code implementations30 Oct 2023 Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan

The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style.

Text-to-Video Generation Video Generation

FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

3 code implementations23 Oct 2023 Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress.

Video Generation

TapMo: Shape-aware Motion Generation of Skeleton-free Characters

no code implementations19 Oct 2023 Jiaxu Zhang, Shaoli Huang, Zhigang Tu, Xin Chen, Xiaohang Zhan, Gang Yu, Ying Shan

In this work, we present TapMo, a Text-driven Animation Pipeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters.

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

no code implementations CVPR 2024 Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, YuChao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou

To overcome this, we propose to introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation, where the editing can be performed in the 3D spaces and propagated to the entire video via the deformation field.

Style Transfer Super-Resolution +1

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

1 code implementation11 Oct 2023 Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan

Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.

Image Generation

HiFi-123: Towards High-fidelity One Image to 3D Content Generation

no code implementations10 Oct 2023 Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, WenBo Hu, Long Quan, Ying Shan, Yonghong Tian

Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods.

3D Generation Image to 3D +1

Making LLaMA SEE and Draw with SEED Tokenizer

1 code implementation2 Oct 2023 Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

multimodal generation

Anti-Aliased Neural Implicit Surfaces with Encoding Level of Detail

no code implementations19 Sep 2023 Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao

Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance.

Surface Reconstruction

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video

1 code implementation ICCV 2023 Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi

Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training.

Image Generation

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

no code implementations4 Sep 2023 Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo

StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods.

Image Generation

Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training

no code implementations1 Sep 2023 Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng

Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information.

Singing Voice Synthesis Unsupervised Pre-training

Exploring Model Transferability through the Lens of Potential Energy

1 code implementation ICCV 2023 Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Ling-Yu Duan

The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning.

Model Selection Transfer Learning

Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views

no code implementations27 Aug 2023 Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang

While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry.

3D Reconstruction Novel View Synthesis +1

Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning

2 code implementations22 Aug 2023 Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan

To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions.

Caption Generation Large Language Model +3

ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights

1 code implementation20 Aug 2023 Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou

A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.

3D Classification Question Answering +4

Guide3D: Create 3D Avatars from Text and Image Guidance

no code implementations18 Aug 2023 Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong

To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models.

3D Generation Text to 3D +1

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution

no code implementations ICCV 2023 Zidong Cao, Hao Ai, Yan-Pei Cao, Ying Shan, XiaoHu Qie, Lin Wang

The M\"obius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem.

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

2 code implementations30 Jul 2023 Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.

Benchmarking Multiple-choice

GET3D--: Learning GET3D from Unconstrained Image Collections

no code implementations27 Jul 2023 Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong

While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes.

Planting a SEED of Vision in Large Language Model

1 code implementation16 Jul 2023 Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan

Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)

Language Modelling Large Language Model +1

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

no code implementations11 Jul 2023 Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang

Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications.

NOFA: NeRF-based One-shot Facial Avatar Reconstruction

no code implementations7 Jul 2023 Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu

In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar.

Decoder

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

1 code implementation5 Jul 2023 Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang

Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model.

Object

DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models

1 code implementation5 Jul 2023 Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, Chao Dong

After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data.

Image Super-Resolution

DreamDiffusion: Generating High-Quality Images from Brain EEG Signals

1 code implementation29 Jun 2023 Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, Ying Shan

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text.

EEG Image Generation

ID-Pose: Sparse-view Camera Pose Estimation by Inverting Diffusion Models

no code implementations29 Jun 2023 Weihao Cheng, Yan-Pei Cao, Ying Shan

ID-Pose adds a noise to one image, and predicts the noise conditioned on the other image and a hypothesis of the relative pose.

Camera Pose Estimation Denoising +1

DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization

1 code implementation27 Jun 2023 Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Ying Shan

Adversarial training is one of the best-performing methods in improving the robustness of deep language models.

On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection

1 code implementation27 Jun 2023 Songyang Gao, Shihan Dou, Qi Zhang, Xuanjing Huang, Jin Ma, Ying Shan

Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications.

text-classification Text Classification

PTVD: A Large-Scale Plot-Oriented Multimodal Dataset Based on Television Dramas

1 code implementation26 Jun 2023 Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan

Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.

Genre classification Retrieval +1

Towards Unseen Triples: Effective Text-Image-joint Learning for Scene Graph Generation

no code implementations23 Jun 2023 Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang

In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.

Graph Generation Scene Graph Generation +1

TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter

no code implementations22 Jun 2023 Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou

In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.

Question Answering Text Retrieval +4

SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation

1 code implementation6 Jun 2023 XueWei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li

Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude.

Semantic Segmentation

PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas

no code implementations NeurIPS 2023 Zheng Chen, Yan-Pei Cao, Yuan-Chen Guo, Chen Wang, Ying Shan, Song-Hai Zhang

Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection.

Depth Estimation

Inserting Anybody in Diffusion Models via Celeb Basis

1 code implementation NeurIPS 2023 Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng

Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods.

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

no code implementations1 Jun 2023 Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong

Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules.

Image Generation Video Generation

TaleCrafter: Interactive Story Visualization with Multiple Characters

1 code implementation29 May 2023 Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang

Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.

Story Visualization Text-to-Image Generation

TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at Scale

1 code implementation23 May 2023 Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan

We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.

Representation Learning

A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition

1 code implementation21 May 2023 Limao Xiong, Jie zhou, Qunxi Zhu, Xiao Wang, Yuanbin Wu, Qi Zhang, Tao Gui, Xuanjing Huang, Jin Ma, Ying Shan

Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.

named-entity-recognition Named Entity Recognition +2

What Makes for Good Visual Tokenizers for Large Language Models?

1 code implementation20 May 2023 Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan

In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.

Image Captioning Object Counting +2

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

1 code implementation27 Apr 2023 Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

NeAI: A Pre-convoluted Representation for Plug-and-Play Neural Ambient Illumination

no code implementations18 Apr 2023 Yiyu Zhuang, Qi Zhang, Xuan Wang, Hao Zhu, Ying Feng, Xiaoyu Li, Ying Shan, Xun Cao

Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images.

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

3 code implementations ICCV 2023 Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, XiaoHu Qie, Yinqiang Zheng

Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results.

Text-based Image Editing

Improved Test-Time Adaptation for Domain Generalization

1 code implementation CVPR 2023 Liang Chen, Yong Zhang, Yibing Song, Ying Shan, Lingqiao Liu

Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase.

Image to sketch recognition Single-Source Domain Generalization +1

TagGPT: Large Language Models are Zero-shot Multimodal Taggers

1 code implementation6 Apr 2023 Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan

Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.

Optical Character Recognition (OCR) Prompt Engineering +5

Learning Anchor Transformations for 3D Garment Animation

no code implementations CVPR 2023 Fang Zhao, Zekun Li, Shaoli Huang, Junwu Weng, Tianfei Zhou, Guo-Sen Xie, Jue Wang, Ying Shan

Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning.

Position

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

1 code implementation CVPR 2024 Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong

We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses.

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

1 code implementation CVPR 2023 Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan

However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS.

 Ranked #1 on Visual Object Tracking on TrackingNet (AUC metric)

Diversity Semantic Segmentation +3

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

2 code implementations CVPR 2023 Guangcong Zheng, Xianpan Zhou, XueWei Li, Zhongang Qi, Ying Shan, Xi Li

To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.

Layout-to-Image Generation Object

VMesh: Hybrid Volume-Mesh Representation for Efficient View Synthesis

no code implementations28 Mar 2023 Yuan-Chen Guo, Yan-Pei Cao, Chen Wang, Yu He, Ying Shan, XiaoHu Qie, Song-Hai Zhang

With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level.

2k

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation CVPR 2023 Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modelling Masked Language Modeling

HRDFuse: Monocular 360°Depth Estimation by Collaboratively Learning Holistic-with-Regional Depth Distributions

no code implementations21 Mar 2023 Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, Lin Wang

Depth estimation from a monocular 360{\deg} image is a burgeoning problem owing to its holistic sensing of a scene.

Depth Estimation ERP

BoPR: Body-aware Part Regressor for Human Shape and Pose Estimation

1 code implementation21 Mar 2023 Yongkang Cheng, Shaoli Huang, Jifeng Ning, Ying Shan

This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity.

3D Human Pose Estimation Occlusion Handling

Skinned Motion Retargeting with Residual Perception of Motion Semantics & Geometry

1 code implementation CVPR 2023 Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, Zhigang Tu

Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing.

motion retargeting

Binary Embedding-based Retrieval at Tencent

1 code implementation17 Feb 2023 Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan

To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.

Binarization Retrieval

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

2 code implementations16 Feb 2023 Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, XiaoHu Qie

In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly.

Image Generation Style Transfer

RILS: Masked Visual Reconstruction in Language Semantic Space

1 code implementation CVPR 2023 Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang

Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.