no code implementations • 11 Sep 2024 • Sijie Zhao, WenBo Hu, Xiaodong Cun, Yong Zhang, Xiaoyu Li, Zhe Kong, Xiangjun Gao, Muyao Niu, Ying Shan
This paper presents a novel framework for converting 2D videos to immersive stereoscopic 3D, addressing the growing demand for 3D content in immersive experience.
no code implementations • 6 Sep 2024 • Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, LiMin Wang, Ying Shan
We present Open-MAGVIT2, a family of auto-regressive image generation models ranging from 300M to 1. 5B.
Ranked #19 on Image Generation on ImageNet 256x256
no code implementations • 3 Sep 2024 • Wangbo Yu, Jinbo Xing, Li Yuan, WenBo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, Yonghong Tian
Our method takes advantage of the powerful generation capabilities of video diffusion model and the coarse 3D clues offered by point-based representation to generate high-quality video frames with precise camera pose control.
no code implementations • 3 Sep 2024 • WenBo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, Ying Shan
Despite significant advancements in monocular depth estimation for static images, estimating video depth in the open world remains challenging, since open-world videos are extremely diverse in content, motion, camera movement, and length.
no code implementations • 23 Aug 2024 • Tao Wu, Yong Zhang, Xintao Wang, Xianpan Zhou, Guangcong Zheng, Zhongang Qi, Ying Shan, Xi Li
However, since it is only trained on static images, the fine-tuning process of subject learning disrupts abilities of video diffusion models (VDMs) to combine concepts and generate motions.
no code implementations • 21 Aug 2024 • Yuzhou Huang, Yiran Qin, Shunlin Lu, Xintao Wang, Rui Huang, Ying Shan, Ruimao Zhang
Traditional visual storytelling is complex, requiring specialized knowledge and substantial resources, yet often constrained by human creativity and creation precision.
no code implementations • 3 Aug 2024 • Chaolei Tan, Zihang Lin, Junfu Pu, Zhongang Qi, Wei-Yi Pei, Zhi Qu, Yexin Wang, Ying Shan, Wei-Shi Zheng, Jian-Fang Hu
Based on the dataset, we further introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG), which takes as input multiple paragraphs and a long video for grounding each paragraph query to its temporal interval.
no code implementations • 18 Jul 2024 • Xuan Ju, Junhao Zhuang, Zhaoyang Zhang, Yuxuan Bian, Qiang Xu, Ying Shan
The most advanced methods, such as SmartEdit and MGIE, usually combine large language models with diffusion models through joint training, where the former provides text understanding ability, and the latter provides image generation ability.
1 code implementation • 14 Jul 2024 • Qinyu Yang, Haoxin Chen, Yong Zhang, Menghan Xia, Xiaodong Cun, Zhixun Su, Ying Shan
In order to improve the quality of synthesized videos, currently, one predominant method involves retraining an expert diffusion model and then implementing a noising-denoising process for refinement.
1 code implementation • 11 Jul 2024 • Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Yingcong Chen
We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.
no code implementations • CVPR 2024 • Yuxin Chen, Zongyang Ma, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Ying Shan, Xiaojuan Qi, Weiming Hu
Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency.
no code implementations • 10 Jul 2024 • Zongyang Ma, Ziqi Zhang, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Yingmin Luo, Xu Li, Xiaojuan Qi, Ying Shan, Weiming Hu
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment, ultimately enhancing the comprehensive understanding of video events.
1 code implementation • 8 Jul 2024 • Xuan Ju, Yiming Gao, Zhaoyang Zhang, Ziyang Yuan, Xintao Wang, Ailing Zeng, Yu Xiong, Qiang Xu, Ying Shan
Sora's high-motion intensity and long consistent videos have significantly impacted the field of video generation, attracting unprecedented attention.
no code implementations • 21 Jun 2024 • Yaowei Li, Xintao Wang, Zhaoyang Zhang, Zhouxia Wang, Ziyang Yuan, Liangbin Xie, Yuexian Zou, Ying Shan
To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image.
1 code implementation • 18 Jun 2024 • Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Ying Shan, Yansong Tang
Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss.
1 code implementation • 5 Jun 2024 • Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner.
1 code implementation • 4 Jun 2024 • Yicheng Xiao, Lin Song, Shaoli Huang, Jiangshan Wang, Siyu Song, Yixiao Ge, Xiu Li, Ying Shan
The state space models, employing recursively propagated features, demonstrate strong representation capabilities comparable to Transformer models and superior efficiency.
1 code implementation • 3 Jun 2024 • Shaoshu Yang, Yong Zhang, Xiaodong Cun, Ying Shan, Ran He
Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model.
1 code implementation • 30 May 2024 • Sijie Zhao, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Muyao Niu, Xiaoyu Li, WenBo Hu, Ying Shan
Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization.
1 code implementation • 30 May 2024 • Muyao Niu, Xiaodong Cun, Xintao Wang, Yong Zhang, Ying Shan, Yinqiang Zheng
We present MOFA-Video, an advanced controllable image animation method that generates video from the given image using various additional controllable signals (such as human landmarks reference, manual trajectories, and another even provided video) or their combinations.
no code implementations • CVPR 2024 • Hanchao Liu, Xiaohang Zhan, Shaoli Huang, Tai-Jiang Mu, Ying Shan
This problem is characterized by an open and fully customizable set of motion control tasks.
no code implementations • 28 May 2024 • Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, WenBo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, Long Quan
This approach reduces the need to design various algorithms for different types of Gaussian manipulation.
no code implementations • 28 May 2024 • Jinbo Xing, Hanyuan Liu, Menghan Xia, Yong Zhang, Xintao Wang, Ying Shan, Tien-Tsin Wong
We introduce ToonCrafter, a novel approach that transcends traditional correspondence-based cartoon video interpolation, paving the way for generative interpolation.
no code implementations • 22 May 2024 • Chong Mou, Mingdeng Cao, Xintao Wang, Zhaoyang Zhang, Ying Shan, Jian Zhang
In this paper, we present a novel attempt to Remake a Video (ReVideo) which stands out from existing methods by allowing precise video editing in specific areas through the specification of both content and motion.
no code implementations • 13 May 2024 • Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo
Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images.
1 code implementation • 7 May 2024 • Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, Ying Shan
In this technical report, we introduce SEED-Data-Edit: a unique hybrid dataset for instruction-guided image editing, which aims to facilitate image manipulation using open-form language.
no code implementations • 1 May 2024 • Zidong Cao, Zhan Wang, Yexin Liu, Yan-Pei Cao, Ying Shan, Wei Zeng, Lin Wang
Our system enables users to effortlessly locate and zoom in on the objects of interest in VR.
2 code implementations • 25 Apr 2024 • Bohao Li, Yuying Ge, Yi Chen, Yixiao Ge, Ruimao Zhang, Ying Shan
We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs.
1 code implementation • 22 Apr 2024 • Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, Ying Shan
We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.
1 code implementation • 10 Apr 2024 • Jiale Xu, Weihao Cheng, Yiming Gao, Xintao Wang, Shenghua Gao, Ying Shan
We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.
1 code implementation • 30 Mar 2024 • Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, Ge Li
In this paper, we investigate a straightforward yet unexplored question: Can we feed all spatial-temporal tokens into the LLM, thus delegating the task of video sequence modeling to the LLMs?
Ranked #1 on Video-based Generative Performance Benchmarking (Temporal Understanding) on VideoInstruct
Reading Comprehension Video-based Generative Performance Benchmarking +1
no code implementations • 18 Mar 2024 • Yujiao Jiang, Qingmin Liao, Xiaoyu Li, Li Ma, Qi Zhang, Chaopeng Zhang, Zongqing Lu, Ying Shan
Therefore, we propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures.
no code implementations • 15 Mar 2024 • Tian-Xing Xu, WenBo Hu, Yu-Kun Lai, Ying Shan, Song-Hai Zhang
3D Gaussian splatting, emerging as a groundbreaking approach, has drawn increasing attention for its capabilities of high-fidelity reconstruction and real-time rendering.
no code implementations • 15 Mar 2024 • Tao Wu, XueWei Li, Zhongang Qi, Di Hu, Xintao Wang, Ying Shan, Xi Li
Controllable spherical panoramic image generation holds substantial applicative potential across a variety of domains. However, it remains a challenging task due to the inherent spherical distortion and geometry characteristics, resulting in low-quality content generation. In this paper, we introduce a novel framework of SphereDiffusion to address these unique challenges, for better generating high-quality and precisely controllable spherical panoramic images. For the spherical distortion characteristic, we embed the semantics of the distorted object with text encoding, then explicitly construct the relationship with text-object correspondence to better use the pre-trained knowledge of the planar images. Meanwhile, we employ a deformable technique to mitigate the semantic deviation in latent space caused by spherical distortion. For the spherical geometry characteristic, in virtue of spherical rotation invariance, we improve the data diversity and optimization objectives in the training process, enabling the model to better learn the spherical geometry characteristic. Furthermore, we enhance the denoising process of the diffusion model, enabling it to effectively use the learned geometric characteristic to ensure the boundary continuity of the generated images. With these specific techniques, experiments on Structured3D dataset show that SphereDiffusion significantly improves the quality of controllable spherical image generation and relatively reduces around 35% FID on average.
no code implementations • 13 Mar 2024 • Ang Li, Qiugen Xiao, Peng Cao, Jian Tang, Yi Yuan, Zijie Zhao, Xiaoyuan Chen, Liang Zhang, Xiangyang Li, Kaitong Yang, Weidong Guo, Yukang Gan, Xu Yu, Daniell Wang, Ying Shan
Using ChatGPT as a labeler to provide feedback on open-domain prompts in RLAIF training, we observe an increase in human evaluators' preference win ratio for model responses, but a decrease in evaluators' satisfaction rate.
1 code implementation • 11 Mar 2024 • Xuan Ju, Xian Liu, Xintao Wang, Yuxuan Bian, Ying Shan, Qiang Xu
Image inpainting, the process of restoring corrupted images, has seen significant advancements with the advent of diffusion models (DMs).
no code implementations • 9 Mar 2024 • Xiuzhe Wu, Xiaoyang Lyu, Qihao Huang, Yong liu, Yang Wu, Ying Shan, Xiaojuan Qi
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
1 code implementation • 16 Feb 2024 • Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, YuFei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, Ying Shan, Bihan Wen
Diffusion models have proven to be highly effective in image and video generation; however, they still face composition challenges when generating images of varying sizes due to single-scale training data.
1 code implementation • CVPR 2024 • Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
no code implementations • 31 Jan 2024 • Xiaoyu Li, Qi Zhang, Di Kang, Weihao Cheng, Yiming Gao, Jingbo Zhang, Zhihao Liang, Jing Liao, Yan-Pei Cao, Ying Shan
In this survey, we aim to introduce the fundamental methodologies of 3D generation methods and establish a structured roadmap, encompassing 3D representation, generation methods, datasets, and corresponding applications.
2 code implementations • CVPR 2024 • Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying Shan
The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools.
Ranked #5 on Zero-Shot Object Detection on MSCOCO (using extra training data)
1 code implementation • 28 Jan 2024 • Dan Zhang, Yangliao Geng, Wenwen Gong, Zhongang Qi, Zhiyu Chen, Xing Tang, Ying Shan, Yuxiao Dong, Jie Tang
In this work, we investigate how to employ both batch-wise CL (BCL) and feature-wise CL (FCL) for recommendation.
no code implementations • 26 Jan 2024 • Jingyu Zhuang, Di Kang, Yan-Pei Cao, Guanbin Li, Liang Lin, Ying Shan
To this end, we propose a 3D scene editing framework, TIPEditor, that accepts both text and image prompts and a 3D bounding box to specify the editing region.
1 code implementation • CVPR 2024 • Yiyuan Zhang, Xiaohan Ding, Kaixiong Gong, Yixiao Ge, Ying Shan, Xiangyu Yue
We propose to improve transformers of a specific modality with irrelevant data from other modalities, e. g., improve an ImageNet model with audio or point cloud datasets.
1 code implementation • 18 Jan 2024 • Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan
Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.
2 code implementations • CVPR 2024 • Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, Ying Shan
Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model.
Ranked #1 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
no code implementations • 15 Jan 2024 • Jay Zhangjie Wu, Guian Fang, HaoNing Wu, Xintao Wang, Yixiao Ge, Xiaodong Cun, David Junhao Zhang, Jia-Wei Liu, YuChao Gu, Rui Zhao, Weisi Lin, Wynne Hsu, Ying Shan, Mike Zheng Shou
Experiments on the TVGE dataset demonstrate the superiority of the proposed T2VScore on offering a better metric for text-to-video generation.
1 code implementation • 4 Jan 2024 • Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, Ping Luo
Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.
no code implementations • CVPR 2024 • Lin Song, Yukang Chen, Shuai Yang, Xiaohan Ding, Yixiao Ge, Ying-Cong Chen, Ying Shan
We empirically show that sparse attention not only reduces computational demands but also enhances model performance in both NLP and multi-modal tasks.
1 code implementation • CVPR 2024 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan
Multimodal large language models (MLLMs) building upon the foundation of powerful large language models (LLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).
1 code implementation • CVPR 2024 • Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.
1 code implementation • 14 Dec 2023 • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan
In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.
1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu
The pursuit of artificial general intelligence (AGI) has been accelerated by Multimodal Large Language Models (MLLMs), which exhibit superior reasoning, generalization capabilities, and proficiency in processing multimodal inputs.
1 code implementation • CVPR 2024 • Yuzhou Huang, Liangbin Xie, Xintao Wang, Ziyang Yuan, Xiaodong Cun, Yixiao Ge, Jiantao Zhou, Chao Dong, Rui Huang, Ruimao Zhang, Ying Shan
Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods, paving the way for the practical application of complex instruction-based image editing.
1 code implementation • CVPR 2024 • Zhen Li, Mingdeng Cao, Xintao Wang, Zhongang Qi, Ming-Ming Cheng, Ying Shan
Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts.
Ranked #6 on Diffusion Personalization Tuning Free on AgeDB
Diffusion Personalization Tuning Free Text-to-Image Generation
1 code implementation • 6 Dec 2023 • Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan
Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement.
1 code implementation • 6 Dec 2023 • Jiwen Yu, Xiaodong Cun, Chenyang Qi, Yong Zhang, Xintao Wang, Ying Shan, Jian Zhang
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image.
1 code implementation • 5 Dec 2023 • Yue Ma, Xiaodong Cun, Yingqing He, Chenyang Qi, Xintao Wang, Ying Shan, Xiu Li, Qifeng Chen
Yet succinct, our method is the first method to show the ability of video property editing from the pre-trained text-to-image model.
2 code implementations • 1 Dec 2023 • Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Yujiu Yang, Ying Shan
To address these challenges, we introduce StyleCrafter, a generic method that enhances pre-trained T2V models with a style control adapter, enabling video generation in any style by providing a reference image.
2 code implementations • 28 Nov 2023 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan
Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).
no code implementations • CVPR 2024 • Jingbo Zhang, Xiaoyu Li, Qi Zhang, YanPei Cao, Ying Shan, Jing Liao
Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image, resulting in inconsistent appearances in different views.
no code implementations • CVPR 2024 • Xiangjun Gao, Xiaoyu Li, Chaopeng Zhang, Qi Zhang, YanPei Cao, Ying Shan, Long Quan
In this work, we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner.
no code implementations • CVPR 2024 • Xian Liu, Xiaohang Zhan, Jiaxiang Tang, Ying Shan, Gang Zeng, Dahua Lin, Xihui Liu, Ziwei Liu
In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance.
3 code implementations • 27 Nov 2023 • Xiaohan Ding, Yiyuan Zhang, Yixiao Ge, Sijie Zhao, Lin Song, Xiangyu Yue, Ying Shan
1) We propose four architectural guidelines for designing large-kernel ConvNets, the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep.
Ranked #1 on Object Detection on COCO 2017 (mAP metric)
1 code implementation • CVPR 2024 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou
In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.
1 code implementation • CVPR 2024 • Zhihao Liang, Qi Zhang, Ying Feng, Ying Shan, Kui Jia
We propose GS-IR, a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results.
1 code implementation • 14 Nov 2023 • Chen Li, Yixiao Ge, Dian Li, Ying Shan
Instruction tuning is a crucial supervised training phase in Large Language Models (LLMs), aiming to enhance the LLM's ability to generalize instruction execution and adapt to user preferences.
1 code implementation • NeurIPS 2023 • Cheng Cheng, Lin Song, Ruoyi Xue, Hang Wang, Hongbin Sun, Yixiao Ge, Ying Shan
Without bells and whistles, our approach outperforms the state-of-the-art online few-shot learning method by an average of 3. 6\% on eight image classification datasets with higher inference speed.
no code implementations • 31 Oct 2023 • Xin He, Shaoli Huang, Xiaohang Zhan, Chao Weng, Ying Shan
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD).
3 code implementations • 30 Oct 2023 • Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, Ying Shan
The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style.
Ranked #3 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)
no code implementations • 30 Oct 2023 • Ziyang Yuan, Mingdeng Cao, Xintao Wang, Zhongang Qi, Chun Yuan, Ying Shan
As a result, our CustomNet ensures enhanced identity preservation and generates diverse, harmonious outputs.
3 code implementations • 23 Oct 2023 • Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu
With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress.
no code implementations • 19 Oct 2023 • Jiaxu Zhang, Shaoli Huang, Zhigang Tu, Xin Chen, Xiaohang Zhan, Gang Yu, Ying Shan
In this work, we present TapMo, a Text-driven Animation Pipeline for synthesizing Motion in a broad spectrum of skeleton-free 3D characters.
1 code implementation • 18 Oct 2023 • Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, Ying Shan
Animating a still image offers an engaging visual experience.
1 code implementation • CVPR 2024 • Yaofang Liu, Xiaodong Cun, Xuebo Liu, Xintao Wang, Yong Zhang, Haoxin Chen, Yang Liu, Tieyong Zeng, Raymond Chan, Ying Shan
For video generation, various open-sourced models and public-available services have been developed to generate high-quality videos.
no code implementations • CVPR 2024 • Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, YuChao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou
To overcome this, we propose to introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation, where the editing can be performed in the 3D spaces and propagated to the entire video via the deformation field.
1 code implementation • 11 Oct 2023 • Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, Ying Shan
Our work also suggests that a pre-trained diffusion model trained on low-resolution images can be directly used for high-resolution visual generation without further tuning, which may provide insights for future research on ultra-high-resolution image and video synthesis.
no code implementations • 10 Oct 2023 • Wangbo Yu, Li Yuan, Yan-Pei Cao, Xiangjun Gao, Xiaoyu Li, WenBo Hu, Long Quan, Ying Shan, Yonghong Tian
Our contributions are twofold: First, we propose a Reference-Guided Novel View Enhancement (RGNV) technique that significantly improves the fidelity of diffusion-based zero-shot novel view synthesis methods.
1 code implementation • 2 Oct 2023 • Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan
We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.
1 code implementation • CVPR 2024 • Ruyang Liu, Chen Li, Yixiao Ge, Ying Shan, Thomas H. Li, Ge Li
Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours.
Ranked #4 on VCGBench-Diverse on VideoInstruct
Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +6
no code implementations • 19 Sep 2023 • Yiyu Zhuang, Qi Zhang, Ying Feng, Hao Zhu, Yao Yao, Xiaoyu Li, Yan-Pei Cao, Ying Shan, Xun Cao
Drawing inspiration from voxel-based representations with the level of detail (LoD), we introduce a multi-scale tri-plane-based scene representation that is capable of capturing the LoD of the signed distance function (SDF) and the space radiance.
1 code implementation • ICCV 2023 • Xiuzhe Wu, Pengfei Hu, Yang Wu, Xiaoyang Lyu, Yan-Pei Cao, Ying Shan, Wenming Yang, Zhongqian Sun, Xiaojuan Qi
Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training.
no code implementations • 4 Sep 2023 • Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo
StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods.
no code implementations • 1 Sep 2023 • Shaohuan Zhou, Xu Li, Zhiyong Wu, Ying Shan, Helen Meng
Specifically, in the pre-training step, we design a phoneme predictor to produce the frame-level phoneme probability vectors as the phonemic timing information and a speaker encoder to model the timbre variations of different singers, and directly estimate the frame-level f0 values from the audio to provide the pitch information.
1 code implementation • ICCV 2023 • Xiaotong Li, Zixuan Hu, Yixiao Ge, Ying Shan, Ling-Yu Duan
The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning.
no code implementations • 27 Aug 2023 • Zi-Xin Zou, Weihao Cheng, Yan-Pei Cao, Shi-Sheng Huang, Ying Shan, Song-Hai Zhang
While recent techniques employ image diffusion models for generating plausible images at novel viewpoints or for distilling pre-trained diffusion priors into 3D representations using score distillation sampling (SDS), these methods often struggle to simultaneously achieve high-quality, consistent, and detailed results for both novel-view synthesis (NVS) and geometry.
2 code implementations • 22 Aug 2023 • Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, Ying Shan
To fill this gap, we present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA Dataset designed for answering open-ended music-related questions.
Ranked #1 on Music Question Answering on MusicQA
1 code implementation • 20 Aug 2023 • Weixian Lei, Yixiao Ge, Jianfeng Zhang, Dylan Sun, Kun Yi, Ying Shan, Mike Zheng Shou
A well-trained lens with a ViT backbone has the potential to serve as one of these foundation models, supervising the learning of subsequent modalities.
Ranked #2 on Zero-Shot Transfer 3D Point Cloud Classification on ModelNet40 (using extra training data)
no code implementations • 18 Aug 2023 • Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
To this end, we introduce Guide3D, a zero-shot text-and-image-guided generative model for 3D avatar generation based on diffusion models.
no code implementations • ICCV 2023 • Zidong Cao, Hao Ai, Yan-Pei Cao, Ying Shan, XiaoHu Qie, Lin Wang
The M\"obius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem.
2 code implementations • 30 Jul 2023 • Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan
Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.
no code implementations • 27 Jul 2023 • Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong
While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes.
1 code implementation • 16 Jul 2023 • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan
Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)
1 code implementation • 13 Jul 2023 • Yingqing He, Menghan Xia, Haoxin Chen, Xiaodong Cun, Yuan Gong, Jinbo Xing, Yong Zhang, Xintao Wang, Chao Weng, Ying Shan, Qifeng Chen
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
no code implementations • 11 Jul 2023 • Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang
Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications.
no code implementations • 7 Jul 2023 • Wangbo Yu, Yanbo Fan, Yong Zhang, Xuan Wang, Fei Yin, Yunpeng Bai, Yan-Pei Cao, Ying Shan, Yang Wu, Zhongqian Sun, Baoyuan Wu
In this work, we propose a one-shot 3D facial avatar reconstruction framework that only requires a single source image to reconstruct a high-fidelity 3D facial avatar.
1 code implementation • 5 Jul 2023 • Chong Mou, Xintao Wang, Jiechong Song, Ying Shan, Jian Zhang
Specifically, we construct classifier guidance based on the strong correspondence of intermediate features in the diffusion model.
1 code implementation • 5 Jul 2023 • Liangbin Xie, Xintao Wang, Xiangyu Chen, Gen Li, Ying Shan, Jiantao Zhou, Chao Dong
After detecting the artifact regions, we develop a finetune procedure to improve GAN-based SR models with a few samples, so that they can deal with similar types of artifacts in more unseen real data.
1 code implementation • 29 Jun 2023 • Yunpeng Bai, Xintao Wang, Yan-Pei Cao, Yixiao Ge, Chun Yuan, Ying Shan
This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text.
no code implementations • 29 Jun 2023 • Weihao Cheng, Yan-Pei Cao, Ying Shan
ID-Pose adds a noise to one image, and predicts the noise conditioned on the other image and a hypothesis of the relative pose.
1 code implementation • 27 Jun 2023 • Songyang Gao, Shihan Dou, Yan Liu, Xiao Wang, Qi Zhang, Zhongyu Wei, Jin Ma, Ying Shan
Adversarial training is one of the best-performing methods in improving the robustness of deep language models.
1 code implementation • 27 Jun 2023 • Songyang Gao, Shihan Dou, Qi Zhang, Xuanjing Huang, Jin Ma, Ying Shan
Detecting adversarial samples that are carefully crafted to fool the model is a critical step to socially-secure applications.
1 code implementation • 26 Jun 2023 • Chen Li, Xutan Peng, Teng Wang, Yixiao Ge, Mengyang Liu, Xuyuan Xu, Yexin Wang, Ying Shan
Art forms such as movies and television (TV) dramas are reflections of the real world, which have attracted much attention from the multimodal learning community recently.
no code implementations • 23 Jun 2023 • Qianji Di, Wenxi Ma, Zhongang Qi, Tianxiang Hou, Ying Shan, Hanzi Wang
In this work, we propose a Text-Image-joint Scene Graph Generation (TISGG) model to resolve the unseen triples and improve the generalisation capability of the SGG models.
no code implementations • 22 Jun 2023 • Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou
In situations involving system upgrades that require updating the upstream foundation model, it becomes essential to re-train all downstream modules to adapt to the new foundation model, which is inflexible and inefficient.
no code implementations • 12 Jun 2023 • Sijie Zhao, Yixiao Ge, Zhongang Qi, Lin Song, Xiaohan Ding, Zehua Xie, Ying Shan
Therefore, we propose StickerCLIP as a benchmark model on the Sticker820K dataset.
no code implementations • 12 Jun 2023 • Jiale Xu, Xintao Wang, Yan-Pei Cao, Weihao Cheng, Ying Shan, Shenghua Gao
Enhancing AI systems to perform tasks following human instructions can significantly boost productivity.
1 code implementation • 6 Jun 2023 • XueWei Li, Tao Wu, Zhongang Qi, Gaoang Wang, Ying Shan, Xi Li
Experimental results on Stanford2D3D Panoramic datasets show that SGAT4PASS significantly improves performance and robustness, with approximately a 2% increase in mIoU, and when small 3D disturbances occur in the data, the stability of our performance is improved by an order of magnitude.
Ranked #4 on Semantic Segmentation on Stanford2D3D Panoramic
no code implementations • NeurIPS 2023 • Zheng Chen, Yan-Pei Cao, Yuan-Chen Guo, Chen Wang, Ying Shan, Song-Hai Zhang
Unlike generalizable radiance fields trained on perspective images, PanoGRF avoids the information loss from panorama-to-perspective conversion and directly aggregates geometry and appearance features of 3D sample points from each panoramic view based on spherical projection.
1 code implementation • NeurIPS 2023 • Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng
Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods.
no code implementations • 1 Jun 2023 • Jinbo Xing, Menghan Xia, Yuxin Liu, Yuechen Zhang, Yong Zhang, Yingqing He, Hanyuan Liu, Haoxin Chen, Xiaodong Cun, Xintao Wang, Ying Shan, Tien-Tsin Wong
Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules.
1 code implementation • NeurIPS 2023 • Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan
This paper aims to efficiently enable Large Language Models (LLMs) to use multimodal tools.
1 code implementation • 29 May 2023 • Yuan Gong, Youxin Pang, Xiaodong Cun, Menghan Xia, Yingqing He, Haoxin Chen, Longyue Wang, Yong Zhang, Xintao Wang, Ying Shan, Yujiu Yang
Accurate Story visualization requires several necessary elements, such as identity consistency across frames, the alignment between plain text and visual content, and a reasonable layout of objects in images.
2 code implementations • NeurIPS 2023 • YuChao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou
Public large-scale text-to-image diffusion models, such as Stable Diffusion, have gained significant attention from the community.
1 code implementation • 23 May 2023 • Ziyun Zeng, Yixiao Ge, Zhan Tong, Xihui Liu, Shu-Tao Xia, Ying Shan
We argue that tuning a text encoder end-to-end, as done in previous work, is suboptimal since it may overfit in terms of styles, thereby losing its original generalization ability to capture the semantics of various language registers.
1 code implementation • 21 May 2023 • Limao Xiong, Jie zhou, Qunxi Zhu, Xiao Wang, Yuanbin Wu, Qi Zhang, Tao Gui, Xuanjing Huang, Jin Ma, Ying Shan
Particularly, we propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
1 code implementation • 20 May 2023 • Guangzhi Wang, Yixiao Ge, Xiaohan Ding, Mohan Kankanhalli, Ying Shan
In our benchmark, which is curated to evaluate MLLMs visual semantic understanding and fine-grained perception capabilities, we discussed different visual tokenizers pre-trained with dominant methods (i. e., DeiT, CLIP, MAE, DINO), and observe that: i) Fully/weakly supervised models capture more semantics than self-supervised models, but the gap is narrowed by scaling up the pre-training dataset.
1 code implementation • 11 May 2023 • Weihao Cheng, Yan-Pei Cao, Ying Shan
We study to generate novel views of indoor scenes given sparse input views.
1 code implementation • 27 Apr 2023 • Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo
Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.
no code implementations • ICCV 2023 • Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Eric Zhongcong Xu, Jussi Keppo, Ying Shan, XiaoHu Qie, Mike Zheng Shou
Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints.
no code implementations • 18 Apr 2023 • Yiyu Zhuang, Qi Zhang, Xuan Wang, Hao Zhu, Ying Feng, Xiaoyu Li, Ying Shan, Xun Cao
Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images.
no code implementations • CVPR 2023 • Yiming Gao, Yan-Pei Cao, Ying Shan
Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge.
3 code implementations • ICCV 2023 • Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, XiaoHu Qie, Yinqiang Zheng
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results.
Ranked #11 on Text-based Image Editing on PIE-Bench
1 code implementation • CVPR 2023 • Liang Chen, Yong Zhang, Yibing Song, Ying Shan, Lingqiao Liu
Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase.
Ranked #2 on Image to sketch recognition on PACS
Image to sketch recognition Single-Source Domain Generalization +1
1 code implementation • 6 Apr 2023 • Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, Ying Shan
Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i. e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts.
no code implementations • CVPR 2023 • Fang Zhao, Zekun Li, Shaoli Huang, Junwu Weng, Tianfei Zhou, Guo-Sen Xie, Jue Wang, Ying Shan
Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning.
2 code implementations • 3 Apr 2023 • Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, Qifeng Chen
Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human.
1 code implementation • CVPR 2024 • Yukang Cao, Yan-Pei Cao, Kai Han, Ying Shan, Kwan-Yee K. Wong
We present DreamAvatar, a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses.
1 code implementation • CVPR 2023 • Qiangqiang Wu, Tianyu Yang, Ziquan Liu, Baoyuan Wu, Ying Shan, Antoni B. Chan
However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS.
Ranked #1 on Visual Object Tracking on TrackingNet (AUC metric)
2 code implementations • CVPR 2023 • Guangcong Zheng, Xianpan Zhou, XueWei Li, Zhongang Qi, Ying Shan, Xi Li
To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form.
Ranked #1 on Layout-to-Image Generation on Visual Genome 128x128
no code implementations • 28 Mar 2023 • Yuan-Chen Guo, Yan-Pei Cao, Chen Wang, Yu He, Ying Shan, XiaoHu Qie, Song-Hai Zhang
With the emergence of neural radiance fields (NeRFs), view synthesis quality has reached an unprecedented level.
1 code implementation • CVPR 2023 • Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo
FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.
no code implementations • 21 Mar 2023 • Hao Ai, Zidong Cao, Yan-Pei Cao, Ying Shan, Lin Wang
Depth estimation from a monocular 360{\deg} image is a burgeoning problem owing to its holistic sensing of a scene.
1 code implementation • 21 Mar 2023 • Yongkang Cheng, Shaoli Huang, Jifeng Ning, Ying Shan
This paper presents a novel approach for estimating human body shape and pose from monocular images that effectively addresses the challenges of occlusions and depth ambiguity.
Ranked #21 on 3D Human Pose Estimation on 3DPW
no code implementations • 20 Mar 2023 • Haoyu Wang, Shaoli Huang, Fang Zhao, Chun Yuan, Ying Shan
We present a simple yet effective method for skeleton-free motion retargeting.
1 code implementation • ICCV 2023 • Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan, Qifeng Chen
We also have a better zero-shot shape-aware editing ability based on the text-to-video model.
1 code implementation • CVPR 2023 • Jiaxu Zhang, Junwu Weng, Di Kang, Fang Zhao, Shaoli Huang, Xuefei Zhe, Linchao Bao, Ying Shan, Jue Wang, Zhigang Tu
Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing.
1 code implementation • 17 Feb 2023 • Yukang Gan, Yixiao Ge, Chang Zhou, Shupeng Su, Zhouchuan Xu, Xuyuan Xu, Quanchao Hui, Xiang Chen, Yexin Wang, Ying Shan
To tackle the challenge, we propose a binary embedding-based retrieval (BEBR) engine equipped with a recurrent binarization algorithm that enables customized bits per dimension.
2 code implementations • 16 Feb 2023 • Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, XiaoHu Qie
In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly.
1 code implementation • CVPR 2023 • Fanghua Yu, Xintao Wang, Mingdeng Cao, Gen Li, Ying Shan, Chao Dong
Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences.
no code implementations • 30 Jan 2023 • Yizhen Chen, Jie Wang, Lijian Lin, Zhongang Qi, Jin Ma, Ying Shan
Vision-language alignment learning for video-text retrieval arouses a lot of attention in recent years.
1 code implementation • CVPR 2023 • Shusheng Yang, Yixiao Ge, Kun Yi, Dian Li, Ying Shan, XiaoHu Qie, Xinggang Wang
Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training.