Search Results for author: Yingya Zhang

Found 43 papers, 21 papers with code

DreamRelation: Relation-Centric Video Customization

no code implementations10 Mar 2025 Yujie Wei, Shiwei Zhang, Hangjie Yuan, Biao Gong, Longxiang Tang, Xiang Wang, Haonan Qiu, Hengjia Li, Shuai Tan, Yingya Zhang, Hongming Shan

First, in Relational Decoupling Learning, we disentangle relations from subject appearances using relation LoRA triplet and hybrid mask training strategy, ensuring better generalization across diverse relationships.

Relation Triplet +1

FreeScale: Unleashing the Resolution of Diffusion Models via Tuning-Free Scale Fusion

no code implementations12 Dec 2024 Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, Ziwei Liu

Visual diffusion models achieve remarkable progress, yet they are typically trained at limited resolutions due to the lack of high-resolution data and constrained computation resources, hampering their ability to generate high-fidelity images or videos at higher resolutions.

8k

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model

no code implementations28 Nov 2024 Feng Liu, Shiwei Zhang, XiaoFeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, Fang Wan

As a fundamental backbone for video generation, diffusion models are challenged by low inference speed due to the sequential nature of denoising.

Denoising Video Generation

PersonalVideo: High ID-Fidelity Video Customization without Dynamic and Semantic Degradation

no code implementations26 Nov 2024 Hengjia Li, Haonan Qiu, Shiwei Zhang, Xiang Wang, Yujie Wei, Zekun Li, Yingya Zhang, Boxi Wu, Deng Cai

The key challenge lies in maintaining high ID fidelity consistently while preserving the original motion dynamic and semantic following after the identity injection.

Video Generation

EgoVid-5M: A Large-Scale Video-Action Dataset for Egocentric Video Generation

no code implementations13 Nov 2024 XiaoFeng Wang, Kang Zhao, Feng Liu, Jiayu Wang, Guosheng Zhao, Xiaoyi Bao, Zheng Zhu, Yingya Zhang, Xingang Wang

Video generation has emerged as a promising tool for world simulation, leveraging visual data to replicate real-world environments.

Video Generation

DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control

no code implementations17 Oct 2024 Yujie Wei, Shiwei Zhang, Hangjie Yuan, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Feng Liu, Zhizhong Huang, Jiaxin Ye, Yingya Zhang, Hongming Shan

In this paper, we present DreamVideo-2, a zero-shot video customization framework capable of generating videos with a specific subject and motion trajectory, guided by a single image and a bounding box sequence, respectively, and without the need for test-time fine-tuning.

Video Generation

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

1 code implementation9 Oct 2024 Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, YuChao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou

Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability.

Text-to-Image Generation

UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation

no code implementations3 Jun 2024 Xiang Wang, Shiwei Zhang, Changxin Gao, Jiayu Wang, Xiaoqiang Zhou, Yingya Zhang, Luxin Yan, Nong Sang

First, to reduce the optimization difficulty and ensure temporal coherence, we map the reference image along with the posture guidance and noise video into a common feature space by incorporating a unified video diffusion model.

Image Animation Video Generation

A Recipe for Scaling up Text-to-Video Generation with Text-free Videos

1 code implementation CVPR 2024 Xiang Wang, Shiwei Zhang, Hangjie Yuan, Zhiwu Qing, Biao Gong, Yingya Zhang, Yujun Shen, Changxin Gao, Nong Sang

Following such a pipeline, we study the effect of doubling the scale of training set (i. e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9. 67 to 8. 19 and FVD from 484 to 441), demonstrating the scalability of our approach.

Text-to-Image Generation Text-to-Video Generation +2

InstructVideo: Instructing Video Diffusion Models with Human Feedback

1 code implementation CVPR 2024 Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, Dong Ni

To tackle this problem, we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning.

Video Generation

AE-NeRF: Audio Enhanced Neural Radiance Field for Few Shot Talking Head Synthesis

no code implementations18 Dec 2023 Dongze Li, Kang Zhao, Wei Wang, Bo Peng, Yingya Zhang, Jing Dong, Tieniu Tan

Audio-driven talking head synthesis is a promising topic with wide applications in digital human, film making and virtual reality.

NeRF Talking Head Generation

DreamTalk: When Emotional Talking Head Generation Meets Diffusion Probabilistic Models

1 code implementation15 Dec 2023 Yifeng Ma, Shiwei Zhang, Jiayu Wang, Xiang Wang, Yingya Zhang, Zhidong Deng

To more conveniently specify personalized emotions, a diffusion-based style predictor is utilized to predict the personalized emotion directly from the audio, eliminating the need for extra emotion reference.

Denoising Talking Head Generation

VideoLCM: Video Latent Consistency Model

2 code implementations14 Dec 2023 Xiang Wang, Shiwei Zhang, Han Zhang, Yu Liu, Yingya Zhang, Changxin Gao, Nong Sang

Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models.

Computational Efficiency Image Generation +2

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

3 code implementations7 Nov 2023 Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, Jingren Zhou

By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.

Few-shot Action Recognition with Captioning Foundation Models

no code implementations16 Oct 2023 Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang

In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text.

Few-Shot action recognition Few Shot Action Recognition

DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing

1 code implementation12 Oct 2023 Yueming Lyu, Kang Zhao, Bo Peng, Yue Jiang, Yingya Zhang, Jing Dong

Based on DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model during the training phase, and predicts the latent space directions from the CLIP textual feature differences during the inference phase.

text-guided-image-editing

Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning

1 code implementation ICCV 2023 Zhiwu Qing, Shiwei Zhang, Ziyuan Huang, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang

When pre-training on the large-scale Kinetics-710, we achieve 89. 7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST.

Transfer Learning Video Recognition

RLIPv2: Fast Scaling of Relational Language-Image Pre-training

3 code implementations ICCV 2023 Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data.

 Ranked #1 on Zero-Shot Human-Object Interaction Detection on HICO-DET (using extra training data)

Graph Generation Human-Object Interaction Detection +6

ModelScope Text-to-Video Technical Report

5 code implementations12 Aug 2023 Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, Shiwei Zhang

This paper introduces ModelScopeT2V, a text-to-video synthesis model that evolves from a text-to-image synthesis model (i. e., Stable Diffusion).

Denoising Image Generation +1

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

1 code implementation9 Jul 2023 Jun Cen, Shiwei Zhang, Yixuan Pei, Kun Li, Hang Zheng, Maochun Luo, Yingya Zhang, Qifeng Chen

In this way, RGB images are not required during inference anymore since the 2D knowledge branch provides 2D information according to the 3D LIDAR input.

Autonomous Vehicles Knowledge Distillation +2

Freestyle 3D-Aware Portrait Synthesis Based on Compositional Generative Priors

no code implementations27 Jun 2023 Tianxiang Ma, Kang Zhao, Jianxin Sun, Yingya Zhang, Jing Dong

Efficiently generating a freestyle 3D portrait with high quality and 3D-consistency is a promising yet challenging task.

MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot Action Recognition

1 code implementation CVPR 2023 Xiang Wang, Shiwei Zhang, Zhiwu Qing, Changxin Gao, Yingya Zhang, Deli Zhao, Nong Sang

To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.

Contrastive Learning Few-Shot action recognition +1

Enlarging Instance-specific and Class-specific Information for Open-set Action Recognition

1 code implementation CVPR 2023 Jun Cen, Shiwei Zhang, Xiang Wang, Yixuan Pei, Zhiwu Qing, Yingya Zhang, Qifeng Chen

In this paper, we begin with analyzing the feature representation behavior in the open-set action recognition (OSAR) problem based on the information bottleneck (IB) theory, and propose to enlarge the instance-specific (IS) and class-specific (CS) information contained in the feature for better performance.

Open Set Action Recognition

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

1 code implementation CVPR 2023 Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, Tieniu Tan

A diffusion probabilistic model (DPM), which constructs a forward diffusion process by gradually adding noise to data points and learns the reverse denoising process to generate new samples, has been shown to handle complex data distribution.

Code Generation Denoising +4

CLIP-guided Prototype Modulating for Few-shot Action Recognition

1 code implementation6 Mar 2023 Xiang Wang, Shiwei Zhang, Jun Cen, Changxin Gao, Yingya Zhang, Deli Zhao, Nong Sang

Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task.

Few-Shot action recognition Few Shot Action Recognition

The Devil is in the Wrongly-classified Samples: Towards Unified Open-set Recognition

1 code implementation8 Feb 2023 Jun Cen, Di Luan, Shiwei Zhang, Yixuan Pei, Yingya Zhang, Deli Zhao, Shaojie Shen, Qifeng Chen

Recently, Unified Open-set Recognition (UOSR) has been proposed to reject not only unknown samples but also known but wrongly classified samples, which tends to be more practical in real-world applications.

Open Set Learning

Space-time Prompting for Video Class-incremental Learning

no code implementations ICCV 2023 Yixuan Pei, Zhiwu Qing, Shiwei Zhang, Xiang Wang, Yingya Zhang, Deli Zhao, Xueming Qian

In this paper, we will fill this gap by learning multiple prompts based on a powerful image-language pre-trained model, i. e., CLIP, making it fit for video class-incremental learning (VCIL).

class-incremental learning Class Incremental Learning +1

Revisiting Optimal Convergence Rate for Smooth and Non-convex Stochastic Decentralized Optimization

no code implementations14 Oct 2022 Kun Yuan, Xinmeng Huang, Yiming Chen, Xiaohan Zhang, Yingya Zhang, Pan Pan

While (Lu and Sa, 2021) have recently provided an optimal rate for non-convex stochastic decentralized optimization with weight matrices defined over linear graphs, the optimal rate with general weight matrices remains unclear.

Communicate Then Adapt: An Effective Decentralized Adaptive Method for Deep Training

no code implementations29 Sep 2021 Bicheng Ying, Kun Yuan, Yiming Chen, Hanbin Hu, Yingya Zhang, Pan Pan, Wotao Yin

Decentralized adaptive gradient methods, in which each node averages only with its neighbors, are critical to save communication and wall-clock training time in deep learning tasks.

Communication Efficient SGD via Gradient Sampling With Bayes Prior

no code implementations CVPR 2021 Liuyihan Song, Kang Zhao, Pan Pan, Yu Liu, Yingya Zhang, Yinghui Xu, Rong Jin

Different from all of them, we regard large and small gradients selection as the exploitation and exploration of gradient information, respectively.

Image Classification object-detection +2

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

1 code implementation ICCV 2021 Kun Yuan, Yiming Chen, Xinmeng Huang, Yingya Zhang, Pan Pan, Yinghui Xu, Wotao Yin

Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.

Visual Search at Alibaba

no code implementations9 Feb 2021 Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, Rong Jin

We hope visual search at Alibaba becomes more widely incorporated into today's commercial applications.

Image Retrieval

Large-Scale Visual Search with Binary Distributed Graph at Alibaba

no code implementations9 Feb 2021 Kang Zhao, Pan Pan, Yun Zheng, Yanhao Zhang, Changxu Wang, Yingya Zhang, Yinghui Xu, Rong Jin

For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods.

graph construction

Distribution Adaptive INT8 Quantization for Training CNNs

no code implementations9 Feb 2021 Kang Zhao, Sida Huang, Pan Pan, Yinghan Li, Yingya Zhang, Zhenyu Gu, Yinghui Xu

Researches have demonstrated that low bit-width (e. g., INT8) quantization can be employed to accelerate the inference process.

Image Classification object-detection +3

Cannot find the paper you are looking for? You can Submit a new open access paper.