Search Results for author: Ying Shan

Found 63 papers, 34 papers with code

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

no code implementations6 Dec 2022 YuChao Gu, Xintao Wang, Yixiao Ge, Ying Shan, XiaoHu Qie, Mike Zheng Shou

Vector-Quantized (VQ-based) generative models usually consist of two basic components, i. e., VQ tokenizers and generative transformers.

Image Generation

3D GAN Inversion with Facial Symmetry Prior

no code implementations30 Nov 2022 Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, Yujiu Yang

It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion.

Image Reconstruction Neural Rendering

Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths

1 code implementation23 Nov 2022 Yingqing He, Tianyu Yang, Yong Zhang, Ying Shan, Qifeng Chen

Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks.

Denoising Image Generation +1

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

no code implementations22 Nov 2022 Wenxuan Zhang, Xiaodong Cun, Xuan Wang, Yong Zhang, Xi Shen, Yu Guo, Ying Shan, Fei Wang

We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation.

Talking Head Generation

Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields

no code implementations21 Nov 2022 Yue Chen, Xingyu Chen, Xuan Wang, Qi Zhang, Yu Guo, Ying Shan, Fei Wang

Neural Radiance Fields (NeRF) have achieved photorealistic novel views synthesis; however, the requirement of accurate camera poses limits its application.

Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation

1 code implementation10 Nov 2022 Runbang Zhang, Yixiao Zhang, Kai Shao, Ying Shan, Gus Xia

In this study, we explore the representation mapping from the domain of visual arts to the domain of music, with which we can use visual arts as an effective handle to control music generation.

Music Generation Representation Learning +1

Darwinian Model Upgrades: Model Evolving with Selective Compatibility

no code implementations13 Oct 2022 Binjie Zhang, Shupeng Su, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Mike Zheng Shou, Ying Shan

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as "backfilling"), which is quite expensive and time-consuming considering billions of instances in industrial applications.

Face Recognition Retrieval

Robust Human Matting via Semantic Guidance

1 code implementation11 Oct 2022 Xiangguang Chen, Ye Zhu, Yu Li, Bingtao Fu, Lei Sun, Ying Shan, Shan Liu

Unlike previous works, our framework is data efficient, which requires a small amount of matting ground-truth to learn to estimate high quality object mattes.

Image Matting

MonoNeuralFusion: Online Monocular Neural 3D Reconstruction with Geometric Priors

no code implementations30 Sep 2022 Zi-Xin Zou, Shi-Sheng Huang, Yan-Pei Cao, Tai-Jiang Mu, Ying Shan, Hongbo Fu

This paper introduces a novel neural implicit scene representation with volume rendering for high-fidelity online 3D scene reconstruction from monocular videos.

3D Reconstruction 3D Scene Reconstruction

Music-driven Dance Regeneration with Controllable Key Pose Constraints

no code implementations8 Jul 2022 Junfu Pu, Ying Shan

The cross-modal transformer decoder achieves the capability of synthesizing smooth dance motion sequences, which keeps a consistency with key poses at corresponding positions, by introducing the local neighbor position embedding.

Motion Synthesis

Self-Supervised Learning of Music-Dance Representation through Explicit-Implicit Rhythm Synchronization

no code implementations7 Jul 2022 Jiashuo Yu, Junfu Pu, Ying Cheng, Rui Feng, Ying Shan

Although audio-visual representation has been proved to be applicable in many downstream tasks, the representation of dancing videos, which is more specific and always accompanied by music with complex auditory contents, remains challenging and uninvestigated.

Contrastive Learning Representation Learning +2

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation7 Jul 2022 Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

no code implementations28 Jun 2022 Xu Li, Shansong Liu, Ying Shan

It is suspected that a single embedding vector may only capture averaged and coarse-grained speaker characteristics, which is insufficient for the SVC task.

Speaker Recognition Voice Conversion

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

1 code implementation14 Jun 2022 Yanze Wu, Xintao Wang, Gen Li, Ying Shan

This paper studies the problem of real-world video super-resolution (VSR) for animation videos, and reveals three key improvements for practical animation VSR.

Video Super-Resolution

Do we really need temporal convolutions in action segmentation?

no code implementations26 May 2022 Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models.

Action Classification Action Segmentation +1

Masked Image Modeling with Denoising Contrast

no code implementations19 May 2022 Kun Yi, Yixiao Ge, Xiaotong Li, Shusheng Yang, Dian Li, Jianping Wu, Ying Shan, XiaoHu Qie

Since the development of self-supervised visual representation learning from contrastive learning to masked image modeling, there is no significant difference in essence, that is, how to design proper pretext tasks for vision dictionary look-up.

Contrastive Learning Denoising +6

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

1 code implementation13 May 2022 YuChao Gu, Xintao Wang, Liangbin Xie, Chao Dong, Gen Li, Ying Shan, Ming-Ming Cheng

Equipped with the VQ codebook as a facial detail dictionary and the parallel decoder design, the proposed VQFR can largely enhance the restored quality of facial details while keeping the fidelity to previous methods.

Blind Face Restoration Quantization

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

no code implementations11 May 2022 Xintao Wang, Chao Dong, Ying Shan

Extensive experiments demonstrate that our simple RepSR is capable of achieving superior performance to previous SR re-parameterization methods among different model sizes.


Accelerating the Training of Video Super-Resolution Models

no code implementations10 May 2022 Lijian Lin, Xintao Wang, Zhongang Qi, Ying Shan

In this work, we show that it is possible to gradually train video models from small to large spatial/temporal sizes, i. e., in an easy-to-hard manner.

Video Super-Resolution

MM-RealSR: Metric Learning based Interactive Modulation for Real-World Super-Resolution

1 code implementation10 May 2022 Chong Mou, Yanze Wu, Xintao Wang, Chao Dong, Jian Zhang, Ying Shan

Instead of using known degradation levels as explicit supervision to the interactive mechanism, we propose a metric learning strategy to map the unquantifiable degradation levels in real-world scenarios to a metric space, which is trained in an unsupervised manner.

Image Restoration Metric Learning +1

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval

1 code implementation29 Apr 2022 Shupeng Su, Binjie Zhang, Yixiao Ge, Xuyuan Xu, Yexin Wang, Chun Yuan, Ying Shan

The task of privacy-preserving model upgrades in image retrieval desires to reap the benefits of rapidly evolving new models without accessing the raw gallery images.

Image Retrieval Privacy Preserving +1

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation26 Apr 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Action Recognition Retrieval +4

Temporally Efficient Vision Transformer for Video Instance Segmentation

2 code implementations CVPR 2022 Shusheng Yang, Xinggang Wang, Yu Li, Yuxin Fang, Jiemin Fang, Wenyu Liu, Xun Zhao, Ying Shan

To effectively and efficiently model the crucial temporal information within a video clip, we propose a Temporally Efficient Vision Transformer (TeViT) for video instance segmentation (VIS).

Instance Segmentation Semantic Segmentation +1

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

2 code implementations6 Apr 2022 Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang Wang

We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e. g., only 25% $\sim$ 50% of the input embeddings.

Instance Segmentation object-detection +1

CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

no code implementations31 Mar 2022 Ziqi Zhang, Yuxin Chen, Zongyang Ma, Zhongang Qi, Chunfeng Yuan, Bing Li, Ying Shan, Weiming Hu

In this paper, we propose to CREATE, the first large-scale Chinese shoRt vidEo retrievAl and Title gEneration benchmark, to facilitate research and application in video titling and video retrieval in Chinese.

Retrieval Video Captioning +1

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

1 code implementation29 Mar 2022 Xiaotong Li, Yixiao Ge, Kun Yi, Zixuan Hu, Ying Shan, Ling-Yu Duan

Image BERT pre-training with masked image modeling (MIM) becomes a popular practice to cope with self-supervised representation learning.

Instance Segmentation object-detection +4

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

1 code implementation CVPR 2022 Ye Liu, Siyuan Li, Yang Wu, Chang Wen Chen, Ying Shan, XiaoHu Qie

Finding relevant moments and highlights in videos according to natural language queries is a natural and highly valuable common need in the current video content explosion era.

Highlight Detection Moment Retrieval +2

Revitalize Region Feature for Democratizing Video-Language Pre-training

2 code implementations15 Mar 2022 Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, XiaoHu Qie, Jianping Wu, Mike Zheng Shou

Recent dominant methods for video-language pre-training (VLP) learn transferable representations from the raw pixels in an end-to-end manner to achieve advanced performance on downstream video-language tasks.

Question Answering Retrieval +4

All in One: Exploring Unified Video-Language Pre-training

1 code implementation14 Mar 2022 Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Language Modelling Multiple-choice +11

Towards Universal Backward-Compatible Representation Learning

1 code implementation3 Mar 2022 Binjie Zhang, Yixiao Ge, Yantao Shen, Shupeng Su, Fanzi Wu, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

The task of backward-compatible representation learning is therefore introduced to support backfill-free model upgrades, where the new query features are interoperable with the old gallery features.

Face Recognition Representation Learning

Uncertainty Modeling for Out-of-Distribution Generalization

1 code implementation ICLR 2022 Xiaotong Li, Yongxing Dai, Yixiao Ge, Jun Liu, Ying Shan, Ling-Yu Duan

In this paper, we improve the network generalization ability by modeling the uncertainty of domain shifts with synthesized feature statistics during training.

Image Classification Out-of-Distribution Generalization +2

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval

1 code implementation24 Jan 2022 Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.

Image Retrieval regression +1

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations CVPR 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Ranked #21 on Video Retrieval on MSR-VTT-1kA (using extra training data)

Action Recognition Multiple-choice +6

BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild

no code implementations CVPR 2022 Xixi Xu, Zhongang Qi, jianqi ma, Honglun Zhang, Ying Shan, XiaoHu Qie

Current researches mainly focus on only English characters and digits, while few work studies Chinese characters due to the lack of public large-scale and high-quality Chinese datasets, which limits the practical application scenarios of text segmentation.

Style Transfer Text Segmentation +1

Object-aware Video-language Pre-training for Retrieval

1 code implementation CVPR 2022 Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.

Retrieval Text Matching

Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval

no code implementations ICLR 2022 Binjie Zhang, Yixiao Ge, Yantao Shen, Yu Li, Chun Yuan, Xuyuan Xu, Yexin Wang, Ying Shan

In contrast, hot-refresh model upgrades deploy the new model immediately and then gradually improve the retrieval accuracy by backfilling the gallery on-the-fly.

Image Retrieval regression +1

Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution

1 code implementation NeurIPS 2021 Liangbin Xie, Xintao Wang, Chao Dong, Zhongang Qi, Ying Shan

Unlike previous integral gradient methods, our FAIG aims at finding the most discriminative filters instead of input pixels/features for degradation removal in blind SR networks.

Blind Super-Resolution Super-Resolution

Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization

1 code implementation27 Jul 2021 Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng

In this work, we argue that the features extracted from the pretrained extractor, e. g., I3D, are not the WS-TALtask-specific features, thus the feature re-calibration is needed for reducing the task-irrelevant information redundancy.

Weakly Supervised Action Localization Weakly-supervised Temporal Action Localization +1

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

6 code implementations22 Jul 2021 Xintao Wang, Liangbin Xie, Chao Dong, Ying Shan

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images.

Blind Super-Resolution Video Super-Resolution

Tracking Instances as Queries

1 code implementation22 Jun 2021 Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Ying Shan, Bin Feng, Wenyu Liu

Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation.

Instance Segmentation object-detection +3

Instances as Queries

5 code implementations ICCV 2021 Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, Wenyu Liu

The key insight of QueryInst is to leverage the intrinsic one-to-one correspondence in object queries across different stages, as well as one-to-one correspondence between mask RoI features and object queries in the same stage.

Instance Segmentation object-detection +2

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

1 code implementation CVPR 2021 Yanbei Chen, Yongqin Xian, A. Sophia Koepke, Ying Shan, Zeynep Akata

Having access to multi-modal cues (e. g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality.

Audio Tagging audio-visual learning +5

Crossover Learning for Fast Online Video Instance Segmentation

1 code implementation ICCV 2021 Shusheng Yang, Yuxin Fang, Xinggang Wang, Yu Li, Chen Fang, Ying Shan, Bin Feng, Wenyu Liu

For temporal information modeling in VIS, we present a novel crossover learning scheme that uses the instance feature in the current frame to pixel-wisely localize the same instance in other frames.

Association Instance Segmentation +3

Open-book Video Captioning with Retrieve-Copy-Generate Network

no code implementations CVPR 2021 Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Ying Shan, Bing Li, Ying Deng, Weiming Hu

Due to the rapid emergence of short videos and the requirement for content understanding and creation, the video captioning task has received increasing attention in recent years.

Retrieval Video Captioning

Towards Real-World Blind Face Restoration with Generative Facial Prior

1 code implementation CVPR 2021 Xintao Wang, Yu Li, Honglun Zhang, Ying Shan

Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details.

Blind Face Restoration Video Super-Resolution

Non-Inherent Feature Compatible Learning

no code implementations1 Jan 2021 Yantao Shen, Fanzi Wu, Ying Shan

In this work, we introduce an approach for feature compatible learning without inheriting old classifier and training data, i. e., Non-Inherent Feature Compatible Learning.


Detecting Interactions from Neural Networks via Topological Analysis

no code implementations NeurIPS 2020 Zirui Liu, Qingquan Song, Kaixiong Zhou, Ting-Hsiang Wang, Ying Shan, Xia Hu

Motivated by the observation, in this paper, we propose to investigate the interaction detection problem from a novel topological perspective by analyzing the connectivity in neural networks.

Towards Interaction Detection Using Topological Analysis on Neural Networks

no code implementations25 Oct 2020 Zirui Liu, Qingquan Song, Kaixiong Zhou, Ting Hsiang Wang, Ying Shan, Xia Hu

Detecting statistical interactions between input features is a crucial and challenging task.

A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention

no code implementations23 Sep 2020 Binjie Zhang, Yu Li, Chun Yuan, Dejing Xu, Pin Jiang, Ying Shan

The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.

Dual Semantic Fusion Network for Video Object Detection

no code implementations16 Sep 2020 Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, Hanzi Wang

Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments.

object-detection Optical Flow Estimation +1

Fast Video Object Segmentation using the Global Context Module

no code implementations ECCV 2020 Yu Li, Zhuoran Shen, Ying Shan

Therefore, it uses constant memory regardless of the video length and costs substantially less memory and computation.

online learning Semantic Segmentation +3

Recurrent Binary Embedding for GPU-Enabled Exhaustive Retrieval from Billion-Scale Semantic Vectors

no code implementations18 Feb 2018 Ying Shan, Jian Jiao, Jie Zhu, JC Mao

Building on top of the powerful concept of semantic learning, this paper proposes a Recurrent Binary Embedding (RBE) model that learns compact representations for real-time retrieval.

Information Retrieval Retrieval

Deep Embedding Forest: Forest-based Serving with Deep Embedding Features

no code implementations15 Mar 2017 Jie Zhu, Ying Shan, JC Mao, Dong Yu, Holakou Rahmanian, Yi Zhang

Built on top of a representative DNN model called Deep Crossing, and two forest/tree-based models including XGBoost and LightGBM, a two-step Deep Embedding Forest algorithm is demonstrated to achieve on-par or slightly better performance as compared with the DNN counterpart, with only a fraction of serving time on conventional hardware.

Cannot find the paper you are looking for? You can Submit a new open access paper.