Search Results for author: Rui Qian

Found 32 papers, 24 papers with code

Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos

1 code implementation ICCV 2023 Rui Qian, Shuangrui Ding, Xian Liu, Dahua Lin

In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification.

Object Discovery Semantic Segmentation

Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation

1 code implementation ICCV 2023 Shuangrui Ding, Peisen Zhao, Xiaopeng Zhang, Rui Qian, Hongkai Xiong, Qi Tian

Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training.

Video Recognition

Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation

1 code implementation CVPR 2023 Lingting Zhu, Xian Liu, Xuanyu Liu, Rui Qian, Ziwei Liu, Lequan Yu

In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation.

Gesture Generation

Exploring Fine-Grained Audiovisual Categorization with the SSW60 Dataset

1 code implementation21 Jul 2022 Grant van Horn, Rui Qian, Kimberly Wilber, Hartwig Adam, Oisin Mac Aodha, Serge Belongie

We thoroughly benchmark audiovisual classification performance and modality fusion experiments through the use of state-of-the-art transformer methods.

Fine-Grained Visual Categorization Video Classification

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

no code implementations15 Jul 2022 Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition.

Optical Flow Estimation Video Classification +1

Dual Contrastive Learning for Spatio-temporal Representation

no code implementations12 Jul 2022 Shuangrui Ding, Rui Qian, Hongkai Xiong

In this way, the static scene and the dynamic motion are simultaneously encoded into the compact RGB representation.

Contrastive Learning Representation Learning

Controllable Augmentations for Video Representation Learning

no code implementations30 Mar 2022 Rui Qian, Weiyao Lin, John See, Dian Li

The major reason is that the positive pairs, i. e., different clips sampled from the same video, have limited temporal receptive field, and usually share similar background but differ in motions.

Action Recognition Contrastive Learning +3

Visual Sound Localization in the Wild by Cross-Modal Interference Erasing

1 code implementation13 Feb 2022 Xian Liu, Rui Qian, Hang Zhou, Di Hu, Weiyao Lin, Ziwei Liu, Bolei Zhou, Xiaowei Zhou

Specifically, we observe that the previous practice of learning only a single audio representation is insufficient due to the additive nature of audio signals.

Class-aware Sounding Objects Localization via Audiovisual Correspondence

1 code implementation22 Dec 2021 Di Hu, Yake Wei, Rui Qian, Weiyao Lin, Ruihua Song, Ji-Rong Wen

To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision.

object-detection Object Detection +2

Exploring Temporal Granularity in Self-Supervised Video Representation Learning

no code implementations8 Dec 2021 Rui Qian, Yeqing Li, Liangzhe Yuan, Boqing Gong, Ting Liu, Matthew Brown, Serge Belongie, Ming-Hsuan Yang, Hartwig Adam, Yin Cui

The training objective consists of two parts: a fine-grained temporal learning objective to maximize the similarity between corresponding temporal embeddings in the short clip and the long clip, and a persistent temporal learning objective to pull together global embeddings of the two clips.

Representation Learning Self-Supervised Learning

Motion-aware Contrastive Video Representation Learning via Foreground-background Merging

1 code implementation CVPR 2022 Shuangrui Ding, Maomao Li, Tianyu Yang, Rui Qian, Haohang Xu, Qingyi Chen, Jue Wang, Hongkai Xiong

To alleviate such bias, we propose \textbf{F}oreground-b\textbf{a}ckground \textbf{Me}rging (FAME) to deliberately compose the moving foreground region of the selected video onto the static background of others.

Action Recognition Contrastive Learning +1

Revisiting 3D ResNets for Video Recognition

3 code implementations3 Sep 2021 Xianzhi Du, Yeqing Li, Yin Cui, Rui Qian, Jing Li, Irwan Bello

A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.

Action Classification Contrastive Learning +1

TA2N: Two-Stage Action Alignment Network for Few-shot Action Recognition

1 code implementation10 Jul 2021 Shuyuan Li, Huabin Liu, Rui Qian, Yuxi Li, John See, Mengjuan Fei, Xiaoyuan Yu, Weiyao Lin

The first stage locates the action by learning a temporal affine transform, which warps each video feature to its action duration while dismissing the action-irrelevant feature (e. g. background).

Few-Shot action recognition Few Shot Action Recognition +2

3D Object Detection for Autonomous Driving: A Survey

1 code implementation21 Jun 2021 Rui Qian, Xin Lai, Xirong Li

Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes.

3D Object Detection Autonomous Driving +3

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

2 code implementations NeurIPS 2021 Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, Boqing Gong

We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.

Ranked #9 on Action Classification on Moments in Time (using extra training data)

Action Classification Action Recognition In Videos +8

BADet: Boundary-Aware 3D Object Detection from Point Clouds

1 code implementation21 Apr 2021 Rui Qian, Xin Lai, Xirong Li

Specifically, instead of refining each proposal independently as previous works do, we represent each proposal as a node for graph construction within a given cut-off threshold, associating proposals in the form of local neighborhood graph, with boundary correlations of an object being explicitly exploited.

3D Object Detection graph construction +2

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

1 code implementation NeurIPS 2020 Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.

Object Localization

Finding Action Tubes with a Sparse-to-Dense Framework

no code implementations30 Aug 2020 Yuxi Li, Weiyao Lin, Tao Wang, John See, Rui Qian, Ning Xu, Li-Min Wang, Shugong Xu

The task of spatial-temporal action detection has attracted increasing attention among researchers.

Ranked #3 on Action Detection on UCF Sports (Video-mAP 0.2 metric)

Action Detection

Spatiotemporal Contrastive Video Representation Learning

4 code implementations CVPR 2021 Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui

Our representations are learned using a contrastive loss, where two augmented clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away.

Contrastive Learning Data Augmentation +4

Multiple Sound Sources Localization from Coarse to Fine

1 code implementation ECCV 2020 Rui Qian, Di Hu, Heinrich Dinkel, Mengyue Wu, Ning Xu, Weiyao Lin

How to visually localize multiple sound sources in unconstrained videos is a formidable problem, especially when lack of the pairwise sound-object annotations.

Human in Events: A Large-Scale Benchmark for Human-centric Video Analysis in Complex Events

no code implementations9 May 2020 Weiyao Lin, Huabin Liu, Shizhan Liu, Yuxi Li, Rui Qian, Tao Wang, Ning Xu, Hongkai Xiong, Guo-Jun Qi, Nicu Sebe

To this end, we present a new large-scale dataset with comprehensive annotations, named Human-in-Events or HiEve (Human-centric video analysis in complex Events), for the understanding of human motions, poses, and actions in a variety of realistic events, especially in crowd & complex events.

Action Recognition Pose Estimation

ATRW: A Benchmark for Amur Tiger Re-identification in the Wild

1 code implementation13 Jun 2019 Shuyuan Li, Jianguo Li, Hanlin Tang, Rui Qian, Weiyao Lin

This paper tries to fill the gap by introducing a novel large-scale dataset, the Amur Tiger Re-identification in the Wild (ATRW) dataset.

Attentive Generative Adversarial Network for Raindrop Removal from a Single Image

3 code implementations CVPR 2018 Rui Qian, Robby T. Tan, Wenhan Yang, Jiajun Su, Jiaying Liu

This injection of visual attention to both generative and discriminative networks is the main contribution of this paper.

Rain Removal

Cannot find the paper you are looking for? You can Submit a new open access paper.