Search Results for author: Shilei Wen

Found 34 papers, 19 papers with code

RSPNet: Relative Speed Perception for Unsupervised Video Representation Learning

1 code implementation27 Oct 2020 Peihao Chen, Deng Huang, Dongliang He, Xiang Long, Runhao Zeng, Shilei Wen, Mingkui Tan, Chuang Gan

We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only, which can be reused for downstream tasks such as action recognition.

Representation Learning Self-Supervised Action Recognition +1

Coherent Loss: A Generic Framework for Stable Video Segmentation

no code implementations25 Oct 2020 Mingyang Qian, Yi Fu, Xiao Tan, YingYing Li, Jinqing Qi, Huchuan Lu, Shilei Wen, Errui Ding

Video segmentation approaches are of great importance for numerous vision tasks especially in video manipulation for entertainment.

Semantic Segmentation Video Segmentation +1

Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching

1 code implementation NeurIPS 2020 Di Hu, Rui Qian, Minyue Jiang, Xiao Tan, Shilei Wen, Errui Ding, Weiyao Lin, Dejing Dou

First, we propose to learn robust object representations by aggregating the candidate sound localization results in the single source scenes.

Object Localization

PP-YOLO: An Effective and Efficient Implementation of Object Detector

5 code implementations23 Jul 2020 Xiang Long, Kaipeng Deng, Guanzhong Wang, Yang Zhang, Qingqing Dang, Yuan Gao, Hui Shen, Jianguo Ren, Shumin Han, Errui Ding, Shilei Wen

We mainly try to combine various existing tricks that almost not increase the number of model parameters and FLOPs, to achieve the goal of improving the accuracy of detector as much as possible while ensuring that the speed is almost unchanged.

Object Detection

Graph-PCNN: Two Stage Human Pose Estimation with Graph Pose Refinement

no code implementations ECCV 2020 Jian Wang, Xiang Long, Yuan Gao, Errui Ding, Shilei Wen

In the first stage, heatmap regression network is applied to obtain a rough localization result, and a set of proposal keypoints, called guided points, are sampled.

Pose Estimation

PointTrack++ for Effective Online Multi-Object Tracking and Segmentation

1 code implementation3 Jul 2020 Zhenbo Xu, Wei zhang, Xiao Tan, Wei Yang, Xiangbo Su, Yuchen Yuan, Hongwu Zhang, Shilei Wen, Errui Ding, Liusheng Huang

In this work, we present PointTrack++, an effective on-line framework for MOTS, which remarkably extends our recently proposed PointTrack framework.

Data Augmentation Instance Segmentation +5

Segment as Points for Efficient Online Multi-Object Tracking and Segmentation

1 code implementation ECCV 2020 Zhenbo Xu, Wei zhang, Xiao Tan, Wei Yang, Huan Huang, Shilei Wen, Errui Ding, Liusheng Huang

The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods including 3D tracking methods by large margins (5. 4% higher MOTSA and 18 times faster over MOTSFusion) with the near real-time speed (22 FPS).

Multi-Object Tracking Multi-Object Tracking and Segmentation +1

ZoomNet: Part-Aware Adaptive Zooming Neural Network for 3D Object Detection

1 code implementation1 Mar 2020 Zhenbo Xu, Wei zhang, Xiaoqing Ye, Xiao Tan, Wei Yang, Shilei Wen, Errui Ding, Ajin Meng, Liusheng Huang

The pipeline of ZoomNet begins with an ordinary 2D object detection model which is used to obtain pairs of left-right bounding boxes.

2D Object Detection 3D Object Detection +2

Dynamic Inference: A New Approach Toward Efficient Video Action Recognition

no code implementations9 Feb 2020 Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, Yi Yang, Shilei Wen

In a nutshell, we treat input frames and network depth of the computational graph as a 2-dimensional grid, and several checkpoints are placed on this grid in advance with a prediction module.

Action Recognition Action Recognition In Videos +1

Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification

no code implementations17 Dec 2019 Renchun You, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, Shilei Wen

In order to overcome these challenges, we propose to use cross-modality attention with semantic graph embedding for multi label classification.

Classification General Classification +4

Multi-Label Classification with Label Graph Superimposing

2 code implementations21 Nov 2019 Ya Wang, Dongliang He, Fu Li, Xiang Long, Zhichao Zhou, Jinwen Ma, Shilei Wen

In this paper, we propose a label graph superimposing framework to improve the conventional GCN+CNN framework developed for multi-label recognition in the following two aspects.

Classification General Classification +2

Dynamic Instance Normalization for Arbitrary Style Transfer

no code implementations16 Nov 2019 Yongcheng Jing, Xiao Liu, Yukang Ding, Xinchao Wang, Errui Ding, Mingli Song, Shilei Wen

Prior normalization methods rely on affine transformations to produce arbitrary image style transfers, of which the parameters are computed in a pre-defined way.

Style Transfer

TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation

no code implementations14 Oct 2019 Fan Yang, Xiao Liu, Dongliang He, Chuang Gan, Jian Wang, Chao Li, Fu Li, Shilei Wen

In this work, we introduce a new problem, named as {\em story-preserving long video truncation}, that requires an algorithm to automatically truncate a long-duration video into multiple short and attractive sub-videos with each one containing an unbroken story.

Video Summarization

Perspective-Guided Convolution Networks for Crowd Counting

1 code implementation ICCV 2019 Zhaoyi Yan, Yuchen Yuan, WangMeng Zuo, Xiao Tan, Yezhen Wang, Shilei Wen, Errui Ding

In this paper, we propose a novel perspective-guided convolution (PGC) for convolutional neural network (CNN) based crowd counting (i. e. PGCNet), which aims to overcome the dramatic intra-scene scale variations of people due to the perspective effect.

Crowd Counting

Image Inpainting with Learnable Bidirectional Attention Maps

1 code implementation ICCV 2019 Chaohao Xie, Shaohui Liu, Chao Li, Ming-Ming Cheng, WangMeng Zuo, Xiao Liu, Shilei Wen, Errui Ding

Most convolutional network (CNN)-based inpainting methods adopt standard convolution to indistinguishably treat valid pixels and holes, making them limited in handling irregular holes and more likely to generate inpainting results with color discrepancy and blurriness.

Image Inpainting

Deep Concept-wise Temporal Convolutional Networks for Action Localization

2 code implementations26 Aug 2019 Xin Li, Tianwei Lin, Xiao Liu, Chuang Gan, WangMeng Zuo, Chao Li, Xiang Long, Dongliang He, Fu Li, Shilei Wen

In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution.

Action Classification Action Localization

BMN: Boundary-Matching Network for Temporal Action Proposal Generation

10 code implementations ICCV 2019 Tianwei Lin, Xiao Liu, Xin Li, Errui Ding, Shilei Wen

To address these difficulties, we introduce the Boundary-Matching (BM) mechanism to evaluate confidence scores of densely distributed proposals, which denote a proposal as a matching pair of starting and ending boundaries and combine all densely distributed BM pairs into the BM confidence map.

Action Detection Action Recognition +1

Adapting Image Super-Resolution State-of-the-arts and Learning Multi-model Ensemble for Video Super-Resolution

no code implementations7 May 2019 Chao Li, Dongliang He, Xiao Liu, Yukang Ding, Shilei Wen

Recently, image super-resolution has been widely studied and achieved significant progress by leveraging the power of deep convolutional neural networks.

Image Super-Resolution Video Super-Resolution

STGAN: A Unified Selective Transfer Network for Arbitrary Image Attribute Editing

3 code implementations CVPR 2019 Ming Liu, Yukang Ding, Min Xia, Xiao Liu, Errui Ding, WangMeng Zuo, Shilei Wen

Arbitrary attribute editing generally can be tackled by incorporating encoder-decoder and generative adversarial networks.

Translation

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

1 code implementation21 Jan 2019 Dongliang He, Xiang Zhao, Jizhou Huang, Fu Li, Xiao Liu, Shilei Wen

The task of video grounding, which temporally localizes a natural language description in a video, plays an important role in understanding videos.

Decision Making Multi-Task Learning

StNet: Local and Global Spatial-Temporal Modeling for Action Recognition

3 code implementations5 Nov 2018 Dongliang He, Zhichao Zhou, Chuang Gan, Fu Li, Xiao Liu, Yandong Li, Li-Min Wang, Shilei Wen

In this paper, in contrast to the existing CNN+RNN or pure 3D convolution based approaches, we explore a novel spatial temporal network (StNet) architecture for both local and global spatial-temporal modeling in videos.

Action Recognition

Exploiting Spatial-Temporal Modelling and Multi-Modal Fusion for Human Action Recognition

no code implementations27 Jun 2018 Dongliang He, Fu Li, Qijie Zhao, Xiang Long, Yi Fu, Shilei Wen

In this challenge, we propose spatial-temporal network (StNet) for better joint spatial-temporal modelling and comprehensively video understanding.

Action Recognition Video Understanding

Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification

2 code implementations CVPR 2018 Xiang Long, Chuang Gan, Gerard de Melo, Jiajun Wu, Xiao Liu, Shilei Wen

In this paper, however, we show that temporal information, especially longer-term patterns, may not be necessary to achieve competitive results on common video classification datasets.

Classification General Classification +1

Revisiting the Effectiveness of Off-the-shelf Temporal Modeling Approaches for Large-scale Video Classification

no code implementations12 Aug 2017 Yunlong Bian, Chuang Gan, Xiao Liu, Fu Li, Xiang Long, Yandong Li, Heng Qi, Jie zhou, Shilei Wen, Yuanqing Lin

Experiment results on the challenging Kinetics dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing approaches in the large-scale video recognition tasks.

Action Classification Fine-tuning +3

Deep Metric Learning with Angular Loss

1 code implementation ICCV 2017 Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, Yuanqing Lin

The modern image search system requires semantic understanding of image, and a key yet under-addressed problem is to learn a good metric for measuring the similarity between images.

Image Retrieval Metric Learning

Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

1 code implementation14 Jul 2017 Fu Li, Chuang Gan, Xiao Liu, Yunlong Bian, Xiang Long, Yandong Li, Zhichao Li, Jie zhou, Shilei Wen

This paper describes our solution for the video recognition task of the Google Cloud and YouTube-8M Video Understanding Challenge that ranked the 3rd place.

Video Recognition Video Understanding

Dynamic Computational Time for Visual Attention

1 code implementation30 Mar 2017 Zhichao Li, Yi Yang, Xiao Liu, Feng Zhou, Shilei Wen, Wei Xu

We propose a dynamic computational time model to accelerate the average processing time for recurrent visual attention (RAM).

Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition

no code implementations20 May 2016 Xiao Liu, Jiang Wang, Shilei Wen, Errui Ding, Yuanqing Lin

By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm.

Cannot find the paper you are looking for? You can Submit a new open access paper.