Search Results for author: LiMin Wang

Found 61 papers, 42 papers with code

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

no code implementations28 Mar 2023 Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

CycleACR: Cycle Modeling of Actor-Context Relations for Video Action Detection

no code implementations28 Mar 2023 Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang

Existing studies model each actor and scene relation to improve action recognition.

LinK: Linear Kernel for LiDAR-based 3D Perception

no code implementations28 Mar 2023 Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, LiMin Wang

Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity.

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

no code implementations26 Mar 2023 Hanlin Wang, Yilu Wu, Sheng Guo, LiMin Wang

In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution.

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

1 code implementation21 Mar 2023 Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, LiMin Wang

To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM).

Optical Flow Estimation

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

1 code implementation13 Feb 2023 Jiange Yang, Sheng Guo, Gangshan Wu, LiMin Wang

Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.

Contrastive Learning Representation Learning +1

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation6 Feb 2023 Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Visual Object Tracking

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

1 code implementation6 Dec 2022 Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Action Classification Contrastive Learning +7

VLG: General Video Recognition with Web Textual Knowledge

1 code implementation3 Dec 2022 Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, LiMin Wang

Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.

Video Recognition

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

1 code implementation17 Nov 2022 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Video Understanding

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

1 code implementation20 Oct 2022 Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g., ActivityNet, THUMOS).

Action Detection Temporal Action Localization

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

no code implementations28 Sep 2022 Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang

The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.

Visual Grounding

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

no code implementations30 Jun 2022 Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, LiMin Wang

Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence.

Boundary Detection Video Understanding

Cross-Architecture Self-supervised Video Representation Learning

1 code implementation CVPR 2022 Sheng Guo, Zihua Xiong, Yujie Zhong, LiMin Wang, Xiaobo Guo, Bing Han, Weilin Huang

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning.

Action Recognition Contrastive Learning +4

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

2 code implementations5 May 2022 Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang

Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.

Action Detection object-detection +3

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

no code implementations2 May 2022 Tao Lu, Chunxu Liu, Youxin Chen, Gangshan Wu, LiMin Wang

In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently.

3D Classification 3D Point Cloud Classification +1

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

1 code implementation25 Apr 2022 Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, LiMin Wang

This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries.


Logit Normalization for Long-tail Object Detection

no code implementations31 Mar 2022 Liang Zhao, Yao Teng, LiMin Wang

Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors.

object-detection Object Detection

AdaMixer: A Fast-Converging Query-Based Object Detector

2 code implementations CVPR 2022 Ziteng Gao, LiMin Wang, Bing Han, Sheng Guo

The recent query-based object detectors break this convention by decoding image features with a set of learnable queries.

Object Detection

Task-specific Inconsistency Alignment for Domain Adaptive Object Detection

1 code implementation CVPR 2022 Liang Zhao, LiMin Wang

To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks.

object-detection Object Detection

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation CVPR 2022 Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Semi-Supervised Video Object Segmentation Visual Object Tracking

Recovering 3D Human Mesh from Monocular Images: A Survey

1 code implementation3 Mar 2022 Yating Tian, Hongwen Zhang, Yebin Liu, LiMin Wang

Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention.

3D human pose and shape estimation Human Mesh Recovery

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

no code implementations1 Mar 2022 Jing Tan, Yuhong Wang, Gangshan Wu, LiMin Wang

Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs.

Avg Boundary Detection +1

OCSampler: Compressing Videos to One Clip with Single-step Sampling

1 code implementation CVPR 2022 Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang

Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.

Video Recognition

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

1 code implementation CVPR 2022 Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, LiMin Wang

Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.

Boundary Detection Video Understanding

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

1 code implementation7 Dec 2021 Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu

Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.

Action Detection Temporal Action Localization

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

1 code implementation24 Oct 2021 Zhenxi Zhu, LiMin Wang, Sheng Guo, Gangshan Wu

In this paper, we aim to present an in-depth study on few-shot video classification by making three contributions.

Classification Meta-Learning +2

End-to-End Dense Video Grounding via Parallel Regression

no code implementations23 Sep 2021 Fengyuan Shi, LiMin Wang, Weilin Huang

In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.

regression Video Grounding

Mutual Supervision for Dense Object Detection

no code implementations ICCV 2021 Ziteng Gao, LiMin Wang, Gangshan Wu

In this paper, we break the convention of the same training samples for these two heads in dense detectors and explore a novel supervisory paradigm, termed as Mutual Supervision (MuSu), to respectively and mutually assign training samples for the classification and regression head to ensure this consistency.

Classification Dense Object Detection +2

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

1 code implementation10 Sep 2021 Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu

Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.

Metric Learning Representation Learning +1

Self Supervision to Distillation for Long-Tailed Visual Recognition

1 code implementation ICCV 2021 TianHao Li, LiMin Wang, Gangshan Wu

In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition.

Long-tail Learning

Target Adaptive Context Aggregation for Video Scene Graph Generation

1 code implementation ICCV 2021 Yao Teng, LiMin Wang, Zhifeng Li, Gangshan Wu

Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition.

Association Graph Generation +1

Structured Sparse R-CNN for Direct Scene Graph Generation

1 code implementation CVPR 2022 Yao Teng, LiMin Wang

The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner.

graph construction Graph Generation +4

CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation

1 code implementation CVPR 2021 Tao Lu, LiMin Wang, Gangshan Wu

Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories.

Semantic Segmentation

Joint Landmark and Structure Learning for Automatic Evaluation of Developmental Dysplasia of the Hip

no code implementations10 Jun 2021 Xindi Hu, LiMin Wang, Xin Yang, Xu Zhou, Wufeng Xue, Yan Cao, Shengfeng Liu, Yuhao Huang, Shuangping Guo, Ning Shang, Dong Ni, Ning Gu

In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH.

SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

1 code implementation6 Jun 2021 Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, LiMin Wang

Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images.

3D Face Alignment 3D Face Reconstruction +3

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

1 code implementation ICCV 2021 Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu

First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.

Action Recognition Temporal Action Localization

Target Transformed Regression for Accurate Tracking

1 code implementation1 Apr 2021 Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos.

regression Visual Object Tracking +1

Relaxed Transformer Decoders for Direct Action Proposal Generation

2 code implementations ICCV 2021 Jing Tan, Jiaqi Tang, LiMin Wang, Gangshan Wu

Extensive experiments on THUMOS14 and ActivityNet-1. 3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection.

Action Detection Temporal Action Proposal Generation +1

Temporal Difference Networks for Action Recognition

no code implementations1 Jan 2021 LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Recognition In Videos

TDN: Temporal Difference Networks for Efficient Action Recognition

1 code implementation CVPR 2021 LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Classification Action Recognition In Videos

Appearance-and-Relation Networks for Video Classification

1 code implementation CVPR 2018 Limin Wang, Wei Li, Wen Li, Luc van Gool

Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.

Action Classification Action Recognition +4

Temporal Segment Networks for Action Recognition in Videos

9 code implementations8 May 2017 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Ranked #18 on Action Classification on Moments in Time (Top 5 Accuracy metric)

Action Classification Action Recognition In Videos +2

UntrimmedNets for Weakly Supervised Action Recognition and Detection

2 code implementations CVPR 2017 Limin Wang, Yuanjun Xiong, Dahua Lin, Luc van Gool

We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet.

Weakly Supervised Action Localization Weakly-Supervised Action Recognition

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs

2 code implementations4 Oct 2016 Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao

Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.

General Classification Scene Classification +1

Transferring Object-Scene Convolutional Neural Networks for Event Recognition in Still Images

no code implementations1 Sep 2016 Limin Wang, Zhe Wang, Yu Qiao, Luc van Gool

These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs.

Multi-Task Learning

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

19 code implementations2 Aug 2016 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Action Classification Action Recognition In Videos +2

Better Exploiting OS-CNNs for Better Event Recognition in Images

no code implementations14 Oct 2015 Limin Wang, Zhe Wang, Sheng Guo, Yu Qiao

Event recognition from still images is one of the most important problems for image understanding.

Object Recognition Scene Recognition

Places205-VGGNet Models for Scene Recognition

1 code implementation7 Aug 2015 Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.

Object Recognition Scene Recognition

Towards Good Practices for Very Deep Two-Stream ConvNets

5 code implementations8 Jul 2015 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.

Action Recognition In Videos Data Augmentation +1

Object-Scene Convolutional Neural Networks for Event Recognition in Images

no code implementations2 May 2015 Limin Wang, Zhe Wang, Wenbin Du, Yu Qiao

Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition.

Cannot find the paper you are looking for? You can Submit a new open access paper.