no code implementations • 28 Mar 2023 • Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang
STMixer is based on two core designs.
no code implementations • 28 Mar 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao
Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.
no code implementations • 28 Mar 2023 • Lei Chen, Zhan Tong, Yibing Song, Gangshan Wu, LiMin Wang
Existing studies model each actor and scene relation to improve action recognition.
no code implementations • 28 Mar 2023 • Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, LiMin Wang
Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity.
no code implementations • 26 Mar 2023 • Hanlin Wang, Yilu Wu, Sheng Guo, LiMin Wang
In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution.
1 code implementation • 21 Mar 2023 • Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, LiMin Wang
To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM).
1 code implementation • 1 Mar 2023 • Guozhen Zhang, Yuhan Zhu, Haonan Wang, Youxin Chen, Gangshan Wu, LiMin Wang
In this paper, we propose a novel module to explicitly extract motion and appearance information via a unifying operation.
Ranked #1 on
Video Frame Interpolation
on UCF101
1 code implementation • 13 Feb 2023 • Jiange Yang, Sheng Guo, Gangshan Wu, LiMin Wang
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
1 code implementation • 6 Feb 2023 • Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Ranked #1 on
Visual Object Tracking
on LaSOT
1 code implementation • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao
Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.
Ranked #1 on
Video Retrieval
on VATEX
1 code implementation • 3 Dec 2022 • Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, LiMin Wang
Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.
1 code implementation • 17 Nov 2022 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao
UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.
1 code implementation • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on
State Change Object Detection
on Ego4D
no code implementations • 16 Nov 2022 • Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, LiMin Wang
Our method achieves an accuracy of 0. 796 on OSCC while achieving an absolute temporal localization error of 0. 516 on PNR.
Human-Object Interaction Detection
Object State Change Classification
+2
1 code implementation • 20 Oct 2022 • Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang
Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g., ActivityNet, THUMOS).
Ranked #2 on
Temporal Action Localization
on MultiTHUMOS
no code implementations • 28 Sep 2022 • Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang
The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.
no code implementations • 30 Jun 2022 • Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, LiMin Wang
Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence.
1 code implementation • CVPR 2022 • Sheng Guo, Zihua Xiong, Yujie Zhong, LiMin Wang, Xiaobo Guo, Bing Han, Weilin Huang
In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning.
2 code implementations • 5 May 2022 • Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang
Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.
Ranked #1 on
Temporal Action Localization
on THUMOS14
no code implementations • 2 May 2022 • Tao Lu, Chunxu Liu, Youxin Chen, Gangshan Wu, LiMin Wang
In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently.
Ranked #26 on
3D Point Cloud Classification
on ScanObjectNN
1 code implementation • 25 Apr 2022 • Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, LiMin Wang
This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries.
no code implementations • 31 Mar 2022 • Liang Zhao, Yao Teng, LiMin Wang
Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors.
2 code implementations • CVPR 2022 • Ziteng Gao, LiMin Wang, Bing Han, Sheng Guo
The recent query-based object detectors break this convention by decoding image features with a set of learnable queries.
1 code implementation • CVPR 2022 • Liang Zhao, LiMin Wang
To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks.
2 code implementations • 23 Mar 2022 • Zhan Tong, Yibing Song, Jue Wang, LiMin Wang
Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.
Ranked #3 on
Self-Supervised Action Recognition
on UCF101
1 code implementation • CVPR 2022 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu
Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.
Ranked #4 on
Visual Object Tracking
on GOT-10k
Semi-Supervised Video Object Segmentation
Visual Object Tracking
1 code implementation • 3 Mar 2022 • Yating Tian, Hongwen Zhang, Yebin Liu, LiMin Wang
Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention.
no code implementations • 1 Mar 2022 • Jing Tan, Yuhong Wang, Gangshan Wu, LiMin Wang
Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs.
1 code implementation • CVPR 2022 • Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang
Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.
1 code implementation • CVPR 2022 • Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, LiMin Wang
Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.
1 code implementation • 7 Dec 2021 • Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu
Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.
Ranked #12 on
Temporal Action Localization
on THUMOS’14
1 code implementation • 24 Oct 2021 • Zhenxi Zhu, LiMin Wang, Sheng Guo, Gangshan Wu
In this paper, we aim to present an in-depth study on few-shot video classification by making three contributions.
no code implementations • 23 Sep 2021 • Fengyuan Shi, LiMin Wang, Weilin Huang
In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.
no code implementations • ICCV 2021 • Ziteng Gao, LiMin Wang, Gangshan Wu
In this paper, we break the convention of the same training samples for these two heads in dense detectors and explore a novel supervisory paradigm, termed as Mutual Supervision (MuSu), to respectively and mutually assign training samples for the classification and regression head to ensure this consistency.
1 code implementation • 10 Sep 2021 • Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu
Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.
1 code implementation • ICCV 2021 • TianHao Li, LiMin Wang, Gangshan Wu
In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition.
Ranked #33 on
Long-tail Learning
on CIFAR-100-LT (ρ=100)
1 code implementation • ICCV 2021 • Yao Teng, LiMin Wang, Zhifeng Li, Gangshan Wu
Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition.
1 code implementation • CVPR 2022 • Yao Teng, LiMin Wang
The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner.
1 code implementation • CVPR 2021 • Tao Lu, LiMin Wang, Gangshan Wu
Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories.
Ranked #1 on
Semantic Segmentation
on SYNTHIA
no code implementations • 10 Jun 2021 • Xindi Hu, LiMin Wang, Xin Yang, Xu Zhou, Wufeng Xue, Yan Cao, Shengfeng Liu, Yuhao Huang, Shuangping Guo, Ning Shang, Dong Ni, Ning Gu
In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH.
1 code implementation • 6 Jun 2021 • Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, LiMin Wang
Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images.
Ranked #1 on
3D Face Reconstruction
on AFLW2000-3D
1 code implementation • 24 May 2021 • Yi Liu, LiMin Wang, Yali Wang, Xiao Ma, Yu Qiao
Temporal action localization (TAL) is an important and challenging problem in video understanding.
Fine-Grained Action Detection
Temporal Action Localization
+2
1 code implementation • ICCV 2021 • Yixuan Li, Lei Chen, Runyu He, Zhenzhi Wang, Gangshan Wu, LiMin Wang
Spatio-temporal action detection is an important and challenging problem in video understanding.
1 code implementation • ICCV 2021 • Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu
First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.
1 code implementation • 1 Apr 2021 • Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu
Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos.
Ranked #1 on
Visual Object Tracking
on VOT2019
2 code implementations • ICCV 2021 • Hongwen Zhang, Yating Tian, Xinchi Zhou, Wanli Ouyang, Yebin Liu, LiMin Wang, Zhenan Sun
Regression-based methods have recently shown promising results in reconstructing human meshes from monocular images.
Ranked #32 on
3D Human Pose Estimation
on 3DPW
(using extra training data)
3D human pose and shape estimation
3D Human Reconstruction
+2
2 code implementations • ICCV 2021 • Jing Tan, Jiaqi Tang, LiMin Wang, Gangshan Wu
Extensive experiments on THUMOS14 and ActivityNet-1. 3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection.
no code implementations • 1 Jan 2021 • LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
1 code implementation • CVPR 2021 • LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu
To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.
Ranked #11 on
Action Recognition
on Something-Something V1
1 code implementation • CVPR 2018 • Limin Wang, Wei Li, Wen Li, Luc van Gool
Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.
Ranked #47 on
Action Recognition
on UCF101
9 code implementations • 8 May 2017 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool
Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.
Ranked #18 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric)
2 code implementations • CVPR 2017 • Limin Wang, Yuanjun Xiong, Dahua Lin, Luc van Gool
We exploit the learned models for action recognition (WSR) and detection (WSD) on the untrimmed video datasets of THUMOS14 and ActivityNet.
Ranked #3 on
Action Classification
on THUMOS’14
Weakly Supervised Action Localization
Weakly-Supervised Action Recognition
2 code implementations • 4 Oct 2016 • Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao
Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.
no code implementations • 1 Sep 2016 • Limin Wang, Zhe Wang, Yu Qiao, Luc van Gool
These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs.
19 code implementations • 2 Aug 2016 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool
The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.
Ranked #3 on
Multimodal Activity Recognition
on EV-Action
no code implementations • CVPR 2016 • Limin Wang, Yu Qiao, Xiaoou Tang, Luc van Gool
Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location.
Ranked #7 on
Temporal Action Localization
on J-HMDB-21
no code implementations • 14 Oct 2015 • Limin Wang, Zhe Wang, Sheng Guo, Yu Qiao
Event recognition from still images is one of the most important problems for image understanding.
1 code implementation • 7 Aug 2015 • Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao
We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.
5 code implementations • 8 Jul 2015 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao
However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.
Ranked #62 on
Action Recognition
on UCF101
1 code implementation • CVPR 2015 • Limin Wang, Yu Qiao, Xiaoou Tang
Visual features are of vital importance for human action understanding in videos.
Ranked #2 on
Activity Recognition In Videos
on DogCentric
no code implementations • 2 May 2015 • Limin Wang, Zhe Wang, Wenbin Du, Yu Qiao
Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition.