Search Results for author: LiMin Wang

Found 110 papers, 75 papers with code

Object-Scene Convolutional Neural Networks for Event Recognition in Images

no code implementations2 May 2015 Limin Wang, Zhe Wang, Wenbin Du, Yu Qiao

Meanwhile, we investigate different network architectures for OS-CNN design, and adapt the deep (AlexNet) and very-deep (GoogLeNet) networks to the task of event recognition.

Towards Good Practices for Very Deep Two-Stream ConvNets

5 code implementations8 Jul 2015 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.

Action Recognition In Videos Computational Efficiency +3

Places205-VGGNet Models for Scene Recognition

2 code implementations7 Aug 2015 Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.

Computational Efficiency Object Recognition +1

Better Exploiting OS-CNNs for Better Event Recognition in Images

no code implementations14 Oct 2015 Limin Wang, Zhe Wang, Sheng Guo, Yu Qiao

Event recognition from still images is one of the most important problems for image understanding.

Object Object Recognition +1

Actionness Estimation Using Hybrid Fully Convolutional Networks

no code implementations CVPR 2016 Limin Wang, Yu Qiao, Xiaoou Tang, Luc van Gool

Actionness was introduced to quantify the likelihood of containing a generic action instance at a specific location.

Action Detection Action Recognition +1

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

19 code implementations2 Aug 2016 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Action Classification Action Recognition In Videos +2

Transferring Object-Scene Convolutional Neural Networks for Event Recognition in Still Images

no code implementations1 Sep 2016 Limin Wang, Zhe Wang, Yu Qiao, Luc van Gool

These newly designed transferring techniques exploit multi-task learning frameworks to incorporate extra knowledge from other networks and additional datasets into the training procedure of event CNNs.

Multi-Task Learning

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs

2 code implementations4 Oct 2016 Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao

Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.

General Classification Scene Classification +1

Temporal Segment Networks for Action Recognition in Videos

11 code implementations8 May 2017 Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Action Classification Action Recognition In Videos +3

Appearance-and-Relation Networks for Video Classification

1 code implementation CVPR 2018 Limin Wang, Wei Li, Wen Li, Luc van Gool

Specifically, SMART blocks decouple the spatiotemporal learning module into an appearance branch for spatial modeling and a relation branch for temporal modeling.

Action Classification Action Recognition +6

TDN: Temporal Difference Networks for Efficient Action Recognition

1 code implementation CVPR 2021 LiMin Wang, Zhan Tong, Bin Ji, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Classification Action Recognition In Videos

Temporal Difference Networks for Action Recognition

no code implementations1 Jan 2021 LiMin Wang, Bin Ji, Zhan Tong, Gangshan Wu

To mitigate this issue, this paper presents a new video architecture, termed as Temporal Difference Network (TDN), with a focus on capturing multi-scale temporal information for efficient action recognition.

Action Recognition In Videos

Relaxed Transformer Decoders for Direct Action Proposal Generation

2 code implementations ICCV 2021 Jing Tan, Jiaqi Tang, LiMin Wang, Gangshan Wu

Extensive experiments on THUMOS14 and ActivityNet-1. 3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection.

Action Detection Temporal Action Proposal Generation +1

Target Transformed Regression for Accurate Tracking

1 code implementation1 Apr 2021 Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Accurate tracking is still a challenging task due to appearance variations, pose and view changes, and geometric deformations of target in videos.

regression Visual Object Tracking +1

MGSampler: An Explainable Sampling Strategy for Video Action Recognition

1 code implementation ICCV 2021 Yuan Zhi, Zhan Tong, LiMin Wang, Gangshan Wu

First, we present two different motion representations to enable us to efficiently distinguish the motion-salient frames from the background.

Action Recognition Temporal Action Localization

SADRNet: Self-Aligned Dual Face Regression Networks for Robust 3D Dense Face Alignment and Reconstruction

1 code implementation6 Jun 2021 Zeyu Ruan, Changqing Zou, Longhai Wu, Gangshan Wu, LiMin Wang

Three-dimensional face dense alignment and reconstruction in the wild is a challenging problem as partial facial information is commonly missing in occluded and large pose face images.

3D Face Alignment 3D Face Reconstruction +3

Joint Landmark and Structure Learning for Automatic Evaluation of Developmental Dysplasia of the Hip

no code implementations10 Jun 2021 Xindi Hu, LiMin Wang, Xin Yang, Xu Zhou, Wufeng Xue, Yan Cao, Shengfeng Liu, Yuhao Huang, Shuangping Guo, Ning Shang, Dong Ni, Ning Gu

In this study, we propose a multi-task framework to learn the relationships among landmarks and structures jointly and automatically evaluate DDH.

CGA-Net: Category Guided Aggregation for Point Cloud Semantic Segmentation

1 code implementation CVPR 2021 Tao Lu, LiMin Wang, Gangshan Wu

Previous point cloud semantic segmentation networks use the same process to aggregate features from neighbors of the same category and different categories.

Segmentation Semantic Segmentation

Structured Sparse R-CNN for Direct Scene Graph Generation

4 code implementations CVPR 2022 Yao Teng, LiMin Wang

The key to our method is a set of learnable triplet queries and a structured triplet detector which could be jointly optimized from the training set in an end-to-end manner.

graph construction Graph Generation +4

Target Adaptive Context Aggregation for Video Scene Graph Generation

1 code implementation ICCV 2021 Yao Teng, LiMin Wang, Zhifeng Li, Gangshan Wu

Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition.

Graph Generation Relation +2

Self Supervision to Distillation for Long-Tailed Visual Recognition

1 code implementation ICCV 2021 TianHao Li, LiMin Wang, Gangshan Wu

In this paper, we show that soft label can serve as a powerful solution to incorporate label correlation into a multi-stage training scheme for long-tailed recognition.

Long-tail Learning

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2 code implementations10 Sep 2021 Zhenzhi Wang, LiMin Wang, Tao Wu, TianHao Li, Gangshan Wu

Instead, from a perspective on temporal grounding as a metric-learning problem, we present a Mutual Matching Network (MMN), to directly model the similarity between language queries and video moments in a joint embedding space.

Metric Learning Representation Learning +2

Mutual Supervision for Dense Object Detection

no code implementations ICCV 2021 Ziteng Gao, LiMin Wang, Gangshan Wu

In this paper, we break the convention of the same training samples for these two heads in dense detectors and explore a novel supervisory paradigm, termed as Mutual Supervision (MuSu), to respectively and mutually assign training samples for the classification and regression head to ensure this consistency.

Classification Dense Object Detection +3

End-to-End Dense Video Grounding via Parallel Regression

no code implementations23 Sep 2021 Fengyuan Shi, Weilin Huang, LiMin Wang

In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input.

regression Sentence +1

A Closer Look at Few-Shot Video Classification: A New Baseline and Benchmark

1 code implementation24 Oct 2021 Zhenxi Zhu, LiMin Wang, Sheng Guo, Gangshan Wu

In this paper, we aim to present an in-depth study on few-shot video classification by making three contributions.

Classification Meta-Learning +2

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

1 code implementation7 Dec 2021 Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu

Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.

Action Detection Temporal Action Localization

Progressive Attention on Multi-Level Dense Difference Maps for Generic Event Boundary Detection

3 code implementations CVPR 2022 Jiaqi Tang, Zhaoyang Liu, Chen Qian, Wayne Wu, LiMin Wang

Generic event boundary detection is an important yet challenging task in video understanding, which aims at detecting the moments where humans naturally perceive event boundaries.

Boundary Detection Generic Event Boundary Detection +1

OCSampler: Compressing Videos to One Clip with Single-step Sampling

1 code implementation CVPR 2022 Jintao Lin, Haodong Duan, Kai Chen, Dahua Lin, LiMin Wang

Recent works prefer to formulate frame sampling as a sequential decision task by selecting frames one by one according to their importance, while we present a new paradigm of learning instance-specific video condensation policies to select informative frames for representing the entire video only in a single step.

Video Recognition

Temporal Perceiver: A General Architecture for Arbitrary Boundary Detection

no code implementations1 Mar 2022 Jing Tan, Yuhong Wang, Gangshan Wu, LiMin Wang

Instead, in this paper, we present Temporal Perceiver, a general architecture with Transformer, offering a unified solution to the detection of arbitrary generic boundaries, ranging from shot-level, event-level, to scene-level GBDs.

Avg Boundary Detection +1

Recovering 3D Human Mesh from Monocular Images: A Survey

1 code implementation3 Mar 2022 Yating Tian, Hongwen Zhang, Yebin Liu, LiMin Wang

Since the release of statistical body models, 3D human mesh recovery has been drawing broader attention.

3D human pose and shape estimation Human Mesh Recovery

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation CVPR 2022 Yutao Cui, Cheng Jiang, LiMin Wang, Gangshan Wu

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Semi-Supervised Video Object Segmentation Visual Object Tracking

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

4 code implementations23 Mar 2022 Zhan Tong, Yibing Song, Jue Wang, LiMin Wang

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets.

4k Action Classification +3

Task-specific Inconsistency Alignment for Domain Adaptive Object Detection

1 code implementation CVPR 2022 Liang Zhao, LiMin Wang

To address this issue, in this paper, we propose Task-specific Inconsistency Alignment (TIA), by developing a new alignment mechanism in separate task spaces, improving the performance of the detector on both subtasks.

Object object-detection +1

AdaMixer: A Fast-Converging Query-Based Object Detector

2 code implementations CVPR 2022 Ziteng Gao, LiMin Wang, Bing Han, Sheng Guo

The recent query-based object detectors break this convention by decoding image features with a set of learnable queries.

Object Object Detection

Logit Normalization for Long-tail Object Detection

1 code implementation31 Mar 2022 Liang Zhao, Yao Teng, LiMin Wang

Real-world data exhibiting skewed distributions pose a serious challenge to existing object detectors.

Object object-detection +1

Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing

2 code implementations25 Apr 2022 Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, LiMin Wang

This paper focuses on the weakly-supervised audio-visual video parsing task, which aims to recognize all events belonging to each modality and localize their temporal boundaries.

Denoising valid

APP-Net: Auxiliary-point-based Push and Pull Operations for Efficient Point Cloud Classification

1 code implementation2 May 2022 Tao Lu, Chunxu Liu, Youxin Chen, Gangshan Wu, LiMin Wang

In the existing work, each point in the cloud may inevitably be selected as the neighbors of multiple aggregation centers, as all centers will gather neighbor features from the whole point cloud independently.

3D Classification 3D Point Cloud Classification +1

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

2 code implementations5 May 2022 Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang

Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.

Action Detection object-detection +3

Cross-Architecture Self-supervised Video Representation Learning

1 code implementation CVPR 2022 Sheng Guo, Zihua Xiong, Yujie Zhong, LiMin Wang, Xiaobo Guo, Bing Han, Weilin Huang

In this paper, we present a new cross-architecture contrastive learning (CACL) framework for self-supervised video representation learning.

Action Recognition Contrastive Learning +4

Submission to Generic Event Boundary Detection Challenge@CVPR 2022: Local Context Modeling and Global Boundary Decoding Approach

no code implementations30 Jun 2022 Jiaqi Tang, Zhaoyang Liu, Jing Tan, Chen Qian, Wayne Wu, LiMin Wang

Local context modeling sub-network is proposed to perceive diverse patterns of generic event boundaries, and it generates powerful video representations and reliable boundary confidence.

Boundary Detection Generic Event Boundary Detection +1

Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual Grounding

no code implementations28 Sep 2022 Fengyuan Shi, Ruopeng Gao, Weilin Huang, LiMin Wang

The sampling module aims to select these informative patches by predicting the offsets with respect to a reference point, while the decoding module works for extracting the grounded object information by performing cross attention between image features and text features.

Visual Grounding

PointTAD: Multi-Label Temporal Action Detection with Learnable Query Points

1 code implementation20 Oct 2022 Jing Tan, Xiaotong Zhao, Xintian Shi, Bin Kang, LiMin Wang

Traditional temporal action detection (TAD) usually handles untrimmed videos with small number of action instances from a single label (e. g., ActivityNet, THUMOS).

Action Detection Temporal Action Localization

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

3 code implementations17 Nov 2022 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Video Understanding

VLG: General Video Recognition with Web Textual Knowledge

1 code implementation3 Dec 2022 Jintao Lin, Zhaoyang Liu, Wenhai Wang, Wayne Wu, LiMin Wang

Our VLG is first pre-trained on video and language datasets to learn a shared feature space, and then devises a flexible bi-modal attention head to collaborate high-level semantic concepts under different settings.

Video Recognition

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

1 code implementation6 Dec 2022 Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

 Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Contrastive Learning +8

UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding

no code implementations ICCV 2023 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks.

Video Understanding

MixFormer: End-to-End Tracking with Iterative Mixed Attention

1 code implementation6 Feb 2023 Yutao Cui, Cheng Jiang, Gangshan Wu, LiMin Wang

Our core design is to utilize the flexibility of attention operations, and propose a Mixed Attention Module (MAM) for simultaneous feature extraction and target information integration.

Visual Object Tracking

CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets

1 code implementation13 Feb 2023 Jiange Yang, Sheng Guo, Gangshan Wu, LiMin Wang

Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.

Contrastive Learning Representation Learning +1

Learning Optical Flow and Scene Flow with Bidirectional Camera-LiDAR Fusion

1 code implementation21 Mar 2023 Haisong Liu, Tao Lu, Yihui Xu, Jia Liu, LiMin Wang

To fuse dense image features and sparse point features, we propose a learnable operator named bidirectional camera-LiDAR fusion module (Bi-CLFM).

Optical Flow Estimation Scene Flow Estimation +1

PDPP:Projected Diffusion for Procedure Planning in Instructional Videos

1 code implementation CVPR 2023 Hanlin Wang, Yilu Wu, Sheng Guo, LiMin Wang

In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution.

LinK: Linear Kernel for LiDAR-based 3D Perception

1 code implementation CVPR 2023 Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, LiMin Wang

Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity.

3D Object Detection 3D Semantic Segmentation +1

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation CVPR 2023 LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

 Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +3

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

1 code implementation7 Apr 2023 Ziteng Gao, Zhan Tong, LiMin Wang, Mike Zheng Shou

In this paper, we challenge this dense paradigm and present a new method, coined SparseFormer, to imitate human's sparse visual recognition in an end-to-end manner.

Sparse Representation-based Classification Video Classification

SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes

1 code implementation ICCV 2023 Yutao Cui, Chenkai Zeng, Xiaoyu Zhao, Yichun Yang, Gangshan Wu, LiMin Wang

We expect SportsMOT to encourage the MOT trackers to promote in both motion-based association and appearance-based association.

Ranked #3 on Multi-Object Tracking on SportsMOT (using extra training data)

Multi-Object Tracking Multiple Object Tracking +1

Progressive Visual Prompt Learning with Contrastive Feature Re-formation

no code implementations17 Apr 2023 Chen Xu, Haocheng Shen, Fengyuan Shi, Boheng Chen, Yixuan Liao, Xiaoxin Chen, LiMin Wang

To the best of our knowledge, we are the first to demonstrate the superior performance of visual prompts in V-L models to previous prompt-based methods in downstream tasks.

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation22 May 2023 Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Video Understanding

MixFormerV2: Efficient Fully Transformer Tracking

1 code implementation NeurIPS 2023 Yutao Cui, Tianhui Song, Gangshan Wu, LiMin Wang

Our key design is to introduce four special prediction tokens and concatenate them with the tokens from target template and search areas.

AlphaBlock: Embodied Finetuning for Vision-Language Reasoning in Robot Manipulation

no code implementations30 May 2023 Chuhao Jin, Wenhui Tan, Jiange Yang, Bei Liu, Ruihua Song, LiMin Wang, Jianlong Fu

We propose a novel framework for learning high-level cognitive capabilities in robot manipulation tasks, such as making a smiley face using building blocks.

Robot Manipulation

Transferring Foundation Models for Generalizable Robotic Manipulation

no code implementations9 Jun 2023 Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, LiMin Wang

In this paper, we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models, to condition robot manipulation tasks.

Imitation Learning Object +1

MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking

1 code implementation ICCV 2023 Ruopeng Gao, LiMin Wang

Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7. 9% and 13. 0% on HOTA and AssA metrics, respectively.

Multi-Object Tracking Multiple Object Tracking +1

Memory-and-Anticipation Transformer for Online Action Understanding

1 code implementation ICCV 2023 Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu

Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.

Action Understanding Online Action Detection

Is Self-Supervised Pretraining Good for Extrapolation in Molecular Property Prediction?

no code implementations16 Aug 2023 Shun Takashige, Masatoshi Hanai, Toyotaro Suzumura, LiMin Wang, Kenjiro Taura

In material science, the prediction of unobserved values, commonly referred to as extrapolation, is particularly critical for property prediction as it enables researchers to gain insight into materials beyond the limits of available data.

Molecular Property Prediction Property Prediction

Deep Equilibrium Object Detection

1 code implementation ICCV 2023 Shuai Wang, Yao Teng, LiMin Wang

To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement.

Object object-detection +1

SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos

1 code implementation ICCV 2023 Haisong Liu, Yao Teng, Tao Lu, Haiguang Wang, LiMin Wang

Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost.

3D Object Detection Object +1

DPL: Decoupled Prompt Learning for Vision-Language Models

no code implementations19 Aug 2023 Chen Xu, Yuhan Zhu, Guozhen Zhang, Haocheng Shen, Yixuan Liao, Xiaoxin Chen, Gangshan Wu, LiMin Wang

Prompt learning has emerged as an efficient and effective approach for transferring foundational Vision-Language Models (e. g., CLIP) to downstream tasks.

MGMAE: Motion Guided Masking for Video Masked Autoencoding

1 code implementation ICCV 2023 Bingkun Huang, Zhiyu Zhao, Guozhen Zhang, Yu Qiao, LiMin Wang

Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos.

Optical Flow Estimation Representation Learning

Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation

no code implementations25 Aug 2023 Jiaming Zhang, Yutao Cui, Gangshan Wu, LiMin Wang

To overcome these issues, we propose a unified VOS framework, coined as JointFormer, for joint modeling the three elements of feature, correspondence, and a compressed memory.

Semantic Segmentation Video Object Segmentation +1

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

1 code implementation2 Oct 2023 Xinhao Li, LiMin Wang

In this paper, our goal is to present a zero-cost adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i. e., introduce zero extra cost to the adapted models during inference).

Ranked #5 on Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition +1

Bridging The Gaps Between Token Pruning and Full Pre-training via Masked Fine-tuning

no code implementations26 Oct 2023 Fengyuan Shi, LiMin Wang

Despite the success of transformers on various computer vision tasks, they suffer from excessive memory and computational cost.

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation30 Oct 2023 Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

Asymmetric Masked Distillation for Pre-Training Small Foundation Models

no code implementations6 Nov 2023 Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, LiMin Wang

And AMD achieves 73. 3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset, a 3. 7% improvement over the original ViT-B model from VideoMAE.

Action Classification Action Recognition +3

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

1 code implementation28 Nov 2023 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Fairness Multiple-choice +8

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation29 Nov 2023 Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering

1 code implementation30 Nov 2023 Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, LiMin Wang, Dahua Lin, Bo Dai

Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications.

Neural Rendering

Adapting Short-Term Transformers for Action Detection in Untrimmed Videos

no code implementations4 Dec 2023 Min Yang, Huan Gao, Ping Guo, LiMin Wang

To this end, we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels.

Action Detection Video Recognition

BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

1 code implementation5 Dec 2023 Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei zhang, LiMin Wang

Now text-to-image foundation models are widely applied to various downstream image synthesis tasks, such as controllable image generation and image editing, while downstream video synthesis tasks are less explored for several reasons.

Image Generation Model Selection +3

MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding

no code implementations8 Dec 2023 Hongjie Zhang, Yi Liu, Lu Dong, Yifei HUANG, Zhen-Hua Ling, Yali Wang, LiMin Wang, Yu Qiao

While several long-form VideoQA datasets have been introduced, the length of both videos used to curate questions and sub-clips of clues leveraged to answer those questions have not yet reached the criteria for genuine long-form video understanding.

Question Answering Video Question Answering +1

Data-efficient Event Camera Pre-training via Disentangled Masked Modeling

no code implementations1 Mar 2024 Zhenpeng Huang, Chao Li, Hao Chen, Yongjian Deng, Yifeng Geng, LiMin Wang

Our pre-training overcomes the limitations of previous methods, which either sacrifice temporal information by converting event sequences into 2D images for utilizing pre-trained image models or directly employ paired image data for knowledge distillation to enhance the learning of event streams.

Knowledge Distillation Self-Supervised Learning

StableDrag: Stable Dragging for Point-based Image Editing

no code implementations7 Mar 2024 Yutao Cui, Xiaotong Zhao, Guozhen Zhang, Shengming Cao, Kai Ma, LiMin Wang

Point-based image editing has attracted remarkable attention since the emergence of DragGAN.

Point Tracking

VideoMamba: State Space Model for Efficient Video Understanding

3 code implementations11 Mar 2024 Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Video Understanding

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

1 code implementation14 Mar 2024 Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

Moment Retrieval Temporal Action Localization +1

Contextual AD Narration with Interleaved Multimodal Sequence

no code implementations19 Mar 2024 Hanlin Wang, Zhan Tong, Kecheng Zheng, Yujun Shen, LiMin Wang

With video feature, text, character bank and context information as inputs, the generated ADs are able to correspond to the characters by name and provide reasonable, contextual descriptions to help audience understand the storyline of movie.

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations22 Mar 2024 Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

 Ranked #1 on Audio Classification on ESC-50 (using extra training data)

Action Classification Action Recognition +12

EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World

1 code implementation24 Mar 2024 Yifei HUANG, Guo Chen, Jilan Xu, Mingfang Zhang, Lijin Yang, Baoqi Pei, Hongjie Zhang, Lu Dong, Yali Wang, LiMin Wang, Yu Qiao

Along with the videos we record high-quality gaze data and provide detailed multimodal annotations, formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints.

Multiple Object Tracking as ID Prediction

1 code implementation25 Mar 2024 Ruopeng Gao, Yijun Zhang, LiMin Wang

In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association.

 Ranked #1 on Multi-Object Tracking on DanceTrack (using extra training data)

Multi-Object Tracking Multiple Object Tracking +3

Dual DETRs for Multi-Label Temporal Action Detection

no code implementations31 Mar 2024 Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, LiMin Wang

To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level.

Action Detection object-detection +1

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos

no code implementations6 Apr 2024 Tao Wu, Runyu He, Gangshan Wu, LiMin Wang

We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection.

Graph Generation Relation +4

STMixer: A One-Stage Sparse Action Detector

no code implementations15 Apr 2024 Tao Wu, Mengqi Cao, Ziteng Gao, Gangshan Wu, LiMin Wang

First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain.

Action Detection

Cannot find the paper you are looking for? You can Submit a new open access paper.