VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

1 code implementation12 Jun 2024 Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Wenhai Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai

It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.

Image Generation Language Modelling +5

Video Mamba Suite: State Space Model as a Versatile Alternative for Video Understanding

1 code implementation14 Mar 2024 Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang

We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.

Moment Retrieval Temporal Action Localization +1

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation4 Mar 2024 Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

PromptRR: Diffusion Models as Prompt Generators for Single Image Reflection Removal

1 code implementation4 Feb 2024 Tao Wang, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Tae-Kyun Kim, Tong Lu, Hongdong Li, Ming-Hsuan Yang

For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts.

Reflection Removal

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

1 code implementation CVPR 2024 Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

CRA-PCN: Point Cloud Completion with Intra- and Inter-level Cross-Resolution Transformers

1 code implementation3 Jan 2024 Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu

Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc.

Point Cloud Completion

RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation

no code implementations CVPR 2024 Yi Rong, Haoran Zhou, Kang Xia, Cheng Mei, Jiahao Wang, Tong Lu

Moreover we propose a novel paradigm namely Kernel-to-Displacement generation for point generation where point cloud upsampling is reformulated as the deformation of kernel points.

point cloud upsampling

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?

1 code implementation CVPR 2024 Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez

We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity.

Autonomous Driving

Deep Video Restoration for Under-Display Camera

no code implementations9 Sep 2023 Xuanxi Chen, Tao Wang, Ziqian Shao, Kaihao Zhang, Wenhan Luo, Tong Lu, Zikun Liu, Tae-Kyun Kim, Hongdong Li

With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC.

Video Restoration

Memory-and-Anticipation Transformer for Online Action Understanding

1 code implementation ICCV 2023 Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu

Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.

Action Understanding Online Action Detection

AVSegFormer: Audio-Visual Segmentation with Transformer

1 code implementation3 Jul 2023 Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu

In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.

Decoder Scene Understanding +1

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation22 May 2023 Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Decoder Video Understanding

Graph Propagation Transformer for Graph Representation Learning

1 code implementation19 May 2023 Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, Qiuying Peng, Cheng Cheng, Yue Qi

The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks.

Ranked #2 on Graph Regression on PCQM4M-LSC (Validation MAE metric)

Graph Learning Graph Property Prediction +3

MRSN: Multi-Relation Support Network for Video Action Detection

no code implementations24 Apr 2023 Yin-Dong Zheng, Guo Chen, Minglei Yuan, Tong Lu

Action detection is a challenging video understanding task, requiring modeling spatio-temporal and interaction relations.

Action Detection Relation +1

Champion Solution for the WSDM2023 Toloka VQA Challenge

1 code implementation22 Jan 2023 Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu

In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge.

Question Answering Visual Grounding +1

Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and Transformer-Based Method

1 code implementation22 Dec 2022 Tao Wang, Kaihao Zhang, Tianrun Shen, Wenhan Luo, Bjorn Stenger, Tong Lu

In this paper, we consider the task of low-light image enhancement (LLIE) and introduce a large-scale database consisting of images at 4K and 8K resolution.

4k 8k +3

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

3 code implementations CVPR 2023 Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

 Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)

Classification Image Classification +3

Incremental Few-Shot Semantic Segmentation via Embedding Adaptive-Update and Hyper-class Representation

no code implementations26 Jul 2022 Guangchen Shi, Yirui Wu, Jun Liu, Shaohua Wan, Wenhai Wang, Tong Lu

Second, to resist overfitting issues caused by few training samples, a hyper-class embedding is learned by clustering all category embeddings for initialization and aligned with category embedding of the new class for enhancement, where learned knowledge assists to learn new knowledge, thus alleviating performance dependence on training data scale.

Few-Shot Semantic Segmentation Segmentation +1

SeedFormer: Patch Seeds based Point Cloud Completion with Upsample Transformer

1 code implementation21 Jul 2022 Haoran Zhou, Yun Cao, Wenqing Chu, Junwei Zhu, Tong Lu, Ying Tai, Chengjie Wang

Point cloud completion has become increasingly popular among generation tasks of 3D point clouds, as it is a challenging yet indispensable problem to recover the complete shape of a 3D object from its partial observation.

Point Cloud Completion

Uncertainty-based Network for Few-shot Image Classification

no code implementations17 May 2022 Minglei Yuan, Qian Xu, Chunhao Cai, Yin-Dong Zheng, Tao Wang, Tong Lu

Specifically, we first data augment and classify the query instance and calculate the mutual information of these classification scores.

Classification Few-Shot Image Classification +1

BasicTAD: an Astounding RGB-Only Baseline for Temporal Action Detection

2 code implementations5 May 2022 Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang

Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.

Action Detection object-detection +3

Refine-Net: Normal Refinement Neural Network for Noisy Point Clouds

1 code implementation23 Mar 2022 Haoran Zhou, Honghua Chen, Yingkui Zhang, Mingqiang Wei, Haoran Xie, Jun Wang, Tong Lu, Jing Qin, Xiao-Ping Zhang

Differently, our network is designed to refine the initial normal of each point by extracting additional information from multiple feature representations.

DCAN: Improving Temporal Action Detection via Dual Context Aggregation

1 code implementation7 Dec 2021 Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu

Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.

Action Detection Temporal Action Localization

Spectrum-to-Kernel Translation for Accurate Blind Image Super-Resolution

no code implementations NeurIPS 2021 Guangpin Tao, Xiaozhong Ji, Wenzhuo Wang, Shuo Chen, Chuming Lin, Yun Cao, Tong Lu, Donghao Luo, Ying Tai

In this paper, we propose a novel blind SR framework to super-resolve LR images degraded by arbitrary blur kernel with accurate kernel estimation in frequency domain.

Image Super-Resolution Translation

Learning Class-level Prototypes for Few-shot Learning

no code implementations25 Aug 2021 Minglei Yuan, Wenhai Wang, Tao Wang, Chunhao Cai, Qian Xu, Tong Lu

Few-shot learning aims to recognize new categories using very few labeled samples.

Few-Shot Learning

PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text

1 code implementation2 May 2021 Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Zhibo Yang, Tong Lu, Chunhua Shen

By systematically comparing with existing scene text representations, we show that our kernel representation can not only describe arbitrarily-shaped text but also well distinguish adjacent text.

Scene Text Detection Text Detection +1

An Introduction of mini-AlphaStar

1 code implementation14 Apr 2021 Ruo-Ze Liu, Wenhai Wang, Yanjie Shen, Zhiqi Li, Yang Yu, Tong Lu

StarCraft II (SC2) is a real-time strategy game in which players produce and control multiple units to fight against opponent's units.

Starcraft Starcraft II

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

1 code implementation22 Mar 2021 Zhe Chen, Wenhai Wang, Enze Xie, Tong Lu, Ping Luo

(1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution.

Style Transfer

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

10 code implementations ICCV 2021 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

Unlike the recently-proposed Transformer model (e. g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks.

Image Classification Instance Segmentation +3

Frequency Consistent Adaptation for Real World Super Resolution

no code implementations18 Dec 2020 Xiaozhong Ji, Guangpin Tao, Yun Cao, Ying Tai, Tong Lu, Chengjie Wang, Jilin Li, Feiyue Huang

From this point of view, we design a novel Frequency Consistent Adaptation (FCA) that ensures the frequency domain consistency when applying existing SR methods to the real scene.


AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

2 code implementations ECCV 2020 Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, Zhibo Yang, Tong Lu, Chunhua Shen, Ping Luo

Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection.

Language Modelling Sentence +2

Dynamic Sampling Networks for Efficient Action Recognition in Videos

no code implementations28 Jun 2020 Yin-Dong Zheng, Zhao-Yang Liu, Tong Lu, Li-Min Wang

The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing.

Action Recognition In Videos

Channel Relationship Prediction with Forget-Update Module for Few-shot Classification

no code implementations16 Jun 2020 Minglei Yuan, Cunhao Cai, Tong Lu

The proposed pipeline, which consists of channel vector sequence construction module and forget-update module, can infer the relationship between the query sample and support samples in few-shot classification scenario.

General Classification

A New Unified Method for Detecting Text from Marathon Runners and Sports Players in Video

no code implementations26 May 2020 Sauradip Nag, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein

The proposed method fuses gradient magnitude and direction coherence of text pixels in a new way for detecting candidate regions.

Clustering Text Detection

TAM: Temporal Adaptive Module for Video Recognition

2 code implementations ICCV 2021 Zhao-Yang Liu, Li-Min Wang, Wayne Wu, Chen Qian, Tong Lu

Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities.

Action Recognition Video Recognition

TEINet: Towards an Efficient Architecture for Video Recognition

no code implementations21 Nov 2019 Zhao-Yang Liu, Donghao Luo, Yabiao Wang, Li-Min Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Tong Lu

To relieve this problem, we propose an efficient temporal module, termed as Temporal Enhancement-and-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet).

Action Recognition Video Recognition

Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network

6 code implementations ICCV 2019 Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, Chunhua Shen

Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications. In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing.

Scene Text Detection Segmentation +1

Shape Robust Text Detection with Progressive Scale Expansion Network

19 code implementations CVPR 2019 Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao

Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances.

Optical Character Recognition (OCR) Scene Text Detection +1

A New COLD Feature based Handwriting Analysis for Ethnicity/Nationality Identification

no code implementations19 Jun 2018 Sauradip Nag, Palaiahnakote Shivakumara, Wu Yirui, Umapada Pal, Tong Lu

For each line segment, the proposed method estimates angle and length, which gives a point in polar domain.

Shape Robust Text Detection with Progressive Scale Expansion Network

9 code implementations7 Jun 2018 Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, Jian Yang

To address these problems, we propose a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance.

Curved Text Detection Text Detection

Mixed Link Networks

1 code implementation6 Feb 2018 Wenhai Wang, Xiang Li, Jian Yang, Tong Lu

Basing on the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection -- addition (dubbed "inner link") vs. concatenation (dubbed "outer link").

Representation Learning

Temporal Action Localization by Structured Maximal Sums

no code implementations CVPR 2017 Zehuan Yuan, Jonathan C. Stroud, Tong Lu, Jia Deng

We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.

Action Detection General Classification +2

