1 code implementation • 25 Apr 2024 • Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Botian Shi, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan, Min Dou, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao, Jifeng Dai, Wenhai Wang
Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.
Ranked #6 on Visual Question Answering on MM-Vet
1 code implementation • 14 Mar 2024 • Guo Chen, Yifei HUANG, Jilan Xu, Baoqi Pei, Zhe Chen, Zhiqi Li, Jiahao Wang, Kunchang Li, Tong Lu, LiMin Wang
We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks.
Ranked #1 on Temporal Action Localization on FineAction
1 code implementation • 4 Mar 2024 • Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang
Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.
1 code implementation • 4 Feb 2024 • Tao Wang, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Tae-Kyun Kim, Tong Lu, Hongdong Li, Ming-Hsuan Yang
For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts.
1 code implementation • 18 Jan 2024 • Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie zhou, Hongsheng Li, Yu Qiao, Jifeng Dai
Developing generative models for interleaved image-text data has both research and practical value.
1 code implementation • 11 Jan 2024 • Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai
The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.
1 code implementation • 3 Jan 2024 • Yi Rong, Haoran Zhou, Lixin Yuan, Cheng Mei, Jiahao Wang, Tong Lu
Point cloud completion is an indispensable task for recovering complete point clouds due to incompleteness caused by occlusion, limited sensor resolution, etc.
2 code implementations • 21 Dec 2023 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai
However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)
1 code implementation • 5 Dec 2023 • Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, Jose M. Alvarez
We initially observed that the nuScenes dataset, characterized by relatively simple driving scenarios, leads to an under-utilization of perception information in end-to-end models incorporating ego status, such as the ego vehicle's velocity.
no code implementations • 9 Sep 2023 • Xuanxi Chen, Tao Wang, Ziqian Shao, Kaihao Zhang, Wenhan Luo, Tong Lu, Zikun Liu, Tae-Kyun Kim, Hongdong Li
With the pipeline, we build the first large-scale UDC video restoration dataset called PexelsUDC, which includes two subsets named PexelsUDC-T and PexelsUDC-P corresponding to different displays for UDC.
1 code implementation • ICCV 2023 • Jiahao Wang, Guo Chen, Yifei HUANG, LiMin Wang, Tong Lu
Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks.
Ranked #1 on Action Detection on THUMOS' 14
1 code implementation • ICCV 2023 • Zhiqi Li, Zhiding Yu, Wenhai Wang, Anima Anandkumar, Tong Lu, Jose M. Alvarez
Currently, the two most prominent VTM paradigms are forward projection and backward projection.
1 code implementation • 3 Aug 2023 • Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
1 code implementation • 3 Jul 2023 • Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture.
no code implementations • 29 May 2023 • Tao Wang, Kaihao Zhang, Ziqian Shao, Wenhan Luo, Bjorn Stenger, Tong Lu, Tae-Kyun Kim, Wei Liu, Hongdong Li
Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer.
1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang
Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.
1 code implementation • 19 May 2023 • Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, Qiuying Peng, Cheng Cheng, Yue Qi
The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks.
Ranked #2 on Graph Regression on PCQM4M-LSC (Validation MAE metric)
2 code implementations • NeurIPS 2023 • Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie zhou, Yu Qiao, Jifeng Dai
We hope this model can set a new baseline for generalist vision and language models.
no code implementations • 24 Apr 2023 • Yin-Dong Zheng, Guo Chen, Minglei Yuan, Tong Lu
Action detection is a challenging video understanding task, requiring modeling spatio-temporal and interaction relations.
1 code implementation • ICCV 2023 • Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo
We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline.
Ranked #2 on Monocular Depth Estimation on SUN-RGBD
1 code implementation • 22 Jan 2023 • Shengyi Gao, Zhe Chen, Guo Chen, Wenhai Wang, Tong Lu
In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge.
1 code implementation • 22 Dec 2022 • Tao Wang, Kaihao Zhang, Tianrun Shen, Wenhan Luo, Bjorn Stenger, Tong Lu
In this paper, we consider the task of low-light image enhancement (LLIE) and introduce a large-scale database consisting of images at 4K and 8K resolution.
no code implementations • 22 Dec 2022 • Tao Wang, Guangpin Tao, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Xiaoqin Zhang, Tong Lu
HCD consists of a hierarchical dehazing network (HDN) and a novel hierarchical contrastive loss (HCL).
2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao
In this report, we present our champion solutions to five tracks at Ego4D challenge.
Ranked #1 on State Change Object Detection on Ego4D
no code implementations • 16 Nov 2022 • Yin-Dong Zheng, Guo Chen, Jiahao Wang, Tong Lu, LiMin Wang
Our method achieves an accuracy of 0. 796 on OSCC while achieving an absolute temporal localization error of 0. 516 on PNR.
2 code implementations • CVPR 2023 • Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.
Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)
1 code implementation • 5 Nov 2022 • Tao Wang, Kaihao Zhang, Xuanxi Chen, Wenhan Luo, Jiankang Deng, Tong Lu, Xiaochun Cao, Wei Liu, Hongdong Li, Stefanos Zafeiriou
Second, we discuss the challenges of face restoration.
2 code implementations • 23 Sep 2022 • Ruo-Ze Liu, Zhen-Jia Pang, Zhou-Yu Meng, Wenhai Wang, Yang Yu, Tong Lu
In this work, we investigate a set of RL techniques for the full-length game of StarCraft II.
no code implementations • 26 Jul 2022 • Guangchen Shi, Yirui Wu, Jun Liu, Shaohua Wan, Wenhai Wang, Tong Lu
Second, to resist overfitting issues caused by few training samples, a hyper-class embedding is learned by clustering all category embeddings for initialization and aligned with category embedding of the new class for enhancement, where learned knowledge assists to learn new knowledge, thus alleviating performance dependence on training data scale.
1 code implementation • 21 Jul 2022 • Haoran Zhou, Yun Cao, Wenqing Chu, Junwei Zhu, Tong Lu, Ying Tai, Chengjie Wang
Point cloud completion has become increasingly popular among generation tasks of 3D point clouds, as it is a challenging yet indispensable problem to recover the complete shape of a 3D object from its partial observation.
Ranked #7 on Point Cloud Completion on Completion3D
1 code implementation • 17 May 2022 • Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT).
Ranked #4 on Semantic Segmentation on PASCAL Context
no code implementations • 17 May 2022 • Minglei Yuan, Qian Xu, Chunhao Cai, Yin-Dong Zheng, Tao Wang, Tong Lu
Specifically, we first data augment and classify the query instance and calculate the mutual information of these classification scores.
2 code implementations • 5 May 2022 • Min Yang, Guo Chen, Yin-Dong Zheng, Tong Lu, LiMin Wang
Empirical results demonstrate that our PlusTAD is very efficient and significantly outperforms the previous methods on the datasets of THUMOS14 and FineAction.
Ranked #1 on Temporal Action Localization on THUMOS14
3 code implementations • 31 Mar 2022 • Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai
In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries.
1 code implementation • 23 Mar 2022 • Haoran Zhou, Honghua Chen, Yingkui Zhang, Mingqiang Wei, Haoran Xie, Jun Wang, Tong Lu, Jing Qin, Xiao-Ping Zhang
Differently, our network is designed to refine the initial normal of each point by extracting additional information from multiple feature representations.
1 code implementation • 7 Dec 2021 • Guo Chen, Yin-Dong Zheng, LiMin Wang, Tong Lu
Specifically, we design the Multi-Path Temporal Context Aggregation (MTCA) to achieve smooth context aggregation on boundary level and precise evaluation of boundaries.
Ranked #19 on Temporal Action Localization on ActivityNet-1.3
2 code implementations • 3 Nov 2021 • Zhe Chen, Jiahao Wang, Wenhai Wang, Guo Chen, Enze Xie, Ping Luo, Tong Lu
We propose an accurate and efficient scene text detection framework, termed FAST (i. e., faster arbitrarily-shaped text detector).
Ranked #2 on Scene Text Detection on MSRA-TD500
no code implementations • NeurIPS 2021 • Guangpin Tao, Xiaozhong Ji, Wenzhuo Wang, Shuo Chen, Chuming Lin, Yun Cao, Tong Lu, Donghao Luo, Ying Tai
In this paper, we propose a novel blind SR framework to super-resolve LR images degraded by arbitrary blur kernel with accurate kernel estimation in frequency domain.
no code implementations • 20 Oct 2021 • Humen Zhong, Jun Tang, Wenhai Wang, Zhibo Yang, Cong Yao, Tong Lu
Recent approaches for end-to-end text spotting have achieved promising results.
2 code implementations • CVPR 2022 • Zhiqi Li, Wenhai Wang, Enze Xie, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo, Tong Lu
Specifically, we supervise the attention modules in the mask decoder in a layer-wise manner.
Ranked #4 on Panoptic Segmentation on COCO test-dev
no code implementations • 25 Aug 2021 • Minglei Yuan, Wenhai Wang, Tao Wang, Chunhao Cai, Qian Xu, Tong Lu
Few-shot learning aims to recognize new categories using very few labeled samples.
1 code implementation • ICCV 2021 • Haoran Zhou, Yidan Feng, Mingsheng Fang, Mingqiang Wei, Jing Qin, Tong Lu
Convolution on 3D point clouds that generalized from 2D grid-like domains is widely researched yet far from perfect.
Ranked #10 on 3D Point Cloud Classification on IntrA
16 code implementations • 25 Jun 2021 • Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
We hope this work will facilitate state-of-the-art Transformer researches in computer vision.
Ranked #23 on Object Detection on COCO-O
1 code implementation • 2 May 2021 • Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Zhibo Yang, Tong Lu, Chunhua Shen
By systematically comparing with existing scene text representations, we show that our kernel representation can not only describe arbitrarily-shaped text but also well distinguish adjacent text.
1 code implementation • 14 Apr 2021 • Ruo-Ze Liu, Wenhai Wang, Yanjie Shen, Zhiqi Li, Yang Yu, Tong Lu
StarCraft II (SC2) is a real-time strategy game in which players produce and control multiple units to fight against opponent's units.
1 code implementation • 22 Mar 2021 • Zhe Chen, Wenhai Wang, Enze Xie, Tong Lu, Ping Luo
(1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution.
9 code implementations • ICCV 2021 • Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao
Unlike the recently-proposed Transformer model (e. g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks.
Ranked #5 on Semantic Segmentation on SynPASS
no code implementations • 18 Dec 2020 • Xiaozhong Ji, Guangpin Tao, Yun Cao, Ying Tai, Tong Lu, Chengjie Wang, Jilin Li, Feiyue Huang
From this point of view, we design a novel Frequency Consistent Adaptation (FCA) that ensures the frequency domain consistency when applying existing SR methods to the real scene.
2 code implementations • ECCV 2020 • Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, Zhibo Yang, Tong Lu, Chunhua Shen, Ping Luo
Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection.
no code implementations • 28 Jun 2020 • Yin-Dong Zheng, Zhao-Yang Liu, Tong Lu, Li-Min Wang
The existing action recognition methods are mainly based on clip-level classifiers such as two-stream CNNs or 3D CNNs, which are trained from the randomly selected clips and applied to densely sampled clips during testing.
Ranked #9 on Action Recognition on ActivityNet
no code implementations • 16 Jun 2020 • Minglei Yuan, Cunhao Cai, Tong Lu
The proposed pipeline, which consists of channel vector sequence construction module and forget-update module, can infer the relationship between the query sample and support samples in few-shot classification scenario.
no code implementations • 26 May 2020 • Sauradip Nag, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein
The proposed method fuses gradient magnitude and direction coherence of text pixels in a new way for detecting candidate regions.
2 code implementations • ICCV 2021 • Zhao-Yang Liu, Li-Min Wang, Wayne Wu, Chen Qian, Tong Lu
Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities.
no code implementations • 21 Nov 2019 • Zhao-Yang Liu, Donghao Luo, Yabiao Wang, Li-Min Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Tong Lu
To relieve this problem, we propose an efficient temporal module, termed as Temporal Enhancement-and-Interaction (TEI Module), which could be plugged into the existing 2D CNNs (denoted by TEINet).
6 code implementations • ICCV 2019 • Wenhai Wang, Enze Xie, Xiaoge Song, Yuhang Zang, Wenjia Wang, Tong Lu, Gang Yu, Chunhua Shen
Recently, some methods have been proposed to tackle arbitrary-shaped text detection, but they rarely take the speed of the entire pipeline into consideration, which may fall short in practical applications. In this paper, we propose an efficient and accurate arbitrary-shaped text detector, termed Pixel Aggregation Network (PAN), which is equipped with a low computational-cost segmentation head and a learnable post-processing.
Ranked #8 on Scene Text Detection on SCUT-CTW1500
19 code implementations • CVPR 2019 • Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, Shuai Shao
Due to the fact that there are large geometrical margins among the minimal scale kernels, our method is effective to split the close text instances, making it easier to use segmentation-based methods to detect arbitrary-shaped text instances.
Ranked #12 on Scene Text Detection on SCUT-CTW1500
1 code implementation • 2 Mar 2019 • Ruo-Ze Liu, Haifeng Guo, Xiaozhong Ji, Yang Yu, Zhen-Jia Pang, Zitai Xiao, Yuzhou Wu, Tong Lu
Injecting human knowledge is an effective way to accelerate reinforcement learning (RL).
no code implementations • 23 Sep 2018 • Zhen-Jia Pang, Ruo-Ze Liu, Zhou-Yu Meng, Yi Zhang, Yang Yu, Tong Lu
The reinforcement training algorithm for this architecture is also investigated.
Hierarchical Reinforcement Learning reinforcement-learning +4
no code implementations • 19 Jun 2018 • Sauradip Nag, Palaiahnakote Shivakumara, Wu Yirui, Umapada Pal, Tong Lu
For each line segment, the proposed method estimates angle and length, which gives a point in polar domain.
9 code implementations • 7 Jun 2018 • Xiang Li, Wenhai Wang, Wenbo Hou, Ruo-Ze Liu, Tong Lu, Jian Yang
To address these problems, we propose a novel Progressive Scale Expansion Network (PSENet), designed as a segmentation-based detector with multiple predictions for each text instance.
Ranked #12 on Scene Text Detection on ICDAR 2017 MLT
1 code implementation • 6 Feb 2018 • Wenhai Wang, Xiang Li, Jian Yang, Tong Lu
Basing on the analysis by revealing the equivalence of modern networks, we find that both ResNet and DenseNet are essentially derived from the same "dense topology", yet they only differ in the form of connection -- addition (dubbed "inner link") vs. concatenation (dubbed "outer link").
no code implementations • CVPR 2017 • Zehuan Yuan, Jonathan C. Stroud, Tong Lu, Jia Deng
We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.