In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.
To our best knowledge, Point2RBox is the first end-to-end solution for point-supervised OOD.
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.
The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoEncoding (MCVA) for pretraining it to tackle the problem of optical flow estimation.
In each denoising step, our method first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels.
In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model.
These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.
In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.
We hope this model can set a new baseline for generalist vision and language models.
2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao
Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.
We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner.
FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance.
The first method is One-to-many Matching via Data Augmentation (denoted as DataAug-DETR).
Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features.
1 code implementation • • Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li
Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.
The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.
Ranked #4 on 3D Object Detection on Rope3D
In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.
It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.
Ranked #2 on Object Detection on LVIS v1.0 minival (using extra training data)
Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators.
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.
Ranked #1 on Instance Segmentation on COCO test-dev (APS metric, using extra training data)
2 code implementations • 12 Sep 2022 • Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao
As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.
Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos.
Ranked #23 on Action Classification on Kinetics-400 (using extra training data)
On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient.
To mitigate such interference, we introduce the Conditional Mixture-of-Experts (Conditional MoEs) to generalist models.
Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.
This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT).
Ranked #4 on Semantic Segmentation on PASCAL Context
Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation.
In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries.
Ranked #3 on Robust Camera Only 3D Object Detection on nuScenes-C
We introduce optical Flow transFormer, dubbed as FlowFormer, a transformer-based neural network architecture for learning optical flow.
Ranked #1 on Optical Flow Estimation on Sintel-final
In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation.
These methods appear to be quite different in the designed loss functions from various motivations.
The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage.
Deep learning-based models encounter challenges when processing long-tailed data in the real world.
Ranked #2 on Long-tail Learning on iNaturalist 2018 (using extra training data)
To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification.
On the contrary, the soft composition operates by stitching different patches into a whole feature map where pixels in overlapping regions are summed up.
Ranked #4 on Video Inpainting on DAVIS
To obtain the Influence of the unlabeled sample in the active learning scenario, we design the Untrained Unlabeled sample Influence Calculation(UUIC) to estimate the unlabeled sample's expected gradient with which we calculate its Influence.
As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques.
In this paper, we propose a novel Scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-the-art video inpainting approaches with significant boosted efficiency.
However, the automatic design of loss functions for generic tasks with various evaluation metrics remains under-investigated.
Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The recently proposed Detection Transformer (DETR) model successfully applies Transformer to objects detection and achieves comparable performance with two-stage object detection frameworks, such as Faster-RCNN.
We further identify another major issue, seldom noticed by the community, that the long-tailed and open-ended (sub-)category distribution should be accommodated.
In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric.
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance.
Ranked #34 on Object Detection on COCO-O
This article introduces the solutions of the team lvisTraveler for LVIS Challenge 2020.
Ranked #1 on Instance Segmentation on LVIS v1.0 test-dev
Moreover, our approach ranked 1st place in the Weakly-Supervised Semantic Segmentation Track of CVPR2020 Learning from Imperfect Data Challenge.
As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task.
This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field.
Ranked #186 on Object Detection on COCO test-dev
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT for short).
Ranked #1 on Visual Question Answering (VQA) on VCR (Q-A) dev
143 code implementations • 17 Jun 2019 • Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, Dahua Lin
In this paper, we introduce the various features of this toolbox.
Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance.
The superior performance of Deformable Convolutional Networks arises from its ability to adapt to the geometric variations of objects.
Ranked #130 on Object Detection on COCO test-dev
Accurate detection and tracking of objects is vital for effective video understanding.
Ranked #16 on Video Object Detection on ImageNet VID
In this paper, we present a light weight network architecture for video object detection on mobiles.
While most steps in the modern object detection methods are learnable, the region feature extraction step remains largely hand-crafted, featured by RoI pooling methods.
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era.
The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
Ranked #22 on Video Object Detection on ImageNet VID
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules.
Ranked #3 on Vessel Detection on Vessel detection Dateset
It inherits all the merits of FCNs for semantic segmentation and instance mask proposal.
Ranked #95 on Instance Segmentation on COCO test-dev
Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable.
Ranked #9 on Video Semantic Segmentation on Cityscapes val
In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image.
Ranked #4 on Real-Time Object Detection on PASCAL VOC 2007
Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure.
In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.
We develop an algorithm for the nontrivial end-to-end training of this causal, cascaded structure.
Ranked #3 on Multi-Human Parsing on PASCAL-Part
Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level segmentation masks.
Ranked #46 on Semantic Segmentation on PASCAL VOC 2012 test
(2) We propose a generative gradient for pre-training CNNs by a non-parametric importance sampling scheme, which is fundamentally different from the commonly used discriminative gradient, and yet has the same computational architecture and cost as the latter.
The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions.
Ranked #61 on Semantic Segmentation on PASCAL Context
Given a set of unannotated training images, a dictionary of such hierarchical templates are learned so that each training image can be represented by a small number of templates that are spatially translated, rotated and scaled versions of the templates in the learned dictionary.