Search Results for author: Jifeng Dai

Found 75 papers, 58 papers with code

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

1 code implementation30 Nov 2023 Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li

In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.

Image Captioning Referring Expression +2

ControlLLM: Augment Language Models with Tools by Searching on Graphs

1 code implementation26 Oct 2023 Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Zhiheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.


Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

1 code implementation11 Oct 2023 Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.

Code Generation Image Generation +2

FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow

no code implementations8 Jun 2023 Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Yijin Li, Hongwei Qin, Jifeng Dai, Xiaogang Wang, Hongsheng Li

This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoEncoding (MCVA) for pretraining it to tackle the problem of optical flow estimation.

Optical Flow Estimation

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

no code implementations2 Jun 2023 Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang

In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model.

Denoising Segmentation +1

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

no code implementations NeurIPS 2023 Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation

1 code implementation ICCV 2023 Xiaoyu Shi, Zhaoyang Huang, Weikang Bian, Dasong Li, Manyuan Zhang, Ka Chun Cheung, Simon See, Hongwei Qin, Jifeng Dai, Hongsheng Li

We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner.

Optical Flow Estimation

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

no code implementations CVPR 2023 Yurui Zhu, Tianyu Wang, Xueyang Fu, Xuanyu Yang, Xin Guo, Jifeng Dai, Yu Qiao, Xiaowei Hu

Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features.

Image Restoration

Planning-oriented Autonomous Driving

1 code implementation CVPR 2023 Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.

Autonomous Driving Philosophy

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

2 code implementations CVPR 2023 Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.

Language Modelling Multi-Task Learning

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

1 code implementation CVPR 2023 Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.

Ranked #2 on Object Detection on LVIS v1.0 minival (using extra training data)

Image Classification Long-tailed Object Detection +3

Demystify Transformers & Convolutions in Modern Image Deep Networks

1 code implementation10 Nov 2022 Jifeng Dai, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie zhou, Xiaogang Wang, Yu Qiao, Xiaowei Hu

Although the novel feature transformation designs are often claimed as the source of gain, some backbones may benefit from advanced engineering techniques, which makes it hard to identify the real gain from the key feature transformation operators.

Image Deep Networks Spatial Token Mixer

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2 code implementations CVPR 2023 Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

 Ranked #1 on Instance Segmentation on COCO test-dev (APS metric, using extra training data)

Classification Image Classification +3

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

2 code implementations12 Sep 2022 Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.

Autonomous Driving

Frozen CLIP Models are Efficient Video Learners

2 code implementations6 Aug 2022 Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos.

Ranked #23 on Action Classification on Kinetics-400 (using extra training data)

Action Classification Video Recognition

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

1 code implementation19 Jul 2022 Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient.

Retrieval Transfer Learning

Siamese Image Modeling for Self-Supervised Vision Representation Learning

2 code implementations CVPR 2023 Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie zhou, Yu Qiao, Xiaogang Wang, Jifeng Dai

Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.

Representation Learning Self-Supervised Learning +1

ConvMAE: Masked Convolution Meets Masked Autoencoders

4 code implementations8 May 2022 Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao

Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation.

Image Classification Object Detection +1

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

2 code implementations31 Mar 2022 Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, Jifeng Dai

In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries.

3D Object Detection Autonomous Driving +1

FlowFormer: A Transformer Architecture for Optical Flow

1 code implementation30 Mar 2022 Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, Hongsheng Li

We introduce optical Flow transFormer, dubbed as FlowFormer, a transformer-based neural network architecture for learning optical flow.

Optical Flow Estimation

Searching Parameterized AP Loss for Object Detection

1 code implementation NeurIPS 2021 Chenxin Tao, Zizhang Li, Xizhou Zhu, Gao Huang, Yong liu, Jifeng Dai

In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation.

object-detection Object Detection

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

1 code implementation CVPR 2022 Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai

The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage.

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

1 code implementation6 Nov 2021 Renrui Zhang, Rongyao Fang, Wei zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification.

Language Modelling Transfer Learning

Influence Selection for Active Learning

1 code implementation ICCV 2021 Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, Conghui He

To obtain the Influence of the unlabeled sample in the active learning scenario, we design the Untrained Unlabeled sample Influence Calculation(UUIC) to estimate the unlabeled sample's expected gradient with which we calculate its Influence.

Active Learning

Collaborative Visual Navigation

1 code implementation2 Jul 2021 Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, LiWei Wang

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques.

Multi-agent Reinforcement Learning Navigate +1

Scalable Transformers for Neural Machine Translation

no code implementations4 Jun 2021 Peng Gao, Shijie Geng, Yu Qiao, Xiaogang Wang, Jifeng Dai, Hongsheng Li

In this paper, we propose a novel Scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.

Machine Translation NMT +1

Decoupled Spatial-Temporal Transformer for Video Inpainting

1 code implementation14 Apr 2021 Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, Jifeng Dai, Hongsheng Li

Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-the-art video inpainting approaches with significant boosted efficiency.

Video Inpainting

AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks

no code implementations CVPR 2022 Hao Li, Tianwen Fu, Jifeng Dai, Hongsheng Li, Gao Huang, Xizhou Zhu

However, the automatic design of loss functions for generic tasks with various evaluation metrics remains under-investigated.

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

5 code implementations ICCV 2021 Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, Luc van Gool

Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.

Metric Learning Optical Character Recognition (OCR) +3

Fast Convergence of DETR with Spatially Modulated Co-Attention

2 code implementations19 Jan 2021 Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, Hongsheng Li

The recently proposed Detection Transformer (DETR) model successfully applies Transformer to objects detection and achieves comparable performance with two-stage object detection frameworks, such as Faster-RCNN.

object-detection Object Detection

Unsupervised Object Detection with LiDAR Clues

no code implementations CVPR 2021 Hao Tian, Yuntao Chen, Jifeng Dai, Zhaoxiang Zhang, Xizhou Zhu

We further identify another major issue, seldom noticed by the community, that the long-tailed and open-ended (sub-)category distribution should be accommodated.

object-detection Object Detection +1

Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation

1 code implementation ICLR 2021 Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, Jifeng Dai

In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric.

Semantic Segmentation

Deformable DETR: Deformable Transformers for End-to-End Object Detection

17 code implementations ICLR 2021 Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance.

Real-Time Object Detection

Resolution Adaptive Networks for Efficient Inference

2 code implementations CVPR 2020 Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, Gao Huang

Adaptive inference is an effective mechanism to achieve a dynamic tradeoff between accuracy and computational cost in deep networks.

Hierarchical Human Parsing with Typed Part-Relation Reasoning

1 code implementation CVPR 2020 Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, Ling Shao

As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task.

Human Parsing

Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation

2 code implementations ICLR 2020 Hang Gao, Xizhou Zhu, Steve Lin, Jifeng Dai

This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field.

Image Classification Object Detection

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

1 code implementation ICCV 2019 Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, Jifeng Dai

Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance.

Deformable ConvNets v2: More Deformable, Better Results

23 code implementations CVPR 2019 Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai

The superior performance of Deformable Convolutional Networks arises from its ability to adapt to the geometric variations of objects.

Instance Segmentation Object Detection +1

Towards High Performance Video Object Detection for Mobiles

3 code implementations16 Apr 2018 Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan

In this paper, we present a light weight network architecture for video object detection on mobiles.

object-detection Video Object Detection +1

Learning Region Features for Object Detection

no code implementations ECCV 2018 Jiayuan Gu, Han Hu, Li-Wei Wang, Yichen Wei, Jifeng Dai

While most steps in the modern object detection methods are learnable, the region feature extraction step remains largely hand-crafted, featured by RoI pooling methods.

object-detection Object Detection

Relation Networks for Object Detection

6 code implementations CVPR 2018 Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, Yichen Wei

Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era.

object-detection Object Detection +1

Flow-Guided Feature Aggregation for Video Object Detection

2 code implementations ICCV 2017 Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, Yichen Wei

The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.

object-detection Video Object Detection +1

Deformable Convolutional Networks

38 code implementations ICCV 2017 Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules.

Object Detection Semantic Segmentation +1

Deep Feature Flow for Video Recognition

3 code implementations CVPR 2017 Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, Yichen Wei

Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable.

Video Recognition Video Semantic Segmentation

R-FCN: Object Detection via Region-based Fully Convolutional Networks

46 code implementations NeurIPS 2016 Jifeng Dai, Yi Li, Kaiming He, Jian Sun

In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image.

Real-Time Object Detection Test +1

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

no code implementations CVPR 2016 Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, Jian Sun

Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure.

Image Segmentation Segmentation +1

Instance-sensitive Fully Convolutional Networks

no code implementations29 Mar 2016 Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun

In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.

Semantic Segmentation

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

no code implementations ICCV 2015 Jifeng Dai, Kaiming He, Jian Sun

Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level segmentation masks.

Segmentation Semantic Segmentation

Generative Modeling of Convolutional Neural Networks

no code implementations19 Dec 2014 Jifeng Dai, Yang Lu, Ying-Nian Wu

(2) We propose a generative gradient for pre-training CNNs by a non-parametric importance sampling scheme, which is fundamentally different from the commonly used discriminative gradient, and yet has the same computational architecture and cost as the latter.

Convolutional Feature Masking for Joint Object and Stuff Segmentation

1 code implementation CVPR 2015 Jifeng Dai, Kaiming He, Jian Sun

The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions.

Semantic Segmentation

Unsupervised Learning of Dictionaries of Hierarchical Compositional Models

no code implementations CVPR 2014 Jifeng Dai, Yi Hong, Wenze Hu, Song-Chun Zhu, Ying Nian Wu

Given a set of unannotated training images, a dictionary of such hierarchical templates are learned so that each training image can be represented by a small number of templates that are spatially translated, rotated and scaled versions of the templates in the learned dictionary.

Domain Adaptation Template Matching

Cannot find the paper you are looking for? You can Submit a new open access paper.