Search Results for author: Jifeng Dai

Found 124 papers, 93 papers with code

Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy

1 code implementation25 Mar 2025 Zhi Hou, Tianyi Zhang, Yuwen Xiong, Haonan Duan, Hengjun Pu, Ronglei Tong, Chengyang Zhao, Xizhou Zhu, Yu Qiao, Jifeng Dai, Yuntao Chen

While recent vision-language-action models trained on diverse robot datasets exhibit promising generalization capabilities with limited in-domain data, their reliance on compact action heads to predict discretized or continuous actions constrains adaptability to heterogeneous action spaces.

Ranked #3 on Robot Manipulation on SimplerEnv-Google Robot (using extra training data)

Denoising Robot Manipulation +1

LangBridge: Interpreting Image as a Combination of Language Embeddings

no code implementations25 Mar 2025 Jiaqi Liao, Yuwei Niu, Fanqing Meng, Hao Li, Changyao Tian, Yinuo Du, Yuwen Xiong, Dianqi Li, Xizhou Zhu, Li Yuan, Jifeng Dai, Yu Cheng

Based on this insight, we propose LangBridge, a novel adapter that explicitly maps visual tokens to linear combinations of LLM vocabulary embeddings.

cross-modal alignment

GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing

1 code implementation13 Mar 2025 Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Xihui Liu, Hongsheng Li

We present Generation Chain-of-Thought (GoT), a novel paradigm that enables generation and editing through an explicit language reasoning process before outputting images.

Language Modeling Language Modelling +3

VisualPRM: An Effective Process Reward Model for Multimodal Reasoning

no code implementations13 Mar 2025 Weiyun Wang, Zhangwei Gao, Lianjie Chen, Zhe Chen, Jinguo Zhu, Xiangyu Zhao, Yangzhou Liu, Yue Cao, Shenglong Ye, Xizhou Zhu, Lewei Lu, Haodong Duan, Yu Qiao, Jifeng Dai, Wenhai Wang

We introduce VisualPRM, an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the reasoning abilities of existing Multimodal Large Language Models (MLLMs) across different model scales and families with Best-of-N (BoN) evaluation strategies.

Multimodal Reasoning

MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism

1 code implementation3 Mar 2025 Zhixiong Nan, Xianghong Li, Jifeng Dai, Tao Xiang

Based on analyzing the character of cascaded decoder architecture commonly adopted in existing DETR-like models, this paper proposes a new decoder architecture.

Object Detection

HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding

no code implementations20 Dec 2024 Chenxin Tao, Shiqian Su, Xizhou Zhu, Chenyu Zhang, Zhe Chen, Jiawen Liu, Wenhai Wang, Lewei Lu, Gao Huang, Yu Qiao, Jifeng Dai

The challenge for current monolithic VLMs actually lies in the lack of a holistic embedding module for both vision and language inputs.

V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding

1 code implementation12 Dec 2024 Junqi Ge, Ziyi Chen, Jintao Lin, Jinguo Zhu, Xihui Liu, Jifeng Dai, Xizhou Zhu

In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets.

Position

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

no code implementations12 Dec 2024 Hao Li, Changyao Tian, Jie Shao, Xizhou Zhu, Zhaokai Wang, Jinguo Zhu, Wenhan Dou, Xiaogang Wang, Hongsheng Li, Lewei Lu, Jifeng Dai

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation.

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models

no code implementations12 Dec 2024 Chenyu Yang, Xuan Dong, Xizhou Zhu, Weijie Su, Jiahao Wang, Hao Tian, Zhe Chen, Wenhai Wang, Lewei Lu, Jifeng Dai

To this end, we extend each image into a "static" video and introduce a unified token compression strategy called Progressive Visual Token Compression (PVC), where the tokens of each frame are progressively encoded and adaptively compressed to supplement the information not extracted from previous frames.

Video Understanding

HoloDrive: Holistic 2D-3D Multi-Modal Street Scene Generation for Autonomous Driving

no code implementations2 Dec 2024 Zehuan Wu, Jingcheng Ni, Xiaodong Wang, Yuxin Guo, Rui Chen, Lewei Lu, Jifeng Dai, Yuwen Xiong

Generative models have significantly improved the generation and prediction quality on either camera images or LiDAR point clouds for autonomous driving.

Autonomous Driving Depth Estimation +2

DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model

1 code implementation22 Oct 2024 Zhixiong Nan, Xianghong Li, Tao Xiang, Jifeng Dai

This paper is motivated by an interesting phenomenon: the performance of object detection lags behind that of instance segmentation (i. e., performance imbalance) when investigating the intermediate results from the beginning transformer decoder layer of MaskDINO (i. e., the SOTA model for joint detection and segmentation).

Decoder Instance Segmentation +5

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

1 code implementation21 Oct 2024 Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Jifeng Dai, Wenhai Wang

Multimodal large language models (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains.

Autonomous Driving

Diffusion Transformer Policy

1 code implementation21 Oct 2024 Zhi Hou, Tianyi Zhang, Yuwen Xiong, Hengjun Pu, Chengyang Zhao, Ronglei Tong, Yu Qiao, Jifeng Dai, Yuntao Chen

In contrast, we model the continuous action sequence with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head for action embedding.

Denoising Vision-Language-Action

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

1 code implementation17 Oct 2024 Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu

This work represents a significant step towards a truly unified MLLM capable of adapting to the granularity demands of various visual tasks.

Diversity Image Manipulation +1

big.LITTLE Vision Transformer for Efficient Visual Recognition

no code implementations14 Oct 2024 He guo, Yulong Wang, Zixuan Ye, Jifeng Dai, Yuwen Xiong

In this paper, we introduce the big. LITTLE Vision Transformer, an innovative architecture aimed at achieving efficient visual recognition.

Image Classification Object Recognition

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training

no code implementations10 Oct 2024 Gen Luo, Xue Yang, Wenhan Dou, Zhaokai Wang, Jiawen Liu, Jifeng Dai, Yu Qiao, Xizhou Zhu

In particular, EViP is designed as a progressive learning process for visual experts, which aims to fully exploit the visual knowledge from noisy data to high-quality data.

Mixture-of-Experts Visual Question Answering

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

1 code implementation5 Aug 2024 Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks.

Image Comprehension Multiple-choice

Hierarchical Memory for Long Video QA

no code implementations30 Jun 2024 Yiqin Wang, Haoji Zhang, Yansong Tang, Yong liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin

This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA).

Question Answering Video Question Answering +1

CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics

no code implementations20 Jun 2024 Jiawei Gao, Ziqin Wang, Zeqi Xiao, Jingbo Wang, Tai Wang, Jinkun Cao, Xiaolin Hu, Si Liu, Jifeng Dai, Jiangmiao Pang

Given the scarcity of motion capture data on multi-humanoid collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios.

Human-Object Interaction Detection Humanoid Control +2

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

1 code implementation12 Jun 2024 Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai

It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.

Image Generation Language Modeling +7

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

1 code implementation11 Jun 2024 Chenyu Yang, Xizhou Zhu, Jinguo Zhu, Weijie Su, Junjie Wang, Xuan Dong, Wenhai Wang, Lewei Lu, Bin Li, Jie zhou, Yu Qiao, Jifeng Dai

Recently, vision model pre-training has evolved from relying on manually annotated datasets to leveraging large-scale, web-crawled image-text data.

Contrastive Learning

Needle In A Multimodal Haystack

1 code implementation11 Jun 2024 Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Retrieval

Learning 1D Causal Visual Representation with De-focus Attention Networks

1 code implementation6 Jun 2024 Chenxin Tao, Xizhou Zhu, Shiqian Su, Lewei Lu, Changyao Tian, Xuan Luo, Gao Huang, Hongsheng Li, Yu Qiao, Jie zhou, Jifeng Dai

The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization.

Parameter-Inverted Image Pyramid Networks

1 code implementation6 Jun 2024 Xizhou Zhu, Xue Yang, Zhaokai Wang, Hao Li, Wenhan Dou, Junqi Ge, Lewei Lu, Yu Qiao, Jifeng Dai

Our core idea is to use models with different parameter sizes to process different resolution levels of the image pyramid, thereby balancing computational efficiency and performance.

Computational Efficiency Image Classification +3

FLoRA: Low-Rank Core Space for N-dimension

1 code implementation23 May 2024 Chongjie Si, Xuehui Wang, Xue Yang, Zhengqin Xu, Qingyun Li, Jifeng Dai, Yu Qiao, Xiaokang Yang, Wei Shen

To tackle the diversity of dimensional spaces across different foundation models and provide a more precise representation of the changes within these spaces, this paper introduces a generalized parameter-efficient fine-tuning framework, FLoRA, designed for various dimensional parameter space.

parameter-efficient fine-tuning Tensor Decomposition

Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments

1 code implementation20 Mar 2024 Yang Yang, Wenhai Wang, Zhe Chen, Jifeng Dai, Liang Zheng

However, in the real-world where test ground truths are not provided, it is non-trivial to find out whether bounding boxes are accurate, thus preventing us from assessing the detector generalization ability.

object-detection Object Detection +1

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation4 Mar 2024 Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

1 code implementation29 Feb 2024 Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.

All Hallucination +4

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

2 code implementations CVPR 2024 Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

InstructSeq: Unifying Vision Tasks with Instruction-conditioned Multi-modal Sequence Generation

no code implementations30 Nov 2023 Rongyao Fang, Shilin Yan, Zhaoyang Huang, Jingqiu Zhou, Hao Tian, Jifeng Dai, Hongsheng Li

In this work, we introduce InstructSeq, an instruction-conditioned multi-modal modeling framework that unifies diverse vision tasks through flexible natural language control and handling of both visual and textual data.

Image Captioning Referring Expression +2

ControlLLM: Augment Language Models with Tools by Searching on Graphs

1 code implementation26 Oct 2023 Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.

Scheduling

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

1 code implementation11 Oct 2023 Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.

Code Generation Image Generation +2

FlowFormer: A Transformer Architecture and Its Masked Cost Volume Autoencoding for Optical Flow

no code implementations8 Jun 2023 Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Yijin Li, Hongwei Qin, Jifeng Dai, Xiaogang Wang, Hongsheng Li

This paper introduces a novel transformer-based network architecture, FlowFormer, along with the Masked Cost Volume AutoEncoding (MCVA) for pretraining it to tackle the problem of optical flow estimation.

Decoder Optical Flow Estimation

Denoising Diffusion Semantic Segmentation with Mask Prior Modeling

no code implementations2 Jun 2023 Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, Wenhai Wang

In this paper, we propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a recently-developed denoising diffusion generative model.

Denoising Segmentation +1

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

no code implementations NeurIPS 2023 Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions

no code implementations CVPR 2023 Yurui Zhu, Tianyu Wang, Xueyang Fu, Xuanyu Yang, Xin Guo, Jifeng Dai, Yu Qiao, Xiaowei Hu

Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features.

Image Restoration

Planning-oriented Autonomous Driving

1 code implementation CVPR 2023 Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.

Bench2Drive Philosophy +1

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

1 code implementation CVPR 2023 Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.

All Image Classification +4

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

2 code implementations CVPR 2023 Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.

Decoder Language Modelling +1

Demystify Transformers & Convolutions in Modern Image Deep Networks

1 code implementation10 Nov 2022 Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai

Vision transformers have gained popularity recently, leading to the development of new vision backbones with improved features and consistent performance gains.

Adversarial Robustness Image Deep Networks +1

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

3 code implementations CVPR 2023 Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

 Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)

Classification Image Classification +3

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

2 code implementations12 Sep 2022 Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.

Autonomous Driving

Frozen CLIP Models are Efficient Video Learners

2 code implementations6 Aug 2022 Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos.

Action Classification Decoder +1

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

3 code implementations19 Jul 2022 Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient.

Retrieval Transfer Learning

Siamese Image Modeling for Self-Supervised Vision Representation Learning

2 code implementations CVPR 2023 Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie zhou, Yu Qiao, Xiaogang Wang, Jifeng Dai

Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.

Representation Learning Self-Supervised Learning +1

ConvMAE: Masked Convolution Meets Masked Autoencoders

5 code implementations8 May 2022 Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao

Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation.

Computational Efficiency Image Classification +2

FlowFormer: A Transformer Architecture for Optical Flow

1 code implementation30 Mar 2022 Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, Hongsheng Li

We introduce optical Flow transFormer, dubbed as FlowFormer, a transformer-based neural network architecture for learning optical flow.

Decoder Optical Flow Estimation

Searching Parameterized AP Loss for Object Detection

1 code implementation NeurIPS 2021 Chenxin Tao, Zizhang Li, Xizhou Zhu, Gao Huang, Yong liu, Jifeng Dai

In this paper, we propose Parameterized AP Loss, where parameterized functions are introduced to substitute the non-differentiable components in the AP calculation.

Object object-detection +1

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

1 code implementation CVPR 2022 Xizhou Zhu, Jinguo Zhu, Hao Li, Xiaoshi Wu, Xiaogang Wang, Hongsheng Li, Xiaohua Wang, Jifeng Dai

The model is pre-trained on several uni-modal and multi-modal tasks, and evaluated on a variety of downstream tasks, including novel tasks that did not appear in the pre-training stage.

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

1 code implementation6 Nov 2021 Renrui Zhang, Rongyao Fang, Wei zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification.

Language Modeling Language Modelling +1

Influence Selection for Active Learning

1 code implementation ICCV 2021 Zhuoming Liu, Hao Ding, Huaping Zhong, Weijia Li, Jifeng Dai, Conghui He

To obtain the Influence of the unlabeled sample in the active learning scenario, we design the Untrained Unlabeled sample Influence Calculation(UUIC) to estimate the unlabeled sample's expected gradient with which we calculate its Influence.

Active Learning Diversity

Collaborative Visual Navigation

1 code implementation2 Jul 2021 Haiyang Wang, Wenguan Wang, Xizhou Zhu, Jifeng Dai, LiWei Wang

As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques.

Multi-agent Reinforcement Learning Navigate +1

Scalable Transformers for Neural Machine Translation

no code implementations4 Jun 2021 Peng Gao, Shijie Geng, Yu Qiao, Xiaogang Wang, Jifeng Dai, Hongsheng Li

In this paper, we propose a novel Scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.

Machine Translation NMT +1

Decoupled Spatial-Temporal Transformer for Video Inpainting

1 code implementation14 Apr 2021 Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, Jifeng Dai, Hongsheng Li

Seamless combination of these two novel designs forms a better spatial-temporal attention scheme and our proposed model achieves better performance than state-of-the-art video inpainting approaches with significant boosted efficiency.

Video Inpainting

AutoLoss-Zero: Searching Loss Functions from Scratch for Generic Tasks

no code implementations CVPR 2022 Hao Li, Tianwen Fu, Jifeng Dai, Hongsheng Li, Gao Huang, Xizhou Zhu

However, the automatic design of loss functions for generic tasks with various evaluation metrics remains under-investigated.

Exploring Cross-Image Pixel Contrast for Semantic Segmentation

5 code implementations ICCV 2021 Wenguan Wang, Tianfei Zhou, Fisher Yu, Jifeng Dai, Ender Konukoglu, Luc van Gool

Inspired by the recent advance in unsupervised contrastive representation learning, we propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.

Metric Learning Optical Character Recognition (OCR) +3

Fast Convergence of DETR with Spatially Modulated Co-Attention

2 code implementations19 Jan 2021 Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, Hongsheng Li

The recently proposed Detection Transformer (DETR) model successfully applies Transformer to objects detection and achieves comparable performance with two-stage object detection frameworks, such as Faster-RCNN.

Decoder object-detection +1

Unsupervised Object Detection with LiDAR Clues

no code implementations CVPR 2021 Hao Tian, Yuntao Chen, Jifeng Dai, Zhaoxiang Zhang, Xizhou Zhu

We further identify another major issue, seldom noticed by the community, that the long-tailed and open-ended (sub-)category distribution should be accommodated.

Object object-detection +2

Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation

1 code implementation ICLR 2021 Hao Li, Chenxin Tao, Xizhou Zhu, Xiaogang Wang, Gao Huang, Jifeng Dai

In this paper, we propose to automate the design of metric-specific loss functions by searching differentiable surrogate losses for each metric.

Semantic Segmentation

Deformable DETR: Deformable Transformers for End-to-End Object Detection

18 code implementations ICLR 2021 Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, Jifeng Dai

DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance.

Real-Time Object Detection

Resolution Adaptive Networks for Efficient Inference

2 code implementations CVPR 2020 Le Yang, Yizeng Han, Xi Chen, Shiji Song, Jifeng Dai, Gao Huang

Adaptive inference is an effective mechanism to achieve a dynamic tradeoff between accuracy and computational cost in deep networks.

Hierarchical Human Parsing with Typed Part-Relation Reasoning

1 code implementation CVPR 2020 Wenguan Wang, Hailong Zhu, Jifeng Dai, Yanwei Pang, Jianbing Shen, Ling Shao

As human bodies are underlying hierarchically structured, how to model human structures is the central theme in this task.

Human Parsing Relation

Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation

2 code implementations ICLR 2020 Hang Gao, Xizhou Zhu, Steve Lin, Jifeng Dai

This is typically done by augmenting static operators with learned free-form sampling grids in the image space, dynamically tuned to the data and task for adapting the receptive field.

Image Classification Object +1

An Empirical Study of Spatial Attention Mechanisms in Deep Networks

1 code implementation ICCV 2019 Xizhou Zhu, Dazhi Cheng, Zheng Zhang, Stephen Lin, Jifeng Dai

Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these factors affect performance.

Decoder

Deformable ConvNets v2: More Deformable, Better Results

24 code implementations CVPR 2019 Xizhou Zhu, Han Hu, Stephen Lin, Jifeng Dai

The superior performance of Deformable Convolutional Networks arises from its ability to adapt to the geometric variations of objects.

Instance Segmentation Object +2

Towards High Performance Video Object Detection for Mobiles

3 code implementations16 Apr 2018 Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, Lu Yuan

In this paper, we present a light weight network architecture for video object detection on mobiles.

Object object-detection +2

Learning Region Features for Object Detection

no code implementations ECCV 2018 Jiayuan Gu, Han Hu, Li-Wei Wang, Yichen Wei, Jifeng Dai

While most steps in the modern object detection methods are learnable, the region feature extraction step remains largely hand-crafted, featured by RoI pooling methods.

Object object-detection +1

Relation Networks for Object Detection

6 code implementations CVPR 2018 Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, Yichen Wei

Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence that the idea is working in the deep learning era.

Object object-detection +3

Flow-Guided Feature Aggregation for Video Object Detection

2 code implementations ICCV 2017 Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, Yichen Wei

The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.

Object object-detection +2

Deformable Convolutional Networks

38 code implementations ICCV 2017 Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, Yichen Wei

Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due to the fixed geometric structures in its building modules.

Object Detection Semantic Segmentation +1

Deep Feature Flow for Video Recognition

3 code implementations CVPR 2017 Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, Yichen Wei

Yet, it is non-trivial to transfer the state-of-the-art image recognition networks to videos as per-frame evaluation is too slow and unaffordable.

Video Recognition Video Semantic Segmentation

R-FCN: Object Detection via Region-based Fully Convolutional Networks

49 code implementations NeurIPS 2016 Jifeng Dai, Yi Li, Kaiming He, Jian Sun

In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our region-based detector is fully convolutional with almost all computation shared on the entire image.

Object Real-Time Object Detection +1

ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation

no code implementations CVPR 2016 Di Lin, Jifeng Dai, Jiaya Jia, Kaiming He, Jian Sun

Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks is a tedious and inefficient procedure.

Image Segmentation Segmentation +1

Instance-sensitive Fully Convolutional Networks

no code implementations29 Mar 2016 Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, Jian Sun

In contrast to the previous FCN that generates one score map, our FCN is designed to compute a small set of instance-sensitive score maps, each of which is the outcome of a pixel-wise classifier of a relative position to instances.

Position Semantic Segmentation

BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation

no code implementations ICCV 2015 Jifeng Dai, Kaiming He, Jian Sun

Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level segmentation masks.

Segmentation Semantic Segmentation

Generative Modeling of Convolutional Neural Networks

no code implementations19 Dec 2014 Jifeng Dai, Yang Lu, Ying-Nian Wu

(2) We propose a generative gradient for pre-training CNNs by a non-parametric importance sampling scheme, which is fundamentally different from the commonly used discriminative gradient, and yet has the same computational architecture and cost as the latter.

Convolutional Feature Masking for Joint Object and Stuff Segmentation

1 code implementation CVPR 2015 Jifeng Dai, Kaiming He, Jian Sun

The current leading approaches for semantic segmentation exploit shape information by extracting CNN features from masked image regions.

Object Segmentation +1

Unsupervised Learning of Dictionaries of Hierarchical Compositional Models

no code implementations CVPR 2014 Jifeng Dai, Yi Hong, Wenze Hu, Song-Chun Zhu, Ying Nian Wu

Given a set of unannotated training images, a dictionary of such hierarchical templates are learned so that each training image can be represented by a small number of templates that are spatially translated, rotated and scaled versions of the templates in the learned dictionary.

Domain Adaptation Template Matching

Cannot find the paper you are looking for? You can Submit a new open access paper.