Search Results for author: Yu Qiao

Found 410 papers, 245 papers with code

ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks

45 code implementations • 1 Sep 2018 • Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Yu Qiao, Xiaoou Tang

To further enhance the visual quality, we thoroughly study three key components of SRGAN - network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN).

Ranked #2 on Face Hallucination on FFHQ 512 x 512 - 16x upscaling

Face Hallucination Generative Adversarial Network +2

15,711

Paper
Code

Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks

42 code implementations • 11 Apr 2016 • Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, Yu Qiao

Face detection and alignment in unconstrained environment are challenging due to various poses, illuminations and occlusions.

Ranked #27 on Face Detection on WIDER Face (Easy)

Face Alignment Face Detection

9,996

Paper
Code

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

4 code implementations • 10 Jul 2023 • Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, Bo Dai

Once trained, the motion module can be inserted into a personalized T2I model to form a personalized animation generator.

Image Animation

8,762

Paper
Code

LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

7 code implementations • 28 Mar 2023 • Renrui Zhang, Jiaming Han, Chris Liu, Peng Gao, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu, Hongsheng Li, Yu Qiao

We present LLaMA-Adapter, a lightweight adaption method to efficiently fine-tune LLaMA into an instruction-following model.

Ranked #2 on Music Question Answering on MusicQA

Instruction Following Language Modelling +3

5,793

Paper
Code

LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model

3 code implementations • 28 Apr 2023 • Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei zhang, Pan Lu, Conghui He, Xiangyu Yue, Hongsheng Li, Yu Qiao

This strategy effectively alleviates the interference between the two tasks of image-text alignment and instruction following and achieves strong multi-modal reasoning with only a small-scale image-text and instruction dataset.

Ranked #6 on Visual Question Answering (VQA) on InfiMM-Eval

Instruction Following Optical Character Recognition (OCR) +7

5,502

Paper
Code

ImageBind-LLM: Multi-modality Instruction Tuning

2 code implementations • 7 Sep 2023 • Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao

During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder.

Instruction Following Text Generation

5,502

Paper
Code

InternLM2 Technical Report

1 code implementation • 26 Mar 2024 • Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, FuKai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Chao Xu, Ruiliang Xu, Hang Yan, Yirong Yan, Xiaogui Yang, Haochen Ye, Huaiyuan Ying, JIA YU, Jing Yu, Yuhang Zang, Chuyu Zhang, Li Zhang, Pan Zhang, Peng Zhang, Ruijie Zhang, Shuo Zhang, Songyang Zhang, Wenjian Zhang, Wenwei Zhang, Xingcheng Zhang, Xinyue Zhang, Hui Zhao, Qian Zhao, Xiaomeng Zhao, Fengzhe Zhou, Zaida Zhou, Jingming Zhuo, Yicheng Zou, Xipeng Qiu, Yu Qiao, Dahua Lin

The evolution of Large Language Models (LLMs) like ChatGPT and GPT-4 has sparked discussions on the advent of Artificial General Intelligence (AGI).

Ranked #5 on Long-Context Understanding on Ada-LEval (BestAnswer)

4k Long-Context Understanding

5,186

Paper
Code

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

2 code implementations • 25 Jun 2023 • Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, Choong Seon Hong

Concretely, we distill the knowledge from the heavy image encoder (ViT-H in the original SAM) to a lightweight image encoder, which can be automatically compatible with the mask decoder in the original SAM.

Image Segmentation Instance Segmentation +1

4,283

Paper
Code

Temporal Segment Networks for Action Recognition in Videos

11 code implementations • 8 May 2017 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

Furthermore, based on the temporal segment networks, we won the video classification track at the ActivityNet challenge 2016 among 24 teams, which demonstrates the effectiveness of TSN and the proposed good practices.

Ranked #5 on Video Classification on COIN

Action Classification Action Recognition In Videos +3

3,892

Paper
Code

Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

19 code implementations • 2 Aug 2016 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc van Gool

The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network.

Ranked #3 on Multimodal Activity Recognition on EV-Action

Action Classification Action Recognition In Videos +2

3,892

Paper
Code

UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer

3 code implementations • 17 Nov 2022 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, LiMin Wang, Yu Qiao

UniFormer has successfully alleviated this issue, by unifying convolution and self-attention as a relation aggregator in the transformer format.

Video Understanding

3,892

Paper
Code

Detecting Text in Natural Image with Connectionist Text Proposal Network

27 code implementations • 12 Sep 2016 • Zhi Tian, Weilin Huang, Tong He, Pan He, Yu Qiao

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image.

Scene Text Detection

3,414

Paper
Code

Domain Generalization with MixStyle

3 code implementations • ICLR 2021 • Kaiyang Zhou, Yongxin Yang, Yu Qiao, Tao Xiang

Our method, termed MixStyle, is motivated by the observation that visual domain is closely related to image style (e. g., photo vs.~sketch images).

Ranked #57 on Domain Generalization on PACS

Domain Generalization Retrieval

3,146

Paper
Code

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations • 9 May 2023 • Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

3,121

Paper
Code

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

2 code implementations • NeurIPS 2023 • Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie zhou, Yu Qiao, Jifeng Dai

We hope this model can set a new baseline for generalist vision and language models.

Language Modelling Large Language Model

3,121

Paper
Code

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

1 code implementation • 29 Aug 2023 • Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, Chao Dong

We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework.

Ranked #1 on Blind Face Restoration on LFW

Blind Face Restoration Image Denoising +2

3,007

Paper
Code

UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning

3 code implementations • ICLR 2022 • Kunchang Li, Yali Wang, Gao Peng, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao

For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 8% and 71. 4% top-1 accuracy respectively.

Ranked #8 on Action Recognition on Something-Something V1

Action Classification Action Recognition +1

2,991

Paper
Code

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

2 code implementations • 12 Jan 2022 • Kunchang Li, Yali Wang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao

For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60. 9% and 71. 2% top-1 accuracy respectively.

Representation Learning

2,991

Paper
Code

Delving into the Devils of Bird's-eye-view Perception: A Review, Evaluation and Recipe

2 code implementations • 12 Sep 2022 • Hongyang Li, Chonghao Sima, Jifeng Dai, Wenhai Wang, Lewei Lu, Huijie Wang, Jia Zeng, Zhiqi Li, Jiazhi Yang, Hanming Deng, Hao Tian, Enze Xie, Jiangwei Xie, Li Chen, Tianyu Li, Yang Li, Yulu Gao, Xiaosong Jia, Si Liu, Jianping Shi, Dahua Lin, Yu Qiao

As sensor configurations get more complex, integrating multi-source information from different sensors and representing features in a unified view come of vital importance.

Autonomous Driving

2,870

Paper
Code

BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision

2 code implementations • CVPR 2023 • Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset.

Ranked #5 on 3D Object Detection on Rope3D

3D Object Detection

2,870

Paper
Code

Planning-oriented Autonomous Driving

1 code implementation • CVPR 2023 • Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, Hongyang Li

Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning.

Autonomous Driving Philosophy

2,816

Paper
Code

AIM 2020 Challenge on Efficient Super-Resolution: Methods and Results

3 code implementations • 15 Sep 2020 • Kai Zhang, Martin Danelljan, Yawei Li, Radu Timofte, Jie Liu, Jie Tang, Gangshan Wu, Yu Zhu, Xiangyu He, Wenjie Xu, Chenghua Li, Cong Leng, Jian Cheng, Guangyang Wu, Wenyi Wang, Xiaohong Liu, Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, Chao Dong, Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan, Xiaochuan Li, Zhiqiang Lang, Jiangtao Nie, Wei Wei, Lei Zhang, Abdul Muqeet, Jiwon Hwang, Subin Yang, JungHeum Kang, Sung-Ho Bae, Yongwoo Kim, Geun-Woo Jeon, Jun-Ho Choi, Jun-Hyuk Kim, Jong-Seok Lee, Steven Marty, Eric Marty, Dongliang Xiong, Siang Chen, Lin Zha, Jiande Jiang, Xinbo Gao, Wen Lu, Haicheng Wang, Vineeth Bhaskara, Alex Levinshtein, Stavros Tsogkas, Allan Jepson, Xiangzhen Kong, Tongtong Zhao, Shanshan Zhao, Hrishikesh P. S, Densen Puthussery, Jiji C. V, Nan Nan, Shuai Liu, Jie Cai, Zibo Meng, Jiaming Ding, Chiu Man Ho, Xuehui Wang, Qiong Yan, Yuzhi Zhao, Long Chen, Jiangtao Zhang, Xiaotong Luo, Liang Chen, Yanyun Qu, Long Sun, Wenhao Wang, Zhenbing Liu, Rushi Lan, Rao Muhammad Umer, Christian Micheloni

This paper reviews the AIM 2020 challenge on efficient single image super-resolution with focus on the proposed solutions and results.

Image Super-Resolution

2,713

Paper
Code

VideoChat: Chat-Centric Video Understanding

1 code implementation • 10 May 2023 • Kunchang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, LiMin Wang, Yu Qiao

In this paper, we initiate an attempt of developing an end-to-end chat-centric video understanding system, coined as VideoChat.

Ranked #7 on Video Question Answering on MVBench

Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +5

2,667

Paper
Code

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

1 code implementation • 28 Nov 2023 • Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Ranked #1 on Zero-Shot Video Question Answer on STAR Benchmark

Fairness Multiple-choice +8

2,667

Paper
Code

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

1 code implementation • 13 Nov 2023 • Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, Jiaming Han, Siyuan Huang, Yichi Zhang, Xuming He, Hongsheng Li, Yu Qiao

We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings.

Ranked #2 on Visual Question Answering on BenchLMM

Described Object Detection Language Modelling +4

2,496

Paper
Code

SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models

1 code implementation • 8 Feb 2024 • Peng Gao, Renrui Zhang, Chris Liu, Longtian Qiu, Siyuan Huang, Weifeng Lin, Shitian Zhao, Shijie Geng, Ziyi Lin, Peng Jin, Kaipeng Zhang, Wenqi Shao, Chao Xu, Conghui He, Junjun He, Hao Shao, Pan Lu, Hongsheng Li, Yu Qiao

We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX.

Ranked #5 on Video Question Answering on MVBench

Benchmarking Language Modelling +4

2,496

Paper
Code

InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions

2 code implementations • CVPR 2023 • Wenhai Wang, Jifeng Dai, Zhe Chen, Zhenhang Huang, Zhiqi Li, Xizhou Zhu, Xiaowei Hu, Tong Lu, Lewei Lu, Hongsheng Li, Xiaogang Wang, Yu Qiao

Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state.

Ranked #1 on Instance Segmentation on COCO test-dev (AP50 metric, using extra training data)

Classification Image Classification +3

2,310

Paper
Code

Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

2 code implementations • CVPR 2023 • Hao Li, Jinguo Zhu, Xiaohu Jiang, Xizhou Zhu, Hongsheng Li, Chun Yuan, Xiaohua Wang, Yu Qiao, Xiaogang Wang, Wenhai Wang, Jifeng Dai

In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance.

Language Modelling Multi-Task Learning

2,310

Paper
Code

Point Transformer V3: Simpler, Faster, Stronger

3 code implementations • 15 Dec 2023 • Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xihui Liu, Yu Qiao, Wanli Ouyang, Tong He, Hengshuang Zhao

This paper is not motivated to seek innovation within the attention mechanism.

Ranked #1 on Semantic Segmentation on S3DIS (using extra training data)

3D Semantic Segmentation LIDAR Semantic Segmentation +1

1,991

Paper
Code

Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision

2 code implementations • 23 Nov 2023 • Yi Yu, Xue Yang, Qingyun Li, Feipeng Da, Jifeng Dai, Yu Qiao, Junchi Yan

To our best knowledge, Point2RBox is the first end-to-end solution for point-supervised OOD.

Object object-detection +2

1,724

Paper
Code

InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition

1 code implementation • 26 Sep 2023 • Pan Zhang, Xiaoyi Dong, Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Haodong Duan, Songyang Zhang, Shuangrui Ding, Wenwei Zhang, Hang Yan, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

We propose InternLM-XComposer, a vision-language large model that enables advanced image-text comprehension and composition.

Ranked #9 on Visual Question Answering (VQA) on InfiMM-Eval

Image Comprehension Reading Comprehension +1

1,636

Paper
Code

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

1 code implementation • 29 Jan 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension.

Ranked #16 on Visual Question Answering on MM-Vet

Language Modelling Visual Question Answering

1,636

Paper
Code

InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD

2 code implementations • 9 Apr 2024 • Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Zhe Chen, Xinyue Zhang, Wei Li, Jingwen Li, Wenhai Wang, Kai Chen, Conghui He, Xingcheng Zhang, Jifeng Dai, Yu Qiao, Dahua Lin, Jiaqi Wang

The Large Vision-Language Model (LVLM) field has seen significant advancements, yet its progression has been hindered by challenges in comprehending fine-grained visual content due to limited resolution.

Ranked #11 on Visual Question Answering on MM-Vet

4k Language Modelling +1

1,636

Paper
Code

Meta-Transformer: A Unified Framework for Multimodal Learning

1 code implementation • 20 Jul 2023 • Yiyuan Zhang, Kaixiong Gong, Kaipeng Zhang, Hongsheng Li, Yu Qiao, Wanli Ouyang, Xiangyu Yue

Multimodal learning aims to build models that can process and relate information from multiple modalities.

Time Series

1,439

Paper
Code

Vision Transformer Adapter for Dense Predictions

1 code implementation • 17 May 2022 • Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, Yu Qiao

This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT).

Ranked #4 on Semantic Segmentation on PASCAL Context

Instance Segmentation Panoptic Segmentation +1

1,118

Paper
Code

Domain Adaptive Ensemble Learning

1 code implementation • 16 Mar 2020 • Kaiyang Zhou, Yongxin Yang, Yu Qiao, Tao Xiang

Each such classifier is an expert to its own domain and a non-expert to others.

Domain Generalization Ensemble Learning +3

1,083

Paper
Code

Domain Generalization: A Survey

2 code implementations • 3 Mar 2021 • Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, Chen Change Loy

Generalization to out-of-distribution (OOD) data is a capability natural to humans yet challenging for machines to reproduce.

Action Recognition Data Augmentation +8

1,083

Paper
Code

MixStyle Neural Networks for Domain Generalization and Adaptation

2 code implementations • 5 Jul 2021 • Kaiyang Zhou, Yongxin Yang, Yu Qiao, Tao Xiang

MixStyle is easy to implement with a few lines of code, does not require modification to training objectives, and can fit a variety of learning paradigms including supervised domain generalization, semi-supervised domain generalization, and unsupervised domain adaptation.

Data Augmentation Domain Generalization +6

1,083

Paper
Code

Activating More Pixels in Image Super-Resolution Transformer

2 code implementations • CVPR 2023 • Xiangyu Chen, Xintao Wang, Jiantao Zhou, Yu Qiao, Chao Dong

In the training stage, we additionally adopt a same-task pre-training strategy to exploit the potential of the model for further improvement.

Ranked #1 on Image Super-Resolution on Set5 - 2x upscaling

Image Super-Resolution

1,076

Paper
Code

HAT: Hybrid Attention Transformer for Image Restoration

2 code implementations • 11 Sep 2023 • Xiangyu Chen, Xintao Wang, Wenlong Zhang, Xiangtao Kong, Yu Qiao, Jiantao Zhou, Chao Dong

In the training stage, we additionally adopt a same-task pre-training strategy to further exploit the potential of the model for further improvement.

Image Compression Image Denoising +2

1,076

Paper
Code

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

1 code implementation • NeurIPS 2023 • Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert).

Ranked #6 on 3D Object Detection on nuScenes Camera Only

3D Object Detection object-detection

1,070

Paper
Code

A Discriminative Feature Learning Approach for Deep Face Recognition

1 code implementation • ECCV 2016 2016 • Yandong Wen, Kaipeng Zhang, Zhifeng Li, Yu Qiao

In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model.

Face Recognition Face Verification

942

Paper
Code

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

1 code implementation • 6 Dec 2022 • Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications.

Ranked #1 on Action Recognition on Something-Something V1 (using extra training data)

Action Classification Contrastive Learning +8

921

Paper
Code

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

1 code implementation • 13 Jul 2023 • Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, LiMin Wang, Yu Qiao

Specifically, we utilize a multi-scale approach to generate video-related descriptions.

Action Recognition Contrastive Learning +7

921

Paper
Code

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation • 30 Oct 2023 • Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

921

Paper
Code

InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding

2 code implementations • 22 Mar 2024 • Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei HUANG, Yu Qiao, Yali Wang, LiMin Wang

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-the-art performance in action recognition, video-text tasks, and video-centric dialogue.

Ranked #1 on Audio Classification on ESC-50 (using extra training data)

Action Classification Action Recognition +12

921

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2 code implementations • 21 Dec 2023 • Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Image Retrieval Image-to-Text Retrieval +10

844

Paper
Code

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

1 code implementation • 25 Apr 2024 • Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, Ji Ma, Jiaqi Wang, Xiaoyi Dong, Hang Yan, Hewei Guo, Conghui He, Zhenjiang Jin, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Lewei Lu, Xizhou Zhu, Tong Lu, Dahua Lin, Yu Qiao

Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.

844

Paper
Code

UniFormer: Unifying Convolution and Self-attention for Visual Recognition

7 code implementations • 24 Jan 2022 • Kunchang Li, Yali Wang, Junhao Zhang, Peng Gao, Guanglu Song, Yu Liu, Hongsheng Li, Yu Qiao

Different from the typical transformer blocks, the relation aggregators in our UniFormer block are equipped with local and global token affinity respectively in shallow and deep layers, allowing to tackle both redundancy and dependency for efficient and effective representation learning.

Ranked #153 on Image Classification on ImageNet

Image Classification object-detection +5

778

Paper
Code

SAM-Med2D

3 code implementations • 30 Aug 2023 • Junlong Cheng, Jin Ye, Zhongying Deng, Jianpin Chen, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Junjun He, Shaoting Zhang, Min Zhu, Yu Qiao

To bridge this gap, we introduce SAM-Med2D, the most comprehensive studies on applying SAM to medical 2D images.

Image Segmentation Interactive Segmentation +3

742

Paper
Code

SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks

1 code implementation • 20 Nov 2023 • Jin Ye, Junlong Cheng, Jianpin Chen, Zhongying Deng, Tianbin Li, Haoyu Wang, Yanzhou Su, Ziyan Huang, Jilong Chen, Lei Jiang, Hui Sun, Min Zhu, Shaoting Zhang, Junjun He, Yu Qiao

Segment Anything Model (SAM) has achieved impressive results for natural image segmentation with input prompts such as points and bounding boxes.

Image Segmentation Medical Image Segmentation +2

742

Paper
Code

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

2 code implementations • 26 Sep 2023 • Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, Ziwei Liu

To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model.

Ranked #4 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Generation +1

724

Paper
Code

Vision-Centric BEV Perception: A Survey

1 code implementation • 4 Aug 2022 • Yuexin Ma, Tai Wang, Xuyang Bai, Huitong Yang, Yuenan Hou, Yaming Wang, Yu Qiao, Ruigang Yang, Dinesh Manocha, Xinge Zhu

In recent years, vision-centric Bird's Eye View (BEV) perception has garnered significant interest from both industry and academia due to its inherent advantages, such as providing an intuitive representation of the world and being conducive to data fusion.

636

Paper
Code

FOTS: Fast Oriented Text Spotting with a Unified Network

7 code implementations • CVPR 2018 • Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, Junjie Yan

Incidental scene text spotting is considered one of the most difficult and valuable challenges in the document analysis community.

Ranked #4 on Scene Text Detection on ICDAR 2017 MLT

Scene Text Detection Scene Text Recognition +2

635

Paper
Code

VideoMamba: State Space Model for Efficient Video Understanding

3 code implementations • 11 Mar 2024 • Kunchang Li, Xinhao Li, Yi Wang, Yinan He, Yali Wang, LiMin Wang, Yu Qiao

Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain.

Video Understanding

577

Paper
Code

Bi3D: Bi-domain Active Learning for Cross-domain 3D Object Detection

1 code implementation • CVPR 2023 • Jiakang Yuan, Bo Zhang, Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

Unsupervised Domain Adaptation (UDA) technique has been explored in 3D cross-domain tasks recently.

3D Object Detection Active Learning +2

567

Paper
Code

Uni3D: A Unified Baseline for Multi-dataset 3D Object Detection

1 code implementation • CVPR 2023 • Bo Zhang, Jiakang Yuan, Botian Shi, Tao Chen, Yikang Li, Yu Qiao

In this paper, we study the task of training a unified 3D detector from multiple datasets.

3D Object Detection object-detection

567

Paper
Code

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

1 code implementation • 25 May 2023 • Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Weijie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, Jifeng Dai

These agents, equipped with the logic and common sense capabilities of LLMs, can skillfully navigate complex, sparse-reward environments with text-based interactions.

Common Sense Reasoning Navigate +1

567

Paper
Code

ReSimAD: Zero-Shot 3D Domain Transfer for Autonomous Driving with Source Reconstruction and Target Simulation

2 code implementations • 11 Sep 2023 • Bo Zhang, Xinyu Cai, Jiakang Yuan, Donglin Yang, Jianfei Guo, Xiangchao Yan, Renqiu Xia, Botian Shi, Min Dou, Tao Chen, Si Liu, Junchi Yan, Yu Qiao

Domain shifts such as sensor type changes and geographical situation variations are prevalent in Autonomous Driving (AD), which poses a challenge since AD model relying on the previous domain knowledge can be hardly directly deployed to a new domain without additional costs.

Autonomous Driving Domain Generalization

567

Paper
Code

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

1 code implementation • 19 Sep 2023 • Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao

Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks.

3D Object Detection Autonomous Driving +3

567

Paper
Code

AD-PT: Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset

1 code implementation • NeurIPS 2023 • Jiakang Yuan, Bo Zhang, Xiangchao Yan, Tao Chen, Botian Shi, Yikang Li, Yu Qiao

It is a long-term vision for Autonomous Driving (AD) community that the perception models can learn from a large-scale point cloud dataset, to obtain unified representations that can achieve promising results on different tasks or benchmarks.

Autonomous Driving Point Cloud Pre-training

567

Paper
Code

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

2 code implementations • 25 Aug 2023 • Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, Ping Luo

LWC modulates the extreme values of weights by optimizing the clipping threshold.

Common Sense Reasoning Computational Efficiency +3

561

Paper
Code

Towards Good Practices for Very Deep Two-Stream ConvNets

5 code implementations • 8 Jul 2015 • Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao

However, for action recognition in videos, the improvement of deep convolutional networks is not so evident.

Ranked #66 on Action Recognition on UCF101

Action Recognition In Videos Computational Efficiency +3

554

Paper
Code

Knowledge Guided Disambiguation for Large-Scale Scene Classification with Multi-Resolution CNNs

2 code implementations • 4 Oct 2016 • Limin Wang, Sheng Guo, Weilin Huang, Yuanjun Xiong, Yu Qiao

Convolutional Neural Networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2.

General Classification Scene Classification +1

550

Paper
Code

Real-time Action Recognition with Enhanced Motion Vector CNNs

1 code implementation • CVPR 2016 • Bowen Zhang, Li-Min Wang, Zhe Wang, Yu Qiao, Hanli Wang

The deep two-stream architecture exhibited excellent performance on video based action recognition.

Ranked #74 on Action Recognition on UCF101

Action Recognition Optical Flow Estimation +1

550

Paper
Code

OpenICL: An Open-Source Framework for In-context Learning

3 code implementations • 6 Mar 2023 • Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Jiangtao Feng, Jingjing Xu, Yu Qiao, Zhiyong Wu

However, the implementation of ICL is sophisticated due to the diverse retrieval and inference methods involved, as well as the varying pre-processing requirements for different models, datasets, and tasks.

In-Context Learning Language Modelling +4

502

Paper
Code

Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling

1 code implementation • 6 Nov 2021 • Renrui Zhang, Rongyao Fang, Wei zhang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

To further enhance CLIP's few-shot capability, CLIP-Adapter proposed to fine-tune a lightweight residual feature adapter and significantly improves the performance for few-shot classification.

Language Modelling Transfer Learning

470

Paper
Code

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

3 code implementations • 19 Jul 2022 • Renrui Zhang, Zhang Wei, Rongyao Fang, Peng Gao, Kunchang Li, Jifeng Dai, Yu Qiao, Hongsheng Li

On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient.

Retrieval Transfer Learning

470

Paper
Code

Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

3 code implementations • CVPR 2023 • Renrui Zhang, Xiangfei Hu, Bohao Li, Siyuan Huang, Hanqiu Deng, Hongsheng Li, Yu Qiao, Peng Gao

Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision-generative knowledge, and GPT-3's language-generative knowledge.

Few-Shot Learning Representation Learning

470

Paper
Code

PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark

2 code implementations • 21 Mar 2022 • Li Chen, Chonghao Sima, Yang Li, Zehan Zheng, Jiajie Xu, Xiangwei Geng, Hongyang Li, Conghui He, Jianping Shi, Yu Qiao, Junchi Yan

Methods for 3D lane detection have been recently proposed to address the issue of inaccurate lane layouts in many autonomous driving scenarios (uphill/downhill, bump, etc.).

Ranked #5 on 3D Lane Detection on Apollo Synthetic 3D Lane

3D Lane Detection Autonomous Driving +1

468

Paper
Code

OneLLM: One Framework to Align All Modalities with Language

1 code implementation • 6 Dec 2023 • Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue

In detail, we first train an image projection module to connect a vision encoder with LLM.

Ranked #75 on Visual Question Answering on MM-Vet

Question Answering Visual Question Answering

456

Paper
Code

Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward

6 code implementations • 29 Dec 2017 • Kaiyang Zhou, Yu Qiao, Tao Xiang

Video summarization aims to facilitate large-scale video browsing by producing short, concise summaries that are diverse and representative of original videos.

Ranked #7 on Unsupervised Video Summarization on TvSum

Decision Making reinforcement-learning +3

455

Paper
Code

ConvMAE: Masked Convolution Meets Masked Autoencoders

4 code implementations • 8 May 2022 • Peng Gao, Teli Ma, Hongsheng Li, Ziyi Lin, Jifeng Dai, Yu Qiao

Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation.

Computational Efficiency Image Classification +2

455

Paper
Code

You Only Need 90K Parameters to Adapt Light: A Light Weight Transformer for Image Enhancement and Exposure Correction

1 code implementation • 30 May 2022 • Ziteng Cui, Kunchang Li, Lin Gu, Shenghan Su, Peng Gao, Zhengkai Jiang, Yu Qiao, Tatsuya Harada

Challenging illumination conditions (low-light, under-exposure and over-exposure) in the real world not only cast an unpleasant visual appearance but also taint the computer vision tasks.

Ranked #2 on Image Enhancement on Exposure-Errors

Low-Light Image Enhancement object-detection +2

421

Paper
Code

Suppressing Uncertainties for Large-Scale Facial Expression Recognition

2 code implementations • CVPR 2020 • Kai Wang, Xiaojiang Peng, Jianfei Yang, Shijian Lu, Yu Qiao

Annotating a qualitative large-scale facial expression dataset is extremely difficult due to the uncertainties caused by ambiguous facial expressions, low-quality facial images, and the subjectiveness of annotators.

Facial Expression Recognition Facial Expression Recognition (FER)

404

Paper
Code

VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

1 code implementation • CVPR 2023 • LiMin Wang, Bingkun Huang, Zhiyu Zhao, Zhan Tong, Yinan He, Yi Wang, Yali Wang, Yu Qiao

Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90. 0% on K400 and 89. 9% on K600) and Something-Something (68. 7% on V1 and 77. 0% on V2).

Ranked #1 on Self-Supervised Action Recognition on UCF101 (using extra training data)

Action Classification Action Recognition In Videos +3

396

Paper
Code

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

2 code implementations • 9 Oct 2021 • Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, Yu Qiao

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning.

Prompt Engineering Representation Learning

389

Paper
Code

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

1 code implementation • 3 Aug 2023 • Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, Yu Qiao

We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.

Question Answering Retrieval +1

373

Paper
Code

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

1 code implementation • 29 Feb 2024 • Weiyun Wang, Yiming Ren, Haowen Luo, Tiantong Li, Chenxiang Yan, Zhe Chen, Wenhai Wang, Qingyun Li, Lewei Lu, Xizhou Zhu, Yu Qiao, Jifeng Dai

In addition, we design a new benchmark, termed Circular-based Relation Probing Evaluation (CRPE) for comprehensively evaluating the relation comprehension capabilities of MLLMs.

Ranked #30 on Visual Question Answering on MM-Vet

Hallucination Object Localization +3

373

Paper
Code

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

2 code implementations • 6 Dec 2023 • Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Chunjing Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao

With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem.

Autonomous Driving

365

Paper
Code

Generalized Predictive Model for Autonomous Driving

1 code implementation • 14 Mar 2024 • Jiazhi Yang, Shenyuan Gao, Yihang Qiu, Li Chen, Tianyu Li, Bo Dai, Kashyap Chitta, Penghao Wu, Jia Zeng, Ping Luo, Jun Zhang, Andreas Geiger, Yu Qiao, Hongyang Li

In this paper, we introduce the first large-scale video prediction model in the autonomous driving discipline.

Autonomous Driving Video Prediction

365

Paper
Code

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

1 code implementation • 15 Jun 2023 • Peng Xu, Wenqi Shao, Kaipeng Zhang, Peng Gao, Shuo Liu, Meng Lei, Fanqing Meng, Siyuan Huang, Yu Qiao, Ping Luo

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning.

Hallucination Image Captioning +3

362

Paper
Code

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

1 code implementation • 7 Aug 2023 • Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Hallucination Visual Reasoning

362

Paper
Code

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

1 code implementation • 14 Feb 2024 • Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

Medical Visual Question Answering Question Answering +1

362

Paper
Code

A-Eval: A Benchmark for Cross-Dataset Evaluation of Abdominal Multi-Organ Segmentation

2 code implementations • 7 Sep 2023 • Ziyan Huang, Zhongying Deng, Jin Ye, Haoyu Wang, Yanzhou Su, Tianbin Li, Hui Sun, Junlong Cheng, Jianpin Chen, Junjun He, Yun Gu, Shaoting Zhang, Lixu Gu, Yu Qiao

To address these questions, we introduce A-Eval, a benchmark for the cross-dataset Evaluation ('Eval') of Abdominal ('A') multi-organ segmentation.

Organ Segmentation Segmentation

353

Paper
Code

SAM-Med3D

1 code implementation • 23 Oct 2023 • Haoyu Wang, Sizheng Guo, Jin Ye, Zhongying Deng, Junlong Cheng, Tianbin Li, Jianpin Chen, Yanzhou Su, Ziyan Huang, Yiqing Shen, Bin Fu, Shaoting Zhang, Junjun He, Yu Qiao

These issues can hardly be addressed by fine-tuning SAM on medical data because the original 2D structure of SAM neglects 3D spatial information.

3D Architecture Image Segmentation +1

353

Paper
Code

ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic

3 code implementations • CVPR 2021 • Xiangtao Kong, Hengyuan Zhao, Yu Qiao, Chao Dong

On this basis, we propose a new solution pipeline -- ClassSR that combines classification and SR in a unified framework.

2k 8k +3

350

Paper
Code

Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications

1 code implementation • 11 Jan 2024 • Yuwen Xiong, Zhiqi Li, Yuntao Chen, Feng Wang, Xizhou Zhu, Jiapeng Luo, Wenhai Wang, Tong Lu, Hongsheng Li, Yu Qiao, Lewei Lu, Jie zhou, Jifeng Dai

The advancements in speed and efficiency of DCNv4, combined with its robust performance across diverse vision tasks, show its potential as a foundational building block for future vision models.

Image Classification Image Generation +1

326

Paper
Code

An end-to-end TextSpotter with Explicit Alignment and Attention

2 code implementations • CVPR 2018 • Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, Changming Sun

This allows the two tasks to work collaboratively by shar- ing convolutional features, which is critical to identify challenging text instances.

Text Detection

323

Paper
Code

Frame attention networks for facial expression recognition in videos

2 code implementations • 29 Jun 2019 • Debin Meng, Xiaojiang Peng, Kai Wang, Yu Qiao

The feature embedding module is a deep Convolutional Neural Network (CNN) which embeds face images into feature vectors.

Ranked #3 on Facial Expression Recognition (FER) on CK+ (Accuracy (7 emotion) metric)

Facial Expression Recognition Facial Expression Recognition (FER)

323

Paper
Code

LimSim: A Long-term Interactive Multi-scenario Traffic Simulator

1 code implementation • 13 Jul 2023 • Licheng Wen, Daocheng Fu, Song Mao, Pinlong Cai, Min Dou, Yikang Li, Yu Qiao

With the growing popularity of digital twin and autonomous driving in transportation, the demand for simulation systems capable of generating high-fidelity and reliable scenarios is increasing.

Autonomous Driving

315

Paper
Code

Drive Like a Human: Rethinking Autonomous Driving with Large Language Models

1 code implementation • 14 Jul 2023 • Daocheng Fu, Xin Li, Licheng Wen, Min Dou, Pinlong Cai, Botian Shi, Yu Qiao

In this paper, we explore the potential of using a large language model (LLM) to understand the driving environment in a human-like manner and analyze its ability to reason, interpret, and memorize when facing complex scenarios.

Autonomous Driving Common Sense Reasoning +3

312

Paper
Code

DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models

2 code implementations • 28 Sep 2023 • Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, Yu Qiao

Recent advancements in autonomous driving have relied on data-driven approaches, which are widely adopted but face challenges including dataset bias, overfitting, and uninterpretability.

Autonomous Driving Common Sense Reasoning +1

312

Paper
Code

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

1 code implementation • ICCV 2023 • Renrui Zhang, Han Qiu, Tai Wang, Ziyu Guo, Xuanzhuo Xu, Ziteng Cui, Yu Qiao, Peng Gao, Hongsheng Li

In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR.

Ranked #9 on 3D Object Detection From Monocular Images on KITTI-360

3D Object Detection From Monocular Images Autonomous Driving +3

311

Paper
Code

Enhanced Quadratic Video Interpolation

2 code implementations • 10 Sep 2020 • Yihao Liu, Liangbin Xie, Li Si-Yao, Wenxiu Sun, Yu Qiao, Chao Dong

In this work, we further improve the performance of QVI from three facets and propose an enhanced quadratic video interpolation (EQVI) model.

Super-Resolution Video Frame Interpolation

309

Paper
Code

Efficient Image Super-Resolution Using Pixel Attention

1 code implementation • 2 Oct 2020 • Hengyuan Zhao, Xiangtao Kong, Jingwen He, Yu Qiao, Chao Dong

Pixel attention (PA) is similar as channel attention and spatial attention in formulation.

Image Super-Resolution

309

Paper
Code

Towards Knowledge-driven Autonomous Driving

1 code implementation • 7 Dec 2023 • Xin Li, Yeqi Bai, Pinlong Cai, Licheng Wen, Daocheng Fu, Bo Zhang, Xuemeng Yang, Xinyu Cai, Tao Ma, Jianfei Guo, Xing Gao, Min Dou, Yikang Li, Botian Shi, Yong liu, Liang He, Yu Qiao

This paper explores the emerging knowledge-driven autonomous driving technologies.

Autonomous Driving Neural Rendering

304

Paper
Code

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm

1 code implementation • 12 Oct 2023 • Haoyi Zhu, Honghui Yang, Xiaoyang Wu, Di Huang, Sha Zhang, Xianglong He, Hengshuang Zhao, Chunhua Shen, Yu Qiao, Tong He, Wanli Ouyang

In this paper, we introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation, thereby establishing a pathway to 3D foundational models.

Ranked #1 on 3D Semantic Segmentation on ScanNet++ (using extra training data)

3D Object Detection 3D Reconstruction +5

298

Paper
Code

UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

1 code implementation • ICCV 2023 • Youquan Liu, Runnan Chen, Xin Li, Lingdong Kong, Yuchen Yang, Zhaoyang Xia, Yeqi Bai, Xinge Zhu, Yuexin Ma, Yikang Li, Yu Qiao, Yuenan Hou

Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase.

Ranked #2 on 3D Semantic Segmentation on SemanticKITTI (using extra training data)

3D Semantic Segmentation LIDAR Semantic Segmentation +2

296

Paper
Code

PointCLIP: Point Cloud Understanding by CLIP

2 code implementations • CVPR 2022 • Renrui Zhang, Ziyu Guo, Wei zhang, Kunchang Li, Xupeng Miao, Bin Cui, Yu Qiao, Peng Gao, Hongsheng Li

On top of that, we design an inter-view adapter to better extract the global feature and adaptively fuse the few-shot knowledge learned from 3D into CLIP pre-trained in 2D.

Ranked #3 on 3D Open-Vocabulary Instance Segmentation on STPLS3D

3D Open-Vocabulary Instance Segmentation Few-Shot Learning +6

291

Paper
Code

Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models

1 code implementation • 11 Oct 2023 • Zeqiang Lai, Xizhou Zhu, Jifeng Dai, Yu Qiao, Wenhai Wang

The revolution of artificial intelligence content generation has been rapidly accelerated with the booming text-to-image (T2I) diffusion models.

Code Generation Image Generation +2

287

Paper
Code

VBench: Comprehensive Benchmark Suite for Video Generative Models

1 code implementation • 29 Nov 2023 • Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, LiMin Wang, Dahua Lin, Yu Qiao, Ziwei Liu

We will open-source VBench, including all prompts, evaluation methods, generated videos, and human preference annotations, and also include more video generation models in VBench to drive forward the field of video generation.

Image Generation Video Generation

277

Paper
Code

DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds

1 code implementation • ICCV 2023 • Tao Ma, Xuemeng Yang, Hongbin Zhou, Xin Li, Botian Shi, Junjie Liu, Yuchen Yang, Zhizheng Liu, Liang He, Yu Qiao, Yikang Li, Hongsheng Li

Extensive experiments on Waymo Open Dataset show our DetZero outperforms all state-of-the-art onboard and offboard 3D detection methods.

3D Object Detection Object +1

276

Paper
Code

RankSRGAN: Generative Adversarial Networks with Ranker for Image Super-Resolution

2 code implementations • ICCV 2019 • Wenlong Zhang, Yihao Liu, Chao Dong, Yu Qiao

To address the problem, we propose Super-Resolution Generative Adversarial Networks with Ranker (RankSRGAN) to optimize generator in the direction of perceptual metrics.

Ranked #1 on Image Super-Resolution on PIRM-test

Image Super-Resolution

269

Paper
Code

ChEF: A Comprehensive Evaluation Framework for Standardized Assessment of Multimodal Large Language Models

1 code implementation • 5 Nov 2023 • Zhelun Shi, Zhipin Wang, Hongxing Fan, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

We will publicly release all the detailed implementations for further analysis, as well as an easy-to-use modular toolkit for the integration of new recipes and models, so that ChEF can be a growing evaluation framework for the MLLM community.

Hallucination In-Context Learning +2

264

Paper
Code

Octavius: Mitigating Task Interference in MLLMs via LoRA-MoE

1 code implementation • 5 Nov 2023 • Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, Jing Shao

While this phenomenon has been overlooked in previous work, we propose a novel and extensible framework, called Octavius, for comprehensive studies and experimentation on multimodal learning with Multimodal Large Language Models (MLLMs).

Zero-shot Generalization

264

Paper
Code

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

1 code implementation • 9 Nov 2023 • Licheng Wen, Xuemeng Yang, Daocheng Fu, XiaoFeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, Zheng Zhu, Shaoyan Sun, Yeqi Bai, Xinyu Cai, Min Dou, Shuanglu Hu, Botian Shi, Yu Qiao

This has been a significant bottleneck, particularly in the development of common sense reasoning and nuanced scene understanding necessary for safe and reliable autonomous driving.

Autonomous Driving Common Sense Reasoning +4

264

Paper
Code

MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception

1 code implementation • 12 Dec 2023 • Yiran Qin, Enshen Zhou, Qichang Liu, Zhenfei Yin, Lu Sheng, Ruimao Zhang, Yu Qiao, Jing Shao

It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways.

264

Paper
Code

Assessment of Multimodal Large Language Models in Alignment with Human Values

1 code implementation • 26 Mar 2024 • Zhelun Shi, Zhipin Wang, Hongxing Fan, Zaibin Zhang, Lijun Li, Yongting Zhang, Zhenfei Yin, Lu Sheng, Yu Qiao, Jing Shao

Large Language Models (LLMs) aim to serve as versatile assistants aligned with human values, as defined by the principles of being helpful, honest, and harmless (hhh).

264

Paper
Code

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

1 code implementation • 18 May 2023 • Siyuan Huang, Zhengkai Jiang, Hao Dong, Yu Qiao, Peng Gao, Hongsheng Li

This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions to sequential actions for robotic manipulation tasks.

Language Modelling Large Language Model +2

256

Paper
Code

CUHK & ETHZ & SIAT Submission to ActivityNet Challenge 2016

1 code implementation • 2 Aug 2016 • Yuanjun Xiong, Li-Min Wang, Zhe Wang, Bo-Wen Zhang, Hang Song, Wei Li, Dahua Lin, Yu Qiao, Luc van Gool, Xiaoou Tang

This paper presents the method that underlies our submission to the untrimmed video classification task of ActivityNet Challenge 2016.

General Classification Video Classification

251

Paper
Code

Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition

1 code implementation • 10 May 2019 • Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, Yu Qiao

Extensive experiments show that our RAN and region biased loss largely improve the performance of FER with occlusion and variant pose.

Ranked #2 on Facial Expression Recognition (FER) on SFEW

Facial Expression Recognition Facial Expression Recognition (FER)

250

Paper
Code

MedFMC: A Real-world Dataset and Benchmark For Foundation Model Adaptation in Medical Image Classification

1 code implementation • 16 Jun 2023 • Dequan Wang, Xiaosong Wang, Lilong Wang, Mengzhang Li, Qian Da, Xiaoqiang Liu, Xiangyu Gao, Jun Shen, Junjun He, Tian Shen, Qi Duan, Jie Zhao, Kang Li, Yu Qiao, Shaoting Zhang

Foundation models, often pre-trained with large-scale data, have achieved paramount success in jump-starting various vision and language applications.

Diabetic Retinopathy Grading Image Classification +3

248

Paper
Code

Unmasked Teacher: Towards Training-Efficient Video Foundation Models

1 code implementation • ICCV 2023 • Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, LiMin Wang, Yu Qiao

Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain.

Ranked #1 on Video Retrieval on SSv2-template retrieval (using extra training data)

Action Classification Action Recognition +5

242

Paper
Code

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures

1 code implementation • 4 Mar 2024 • Yuchen Duan, Weiyun Wang, Zhe Chen, Xizhou Zhu, Lewei Lu, Tong Lu, Yu Qiao, Hongsheng Li, Jifeng Dai, Wenhai Wang

Our evaluations demonstrate that VRWKV surpasses ViT's performance in image classification and has significantly faster speeds and lower memory usage processing high-resolution inputs.

Image Classification

214

Paper
Code

Single Shot Text Detector with Regional Attention

1 code implementation • ICCV 2017 • Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, Xiaolin Li

Our text detector achieves an F-measure of 77% on the ICDAR 2015 bench- mark, advancing the state-of-the-art results in [18, 28].

Ranked #4 on Scene Text Detection on COCO-Text

Scene Text Detection

212

Paper
Code

TransNormerLLM: A Faster and Better Large Language Model with Improved TransNormer

2 code implementations • 27 Jul 2023 • Zhen Qin, Dong Li, Weigao Sun, Weixuan Sun, Xuyang Shen, Xiaodong Han, Yunshen Wei, Baohong Lv, Xiao Luo, Yu Qiao, Yiran Zhong

TransNormerLLM evolves from the previous linear attention architecture TransNormer by making advanced modifications that include positional embedding, linear attention acceleration, gating mechanisms, tensor normalization, and inference acceleration and stabilization.

Language Modelling Large Language Model

210

Paper
Code

AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations

3 code implementations • CVPR 2019 • Xiao Zhang, Rui Zhao, Yu Qiao, Xiaogang Wang, Hongsheng Li

Our results show that training deep neural networks with the AdaCos loss is stable and able to achieve high face recognition accuracy.

Ranked #6 on Face Verification on MegaFace

Face Recognition Face Verification

207

Paper
Code

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

3 code implementations • 28 May 2022 • Renrui Zhang, Ziyu Guo, Rongyao Fang, Bin Zhao, Dong Wang, Yu Qiao, Hongsheng Li, Peng Gao

By fine-tuning on downstream tasks, Point-M2AE achieves 86. 43% accuracy on ScanObjectNN, +3. 36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme.

Ranked #4 on 3D Point Cloud Linear Classification on ModelNet40 (using extra training data)

3D Object Detection 3D Point Cloud Linear Classification +5

198

Paper
Code

Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders

2 code implementations • CVPR 2023 • Renrui Zhang, Liuhui Wang, Yu Qiao, Peng Gao, Hongsheng Li

Pre-training by numerous image data has become de-facto for robust 2D representations.

Ranked #2 on 3D Point Cloud Linear Classification on ModelNet40 (using extra training data)

3D Point Cloud Linear Classification Few-Shot 3D Point Cloud Classification

198

Paper
Code

Modulating Image Restoration with Continual Levels via Adaptive Feature Modification Layers

1 code implementation • CVPR 2019 • Jingwen He, Chao Dong, Yu Qiao

In image restoration tasks, like denoising and super resolution, continual modulation of restoration levels is of great importance for real-world applications, but has failed most of existing deep learning based image restoration methods.

Ranked #2 on Color Image Denoising on CBSD68 sigma75

Image Denoising Image Restoration +1

184

Paper
Code

MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning

2 code implementations • 24 Nov 2021 • David Junhao Zhang, Kunchang Li, Yali Wang, Yunpeng Chen, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou

With such multi-dimension and multi-scale factorization, our MorphMLP block can achieve a great accuracy-computation balance.

Ranked #38 on Action Recognition on Something-Something V2 (using extra training data)

Action Recognition Image Classification +3

165

Paper
Code

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

2 code implementations • 15 Mar 2022 • Yuanhan Zhang, Qinghong Sun, Yichun Zhou, Zexin He, Zhenfei Yin, Kun Wang, Lu Sheng, Yu Qiao, Jing Shao, Ziwei Liu

This work thus proposes a novel active learning framework for realistic dataset annotation.

Ranked #1 on Image Classification on Food-101 (using extra training data)

Active Learning Classification +3

161

Paper
Code

ControlLLM: Augment Language Models with Tools by Searching on Graphs

1 code implementation • 26 Oct 2023 • Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Ziheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, Wenhai Wang

We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving complex real-world tasks.

Scheduling

161

Paper
Code

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

1 code implementation • 18 Jan 2024 • Changyao Tian, Xizhou Zhu, Yuwen Xiong, Weiyun Wang, Zhe Chen, Wenhai Wang, Yuntao Chen, Lewei Lu, Tong Lu, Jie zhou, Hongsheng Li, Yu Qiao, Jifeng Dai

Developing generative models for interleaved image-text data has both research and practical value.

157

Paper
Code

Blueprint Separable Residual Network for Efficient Image Super-Resolution

1 code implementation • 12 May 2022 • Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Jinjin Gu, Yu Qiao, Chao Dong

One is the usage of blueprint separable convolution (BSConv), which takes place of the redundant convolution operation.

Image Super-Resolution

155

Paper
Code

Frozen CLIP Models are Efficient Video Learners

2 code implementations • 6 Aug 2022 • Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos.

Ranked #26 on Action Classification on Kinetics-400 (using extra training data)

Action Classification Video Recognition

155

Paper
Code

Clearer Frames, Anytime: Resolving Velocity Ambiguity in Video Frame Interpolation

1 code implementation • 14 Nov 2023 • Zhihang Zhong, Gurunandan Krishnan, Xiao Sun, Yu Qiao, Sizhuo Ma, Jian Wang

Existing video frame interpolation (VFI) methods blindly predict where each object is at a specific timestep t ("time indexing"), which struggles to predict precise object movements.

Object Video Editing +1

155

Paper
Code

VideoLLM: Modeling Video Sequence with Large Language Models

1 code implementation • 22 May 2023 • Guo Chen, Yin-Dong Zheng, Jiahao Wang, Jilan Xu, Yifei HUANG, Junting Pan, Yi Wang, Yali Wang, Yu Qiao, Tong Lu, LiMin Wang

Building upon this insight, we propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs from natural language processing (NLP) for video sequence understanding.

Video Understanding

154

Paper
Code

Multi-View Partial (MVP) Point Cloud Challenge 2021 on Completion and Registration: Methods and Results

2 code implementations • 22 Dec 2021 • Liang Pan, Tong Wu, Zhongang Cai, Ziwei Liu, Xumin Yu, Yongming Rao, Jiwen Lu, Jie zhou, Mingye Xu, Xiaoyuan Luo, Kexue Fu, Peng Gao, Manning Wang, Yali Wang, Yu Qiao, Junsheng Zhou, Xin Wen, Peng Xiang, Yu-Shen Liu, Zhizhong Han, Yuanjie Yan, Junyi An, Lifa Zhu, Changwei Lin, Dongrui Liu, Xin Li, Francisco Gómez-Fernández, Qinlong Wang, Yang Yang

Based on the MVP dataset, this paper reports methods and results in the Multi-View Partial Point Cloud Challenge 2021 on Completion and Registration.

3D Reconstruction Point Cloud Completion +2

153

Paper
Code

Self-supervised Multi-view Stereo via Effective Co-Segmentation and Data-Augmentation

1 code implementation • 12 Apr 2021 • Hongbin Xu, Zhipeng Zhou, Yu Qiao, Wenxiong Kang, Qiuxia Wu

Recent studies have witnessed that self-supervised methods based on view synthesis obtain clear progress on multi-view stereo (MVS).

Data Augmentation

150

Paper
Code

HDRUNet: Single Image HDR Reconstruction with Denoising and Dequantization

1 code implementation • 27 May 2021 • Xiangyu Chen, Yihao Liu, Zhengwen Zhang, Yu Qiao, Chao Dong

In this work, we propose a novel learning-based approach using a spatially dynamic encoder-decoder network, HDRUNet, to learn an end-to-end mapping for single image HDR reconstruction with denoising and dequantization.

Ranked #2 on Inverse-Tone-Mapping on MSU HDR Video Reconstruction Benchmark

Denoising HDR Reconstruction +2

146

Paper
Code

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation • ICCV 2023 • Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

136

Paper
Code

Latte: Latent Diffusion Transformer for Video Generation

2 code implementations • 5 Jan 2024 • Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, Yu Qiao

We propose a novel Latent Diffusion Transformer, namely Latte, for video generation.

Text-to-Video Generation Video Generation

136

Paper
Code

CLIP2Scene: Towards Label-efficient 3D Scene Understanding by CLIP

1 code implementation • CVPR 2023 • Runnan Chen, Youquan Liu, Lingdong Kong, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao, Wenping Wang

For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20. 8% and 25. 08% mIoU on nuScenes and ScanNet, respectively.

3D Semantic Segmentation Contrastive Learning +4

132

Paper
Code

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

1 code implementation • 23 Nov 2023 • YuFei Wang, Wenhan Yang, Xinyuan Chen, Yaohui Wang, Lanqing Guo, Lap-Pui Chau, Ziwei Liu, Yu Qiao, Alex C. Kot, Bihan Wen

Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model, in just one sampling step, resulting in a remarkable up to x10 speedup for inference.

Image Super-Resolution

130

Paper
Code

Conditional Sequential Modulation for Efficient Global Image Retouching

1 code implementation • ECCV 2020 • Jingwen He, Yihao Liu, Yu Qiao, Chao Dong

The base network acts like an MLP that processes each pixel independently and the condition network extracts the global features of the input image to generate a condition vector.

Image Retouching Photo Retouching

128

Paper
Code

ChartX & ChartVLM: A Versatile Benchmark and Foundation Model for Complicated Chart Reasoning

1 code implementation • 19 Feb 2024 • Renqiu Xia, Bo Zhang, Hancheng Ye, Xiangchao Yan, Qi Liu, Hongbin Zhou, Zijun Chen, Min Dou, Botian Shi, Junchi Yan, Yu Qiao

Recently, many versatile Multi-modal Large Language Models (MLLMs) have emerged continuously.

127

Paper
Code

Learning Attentive Pairwise Interaction for Fine-Grained Classification

1 code implementation • 24 Feb 2020 • Peiqin Zhuang, Yali Wang, Yu Qiao

These distinct gate vectors inherit mutual context on semantic differences, which allow API-Net to attentively capture contrastive clues by pairwise interaction between two images.

Ranked #12 on Fine-Grained Image Classification on Stanford Dogs

Classification Fine-Grained Image Classification +1

124

Paper
Code

Attention-Driven Dynamic Graph Convolutional Network for Multi-Label Image Recognition

1 code implementation • ECCV 2020 • Jin Ye, Junjun He, Xiaojiang Peng, Wenhao Wu, Yu Qiao

To this end, we propose an Attention-Driven Dynamic Graph Convolutional Network (ADD-GCN) to dynamically generate a specific graph for each image.

Ranked #22 on Multi-Label Classification on MS-COCO

Multi-Label Classification

122

Paper
Code

A New Journey from SDRTV to HDRTV

1 code implementation • ICCV 2021 • Xiangyu Chen, Zhengwen Zhang, Jimmy S. Ren, Lynhoo Tian, Yu Qiao, Chao Dong

However, most available resources are still in standard dynamic range (SDR).

Ranked #1 on Inverse-Tone-Mapping on MSU HDR Video Reconstruction Benchmark

Inverse-Tone-Mapping

120

Paper
Code

Towards Efficient SDRTV-to-HDRTV by Learning from Image Formation

2 code implementations • 8 Sep 2023 • Xiangyu Chen, Zheyuan Li, Zhengwen Zhang, Jimmy S. Ren, Yihao Liu, Jingwen He, Yu Qiao, Jiantao Zhou, Chao Dong

However, the majority of available resources are still in standard dynamic range (SDR).

120

Paper
Code

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

1 code implementation • 14 Dec 2023 • Wenhai Wang, Jiangwei Xie, Chuanyang Hu, Haoming Zou, Jianan Fan, Wenwen Tong, Yang Wen, Silei Wu, Hanming Deng, Zhiqi Li, Hao Tian, Lewei Lu, Xizhou Zhu, Xiaogang Wang, Yu Qiao, Jifeng Dai

In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD).

Autonomous Driving Motion Planning

119

Paper
Code

NTIRE 2022 Challenge on Efficient Super-Resolution: Methods and Results

2 code implementations • 11 May 2022 • Yawei Li, Kai Zhang, Radu Timofte, Luc van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang, Peiran Ren, Xuansong Xie, Xian-Sheng Hua, Yanbo Wang, Xiaozhong Ji, Chuming Lin, Donghao Luo, Ying Tai, Chengjie Wang, Zhizhong Zhang, Yuan Xie, Shen Cheng, Ziwei Luo, Lei Yu, Zhihong Wen, Qi Wu1, Youwei Li, Haoqiang Fan, Jian Sun, Shuaicheng Liu, Yuanfei Huang, Meiguang Jin, Hua Huang, Jing Liu, Xinjian Zhang, Yan Wang, Lingshun Long, Gen Li, Yuanfan Zhang, Zuowei Cao, Lei Sun, Panaetov Alexander, Yucong Wang, Minjie Cai, Li Wang, Lu Tian, Zheyuan Wang, Hongbing Ma, Jie Liu, Chao Chen, Yidong Cai, Jie Tang, Gangshan Wu, Weiran Wang, Shirui Huang, Honglei Lu, Huan Liu, Keyan Wang, Jun Chen, Shi Chen, Yuchun Miao, Zimo Huang, Lefei Zhang, Mustafa Ayazoğlu, Wei Xiong, Chengyi Xiong, Fei Wang, Hao Li, Ruimian Wen, Zhijing Yang, Wenbin Zou, Weixin Zheng, Tian Ye, Yuncheng Zhang, Xiangzhen Kong, Aditya Arora, Syed Waqas Zamir, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Dandan Gaoand Dengwen Zhouand Qian Ning, Jingzhu Tang, Han Huang, YuFei Wang, Zhangheng Peng, Haobo Li, Wenxue Guan, Shenghua Gong, Xin Li, Jun Liu, Wanjun Wang, Dengwen Zhou, Kun Zeng, Hanjiang Lin, Xinyu Chen, Jinsheng Fang

The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29. 00dB on DIV2K validation set.

Image Super-Resolution

117

Paper
Code

Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling

1 code implementation • 3 Jan 2023 • Penghao Wu, Li Chen, Hongyang Li, Xiaosong Jia, Junchi Yan, Yu Qiao

Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving.

Autonomous Driving Decision Making

107

Paper
Code

Context-Transformer: Tackling Object Confusion for Few-Shot Detection

1 code implementation • 16 Mar 2020 • Ze Yang, Yali Wang, Xianyu Chen, Jianzhuang Liu, Yu Qiao

Few-shot object detection is a challenging but realistic scenario, where only a few annotated training images are available for training detectors.

Few-Shot Learning Few-Shot Object Detection +3

102

Paper
Code

Interactive Multi-Dimension Modulation with Dynamic Controllable Residual Learning for Image Restoration

1 code implementation • ECCV 2020 • Jingwen He, Chao Dong, Yu Qiao

To make a step forward, this paper presents a new problem setup, called multi-dimension (MD) modulation, which aims at modulating output effects across multiple degradation types and levels.

Image Restoration

Paper
Code

Are We on the Right Way for Evaluating Large Vision-Language Models?

1 code implementation • 29 Mar 2024 • Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, Feng Zhao

We evaluate 16 leading LVLMs on MMStar to assess their multi-modal capabilities, and on 7 benchmarks with the proposed metrics to investigate their data leakage and actual multi-modal gain.

World Knowledge

Paper
Code

Video Dehazing via a Multi-Range Temporal Alignment Network with Physical Prior

1 code implementation • CVPR 2023 • Jiaqi Xu, Xiaowei Hu, Lei Zhu, Qi Dou, Jifeng Dai, Yu Qiao, Pheng-Ann Heng

Video dehazing aims to recover haze-free frames with high visibility and contrast.

Paper
Code

Embodied Understanding of Driving Scenarios

1 code implementation • 7 Mar 2024 • Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, Hongyang Li

Hereby, we introduce the Embodied Language Model (ELM), a comprehensive framework tailored for agents' understanding of driving scenes with large spatial and temporal spans.

Autonomous Driving Language Modelling +1

Paper
Code

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information

1 code implementation • CVPR 2023 • Weijie Su, Xizhou Zhu, Chenxin Tao, Lewei Lu, Bin Li, Gao Huang, Yu Qiao, Xiaogang Wang, Jie zhou, Jifeng Dai

It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models.

Ranked #2 on Semantic Segmentation on ADE20K (using extra training data)

Image Classification Long-tailed Object Detection +3

Paper
Code

Diff-Font: Diffusion Model for Robust One-Shot Font Generation

1 code implementation • 12 Dec 2022 • Haibin He, Xinyuan Chen, Chaoyue Wang, Juhua Liu, Bo Du, DaCheng Tao, Yu Qiao

Specifically, a large stroke-wise dataset is constructed, and a stroke-wise diffusion model is proposed to preserve the structure and the completion of each generated character.

Font Generation

Paper
Code

SCPNet: Semantic Scene Completion on Point Cloud

1 code implementation • CVPR 2023 • Zhaoyang Xia, Youquan Liu, Xin Li, Xinge Zhu, Yuexin Ma, Yikang Li, Yuenan Hou, Yu Qiao

We propose a simple yet effective label rectification strategy, which uses off-the-shelf panoptic segmentation labels to remove the traces of dynamic objects in completion labels, greatly improving the performance of deep models especially for those moving objects.

Ranked #1 on 3D Semantic Scene Completion on SemanticKITTI

3D Semantic Scene Completion Knowledge Distillation +3

Paper
Code

Tripartite Information Mining and Integration for Image Matting

1 code implementation • ICCV 2021 • Yuhao Liu, Jiake Xie, Xiao Shi, Yu Qiao, Yujie Huang, Yong Tang, Xin Yang

Regarding the nature of image matting, most researches have focused on solutions for transition regions.

2k Image Matting

Paper
Code

InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges

2 code implementations • 17 Nov 2022 • Guo Chen, Sen Xing, Zhe Chen, Yi Wang, Kunchang Li, Yizhuo Li, Yi Liu, Jiahao Wang, Yin-Dong Zheng, Bingkun Huang, Zhiyu Zhao, Junting Pan, Yifei HUANG, Zun Wang, Jiashuo Yu, Yinan He, Hongjie Zhang, Tong Lu, Yali Wang, LiMin Wang, Yu Qiao

In this report, we present our champion solutions to five tracks at Ego4D challenge.

Ranked #1 on State Change Object Detection on Ego4D

Future Hand Prediction Moment Queries +7

Paper
Code

Visual Compositional Learning for Human-Object Interaction Detection

4 code implementations • ECCV 2020 • Zhi Hou, Xiaojiang Peng, Yu Qiao, DaCheng Tao

The integration of decomposition and composition enables VCL to share object and verb features among different HOI samples and images, and to generate new interaction samples and new types of HOI, and thus largely alleviates the long-tail distribution problem and benefits low-shot or zero-shot HOI detection.

Ranked #3 on Affordance Recognition on HICO-DET(Unknown Concepts)

Affordance Recognition Object

Paper
Code

Detecting Human-Object Interaction via Fabricated Compositional Learning

1 code implementation • CVPR 2021 • Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, DaCheng Tao

With the proposed object fabricator, we are able to generate large-scale HOI samples for rare and unseen categories to alleviate the open long-tailed issues in HOI detection.

Ranked #4 on Affordance Recognition on HICO-DET

Affordance Recognition Object +1

Paper
Code

Affordance Transfer Learning for Human-Object Interaction Detection

2 code implementations • CVPR 2021 • Zhi Hou, Baosheng Yu, Yu Qiao, Xiaojiang Peng, DaCheng Tao

The proposed method can thus be used to 1) improve the performance of HOI detection, especially for the HOIs with unseen objects; and 2) infer the affordances of novel objects.

Ranked #2 on Affordance Recognition on HICO-DET(Unknown Concepts)

Affordance Detection Affordance Recognition +4

Paper
Code

Self-slimmed Vision Transformer

1 code implementation • 24 Nov 2021 • Zhuofan Zong, Kunchang Li, Guanglu Song, Yali Wang, Yu Qiao, Biao Leng, Yu Liu

Specifically, we first design a novel Token Slimming Module (TSM), which can boost the inference efficiency of ViTs by dynamic token aggregation.

Knowledge Distillation

Paper
Code

LSTD: A Low-Shot Transfer Detector for Object Detection

1 code implementation • 5 Mar 2018 • Hao Chen, Yali Wang, Guoyou Wang, Yu Qiao

Second, we introduce a novel regularized transfer learning framework for low-shot detection, where the transfer knowledge (TK) and background depression (BD) regularizations are proposed to leverage object knowledge respectively from source and target domains, in order to further enhance fine-tuning with a few target images.

Ranked #22 on Few-Shot Object Detection on MS-COCO (30-shot)

Few-Shot Object Detection Object +2

Paper
Code

DiffRate : Differentiable Compression Rate for Efficient Vision Transformers

1 code implementation • ICCV 2023 • Mengzhao Chen, Wenqi Shao, Peng Xu, Mingbao Lin, Kaipeng Zhang, Fei Chao, Rongrong Ji, Yu Qiao, Ping Luo

Token compression aims to speed up large-scale vision transformers (e. g. ViTs) by pruning (dropping) or merging tokens.

Ranked #4 on Efficient ViTs on ImageNet-1K (with DeiT-S)

Efficient ViTs

Paper
Code

SpiderCNN: Deep Learning on Point Sets with Parameterized Convolutional Filters

1 code implementation • ECCV 2018 • Yifan Xu, Tianqi Fan, Mingye Xu, Long Zeng, Yu Qiao

Deep neural networks have enjoyed remarkable success for various vision tasks, however it remains challenging to apply CNNs to domains lacking a regular underlying structures such as 3D point clouds.

Ranked #6 on 3D Part Segmentation on IntrA

3D Part Segmentation 3D Point Cloud Classification

Paper
Code

Temporal Context Aggregation Network for Temporal Action Proposal Refinement

1 code implementation • CVPR 2021 • Zhiwu Qing, Haisheng Su, Weihao Gan, Dongliang Wang, Wei Wu, Xiang Wang, Yu Qiao, Junjie Yan, Changxin Gao, Nong Sang

In this paper, we propose Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement.

Ranked #9 on Temporal Action Localization on ActivityNet-1.3

Action Detection Retrieval +2

Paper
Code

Demystify Transformers & Convolutions in Modern Image Deep Networks

1 code implementation • 10 Nov 2022 • Xiaowei Hu, Min Shi, Weiyun Wang, Sitong Wu, Linjie Xing, Wenhai Wang, Xizhou Zhu, Lewei Lu, Jie zhou, Xiaogang Wang, Yu Qiao, Jifeng Dai

Our experiments on various tasks and an analysis of inductive bias show a significant performance boost due to advanced network-level and block-level designs, but performance differences persist among different STMs.

Image Deep Networks Spatial Token Mixer

Paper
Code

Fine-grained Audible Video Description

1 code implementation • CVPR 2023 • Xuyang Shen, Dong Li, Jinxing Zhou, Zhen Qin, Bowen He, Xiaodong Han, Aixuan Li, Yuchao Dai, Lingpeng Kong, Meng Wang, Yu Qiao, Yiran Zhong

We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD).

Language Modelling Masked Language Modeling +5

Paper
Code

Efficient Image Super-Resolution using Vast-Receptive-Field Attention

1 code implementation • 12 Oct 2022 • Lin Zhou, Haoming Cai, Jinjin Gu, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Yu Qiao, Chao Dong

In this work, we design an efficient SR network by improving the attention mechanism.

Image Super-Resolution

Paper
Code

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

1 code implementation • 19 Dec 2023 • Lingjun Zhang, Xinyuan Chen, Yaohui Wang, Yue Lu, Yu Qiao

To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language.

Text Generation Text-to-Image Generation

Paper
Code

Learning Geometry-Disentangled Representation for Complementary Understanding of 3D Object Point Cloud

3 code implementations • 20 Dec 2020 • Mutian Xu, Junhao Zhang, Zhipeng Zhou, Mingye Xu, Xiaojuan Qi, Yu Qiao

GDANet introduces Geometry-Disentangle Module to dynamically disentangle point clouds into the contour and flat part of 3D objects, respectively denoted by sharp and gentle variation components.

Ranked #1 on Point Cloud Segmentation on PointCloud-C

3D Object Classification 3D Part Segmentation +2

Paper
Code

VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

1 code implementation • 26 Nov 2021 • Changyao Tian, Wenhai Wang, Xizhou Zhu, Jifeng Dai, Yu Qiao

Deep learning-based models encounter challenges when processing long-tailed data in the real world.

Ranked #2 on Long-tail Learning on iNaturalist 2018 (using extra training data)

Image Classification Long-tail Learning +1

Paper
Code

Aleth-NeRF: Low-light Condition View Synthesis with Concealing Fields

1 code implementation • 10 Mar 2023 • Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada

Common capture low-light scenes are challenging for most computer vision techniques, including Neural Radiance Fields (NeRF).

Paper
Code

Aleth-NeRF: Illumination Adaptive NeRF with Concealing Field Assumption

1 code implementation • 14 Dec 2023 • Ziteng Cui, Lin Gu, Xiao Sun, Xianzheng Ma, Yu Qiao, Tatsuya Harada

The standard Neural Radiance Fields (NeRF) paradigm employs a viewer-centered methodology, entangling the aspects of illumination and material reflectance into emission solely from 3D points.

Paper
Code

ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning

1 code implementation • 4 Jan 2024 • Fanqing Meng, Wenqi Shao, Quanfeng Lu, Peng Gao, Kaipeng Zhang, Yu Qiao, Ping Luo

Charts play a vital role in data visualization, understanding data patterns, and informed decision-making.

Data Visualization Decision Making +2

Paper
Code

Places205-VGGNet Models for Scene Recognition

2 code implementations • 7 Aug 2015 • Limin Wang, Sheng Guo, Weilin Huang, Yu Qiao

We verify the performance of trained Places205-VGGNet models on three datasets: MIT67, SUN397, and Places205.

Computational Efficiency Object Recognition +1

Paper
Code

A Simple Long-Tailed Recognition Baseline via Vision-Language Model

1 code implementation • 29 Nov 2021 • Teli Ma, Shijie Geng, Mengmeng Wang, Jing Shao, Jiasen Lu, Hongsheng Li, Peng Gao, Yu Qiao

Recent advances in large-scale contrastive visual-language pretraining shed light on a new pathway for visual recognition.

Ranked #4 on Long-tail Learning on Places-LT (using extra training data)

Contrastive Learning Language Modelling +3

Paper
Code

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation

1 code implementation • 25 May 2023 • Shilin Yan, Renrui Zhang, Ziyu Guo, Wenchao Chen, Wei zhang, Hongyang Li, Yu Qiao, Hao Dong, Zhongjiang He, Peng Gao

In this paper, we propose MUTR, a Multi-modal Unified Temporal transformer for Referring video object segmentation.

Ranked #1 on Referring Expression Segmentation on Referring Expressions for DAVIS 2016 & 2017

Object Referring Expression Segmentation +3

Paper
Code

Learning Open-vocabulary Semantic Segmentation Models From Natural Language Supervision

1 code implementation • CVPR 2023 • Jilan Xu, Junlin Hou, Yuejie Zhang, Rui Feng, Yi Wang, Yu Qiao, Weidi Xie

The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities.

Open Vocabulary Semantic Segmentation Semantic Segmentation

Paper
Code

Long-Term Rhythmic Video Soundtracker

1 code implementation • 2 May 2023 • Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, Yu Qiao

To this end, we present Long-Term Rhythmic Video Soundtracker (LORIS), a novel framework to synthesize long-term conditional waveforms.

Paper
Code

MineDreamer: Learning to Follow Instructions via Chain-of-Imagination for Simulated-World Control

1 code implementation • 18 Mar 2024 • Enshen Zhou, Yiran Qin, Zhenfei Yin, Yuzhou Huang, Ruimao Zhang, Lu Sheng, Yu Qiao, Jing Shao

It is a long-lasting goal to design a generalist-embodied agent that can follow diverse instructions in human-like ways.

Instruction Following

Paper
Code

Mining Inter-Video Proposal Relations for Video Object Detection

1 code implementation • ECCV 2020 • Mingfei Han, Yali Wang, Xiaojun Chang, Yu Qiao

Recent studies have shown that, context aggregating information from proposals in different frames can clearly enhance the performance of video object detection.

Ranked #11 on Video Object Detection on ImageNet VID

Object object-detection +3

Paper
Code

Linear Attention Sequence Parallelism

1 code implementation • 3 Apr 2024 • Weigao Sun, Zhen Qin, Dong Li, Xuyang Shen, Yu Qiao, Yiran Zhong

In this paper, we introduce Linear Attention Sequence Parallel (LASP), an efficient SP method tailored to linear attention-based language models.

Paper
Code

Text-guided Foundation Model Adaptation for Pathological Image Classification

2 code implementations • 27 Jul 2023 • Yunkun Zhang, Jin Gao, Mu Zhou, Xiaosong Wang, Yu Qiao, Shaoting Zhang, Dequan Wang

In this paper, we propose to Connect Image and Text Embeddings (CITE) to enhance pathological image classification.

Classification Image Classification +1

Paper
Code

Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

1 code implementation • ICCV 2023 • Lihe Yang, Zhen Zhao, Lei Qi, Yu Qiao, Yinghuan Shi, Hengshuang Zhao

To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples.

Ranked #1 on Semi-Supervised Image Classification on SVHN, 40 Labels

Semi-Supervised Image Classification

Paper
Code

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

1 code implementation • 7 Feb 2024 • Lijun Li, Bowen Dong, Ruohui Wang, Xuhao Hu, WangMeng Zuo, Dahua Lin, Yu Qiao, Jing Shao

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount.

Multiple-choice

Paper
Code

Geometry Sharing Network for 3D Point Cloud Classification and Segmentation

1 code implementation • 23 Dec 2019 • Mingye Xu, Zhipeng Zhou, Yu Qiao

Specially, GS-Net consists of Geometry Similarity Connection (GSC) modules which exploit Eigen-Graph to group distant points with similar and relevant geometric information, and aggregate features from nearest neighbors in both Euclidean space and Eigenvalue space.

Ranked #7 on 3D Point Cloud Classification on IntrA

3D Point Cloud Classification Classification +3

Paper
Code

DegAE: A New Pretraining Paradigm for Low-Level Vision

1 code implementation • CVPR 2023 • Yihao Liu, Jingwen He, Jinjin Gu, Xiangtao Kong, Yu Qiao, Chao Dong

However, we argue that pretraining is more significant for high-cost tasks, where data acquisition is more challenging.

Philosophy

Paper
Code

ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation

1 code implementation • 11 Oct 2023 • Bo Peng, Xinyuan Chen, Yaohui Wang, Chaochao Lu, Yu Qiao

In this work, we introduce ConditionVideo, a training-free approach to text-to-video generation based on the provided condition, video, and input text, by leveraging the power of off-the-shelf text-to-image generation methods (e. g., Stable Diffusion).

Text-to-Image Generation Text-to-Video Generation +1

Paper
Code

Weakly Supervised PatchNets: Describing and Aggregating Local Patches for Scene Recognition

1 code implementation • 1 Sep 2016 • Zhe Wang, Li-Min Wang, Yali Wang, Bo-Wen Zhang, Yu Qiao

In this paper, we propose a hybrid representation, which leverages the discriminative capacity of CNNs and the simplicity of descriptor encoding schema for image recognition, with a focus on scene recognition.

Scene Recognition

Paper
Code

Digging into Uncertainty in Self-supervised Multi-view Stereo

1 code implementation • ICCV 2021 • Hongbin Xu, Zhipeng Zhou, Yali Wang, Wenxiong Kang, Baigui Sun, Hao Li, Yu Qiao

Specially, the limitations can be categorized into two types: ambiguious supervision in foreground and invalid supervision in background.

Image Reconstruction Self-Supervised Learning

Paper
Code

SmallBigNet: Integrating Core and Contextual Views for Video Classification

1 code implementation • CVPR 2020 • Xianhang Li, Yali Wang, Zhipeng Zhou, Yu Qiao

Our SmallBig network outperforms a number of recent state-of-the-art approaches, in terms of accuracy and/or efficiency.

Classification General Classification +1

Paper
Code

LEO: Generative Latent Image Animator for Human Video Synthesis

5 code implementations • 6 May 2023 • Yaohui Wang, Xin Ma, Xinyuan Chen, Antitza Dantcheva, Bo Dai, Yu Qiao

Our key idea is to represent motion as a sequence of flow maps in the generation process, which inherently isolate motion from appearance.

Disentanglement Video Editing

Paper
Code

SEAL: A Framework for Systematic Evaluation of Real-World Super-Resolution

1 code implementation • 6 Sep 2023 • Wenlong Zhang, Xiaohui Li, Xiangyu Chen, Yu Qiao, Xiao-Ming Wu, Chao Dong

In particular, we cluster the extensive degradation space to create a set of representative degradation cases, which serves as a comprehensive test set.

Super-Resolution

Paper
Code

Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors

1 code implementation • CVPR 2015 • Limin Wang, Yu Qiao, Xiaoou Tang

Visual features are of vital importance for human action understanding in videos.

Ranked #59 on Action Recognition on HMDB-51

Action Recognition Action Understanding +1

Paper
Code

RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos

1 code implementation • 2017 IEEE International Conference on Computer Vision (ICCV) 2017 • Wenbin Du, Yali Wang, Yu Qiao

Firstly, unlike previous works on pose-related action recognition, our RPAN is an end-to-end recurrent network which can exploit important spatial-temporal evolutions of human pose to assist action recognition in a unified framework.

Ranked #5 on Skeleton Based Action Recognition on J-HMDB

Action Recognition In Videos Pose Estimation +1

Paper
Code

RBF-Softmax: Learning Deep Representative Prototypes with Radial Basis Function Softmax

1 code implementation • ECCV 2020 • Xiao Zhang, Rui Zhao, Yu Qiao, Hongsheng Li

To address this problem, this paper introduces a novel Radial Basis Function (RBF) distances to replace the commonly used inner products in the softmax loss function, such that it can adaptively assign losses to regularize the intra-class and inter-class distances by reshaping the relative differences, and thus creating more representative prototypes of classes to improve optimization.

Paper
Code

Temporally Consistent Video Colorization with Deep Feature Propagation and Self-regularization Learning

1 code implementation • 9 Oct 2021 • Yihao Liu, Hengyuan Zhao, Kelvin C. K. Chan, Xintao Wang, Chen Change Loy, Yu Qiao, Chao Dong

We address this problem from a new perspective, by jointly considering colorization and temporal consistency in a unified framework.

Colorization Image Colorization

Paper
Code

CT-Net: Channel Tensorization Network for Video Classification

1 code implementation • ICLR 2021 • Kunchang Li, Xianhang Li, Yali Wang, Jun Wang, Yu Qiao

It can learn to exploit spatial, temporal and channel attention in a high-dimensional manner, to improve the cooperative power of all the feature dimensions in our CT-Module.

Ranked #18 on Action Recognition on Something-Something V1

Action Classification Classification +1

Paper
Code

Safety of Multimodal Large Language Models on Images and Text

1 code implementation • 1 Feb 2024 • Xin Liu, Yichen Zhu, Yunshi Lan, Chao Yang, Yu Qiao

In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text.

Paper
Code

Cross Domain Object Detection by Target-Perceived Dual Branch Distillation

1 code implementation • CVPR 2022 • Mengzhe He, Yali Wang, Jiaxi Wu, Yiru Wang, Hanqing Li, Bo Li, Weihao Gan, Wei Wu, Yu Qiao

It can adaptively enhance source detector to perceive objects in a target image, by leveraging target proposal contexts from iterative cross-attention.

Object object-detection +1

Paper
Code

Siamese Image Modeling for Self-Supervised Vision Representation Learning

2 code implementations • CVPR 2023 • Chenxin Tao, Xizhou Zhu, Weijie Su, Gao Huang, Bin Li, Jie zhou, Yu Qiao, Xiaogang Wang, Jifeng Dai

Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations.

Representation Learning Self-Supervised Learning +1

Paper
Code

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey

1 code implementation • 14 Feb 2024 • Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, Yu Qiao

Large Language Models (LLMs) are now commonplace in conversation applications.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.