Search Results for author: Ping Luo

Found 270 papers, 155 papers with code

Adapting LLaMA Decoder to Vision Transformer

no code implementations10 Apr 2024 Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong liu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a casual mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training.

Computational Efficiency Quantization +1

End-to-End Autonomous Driving through V2X Cooperation

2 code implementations31 Mar 2024 Haibao Yu, Wenxian Yang, Jiaru Zhong, Zhenwei Yang, Siqi Fan, Ping Luo, Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data via V2X communication has emerged as a promising approach for advanced autonomous driving.

Autonomous Driving

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

1 code implementation31 Mar 2024 Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.

Language Modelling Large Language Model

Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

no code implementations25 Mar 2024 Ping Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

Federated Learning (FL) is a distributed machine learning framework in communication network systems.

Federated Learning

DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving

no code implementations25 Mar 2024 Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, Ping Luo

We utilize the challenging driving scenarios from the CARLA leaderboard 2. 0, which involve high-speed driving and lane-changing, and propose a rule-based expert policy to control the vehicle and generate ground truth labels for its reasoning process across different driving aspects and the final decisions.

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

no code implementations25 Mar 2024 Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt.

Face Swapping Instruction Following +1

Zero-shot Generative Linguistic Steganography

1 code implementation16 Mar 2024 Ke Lin, Yiyang Luo, Zijian Zhang, Ping Luo

Generative linguistic steganography attempts to hide secret messages into covertext.

In-Context Learning Linguistic steganography

AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Adversarial Visual-Instructions

no code implementations14 Mar 2024 Hao Zhang, Wenqi Shao, Hong Liu, Yongqiang Ma, Ping Luo, Yu Qiao, Kaipeng Zhang

To bridge this gap, we introduce AVIBench, a framework designed to analyze the robustness of LVLMs when facing various adversarial visual-instructions (AVIs), including four types of image-based AVIs, ten types of text-based AVIs, and nine types of content bias AVIs (such as gender, violence, cultural, and racial biases, among others).

Fairness Language Modelling

ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation

no code implementations11 Mar 2024 Shaojie Dai, Xin Liu, Ping Luo, Yue Yu

Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning.

Language Modelling Large Language Model +2

PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

no code implementations7 Mar 2024 Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution.

4k Image Captioning +1

RegionGPT: Towards Region Understanding Vision Language Model

no code implementations4 Mar 2024 Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions.

Language Modelling

Towards Implicit Prompt For Text-To-Image Models

no code implementations4 Mar 2024 Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Position

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

no code implementations22 Feb 2024 Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.

Code Generation Common Sense Reasoning +2

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

1 code implementation18 Feb 2024 Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.

Question Answering Text Summarization

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

no code implementations14 Feb 2024 Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions, which is essential in real-world medical applications.

Medical Visual Question Answering Question Answering +1

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

1 code implementation10 Jan 2024 Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li

As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.

Image Generation

LLaMA Pro: Progressive LLaMA with Block Expansion

1 code implementation4 Jan 2024 Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ping Luo, Ying Shan

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.

Instruction Following Math

Video Understanding with Large Language Models: A Survey

1 code implementation29 Dec 2023 Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, JianGuo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly.

Video Understanding

DriveLM: Driving with Graph Visual Question Answering

1 code implementation21 Dec 2023 Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Ping Luo, Andreas Geiger, Hongyang Li

The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task.

Autonomous Driving Question Answering +1

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

2 code implementations21 Dec 2023 Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, Jifeng Dai

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Image Retrieval Image-to-Text Retrieval +10

Cached Transformers: Improving Transformers with Differentiable Memory Cache

1 code implementation20 Dec 2023 Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.

Image Classification Instance Segmentation +6

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

no code implementations18 Dec 2023 Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, Ping Luo

Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser.

Trajectory Planning

You Only Learn One Query: Learning Unified Human Query for Single-Stage Multi-Person Multi-Task Human-Centric Perception

no code implementations9 Dec 2023 Sheng Jin, Shuhuai Li, Tong Li, Wentao Liu, Chen Qian, Ping Luo

Human-centric perception (e. g. pedetrian detection, segmentation, pose estimation, and attribute analysis) is a long-standing problem for computer vision.

Attribute Multi-Task Learning +1

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

1 code implementation6 Dec 2023 Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan

Therefore, this paper presents MotionCtrl, a unified and flexible motion controller for video generation designed to effectively and independently control camera and object motion.

Object Video Generation

MLLMs-Augmented Visual-Language Representation Learning

1 code implementation30 Nov 2023 Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets.

Representation Learning Retrieval +1

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

1 code implementation28 Nov 2023 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

Fairness Multiple-choice +8

Advancing Vision Transformers with Group-Mix Attention

1 code implementation26 Nov 2023 Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo

The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.

Image Classification object-detection +2

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

no code implementations24 Nov 2023 Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment.

Benchmarking World Knowledge

DiffusionMat: Alpha Matting as Sequential Refinement Learning

no code implementations22 Nov 2023 Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo

In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes.

Denoising Image Matting

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

1 code implementation NeurIPS 2023 Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Ping Luo, Zaiqing Nie

To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we propose the Feature Flow Net (FFNet), a novel cooperative detection framework.

3D Object Detection Autonomous Driving +1

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation30 Oct 2023 Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

no code implementations12 Oct 2023 Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.

Decision Making

MeanAP-Guided Reinforced Active Learning for Object Detection

no code implementations12 Oct 2023 Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo

Active learning presents a promising avenue for training high-performance models with minimal labeled data, achieved by judiciously selecting the most informative instances to label and incorporating them into the task learner.

Active Object Detection Object +2

Guideline Learning for In-context Information Extraction

no code implementations8 Oct 2023 Chaoxu Pang, Yixuan Cao, Qiang Ding, Ping Luo

In this paper, we propose a Guideline Learning (GL) framework for In-context IE which reflectively learns and follows guidelines.

Active Learning Event Extraction +2

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

no code implementations8 Oct 2023 Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches.

Keypoint Detection

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

no code implementations4 Oct 2023 Hao Sha, Yao Mu, YuXuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding

Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability.

Autonomous Driving Decision Making

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

2 code implementations30 Sep 2023 Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Image Generation Language Modelling

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving

1 code implementation19 Sep 2023 Xiangchao Yan, Runjian Chen, Bo Zhang, Jiakang Yuan, Xinyu Cai, Botian Shi, Wenqi Shao, Junchi Yan, Ping Luo, Yu Qiao

Our contributions are threefold: (1) Occupancy prediction is shown to be promising for learning general representations, which is demonstrated by extensive experiments on plenty of datasets and tasks.

3D Object Detection Autonomous Driving +3

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

no code implementations4 Sep 2023 Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo

StyleAdapter can generate high-quality images that match the content of the prompts and adopt the style of the references (even for unseen styles) in a single pass, which is more flexible and efficient than previous methods.

Image Generation

MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision

1 code implementation30 Aug 2023 Jianning Li, Zongwei Zhou, Jiancheng Yang, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Chongyu Qu, Tiezheng Zhang, Xiaoxi Chen, Wenxuan Li, Marek Wodzinski, Paul Friedrich, Kangxian Xie, Yuan Jin, Narmada Ambigapathy, Enrico Nasca, Naida Solak, Gian Marco Melito, Viet Duc Vu, Afaque R. Memon, Christopher Schlachta, Sandrine de Ribaupierre, Rajnikant Patel, Roy Eagleson, Xiaojun Chen, Heinrich Mächler, Jan Stefan Kirschke, Ezequiel de la Rosa, Patrick Ferdinand Christ, Hongwei Bran Li, David G. Ellis, Michele R. Aizenberg, Sergios Gatidis, Thomas Küstner, Nadya Shusharina, Nicholas Heller, Vincent Andrearczyk, Adrien Depeursinge, Mathieu Hatt, Anjany Sekuboyina, Maximilian Löffler, Hans Liebl, Reuben Dorent, Tom Vercauteren, Jonathan Shapey, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Achraf Ben-Hamadou, Ahmed Rekik, Sergi Pujades, Edmond Boyer, Federico Bolelli, Costantino Grana, Luca Lumetti, Hamidreza Salehi, Jun Ma, Yao Zhang, Ramtin Gharleghi, Susann Beier, Arcot Sowmya, Eduardo A. Garza-Villarreal, Thania Balducci, Diego Angeles-Valdez, Roberto Souza, Leticia Rittner, Richard Frayne, Yuanfeng Ji, Vincenzo Ferrari, Soumick Chatterjee, Florian Dubost, Stefanie Schreiber, Hendrik Mattern, Oliver Speck, Daniel Haehn, Christoph John, Andreas Nürnberger, João Pedrosa, Carlos Ferreira, Guilherme Aresta, António Cunha, Aurélio Campilho, Yannick Suter, Jose Garcia, Alain Lalande, Vicky Vandenbossche, Aline Van Oevelen, Kate Duquesne, Hamza Mekhzoum, Jef Vandemeulebroucke, Emmanuel Audenaert, Claudia Krebs, Timo Van Leeuwen, Evie Vereecke, Hauke Heidemeyer, Rainer Röhrig, Frank Hölzle, Vahid Badeli, Kathrin Krieger, Matthias Gunzer, Jianxu Chen, Timo van Meegdenburg, Amin Dada, Miriam Balzer, Jana Fragemann, Frederic Jonske, Moritz Rempe, Stanislav Malorodov, Fin H. Bahnsen, Constantin Seibold, Alexander Jaus, Zdravko Marinov, Paul F. Jaeger, Rainer Stiefelhagen, Ana Sofia Santos, Mariana Lindo, André Ferreira, Victor Alves, Michael Kamp, Amr Abourayya, Felix Nensa, Fabian Hörst, Alexander Brehmer, Lukas Heine, Yannik Hanusrichter, Martin Weßling, Marcel Dudda, Lars E. Podleska, Matthias A. Fink, Julius Keyl, Konstantinos Tserpes, Moon-Sung Kim, Shireen Elhabian, Hans Lamecker, Dženan Zukić, Beatriz Paniagua, Christian Wachinger, Martin Urschler, Luc Duong, Jakob Wasserthal, Peter F. Hoyer, Oliver Basu, Thomas Maal, Max J. H. Witjes, Gregor Schiele, Ti-chiun Chang, Seyed-Ahmad Ahmadi, Ping Luo, Bjoern Menze, Mauricio Reyes, Thomas M. Deserno, Christos Davatzikos, Behrus Puladi, Pascal Fua, Alan L. Yuille, Jens Kleesiek, Jan Egger

For the medical domain, we present a large collection of anatomical shapes (e. g., bones, organs, vessels) and 3D models of surgical instrument, called MedShapeNet, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems.

Anatomy Mixed Reality

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

no code implementations28 Aug 2023 Ruijie Yao, Sheng Jin, Lumin Xu, Wang Zeng, Wentao Liu, Chen Qian, Ping Luo, Ji Wu

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions.

graph construction

RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs

1 code implementation14 Aug 2023 Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, Ping Luo

In this work, we propose RestoreFormer++, which on the one hand introduces fully-spatial attention mechanisms to model the contextual information and the interplay with the priors, and on the other hand, explores an extending degrading model to help generate more realistic degraded face images to alleviate the synthetic-to-real-world gap.

Blind Face Restoration

Foundation Model is Efficient Multimodal Multitask Model Selector

1 code implementation NeurIPS 2023 Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.

Model Selection Question Answering +1

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

no code implementations ICCV 2023 Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo

In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos.

Attribute Facial Editing +1

Exploring Transformers for Open-world Instance Segmentation

no code implementations ICCV 2023 Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo

Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects.

Contrastive Learning Open-World Instance Segmentation +1

Tiny LVLM-eHub: Early Multimodal Experiments with Bard

1 code implementation7 Aug 2023 Wenqi Shao, Yutao Hu, Peng Gao, Meng Lei, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Hallucination Visual Reasoning

ChiPFormer: Transferable Chip Placement via Offline Decision Transformer

no code implementations26 Jun 2023 Yao Lai, Jinxin Liu, Zhentao Tang, Bin Wang, Jianye Hao, Ping Luo

To resolve these challenges, we cast the chip placement as an offline RL formulation and present ChiPFormer that enables learning a transferable placement policy from fixed offline data.

Offline RL Reinforcement Learning (RL)

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations20 Jun 2023 Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Retrieval +1

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

no code implementations NeurIPS 2023 Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

SyNDock: N Rigid Protein Docking via Learnable Group Synchronization

no code implementations23 May 2023 Yuanfeng Ji, Yatao Bian, Guoji Fu, Peilin Zhao, Ping Luo

Firstly, SyNDock formulates multimeric protein docking as a problem of learning global transformations to holistically depict the placement of chain units of a complex, enabling a learning-centric solution.

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

1 code implementation22 May 2023 Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.

Autonomous Driving Video Generation +1

Going Denser with Open-Vocabulary Part Segmentation

2 code implementations ICCV 2023 Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan

In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.

Object object-detection +3

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

1 code implementation CVPR 2023 Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, Zaiqing Nie

Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving.

Autonomous Driving Decision Making +1

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

1 code implementation8 May 2023 Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen

To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly.

Instruction Following Language Modelling

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

1 code implementation27 Apr 2023 Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

1 code implementation19 Apr 2023 Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities.

3D Object Detection Autonomous Driving +3

EC^2: Emergent Communication for Embodied Control

no code implementations19 Apr 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer

2 code implementations12 Apr 2023 Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin

Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.

Inductive Bias

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

no code implementations7 Apr 2023 Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.

Instruction Following Self-Supervised Learning

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

Multi-Level Contrastive Learning for Dense Prediction Task

1 code implementation4 Apr 2023 Qiushan Guo, Yizhou Yu, Yi Jiang, Jiannan Wu, Zehuan Yuan, Ping Luo

We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning.

Contrastive Learning Self-Supervised Learning

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

no code implementations3 Apr 2023 Tianqi Wang, Sukmin Kim, Wenxuan Ji, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, Ping Luo

In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms.

3D Object Detection Autonomous Driving +1

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

no code implementations30 Mar 2023 Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo

We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.

Contrastive Learning Image Classification +6

Real-time Controllable Denoising for Image and Video

1 code implementation CVPR 2023 Zhaoyang Zhang, Yitong Jiang, Wenqi Shao, Xiaogang Wang, Ping Luo, Kaimo Lin, Jinwei Gu

Controllable image denoising aims to generate clean samples with human perceptual priors and balance sharpness and smoothness.

Image Denoising Video Denoising

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation CVPR 2023 Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modelling Masked Language Modeling

Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction

1 code implementation19 Mar 2023 Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Jirui Yuan, Ping Luo, Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities.

3D Object Detection Autonomous Driving +1

Universal Instance Perception as Object Discovery and Retrieval

1 code implementation CVPR 2023 Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.

 Ranked #1 on Referring Expression Segmentation on RefCoCo val (using extra training data)

Described Object Detection Generalized Referring Expression Comprehension +15

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

1 code implementation3 Feb 2023 Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, Ping Luo

For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20. 8% on Maze2D and 7. 5% on MuJoCo locomotion, but also adapts better to new tasks, e. g., KUKA pick-and-place, by 27. 9% without requiring additional expert data.

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

1 code implementation27 Jan 2023 Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.

Contrastive Learning Object +1

Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception

1 code implementation19 Jan 2023 Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, Jing Shao

Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving.

Autonomous Driving Data Augmentation

Segment Every Reference Object in Spatial and Temporal Spaces

no code implementations ICCV 2023 Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo

In this work, we end the current fragmented situation and propose UniRef to unify the three reference-based object segmentation tasks with a single architecture.

Image Segmentation Object +5

EC2: Emergent Communication for Embodied Control

no code implementations CVPR 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

no code implementations ICCV 2023 Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities.

3D Object Detection Autonomous Driving +3

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

no code implementations CVPR 2023 Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin

Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.

Inductive Bias

Policy Adaptation from Foundation Model Feedback

no code implementations CVPR 2023 Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang

When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations.

Decision Making

Learning Object-Language Alignments for Open-Vocabulary Object Detection

1 code implementation27 Nov 2022 Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai

In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data.

Object object-detection +3

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

1 code implementation24 Nov 2022 Yao Lai, Yao Mu, Ping Luo

Firstly, MaskPlace recasts placement as a problem of learning pixel-level visual representation to comprehensively describe millions of modules on a chip, enabling placement in a high-resolution canvas and a large action space.

Layout Design Representation Learning +1

Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

no code implementations23 Nov 2022 Junjie Wang, Yao Mu, Dong Li, Qichao Zhang, Dongbin Zhao, Yuzheng Zhuang, Ping Luo, Bin Wang, Jianye Hao

The latent world model provides a promising way to learn policies in a compact latent space for tasks with high-dimensional observations, however, its generalization across diverse environments with unseen dynamics remains challenging.

Model-based Reinforcement Learning reinforcement-learning +1

DiffusionDet: Diffusion Model for Object Detection

3 code implementations ICCV 2023 Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.

Denoising Object +2

Large-batch Optimization for Dense Visual Predictions

1 code implementation20 Oct 2022 Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong, Liang Chen, Yu Liu, Ping Luo

To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts.

Instance Segmentation object-detection +3

Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

1 code implementation9 Oct 2022 Yao Mu, Yuzheng Zhuang, Fei Ni, Bin Wang, Jianyu Chen, Jianye Hao, Ping Luo

This paper addresses such a challenge by Decomposed Mutual INformation Optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories, while minimizing the state transition prediction error.

Decision Making Meta Reinforcement Learning +2

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

no code implementations8 Oct 2022 Zeyu Gao, Yao Mu, Ruoyan Shen, Chen Chen, Yangang Ren, Jianyu Chen, Shengbo Eben Li, Ping Luo, YanFeng Lu

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals.

Autonomous Driving

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

1 code implementation CVPR 2023 Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.

Descriptive Representation Learning +1

FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

no code implementations28 Sep 2022 Ping Luo, Jieren Cheng, Zhenhao Liu, N. Xiong, Jie Wu

However, the clients' Non-Independent and Identically Distributed (Non-IID) data negatively affect the trained model, and clients with different numbers of local updates may cause significant gaps to the local gradients in each communication round.

Federated Learning

Rethinking Resolution in the Context of Efficient Video Recognition

1 code implementation26 Sep 2022 Chuofan Ma, Qiushan Guo, Yi Jiang, Zehuan Yuan, Ping Luo, Xiaojuan Qi

Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale.

Knowledge Distillation Video Recognition

ZoomNAS: Searching for Whole-body Human Pose Estimation in the Wild

1 code implementation23 Aug 2022 Lumin Xu, Sheng Jin, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts.

2D Human Pose Estimation Neural Architecture Search +1

3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

1 code implementation22 Jul 2022 Hao Meng, Sheng Jin, Wentao Liu, Chen Qian, Mengxiang Lin, Wanli Ouyang, Ping Luo

Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately.

3D Interacting Hand Pose Estimation Hand Pose Estimation

Pose for Everything: Towards Category-Agnostic Pose Estimation

1 code implementation21 Jul 2022 Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.

Category-Agnostic Pose Estimation Pose Estimation

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

1 code implementation18 Jul 2022 Wejia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Hong Zhou, Size Li, Zhongyuan Wang, Ping Luo

Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e. g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.

Contrastive Learning Representation Learning +2

Towards Grand Unification of Object Tracking

1 code implementation14 Jul 2022 Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, Huchuan Lu

We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Multi-Object Tracking Multi-Object Tracking and Segmentation +3

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation7 Jul 2022 Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

Transferability

Exploiting Context Information for Generic Event Boundary Captioning

1 code implementation3 Jul 2022 Jinrui Zhang, Teng Wang, Feng Zheng, Ran Cheng, Ping Luo

Previous methods only process the information of a single boundary at a time, which lacks utilization of video context information.

Boundary Captioning

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

1 code implementation17 Jun 2022 Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo

Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors, or crawled from the internet followed by elaborate data cleaning techniques.

Contrastive Learning Data Augmentation +2

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

1 code implementation17 Jun 2022 Yao Mu, Shoufa Chen, Mingyu Ding, Jianyu Chen, Runjian Chen, Ping Luo

In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size.

Transfer Learning

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

1 code implementation16 Jun 2022 Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, Ping Luo

Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods.

Image Segmentation Medical Image Segmentation +3

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

1 code implementation8 Jun 2022 Runjian Chen, Yao Mu, Runsen Xu, Wenqi Shao, Chenhan Jiang, Hang Xu, Zhenguo Li, Ping Luo

In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.

Autonomous Driving Contrastive Learning +1

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

2 code implementations26 May 2022 Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.

Action Recognition Video Recognition

Flow-based Recurrent Belief State Learning for POMDPs

no code implementations23 May 2022 Xiaoyu Chen, Yao Mu, Ping Luo, Shengbo Li, Jianyu Chen

Furthermore, we show that the learned belief states can be plugged into downstream RL algorithms to improve performance.

Decision Making Variational Inference

An Empirical Investigation of Representation Learning for Imitation

2 code implementations16 May 2022 Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.

Image Classification Imitation Learning +1

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation26 Apr 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Action Recognition Retrieval +6

Semantic-Aware Pretraining for Dense Video Captioning

no code implementations13 Apr 2022 Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo

This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021.

Dense Captioning Dense Video Captioning

M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

no code implementations11 Apr 2022 Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, Jose M. Alvarez

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs.

3D Object Detection object-detection +1

DaViT: Dual Attention Vision Transformers

3 code implementations7 Apr 2022 Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan

We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention.

Computational Efficiency Image Classification +4

Scale-Equivalent Distillation for Semi-Supervised Object Detection

no code implementations CVPR 2022 Qiushan Guo, Yao Mu, Jianyu Chen, Tianqi Wang, Yizhou Yu, Ping Luo

Further, we overcome these challenges by introducing a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.

Knowledge Distillation Object +3

Compression of Generative Pre-trained Language Models via Quantization

no code implementations ACL 2022 Chaofan Tao, Lu Hou, Wei zhang, Lifeng Shang, Xin Jiang, Qun Liu, Ping Luo, Ngai Wong

We find that previous quantization methods fail on generative tasks due to the \textit{homogeneous word embeddings} caused by reduced capacity, and \textit{varied distribution of weights}.

Model Compression Quantization +1

End-to-End Video Text Spotting with Transformer

1 code implementation20 Mar 2022 Weijia Wu, Yuanqiang Cai, Chunhua Shen, Debing Zhang, Ying Fu, Hong Zhou, Ping Luo

Recent video text spotting methods usually require the three-staged pipeline, i. e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results.

Text Detection Text Spotting

WegFormer: Transformers for Weakly Supervised Semantic Segmentation

no code implementations16 Mar 2022 Chunmeng Liu, Enze Xie, Wenjia Wang, Wenhai Wang, Guangyao Li, Ping Luo

Although convolutional neural networks (CNNs) have achieved remarkable progress in weakly supervised semantic segmentation (WSSS), the effective receptive field of CNN is insufficient to capture global context information, leading to sub-optimal results.

Segmentation Weakly supervised Semantic Segmentation +1

Context Autoencoder for Self-Supervised Representation Learning

6 code implementations7 Feb 2022 Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.

Instance Segmentation object-detection +5

Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization

no code implementations ICLR 2022 Can Wang, Sheng Jin, Yingda Guan, Wentao Liu, Chen Qian, Ping Luo, Wanli Ouyang

PL approaches apply pseudo-labels to unlabeled data, and then train the model with a combination of the labeled and pseudo-labeled data iteratively.

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations CVPR 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Action Recognition Multiple-choice +8

MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

no code implementations13 Jan 2022 Yuying Ge, Yibing Song, Ruimao Zhang, Ping Luo

Dancing video retargeting aims to synthesize a video that transfers the dance movements from a source video to a target person.

Meta-Learning

Language as Queries for Referring Video Object Segmentation

1 code implementation CVPR 2022 Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.

Ranked #3 on Referring Expression Segmentation on A2D Sentences (using extra training data)

Object Object Tracking +5

MetaCloth: Learning Unseen Tasks of Dense Fashion Landmark Detection from a Few Samples

no code implementations6 Dec 2021 Yuying Ge, Ruimao Zhang, Ping Luo

This work proposes a novel framework named MetaCloth via meta-learning, which is able to learn unseen tasks of dense fashion landmark detection with only a few annotated samples.

Meta-Learning

Revitalizing CNN Attention via Transformers in Self-Supervised Visual Representation Learning

1 code implementation NeurIPS 2021 Chongjian Ge, Youwei Liang, Yibing Song, Jianbo Jiao, Jue Wang, Ping Luo

Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL.

Image Classification object-detection +3

Compressed Video Contrastive Learning

no code implementations NeurIPS 2021 Yuqi Huo, Mingyu Ding, Haoyu Lu, Nanyi Fei, Zhiwu Lu, Ji-Rong Wen, Ping Luo

To enhance the representation ability of the motion vectors, hence the effectiveness of our method, we design a cross guidance contrastive learning algorithm based on multi-instance InfoNCE loss, where motion vectors can take supervision signals from RGB frames and vice versa.

Contrastive Learning Representation Learning

DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

3 code implementations CVPR 2022 Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, Ping Luo

A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association.

Multi-Object Tracking Object +3

Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language

no code implementations NeurIPS 2021 Mingyu Ding, Zhenfang Chen, Tao Du, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.

counterfactual Visual Reasoning

Revitalizing CNN Attentions via Transformers in Self-Supervised Visual Representation Learning

1 code implementation11 Oct 2021 Chongjian Ge, Youwei Liang, Yibing Song, Jianbo Jiao, Jue Wang, Ping Luo

Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL.

Image Classification object-detection +3

Objects in Semantic Topology

no code implementations ICLR 2022 Shuo Yang, Peize Sun, Yi Jiang, Xiaobo Xia, Ruiheng Zhang, Zehuan Yuan, Changhu Wang, Ping Luo, Min Xu

A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.

Incremental Learning Language Modelling +3

Scale-Invariant Teaching for Semi-Supervised Object Detection

no code implementations29 Sep 2021 Qiushan Guo, Yizhou Yu, Ping Luo

Furthermore, the limited annotations in semi-supervised learning scale up the challenges: large variance of object sizes and class imbalance (i. e., the extreme ratio between background and object), hindering the performance of prior arts.

Object object-detection +1

Towards High-Quality Temporal Action Detection with Sparse Proposals

1 code implementation18 Sep 2021 Jiannan Wu, Peize Sun, Shoufa Chen, Jiewen Yang, Zihao Qi, Lan Ma, Ping Luo

Towards high-quality temporal action detection, we introduce Sparse Proposals to interact with the hierarchical features.

Action Detection Avg +2

Adversarial Robustness for Unsupervised Domain Adaptation

no code implementations ICCV 2021 Muhammad Awais, Fengwei Zhou, Hang Xu, Lanqing Hong, Ping Luo, Sung-Ho Bae, Zhenguo Li

Extensive Unsupervised Domain Adaptation (UDA) studies have shown great success in practice by learning transferable representations across a labeled source domain and an unlabeled target domain with deep models.

Adversarial Robustness Unsupervised Domain Adaptation

CycleMLP: A MLP-like Architecture for Dense Prediction

8 code implementations ICLR 2022 Shoufa Chen, Enze Xie, Chongjian Ge, Runjian Chen, Ding Liang, Ping Luo

We build a family of models which surpass existing MLPs and even state-of-the-art Transformer-based models, e. g., Swin Transformer, while using fewer parameters and FLOPs.

Image Classification Instance Segmentation +4

Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation

1 code implementation8 Jul 2021 Lingyun Wu, Zhiqiang Hu, Yuanfeng Ji, Ping Luo, Shaoting Zhang

For example, STFT improves the still image baseline FCOS by 10. 6% and 20. 6% on the comprehensive F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets, respectively, and outperforms the state-of-the-art video-based method by 3. 6% and 8. 0%, respectively.

Multi-Compound Transformer for Accurate Biomedical Image Segmentation

1 code implementation28 Jun 2021 Yuanfeng Ji, Ruimao Zhang, Huijie Wang, Zhen Li, Lingyun Wu, Shaoting Zhang, Ping Luo

The recent vision transformer(i. e. for image classification) learns non-local attentive interaction of different patch tokens.

Image Classification Image Segmentation +2

HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers

1 code implementation CVPR 2021 Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, Ping Luo

Last, we proposed an efficient fine-grained search strategy to train HR-NAS, which effectively explores the search space, and finds optimal architectures given various tasks and computation resources.

Image Classification Neural Architecture Search +3

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

23 code implementations NeurIPS 2021 Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M. Alvarez, Ping Luo

We present SegFormer, a simple, efficient yet powerful semantic segmentation framework which unifies Transformers with lightweight multilayer perception (MLP) decoders.

C++ code Semantic Segmentation +1

Extracting Variable-Depth Logical Document Hierarchy from Long Documents: Method, Evaluation, and Application

no code implementations14 May 2021 Rongyu Cao, Yixuan Cao, Ganbin Zhou, Ping Luo

In this paper, we study the problem of extracting variable-depth "logical document hierarchy" from long documents, namely organizing the recognized "physical document objects" into hierarchical structures.

Binary Classification Passage Retrieval +1

BWCP: Probabilistic Learning-to-Prune Channels for ConvNets via Batch Whitening

no code implementations13 May 2021 Wenqi Shao, Hang Yu, Zhaoyang Zhang, Hang Xu, Zhenguo Li, Ping Luo

To address this problem, we develop a probability-based pruning algorithm, called batch whitening channel pruning (BWCP), which can stochastically discard unimportant channels by modeling the probability of a channel being activated.

When Human Pose Estimation Meets Robustness: Adversarial Algorithms and Benchmarks

1 code implementation CVPR 2021 Jiahang Wang, Sheng Jin, Wentao Liu, Weizhong Liu, Chen Qian, Ping Luo

However, unlike human vision that is robust to various data corruptions such as blur and pixelation, current pose estimators are easily confused by these corruptions.

Knowledge Distillation Pose Estimation

PolarMask++: Enhanced Polar Representation for Single-Shot Instance Segmentation and Beyond

1 code implementation5 May 2021 Enze Xie, Wenhai Wang, Mingyu Ding, Ruimao Zhang, Ping Luo

Extensive experiments demonstrate the effectiveness of both PolarMask and PolarMask++, which achieve competitive results on instance segmentation in the challenging COCO dataset with single-model and single-scale training and testing, as well as new state-of-the-art results on rotate text detection and cell segmentation.

Ranked #81 on Instance Segmentation on COCO test-dev (using extra training data)

Cell Segmentation Instance Segmentation +5

Going Deeper Into Face Detection: A Survey

no code implementations27 Mar 2021 Shervin Minaee, Ping Luo, Zhe Lin, Kevin Bowyer

In this work, we provide a detailed overview of some of the most representative deep learning based face detection methods by grouping them into a few major categories, and present their core architectural designs and accuracies on popular benchmarks.

Face Detection Image Classification

Learning Versatile Neural Architectures by Propagating Network Codes

1 code implementation ICLR 2022 Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang, Ping Luo

(4) Thorough studies of NCP on inter-, cross-, and intra-tasks highlight the importance of cross-task neural architecture design, i. e., multitask neural architectures and architecture transferring between different tasks.

Image Segmentation Neural Architecture Search +2

Towards Ultra-Resolution Neural Style Transfer via Thumbnail Instance Normalization

1 code implementation22 Mar 2021 Zhe Chen, Wenhai Wang, Enze Xie, Tong Lu, Ping Luo

(1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution.

Style Transfer

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On

1 code implementation CVPR 2021 Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, Ping Luo

To this end, DCTON can be naturally trained in a self-supervised manner following cycle consistency learning.

Virtual Try-on

Parser-Free Virtual Try-on via Distilling Appearance Flows

2 code implementations CVPR 2021 Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, Ping Luo

A recent pioneering work employed knowledge distillation to reduce the dependency of human parsing, where the try-on images produced by a parser-based method are used as supervisions to train a "student" network without relying on segmentation, making the student mimic the try-on ability of the parser-based model.

Human Parsing Knowledge Distillation +1

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

9 code implementations ICCV 2021 Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao

Unlike the recently-proposed Transformer model (e. g., ViT) that is specially designed for image classification, we propose Pyramid Vision Transformer~(PVT), which overcomes the difficulties of porting Transformer to various dense prediction tasks.

Image Classification Instance Segmentation +3

FAT: Learning Low-Bitwidth Parametric Representation via Frequency-Aware Transformation

1 code implementation15 Feb 2021 Chaofan Tao, Rui Lin, Quan Chen, Zhaoyang Zhang, Ping Luo, Ngai Wong

Prior arts often discretize the network weights by carefully tuning hyper-parameters of quantization (e. g. non-uniform stepsize and layer-wise bitwidths), which are complicated and sub-optimal because the full-precision and low-precision models have a large discrepancy.

Neural Network Compression Quantization

DetCo: Unsupervised Contrastive Learning for Object Detection

2 code implementations ICCV 2021 Enze Xie, Jian Ding, Wenhai Wang, Xiaohang Zhan, Hang Xu, Peize Sun, Zhenguo Li, Ping Luo

Unlike most recent methods that focused on improving accuracy of image classification, we present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches to learn discriminative representations for object detection.

Contrastive Learning Image Classification +2

Segmenting Transparent Object in the Wild with Transformer

2 code implementations21 Jan 2021 Enze Xie, Wenjia Wang, Wenhai Wang, Peize Sun, Hang Xu, Ding Liang, Ping Luo

This work presents a new fine-grained transparent object segmentation dataset, termed Trans10K-v2, extending Trans10K-v1, the first large-scale transparent object segmentation dataset.

Object Segmentation +2

Rethinking the Pruning Criteria for Convolutional Neural Network

no code implementations NeurIPS 2021 Zhongzhan Huang, Xinjiang Wang, Ping Luo

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), and various pruning criteria have been proposed to remove the redundant filters of CNNs.

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

no code implementations1 Jan 2021 Yuqi Huo, Mingyu Ding, Haoyu Lu, Zhiwu Lu, Tao Xiang, Ji-Rong Wen, Ziyuan Huang, Jianwen Jiang, Shiwei Zhang, Mingqian Tang, Songfang Huang, Ping Luo

With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels.

Representation Learning

Bringing Events Into Video Deblurring With Non-Consecutively Blurry Frames

1 code implementation ICCV 2021 Wei Shang, Dongwei Ren, Dongqing Zou, Jimmy S. Ren, Ping Luo, WangMeng Zuo

EFM can also be easily incorporated into existing deblurring networks, making event-driven deblurring task benefit from state-of-the-art deblurring methods.

Deblurring

TransTrack: Multiple Object Tracking with Transformer

2 code implementations31 Dec 2020 Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, Ping Luo

In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.

Ranked #7 on Multi-Object Tracking on SportsMOT (using extra training data)

Multi-Object Tracking Multiple Object Tracking with Transformer +3

What Makes for End-to-End Object Detection?

1 code implementation10 Dec 2020 Peize Sun, Yi Jiang, Enze Xie, Wenqi Shao, Zehuan Yuan, Changhu Wang, Ping Luo

We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.

General Classification Object +2

Polygon-free: Unconstrained Scene Text Detection with Box Annotations

1 code implementation26 Nov 2020 Weijia Wu, Enze Xie, Ruimao Zhang, Wenhai Wang, Hong Zhou, Ping Luo

For example, without using polygon annotations, PSENet achieves an 80. 5% F-score on TotalText [3] (vs. 80. 9% of fully supervised counterpart), 31. 1% better than training directly with upright bounding box annotations, and saves 80%+ labeling costs.

Scene Text Detection Text Detection

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

6 code implementations CVPR 2021 Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei LI, Zehuan Yuan, Changhu Wang, Ping Luo

In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.

Object object-detection +2

Do 2D GANs Know 3D Shape? Unsupervised 3D shape reconstruction from 2D Image GANs

1 code implementation ICLR 2021 Xingang Pan, Bo Dai, Ziwei Liu, Chen Change Loy, Ping Luo

Through our investigation, we found that such a pre-trained GAN indeed contains rich 3D knowledge and thus can be used to recover 3D shape from a single 2D image in an unsupervised manner.

3D Shape Reconstruction Object

UXNet: Searching Multi-level Feature Aggregation for 3D Medical Image Segmentation

no code implementations16 Sep 2020 Yuanfeng Ji, Ruimao Zhang, Zhen Li, Jiamin Ren, Shaoting Zhang, Ping Luo

Unlike the recent neural architecture search (NAS) methods that typically searched the optimal operators in each network layer, but missed a good strategy to search for feature aggregations, this paper proposes a novel NAS method for 3D medical image segmentation, named UXNet, which searches both the scale-wise feature aggregation strategies as well as the block-wise operators in the encoder-decoder network.

Image Segmentation Neural Architecture Search +3

RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

2 code implementations14 Sep 2020 Hao Tan, Ran Cheng, Shihua Huang, Cheng He, Changxiao Qiu, Fan Yang, Ping Luo

Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN.

Keypoint Detection Neural Architecture Search +3

Compensation Tracker: Reprocessing Lost Object for Multi-Object Tracking

no code implementations27 Aug 2020 Zhibo Zou, Jun-Jie Huang, Ping Luo

Based on simple and traditional methods, we propose a compensation tracker to further alleviate the lost tracking problem caused by missing detection.

Motion Compensation Multi-Object Tracking +1

Dynamic and Static Context-aware LSTM for Multi-agent Motion Prediction

no code implementations ECCV 2020 Chaofan Tao, Qinhong Jiang, Lixin Duan, Ping Luo

Existing work addressed this challenge by either learning social spatial interactions represented by the positions of a group of pedestrians, while ignoring their temporal coherence (\textit{i. e.} dependencies between different long trajectories), or by understanding the complicated scene layout (\textit{e. g.} scene segmentation) to ensure safe navigation.

motion prediction Trajectory Prediction

AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting

2 code implementations ECCV 2020 Wenhai Wang, Xuebo Liu, Xiaozhong Ji, Enze Xie, Ding Liang, Zhibo Yang, Tong Lu, Chunhua Shen, Ping Luo

Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection.

Language Modelling Sentence +2

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

no code implementations ECCV 2020 Sheng Jin, Wentao Liu, Enze Xie, Wenhai Wang, Chen Qian, Wanli Ouyang, Ping Luo

The modules of HGG can be trained end-to-end with the keypoint detection network and is able to supervise the grouping process in a hierarchical manner.

2D Human Pose Estimation Clustering +4

Whole-Body Human Pose Estimation in the Wild

2 code implementations ECCV 2020 Sheng Jin, Lumin Xu, Jin Xu, Can Wang, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo

This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet.

2D Human Pose Estimation Facial Landmark Detection +2

3D Human Mesh Regression with Dense Correspondence

3 code implementations CVPR 2020 Wang Zeng, Wanli Ouyang, Ping Luo, Wentao Liu, Xiaogang Wang

This paper proposes a model-free 3D human mesh estimation framework, named DecoMR, which explicitly establishes the dense correspondence between the mesh and the local image features in the UV space (i. e. a 2D space used for texture mapping of 3D mesh).

3D Human Pose Estimation 3D Human Reconstruction +1

Learning a Reinforced Agent for Flexible Exposure Bracketing Selection

1 code implementation CVPR 2020 Zhouxia Wang, Jiawei Zhang, Mude Lin, Jiong Wang, Ping Luo, Jimmy Ren

Automatically selecting exposure bracketing (images exposed differently) is important to obtain a high dynamic range image by using multi-exposure fusion.

Convolution-Weight-Distribution Assumption: Rethinking the Criteria of Channel Pruning

no code implementations24 Apr 2020 Zhongzhan Huang, Wenqi Shao, Xinjiang Wang, Liang Lin, Ping Luo

Channel pruning is a popular technique for compressing convolutional neural networks (CNNs), where various pruning criteria have been proposed to remove the redundant filters.

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

1 code implementation21 Apr 2020 Wenjie Li, Zhaoyang Zhang, Xinjiang Wang, Ping Luo

Although adaptive optimization algorithms such as Adam show fast convergence in many machine learning tasks, this paper identifies a problem of Adam by analyzing its performance in a simple non-convex synthetic problem, showing that Adam's fast convergence would possibly lead the algorithm to local minimums.

Cannot find the paper you are looking for? You can Submit a new open access paper.