Search Results for author: Ping Luo

Found 342 papers, 202 papers with code

DanceGRPO: Unleashing GRPO on Visual Generation

no code implementations12 May 2025 Zeyue Xue, Jie Wu, Yu Gao, Fangyuan Kong, Lingting Zhu, Mengzhao Chen, Zhiheng Liu, Wei Liu, Qiushan Guo, Weilin Huang, Ping Luo

This paper introduces DanceGRPO, the first unified framework to adapt Group Relative Policy Optimization (GRPO) to visual generation paradigms, unleashing one unified RL algorithm across two generative paradigms (diffusion models and rectified flows), three tasks (text-to-image, text-to-video, image-to-video), four foundation models (Stable Diffusion, HunyuanVideo, FLUX, SkyReel-I2V), and five reward models (image/video aesthetics, text-image alignment, video motion quality, and binary reward).

Denoising reinforcement-learning +3

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

1 code implementation9 May 2025 Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, Hongyang Li

Learned from internet-scale videos, the generalist policy can be deployed to various robots through efficient latent action decoding.

Vision-Language-Action

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins

1 code implementation17 Apr 2025 Yao Mu, Tianxing Chen, Zanxin Chen, Shijia Peng, Zhiqian Lan, Zeyu Gao, Zhixuan Liang, Qiaojun Yu, Yude Zou, Mingkun Xu, Lunkai Lin, Zhiqiang Xie, Mingyu Ding, Ping Luo

In the rapidly advancing field of robotics, dual-arm coordination and complex object manipulation are essential capabilities for developing advanced autonomous systems.

Code Generation

PixelFlow: Pixel-Space Generative Models with Flow

1 code implementation10 Apr 2025 Shoufa Chen, Chongjian Ge, Shilong Zhang, Peize Sun, Ping Luo

We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models.

Conditional Image Generation

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

no code implementations19 Mar 2025 Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, Kelu Yao, Chao Li, Wenqi Shao, Ping Luo

Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc.

Centaur: Robust End-to-End Autonomous Driving with Test-Time Training

no code implementations14 Mar 2025 Chonghao Sima, Kashyap Chitta, Zhiding Yu, Shiyi Lan, Ping Luo, Andreas Geiger, Hongyang Li, Jose M. Alvarez

In this work, we propose Centaur (Cluster Entropy for Test-time trAining using Uncertainty) which updates a planner's behavior via test-time training, without relying on hand-engineered rules or cost functions.

Autonomous Driving

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation

no code implementations18 Feb 2025 Mengkang Hu, Tianxing Chen, Yude Zou, YuHeng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo

Recently, there has been growing interest in leveraging large language models (LLMs) to generate symbolic world models from textual descriptions.

Benchmarking

SAMRefiner: Taming Segment Anything Model for Universal Mask Refinement

1 code implementation10 Feb 2025 Yuqi Lin, Hengjia Li, Wenqi Shao, Zheng Yang, Jun Zhao, Xiaofei He, Ping Luo, Kaipeng Zhang

In contrast to prior refinement techniques that are tailored to specific models or tasks in a close-world manner, we propose SAMRefiner, a universal and efficient approach by adapting SAM to the mask refinement task.

Semantic Segmentation

Goku: Flow Based Video Generative Foundation Models

no code implementations7 Feb 2025 Shoufa Chen, Chongjian Ge, Yuqi Zhang, Yida Zhang, Fengda Zhu, Hao Yang, Hongxiang Hao, Hui Wu, Zhichao Lai, Yifei Hu, Ting-Che Lin, Shilong Zhang, Fu Li, Chuan Li, Xing Wang, Yanghua Peng, Peize Sun, Ping Luo, Yi Jiang, Zehuan Yuan, Bingyue Peng, Xiaobing Liu

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance.

Text-to-Image Generation Video Generation

LiT: Delving into a Simplified Linear Diffusion Transformer for Image Generation

no code implementations22 Jan 2025 Jiahao Wang, Ning Kang, Lewei Yao, Mengzhao Chen, Chengyue Wu, Songyang Zhang, Shuchen Xue, Yong liu, Taiqiang Wu, Xihui Liu, Kaipeng Zhang, Shifeng Zhang, Wenqi Shao, Zhenguo Li, Ping Luo

(3) Hybrid knowledge distillation objective: using a pre-trained diffusion Transformer to help the training of the student linear Transformer, supervising not only the predicted noise but also the variance of the reverse diffusion process.

Knowledge Distillation Mamba +1

MangaNinja: Line Art Colorization with Precise Reference Following

no code implementations14 Jan 2025 Zhiheng Liu, Ka Leong Cheng, Xi Chen, Jie Xiao, Hao Ouyang, Kai Zhu, Yu Liu, Yujun Shen, Qifeng Chen, Ping Luo

Derived from diffusion models, MangaNinjia specializes in the task of reference-guided line art colorization.

Line Art Colorization

Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

1 code implementation13 Jan 2025 Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li

Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks.

NADER: Neural Architecture Design via Multi-Agent Collaboration

no code implementations26 Dec 2024 Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

In this paper, we introduce NADER (Neural Architecture Design via multi-agEnt collaboRation), a novel framework that formulates neural architecture design (NAD) as a LLM-based multi-agent collaboration problem.

Neural Architecture Search

DepthLab: From Partial to Complete

no code implementations24 Dec 2024 Zhiheng Liu, Ka Leong Cheng, Qiuyu Wang, Shuzhe Wang, Hao Ouyang, Bin Tan, Kai Zhu, Yujun Shen, Qifeng Chen, Ping Luo

Missing values remain a common challenge for depth data across its wide range of applications, stemming from various causes like incomplete data acquisition and perspective alteration.

Depth Completion Missing Values +2

Prompt-A-Video: Prompt Your Video Diffusion Model via Preference-Aligned LLM

1 code implementation19 Dec 2024 Yatai Ji, Jiacheng Zhang, Jie Wu, Shilong Zhang, Shoufa Chen, Chongjian Ge, Peize Sun, Weifeng Chen, Wenqi Shao, Xuefeng Xiao, Weilin Huang, Ping Luo

Text-to-video models have made remarkable advancements through optimization on high-quality text-video pairs, where the textual prompts play a pivotal role in determining quality of output videos.

Video Generation

Attention with Dependency Parsing Augmentation for Fine-Grained Attribution

no code implementations16 Dec 2024 Qiang Ding, Lvzhou Luo, Yixuan Cao, Ping Luo

To assist humans in efficiently validating RAG-generated content, developing a fine-grained attribution mechanism that provides supporting evidence from retrieved documents for every answer span is essential.

Decoder Dependency Parsing +1

SpecFuse: Ensembling Large Language Models via Next-Segment Prediction

no code implementations10 Dec 2024 Bo Lv, Chen Tang, Yanan Zhang, Xin Liu, Yue Yu, Ping Luo

In this paper, we propose SpecFuse, a novel ensemble framework that outputs the fused result by iteratively producing the next segment through collaboration among LLMs.

Prediction

DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models

no code implementations5 Dec 2024 Yizhuo Li, Yuying Ge, Yixiao Ge, Ping Luo, Ying Shan

We introduce DiCoDe, a novel approach that leverages Diffusion-Compressed Deep Tokens to generate videos with a language model in an autoregressive manner.

Temporal Sequences Video Generation

TREND: Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception

no code implementations4 Dec 2024 Runjian Chen, Hyoungseob Park, Bo Zhang, Wenqi Shao, Ping Luo, Alex Wong

Labeling LiDAR point clouds is notoriously time-and-energy-consuming, which spurs recent unsupervised 3D representation learning methods to alleviate the labeling burden in LiDAR perception via pretrained weights.

3D Object Detection Contrastive Learning +2

CLAP: Unsupervised 3D Representation Learning for Fusion 3D Perception via Curvature Sampling and Prototype Learning

no code implementations4 Dec 2024 Runjian Chen, Hang Zhang, Avinash Ravichandran, Wenqi Shao, Alex Wong, Ping Luo

In this paper, we explore joint unsupervised pre-training for fusion 3D perception via differentiable rendering and propose CLAP, short for Curvature sampLing and swApping Prototype assignment prediction.

Representation Learning Unsupervised Pre-training

G3Flow: Generative 3D Semantic Flow for Pose-aware and Generalizable Object Manipulation

1 code implementation27 Nov 2024 Tianxing Chen, Yao Mu, Zhixuan Liang, Zanxin Chen, Shijia Peng, Qiangyu Chen, Mingkun Xu, Ruizhen Hu, Hongyuan Zhang, Xuelong Li, Ping Luo

Our results demonstrate the effectiveness of G3Flow in enhancing real-time dynamic semantic feature understanding for robotic manipulation policies.

Imitation Learning Object +1

DexHandDiff: Interaction-aware Diffusion Planning for Adaptive Dexterous Manipulation

no code implementations27 Nov 2024 Zhixuan Liang, Yao Mu, Yixiao Wang, Tianxing Chen, Wenqi Shao, Wei Zhan, Masayoshi Tomizuka, Ping Luo, Mingyu Ding

Our framework achieves an average of 70. 7% success rate on goal adaptive dexterous tasks, highlighting its robustness and flexibility in contact-rich manipulation.

Contact-rich Manipulation

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

no code implementations30 Oct 2024 Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang

These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model.

Text-to-Image Generation

CompGS: Unleashing 2D Compositionality for Compositional Text-to-3D via Dynamically Optimizing 3D Gaussians

no code implementations28 Oct 2024 Chongjian Ge, Chenfeng Xu, Yuanfeng Ji, Chensheng Peng, Masayoshi Tomizuka, Ping Luo, Mingyu Ding, Varun Jampani, Wei Zhan

To achieve this goal, two core designs are proposed: (1) 3D Gaussians Initialization with 2D compositionality: We transfer the well-established 2D compositionality to initialize the Gaussian parameters on an entity-by-entity basis, ensuring both consistent 3D priors for each entity and reasonable interactions among multiple entities; (2) Dynamic Optimization: We propose a dynamic strategy to optimize 3D Gaussians using Score Distillation Sampling (SDS) loss.

3D Generation Scene Generation +1

Analysis and Benchmarking of Extending Blind Face Image Restoration to Videos

no code implementations15 Oct 2024 Zhouxia Wang, Jiawei Zhang, Xintao Wang, Tianshui Chen, Ying Shan, Wenping Wang, Ping Luo

In this work, we first present a fair evaluation benchmark, in which we first introduce a Real-world Low-Quality Face Video benchmark (RFV-LQ), evaluate several leading image-based face restoration algorithms, and conduct a thorough systematical analysis of the benefits and challenges associated with extending blind face image restoration algorithms to degraded face videos.

Benchmarking Blind Face Restoration +1

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping

1 code implementation11 Oct 2024 Yue Yang, Shuibai Zhang, Wenqi Shao, Kaipeng Zhang, Yi Bin, Yu Wang, Ping Luo

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across multimodal tasks such as visual perception and reasoning, leading to good performance on various multimodal evaluation benchmarks.

MME Question Answering +1

DCP: Learning Accelerator Dataflow for Neural Network via Propagation

no code implementations9 Oct 2024 Peng Xu, Wenqi Shao, Mingyu Ding, Ping Luo

Deep neural network (DNN) hardware (HW) accelerators have achieved great success in improving DNNs' performance and efficiency.

Few-Shot Learning Scheduling

PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization

1 code implementation7 Oct 2024 Mengzhao Chen, Yi Liu, Jiahao Wang, Yi Bin, Wenqi Shao, Ping Luo

In this work, we propose PrefixQuant, a novel quantization method that achieves state-of-the-art performance across various precision levels (W4A4KV4 and W4A8KV4) and granularities (dynamic and static quantization) by effectively isolating token-wise outliers.

Common Sense Reasoning Quantization

HRVMamba: High-Resolution Visual State Space Model for Dense Prediction

no code implementations4 Oct 2024 Hao Zhang, Yongqiang Ma, Wenqi Shao, Ping Luo, Nanning Zheng, Kaipeng Zhang

Recently, State Space Models (SSMs) with efficient hardware-aware designs, i. e., Mamba, have demonstrated significant potential in computer vision tasks due to their linear computational complexity with respect to token length and their global receptive field.

Inductive Bias Mamba +3

Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking

no code implementations24 Sep 2024 Xi Wang, Tianxing Chen, Qiaojun Yu, Tianling Xu, Zanxin Chen, Yiting Fu, Ziqi He, Cewu Lu, Yao Mu, Ping Luo

To address this limitation, we present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.

Object

Prior Knowledge Distillation Network for Face Super-Resolution

no code implementations22 Sep 2024 Qiu Yang, Xiao Sun, Xin-yu Li, Feng-Qi Cui, Yu-Tong Guo, Shuang-Zhen Hu, Ping Luo, Si-Ying Li

This approach enables the network to learn priors during the training stage while relying solely on low-resolution facial images during the testing stage, thus mitigating the adverse effects of prior estimation inaccuracies.

Knowledge Distillation Super-Resolution

RoboTwin: Dual-Arm Robot Benchmark with Generative Digital Twins (early version)

1 code implementation4 Sep 2024 Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, Ping Luo

To address this, we introduce RoboTwin, a generative digital twin framework that uses 3D generative foundation models and large language models to produce diverse expert datasets and provide a real-world-aligned evaluation platform for dual-arm robotic tasks.

Code Generation

Federated Prediction-Powered Inference from Decentralized Data

no code implementations3 Sep 2024 Ping Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

The Fed-PPI framework involves training local models on private data, aggregating them through Federated Learning (FL), and deriving confidence intervals using PPI computation.

Federated Learning Prediction +1

Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

no code implementations23 Aug 2024 Yangyang Xu, Wenqi Shao, Yong Du, Haiming Zhu, Yang Zhou, Ping Luo, Shengfeng He

Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge.

Image Manipulation

HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model

1 code implementation18 Aug 2024 Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, Ping Luo

Specifically, HiAgent prompts LLMs to formulate subgoals before generating executable actions and enables LLMs to decide proactively to replace previous subgoals with summarized observations, retaining only the action-observation pairs relevant to the current subgoal.

Language Modeling Language Modelling +2

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

1 code implementation5 Aug 2024 Fanqing Meng, Jin Wang, Chuanhao Li, Quanfeng Lu, Hao Tian, Jiaqi Liao, Xizhou Zhu, Jifeng Dai, Yu Qiao, Ping Luo, Kaipeng Zhang, Wenqi Shao

To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks.

Image Comprehension Multiple-choice

AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation

1 code implementation1 Aug 2024 Mengkang Hu, Pu Zhao, Can Xu, Qingfeng Sun, JianGuang Lou, QIngwei Lin, Ping Luo, Saravan Rajmohan

Moreover, to increase the difficulty diversity of generated planning tasks, we propose a bidirectional evolution method, Bi-Evol, that evolves planning tasks from easier and harder directions to synthesize a task set with a smoother difficulty curve.

Diversity Language Modeling +2

Low-Latency Privacy-Preserving Deep Learning Design via Secure MPC

no code implementations24 Jul 2024 Ke Lin, Yasir Glani, Ping Luo

Secure multi-party computation (MPC) facilitates privacy-preserving computation between multiple parties without leaking private information.

Deep Learning Privacy Preserving +1

Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model

1 code implementation24 Jul 2024 Lirui Zhao, Tianshuo Yang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Kaipeng Zhang, Rongrong Ji

To tackle this challenge, we introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control.

Image Inpainting Object

Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies

1 code implementation18 Jul 2024 Chaofan Tao, Qian Liu, Longxu Dou, Niklas Muennighoff, Zhongwei Wan, Ping Luo, Min Lin, Ngai Wong

We investigate how vocabulary size impacts LLM scaling laws by training models ranging from 33M to 3B parameters on up to 500B characters with various vocabulary configurations.

ARC

Segment, Lift and Fit: Automatic 3D Shape Labeling from 2D Prompts

no code implementations16 Jul 2024 Jianhao Li, Tianyu Sun, Zhongdao Wang, Enze Xie, Bailan Feng, Hongbo Zhang, Ze Yuan, Ke Xu, Jiaheng Liu, Ping Luo

Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset.

Autonomous Driving

TCFormer: Visual Recognition via Token Clustering Transformer

1 code implementation16 Jul 2024 Wang Zeng, Sheng Jin, Lumin Xu, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.

Clustering Image Classification +4

When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset

1 code implementation14 Jul 2024 Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality.

3D Object Detection Multispectral Object Detection +1

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

1 code implementation10 Jul 2024 Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, Ping Luo

To the best of our knowledge, Block-AP is the first method to enable direct training of all parameters in a block-wise manner, reducing accuracy loss in low-bit scenarios by enhancing the solution space during optimization.

Quantization

PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models

no code implementations17 Jun 2024 Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, Ping Luo

Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense.

Image Generation

DAG-Plan: Generating Directed Acyclic Dependency Graphs for Dual-Arm Cooperative Planning

no code implementations14 Jun 2024 Zeyu Gao, Yao Mu, Jinye Qu, Mengkang Hu, Shijia Peng, Chengkai Hou, Lingyue Guo, Ping Luo, Shanghang Zhang, YanFeng Lu

Extensive experiments demonstrate the superiority of DAG-Plan over directly using LLM to generate linear task sequence, achieving 52. 8% higher efficiency compared to the single-arm task planning and 48% higher success rate of the dual-arm task planning.

Task Planning

Rethinking Human Evaluation Protocol for Text-to-Video Models: Enhancing Reliability,Reproducibility, and Practicality

1 code implementation13 Jun 2024 Tianle Zhang, Langtian Ma, Yuchen Yan, Yuchen Zhang, Kai Wang, Yue Yang, Ziyao Guo, Wenqi Shao, Yang You, Yu Qiao, Ping Luo, Kaipeng Zhang

To address these challenges, this paper introduces the Text-to-Video Human Evaluation (T2VHE) protocol, a comprehensive and standardized protocol for T2V models.

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

1 code implementation12 Jun 2024 Jiannan Wu, Muyan Zhong, Sen Xing, Zeqiang Lai, Zhaoyang Liu, Zhe Chen, Wenhai Wang, Xizhou Zhu, Lewei Lu, Tong Lu, Ping Luo, Yu Qiao, Jifeng Dai

It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios.

Image Generation Language Modeling +7

GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices

1 code implementation12 Jun 2024 Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, Ping Luo

Smartphone users often navigate across multiple applications (apps) to complete tasks such as sharing content between social media platforms.

Navigate

Needle In A Multimodal Haystack

1 code implementation11 Jun 2024 Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang

In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.

Retrieval

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

2 code implementations10 Jun 2024 Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan

(3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment.

Conditional Image Generation

Uncovering Limitations of Large Language Models in Information Seeking from Tables

1 code implementation6 Jun 2024 Chaoxu Pang, Yixuan Cao, ChunHao Yang, Ping Luo

Seeking information from tables (TIS) is a crucial capability for Large Language Models (LLMs), serving as the foundation of knowledge-based Q&A systems.

Single Choice Question Text Generation +1

Learning Manipulation by Predicting Interaction

1 code implementation1 Jun 2024 Jia Zeng, Qingwen Bu, Bangjun Wang, Wenke Xia, Li Chen, Hao Dong, Haoming Song, Dong Wang, Di Hu, Ping Luo, Heming Cui, Bin Zhao, Xuelong Li, Yu Qiao, Hongyang Li

To this end, we propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction (MPI) and enhances the visual representation. Given a pair of keyframes representing the initial and final states, along with language instructions, our algorithm predicts the transition frame and detects the interaction object, respectively.

Representation Learning

Part123: Part-aware 3D Reconstruction from a Single-view Image

no code implementations27 May 2024 Anran Liu, Cheng Lin, YuAn Liu, Xiaoxiao Long, Zhiyang Dou, Hao-Xiang Guo, Ping Luo, Wenping Wang

However, all the existing methods represent the target object as a closed mesh devoid of any structural information, thus neglecting the part-based structure, which is crucial for many downstream applications, of the reconstructed shape.

3D Part Segmentation 3D Reconstruction +3

Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View

no code implementations27 May 2024 Jin Wang, Shichao Dong, Yapeng Zhu, Kelu Yao, Weidong Zhao, Chao Li, Ping Luo

Compositional reasoning capabilities are usually considered as fundamental skills to characterize human perception.

SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge

no code implementations23 May 2024 Chuanhao Li, Zhen Li, Chenchen Jing, Shuo Liu, Wenqi Shao, Yuwei Wu, Ping Luo, Yu Qiao, Kaipeng Zhang

In this paper, we propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs.

Question Answering RAG +1

AnalogCoder: Analog Circuit Design via Training-Free Code Generation

1 code implementation23 May 2024 Yao Lai, Sungyoung Lee, Guojin Chen, Souradip Poddar, Mengkang Hu, David Z. Pan, Ping Luo

Analog circuit design is a significant task in modern chip technology, focusing on the selection of component types, connectivity, and parameters to ensure proper circuit functionality.

Code Generation

Score-based Generative Models with Adaptive Momentum

no code implementations22 May 2024 Ziqing Wen, Xiaoge Deng, Ping Luo, Tao Sun, Dongsheng Li

Score-based generative models have demonstrated significant practical success in data-generating tasks.

Denoising Graph Generation

KET-QA: A Dataset for Knowledge Enhanced Table Question Answering

no code implementations13 May 2024 Mengkang Hu, Haoyu Dong, Ping Luo, Shi Han, Dongmei Zhang

In this paper, we propose to use a knowledge base (KB) as the external knowledge source for TableQA and construct a dataset KET-QA with fine-grained gold evidence annotation.

Question Answering

Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots

no code implementations13 May 2024 Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo

Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images.

Code Generation Descriptive

Scalable and Effective Arithmetic Tree Generation for Adder and Multiplier Designs

1 code implementation10 May 2024 Yao Lai, Jinxin Liu, David Z. Pan, Ping Luo

We believe our work will offer valuable insights into hardware design, further accelerating speed and reducing size through the refined search space and our tree generation methodologies.

Computational Efficiency Navigate

UniFS: Universal Few-shot Instance Perception with Point Representations

1 code implementation30 Apr 2024 Sheng Jin, Ruijie Yao, Lumin Xu, Wentao Liu, Chen Qian, Ji Wu, Ping Luo

In this paper, we propose UniFS, a universal few-shot instance perception model that unifies a wide range of instance perception tasks by reformulating them into a dynamic point representation learning framework.

Few-Shot Learning Few-Shot Object Detection +4

Adapting LLaMA Decoder to Vision Transformer

1 code implementation10 Apr 2024 Jiahao Wang, Wenqi Shao, Mengzhao Chen, Chengyue Wu, Yong liu, Taiqiang Wu, Kaipeng Zhang, Songyang Zhang, Kai Chen, Ping Luo

We first "LLaMAfy" a standard ViT step-by-step to align with LLaMA's architecture, and find that directly applying a causal mask to the self-attention brings an attention collapse issue, resulting in the failure to the network training.

Computational Efficiency Decoder +2

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model

1 code implementation CVPR 2024 Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, Rongrong Ji

Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research.

Diversity Language Modeling +2

End-to-End Autonomous Driving through V2X Cooperation

2 code implementations31 Mar 2024 Haibao Yu, Wenxian Yang, Jiaru Zhong, Zhenwei Yang, Siqi Fan, Ping Luo, Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data via V2X communication has emerged as a promising approach for advanced autonomous driving.

Autonomous Driving

DriveCoT: Integrating Chain-of-Thought Reasoning with End-to-End Driving

no code implementations25 Mar 2024 Tianqi Wang, Enze Xie, Ruihang Chu, Zhenguo Li, Ping Luo

We utilize the challenging driving scenarios from the CARLA leaderboard 2. 0, which involve high-speed driving and lane-changing, and propose a rule-based expert policy to control the vehicle and generate ground truth labels for its reasoning process across different driving aspects and the final decisions.

CARLA Leaderboard 2.0

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

1 code implementation25 Mar 2024 Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt.

Face Swapping Instruction Following +1

Accelerating Federated Learning by Selecting Beneficial Herd of Local Gradients

no code implementations25 Mar 2024 Ping Luo, Xiaoge Deng, Ziqing Wen, Tao Sun, Dongsheng Li

Federated Learning (FL) is a distributed machine learning framework in communication network systems.

Federated Learning

Zero-shot Generative Linguistic Steganography

1 code implementation16 Mar 2024 Ke Lin, Yiyang Luo, Zijian Zhang, Ping Luo

Generative linguistic steganography attempts to hide secret messages into covertext.

In-Context Learning Linguistic steganography

Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs

no code implementations15 Mar 2024 Yiyang Luo, Ke Lin, Chao Gu, Jiahui Hou, Lijie Wen, Ping Luo

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright.

Philosophy Question Answering

ACT-MNMT Auto-Constriction Turning for Multilingual Neural Machine Translation

no code implementations11 Mar 2024 Shaojie Dai, Xin Liu, Ping Luo, Yue Yu

Large language model (LLM) has achieved promising performance in multilingual machine translation tasks through zero/few-shot prompts or prompt-tuning.

Language Modelling Large Language Model +2

Position: Towards Implicit Prompt For Text-To-Image Models

no code implementations4 Mar 2024 Yue Yang, Yuqi Lin, Hong Liu, Wenqi Shao, Runjian Chen, Hailong Shang, Yu Wang, Yu Qiao, Kaipeng Zhang, Ping Luo

We call for increased attention to the potential and risks of implicit prompts in the T2I community and further investigation into the capabilities and impacts of implicit prompts, advocating for a balanced approach that harnesses their benefits while mitigating their risks.

Position

RegionGPT: Towards Region Understanding Vision Language Model

no code implementations CVPR 2024 Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, Sifei Liu

Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs, yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder, and the use of coarse-grained training data that lacks detailed, region-specific captions.

Language Modeling Language Modelling +1

AutoMMLab: Automatically Generating Deployable Models from Language Instructions for Computer Vision Tasks

1 code implementation23 Feb 2024 Zekang Yang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu

While traditional AutoML approaches have been successfully applied in several critical steps of model development (e. g. hyperparameter optimization), there lacks a AutoML system that automates the entire end-to-end model production workflow for computer vision.

Hyperparameter Optimization Keypoint Estimation

RoboScript: Code Generation for Free-Form Manipulation Tasks across Real and Simulation

no code implementations22 Feb 2024 Junting Chen, Yao Mu, Qiaojun Yu, Tianming Wei, Silang Wu, Zhecheng Yuan, Zhixuan Liang, Chao Yang, Kaipeng Zhang, Wenqi Shao, Yu Qiao, Huazhe Xu, Mingyu Ding, Ping Luo

To bridge this ``ideal-to-real'' gap, this paper presents \textbf{RobotScript}, a platform for 1) a deployable robot manipulation pipeline powered by code generation; and 2) a code generation benchmark for robot manipulation tasks in free-form natural language.

Code Generation Common Sense Reasoning +4

BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation

2 code implementations18 Feb 2024 Peng Xu, Wenqi Shao, Mengzhao Chen, Shitao Tang, Kaipeng Zhang, Peng Gao, Fengwei An, Yu Qiao, Ping Luo

Large language models (LLMs) have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc.

Question Answering Text Summarization

OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM

1 code implementation CVPR 2024 Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, Ping Luo

Importantly, all images in this benchmark are sourced from authentic medical scenarios, ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs.

Medical Visual Question Answering Question Answering +1

PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models

1 code implementation10 Jan 2024 Junsong Chen, Yue Wu, Simian Luo, Enze Xie, Sayak Paul, Ping Luo, Hang Zhao, Zhenguo Li

As a state-of-the-art, open-source image generation model, PIXART-{\delta} offers a promising alternative to the Stable Diffusion family of models, contributing significantly to text-to-image synthesis.

Text-to-Image Generation

LLaMA Pro: Progressive LLaMA with Block Expansion

1 code implementation4 Jan 2024 Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, Ping Luo

Humans generally acquire new skills without compromising the old; however, the opposite holds for Large Language Models (LLMs), e. g., from LLaMA to CodeLLaMA.

Instruction Following Math

Video Understanding with Large Language Models: A Survey

1 code implementation29 Dec 2023 Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, JianGuo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly.

Survey Video Understanding

DriveLM: Driving with Graph Visual Question Answering

3 code implementations21 Dec 2023 Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, Hongyang Li

The experiments demonstrate that Graph VQA provides a simple, principled framework for reasoning about a driving scene, and DriveLM-Data provides a challenging benchmark for this task.

Autonomous Driving Question Answering +1

Cached Transformers: Improving Transformers with Differentiable Memory Cache

1 code implementation20 Dec 2023 Zhaoyang Zhang, Wenqi Shao, Yixiao Ge, Xiaogang Wang, Jinwei Gu, Ping Luo

This work introduces a new Transformer model called Cached Transformer, which uses Gated Recurrent Cached (GRC) attention to extend the self-attention mechanism with a differentiable memory cache of tokens.

Image Classification Instance Segmentation +7

SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

1 code implementation CVPR 2024 Zhixuan Liang, Yao Mu, Hengbo Ma, Masayoshi Tomizuka, Mingyu Ding, Ping Luo

Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser.

Trajectory Planning

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation

1 code implementation6 Dec 2023 Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, Ying Shan

Motions in a video primarily consist of camera motion, induced by camera movement, and object motion, resulting from object movement.

Object Video Generation

MLLMs-Augmented Visual-Language Representation Learning

1 code implementation30 Nov 2023 Yanqing Liu, Kai Wang, Wenqi Shao, Ping Luo, Yu Qiao, Mike Zheng Shou, Kaipeng Zhang, Yang You

Visual-language pre-training has achieved remarkable success in many multi-modal tasks, largely attributed to the availability of large-scale image-text datasets.

Image-text Retrieval Representation Learning +1

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

3 code implementations CVPR 2024 Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, LiMin Wang, Yu Qiao

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models.

3D Question Answering (3D-QA) Diagnostic +12

Advancing Vision Transformers with Group-Mix Attention

2 code implementations26 Nov 2023 Chongjian Ge, Xiaohan Ding, Zhan Tong, Li Yuan, Jiangliu Wang, Yibing Song, Ping Luo

The attention map is computed based on the mixtures of tokens and group proxies and used to re-combine the tokens and groups in Value.

Image Classification object-detection +2

Large Language Models as Automated Aligners for benchmarking Vision-Language Models

no code implementations24 Nov 2023 Yuanfeng Ji, Chongjian Ge, Weikai Kong, Enze Xie, Zhengying Liu, Zhengguo Li, Ping Luo

In this work, we address the limitations via Auto-Bench, which delves into exploring LLMs as proficient aligners, measuring the alignment between VLMs and human intelligence and value through automatic data curation and assessment.

Benchmarking World Knowledge

DiffusionMat: Alpha Matting as Sequential Refinement Learning

no code implementations22 Nov 2023 Yangyang Xu, Shengfeng He, Wenqi Shao, Kwan-Yee K. Wong, Yu Qiao, Ping Luo

In this paper, we introduce DiffusionMat, a novel image matting framework that employs a diffusion model for the transition from coarse to refined alpha mattes.

Denoising Image Matting

Flow-Based Feature Fusion for Vehicle-Infrastructure Cooperative 3D Object Detection

1 code implementation NeurIPS 2023 Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Ping Luo, Zaiqing Nie

To address these issues in vehicle-infrastructure cooperative 3D (VIC3D) object detection, we propose the Feature Flow Net (FFNet), a novel cooperative detection framework.

3D Object Detection Autonomous Driving +1

Harvest Video Foundation Models via Efficient Post-Pretraining

1 code implementation30 Oct 2023 Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, LiMin Wang, Yu Qiao, Ping Luo

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets.

Question Answering Text Retrieval +2

Tree-Planner: Efficient Close-loop Task Planning with Large Language Models

no code implementations12 Oct 2023 Mengkang Hu, Yao Mu, Xinmiao Yu, Mingyu Ding, Shiguang Wu, Wenqi Shao, Qiguang Chen, Bin Wang, Yu Qiao, Ping Luo

This paper studies close-loop task planning, which refers to the process of generating a sequence of skills (a plan) to accomplish a specific goal while adapting the plan based on real-time observations.

Decision Making Task Planning

Aligning Data Selection with Performance: Performance-driven Reinforcement Learning for Active Learning in Object Detection

no code implementations12 Oct 2023 Zhixuan Liang, Xingyu Zeng, Rui Zhao, Ping Luo

Active learning strategies aim to train high-performance models with minimal labeled data by selecting the most informative instances for labeling.

Active Object Detection Informativeness +6

Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching

1 code implementation8 Oct 2023 Hao Zhang, Lumin Xu, Shenqi Lai, Wenqi Shao, Nanning Zheng, Ping Luo, Yu Qiao, Kaipeng Zhang

Current image-based keypoint detection methods for animal (including human) bodies and faces are generally divided into full-supervised and few-shot class-agnostic approaches.

Keypoint Detection Open Vocabulary Keypoint Detection

Guideline Learning for In-context Information Extraction

no code implementations8 Oct 2023 Chaoxu Pang, Yixuan Cao, Qiang Ding, Ping Luo

In this paper, we propose a Guideline Learning (GL) framework for In-context IE which reflectively learns and follows guidelines.

Active Learning Event Extraction +2

LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving

no code implementations4 Oct 2023 Hao Sha, Yao Mu, YuXuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, Mingyu Ding

Existing learning-based autonomous driving (AD) systems face challenges in comprehending high-level information, generalizing to rare events, and providing interpretability.

Autonomous Driving Decision Making

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

3 code implementations30 Sep 2023 Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, Zhenguo Li

We hope PIXART-$\alpha$ will provide new insights to the AIGC community and startups to accelerate building their own high-quality yet low-cost generative models from scratch.

Language Modelling Text-to-Image Generation

SPOT: Scalable 3D Pre-training via Occupancy Prediction for Learning Transferable 3D Representations

1 code implementation19 Sep 2023 Xiangchao Yan, Runjian Chen, Bo Zhang, Hancheng Ye, Renqiu Xia, Jiakang Yuan, Hongbin Zhou, Xinyu Cai, Botian Shi, Wenqi Shao, Ping Luo, Yu Qiao, Tao Chen, Junchi Yan

Annotating 3D LiDAR point clouds for perception tasks is fundamental for many applications e. g., autonomous driving, yet it still remains notoriously labor-intensive.

3D Object Detection Autonomous Driving +3

StyleAdapter: A Unified Stylized Image Generation Model

no code implementations4 Sep 2023 Zhouxia Wang, Xintao Wang, Liangbin Xie, Zhongang Qi, Ying Shan, Wenping Wang, Ping Luo

In this work, we propose StyleAdapter, a unified stylized image generation model capable of producing a variety of stylized images that match both the content of a given prompt and the style of reference images, without the need for per-style fine-tuning.

Image Generation model

MedShapeNet -- A Large-Scale Dataset of 3D Medical Shapes for Computer Vision

1 code implementation30 Aug 2023 Jianning Li, Zongwei Zhou, Jiancheng Yang, Antonio Pepe, Christina Gsaxner, Gijs Luijten, Chongyu Qu, Tiezheng Zhang, Xiaoxi Chen, Wenxuan Li, Marek Wodzinski, Paul Friedrich, Kangxian Xie, Yuan Jin, Narmada Ambigapathy, Enrico Nasca, Naida Solak, Gian Marco Melito, Viet Duc Vu, Afaque R. Memon, Christopher Schlachta, Sandrine de Ribaupierre, Rajnikant Patel, Roy Eagleson, Xiaojun Chen, Heinrich Mächler, Jan Stefan Kirschke, Ezequiel de la Rosa, Patrick Ferdinand Christ, Hongwei Bran Li, David G. Ellis, Michele R. Aizenberg, Sergios Gatidis, Thomas Küstner, Nadya Shusharina, Nicholas Heller, Vincent Andrearczyk, Adrien Depeursinge, Mathieu Hatt, Anjany Sekuboyina, Maximilian Löffler, Hans Liebl, Reuben Dorent, Tom Vercauteren, Jonathan Shapey, Aaron Kujawa, Stefan Cornelissen, Patrick Langenhuizen, Achraf Ben-Hamadou, Ahmed Rekik, Sergi Pujades, Edmond Boyer, Federico Bolelli, Costantino Grana, Luca Lumetti, Hamidreza Salehi, Jun Ma, Yao Zhang, Ramtin Gharleghi, Susann Beier, Arcot Sowmya, Eduardo A. Garza-Villarreal, Thania Balducci, Diego Angeles-Valdez, Roberto Souza, Leticia Rittner, Richard Frayne, Yuanfeng Ji, Vincenzo Ferrari, Soumick Chatterjee, Florian Dubost, Stefanie Schreiber, Hendrik Mattern, Oliver Speck, Daniel Haehn, Christoph John, Andreas Nürnberger, João Pedrosa, Carlos Ferreira, Guilherme Aresta, António Cunha, Aurélio Campilho, Yannick Suter, Jose Garcia, Alain Lalande, Vicky Vandenbossche, Aline Van Oevelen, Kate Duquesne, Hamza Mekhzoum, Jef Vandemeulebroucke, Emmanuel Audenaert, Claudia Krebs, Timo Van Leeuwen, Evie Vereecke, Hauke Heidemeyer, Rainer Röhrig, Frank Hölzle, Vahid Badeli, Kathrin Krieger, Matthias Gunzer, Jianxu Chen, Timo van Meegdenburg, Amin Dada, Miriam Balzer, Jana Fragemann, Frederic Jonske, Moritz Rempe, Stanislav Malorodov, Fin H. Bahnsen, Constantin Seibold, Alexander Jaus, Zdravko Marinov, Paul F. Jaeger, Rainer Stiefelhagen, Ana Sofia Santos, Mariana Lindo, André Ferreira, Victor Alves, Michael Kamp, Amr Abourayya, Felix Nensa, Fabian Hörst, Alexander Brehmer, Lukas Heine, Yannik Hanusrichter, Martin Weßling, Marcel Dudda, Lars E. Podleska, Matthias A. Fink, Julius Keyl, Konstantinos Tserpes, Moon-Sung Kim, Shireen Elhabian, Hans Lamecker, Dženan Zukić, Beatriz Paniagua, Christian Wachinger, Martin Urschler, Luc Duong, Jakob Wasserthal, Peter F. Hoyer, Oliver Basu, Thomas Maal, Max J. H. Witjes, Gregor Schiele, Ti-chiun Chang, Seyed-Ahmad Ahmadi, Ping Luo, Bjoern Menze, Mauricio Reyes, Thomas M. Deserno, Christos Davatzikos, Behrus Puladi, Pascal Fua, Alan L. Yuille, Jens Kleesiek, Jan Egger

For the medical domain, we present a large collection of anatomical shapes (e. g., bones, organs, vessels) and 3D models of surgical instrument, called MedShapeNet, created to facilitate the translation of data-driven vision algorithms to medical applications and to adapt SOTA vision algorithms to medical problems.

Anatomy Mixed Reality

GKGNet: Group K-Nearest Neighbor based Graph Convolutional Network for Multi-Label Image Recognition

1 code implementation28 Aug 2023 Ruijie Yao, Sheng Jin, Lumin Xu, Wang Zeng, Wentao Liu, Chen Qian, Ping Luo, Ji Wu

Multi-Label Image Recognition (MLIR) is a challenging task that aims to predict multiple object labels in a single image while modeling the complex relationships between labels and image regions.

graph construction Multi-Label Classification +1

RestoreFormer++: Towards Real-World Blind Face Restoration from Undegraded Key-Value Pairs

1 code implementation14 Aug 2023 Zhouxia Wang, Jiawei Zhang, Tianshui Chen, Wenping Wang, Ping Luo

In this work, we propose RestoreFormer++, which on the one hand introduces fully-spatial attention mechanisms to model the contextual information and the interplay with the priors, and on the other hand, explores an extending degrading model to help generate more realistic degraded face images to alleviate the synthetic-to-real-world gap.

Blind Face Restoration

RIGID: Recurrent GAN Inversion and Editing of Real Face Videos

no code implementations ICCV 2023 Yangyang Xu, Shengfeng He, Kwan-Yee K. Wong, Ping Luo

In this paper, we propose a unified recurrent framework, named \textbf{R}ecurrent v\textbf{I}deo \textbf{G}AN \textbf{I}nversion and e\textbf{D}iting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos.

Attribute Facial Editing +1

Foundation Model is Efficient Multimodal Multitask Model Selector

1 code implementation NeurIPS 2023 Fanqing Meng, Wenqi Shao, Zhanglin Peng, Chonghe Jiang, Kaipeng Zhang, Yu Qiao, Ping Luo

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering.

model Model Selection +2

Exploring Transformers for Open-world Instance Segmentation

no code implementations ICCV 2023 Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo

Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects.

Contrastive Learning Open-World Instance Segmentation +1

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

1 code implementation7 Aug 2023 Wenqi Shao, Meng Lei, Yutao Hu, Peng Gao, Kaipeng Zhang, Fanqing Meng, Peng Xu, Siyuan Huang, Hongsheng Li, Yu Qiao, Ping Luo

Secondly, it conducts an in-depth analysis of LVLMs' predictions using the ChatGPT Ensemble Evaluation (CEE), which leads to a robust and accurate evaluation and exhibits improved alignment with human evaluation compared to the word matching approach.

Hallucination Object Hallucination +1

ChiPFormer: Transferable Chip Placement via Offline Decision Transformer

no code implementations26 Jun 2023 Yao Lai, Jinxin Liu, Zhentao Tang, Bin Wang, Jianye Hao, Ping Luo

To resolve these challenges, we cast the chip placement as an offline RL formulation and present ChiPFormer that enables learning a transferable placement policy from fixed offline data.

Offline RL Reinforcement Learning (RL)

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations20 Jun 2023 Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Text Retrieval

EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought

no code implementations NeurIPS 2023 Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, Ping Luo

In this work, we introduce EmbodiedGPT, an end-to-end multi-modal foundation model for embodied AI, empowering embodied agents with multi-modal understanding and execution capabilities.

Image Captioning Language Modelling +3

SyNDock: N Rigid Protein Docking via Learnable Group Synchronization

no code implementations23 May 2023 Yuanfeng Ji, Yatao Bian, Guoji Fu, Peilin Zhao, Ping Luo

Firstly, SyNDock formulates multimeric protein docking as a problem of learning global transformations to holistically depict the placement of chain units of a complex, enabling a learning-centric solution.

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

1 code implementation22 May 2023 Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.

Autonomous Driving Video Generation +1

Going Denser with Open-Vocabulary Part Segmentation

2 code implementations ICCV 2023 Peize Sun, Shoufa Chen, Chenchen Zhu, Fanyi Xiao, Ping Luo, Saining Xie, Zhicheng Yan

In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation.

Object object-detection +3

V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting

1 code implementation CVPR 2023 Haibao Yu, Wenxian Yang, Hongzhi Ruan, Zhenwei Yang, Yingjuan Tang, Xu Gao, Xin Hao, Yifeng Shi, Yifeng Pan, Ning Sun, Juan Song, Jirui Yuan, Ping Luo, Zaiqing Nie

Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving.

Autonomous Driving Decision Making +1

InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language

2 code implementations9 May 2023 Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Zeqiang Lai, Yang Yang, Qingyun Li, Jiashuo Yu, Kunchang Li, Zhe Chen, Xue Yang, Xizhou Zhu, Yali Wang, LiMin Wang, Ping Luo, Jifeng Dai, Yu Qiao

Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, the proposed iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios where the number of objects is greater than 2.

Language Modelling

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

1 code implementation8 May 2023 Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen

To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly.

Instruction Following Language Modeling +1

$π$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation

1 code implementation27 Apr 2023 Chengyue Wu, Teng Wang, Yixiao Ge, Zeyu Lu, Ruisong Zhou, Ying Shan, Ping Luo

Foundation models have achieved great advances in multi-task learning with a unified interface of unimodal and multimodal tasks.

Multi-Task Learning

EC^2: Emergent Communication for Embodied Control

no code implementations19 Apr 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

MetaBEV: Solving Sensor Failures for BEV Detection and Map Segmentation

1 code implementation19 Apr 2023 Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities.

3D Object Detection Autonomous Driving +3

RIFormer: Keep Your Vision Backbone Effective While Removing Token Mixer

2 code implementations12 Apr 2023 Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin

Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.

Inductive Bias

Embodied Concept Learner: Self-supervised Learning of Concepts and Mapping through Instruction Following

no code implementations7 Apr 2023 Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

ECL consists of: (i) an instruction parser that translates the natural languages into executable programs; (ii) an embodied concept learner that grounds visual concepts based on language descriptions; (iii) a map constructor that estimates depth and constructs semantic maps by leveraging the learned concepts; and (iv) a program executor with deterministic policies to execute each program.

Instruction Following Self-Supervised Learning

Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

1 code implementation CVPR 2023 Mingyu Ding, Yikang Shen, Lijie Fan, Zhenfang Chen, Zitian Chen, Ping Luo, Joshua B. Tenenbaum, Chuang Gan

When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them.

Multi-Level Contrastive Learning for Dense Prediction Task

1 code implementation4 Apr 2023 Qiushan Guo, Yizhou Yu, Yi Jiang, Jiannan Wu, Zehuan Yuan, Ping Luo

We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning.

Contrastive Learning Prediction +1

DeepAccident: A Motion and Accident Prediction Benchmark for V2X Autonomous Driving

no code implementations3 Apr 2023 Tianqi Wang, Sukmin Kim, Wenxuan Ji, Enze Xie, Chongjian Ge, Junsong Chen, Zhenguo Li, Ping Luo

In addition, we propose a new task, end-to-end motion and accident prediction, which can be used to directly evaluate the accident prediction ability for different autonomous driving algorithms.

3D Object Detection Autonomous Driving +2

Soft Neighbors are Positive Supporters in Contrastive Visual Representation Learning

no code implementations30 Mar 2023 Chongjian Ge, Jiangliu Wang, Zhan Tong, Shoufa Chen, Yibing Song, Ping Luo

We evaluate our soft neighbor contrastive learning method (SNCLR) on standard visual recognition benchmarks, including image classification, object detection, and instance segmentation.

Contrastive Learning Image Classification +6

DDP: Diffusion Model for Dense Visual Prediction

1 code implementation ICCV 2023 Yuanfeng Ji, Zhe Chen, Enze Xie, Lanqing Hong, Xihui Liu, Zhaoqiang Liu, Tong Lu, Zhenguo Li, Ping Luo

We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline.

Denoising model +4

Real-time Controllable Denoising for Image and Video

1 code implementation CVPR 2023 Zhaoyang Zhang, Yitong Jiang, Wenqi Shao, Xiaogang Wang, Ping Luo, Kaimo Lin, Jinwei Gu

Controllable image denoising aims to generate clean samples with human perceptual priors and balance sharpness and smoothness.

Image Denoising Video Denoising

Accelerating Vision-Language Pretraining with Free Language Modeling

1 code implementation CVPR 2023 Teng Wang, Yixiao Ge, Feng Zheng, Ran Cheng, Ying Shan, XiaoHu Qie, Ping Luo

FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted.

Language Modeling Language Modelling +2

Vehicle-Infrastructure Cooperative 3D Object Detection via Feature Flow Prediction

1 code implementation19 Mar 2023 Haibao Yu, Yingjuan Tang, Enze Xie, Jilei Mao, Jirui Yuan, Ping Luo, Zaiqing Nie

Cooperatively utilizing both ego-vehicle and infrastructure sensor data can significantly enhance autonomous driving perception abilities.

3D Object Detection Autonomous Driving +1

Universal Instance Perception as Object Discovery and Retrieval

1 code implementation CVPR 2023 Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu

All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.

Described Object Detection Generalized Referring Expression Comprehension +15

AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners

1 code implementation3 Feb 2023 Zhixuan Liang, Yao Mu, Mingyu Ding, Fei Ni, Masayoshi Tomizuka, Ping Luo

For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20. 8% on Maze2D and 7. 5% on MuJoCo locomotion, but also adapts better to new tasks, e. g., KUKA pick-and-place, by 27. 9% without requiring additional expert data.

Diversity MuJoCo

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

1 code implementation27 Jan 2023 Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.

Contrastive Learning Object +1

Fast-BEV: Towards Real-time On-vehicle Bird's-Eye View Perception

1 code implementation19 Jan 2023 Bin Huang, Yangguang Li, Enze Xie, Feng Liang, Luya Wang, Mingzhu Shen, Fenggang Liu, Tianqi Wang, Ping Luo, Jing Shao

Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving.

Autonomous Driving Data Augmentation

EC2: Emergent Communication for Embodied Control

no code implementations CVPR 2023 Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan

We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control.

Contrastive Learning Language Modelling

RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer

no code implementations CVPR 2023 Jiahao Wang, Songyang Zhang, Yong liu, Taiqiang Wu, Yujiu Yang, Xihui Liu, Kai Chen, Ping Luo, Dahua Lin

Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy.

Inductive Bias

Segment Every Reference Object in Spatial and Temporal Spaces

no code implementations ICCV 2023 Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo

In this work, we end the current fragmented situation and propose UniRef to unify the three reference-based object segmentation tasks with a single architecture.

Image Segmentation Object +5

MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation

no code implementations ICCV 2023 Chongjian Ge, Junsong Chen, Enze Xie, Zhongdao Wang, Lanqing Hong, Huchuan Lu, Zhenguo Li, Ping Luo

These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities.

3D Object Detection Autonomous Driving +3

Policy Adaptation from Foundation Model Feedback

no code implementations CVPR 2023 Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang

When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations.

Decision Making model

Learning Object-Language Alignments for Open-Vocabulary Object Detection

1 code implementation27 Nov 2022 Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai

In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data.

Object object-detection +4

MaskPlace: Fast Chip Placement via Reinforced Visual Representation Learning

2 code implementations24 Nov 2022 Yao Lai, Yao Mu, Ping Luo

Firstly, MaskPlace recasts placement as a problem of learning pixel-level visual representation to comprehensively describe millions of modules on a chip, enabling placement in a high-resolution canvas and a large action space.

Deep Reinforcement Learning Layout Design +2

Prototypical context-aware dynamics generalization for high-dimensional model-based reinforcement learning

no code implementations23 Nov 2022 Junjie Wang, Yao Mu, Dong Li, Qichao Zhang, Dongbin Zhao, Yuzheng Zhuang, Ping Luo, Bin Wang, Jianye Hao

The latent world model provides a promising way to learn policies in a compact latent space for tasks with high-dimensional observations, however, its generalization across diverse environments with unseen dynamics remains challenging.

Model-based Reinforcement Learning reinforcement-learning +1

DiffusionDet: Diffusion Model for Object Detection

3 code implementations ICCV 2023 Shoufa Chen, Peize Sun, Yibing Song, Ping Luo

We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes.

Denoising model +3

Large-batch Optimization for Dense Visual Predictions

1 code implementation20 Oct 2022 Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong, Liang Chen, Yu Liu, Ping Luo

To address this challenge, we propose a simple yet effective algorithm, named Adaptive Gradient Variance Modulator (AGVM), which can train dense visual predictors with very large batch size, enabling several benefits more appealing than prior arts.

Instance Segmentation object-detection +3

Decomposed Mutual Information Optimization for Generalized Context in Meta-Reinforcement Learning

1 code implementation9 Oct 2022 Yao Mu, Yuzheng Zhuang, Fei Ni, Bin Wang, Jianyu Chen, Jianye Hao, Ping Luo

This paper addresses such a challenge by Decomposed Mutual INformation Optimization (DOMINO) for context learning, which explicitly learns a disentangled context to maximize the mutual information between the context and historical trajectories, while minimizing the state transition prediction error.

Decision Making Meta Reinforcement Learning +3

Enhance Sample Efficiency and Robustness of End-to-end Urban Autonomous Driving via Semantic Masked World Model

no code implementations8 Oct 2022 Zeyu Gao, Yao Mu, Chen Chen, Jingliang Duan, Shengbo Eben Li, Ping Luo, YanFeng Lu

End-to-end autonomous driving provides a feasible way to automatically maximize overall driving system performance by directly mapping the raw pixels from a front-facing camera to control signals.

Autonomous Driving

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

1 code implementation CVPR 2023 Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.

Descriptive Representation Learning +1

FedVeca: Federated Vectorized Averaging on Non-IID Data with Adaptive Bi-directional Global Objective

no code implementations28 Sep 2022 Ping Luo, Jieren Cheng, Zhenhao Liu, N. Xiong, Jie Wu

However, the clients' Non-Independent and Identically Distributed (Non-IID) data negatively affect the trained model, and clients with different numbers of local updates may cause significant gaps to the local gradients in each communication round.

Federated Learning

Rethinking Resolution in the Context of Efficient Video Recognition

1 code implementation26 Sep 2022 Chuofan Ma, Qiushan Guo, Yi Jiang, Zehuan Yuan, Ping Luo, Xiaojuan Qi

Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale.

Knowledge Distillation Video Recognition

ZoomNAS: Searching for Whole-body Human Pose Estimation in the Wild

1 code implementation23 Aug 2022 Lumin Xu, Sheng Jin, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts.

2D Human Pose Estimation Neural Architecture Search +1

3D Interacting Hand Pose Estimation by Hand De-occlusion and Removal

1 code implementation22 Jul 2022 Hao Meng, Sheng Jin, Wentao Liu, Chen Qian, Mengxiang Lin, Wanli Ouyang, Ping Luo

Unlike most previous works that directly predict the 3D poses of two interacting hands simultaneously, we propose to decompose the challenging interacting hand pose estimation task and estimate the pose of each hand separately.

3D Interacting Hand Pose Estimation Hand Pose Estimation

Pose for Everything: Towards Category-Agnostic Pose Estimation

1 code implementation21 Jul 2022 Lumin Xu, Sheng Jin, Wang Zeng, Wentao Liu, Chen Qian, Wanli Ouyang, Ping Luo, Xiaogang Wang

In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.

Category-Agnostic Pose Estimation Pose Estimation

Real-time End-to-End Video Text Spotter with Contrastive Representation Learning

1 code implementation18 Jul 2022 Wejia Wu, Zhuang Li, Jiahong Li, Chunhua Shen, Hong Zhou, Size Li, Zhongyuan Wang, Ping Luo

Our contributions are three-fold: 1) CoText simultaneously address the three tasks (e. g., text detection, tracking, recognition) in a real-time end-to-end trainable framework.

Contrastive Learning Representation Learning +2

Towards Grand Unification of Object Tracking

1 code implementation14 Jul 2022 Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, Huchuan Lu

We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.

Multi-Object Tracking Multi-Object Tracking and Segmentation +4

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

1 code implementation7 Jul 2022 Wenqi Shao, Xun Zhao, Yixiao Ge, Zhaoyang Zhang, Lei Yang, Xiaogang Wang, Ying Shan, Ping Luo

It is challenging because the ground-truth model ranking for each task can only be generated by fine-tuning the pre-trained models on the target dataset, which is brute-force and computationally expensive.

All Transferability

Exploiting Context Information for Generic Event Boundary Captioning

1 code implementation3 Jul 2022 Jinrui Zhang, Teng Wang, Feng Zheng, Ran Cheng, Ping Luo

Previous methods only process the information of a single boundary at a time, which lacks utilization of video context information.

Boundary Captioning

VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix

no code implementations17 Jun 2022 Teng Wang, Wenhao Jiang, Zhichao Lu, Feng Zheng, Ran Cheng, Chengguo Yin, Ping Luo

Existing vision-language pre-training (VLP) methods primarily rely on paired image-text datasets, which are either annotated by enormous human labors, or crawled from the internet followed by elaborate data cleaning techniques.

Contrastive Learning cross-modal alignment +3

CtrlFormer: Learning Transferable State Representation for Visual Control via Transformer

1 code implementation17 Jun 2022 Yao Mu, Shoufa Chen, Mingyu Ding, Jianyu Chen, Runjian Chen, Ping Luo

In visual control, learning transferable state representation that can transfer between different control tasks is important to reduce the training sample size.

Transfer Learning

AMOS: A Large-Scale Abdominal Multi-Organ Benchmark for Versatile Medical Image Segmentation

4 code implementations16 Jun 2022 Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, Ping Luo

Constraint by the high cost of collecting and labeling 3D medical data, most of the deep learning models to date are driven by datasets with a limited number of organs of interest or samples, which still limits the power of modern deep models and makes it difficult to provide a fully comprehensive and fair estimate of various methods.

Image Segmentation Medical Image Segmentation +3

CO^3: Cooperative Unsupervised 3D Representation Learning for Autonomous Driving

1 code implementation8 Jun 2022 Runjian Chen, Yao Mu, Runsen Xu, Wenqi Shao, Chenhan Jiang, Hang Xu, Zhenguo Li, Ping Luo

In this paper, we propose CO^3, namely Cooperative Contrastive Learning and Contextual Shape Prediction, to learn 3D representation for outdoor-scene point clouds in an unsupervised manner.

Autonomous Driving Contrastive Learning +1

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

2 code implementations26 May 2022 Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, Ping Luo

To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently.

Action Recognition Video Recognition

Flow-based Recurrent Belief State Learning for POMDPs

no code implementations23 May 2022 Xiaoyu Chen, Yao Mu, Ping Luo, Shengbo Li, Jianyu Chen

Furthermore, we show that the learned belief states can be plugged into downstream RL algorithms to improve performance.

Decision Making Sequential Decision Making +1

An Empirical Investigation of Representation Learning for Imitation

2 code implementations16 May 2022 Xin Chen, Sam Toyer, Cody Wild, Scott Emmons, Ian Fischer, Kuang-Huei Lee, Neel Alex, Steven H Wang, Ping Luo, Stuart Russell, Pieter Abbeel, Rohin Shah

We propose a modular framework for constructing representation learning algorithms, then use our framework to evaluate the utility of representation learning for imitation across several environment suites.

Image Classification Imitation Learning +1

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation26 Apr 2022 Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Action Recognition Text Retrieval +5

Semantic-Aware Pretraining for Dense Video Captioning

no code implementations13 Apr 2022 Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo

This report describes the details of our approach for the event dense-captioning task in ActivityNet Challenge 2021.

Dense Captioning Dense Video Captioning

M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

no code implementations11 Apr 2022 Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, Jose M. Alvarez

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs.

3D Object Detection BEV Segmentation +2

Cannot find the paper you are looking for? You can Submit a new open access paper.