Search Results for author: Jingdong Wang

Found 242 papers, 116 papers with code

Revisiting MLLMs: An In-Depth Analysis of Image Classification Abilities

no code implementations21 Dec 2024 Huan Liu, Lingyu Xiao, JiangJiang Liu, Xiaofan Li, Ze Feng, Sen yang, Jingdong Wang

To understand the factors driving this improvement, we conduct an in-depth analysis of the network architecture, data selection, and training recipe used in public MLLMs.

Attribute Classification +4

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

1 code implementation18 Dec 2024 Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.

Descriptive Human-Object Interaction Detection +2

Unbiased General Annotated Dataset Generation

no code implementations14 Dec 2024 Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang

Pre-training backbone networks on a general annotated dataset (e. g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks.

Dataset Generation Image Generation

Visual Object Tracking across Diverse Data Modalities: A Review

no code implementations13 Dec 2024 Mengmeng Wang, Teli Ma, Shuo Xin, Xiaojun Hou, Jiazheng Xing, Guang Dai, Jingdong Wang, Yong liu

Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking.

Visual Object Tracking

ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts

no code implementations11 Dec 2024 Sinan Du, Guosheng Zhang, Keyao Wang, Yuanrui Wang, Haixiao Yue, Gang Zhang, Errui Ding, Jingdong Wang, Zhengzhuo Xu, Chun Yuan

Parameter-efficient transfer learning (PETL) has become a promising paradigm for adapting large-scale vision foundation models to downstream tasks.

Image Classification Transfer Learning

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

1 code implementation1 Dec 2024 Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu

Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds.

Image Animation Portrait Animation

OpenHumanVid: A Large-Scale High-Quality Dataset for Enhancing Human-Centric Video Generation

no code implementations28 Nov 2024 Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu

Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models.

Video Generation

TopoSD: Topology-Enhanced Lane Segment Perception with SDMap Prior

no code implementations22 Nov 2024 Sen yang, Minyue Jiang, Ziwei Fan, Xiaolu Xie, Xiao Tan, YingYing Li, Errui Ding, Liang Wang, Jingdong Wang

Recent advances in autonomous driving systems have shifted towards reducing reliance on high-definition maps (HDMaps) due to the huge costs of annotation and maintenance.

Autonomous Driving

Continual SFT Matches Multimodal RLHF with Negative Supervision

no code implementations22 Nov 2024 Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, JiangJiang Liu, Gang Zhang, Jingdong Wang

Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss.

DGTR: Distributed Gaussian Turbo-Reconstruction for Sparse-View Vast Scenes

no code implementations19 Nov 2024 Hao Li, Yuanyuan Gao, Haosong Peng, Chenming Wu, Weicai Ye, Yufeng Zhan, Chen Zhao, Dingwen Zhang, Jingdong Wang, Junwei Han

This paper presents DGTR, a novel distributed framework for efficient Gaussian reconstruction for sparse-view vast scenes.

Novel View Synthesis

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

no code implementations30 Oct 2024 Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang

These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model.

Text-to-Image Generation

Schedule Your Edit: A Simple yet Effective Diffusion Noise Schedule for Image Editing

no code implementations24 Oct 2024 Haonan Lin, Mengmeng Wang, Jiahao Wang, Wenbin An, Yan Chen, Yong liu, Feng Tian, Guang Dai, Jingdong Wang, Qianying Wang

To resolve this, we introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing.

Improving Multi-modal Large Language Model through Boosting Vision Capabilities

no code implementations17 Oct 2024 Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li

QLadder employs a learnable ``\textit{ladder}'' structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e. g., CLIP image encoder).

Decoder Language Modeling +3

Hallo2: Long-Duration and High-Resolution Audio-Driven Portrait Image Animation

1 code implementation10 Oct 2024 Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang

To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts.

4k Image Animation +2

MGMapNet: Multi-Granularity Representation Learning for End-to-End Vectorized HD Map Construction

no code implementations10 Oct 2024 Jing Yang, Minyue Jiang, Sen yang, Xiao Tan, YingYing Li, Errui Ding, Hanli Wang, Jingdong Wang

The construction of Vectorized High-Definition (HD) map typically requires capturing both category and geometry information of map elements.

Representation Learning

Flipped Classroom: Aligning Teacher Attention with Student in Generalized Category Discovery

no code implementations29 Sep 2024 Haonan Lin, Wenbin An, Jiahao Wang, Yan Chen, Feng Tian, Mengmeng Wang, Guang Dai, Qianying Wang, Jingdong Wang

Recent advancements have shown promise in applying traditional Semi-Supervised Learning strategies to the task of Generalized Category Discovery (GCD).

Learning Multiple Probabilistic Decisions from Latent World Model in Autonomous Driving

1 code implementation24 Sep 2024 Lingyu Xiao, Jiang-Jiang Liu, Sen yang, Xiaofan Li, Xiaoqing Ye, Wankou Yang, Jingdong Wang

In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses.

Autonomous Driving Imitation Learning +1

MonoFormer: One Transformer for Both Diffusion and Autoregression

1 code implementation24 Sep 2024 Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang

Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation.

Image Generation Text Generation

FullAnno: A Data Engine for Enhancing Image Comprehension of MLLMs

no code implementations20 Sep 2024 Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang

To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions.

Image Captioning Image Comprehension

SpotActor: Training-Free Layout-Controlled Consistent Image Generation

no code implementations7 Sep 2024 Jiahao Wang, Caixia Yan, Weizhan Zhang, Haonan Lin, Mengmeng Wang, Guang Dai, Tieliang Gong, Hao Sun, Jingdong Wang

For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts.

Image Generation object-detection +1

Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression

1 code implementation1 Sep 2024 Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, Xiang Bai

Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving.

Autonomous Driving

Disentangled Noisy Correspondence Learning

no code implementations10 Aug 2024 Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang

Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy.

cross-modal alignment Cross-Modal Retrieval +2

Add-SD: Rational Generation without Manual Reference

1 code implementation30 Jul 2024 Lingfeng Yang, Xinyu Zhang, Xiang Li, Jinwen Chen, Kun Yao, Gang Zhang, Errui Ding, Lingqiao Liu, Jingdong Wang, Jian Yang

Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks.

LION: Linear Group RNN for 3D Object Detection in Point Clouds

1 code implementation25 Jul 2024 Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features.

3D Object Detection Long-range modeling +2

Explore the LiDAR-Camera Dynamic Adjustment Fusion for 3D Object Detection

1 code implementation22 Jul 2024 Yiran Yang, Xu Gao, Tong Wang, Xin Hao, Yifeng Shi, Xiao Tan, Xiaoqing Ye, Jingdong Wang

This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences.

3D Object Detection Autonomous Driving +1

Surfel-based Gaussian Inverse Rendering for Fast and Relightable Dynamic Human Reconstruction from Monocular Video

no code implementations21 Jul 2024 Yiqun Zhao, Chenming Wu, Binbin Huang, YiHao Zhi, Chen Zhao, Jingdong Wang, Shenghua Gao

Efficient and accurate reconstruction of a relightable, dynamic clothed human avatar from a monocular video is crucial for the entertainment industry.

Disentanglement Inverse Rendering

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

1 code implementation16 Jul 2024 Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, Si Liu

Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. However, two main challenges emerge:(1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge.

Language Modeling Language Modelling +3

SEED: A Simple and Effective 3D DETR in Point Clouds

1 code implementation15 Jul 2024 Zhe Liu, Jinghua Hou, Xiaoqing Ye, Tong Wang, Jingdong Wang, Xiang Bai

We argue that the main challenges are twofold: 1) How to obtain the appropriate object queries is challenging due to the high sparsity and uneven distribution of point clouds; 2) How to implement an effective query interaction by exploiting the rich geometric structure of point clouds is not fully explored.

Timestep-Aware Correction for Quantized Diffusion Models

no code implementations4 Jul 2024 Yuzhe Yao, Feng Tian, Jun Chen, Haonan Lin, Guang Dai, Yong liu, Jingdong Wang

This accumulation of error becomes particularly problematic in low-precision scenarios, leading to significant distortions in the generated images.

Attribute Noise Estimation +1

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

1 code implementation1 Jul 2024 Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, WangMeng Zuo, Qixiang Ye, Jingdong Wang

For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video.

Text-to-Video Generation Video Generation

XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis

no code implementations26 Jun 2024 Hao Li, Ming Yuan, Yan Zhang, Chenming Wu, Chen Zhao, Chunyu Song, Haocheng Feng, Errui Ding, Dingwen Zhang, Jingdong Wang

To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations.

Autonomous Driving Benchmarking

VDG: Vision-Only Dynamic Gaussian for Driving Simulation

no code implementations26 Jun 2024 Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views.

Image Generation

Assessing Model Generalization in Vicinity

1 code implementation13 Jun 2024 Yuchi Liu, Yifan Sun, Jingdong Wang, Liang Zheng

This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels.

model

BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection

1 code implementation CVPR 2024 Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li

Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range.

3D Object Detection Autonomous Driving +1

Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

no code implementations13 Jun 2024 Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, Siyu Zhu

Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion.

Diversity Image Animation

OpenGaussian: Towards Point-Level 3D Gaussian-based Open Vocabulary Understanding

no code implementations4 Jun 2024 Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang

To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency.

3DGS Object

Towards Unified Multi-granularity Text Detection with Interactive Attention

no code implementations30 May 2024 Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang

Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands.

Document Layout Analysis Optical Character Recognition (OCR) +3

Are Images Indistinguishable to Humans Also Indistinguishable to Classifiers?

no code implementations28 May 2024 Zebin You, Xinyu Zhang, Hanzhong Guo, Jingdong Wang, Chongxuan Li

However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal.

Image Generation

Dense Connector for MLLMs

1 code implementation22 May 2024 Huanjin Yao, Wenhao Wu, Taojiannan Yang, Yuxin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang

We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs.

Video Understanding

Unsupervised Pre-training with Language-Vision Prompts for Low-Data Instance Segmentation

1 code implementation22 May 2024 Dingwen Zhang, Hao Li, Diqi He, Nian Liu, Lechao Cheng, Jingdong Wang, Junwei Han

Experimental evaluations conducted on MS COCO, Cityscapes, and CTW1500 datasets indicate that the QEIS models' performance can be significantly improved when pre-trained with our method.

Instance Segmentation Semantic Segmentation +1

RTG-SLAM: Real-time 3D Reconstruction at Scale using Gaussian Splatting

no code implementations30 Apr 2024 Zhexi Peng, Tianjia Shao, Yong liu, Jingke Zhou, Yin Yang, Jingdong Wang, Kun Zhou

We present Real-time Gaussian SLAM (RTG-SLAM), a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting.

3D Reconstruction NeRF +1

Training-Free Unsupervised Prompt for Vision-Language Models

1 code implementation25 Apr 2024 Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, Shengsheng Wang, Jingdong Wang

In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner.

CLIP-GS: CLIP-Informed Gaussian Splatting for Real-time and View-consistent 3D Semantic Understanding

1 code implementation22 Apr 2024 Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, Kanglin Liu

Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model.

Attribute

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

1 code implementation1 Apr 2024 Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong liu, Jingdong Wang

However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications.

Virtual Try-on

DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation

no code implementations CVPR 2024 Haonan Lin, Mengmeng Wang, Yan Chen, Wenbin An, Yuzhe Yao, Guang Dai, Qianying Wang, Yong liu, Jingdong Wang

While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context.

Denoising Face Generation

Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection

no code implementations CVPR 2024 Jiacheng Zhang, Jiaming Li, Xiangru Lin, Wei zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

Additionally, we present a DepthGradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels, effectively decoupling the depth gradient and removing conflicting gradients.

Monocular 3D Object Detection object-detection +1

TexRO: Generating Delicate Textures of 3D Models by Recursive Optimization

no code implementations22 Mar 2024 Jinbo Wu, Xing Liu, Chenming Wu, Xiaobo Gao, Jialun Liu, Xinqi Liu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang

We propose an optimal viewpoint selection strategy, that finds the most miniature set of viewpoints covering all the faces of a mesh.

Denoising Texture Synthesis

Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

1 code implementation ICCV 2023 Jiaming Li, Xiangru Lin, Wei zhang, Xiao Tan, YingYing Li, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module.

object-detection Object Detection +1

GGRt: Towards Pose-free Generalizable 3D Gaussian Splatting in Real-time

no code implementations15 Mar 2024 Hao Li, Yuanyuan Gao, Chenming Wu, Dingwen Zhang, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han

Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model.

Generalizable Novel View Synthesis NeRF +1

VRP-SAM: SAM with Visual Reference Prompt

1 code implementation CVPR 2024 Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li

In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model.

Meta-Learning Segmentation

GVA: Reconstructing Vivid 3D Gaussian Avatars from Monocular Videos

no code implementations26 Feb 2024 Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang

In this paper, we present a novel method that facilitates the creation of vivid 3D Gaussian avatars from monocular video inputs (GVA).

Novel View Synthesis Pose Estimation

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

no code implementations22 Jan 2024 Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong liu

In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability.

Action Recognition Decoder +1

Collaborative Position Reasoning Network for Referring Image Segmentation

no code implementations22 Jan 2024 JianJian Cao, Beiya Dai, Yulin Li, Xiameng Qin, Jingdong Wang

Holi integrates features of the two modalities by a cross-modal attention mechanism, which suppresses the irrelevant redundancy under the guide of positioning information from RoCo.

Image Segmentation Position +2

MS-DETR: Efficient DETR Training with Mixed Supervision

1 code implementation CVPR 2024 Chuyang Zhao, Yifan Sun, Wenhao Wang, Qiang Chen, Errui Ding, Yi Yang, Jingdong Wang

The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates.

Decoder Object +2

Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection

1 code implementation CVPR 2024 Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Yao Zhao, Jingdong Wang

In this paper, we study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods, e. g., GANs and diffusion models.

Attribute Synthetic Image Detection

GIR: 3D Gaussian Inverse Rendering for Relightable Scene Factorization

1 code implementation8 Dec 2023 Yahao Shi, Yanmin Wu, Chenming Wu, Xing Liu, Chen Zhao, Haocheng Feng, Jian Zhang, Bin Zhou, Errui Ding, Jingdong Wang

Our method achieves state-of-the-art performance in both relighting and novel view synthesis tasks among the recently proposed inverse rendering methods while achieving real-time rendering.

Disentanglement Inverse Rendering +1

Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future

2 code implementations6 Dec 2023 Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Chunjing Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao

With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem.

Autonomous Driving

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

2 code implementations27 Nov 2023 Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang

Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks: Firstly, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training.

Zero-Shot Learning

Disentangled Representation Learning with Transmitted Information Bottleneck

no code implementations3 Nov 2023 Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, Jingdong Wang

Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models.

Disentanglement Variational Inference

HAP: Structure-Aware Masked Image Modeling for Human-Centric Perception

1 code implementation NeurIPS 2023 Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang

To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image.

2D Pose Estimation Attribute +3

Leveraging Vision-Centric Multi-Modal Expertise for 3D Object Detection

1 code implementation NeurIPS 2023 Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li

Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert).

3D Object Detection object-detection

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

no code implementations11 Oct 2023 Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers.

Dimensionality Reduction

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

no code implementations26 Sep 2023 Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang

In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.

Prediction

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

no code implementations20 Sep 2023 Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang

Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID).

Denoising Pedestrian Detection +2

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

no code implementations18 Sep 2023 Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang

Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations.

Decoder Misinformation

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

1 code implementation ICCV 2023 Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang

In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels.

Person Re-Identification

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

no code implementations1 Sep 2023 Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion.

Decoder Text-to-Image Generation +2

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

no code implementations20 Aug 2023 Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang

Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls.

Diversity Layout-to-Image Generation

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

2 code implementations ICCV 2023 Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Decoder Human Detection +1

MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition

no code implementations3 Aug 2023 Jiazheng Xing, Chao Xu, Mengmeng Wang, Guang Dai, Baigui Sun, Yong liu, Jingdong Wang, Jian Zhao

To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.

Few-Shot action recognition Few Shot Action Recognition +1

Enhancing Your Trained DETRs with Box Refinement

1 code implementation21 Jul 2023 Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng

We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework.

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

1 code implementation ICCV 2023 Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan

CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively.

Representation Learning Scene Understanding +2

What Can Simple Arithmetic Operations Do for Temporal Modeling?

2 code implementations ICCV 2023 Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang

We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.

Action Classification Action Recognition +1

Semi-DETR: Semi-Supervised Object Detection with Detection Transformers

3 code implementations CVPR 2023 Jiacheng Zhang, Xiangru Lin, Wei zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

Specifically, we propose a Stage-wise Hybrid Matching strategy that combines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage.

Object object-detection +3

Multi-Modal 3D Object Detection by Box Matching

1 code implementation12 May 2023 Zhe Liu, Xiaoqing Ye, Zhikang Zou, Xinwei He, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai

Extensive experiments on the nuScenes dataset demonstrate that our method is much more stable in dealing with challenging cases such as asynchronous sensors, misaligned sensor placement, and degenerated camera images than existing fusion methods.

3D Object Detection Autonomous Driving +2

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

no code implementations CVPR 2023 Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang

Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability.

Exploring Effective Factors for Improving Visual In-Context Learning

1 code implementation10 Apr 2023 Yanpeng Sun, Qiang Chen, Jian Wang, Jingdong Wang, Zechao Li

By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks.

In-Context Learning Meta-Learning +1

ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

no code implementations27 Mar 2023 Yifu Zhang, Xinggang Wang, Xiaoqing Ye, Wei zhang, Jincheng Lu, Xiao Tan, Errui Ding, Peize Sun, Jingdong Wang

We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object missing and fragmented trajectories.

3D Multi-Object Tracking motion prediction +1

Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

1 code implementation CVPR 2023 Chang Liu, Weiming Zhang, Xiangru Lin, Wei zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang

It employs a "divide-and-conquer" strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity.

Dense Object Detection Object +3

IRGen: Generative Modeling for Image Retrieval

1 code implementation17 Mar 2023 Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Jingdong Wang, Baining Guo

While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified.

Image Retrieval Retrieval

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

1 code implementation1 Mar 2023 Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing.

Document Image Classification Language Modeling +4

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

1 code implementation27 Jan 2023 Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.

Contrastive Learning Object +1

Graph Contrastive Learning for Skeleton-based Action Recognition

1 code implementation26 Jan 2023 Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

In this paper, we propose a graph contrastive learning framework for skeleton-based action recognition (\textit{SkeletonGCL}) to explore the \textit{global} context across all sequences.

Action Recognition Contrastive Learning +2

UATVR: Uncertainty-Adaptive Text-Video Retrieval

1 code implementation ICCV 2023 Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, Jingdong Wang

In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation.

Retrieval Semantic correspondence +1

s-Adaptive Decoupled Prototype for Few-Shot Object Detection

no code implementations ICCV 2023 Jinhao Du, Shan Zhang, Qiang Chen, Haifeng Le, Yanpeng Sun, Yao Ni, Jian Wang, Bin He, Jingdong Wang

To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively.

Few-Shot Object Detection Meta-Learning +3

CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision

no code implementations ICCV 2023 Shuo Li, Yue He, Weiming Zhang , Wei zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang

Current state-of-the-art semi-supervised semantic segmentation (SSSS) methods typically adopt pseudo labeling and consistency regularization between multiple learners with different perturbations.

Semi-Supervised Semantic Segmentation

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

5 code implementations CVPR 2023 Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.

Action Classification Action Recognition +3

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

4 code implementations CVPR 2023 Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

Data Augmentation Retrieval +2

Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation

1 code implementation CVPR 2023 Zhen Zhao, Lihe Yang, Sifan Long, Jimin Pi, Luping Zhou, Jingdong Wang

Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance.

Semi-Supervised Semantic Segmentation

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

no code implementations9 Dec 2022 Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong, Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike

This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames.

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

2 code implementations22 Nov 2022 Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage.

NeRF Talking Face Generation

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

1 code implementation CVPR 2023 Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, Jingdong Wang

In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning.

Computational Efficiency Diversity +1

Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation

1 code implementation CVPR 2023 Zhen Zhao, Sifan Long, Jimin Pi, Jingdong Wang, Luping Zhou

Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner.

Segmentation Semi-Supervised Semantic Segmentation

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

no code implementations arXiv 2022 Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO.

Decoder Object +2

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

no code implementations11 Oct 2022 Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang

In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction.

Decoder motion prediction +1

StyleSwap: Style-Based Generator Empowers Robust Face Swapping

no code implementations27 Sep 2022 Zhiliang Xu, Hang Zhou, Zhibin Hong, Ziwei Liu, Jiaming Liu, Zhizhi Guo, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Our core idea is to leverage a style-based generator to empower high-fidelity and robust face swapping, thus the generator's advantage can be adopted for optimizing identity similarity.

Face Swapping

NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields

no code implementations24 Sep 2022 Jiankai Sun, Yan Xu, Mingyu Ding, Hongwei Yi, Chen Wang, Jingdong Wang, Liangjun Zhang, Mac Schwager

Using current NeRF training tools, a robot can train a NeRF environment model in real-time and, using our algorithm, identify 3D bounding boxes of objects of interest within the NeRF for downstream navigation or manipulation tasks.

NeRF Object Localization +1

TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

no code implementations31 Aug 2022 Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang

The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately.

Table Recognition