no code implementations • 3 Jan 2025 • Guosheng Zhang, Keyao Wang, Haixiao Yue, Ajian Liu, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
Face Anti-Spoofing (FAS) is essential for ensuring the security and reliability of facial recognition systems.
no code implementations • 21 Dec 2024 • Huan Liu, Lingyu Xiao, JiangJiang Liu, Xiaofan Li, Ze Feng, Sen yang, Jingdong Wang
To understand the factors driving this improvement, we conduct an in-depth analysis of the network architecture, data selection, and training recipe used in public MLLMs.
1 code implementation • 18 Dec 2024 • Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang
We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption.
no code implementations • 14 Dec 2024 • Dengyang Jiang, Haoyu Wang, Lei Zhang, Wei Wei, Guang Dai, Mengmeng Wang, Jingdong Wang, Yanning Zhang
Pre-training backbone networks on a general annotated dataset (e. g., ImageNet) that comprises numerous manually collected images with category annotations has proven to be indispensable for enhancing the generalization capacity of downstream visual tasks.
no code implementations • 13 Dec 2024 • Mengmeng Wang, Teli Ma, Shuo Xin, Xiaojun Hou, Jiazheng Xing, Guang Dai, Jingdong Wang, Yong liu
Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking.
no code implementations • 11 Dec 2024 • Sinan Du, Guosheng Zhang, Keyao Wang, Yuanrui Wang, Haixiao Yue, Gang Zhang, Errui Ding, Jingdong Wang, Zhengzhuo Xu, Chun Yuan
Parameter-efficient transfer learning (PETL) has become a promising paradigm for adapting large-scale vision foundation models to downstream tasks.
1 code implementation • 1 Dec 2024 • Jiahao Cui, Hui Li, Yun Zhan, Hanlin Shang, Kaihui Cheng, Yuqi Ma, Shan Mu, Hang Zhou, Jingdong Wang, Siyu Zhu
Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds.
no code implementations • 28 Nov 2024 • Hui Li, Mingwang Xu, Yun Zhan, Shan Mu, Jiaye Li, Kaihui Cheng, Yuxuan Chen, Tan Chen, Mao Ye, Jingdong Wang, Siyu Zhu
Recent advancements in visual generation technologies have markedly increased the scale and availability of video datasets, which are crucial for training effective video generation models.
no code implementations • 22 Nov 2024 • Sen yang, Minyue Jiang, Ziwei Fan, Xiaolu Xie, Xiao Tan, YingYing Li, Errui Ding, Liang Wang, Jingdong Wang
Recent advances in autonomous driving systems have shifted towards reducing reliance on high-definition maps (HDMaps) due to the huge costs of annotation and maintenance.
no code implementations • 22 Nov 2024 • Ke Zhu, Yu Wang, Yanpeng Sun, Qiang Chen, JiangJiang Liu, Gang Zhang, Jingdong Wang
Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss.
no code implementations • 19 Nov 2024 • Hao Li, Yuanyuan Gao, Haosong Peng, Chenming Wu, Weicai Ye, Yufeng Zhan, Chen Zhao, Dingwen Zhang, Jingdong Wang, Junwei Han
This paper presents DGTR, a novel distributed framework for efficient Gaussian reconstruction for sparse-view vast scenes.
no code implementations • 30 Oct 2024 • Jie Zhu, Yixiong Chen, Mingyu Ding, Ping Luo, Leye Wang, Jingdong Wang
These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model.
no code implementations • 24 Oct 2024 • Haonan Lin, Mengmeng Wang, Jiahao Wang, Wenbin An, Yan Chen, Yong liu, Feng Tian, Guang Dai, Jingdong Wang, Qianying Wang
To resolve this, we introduce the Logistic Schedule, a novel noise schedule designed to eliminate singularities, improve inversion stability, and provide a better noise space for image editing.
no code implementations • 17 Oct 2024 • Yanpeng Sun, Huaxin Zhang, Qiang Chen, Xinyu Zhang, Nong Sang, Gang Zhang, Jingdong Wang, Zechao Li
QLadder employs a learnable ``\textit{ladder}'' structure to deeply aggregates the intermediate representations from the frozen pretrained visual encoder (e. g., CLIP image encoder).
Ranked #161 on
Visual Question Answering
on MM-Vet
no code implementations • 14 Oct 2024 • Jiazhi Guan, Quanwei Yang, Kaisiyuan Wang, Hang Zhou, Shengyi He, Zhiliang Xu, Haocheng Feng, Errui Ding, Jingdong Wang, Hongtao Xie, Youjian Zhao, Ziwei Liu
We propose a Motion-Enhanced Textural Alignment module to enhance the bond between driving and target signals.
1 code implementation • 10 Oct 2024 • Jiahao Cui, Hui Li, Yao Yao, Hao Zhu, Hanlin Shang, Kaihui Cheng, Hang Zhou, Siyu Zhu, Jingdong Wang
To the best of our knowledge, Hallo2, proposed in this paper, is the first method to achieve 4K resolution and generate hour-long, audio-driven portrait image animations enhanced with textual prompts.
no code implementations • 10 Oct 2024 • Jing Yang, Minyue Jiang, Sen yang, Xiao Tan, YingYing Li, Errui Ding, Hanli Wang, Jingdong Wang
The construction of Vectorized High-Definition (HD) map typically requires capturing both category and geometry information of map elements.
no code implementations • 29 Sep 2024 • Haonan Lin, Wenbin An, Jiahao Wang, Yan Chen, Feng Tian, Mengmeng Wang, Guang Dai, Qianying Wang, Jingdong Wang
Recent advancements have shown promise in applying traditional Semi-Supervised Learning strategies to the task of Generalized Category Discovery (GCD).
1 code implementation • 24 Sep 2024 • Lingyu Xiao, Jiang-Jiang Liu, Sen yang, Xiaofan Li, Xiaoqing Ye, Wankou Yang, Jingdong Wang
In this paper, we explore the feasibility of deriving decisions from an autoregressive world model by addressing these challenges through the formulation of multiple probabilistic hypotheses.
1 code implementation • 24 Sep 2024 • Chuyang Zhao, Yuxing Song, Wenhao Wang, Haocheng Feng, Errui Ding, Yifan Sun, Xinyan Xiao, Jingdong Wang
Most existing multimodality methods use separate backbones for autoregression-based discrete text generation and diffusion-based continuous visual generation, or the same backbone by discretizing the visual data to use autoregression for both text and visual generation.
no code implementations • 20 Sep 2024 • Jing Hao, Yuxiang Zhao, Song Chen, Yanpeng Sun, Qiang Chen, Gang Zhang, Kun Yao, Errui Ding, Jingdong Wang
To this end, we devised the FullAnno system, which is a data engine that can generate large-scale, high-quality, and fine-grained image annotations consisting of the category and position of objects, region descriptions, text information, as well as image dense captions.
no code implementations • 7 Sep 2024 • Jiahao Wang, Caixia Yan, Weizhan Zhang, Haonan Lin, Mengmeng Wang, Guang Dai, Tieliang Gong, Hao Sun, Jingdong Wang
For these issues, we pioneer a novel task, Layout-to-Consistent-Image (L2CI) generation, which produces consistent and compositional images in accordance with the given layout conditions and text prompts.
1 code implementation • 1 Sep 2024 • Dingyuan Zhang, Dingkang Liang, Zichang Tan, Xiaoqing Ye, Cheng Zhang, Jingdong Wang, Xiang Bai
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving.
no code implementations • 10 Aug 2024 • Zhuohang Dang, Minnan Luo, Jihong Wang, Chengyou Jia, Haochen Han, Herun Wan, Guang Dai, Xiaojun Chang, Jingdong Wang
Moreover, although intuitive, directly applying previous cross-modal disentanglement methods suffers from limited noise tolerance and disentanglement efficacy.
no code implementations • 6 Aug 2024 • Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu
Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers.
1 code implementation • 30 Jul 2024 • Lingfeng Yang, Xinyu Zhang, Xiang Li, Jinwen Chen, Kun Yao, Gang Zhang, Errui Ding, Lingqiao Liu, Jingdong Wang, Jian Yang
Our work contributes in three aspects: proposing a dataset containing numerous instructed image pairs; fine-tuning a diffusion model for rational generation; and generating synthetic data to boost downstream tasks.
1 code implementation • 25 Jul 2024 • Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai
To tackle this problem, we simply introduce a 3D spatial feature descriptor and integrate it into the linear group RNN operators to enhance their spatial features rather than blindly increasing the number of scanning orders for voxel features.
Ranked #1 on
3D Object Detection
on Waymo Open Dataset
1 code implementation • 22 Jul 2024 • Yiran Yang, Xu Gao, Tong Wang, Xin Hao, Yifeng Shi, Xiao Tan, Xiaoqing Ye, Jingdong Wang
This module adjusts the feature distributions from both the camera and LiDAR, bringing them closer to the ground truth domain and minimizing differences.
no code implementations • 21 Jul 2024 • Yiqun Zhao, Chenming Wu, Binbin Huang, YiHao Zhi, Chen Zhao, Jingdong Wang, Shenghua Gao
Efficient and accurate reconstruction of a relightable, dynamic clothed human avatar from a monocular video is crucial for the entertainment industry.
1 code implementation • 16 Jul 2024 • Penghui Du, Yu Wang, Yifan Sun, Luting Wang, Yue Liao, Gang Zhang, Errui Ding, Yan Wang, Jingdong Wang, Si Liu
Existing methods enhance open-vocabulary object detection by leveraging the robust open-vocabulary recognition capabilities of Vision-Language Models (VLMs), such as CLIP. However, two main challenges emerge:(1) A deficiency in concept representation, where the category names in CLIP's text space lack textual and visual knowledge.
Ranked #1 on
Open Vocabulary Object Detection
on LVIS v1.0
1 code implementation • 15 Jul 2024 • Zhe Liu, Jinghua Hou, Xiaoqing Ye, Tong Wang, Jingdong Wang, Xiang Bai
We argue that the main challenges are twofold: 1) How to obtain the appropriate object queries is challenging due to the high sparsity and uneven distribution of point clouds; 2) How to implement an effective query interaction by exploiting the rich geometric structure of point clouds is not fully explored.
1 code implementation • 15 Jul 2024 • Jinghua Hou, Tong Wang, Xiaoqing Ye, Zhe Liu, Shi Gong, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai
Accurate depth information is crucial for enhancing the performance of multi-view 3D object detection.
1 code implementation • 15 Jul 2024 • Yu Wang, Xiangbo Su, Qiang Chen, Xinyu Zhang, Teng Xi, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang
Open-vocabulary object detection focusing on detecting novel categories guided by natural language.
no code implementations • 4 Jul 2024 • Yuzhe Yao, Feng Tian, Jun Chen, Haonan Lin, Guang Dai, Yong liu, Jingdong Wang
This accumulation of error becomes particularly problematic in low-precision scenarios, leading to significant distortions in the generated images.
1 code implementation • 1 Jul 2024 • Mingxiang Liao, Hannan Lu, Xinyu Zhang, Fang Wan, Tianyu Wang, Yuzhong Zhao, WangMeng Zuo, Qixiang Ye, Jingdong Wang
For this purpose, we establish a new benchmark comprising text prompts that fully reflect multiple dynamics grades, and define a set of dynamics scores corresponding to various temporal granularities to comprehensively evaluate the dynamics of each generated video.
no code implementations • 26 Jun 2024 • Hao Li, Ming Yuan, Yan Zhang, Chenming Wu, Chen Zhao, Chunyu Song, Haocheng Feng, Errui Ding, Dingwen Zhang, Jingdong Wang
To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations.
no code implementations • 26 Jun 2024 • Hao Li, Jingfeng Li, Dingwen Zhang, Chenming Wu, Jieqi Shi, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han
Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views.
1 code implementation • 13 Jun 2024 • Yuchi Liu, Yifan Sun, Jingdong Wang, Liang Zheng
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels.
1 code implementation • CVPR 2024 • Wenjie Wang, Yehao Lu, Guangcong Zheng, Shuigen Zhan, Xiaoqing Ye, Zichang Tan, Jingdong Wang, Gaoang Wang, Xi Li
Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain, since it encompasses inherent advantages in reducing blind spots and expanding perception range.
no code implementations • 13 Jun 2024 • Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Yao Yao, Siyu Zhu
Moving away from traditional paradigms that rely on parametric models for intermediate facial representations, our innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion.
no code implementations • 13 Jun 2024 • Zhengqi Zhao, Xiaohu Huang, Hao Zhou, Kun Yao, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng
The key to action counting is accurately locating each video's repetitive actions.
1 code implementation • 5 Jun 2024 • Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
no code implementations • 4 Jun 2024 • Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Jian Zhang
To ensure robust feature presentation and 3D point-level understanding, we first employ SAM masks without cross-frame associations to train instance features with 3D consistency.
no code implementations • 31 May 2024 • Pengyuan Lyu, Yulin Li, Hao Zhou, Weihong Ma, Xingyu Wan, Qunyi Xie, Liang Wu, Chengquan Zhang, Kun Yao, Errui Ding, Jingdong Wang
Text-rich images have significant and extensive value, deeply integrated into various aspects of human life.
no code implementations • 30 May 2024 • Xingyu Wan, Chengquan Zhang, Pengyuan Lyu, Sen Fan, Zihan Ni, Kun Yao, Errui Ding, Jingdong Wang
Existing OCR engines or document image analysis systems typically rely on training separate models for text detection in varying scenarios and granularities, leading to significant computational complexity and resource demands.
Document Layout Analysis
Optical Character Recognition (OCR)
+3
no code implementations • 28 May 2024 • Zebin You, Xinyu Zhang, Hanzhong Guo, Jingdong Wang, Chongxuan Li
However, through distribution classification tasks, we reveal that, from the perspective of neural network-based classifiers, even advanced diffusion models are still far from this goal.
1 code implementation • 22 May 2024 • Huanjin Yao, Wenhao Wu, Taojiannan Yang, Yuxin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang
We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs.
1 code implementation • 22 May 2024 • Dingwen Zhang, Hao Li, Diqi He, Nian Liu, Lechao Cheng, Jingdong Wang, Junwei Han
Experimental evaluations conducted on MS COCO, Cityscapes, and CTW1500 datasets indicate that the QEIS models' performance can be significantly improved when pre-trained with our method.
1 code implementation • 18 May 2024 • Mengxi Zhang, Wenhao Wu, Yu Lu, Yuxin Song, Kang Rong, Huanjin Yao, Jianbo Zhao, Fanglong Liu, Yifan Sun, Haocheng Feng, Jingdong Wang
To verify our viewpoint, we present the Automated Multi-level Preference (AMP) framework for MLLMs.
no code implementations • 30 Apr 2024 • Zhexi Peng, Tianjia Shao, Yong liu, Jingke Zhou, Yin Yang, Jingdong Wang, Kun Zhou
We present Real-time Gaussian SLAM (RTG-SLAM), a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting.
1 code implementation • 25 Apr 2024 • Sifan Long, Linbin Wang, Zhen Zhao, Zichang Tan, Yiming Wu, Shengsheng Wang, Jingdong Wang
In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner.
1 code implementation • 22 Apr 2024 • Guibiao Liao, Jiankun Li, Zhenyu Bao, Xiaoqing Ye, Jingdong Wang, Qing Li, Kanglin Liu
Additionally, to address the semantic ambiguity, caused by utilizing view-inconsistent 2D CLIP semantics to supervise Gaussians, we introduce a 3D Coherent Self-training (3DCS) strategy, resorting to the multi-view consistency originated from the 3D model.
1 code implementation • 1 Apr 2024 • Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong liu, Jingdong Wang
However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications.
no code implementations • CVPR 2024 • Haonan Lin, Mengmeng Wang, Yan Chen, Wenbin An, Yuzhe Yao, Guang Dai, Qianying Wang, Yong liu, Jingdong Wang
While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images, novel challenges arise with a nuanced task of "identity fine editing": precisely modifying specific features of a subject while maintaining its inherent identity and context.
no code implementations • CVPR 2024 • Jiacheng Zhang, Jiaming Li, Xiangru Lin, Wei zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
Additionally, we present a DepthGradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels, effectively decoupling the depth gradient and removing conflicting gradients.
no code implementations • 22 Mar 2024 • Jinbo Wu, Xing Liu, Chenming Wu, Xiaobo Gao, Jialun Liu, Xinqi Liu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang
We propose an optimal viewpoint selection strategy, that finds the most miniature set of viewpoints covering all the faces of a mesh.
1 code implementation • ICCV 2023 • Jiaming Li, Xiangru Lin, Wei zhang, Xiao Tan, YingYing Li, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module.
no code implementations • 15 Mar 2024 • Hao Li, Yuanyuan Gao, Chenming Wu, Dingwen Zhang, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, Junwei Han
Specifically, we design a novel joint learning framework that consists of an Iterative Pose Optimization Network (IPO-Net) and a Generalizable 3D-Gaussians (G-3DG) model.
1 code implementation • CVPR 2024 • Haochen Han, Qinghua Zheng, Guang Dai, Minnan Luo, Jingdong Wang
To achieve this, we propose L2RM, a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs.
Cross-modal retrieval with noisy correspondence
Retrieval
+2
1 code implementation • CVPR 2024 • Yanpeng Sun, Jiahui Chen, Shan Zhang, Xinyu Zhang, Qiang Chen, Gang Zhang, Errui Ding, Jingdong Wang, Zechao Li
In this paper, we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation, creating the VRP-SAM model.
no code implementations • 26 Feb 2024 • Xinqi Liu, Chenming Wu, Jialun Liu, Xing Liu, Jinbo Wu, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang
In this paper, we present a novel method that facilitates the creation of vivid 3D Gaussian avatars from monocular video inputs (GVA).
no code implementations • 22 Jan 2024 • Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong liu
In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability.
no code implementations • 22 Jan 2024 • JianJian Cao, Beiya Dai, Yulin Li, Xiameng Qin, Jingdong Wang
Holi integrates features of the two modalities by a cross-modal attention mechanism, which suppresses the irrelevant redundancy under the guide of positioning information from RoCo.
1 code implementation • CVPR 2024 • Chuyang Zhao, Yifan Sun, Wenhao Wang, Qiang Chen, Errui Ding, Yi Yang, Jingdong Wang
The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates.
1 code implementation • CVPR 2024 • Huan Liu, Zichang Tan, Chuangchuang Tan, Yunchao Wei, Yao Zhao, Jingdong Wang
In this paper, we study the problem of generalizable synthetic image detection, aiming to detect forgery images from diverse generative methods, e. g., GANs and diffusion models.
no code implementations • 27 Dec 2023 • Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Xiaojun Chang, Jingdong Wang
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
Cross-modal retrieval with noisy correspondence
Memorization
+2
1 code implementation • 17 Dec 2023 • Jiankai Sun, Chuanyang Zheng, Enze Xie, Zhengying Liu, Ruihang Chu, Jianing Qiu, Jiaqi Xu, Mingyu Ding, Hongyang Li, Mengzhe Geng, Yue Wu, Wenhai Wang, Junsong Chen, Zhangyue Yin, Xiaozhe Ren, Jie Fu, Junxian He, Wu Yuan, Qi Liu, Xihui Liu, Yu Li, Hao Dong, Yu Cheng, Ming Zhang, Pheng Ann Heng, Jifeng Dai, Ping Luo, Jingdong Wang, Ji-Rong Wen, Xipeng Qiu, Yike Guo, Hui Xiong, Qun Liu, Zhenguo Li
Reasoning, a crucial ability for complex problem-solving, plays a pivotal role in various real-world settings such as negotiation, medical diagnosis, and criminal investigation.
1 code implementation • 8 Dec 2023 • Yahao Shi, Yanmin Wu, Chenming Wu, Xing Liu, Chen Zhao, Haocheng Feng, Jian Zhang, Bin Zhou, Errui Ding, Jingdong Wang
Our method achieves state-of-the-art performance in both relighting and novel view synthesis tasks among the recently proposed inverse rendering methods while achieving real-time rendering.
2 code implementations • 6 Dec 2023 • Hongyang Li, Yang Li, Huijie Wang, Jia Zeng, Huilin Xu, Pinlong Cai, Li Chen, Junchi Yan, Feng Xu, Lu Xiong, Jingdong Wang, Futang Zhu, Chunjing Xu, Tiancai Wang, Fei Xia, Beipeng Mu, Zhihui Peng, Dahua Lin, Yu Qiao
With the continuous maturation and application of autonomous driving technology, a systematic examination of open-source autonomous driving datasets becomes instrumental in fostering the robust evolution of the industry ecosystem.
no code implementations • 4 Dec 2023 • Chengyou Jia, Minnan Luo, Xiaojun Chang, Zhuohang Dang, Mingfei Han, Mengmeng Wang, Guang Dai, Sizhe Dang, Jingdong Wang
To realize this, we innovatively blend video models with Large Language Models (LLMs) to devise Action-conditioned Prompts.
2 code implementations • 27 Nov 2023 • Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang
Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks: Firstly, we explore the potential of its generated rich textual descriptions across various categories to enhance recognition performance without any training.
no code implementations • CVPR 2024 • Hao Li, Dingwen Zhang, Yalun Dai, Nian Liu, Lechao Cheng, Jingfeng Li, Jingdong Wang, Junwei Han
Applying NeRF to downstream perception tasks for scene understanding and representation is becoming increasingly popular.
no code implementations • 3 Nov 2023 • Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, Jingdong Wang
Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models.
1 code implementation • NeurIPS 2023 • Junkun Yuan, Xinyu Zhang, Hao Zhou, Jian Wang, Zhongwei Qiu, Zhiyin Shao, Shaofeng Zhang, Sifan Long, Kun Kuang, Kun Yao, Junyu Han, Errui Ding, Lanfen Lin, Fei Wu, Jingdong Wang
To further capture human characteristics, we propose a structure-invariant alignment loss that enforces different masked views, guided by the human part prior, to be closely aligned for the same image.
1 code implementation • NeurIPS 2023 • Linyan Huang, Zhiqi Li, Chonghao Sima, Wenhai Wang, Jingdong Wang, Yu Qiao, Hongyang Li
Current research is primarily dedicated to advancing the accuracy of camera-only 3D object detectors (apprentice) through the knowledge transferred from LiDAR- or multi-modal-based counterparts (expert).
Ranked #6 on
3D Object Detection
on nuScenes Camera Only
no code implementations • 11 Oct 2023 • Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers.
no code implementations • ICCV 2023 • Xiang Guo, Jiadai Sun, Yuchao Dai, GuanYing Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang
This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping.
no code implementations • 26 Sep 2023 • Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang
In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.
no code implementations • 20 Sep 2023 • Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang
Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, pedestrian detection and Re-IDentification (ReID).
no code implementations • 18 Sep 2023 • Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang
Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations.
1 code implementation • ICCV 2023 • Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang
In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels.
no code implementations • 1 Sep 2023 • Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang
In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion.
no code implementations • 20 Aug 2023 • Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang
Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls.
1 code implementation • ICCV 2023 • Jiazheng Xing, Mengmeng Wang, Yudi Ruan, Bofan Chen, Yaowei Guo, Boyu Mu, Guang Dai, Jingdong Wang, Yong liu
Class prototype construction and matching are core aspects of few-shot action recognition.
2 code implementations • ICCV 2023 • Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang
State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.
no code implementations • 3 Aug 2023 • Feng Chen, Jiajia Liu, Kaixiang Ji, Wang Ren, Jian Wang, Jingdong Wang
Our BGA-MNER consists of \texttt{image2text} and \texttt{text2image} generation with respect to entity-salient content in two modalities.
no code implementations • 3 Aug 2023 • Jiazheng Xing, Chao Xu, Mengmeng Wang, Guang Dai, Baigui Sun, Yong liu, Jingdong Wang, Jian Zhao
To tackle these issues, we introduce MA-FSAR, a framework that employs the Parameter-Efficient Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations.
1 code implementation • 21 Jul 2023 • Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng
We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework.
1 code implementation • ICCV 2023 • Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan
CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively.
2 code implementations • ICCV 2023 • Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang
We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.
Ranked #4 on
Action Recognition
on Something-Something V1
3 code implementations • CVPR 2023 • Jiacheng Zhang, Xiangru Lin, Wei zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li
Specifically, we propose a Stage-wise Hybrid Matching strategy that combines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage.
no code implementations • 29 Jun 2023 • Zhongwei Qiu, Qiansheng Yang, Jian Wang, Xiyu Wang, Chang Xu, Dongmei Fu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
One of the mainstream schemes for 2D human pose estimation (HPE) is learning keypoints heatmaps by a neural network.
no code implementations • 19 Jun 2023 • Haiyang Xu, Zhichao Zhou, Dongliang He, Fu Li, Jingdong Wang
Vision Transformer(ViT) is now dominating many vision tasks.
no code implementations • 5 Jun 2023 • Wenwen Yu, Chengquan Zhang, Haoyu Cao, Wei Hua, Bohan Li, Huang Chen, MingYu Liu, Mingrui Chen, Jianfeng Kuang, Mengjun Cheng, Yuning Du, Shikun Feng, Xiaoguang Hu, Pengyuan Lyu, Kun Yao, Yuechen Yu, Yuliang Liu, Wanxiang Che, Errui Ding, Cheng-Lin Liu, Jiebo Luo, Shuicheng Yan, Min Zhang, Dimosthenis Karatzas, Xing Sun, Jingdong Wang, Xiang Bai
It is hoped that this competition will attract many researchers in the field of CV and NLP, and bring some new thoughts to the field of Document AI.
1 code implementation • 17 May 2023 • Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, Jingdong Wang
Modern autonomous driving systems are typically divided into three main tasks: perception, prediction, and planning.
Ranked #1 on
Trajectory Planning
on nuScenes
1 code implementation • 15 May 2023 • Jialong Zuo, Jiahao Hong, Feng Zhang, Changqian Yu, Hanyu Zhou, Changxin Gao, Nong Sang, Jingdong Wang
To address this issue, we propose a novel language-image pre-training framework for person representation learning, termed PLIP.
Ranked #7 on
Text based Person Retrieval
on ICFG-PEDES
1 code implementation • 12 May 2023 • Zhe Liu, Xiaoqing Ye, Zhikang Zou, Xinwei He, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai
Extensive experiments on the nuScenes dataset demonstrate that our method is much more stable in dealing with challenging cases such as asynchronous sensors, misaligned sensor placement, and degenerated camera images than existing fusion methods.
Ranked #48 on
3D Object Detection
on nuScenes
no code implementations • CVPR 2023 • Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang
Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability.
1 code implementation • 10 Apr 2023 • Yanpeng Sun, Qiang Chen, Jian Wang, Jingdong Wang, Zechao Li
By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks.
no code implementations • ICCV 2023 • Sifan Long, Zhen Zhao, Junkun Yuan, Zichang Tan, JiangJiang Liu, Luping Zhou, Shengsheng Wang, Jingdong Wang
A contrastive loss is employed to align such augmented text and image representations on downstream tasks.
no code implementations • 27 Mar 2023 • Yifu Zhang, Xinggang Wang, Xiaoqing Ye, Wei zhang, Jincheng Lu, Xiao Tan, Errui Ding, Peize Sun, Jingdong Wang
We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object missing and fragmented trajectories.
1 code implementation • CVPR 2023 • Chang Liu, Weiming Zhang, Xiangru Lin, Wei zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang
It employs a "divide-and-conquer" strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity.
Ranked #1 on
Semi-Supervised Object Detection
on COCO 10% labeled data
(detector metric)
2 code implementations • CVPR 2023 • Kaixin Xiong, Shi Gong, Xiaoqing Ye, Xiao Tan, Ji Wan, Errui Ding, Jingdong Wang, Xiang Bai
In this paper, we address the problem of detecting 3D objects from multi-view images.
Ranked #9 on
3D Object Detection
on nuScenes Camera Only
1 code implementation • 17 Mar 2023 • Yidan Zhang, Ting Zhang, Dong Chen, Yujing Wang, Qi Chen, Xing Xie, Hao Sun, Weiwei Deng, Qi Zhang, Fan Yang, Mao Yang, Qingmin Liao, Jingdong Wang, Baining Guo
While generative modeling has become prevalent across numerous research fields, its integration into the realm of image retrieval remains largely unexplored and underjustified.
no code implementations • CVPR 2023 • Zhongwei Qiu, Yang Qiansheng, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Chang Xu, Dongmei Fu, Jingdong Wang
To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame.
Ranked #32 on
3D Human Pose Estimation
on 3DPW
1 code implementation • ICCV 2023 • Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng
Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction.
1 code implementation • 1 Mar 2023 • Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing.
Ranked #1 on
Table Recognition
on WTW
1 code implementation • 27 Jan 2023 • Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang
The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.
1 code implementation • 26 Jan 2023 • Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng
In this paper, we propose a graph contrastive learning framework for skeleton-based action recognition (\textit{SkeletonGCL}) to explore the \textit{global} context across all sequences.
Ranked #16 on
Skeleton Based Action Recognition
on NTU RGB+D
1 code implementation • ICCV 2023 • Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, Jingdong Wang
In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation.
no code implementations • ICCV 2023 • Jinhao Du, Shan Zhang, Qiang Chen, Haifeng Le, Yanpeng Sun, Yao Ni, Jian Wang, Bin He, Jingdong Wang
To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively.
no code implementations • ICCV 2023 • Shuo Li, Yue He, Weiming Zhang , Wei zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang
Current state-of-the-art semi-supervised semantic segmentation (SSSS) methods typically adopt pseudo labeling and consistency regularization between multiple learners with different perturbations.
5 code implementations • CVPR 2023 • Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.
Ranked #1 on
Zero-Shot Action Recognition
on ActivityNet
4 code implementations • CVPR 2023 • Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.
Ranked #8 on
Video Retrieval
on VATEX
1 code implementation • CVPR 2023 • Zhen Zhao, Lihe Yang, Sifan Long, Jimin Pi, Luping Zhou, Jingdong Wang
Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance.
no code implementations • 9 Dec 2022 • Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong, Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike
This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames.
1 code implementation • 7 Dec 2022 • Haixiao Yue, Keyao Wang, Guosheng Zhang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang
We further extend CDFTN for multi-target domain adaptation by leveraging data from more unlabeled target domains.
2 code implementations • 22 Nov 2022 • Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang
While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage.
1 code implementation • CVPR 2023 • Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, Jingdong Wang
In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning.
Ranked #4 on
Efficient ViTs
on ImageNet-1K (with DeiT-T)
1 code implementation • CVPR 2023 • Zhen Zhao, Sifan Long, Jimin Pi, Jingdong Wang, Luping Zhou
Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner.
no code implementations • 17 Nov 2022 • Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
That is to say, the smaller the model, the lower the mask ratio needs to be.
no code implementations • arXiv 2022 • Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zexian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shuming Han, Gang Zhang, Haocheng Feng, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang
The training process consists of self-supervised pretraining and finetuning a ViT-Huge encoder on ImageNet-1K, pretraining the detector on Object365, and finally finetuning it on COCO.
Ranked #8 on
Object Detection
on COCO test-dev
1 code implementation • 13 Oct 2022 • Jian Wang, Chenhui Gou, Qiman Wu, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang
Recently, transformer-based networks have shown impressive results in semantic segmentation.
Ranked #2 on
Real-Time Semantic Segmentation
on CamVid
no code implementations • 11 Oct 2022 • Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang
In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction.
no code implementations • 27 Sep 2022 • Zhiliang Xu, Hang Zhou, Zhibin Hong, Ziwei Liu, Jiaming Liu, Zhizhi Guo, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang
Our core idea is to leverage a style-based generator to empower high-fidelity and robust face swapping, thus the generator's advantage can be adopted for optimizing identity similarity.
no code implementations • 24 Sep 2022 • Jiankai Sun, Yan Xu, Mingyu Ding, Hongwei Yi, Chen Wang, Jingdong Wang, Liangjun Zhang, Mac Schwager
Using current NeRF training tools, a robot can train a NeRF environment model in real-time and, using our algorithm, identify 3D bounding boxes of objects of interest within the NeRF for downstream navigation or manipulation tasks.
no code implementations • 31 Aug 2022 • Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang
The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately.
Ranked #6 on
Table Recognition
on PubTabNet