Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

no code implementations30 Nov 2023 Zipeng Qi, Guoxi Huang, Zebin Huang, Qin Guo, Jinwen Chen, Junyu Han, Jian Wang, Gang Zhang, Lufei Liu, Errui Ding, Jingdong Wang

The LRDiff framework constructs an image-rendering process with multiple layers, each of which applies the vision guidance to instructively estimate the denoising direction for a single object.

Denoising Image Generation

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

1 code implementation27 Nov 2023 Wenhao Wu, Huanjin Yao, Mengxi Zhang, Yuxin Song, Wanli Ouyang, Jingdong Wang

Our study centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.

Zero-Shot Learning

Disentangled Representation Learning with Transmitted Information Bottleneck

no code implementations3 Nov 2023 Zhuohang Dang, Minnan Luo, Chengyou Jia, Guang Dai, Jihong Wang, Xiaojun Chang, Jingdong Wang, Qinghua Zheng

Encoding only the task-related information from the raw data, \ie, disentangled representation learning, can greatly contribute to the robustness and generalizability of models.

Disentanglement Variational Inference

Accelerating Vision Transformers Based on Heterogeneous Attention Patterns

no code implementations11 Oct 2023 Deli Yu, Teng Xi, Jianwei Li, Baopu Li, Gang Zhang, Haocheng Feng, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

On one hand, different images share more similar attention patterns in early layers than later layers, indicating that the dynamic query-by-key self-attention matrix may be replaced with a static self-attention matrix in early layers.

Dimensionality Reduction

Forward Flow for Novel View Synthesis of Dynamic Scenes

no code implementations ICCV 2023 Xiang Guo, Jiadai Sun, Yuchao Dai, GuanYing Chen, Xiaoqing Ye, Xiao Tan, Errui Ding, Yumeng Zhang, Jingdong Wang

This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping.

Novel View Synthesis

GridFormer: Towards Accurate Table Structure Recognition via Grid Prediction

no code implementations26 Sep 2023 Pengyuan Lyu, Weihong Ma, Hongyi Wang, Yuechen Yu, Chengquan Zhang, Kun Yao, Yang Xue, Jingdong Wang

In this representation, the vertexes and edges of the grid store the localization and adjacency information of the table.

PSDiff: Diffusion Model for Person Search with Iterative and Collaborative Refinement

no code implementations20 Sep 2023 Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Jingdong Wang, Qinghua Zheng

Dominant Person Search methods aim to localize and recognize query persons in a unified network, which jointly optimizes two sub-tasks, \ie, detection and Re-IDentification (ReID).

Denoising Person Search

Unified Frequency-Assisted Transformer Framework for Detecting and Grounding Multi-Modal Manipulation

no code implementations18 Sep 2023 Huan Liu, Zichang Tan, Qiang Chen, Yunchao Wei, Yao Zhao, Jingdong Wang

Moreover, to address the semantic conflicts between image and frequency domains, the forgery-aware mutual module is developed to further enable the effective interaction of disparate image and frequency features, resulting in aligned and comprehensive visual forgery representations.


Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

1 code implementation ICCV 2023 Zhiyin Shao, Xinyu Zhang, Changxing Ding, Jian Wang, Jingdong Wang

In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels.

Person Re-Identification

VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation

no code implementations1 Sep 2023 Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang

In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion.

Text-to-Video Generation Video Generation

SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation

no code implementations20 Aug 2023 Chengyou Jia, Minnan Luo, Zhuohang Dang, Guang Dai, Xiaojun Chang, Mengmeng Wang, Jingdong Wang

Despite significant progress in Text-to-Image (T2I) generative models, even lengthy and complex text descriptions still struggle to convey detailed controls.

Layout-to-Image Generation

Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

2 code implementations ICCV 2023 Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang

State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e. g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR.

Human Detection Multi-Person Pose Estimation

Multimodal Adaptation of CLIP for Few-Shot Action Recognition

no code implementations3 Aug 2023 Jiazheng Xing, Mengmeng Wang, Xiaojun Hou, Guang Dai, Jingdong Wang, Yong liu

The adapters we design can combine information from video-text multimodal sources for task-oriented spatiotemporal modeling, which is fast, efficient, and has low training costs.

Few-Shot action recognition Few Shot Action Recognition

Enhancing Your Trained DETRs with Box Refinement

1 code implementation21 Jul 2023 Yiqun Chen, Qiang Chen, Peize Sun, Shoufa Chen, Jingdong Wang, Jian Cheng

We hope our work will bring the attention of the detection community to the localization bottleneck of current DETR-like models and highlight the potential of the RefineBox framework.

CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation

1 code implementation ICCV 2023 Lizhao Liu, Zhuangwei Zhuang, Shangxin Huang, Xunlong Xiao, Tianhang Xiang, Cen Chen, Jingdong Wang, Mingkui Tan

CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively.

Representation Learning Scene Understanding +2

What Can Simple Arithmetic Operations Do for Temporal Modeling?

2 code implementations ICCV 2023 Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang

We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost.

Action Classification Action Recognition +1

Semi-DETR: Semi-Supervised Object Detection with Detection Transformers

3 code implementations CVPR 2023 Jiacheng Zhang, Xiangru Lin, Wei zhang, Kuo Wang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

Specifically, we propose a Stage-wise Hybrid Matching strategy that combines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage.

object-detection Object Detection +2

Multi-Modal 3D Object Detection by Box Matching

1 code implementation12 May 2023 Zhe Liu, Xiaoqing Ye, Zhikang Zou, Xinwei He, Xiao Tan, Errui Ding, Jingdong Wang, Xiang Bai

Extensive experiments on the nuScenes dataset demonstrate that our method is much more stable in dealing with challenging cases such as asynchronous sensors, misaligned sensor placement, and degenerated camera images than existing fusion methods.

3D Object Detection Autonomous Driving +1

StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-based Generator

no code implementations CVPR 2023 Jiazhi Guan, Zhanwang Zhang, Hang Zhou, Tianshu Hu, Kaisiyuan Wang, Dongliang He, Haocheng Feng, Jingtuo Liu, Errui Ding, Ziwei Liu, Jingdong Wang

Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability.

Exploring Effective Factors for Improving Visual In-Context Learning

1 code implementation10 Apr 2023 Yanpeng Sun, Qiang Chen, Jian Wang, Jingdong Wang, Zechao Li

By doing this, the model can leverage the diverse knowledge stored in different parts of the model to improve its performance on new tasks.

Meta-Learning Semantic Segmentation

ByteTrackV2: 2D and 3D Multi-Object Tracking by Associating Every Detection Box

no code implementations27 Mar 2023 Yifu Zhang, Xinggang Wang, Xiaoqing Ye, Wei zhang, Jincheng Lu, Xiao Tan, Errui Ding, Peize Sun, Jingdong Wang

We propose a hierarchical data association strategy to mine the true objects in low-score detection boxes, which alleviates the problems of object missing and fragmented trajectories.

3D Multi-Object Tracking motion prediction

Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection

1 code implementation CVPR 2023 Chang Liu, Weiming Zhang, Xiangru Lin, Wei zhang, Xiao Tan, Junyu Han, Xiaomao Li, Errui Ding, Jingdong Wang

It employs a "divide-and-conquer" strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity.

Dense Object Detection object-detection +2

StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training

1 code implementation1 Mar 2023 Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo, Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing.

Document Image Classification Language Modelling +3

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

no code implementations27 Jan 2023 Jie Zhu, Jiyang Qi, Mingyu Ding, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang

The study is mainly motivated by that random views, used in contrastive learning, and random masked (visible) patches, used in masked image modeling, are often about object parts.

Contrastive Learning Representation Learning

Graph Contrastive Learning for Skeleton-based Action Recognition

1 code implementation26 Jan 2023 Xiaohu Huang, Hao Zhou, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Jingdong Wang, Xinggang Wang, Wenyu Liu, Bin Feng

In this paper, we propose a graph contrastive learning framework for skeleton-based action recognition (\textit{SkeletonGCL}) to explore the \textit{global} context across all sequences.

Action Recognition Contrastive Learning +2

UATVR: Uncertainty-Adaptive Text-Video Retrieval

1 code implementation ICCV 2023 Bo Fang, Wenhao Wu, Chang Liu, Yu Zhou, Yuxin Song, Weiping Wang, Xiangbo Shu, Xiangyang Ji, Jingdong Wang

In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation.

Retrieval Semantic correspondence +1

CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision

no code implementations ICCV 2023 Shuo Li, Yue He, Weiming Zhang , Wei zhang, Xiao Tan, Junyu Han, Errui Ding, Jingdong Wang

Current state-of-the-art semi-supervised semantic segmentation (SSSS) methods typically adopt pseudo labeling and consistency regularization between multiple learners with different perturbations.

Semi-Supervised Semantic Segmentation

Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection

1 code implementation ICCV 2023 Jiaming Li, Xiangru Lin, Wei zhang, Xiao Tan, YingYing Li, Junyu Han, Errui Ding, Jingdong Wang, Guanbin Li

To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module.

object-detection Object Detection +1

s-Adaptive Decoupled Prototype for Few-Shot Object Detection

no code implementations ICCV 2023 Jinhao Du, Shan Zhang, Qiang Chen, Haifeng Le, Yanpeng Sun, Yao Ni, Jian Wang, Bin He, Jingdong Wang

To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively.

Few-Shot Object Detection Meta-Learning +2

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

5 code implementations CVPR 2023 Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang

In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition.

Action Classification Action Recognition +2

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?

4 code implementations CVPR 2023 Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang

Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences.

Data Augmentation Retrieval +2

Augmentation Matters: A Simple-yet-Effective Approach to Semi-supervised Semantic Segmentation

1 code implementation CVPR 2023 Zhen Zhao, Lihe Yang, Sifan Long, Jimin Pi, Luping Zhou, Jingdong Wang

Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance.

Semi-Supervised Semantic Segmentation

Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers

no code implementations9 Dec 2022 Yasheng Sun, Hang Zhou, Kaisiyuan Wang, Qianyi Wu, Zhibin Hong, Jingtuo Liu, Errui Ding, Jingdong Wang, Ziwei Liu, Hideki Koike

This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames.

Real-time Neural Radiance Talking Portrait Synthesis via Audio-spatial Decomposition

1 code implementation22 Nov 2022 Jiaxiang Tang, Kaisiyuan Wang, Hang Zhou, Xiaokang Chen, Dongliang He, Tianshu Hu, Jingtuo Liu, Gang Zeng, Jingdong Wang

While dynamic Neural Radiance Fields (NeRF) have shown success in high-fidelity 3D modeling of talking portraits, the slow training and inference speed severely obstruct their potential usage.

Talking Face Generation

Instance-specific and Model-adaptive Supervision for Semi-supervised Semantic Segmentation

1 code implementation CVPR 2023 Zhen Zhao, Sifan Long, Jimin Pi, Jingdong Wang, Luping Zhou

Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner.

Segmentation Semi-Supervised Semantic Segmentation

Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers

no code implementations CVPR 2023 Sifan Long, Zhen Zhao, Jimin Pi, Shengsheng Wang, Jingdong Wang

In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning.

It Takes Two: Masked Appearance-Motion Modeling for Self-supervised Video Transformer Pre-training

no code implementations11 Oct 2022 Yuxin Song, Min Yang, Wenhao Wu, Dongliang He, Fu Li, Jingdong Wang

In order to guide the encoder to fully excavate spatial-temporal features, two separate decoders are used for two pretext tasks of disentangled appearance and motion prediction.

motion prediction

StyleSwap: Style-Based Generator Empowers Robust Face Swapping

no code implementations27 Sep 2022 Zhiliang Xu, Hang Zhou, Zhibin Hong, Ziwei Liu, Jiaming Liu, Zhizhi Guo, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Our core idea is to leverage a style-based generator to empower high-fidelity and robust face swapping, thus the generator's advantage can be adopted for optimizing identity similarity.

Face Swapping

NeRF-Loc: Transformer-Based Object Localization Within Neural Radiance Fields

no code implementations24 Sep 2022 Jiankai Sun, Yan Xu, Mingyu Ding, Hongwei Yi, Chen Wang, Jingdong Wang, Liangjun Zhang, Mac Schwager

Using current NeRF training tools, a robot can train a NeRF environment model in real-time and, using our algorithm, identify 3D bounding boxes of objects of interest within the NeRF for downstream navigation or manipulation tasks.

Object Localization Robot Navigation

TRUST: An Accurate and End-to-End Table structure Recognizer Using Splitting-based Transformers

no code implementations31 Aug 2022 Zengyuan Guo, Yuechen Yu, Pengyuan Lv, Chengquan Zhang, Haojie Li, Zhihui Wang, Kun Yao, Jingtuo Liu, Jingdong Wang

The Vertex-based Merging Module is capable of aggregating local contextual information between adjacent basic grids, providing the ability to merge basic girds that belong to the same spanning cell accurately.

Table Recognition

CODER: Coupled Diversity-Sensitive Momentum Contrastive Learning for Image-Text Retrieval

no code implementations21 Aug 2022 Haoran Wang, Dongliang He, Wenhao Wu, Boyang xia, Min Yang, Fu Li, Yunlong Yu, Zhong Ji, Errui Ding, Jingdong Wang

We introduce dynamic dictionaries for both modalities to enlarge the scale of image-text pairs, and diversity-sensitiveness is achieved by adaptive negative pair weighting.

Clustering Contrastive Learning +4

Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention

no code implementations2 Aug 2022 Fanqi Meng, Xuesong Wang, Jingdong Wang, Peifang Wang

The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report (i. e. suggestion or explanation) is also considered, thereby improving the performance of the classification.

Rating the Crisis of Online Public Opinion Using a Multi-Level Index System

no code implementations29 Jul 2022 Fanqi Meng, Xixi Xiao, Jingdong Wang

We propose a method to rate the crisis of online public opinion based on a multi-level index system to evaluate the impact of events objectively.


Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

2 code implementations ICCV 2023 Qiang Chen, Xiaokang Chen, Jian Wang, Shan Zhang, Kun Yao, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang

Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing.

Data Augmentation object-detection +1

Detecting Deepfake by Creating Spatio-Temporal Regularity Disruption

no code implementations21 Jul 2022 Jiazhi Guan, Hang Zhou, Mingming Gong, Errui Ding, Jingdong Wang, Youjian Zhao

Specifically, by carefully examining the spatial and temporal properties, we propose to disrupt a real video through a Pseudo-fake Generator and create a wide range of pseudo-fake videos for training.

DeepFake Detection Face Swapping

Action Quality Assessment with Temporal Parsing Transformer

1 code implementation19 Jul 2022 Yang Bai, Desen Zhou, Songyang Zhang, Jian Wang, Errui Ding, Yu Guan, Yang Long, Jingdong Wang

Action Quality Assessment(AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.

Action Quality Assessment Action Understanding +1

Conditional DETR V2: Efficient Detection Transformer with Box Queries

no code implementations18 Jul 2022 Xiaokang Chen, Fangyun Wei, Gang Zeng, Jingdong Wang

Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point.

object-detection Object Detection

Towards Lightweight Super-Resolution with Dual Regression Learning

1 code implementation16 Jul 2022 Yong Guo, Jingdong Wang, Qi Chen, JieZhang Cao, Zeshuai Deng, Yanwu Xu, Jian Chen, Mingkui Tan

Nevertheless, it is hard for existing model compression methods to accurately identify the redundant components due to the extremely large SR mapping space.

Image Super-Resolution Model Compression +1

Paint and Distill: Boosting 3D Object Detection with Semantic Passing Network

no code implementations12 Jul 2022 Bo Ju, Zhikang Zou, Xiaoqing Ye, Minyue Jiang, Xiao Tan, Errui Ding, Jingdong Wang

In this work, we propose a novel semantic passing framework, named SPNet, to boost the performance of existing lidar-based 3D detection models with the guidance of rich context painting, with no extra computation cost during inference.

3D Object Detection Autonomous Driving +1

Delving into Sequential Patches for Deepfake Detection

no code implementations6 Jul 2022 Jiazhi Guan, Hang Zhou, Zhibin Hong, Errui Ding, Jingdong Wang, Chengbin Quan, Youjian Zhao

Recent advances in face forgery techniques produce nearly visually untraceable deepfake videos, which could be leveraged with malicious intentions.

DeepFake Detection Face Swapping

MaskOCR: Text Recognition with Masked Encoder-Decoder Pretraining

no code implementations1 Jun 2022 Pengyuan Lyu, Chengquan Zhang, Shanshan Liu, Meina Qiao, Yangliu Xu, Liang Wu, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Specifically, we transform text data into synthesized text images to unify the data modalities of vision and language, and enhance the language modeling capability of the sequence decoder using a proposed masked image-language modeling scheme.

Language Modelling Optical Character Recognition (OCR) +1

Few-Shot Font Generation by Learning Fine-Grained Local Styles

2 code implementations CVPR 2022 Licheng Tang, Yiyang Cai, Jiaming Liu, Zhibin Hong, Mingming Gong, Minhu Fan, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Instead of explicitly disentangling global or component-wise modeling, the cross-attention mechanism can attend to the right local styles in the reference glyphs and aggregate the reference styles into a fine-grained style representation for the given content glyphs.

Font Generation

Few-Shot Head Swapping in the Wild

no code implementations CVPR 2022 Changyong Shu, Hemao Wu, Hang Zhou, Jiaming Liu, Zhibin Hong, Changxing Ding, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Particularly, seamless blending is achieved with the help of a Semantic-Guided Color Reference Creation procedure and a Blending UNet.

Face Swapping

Human-Object Interaction Detection via Disentangled Transformer

no code implementations CVPR 2022 Desen Zhou, Zhichao Liu, Jian Wang, Leshan Wang, Tao Hu, Errui Ding, Jingdong Wang

To associate the predictions of disentangled decoders, we first generate a unified representation for HOI triplets with a base decoder, and then utilize it as input feature of each disentangled decoder.

Human-Object Interaction Detection

GitNet: Geometric Prior-based Transformation for Birds-Eye-View Segmentation

no code implementations16 Apr 2022 Shi Gong, Xiaoqing Ye, Xiao Tan, Jingdong Wang, Errui Ding, Yu Zhou, Xiang Bai

Birds-eye-view (BEV) semantic segmentation is critical for autonomous driving for its powerful spatial representation ability.

Autonomous Driving Image Segmentation +2

Implicit Sample Extension for Unsupervised Person Re-Identification

1 code implementation CVPR 2022 Xinyu Zhang, Dongdong Li, Zhigang Wang, Jian Wang, Errui Ding, Javen Qinfeng Shi, Zhaoxiang Zhang, Jingdong Wang

Specifically, we generate support samples from actual samples and their neighbouring clusters in the embedding space through a progressive linear interpolation (PLI) strategy.

Clustering Unsupervised Person Re-Identification

DaViT: Dual Attention Vision Transformers

3 code implementations7 Apr 2022 Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, Lu Yuan

We show that these two self-attentions complement each other: (i) since each channel token contains an abstract representation of the entire image, the channel attention naturally captures global interactions and representations by taking all spatial positions into account when computing attention scores between channels; (ii) the spatial attention refines the local representations by performing fine-grained interactions across spatial locations, which in turn helps the global information modeling in channel attention.

Image Classification Instance Segmentation +2

ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval

no code implementations CVPR 2022 Mengjun Cheng, Yipeng Sun, Longchao Wang, Xiongwei Zhu, Kun Yao, Jie Chen, Guoli Song, Junyu Han, Jingtuo Liu, Errui Ding, Jingdong Wang

Visual appearance is considered to be the most important cue to understand images for cross-modal retrieval, while sometimes the scene text appearing in images can provide valuable information to understand the visual semantics.

Ranked #10 on Cross-Modal Retrieval on Flickr30k (using extra training data)

Contrastive Learning Cross-Modal Retrieval +1

Context Autoencoder for Self-Supervised Representation Learning

6 code implementations7 Feb 2022 Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

The pretraining tasks include two tasks: masked representation prediction - predict the representations for the masked patches, and masked patch reconstruction - reconstruct the masked patches.

Instance Segmentation object-detection +5

HRFormer: High-Resolution Vision Transformer for Dense Predict

2 code implementations NeurIPS 2021 Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost.

Pose Estimation Semantic Segmentation +1

Whole Brain Segmentation with Full Volume Neural Network

1 code implementation29 Oct 2021 Yeshu Li, Jonathan Cui, Yilun Sheng, Xiao Liang, Jingdong Wang, Eric I-Chao Chang, Yan Xu

To address these issues, we propose to adopt a full volume framework, which feeds the full volume brain image into the segmentation network and directly outputs the segmentation result for the whole brain volume.

Brain Segmentation Representation Learning +1

HRFormer: High-Resolution Transformer for Dense Prediction

1 code implementation18 Oct 2021 Yuhui Yuan, Rao Fu, Lang Huang, WeiHong Lin, Chao Zhang, Xilin Chen, Jingdong Wang

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost.

Image Classification Multi-Person Pose Estimation +2

Realistic Image Synthesis with Configurable 3D Scene Layouts

no code implementations23 Aug 2021 Jaebong Jeong, Janghun Jo, Jingdong Wang, Sunghyun Cho, Jaesik Park

Our approach takes a 3D scene with semantic class labels as input and trains a 3D scene painting network that synthesizes color values for the input 3D scene.

Image Generation

Conditional DETR for Fast Training Convergence

3 code implementations ICCV 2021 Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, Jingdong Wang

Our approach, named conditional DETR, learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention.

object-detection Object Detection

Content-Aware Convolutional Neural Networks

1 code implementation30 Jun 2021 Yong Guo, Yaofo Chen, Mingkui Tan, Kui Jia, Jian Chen, Jingdong Wang

In practice, the convolutional operation on some of the windows (e. g., smooth windows that contain very similar pixels) can be very redundant and may introduce noises into the computation.

On the Connection between Local Attention and Dynamic Depth-wise Convolution

1 code implementation ICLR 2022 Qi Han, Zejia Fan, Qi Dai, Lei Sun, Ming-Ming Cheng, Jiaying Liu, Jingdong Wang

Sparse connectivity: there is no connection across channels, and each position is connected to the positions within a small local window.

object-detection Object Detection +1

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression

2 code implementations CVPR 2021 Zigang Geng, Ke Sun, Bin Xiao, Zhaoxiang Zhang, Jingdong Wang

Our motivation is that regressing keypoint positions accurately needs to learn representations that focus on the keypoint regions.

Keypoint Detection regression

Learning Versatile Neural Architectures by Propagating Network Codes

1 code implementation ICLR 2022 Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang, Ping Luo

(4) Thorough studies of NCP on inter-, cross-, and intra-tasks highlight the importance of cross-task neural architecture design, i. e., multitask neural architectures and architecture transferring between different tasks.

Image Segmentation Neural Architecture Search +2

Boosting Adversarial Transferability through Enhanced Momentum

no code implementations19 Mar 2021 Xiaosen Wang, Jiadong Lin, Han Hu, Jingdong Wang, Kun He

Various momentum iterative gradient-based methods are shown to be effective to improve the adversarial transferability.

Adversarial Attack

Admix: Enhancing the Transferability of Adversarial Attacks

1 code implementation ICCV 2021 Xiaosen Wang, Xuanran He, Jingdong Wang, Kun He

We investigate in this direction and observe that existing transformations are all applied on a single image, which might limit the adversarial transferability.

Consistent Instance Classification for Unsupervised Representation Learning

no code implementations1 Jan 2021 Depu Meng, Zigang Geng, Zhirong Wu, Bin Xiao, Houqiang Li, Jingdong Wang

The proposed consistent instance classification (ConIC) approach simultaneously optimizes the classification loss and an additional consistency loss explicitly penalizing the feature dissimilarity between the augmented views from the same instance.

Classification General Classification +1

Improving Person Re-identification with Iterative Impression Aggregation

no code implementations21 Sep 2020 Dengpan Fu, Bo Xin, Jingdong Wang, Dong-Dong Chen, Jianmin Bao, Gang Hua, Houqiang Li

Not only does such a simple method improve the performance of the baseline models, it also achieves comparable performance with latest advanced re-ranking methods.

Person Re-Identification Re-Ranking

Informative Dropout for Robust Representation Learning: A Shape-bias Perspective

1 code implementation ICML 2020 Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, Jingdong Wang

Specifically, we discriminate texture from shape based on local self-information in an image, and adopt a Dropout-like algorithm to decorrelate the model output from the local texture.

Domain Generalization Representation Learning

Distillation Guided Residual Learning for Binary Convolutional Neural Networks

1 code implementation10 Jul 2020 Jianming Ye, Shiliang Zhang, Jingdong Wang

We observe that, this performance gap leads to substantial residuals between intermediate feature maps of BCNN and FCNN.

SegFix: Model-Agnostic Boundary Refinement for Segmentation

4 code implementations ECCV 2020 Yuhui Yuan, Jingyi Xie, Xilin Chen, Jingdong Wang

We present a model-agnostic post-processing scheme to improve the boundary quality for the segmentation result that is generated by any existing segmentation model.


Point-Set Anchors for Object Detection, Instance Segmentation and Pose Estimation

1 code implementation ECCV 2020 Fangyun Wei, Xiao Sun, Hongyang Li, Jingdong Wang, Stephen Lin

A recent approach for object detection and human pose estimation is to regress bounding boxes or human keypoints from a central point on the object or person.

Instance Segmentation object-detection +4

Efficient Semantic Video Segmentation with Per-frame Inference

1 code implementation ECCV 2020 Yifan Liu, Chunhua Shen, Changqian Yu, Jingdong Wang

For semantic segmentation, most existing real-time deep models trained with each frame independently may produce inconsistent results for a video sequence.

Knowledge Distillation Optical Flow Estimation +4

Segmentation Transformer: Object-Contextual Representations for Semantic Segmentation

9 code implementations ECCV 2020 Yuhui Yuan, Xiaokang Chen, Xilin Chen, Jingdong Wang

We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff.

Segmentation Semantic Segmentation

Cross View Fusion for 3D Human Pose Estimation

1 code implementation ICCV 2019 Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, Wen-Jun Zeng

It consists of two separate steps: (1) estimating the 2D poses in multi-view images and (2) recovering the 3D poses from the multi-view 2D poses.

2D Pose Estimation 3D Human Pose Estimation +1

Global-Local Temporal Representations For Video Person Re-Identification

no code implementations ICCV 2019 Jianing Li, Jingdong Wang, Qi Tian, Wen Gao, Shiliang Zhang

The long-term relations are captured by a temporal self-attention model to alleviate the occlusions and noises in video sequences.

Metric Learning Re-Ranking +1

Deep High-Resolution Representation Learning for Visual Recognition

42 code implementations20 Aug 2019 Jingdong Wang, Ke Sun, Tianheng Cheng, Borui Jiang, Chaorui Deng, Yang Zhao, Dong Liu, Yadong Mu, Mingkui Tan, Xinggang Wang, Wenyu Liu, Bin Xiao

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection.

 Ranked #1 on Object Detection on COCO test-dev (Hardware Burden metric)

Dichotomous Image Segmentation Face Alignment +7

Group Re-Identification with Multi-grained Matching and Integration

no code implementations17 May 2019 Weiyao Lin, Yuxi Li, Hao Xiao, John See, Junni Zou, Hongkai Xiong, Jingdong Wang, Tao Mei

The task of re-identifying groups of people underdifferent camera views is an important yet less-studied problem. Group re-identification (Re-ID) is a very challenging task sinceit is not only adversely affected by common issues in traditionalsingle object Re-ID problems such as viewpoint and human posevariations, but it also suffers from changes in group layout andgroup membership.

Structured Knowledge Distillation for Dense Prediction

1 code implementation CVPR 2019 Yifan Liu, Changyong Shun, Jingdong Wang, Chunhua Shen

Here we propose to distill structured knowledge from large networks to compact networks, taking into account the fact that dense prediction is a structured prediction problem.

Depth Estimation General Classification +7

Deep High-Resolution Representation Learning for Human Pose Estimation

39 code implementations CVPR 2019 Ke Sun, Bin Xiao, Dong Liu, Jingdong Wang

We start from a high-resolution subnetwork as the first stage, gradually add high-to-low resolution subnetworks one by one to form more stages, and connect the mutli-resolution subnetworks in parallel.

2D Human Pose Estimation Instance Segmentation +6

Collaborative Quantization for Cross-Modal Similarity Search

no code implementations CVPR 2016 Ting Zhang, Jingdong Wang

Cross-modal similarity search is a problem about designing a search system supporting querying across content modalities, e. g., using an image to search for texts or using a text to search for images.


Deep Triplet Quantization

1 code implementation1 Feb 2019 Bin Liu, Yue Cao, Mingsheng Long, Jian-Min Wang, Jingdong Wang

We propose Deep Triplet Quantization (DTQ), a novel approach to learning deep quantization models from the similarity triplets.

Deep Hashing Image Retrieval +1

Disparity-preserved Deep Cross-platform Association for Cross-platform Video Recommendation

no code implementations1 Jan 2019 Shengze Yu, Xin Wang, Wenwu Zhu, Peng Cui, Jingdong Wang

However, there remain two unsolved challenges: i) there exist inconsistencies in cross-platform association due to platform-specific disparity, and ii) data from distinct platforms may have different semantic granularities.

Weakly Supervised Dense Event Captioning in Videos

no code implementations NeurIPS 2018 Xuguang Duan, Wenbing Huang, Chuang Gan, Jingdong Wang, Wenwu Zhu, Junzhou Huang

Dense event captioning aims to detect and describe all events of interest contained in a video.

Accelerating Deep Neural Networks with Spatial Bottleneck Modules

no code implementations7 Sep 2018 Junran Peng, Lingxi Xie, Zhao-Xiang Zhang, Tieniu Tan, Jingdong Wang

This paper presents an efficient module named spatial bottleneck for accelerating the convolutional layers in deep neural networks.

OCNet: Object Context Network for Scene Parsing

8 code implementations4 Sep 2018 Yuhui Yuan, Lang Huang, Jianyuan Guo, Chao Zhang, Xilin Chen, Jingdong Wang

To capture richer context information, we further combine our interlaced sparse self-attention scheme with the conventional multi-scale context schemes including pyramid pooling~\citep{zhao2017pyramid} and atrous spatial pyramid pooling~\citep{chen2018deeplab}.

Scene Parsing Semantic Segmentation

Weakly-Supervised Semantic Segmentation Network With Deep Seeded Region Growing

1 code implementation CVPR 2018 Zilong Huang, Xinggang Wang, Jiasi Wang, Wenyu Liu, Jingdong Wang

Inspired by the traditional image segmentation methods of seeded region growing, we propose to train a semantic segmentation network starting from the discriminative regions and progressively increase the pixel-level supervision using by seeded region growing.

Ranked #30 on Weakly-Supervised Semantic Segmentation on COCO 2014 val (using extra training data)

Image Segmentation Segmentation +3

Interleaved Structured Sparse Convolutional Neural Networks

no code implementations CVPR 2018 Guotian Xie, Jingdong Wang, Ting Zhang, Jian-Huang Lai, Richang Hong, Guo-Jun Qi

In this paper, we study the problem of designing efficient convolutional neural network architectures with the interest in eliminating the redundancy in convolution kernels.

IGCV$2$: Interleaved Structured Sparse Convolutional Neural Networks

2 code implementations17 Apr 2018 Guotian Xie, Jingdong Wang, Ting Zhang, Jian-Huang Lai, Richang Hong, Guo-Jun Qi

In this paper, we study the problem of designing efficient convolutional neural network architectures with the interest in eliminating the redundancy in convolution kernels.

LVreID: Person Re-Identification with Long Sequence Videos

no code implementations20 Dec 2017 Jianing Li, Shiliang Zhang, Jingdong Wang, Wen Gao, Qi Tian

This paper mainly establishes a large-scale Long sequence Video database for person re-IDentification (LVreID).

Person Re-Identification

Composite Quantization

1 code implementation4 Dec 2017 Jingdong Wang, Ting Zhang

We introduce a composite quantization framework.


S4Net: Single Stage Salient-Instance Segmentation

1 code implementation CVPR 2019 Ruochen Fan, Ming-Ming Cheng, Qibin Hou, Tai-Jiang Mu, Jingdong Wang, Shi-Min Hu

Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch.

Instance Segmentation Segmentation +1

Global versus Localized Generative Adversarial Nets

2 code implementations CVPR 2018 Guo-Jun Qi, Liheng Zhang, Hao Hu, Marzieh Edraki, Jingdong Wang, Xian-Sheng Hua

In this paper, we present a novel localized Generative Adversarial Net (GAN) to learn on the manifold of real data.

General Classification

Ensemble Diffusion for Retrieval

no code implementations ICCV 2017 Song Bai, Zhichao Zhou, Jingdong Wang, Xiang Bai, Longin Jan Latecki, Qi Tian

This stimulates a great research interest of considering similarity fusion in the framework of diffusion process (i. e., fusion with diffusion) for robust retrieval.

3D Shape Classification 3D Shape Retrieval +2

Interleaved Group Convolutions

no code implementations ICCV 2017 Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang

The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution.

Human Pose Estimation using Global and Local Normalization

no code implementations ICCV 2017 Ke Sun, Cuiling Lan, Junliang Xing, Wen-Jun Zeng, Dong Liu, Jingdong Wang

We present a two-stage normalization scheme, human body normalization and limb normalization, to make the distribution of the relative joint locations compact, resulting in easier learning of convolutional spatial models and more accurate pose estimation.

Pose Estimation

Rethink ReLU to Training Better CNNs

no code implementations19 Sep 2017 Gangming Zhao, Zhao-Xiang Zhang, He Guan, Peng Tang, Jingdong Wang

Most of convolutional neural networks share the same characteristic: each convolutional layer is followed by a nonlinear activation layer where Rectified Linear Unit (ReLU) is the most widely used.

Deeply-Learned Part-Aligned Representations for Person Re-Identification

1 code implementation ICCV 2017 Liming Zhao, Xi Li, Jingdong Wang, Yueting Zhuang

In this paper, we address the problem of person re-identification, which refers to associating the persons captured from different cameras.

Person Re-Identification

Orthogonal and Idempotent Transformations for Learning Deep Neural Networks

no code implementations19 Jul 2017 Jingdong Wang, Yajie Xing, Kexin Zhang, Cha Zhang

Identity transformations, used as skip-connections in residual networks, directly connect convolutional layers close to the input and those close to the output in deep neural networks, improving information flow and thus easing the training.

Interleaved Group Convolutions for Deep Neural Networks

2 code implementations10 Jul 2017 Ting Zhang, Guo-Jun Qi, Bin Xiao, Jingdong Wang

The main point lies in a novel building block, a pair of two successive interleaved group convolutions: primary group convolution and secondary group convolution.

Learning Correspondence Structures for Person Re-identification

no code implementations20 Mar 2017 Weiyao Lin, Yang shen, Junchi Yan, Mingliang Xu, Jianxin Wu, Jingdong Wang, Ke Lu

We first introduce a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair.

Patch Matching Person Re-Identification

Deep Convolutional Neural Networks with Merge-and-Run Mappings

4 code implementations23 Nov 2016 Liming Zhao, Jingdong Wang, Xi Li, Zhuowen Tu, Wen-Jun Zeng

A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow.

Geometric Neural Phrase Pooling: Modeling the Spatial Co-occurrence of Neurons

no code implementations21 Jul 2016 Lingxi Xie, Qi Tian, John Flynn, Jingdong Wang, Alan Yuille

For this, we consider the neurons in the hidden layer as neural words, and construct a set of geometric neural phrases on top of them.

Image Classification

A Survey on Learning to Hash

no code implementations1 Jun 2016 Jingdong Wang, Ting Zhang, Jingkuan Song, Nicu Sebe, Heng Tao Shen

In this paper, we present a comprehensive survey of the learning to hash algorithms, categorize them according to the manners of preserving the similarities into: pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, as well as quantization, and discuss their relations.


Deeply-Fused Nets

2 code implementations25 May 2016 Jingdong Wang, Zhen Wei, Ting Zhang, Wen-Jun Zeng

Second, in our suggested fused net formed by one deep and one shallow base networks, the flows of the information from the earlier intermediate layer of the deep base network to the output and from the input to the later intermediate layer of the deep base network are both improved.

InterActive: Inter-Layer Activeness Propagation

no code implementations CVPR 2016 Lingxi Xie, Liang Zheng, Jingdong Wang, Alan Yuille, Qi Tian

An increasing number of computer vision tasks can be tackled with deep features, which are the intermediate outputs of a pre-trained Convolutional Neural Network.

Descriptive General Classification

DisturbLabel: Regularizing CNN on the Loss Layer

2 code implementations CVPR 2016 Lingxi Xie, Jingdong Wang, Zhen Wei, Meng Wang, Qi Tian

During a long period of time we are combating over-fitting in the CNN training process with model regularization, including weight decay, model averaging, data augmentation, etc.

Data Augmentation

Good Practice in CNN Feature Transfer

no code implementations1 Apr 2016 Liang Zheng, Yali Zhao, Shengjin Wang, Jingdong Wang, Qi Tian

The objective of this paper is the effective transfer of the Convolutional Neural Network (CNN) feature in image search and classification.

General Classification Image Retrieval

Scalable Person Re-Identification: A Benchmark

no code implementations ICCV 2015 Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, Qi Tian

As a minor contribution, inspired by recent advances in large-scale image search, this paper proposes an unsupervised Bag-of-Words descriptor.

Image Retrieval Person Re-Identification

RIDE: Reversal Invariant Descriptor Enhancement

no code implementations ICCV 2015 Lingxi Xie, Jingdong Wang, Weiyao Lin, Bo Zhang, Qi Tian

In many fine-grained object recognition datasets, image orientation (left/right) might vary from sample to sample.

Object Recognition

DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection

no code implementations19 Oct 2015 Xi Li, Liming Zhao, Lina Wei, Ming-Hsuan Yang, Fei Wu, Yueting Zhuang, Haibin Ling, Jingdong Wang

A key problem in salient object detection is how to effectively model the semantic properties of salient objects in a data-driven manner.

Image Segmentation