1 code implementation • 24 Oct 2024 • Zijia Zhao, Longteng Guo, Tongtian Yue, Erdong Hu, Shuai Shao, Zehuan Yuan, Hua Huang, Jing Liu
In this paper, we investigate the task of general conversational image retrieval on open-domain images.
1 code implementation • 19 Sep 2024 • Junyi Chen, Lu Chi, Bingyue Peng, Zehuan Yuan
Large Language Models (LLMs) have achieved remarkable success in various fields, prompting several studies to explore their potential in recommendation systems.
1 code implementation • 13 Jun 2024 • Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, Yu-Gang Jiang
To exploit the complementary nature of image and video data, we further propose a progressive training strategy, where OmniTokenizer is first trained on image data on a fixed resolution to develop the spatial encoding capacity and then jointly trained on image and video data on multiple resolutions to learn the temporal dynamics.
Ranked #10 on Video Prediction on Kinetics-600 12 frames, 64x64
2 code implementations • 10 Jun 2024 • Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, Zehuan Yuan
(3) A text-conditional image generation model with 775M parameters, from two-stage training on LAION-COCO and high aesthetics quality images, demonstrating competitive performance of visual quality and text alignment.
Ranked #27 on Image Generation on ImageNet 256x256
1 code implementation • 19 Apr 2024 • Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, Xiaojuan Qi
We introduce Groma, a Multimodal Large Language Model (MLLM) with grounded and fine-grained visual perception ability.
1 code implementation • 3 Apr 2024 • Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, LiWei Wang
We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".
Ranked #15 on Image Generation on ImageNet 256x256
1 code implementation • CVPR 2024 • Chuang Lin, Yi Jiang, Lizhen Qu, Zehuan Yuan, Jianfei Cai
To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way.
2 code implementations • 25 Dec 2023 • Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
We evaluate our unified models on various benchmarks.
1 code implementation • CVPR 2024 • Junfeng Wu, Yi Jiang, Qihao Liu, Zehuan Yuan, Xiang Bai, Song Bai
We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.
Ranked #1 on Referring Expression Segmentation on RefCOCO (using extra training data)
Long-tail Video Object Segmentation Multi-Object Tracking +8
1 code implementation • 2 Nov 2023 • Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, Xiatian Zhu
Building on the success of powerful image-level vision-language (ViL) foundation models like CLIP, recent efforts have sought to harness their capabilities by either training a contrastive model from scratch with an extensive collection of region-label pairs or aligning the outputs of a detection model with image-level representations of region proposals.
1 code implementation • NeurIPS 2023 • Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi
CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept.
Ranked #4 on Open Vocabulary Object Detection on LVIS v1.0 (using extra training data)
no code implementations • 23 Aug 2023 • Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed.
no code implementations • ICCV 2023 • Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects.
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
no code implementations • CVPR 2023 • Li Xu, Mark He Huang, Xindi Shang, Zehuan Yuan, Ying Sun, Jun Liu
Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions.
no code implementations • CVPR 2023 • Tianjiao Li, Lin Geng Foo, Ping Hu, Xindi Shang, Hossein Rahmani, Zehuan Yuan, Jun Liu
Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked ``ground truth" targets can potentially be unreliable in this case.
no code implementations • CVPR 2023 • Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu
Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications.
1 code implementation • 4 Apr 2023 • Qiushan Guo, Yizhou Yu, Yi Jiang, Jiannan Wu, Zehuan Yuan, Ping Luo
We extend our pretext task to supervised pre-training, which achieves a similar performance to self-supervised learning.
1 code implementation • ICCV 2023 • Qiushan Guo, Chuofan Ma, Yi Jiang, Zehuan Yuan, Yizhou Yu, Ping Luo
Learning image classification and image generation using the same set of network parameters is a challenging problem.
1 code implementation • CVPR 2023 • Bin Yan, Yi Jiang, Jiannan Wu, Dong Wang, Ping Luo, Zehuan Yuan, Huchuan Lu
All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks.
Described Object Detection Generalized Referring Expression Comprehension +15
1 code implementation • 9 Jan 2023 • Keyu Tian, Yi Jiang, Qishuai Diao, Chen Lin, LiWei Wang, Zehuan Yuan
This is the first use of sparse convolution for 2D masked modeling.
Ranked #1 on Instance Segmentation on COCO 2017 val
no code implementations • ICCV 2023 • Jiannan Wu, Yi Jiang, Bin Yan, Huchuan Lu, Zehuan Yuan, Ping Luo
In this work, we end the current fragmented situation and propose UniRef to unify the three reference-based object segmentation tasks with a single architecture.
3 code implementations • 15 Dec 2022 • Yabo Xiao, Kai Su, Xiaojuan Wang, Dongdong Yu, Lei Jin, Mingshu He, Zehuan Yuan
The existing end-to-end methods rely on dense representations to preserve the spatial detail and structure for precise keypoint localization.
1 code implementation • 27 Nov 2022 • Chuang Lin, Peize Sun, Yi Jiang, Ping Luo, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan, Jianfei Cai
In this paper, we propose a novel open-vocabulary object detection framework directly learning from image-text pair data.
no code implementations • IEEE Transactions on Pattern Analysis and Machine Intelligence 2022 • Chuchu Han, Zhedong Zheng, Kai Su, Dongdong Yu, Zehuan Yuan, Changxin Gao, Nong Sang, Yi Yang
Person search aims at localizing and recognizing query persons from raw video frames, which is a combination of two sub-tasks, i. e., pedestrian detection and person re-identification.
Ranked #3 on Person Search on PRW
1 code implementation • 9 Oct 2022 • Haosen Yang, Deng Huang, Bin Wen, Jiannan Wu, Hongxun Yao, Yi Jiang, Xiatian Zhu, Zehuan Yuan
As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability.
no code implementations • 9 Oct 2022 • Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
1 code implementation • 27 Sep 2022 • Yang Jin, Yongzhi Li, Zehuan Yuan, Yadong Mu
Spatio-Temporal video grounding (STVG) focuses on retrieving the spatio-temporal tube of a specific object depicted by a free-form textual expression.
1 code implementation • 26 Sep 2022 • Chuofan Ma, Qiushan Guo, Yi Jiang, Zehuan Yuan, Ping Luo, Xiaojuan Qi
Our key finding is that the major cause of degradation is not information loss in the down-sampling process, but rather the mismatch between network architecture and input scale.
2 code implementations • 9 Sep 2022 • Zhenchao Jin, Dongdong Yu, Zehuan Yuan, Lequan Yu
To this end, we propose a novel soft mining contextual information beyond image paradigm named MCIBI++ to further boost the pixel-level representations.
2 code implementations • 18 Aug 2022 • Xizhe Xue, Dongdong Yu, Lingqiao Liu, Yu Liu, Satoshi Tsutsui, Ying Li, Zehuan Yuan, Ping Song, Mike Zheng Shou
Based on the single-stage instance segmentation framework, we propose a regularization model to predict foreground pixels and use its relation to instance segmentation to construct a cross-task consistency loss.
1 code implementation • 16 Jul 2022 • Zhenchao Jin, Dongdong Yu, Luchuan Song, Zehuan Yuan, Lequan Yu
Feature pyramid network (FPN) is one of the key components for object detectors.
1 code implementation • 14 Jul 2022 • Bin Yan, Yi Jiang, Peize Sun, Dong Wang, Zehuan Yuan, Ping Luo, Huchuan Lu
We present a unified method, termed Unicorn, that can simultaneously solve four tracking problems (SOT, MOT, VOS, MOTS) with a single network using the same model parameters.
Multi-Object Tracking Multi-Object Tracking and Segmentation +3
3 code implementations • 3 May 2022 • Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, Chun Yuan
The current distillation algorithm usually improves students' performance by imitating the output of the teacher.
no code implementations • 5 Apr 2022 • Bo Yuan, Danpei Zhao, Shuai Shao, Zehuan Yuan, Changhu Wang
In two typical cross-domain semantic segmentation tasks, i. e., GTA5 to Cityscapes and SYNTHIA to Cityscapes, our method achieves the state-of-the-art segmentation accuracy.
2 code implementations • 5 Mar 2022 • Qishuai Diao, Yi Jiang, Bin Wen, Jia Sun, Zehuan Yuan
Fine-Grained Visual Classification(FGVC) is the task that requires recognizing the objects belonging to multiple subordinate categories of a super-category.
Ranked #1 on Fine-Grained Image Classification on CUB-200-2011
1 code implementation • 26 Feb 2022 • Guanghao Yin, Wei Wang, Zehuan Yuan, Chuchu Han, Wei Ji, Shouqian Sun, Changhu Wang
The comparisons of distribution differences between HQ and LQ images can help our model better assess the image quality.
1 code implementation • CVPR 2022 • Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo
Referring video object segmentation (R-VOS) is an emerging cross-modal task that aims to segment the target object referred by a language expression in all video frames.
Ranked #3 on Referring Expression Segmentation on A2D Sentences (using extra training data)
no code implementations • NeurIPS 2021 • Haoyang Li, Xin Wang, Ziwei Zhang, Zehuan Yuan, Hang Li, Wenwu Zhu
Then we propose a novel factor-wise discrimination objective in a contrastive learning manner, which can force the factorized representations to independently reflect the expressive information from different latent factors.
1 code implementation • 1 Dec 2021 • Weihao Jiang, Dongdong Yu, Zhaozhi Xie, Yaoyi Li, Zehuan Yuan, Hongtao Lu
For emerging content-based feature fusion, most existing matting methods only focus on local features which lack the guidance of a global feature with strong semantic information related to the interesting object.
Ranked #4 on Image Matting on Composition-1K
3 code implementations • CVPR 2022 • Peize Sun, Jinkun Cao, Yi Jiang, Zehuan Yuan, Song Bai, Kris Kitani, Ping Luo
A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association.
1 code implementation • CVPR 2022 • Zhendong Yang, Zhe Li, Xiaohu Jiang, Yuan Gong, Zehuan Yuan, Danpei Zhao, Chun Yuan
Global distillation rebuilds the relation between different pixels and transfers it from teachers to students, compensating for missing global information in focal distillation.
Ranked #1 on Knowledge Distillation on MS COCO
1 code implementation • 10 Nov 2021 • Chuang Lin, Yi Jiang, Jianfei Cai, Lizhen Qu, Gholamreza Haffari, Zehuan Yuan
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position, which relies on the ongoing interactions with the environment during moving.
11 code implementations • arXiv 2021 • Yifu Zhang, Peize Sun, Yi Jiang, Dongdong Yu, Fucheng Weng, Zehuan Yuan, Ping Luo, Wenyu Liu, Xinggang Wang
ByteTrack also achieves state-of-the-art performance on MOT20, HiEve and BDD100K tracking benchmarks.
Ranked #1 on Multiple Object Tracking on BDD100K val
no code implementations • ICLR 2022 • Shuo Yang, Peize Sun, Yi Jiang, Xiaobo Xia, Ruiheng Zhang, Zehuan Yuan, Changhu Wang, Ping Luo, Min Xu
A more realistic object detection paradigm, Open-World Object Detection, has arisen increasing research interests in the community recently.
no code implementations • ICCV 2021 • Chuchu Han, Kai Su, Dongdong Yu, Zehuan Yuan, Changxin Gao, Nong Sang, Yi Yang, Changhu Wang
Large-scale labeled training data is often difficult to collect, especially for person identities.
no code implementations • 1 Sep 2021 • Zhenchao Jin, Dongdong Yu, Kai Su, Zehuan Yuan, Changhu Wang
Video scene parsing is a long-standing challenging task in computer vision, aiming to assign pre-defined semantic labels to pixels of all frames in a given video.
no code implementations • 30 Apr 2021 • Lu Yang, Yunlong Wang, Lingqiao Liu, Peng Wang, Lu Chi, Zehuan Yuan, Changhu Wang, Yanning Zhang
In this paper, we propose a new loss based on center predictivity, that is, a sample must be positioned in a location of the feature space such that from it we can roughly predict the location of the center of same-class samples.
1 code implementation • 8 Apr 2021 • Guanghao Yin, Wei Wang, Zehuan Yuan, Wei Ji, Dongdong Yu, Shouqian Sun, Tat-Seng Chua, Changhu Wang
We extract degradation prior at task-level with the proposed ConditionNet, which will be used to adapt the parameters of the basic SR network (BaseNet).
no code implementations • ICCV 2021 • Chuang Lin, Zehuan Yuan, Sicheng Zhao, Peize Sun, Changhu Wang, Jianfei Cai
By disentangling representations on both image and instance levels, DIDN is able to learn domain-invariant representations that are suitable for generalized object detection.
no code implementations • ICCV 2021 • Wei Wang, Haochen Zhang, Zehuan Yuan, Changhu Wang
A popular attempts towards the challenge is unpaired generative adversarial networks, which generate "real" LR counterparts from real HR images using image-to-image translation and then perform super-resolution from "real" LR->SR.
no code implementations • ICLR 2021 • Bingyi Kang, Yu Li, Sa Xie, Zehuan Yuan, Jiashi Feng
Motivated by this question, we conduct a series of studies on the performance of self-supervised contrastive learning and supervised learning methods over multiple datasets where training instance distributions vary from a balanced one to a long-tailed one.
Ranked #40 on Long-tail Learning on CIFAR-10-LT (ρ=10)
2 code implementations • 31 Dec 2020 • Peize Sun, Jinkun Cao, Yi Jiang, Rufeng Zhang, Enze Xie, Zehuan Yuan, Changhu Wang, Ping Luo
In this work, we propose TransTrack, a simple but efficient scheme to solve the multiple object tracking problems.
Ranked #12 on Multi-Object Tracking on SportsMOT (using extra training data)
Multi-Object Tracking Multiple Object Tracking with Transformer +3
1 code implementation • 10 Dec 2020 • Peize Sun, Yi Jiang, Enze Xie, Wenqi Shao, Zehuan Yuan, Changhu Wang, Ping Luo
We identify that classification cost in matching cost is the main ingredient: (1) previous detectors only consider location cost, (2) by additionally introducing classification cost, previous detectors immediately produce one-to-one prediction during inference.
1 code implementation • 10 Dec 2020 • Liang Hou, Zehuan Yuan, Lei Huang, HuaWei Shen, Xueqi Cheng, Changhu Wang
In particular, for real-time generation tasks, different devices require generators of different sizes due to varying computing power.
6 code implementations • CVPR 2021 • Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei LI, Zehuan Yuan, Changhu Wang, Ping Luo
In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location.
Ranked #5 on 2D Object Detection on CeyMo
1 code implementation • CVPR 2020 • Lei Huang, Li Liu, Fan Zhu, Diwen Wan, Zehuan Yuan, Bo Li, Ling Shao
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation.
no code implementations • 28 Oct 2019 • Dongdong Yu, Zehuan Yuan, Jinlai Liu, Kun Yuan, Changhu Wang
Instance Segmentation is an interesting yet challenging task in computer vision.
no code implementations • 3 Jul 2019 • Wei Li, Zehuan Yuan, Dashan Guo, Lei Huang, Xiangzhong Fang, Changhu Wang
To perform action detection, we design a 3D convolution network with skip connections for tube classification and regression.
no code implementations • 9 Oct 2018 • Wei Li, Zehuan Yuan, Xiangzhong Fang, Changhu Wang
Attention mechanisms have been widely used in Visual Question Answering (VQA) solutions due to their capacity to model deep cross-domain interactions.
no code implementations • 16 Sep 2018 • Jinlai Liu, Zehuan Yuan, Changhu Wang
Leveraging both visual frames and audio has been experimentally proven effective to improve large-scale video classification.
no code implementations • CVPR 2017 • Zehuan Yuan, Jonathan C. Stroud, Tong Lu, Jia Deng
We pose action localization as a structured prediction over arbitrary-length temporal windows, where each window is scored as the sum of frame-wise classification scores.