1 code implementation • 6 Apr 2025 • Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks.
1 code implementation • 27 Feb 2025 • Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie
This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm.
1 code implementation • 3 Jan 2025 • Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li
In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism.
no code implementations • 26 Dec 2024 • Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei
CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization.
1 code implementation • 10 Dec 2024 • Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma
Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models.
1 code implementation • 27 Nov 2024 • Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma
Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks.
no code implementations • 15 Oct 2024 • Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma
Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks.
no code implementations • 9 Sep 2024 • Jiancheng Huang, Yu Gao, Zequn Jie, Yujie Zhong, Xintong Han, Lin Ma
For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle).
no code implementations • 25 Aug 2024 • Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, Xiaodan Liang
Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios.
1 code implementation • 13 Jul 2024 • Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang
In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model.
no code implementations • 11 Jul 2024 • Jinlong Li, Dong Zhao, Zequn Jie, Elisa Ricci, Lin Ma, Nicu Sebe
Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set.
1 code implementation • 10 Jul 2024 • Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang
To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.
Ranked #5 on
Zero-Shot Object Detection
on MSCOCO
(AP metric, using extra
training data)
no code implementations • 3 Jul 2024 • Lei Chen, Feng Yan, Yujie Zhong, Shaoxiang Chen, Zequn Jie, Lin Ma
Multimodal Large Language Models (MLLM) have made significant progress in the field of document analysis.
no code implementations • 12 Jun 2024 • Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma
Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data.
2 code implementations • CVPR 2024 • Duojun Huang, Xinyu Xiong, Jie Ma, Jichang Li, Zequn Jie, Lin Ma, Guanbin Li
In this paper, we propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context through reinforcement learning.
no code implementations • 5 May 2024 • Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, Lin Ma
In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation.
1 code implementation • 12 Mar 2024 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.
no code implementations • CVPR 2024 • Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma
The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector.
no code implementations • 29 Jan 2024 • Shaoxiang Chen, Zequn Jie, Lin Ma
To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs.
1 code implementation • CVPR 2024 • Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, Liang Wang
To demonstrate this we propose Attribute Relation and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks.
no code implementations • 15 Dec 2023 • Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang
During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.
no code implementations • 13 Dec 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Lechao Cheng, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Camera-based bird-eye-view (BEV) perception paradigm has made significant progress in the autonomous driving field.
no code implementations • 1 Jun 2023 • Xiao Dong, Runhui Huang, XiaoYong Wei, Zequn Jie, Jianxing Yu, Jian Yin, Xiaodan Liang
Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e. g., image-text semantic alignment) and image synthesis (e. g., text-to-image generation).
1 code implementation • 5 Feb 2023 • Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, Lin Ma
The deployment of 3D detectors strikes one of the major challenges in real-world self-driving scenarios.
1 code implementation • 7 Dec 2022 • Feng Yan, Zhiheng Li, Weixin Luo, Zequn Jie, Fan Liang, Xiaolin Wei, Lin Ma
This is a brief technical report of our proposed method for Multiple-Object Tracking (MOT) Challenge in Complex Environments.
Ranked #10 on
Multi-Object Tracking
on DanceTrack
(using extra training data)
1 code implementation • CVPR 2023 • Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, Lin Ma
However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization.
1 code implementation • 16 Sep 2022 • Jinlong Li, Zequn Jie, Xu Wang, Xiaolin Wei, Lin Ma
To tackle with this issue, this paper proposes an Expansion and Shrinkage scheme based on the offset learning in the deformable convolution, to sequentially improve the recall and precision of the located object in the two respective stages.
1 code implementation • 16 Sep 2022 • Jinlong Li, Zequn Jie, Xu Wang, Yu Zhou, Xiaolin Wei, Lin Ma
"Progressive Patch Learning" further extends the feature destruction and patch learning to multi-level granularities in a progressive manner.
Weakly supervised Semantic Segmentation
Weakly-Supervised Semantic Segmentation
1 code implementation • CVPR 2023 • Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as seeds) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques.
no code implementations • 11 Aug 2022 • Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, Xiaodan Liang
ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage.
no code implementations • 11 Jul 2022 • Shaoxiang Chen, Zequn Jie, Xiaolin Wei, Lin Ma
In this technical report, we introduce our submission to the Waymo 3D Detection leaderboard.
2 code implementations • 30 Mar 2022 • Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.
no code implementations • 10 Mar 2022 • Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders.
1 code implementation • 10 Mar 2022 • Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.
Ranked #7 on
3D dense captioning
on ScanRefer Dataset
1 code implementation • 9 Oct 2021 • Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang, Xiaolin Wei, Lin Ma
Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
no code implementations • 23 May 2020 • Zheng Ge, Zequn Jie, Xin Huang, Chengzheng Li, Osamu Yoshie
The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i. e., post-classification layers) become highly biased towards the negative proposals in the early training stage.
1 code implementation • CVPR 2020 • Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, Wei Liu
We propose to incorporate neural architecture search (NAS) into general-purpose multi-task learning (GP-MTL).
no code implementations • CVPR 2020 • Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie
To acquire the visible parts, a novel Paired-Box Model (PBM) is proposed to simultaneously predict the full and visible boxes of a pedestrian.
no code implementations • 16 Mar 2020 • Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, Osamu Yoshie
PS-RCNN first detects slightly/none occluded objects by an R-CNN module (referred as P-RCNN), and then suppress the detected instances by human-shaped masks so that the features of heavily occluded instances can stand out.
Ranked #2 on
Object Detection
on WiderPerson
1 code implementation • CVPR 2020 • Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, Jiashi Feng
In this work, we propose a new \emph{global} similarity metric, termed as \emph{central similarity}, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy.
no code implementations • 9 Dec 2018 • Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, Jiebo Luo
Experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate that our proposed SSG without relying on any region proposals can achieve comparable performance with other advanced models.
no code implementations • CVPR 2019 • Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu
Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples.
no code implementations • ECCV 2018 • Yonggen Ling, Linchao Bao, Zequn Jie, Fengming Zhu, Ziyang Li, Shanmin Tang, Yongsheng Liu, Wei Liu, Tong Zhang
Our approach is able to handle the rolling-shutter effects and imperfect sensor synchronization in a unified way.
no code implementations • EMNLP 2018 • Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, Tat-Seng Chua
We introduce an effective and efficient method that grounds (i. e., localizes) natural sentences in long, untrimmed video sequences.
no code implementations • ECCV 2018 • Zhen-Yu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, Jian Yang
In this paper, we propose a novel joint Task-Recursive Learning (TRL) framework for the closing-loop semantic segmentation and monocular depth estimation tasks.
Ranked #88 on
Semantic Segmentation
on NYU Depth v2
no code implementations • 10 Aug 2018 • Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu
Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}.
no code implementations • ICML 2018 • Bingyi Kang, Zequn Jie, Jiashi Feng
Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse.
no code implementations • CVPR 2018 • Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, Thomas S. Huang
Despite remarkable progress, weakly supervised segmentation methods are still inferior to their fully supervised counterparts.
no code implementations • CVPR 2018 • Yunchao Wei, Huaxin Xiao, Honghui Shi, Zequn Jie, Jiashi Feng, Thomas S. Huang
It can produce dense and reliable object localization maps and effectively benefit both weakly- and semi- supervised semantic segmentation.
2 code implementations • ECCV 2018 • Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal
Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains.
no code implementations • CVPR 2018 • Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yunchao Wei, Jiashi Feng, Wei Liu
Left-right consistency check is an effective way to enhance the disparity estimation by referring to the information from the opposite view.
no code implementations • NeurIPS 2017 • Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan
The ability of predicting the future is important for intelligent systems, e. g. autonomous vehicles and robots to plan early and make decisions accordingly.
no code implementations • 15 Aug 2017 • Xin Li, Zequn Jie, Jiashi Feng, Changsong Liu, Shuicheng Yan
However, most of the existing CNN models only learn features through a feedforward structure and no feedback information from top to bottom layers is exploited to enable the networks to refine themselves.
no code implementations • ICCV 2017 • Xin Li, Zequn Jie, Wei Wang, Changsong Liu, Jimei Yang, Xiaohui Shen, Zhe Lin, Qiang Chen, Shuicheng Yan, Jiashi Feng
Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors.
no code implementations • ICCV 2017 • Hao Liu, Jiashi Feng, Zequn Jie, Karlekar Jayashree, Bo Zhao, Meibin Qi, Jianguo Jiang, Shuicheng Yan
We investigate the problem of person search in the wild in this work.
Ranked #4 on
Person Re-Identification
on CUHK-SYSU
1 code implementation • 13 Jun 2017 • Hao liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, Jiashi Feng
Video based person re-identification plays a central role in realistic security and video surveillance.
no code implementations • CVPR 2017 • Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, Wei Liu
To overcome this issue, we propose a deep self-taught learning approach, which makes the detector learn the object-level features reliable for acquiring tight positive samples and afterwards re-train itself based on them.
no code implementations • 17 Apr 2017 • Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao liu, Zequn Jie, Jiashi Feng
This paper addresses a challenging problem -- how to generate multi-view cloth images from only a single view input.
no code implementations • NeurIPS 2016 • Zequn Jie, Xiaodan Liang, Jiashi Feng, Xiaojie Jin, Wen Feng Lu, Shuicheng Yan
Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal.
no code implementations • 1 Jan 2017 • Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, Jiashi Feng
Video based person re-identification plays a central role in realistic security and video surveillance.
no code implementations • ICCV 2017 • Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, Shuicheng Yan
In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations.
no code implementations • 27 Aug 2016 • Xiaojie Jin, Yunpeng Chen, Jiashi Feng, Zequn Jie, Shuicheng Yan
In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images.
no code implementations • 19 Jan 2016 • Zequn Jie, Xiaodan Liang, Jiashi Feng, Wen Feng Lu, Eng Hock Francis Tay, Shuicheng Yan
In particular, in order to improve the localization accuracy, a fully convolutional network is employed which predicts locations of object proposals for each pixel.
no code implementations • CVPR 2016 • Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Zequn Jie, Jiashi Feng, Liang Lin, Shuicheng Yan
By being reversible, the proposal refinement sub-network adaptively determines an optimal number of refinement iterations required for each proposal during both training and testing.