Search Results for author: Zequn Jie

Found 64 papers, 23 papers with code

UniToken: Harmonizing Multimodal Understanding and Generation through Unified Visual Encoding

1 code implementation6 Apr 2025 Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang

We introduce UniToken, an auto-regressive generation model that encodes visual inputs through a combination of discrete and continuous representations, enabling seamless integration of unified visual understanding and image generation tasks.

Image Generation

FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction

1 code implementation27 Feb 2025 Siyu Jiao, Gengwei Zhang, Yinlong Qian, Jiancheng Huang, Yao Zhao, Humphrey Shi, Lin Ma, Yunchao Wei, Zequn Jie

This work challenges the residual prediction paradigm in visual autoregressive modeling and presents FlexVAR, a new Flexible Visual AutoRegressive image generation paradigm.

Image Generation Prediction

Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding

1 code implementation3 Jan 2025 Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

In this method, we design a Cross-Modal Value-Enhanced Decoding(CMVED) module to alleviate hallucination by a novel contrastive decoding mechanism.

Hallucination Language Modeling +2

CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting

no code implementations26 Dec 2024 Siyu Jiao, Haoye Dong, Yuyang Yin, Zequn Jie, Yinlong Qian, Yao Zhao, Humphrey Shi, Yunchao Wei

CLIP-GS leverages contrastive loss between 3DGS and the visual-text embeddings of CLIP, and we introduce an image voting loss to guide the directionality and convergence of gradient optimization.

3DGS Representation Learning

DriveMM: All-in-One Large Multimodal Model for Autonomous Driving

1 code implementation10 Dec 2024 Zhijian Huang, Chengjian Feng, Feng Yan, Baihui Xiao, Zequn Jie, Yujie Zhong, Xiaodan Liang, Lin Ma

Large Multimodal Models (LMMs) have demonstrated exceptional comprehension and interpretation capabilities in Autonomous Driving (AD) by incorporating large language models.

All Autonomous Driving

TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability

1 code implementation27 Nov 2024 Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma

Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks.

Temporal Localization Video Understanding

VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models

no code implementations15 Oct 2024 Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma

Video-based multimodal large language models (Video-LLMs) possess significant potential for video understanding tasks.

Video Understanding

MRStyle: A Unified Framework for Color Style Transfer with Multi-Modality Reference

no code implementations9 Sep 2024 Jiancheng Huang, Yu Gao, Zequn Jie, Yujie Zhong, Xintong Han, Lin Ma

For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle).

Style Transfer

Making Large Language Models Better Planners with Reasoning-Decision Alignment

no code implementations25 Aug 2024 Zhijian Huang, Tao Tang, Shaoxiang Chen, Sihao Lin, Zequn Jie, Lin Ma, Guangrun Wang, Xiaodan Liang

Inspired by the knowledge-driven nature of human driving, recent approaches explore the potential of large language models (LLMs) to improve understanding and decision-making in traffic scenarios.

Autonomous Driving Decision Making +1

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

1 code implementation13 Jul 2024 Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang

In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model.

3D Semantic Segmentation Language Modelling +3

Enhancing Robustness of Vision-Language Models through Orthogonality Learning and Self-Regularization

no code implementations11 Jul 2024 Jinlong Li, Dong Zhao, Zequn Jie, Elisa Ricci, Lin Ma, Nicu Sebe

Previous works primarily focus on prompt learning to adapt the CLIP into a variety of downstream tasks, however, suffering from task overfitting when fine-tuned on a small data set.

Data Augmentation Domain Generalization +2

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

1 code implementation10 Jul 2024 Hao Wang, Pengzhen Ren, Zequn Jie, Xiao Dong, Chengjian Feng, Yinlong Qian, Lin Ma, Dongmei Jiang, YaoWei Wang, Xiangyuan Lan, Xiaodan Liang

To address these challenges, we propose a novel unified open-vocabulary detection method called OV-DINO, which is pre-trained on diverse large-scale datasets with language-aware selective fusion in a unified framework.

Ranked #5 on Zero-Shot Object Detection on MSCOCO (AP metric, using extra training data)

Zero-Shot Object Detection

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

no code implementations12 Jun 2024 Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

Amidst the advancements in image-based Large Vision-Language Models (image-LVLM), the transition to video-based models (video-LVLM) is hindered by the limited availability of quality video data.

Video Understanding

AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning

2 code implementations CVPR 2024 Duojun Huang, Xinyu Xiong, Jie Ma, Jichang Li, Zequn Jie, Lin Ma, Guanbin Li

In this paper, we propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context through reinforcement learning.

reinforcement-learning Reinforcement Learning +1

Matten: Video Generation with Mamba-Attention

no code implementations5 May 2024 Yu Gao, Jiancheng Huang, Xiaopeng Sun, Zequn Jie, Yujie Zhong, Lin Ma

In this paper, we introduce Matten, a cutting-edge latent diffusion model with Mamba-Attention architecture for video generation.

Mamba Video Generation

Lumen: Unleashing Versatile Vision-Centric Capabilities of Large Multimodal Models

1 code implementation12 Mar 2024 Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

To address this issue, we propose a novel LMM architecture named Lumen, a Large multimodal model with versatile vision-centric capability enhancement.

Concept Alignment Instruction Following +2

InstaGen: Enhancing Object Detection by Training on Synthetic Dataset

no code implementations CVPR 2024 Chengjian Feng, Yujie Zhong, Zequn Jie, Weidi Xie, Lin Ma

The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model, using supervision from an off-the-shelf object detector, and a novel self-training scheme on (novel) categories not covered by the detector.

Object object-detection +1

LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs

no code implementations29 Jan 2024 Shaoxiang Chen, Zequn Jie, Lin Ma

To address this issue, we propose to apply an efficient Mixture of Experts (MoE) design, which is a sparse Mixture of LoRA Experts (MoLE) for instruction finetuning MLLMs.

Language Modelling Large Language Model +2

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

1 code implementation CVPR 2024 Yunan Zeng, Yan Huang, Jinjin Zhang, Zequn Jie, Zhenhua Chai, Liang Wang

To demonstrate this we propose Attribute Relation and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks.

Attribute Relation +1

Weakly-Supervised 3D Visual Grounding based on Visual Linguistic Alignment

no code implementations15 Dec 2023 Xiaoxu Xu, Yitian Yuan, Qiudan Zhang, Wenhui Wu, Zequn Jie, Lin Ma, Xu Wang

During the inference stage, the learned text-3D correspondence will help us ground the text queries to the 3D target objects even without 2D images.

3D visual grounding Natural Language Queries +1

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

no code implementations1 Jun 2023 Xiao Dong, Runhui Huang, XiaoYong Wei, Zequn Jie, Jianxing Yu, Jian Yin, Xiaodan Liang

Recent advances in vision-language pre-training have enabled machines to perform better in multimodal object discrimination (e. g., image-text semantic alignment) and image synthesis (e. g., text-to-image generation).

Contrastive Learning Retrieval +2

Multiple Object Tracking Challenge Technical Report for Team MT_IoT

1 code implementation7 Dec 2022 Feng Yan, Zhiheng Li, Weixin Luo, Zequn Jie, Fan Liang, Xiaolin Wei, Lin Ma

This is a brief technical report of our proposed method for Multiple-Object Tracking (MOT) Challenge in Complex Environments.

Ranked #10 on Multi-Object Tracking on DanceTrack (using extra training data)

Human Detection Multi-Object Tracking +2

AeDet: Azimuth-invariant Multi-view 3D Object Detection

1 code implementation CVPR 2023 Chengjian Feng, Zequn Jie, Yujie Zhong, Xiangxiang Chu, Lin Ma

However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization.

3D Object Detection Depth Estimation +3

Expansion and Shrinkage of Localization for Weakly-Supervised Semantic Segmentation

1 code implementation16 Sep 2022 Jinlong Li, Zequn Jie, Xu Wang, Xiaolin Wei, Lin Ma

To tackle with this issue, this paper proposes an Expansion and Shrinkage scheme based on the offset learning in the deformable convolution, to sequentially improve the recall and precision of the located object in the two respective stages.

Object Weakly supervised Semantic Segmentation +1

Weakly Supervised Semantic Segmentation via Progressive Patch Learning

1 code implementation16 Sep 2022 Jinlong Li, Zequn Jie, Xu Wang, Yu Zhou, Xiaolin Wei, Lin Ma

"Progressive Patch Learning" further extends the feature destruction and patch learning to multi-level granularities in a progressive manner.

Weakly supervised Semantic Segmentation Weakly-Supervised Semantic Segmentation

MSMDFusion: Fusing LiDAR and Camera at Multiple Scales with Multi-Depth Seeds for 3D Object Detection

1 code implementation CVPR 2023 Yang Jiao, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, Yu-Gang Jiang

Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as seeds) into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques.

3D Object Detection Autonomous Driving +1

ARMANI: Part-level Garment-Text Alignment for Unified Cross-Modal Fashion Design

no code implementations11 Aug 2022 Xujie Zhang, Yu Sha, Michael C. Kampffmeyer, Zhenyu Xie, Zequn Jie, Chengwen Huang, Jianqing Peng, Xiaodan Liang

ARMANI discretizes an image into uniform tokens based on a learned cross-modal codebook in its first stage and uses a Transformer to model the distribution of image tokens for a real image given the tokens of the control signals in its second stage.

Image Generation

MT-Net Submission to the Waymo 3D Detection Leaderboard

no code implementations11 Jul 2022 Shaoxiang Chen, Zequn Jie, Xiaolin Wei, Lin Ma

In this technical report, we introduce our submission to the Waymo 3D Detection leaderboard.

3D Object Detection

PromptDet: Towards Open-vocabulary Detection using Uncurated Images

2 code implementations30 Mar 2022 Chengjian Feng, Yujie Zhong, Zequn Jie, Xiangxiang Chu, Haibing Ren, Xiaolin Wei, Weidi Xie, Lin Ma

The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations.

Language Modeling Language Modelling +2

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

no code implementations10 Mar 2022 Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

Recently, one-stage visual grounders attract high attention due to their comparable accuracy but significantly higher efficiency than two-stage grounders.

Object Visual Grounding

MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes

1 code implementation10 Mar 2022 Yang Jiao, Shaoxiang Chen, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang

3D dense captioning is a recently-proposed novel task, where point clouds contain more geometric information than the 2D counterpart.

3D dense captioning Dense Captioning +4

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

1 code implementation9 Oct 2021 Yang Jiao, Zequn Jie, Weixin Luo, Jingjing Chen, Yu-Gang Jiang, Xiaolin Wei, Lin Ma

Referring Image Segmentation (RIS) aims at segmenting the target object from an image referred by one given natural language expression.

Image Segmentation Retrieval +2

Delving into the Imbalance of Positive Proposals in Two-stage Object Detection

no code implementations23 May 2020 Zheng Ge, Zequn Jie, Xin Huang, Chengzheng Li, Osamu Yoshie

The first imbalance lies in the large number of low-quality RPN proposals, which makes the R-CNN module (i. e., post-classification layers) become highly biased towards the negative proposals in the early training stage.

object-detection Object Detection

NMS by Representative Region: Towards Crowded Pedestrian Detection by Proposal Pairing

no code implementations CVPR 2020 Xin Huang, Zheng Ge, Zequn Jie, Osamu Yoshie

To acquire the visible parts, a novel Paired-Box Model (PBM) is proposed to simultaneously predict the full and visible boxes of a pedestrian.

Pedestrian Detection

PS-RCNN: Detecting Secondary Human Instances in a Crowd via Primary Object Suppression

no code implementations16 Mar 2020 Zheng Ge, Zequn Jie, Xin Huang, Rong Xu, Osamu Yoshie

PS-RCNN first detects slightly/none occluded objects by an R-CNN module (referred as P-RCNN), and then suppress the detected instances by human-shaped masks so that the features of heavily occluded instances can stand out.

Human Detection Object Detection

Central Similarity Quantization for Efficient Image and Video Retrieval

1 code implementation CVPR 2020 Li Yuan, Tao Wang, Xiaopeng Zhang, Francis EH Tay, Zequn Jie, Wei Liu, Jiashi Feng

In this work, we propose a new \emph{global} similarity metric, termed as \emph{central similarity}, with which the hash codes of similar data pairs are encouraged to approach a common center and those for dissimilar pairs to converge to different centers, to improve hash learning efficiency and retrieval accuracy.

Quantization Retrieval +2

Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

no code implementations9 Dec 2018 Xinpeng Chen, Lin Ma, Jingyuan Chen, Zequn Jie, Wei Liu, Jiebo Luo

Experiments on RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate that our proposed SSG without relying on any region proposals can achieve comparable performance with other advanced models.

Attribute Referring Expression +1

A Sufficient Condition for Convergences of Adam and RMSProp

no code implementations CVPR 2019 Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples.

Stochastic Optimization

Temporally Grounding Natural Sentence in Video

no code implementations EMNLP 2018 Jingyuan Chen, Xinpeng Chen, Lin Ma, Zequn Jie, Tat-Seng Chua

We introduce an effective and efficient method that grounds (i. e., localizes) natural sentences in long, untrimmed video sequences.

Sentence Video Captioning

Joint Task-Recursive Learning for Semantic Segmentation and Depth Estimation

no code implementations ECCV 2018 Zhen-Yu Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, Jian Yang

In this paper, we propose a novel joint Task-Recursive Learning (TRL) framework for the closing-loop semantic segmentation and monocular depth estimation tasks.

Monocular Depth Estimation Segmentation +1

A Unified Analysis of AdaGrad with Weighted Aggregation and Momentum Acceleration

no code implementations10 Aug 2018 Li Shen, Congliang Chen, Fangyu Zou, Zequn Jie, Ju Sun, Wei Liu

Integrating adaptive learning rate and momentum techniques into SGD leads to a large class of efficiently accelerated adaptive stochastic algorithms, such as AdaGrad, RMSProp, Adam, AccAdaGrad, \textit{etc}.

Stochastic Optimization

Policy Optimization with Demonstrations

no code implementations ICML 2018 Bingyi Kang, Zequn Jie, Jiashi Feng

Exploration remains a significant challenge to reinforcement learning methods, especially in environments where reward signals are sparse.

Policy Gradient Methods Reinforcement Learning +1

Modular Generative Adversarial Networks

2 code implementations ECCV 2018 Bo Zhao, Bo Chang, Zequn Jie, Leonid Sigal

Existing methods for multi-domain image-to-image translation (or generation) attempt to directly map an input image (or a random vector) to an image in one of the output domains.

Attribute Image-to-Image Translation +1

Left-Right Comparative Recurrent Model for Stereo Matching

no code implementations CVPR 2018 Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yunchao Wei, Jiashi Feng, Wei Liu

Left-right consistency check is an effective way to enhance the disparity estimation by referring to the information from the opposite view.

Disparity Estimation model +2

Predicting Scene Parsing and Motion Dynamics in the Future

no code implementations NeurIPS 2017 Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan

The ability of predicting the future is important for intelligent systems, e. g. autonomous vehicles and robots to plan early and make decisions accordingly.

Autonomous Vehicles motion prediction +2

Learning with Rethinking: Recurrently Improving Convolutional Neural Networks through Feedback

no code implementations15 Aug 2017 Xin Li, Zequn Jie, Jiashi Feng, Changsong Liu, Shuicheng Yan

However, most of the existing CNN models only learn features through a feedforward structure and no feedback information from top to bottom layers is exploited to enable the networks to refine themselves.

FoveaNet: Perspective-aware Urban Scene Parsing

no code implementations ICCV 2017 Xin Li, Zequn Jie, Wei Wang, Changsong Liu, Jimei Yang, Xiaohui Shen, Zhe Lin, Qiang Chen, Shuicheng Yan, Jiashi Feng

Thus, they suffer from heterogeneous object scales caused by perspective projection of cameras on actual scenes and inevitably encounter parsing failures on distant objects as well as other boundary and recognition errors.

Scene Parsing

Deep Self-Taught Learning for Weakly Supervised Object Localization

no code implementations CVPR 2017 Zequn Jie, Yunchao Wei, Xiaojie Jin, Jiashi Feng, Wei Liu

To overcome this issue, we propose a deep self-taught learning approach, which makes the detector learn the object-level features reliable for acquiring tight positive samples and afterwards re-train itself based on them.

Object Weakly Supervised Object Detection +1

Multi-View Image Generation from a Single-View

no code implementations17 Apr 2017 Bo Zhao, Xiao Wu, Zhi-Qi Cheng, Hao liu, Zequn Jie, Jiashi Feng

This paper addresses a challenging problem -- how to generate multi-view cloth images from only a single view input.

Image Generation Variational Inference

Tree-Structured Reinforcement Learning for Sequential Object Localization

no code implementations NeurIPS 2016 Zequn Jie, Xiaodan Liang, Jiashi Feng, Xiaojie Jin, Wen Feng Lu, Shuicheng Yan

Therefore, Tree-RL can better cover different objects with various scales which is quite appealing in the context of object proposal.

Diversity Object +4

Video Scene Parsing with Predictive Feature Learning

no code implementations ICCV 2017 Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, Jiashi Feng, Shuicheng Yan

In this way, the network can effectively learn to capture video dynamics and temporal context, which are critical clues for video scene parsing, without requiring extra manual annotations.

Representation Learning Scene Parsing

Multi-Path Feedback Recurrent Neural Network for Scene Parsing

no code implementations27 Aug 2016 Xiaojie Jin, Yunpeng Chen, Jiashi Feng, Zequn Jie, Shuicheng Yan

In this paper, we consider the scene parsing problem and propose a novel Multi-Path Feedback recurrent neural network (MPF-RNN) for parsing scene images.

Scene Parsing

Scale-aware Pixel-wise Object Proposal Networks

no code implementations19 Jan 2016 Zequn Jie, Xiaodan Liang, Jiashi Feng, Wen Feng Lu, Eng Hock Francis Tay, Shuicheng Yan

In particular, in order to improve the localization accuracy, a fully convolutional network is employed which predicts locations of object proposals for each pixel.

Object object-detection +2

Reversible Recursive Instance-level Object Segmentation

no code implementations CVPR 2016 Xiaodan Liang, Yunchao Wei, Xiaohui Shen, Zequn Jie, Jiashi Feng, Liang Lin, Shuicheng Yan

By being reversible, the proposal refinement sub-network adaptively determines an optimal number of refinement iterations required for each proposal during both training and testing.

Denoising Object +2

Cannot find the paper you are looking for? You can Submit a new open access paper.