Search Results for author: Jinliang Zheng

Found 5 papers, 3 papers with code

Instruction-Guided Visual Masking

no code implementations30 May 2024 Jinliang Zheng, Jianxiong Li, Sijie Cheng, Yinan Zheng, Jiaming Li, Jihao Liu, Yu Liu, Jingjing Liu, Xianyuan Zhan

To achieve more accurate and nuanced multimodal instruction following, we introduce Instruction-guided Visual Masking (IVM), a new versatile visual grounding model that is compatible with diverse multimodal models, such as LMM and robot model.

Instruction Following Visual Grounding +1

Enhancing Vision-Language Model with Unmasked Token Alignment

1 code implementation29 May 2024 Jihao Liu, Jinliang Zheng, Boxiao Liu, Yu Liu, Hongsheng Li

Contrastive pre-training on image-text pairs, exemplified by CLIP, becomes a standard technique for learning multi-modal visual-language representations.

Language Modelling Self-Supervised Learning

GLID: Pre-training a Generalist Encoder-Decoder Vision Model

no code implementations11 Apr 2024 Jihao Liu, Jinliang Zheng, Yu Liu, Hongsheng Li

This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks.

Decoder Depth Estimation +6

DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning

1 code implementation28 Feb 2024 Jianxiong Li, Jinliang Zheng, Yinan Zheng, Liyuan Mao, Xiao Hu, Sijie Cheng, Haoyi Niu, Jihao Liu, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Xianyuan Zhan

Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots: 1) extracting both local and global task progressions; 2) enforcing temporal consistency of visual representation; 3) capturing trajectory-level language grounding.

Contrastive Learning Decision Making +1

MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers

1 code implementation CVPR 2023 Jihao Liu, Xin Huang, Jinliang Zheng, Yu Liu, Hongsheng Li

In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers.

Image Classification Object Detection +2

Cannot find the paper you are looking for? You can Submit a new open access paper.