Search Results for author: Xingjian He

Found 15 papers, 6 papers with code

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation

no code implementations • 12 Apr 2024 • Yichen Yan, Xingjian He, Sihan Chen, Jing Liu

In this paper, we introduce CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder.

Image Segmentation Semantic Segmentation

Paper
Add Code

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

1 code implementation • 20 Mar 2024 • Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, Jing Liu

In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process.

Paper
Code

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

1 code implementation • 17 Feb 2024 • Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu

Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios.

Visual Grounding

Paper
Code

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

1 code implementation • 13 Dec 2023 • Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES

Descriptive Object +3

Paper
Code

EAVL: Explicitly Align Vision and Language for Referring Image Segmentation

no code implementations • 18 Aug 2023 • Yichen Yan, Xingjian He, Wenxuan Wang, Sihan Chen, Jing Liu

In previous approaches, fused vision-language features are directly fed into a decoder and pass through a convolution with a fixed kernel to obtain the result, which follows a similar pattern as traditional image segmentation.

Image Segmentation Referring Expression Segmentation +2

Paper
Add Code

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

1 code implementation • 15 Jun 2023 • Sihan Chen, Xingjian He, Handong Li, Xiaojie Jin, Jiashi Feng, Jing Liu

Due to the limited scale and quality of video-text training corpus, most vision-language foundation models employ image-text datasets for pretraining and primarily focus on modeling visually semantic representations while disregarding temporal semantic representations and correlations.

Ranked #1 on TGIF-Frame on TGIF-QA (using extra training data)

Question Answering Retrieval +6

Paper
Code

MMNet: Multi-Mask Network for Referring Image Segmentation

no code implementations • 24 May 2023 • Yichen Yan, Xingjian He, Wenxuan Wan, Jing Liu

However, this task is challenging due to the distinct data properties between text and image, and the randomness introduced by diverse objects and unrestricted language expression.

Image Segmentation Segmentation +1

Paper
Add Code

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

no code implementations • 22 May 2023 • Xingjian He, Sihan Chen, Fan Ma, Zhicheng Huang, Xiaojie Jin, Zikang Liu, Dongmei Fu, Yi Yang, Jing Liu, Jiashi Feng

Towards this goal, we propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature Adapting and Blending, which transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.

Ranked #1 on Visual Question Answering (VQA) on MSVD-QA (using extra training data)

Question Answering Retrieval +6

Paper
Add Code

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image Segmentation

no code implementations • 19 May 2023 • Wenxuan Wang, Jing Liu, Xingjian He, Yisi Zhang, Chen Chen, Jiachen Shen, Yan Zhang, Jiangyun Li

Referring image segmentation (RIS) is a fundamental vision-language task that intends to segment a desired object from an image based on a given natural language expression.

Image Segmentation Segmentation +1

Paper
Add Code

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

1 code implementation • 19 May 2023 • Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu

In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.

Dense Captioning Image Captioning +4

Paper
Code

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

1 code implementation • 17 Apr 2023 • Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

Ranked #1 on Video Captioning on VATEX (using extra training data)

Audio captioning Audio-Video Question Answering (AVQA) +16

232

Paper
Code

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

no code implementations • 9 Oct 2022 • Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu

Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.

Question Answering Representation Learning +5

Paper
Add Code

Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing

no code implementations • 6 Sep 2021 • Xingjian He, Weining Wang, Zhiyong Xu, Hao Wang, Jie Jiang, Jing Liu

Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction.

Scene Parsing

Paper
Add Code

Global-Local Propagation Network for RGB-D Semantic Segmentation

no code implementations • 26 Jan 2021 • Sihan Chen, Xinxin Zhu, Wei Liu, Xingjian He, Jing Liu

Depth information matters in RGB-D semantic segmentation task for providing additional geometric information to color images.

Scene Segmentation Segmentation

Paper
Add Code

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

no code implementations • 10 May 2020 • Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, Hanqing Lu

In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL).

Image Captioning Machine Translation +3

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.