Search Results for author: Zijia Zhao

Found 7 papers, 5 papers with code

VL-Mamba: Exploring State Space Models for Multimodal Learning

no code implementations • 20 Mar 2024 • Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu

The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

Ranked #61 on Visual Question Answering on MM-Vet

Language Modelling Large Language Model +1

Paper
Add Code

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

1 code implementation • 20 Mar 2024 • Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, Jing Liu

In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process.

Paper
Code

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

1 code implementation • 17 Feb 2024 • Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu

Previous datasets and methods for classic VG task mainly rely on the prior assumption that the given expression must literally refer to the target object, which greatly impedes the practical deployment of agents in real-world scenarios.

Visual Grounding

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

1 code implementation • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Ranked #1 on Image Captioning on COCO Captions (SPICE metric, using extra training data)

Audio captioning Audio-Visual Captioning +14

175

Paper
Code

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu

We show that only language-paired two-modality data is sufficient to connect all modalities.

Language Modelling Large Language Model

Paper
Code

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

no code implementations • 9 Oct 2022 • Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu

Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.

Question Answering Representation Learning +5

Paper
Add Code

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

2 code implementations • 1 Jul 2021 • Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Ranked #1 on Image Retrieval on Localized Narratives

Audio to Text Retrieval Cross-Modal Retrieval +3

334

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.