1 code implementation • 24 Oct 2024 • Zijia Zhao, Longteng Guo, Tongtian Yue, Erdong Hu, Shuai Shao, Zehuan Yuan, Hua Huang, Jing Liu
In this paper, we investigate the task of general conversational image retrieval on open-domain images.
1 code implementation • 21 Oct 2024 • Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, WeiPeng Chen, Liang Wang
A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets.
1 code implementation • 17 Oct 2024 • Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingning Wang, WeiPeng Chen, Ji-Rong Wen
Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments.
no code implementations • 8 Jul 2024 • Erdong Hu, Longteng Guo, Tongtian Yue, Zijia Zhao, Shuning Xue, Jing Liu
This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module.
1 code implementation • 20 Jun 2024 • Yifan Du, Kun Zhou, Yuqi Huo, YiFan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, WeiPeng Chen, Ji-Rong Wen
Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench.
1 code implementation • 13 Jun 2024 • Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, WeiPeng Chen, Jing Liu
In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
no code implementations • 20 Mar 2024 • Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu
The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
Ranked #177 on
Visual Question Answering
on MM-Vet
1 code implementation • CVPR 2024 • Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, Jing Liu
In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process.
1 code implementation • 17 Feb 2024 • Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu
To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions.
2 code implementations • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu
Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).
Ranked #1 on
Image Captioning
on COCO Captions
(SPICE metric, using extra
training data)
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
no code implementations • 9 Oct 2022 • Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
2 code implementations • 1 Jul 2021 • Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
Ranked #1 on
Image Retrieval
on Localized Narratives