no code implementations • 22 Apr 2024 • Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu
Knowledge-based Visual Question Answering (VQA) requires models to incorporate external knowledge to respond to questions about visual content.
no code implementations • 20 Mar 2024 • Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu
The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.
Ranked #68 on Visual Question Answering on MM-Vet
1 code implementation • 20 Mar 2024 • Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, Jing Liu
In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process.
no code implementations • 15 Mar 2024 • Dongze Hao, Jian Jia, Longteng Guo, Qunbo Wang, Te Yang, Yan Li, Yanhua Cheng, Bo wang, Quan Chen, Han Li, Jing Liu
We condense the retrieved knowledge passages from two perspectives.
1 code implementation • 13 Dec 2023 • Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu
To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES
no code implementations • 23 Aug 2023 • Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang
Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed.
1 code implementation • 25 May 2023 • Zijia Zhao, Longteng Guo, Tongtian Yue, Sihan Chen, Shuai Shao, Xinxin Zhu, Zehuan Yuan, Jing Liu
We show that only language-paired two-modality data is sufficient to connect all modalities.
1 code implementation • 19 May 2023 • Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu
In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.
1 code implementation • 17 Apr 2023 • Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang, Jing Liu
Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.
Ranked #1 on Video Captioning on VATEX (using extra training data)
no code implementations • 9 Oct 2022 • Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu
Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.
2 code implementations • 1 Jul 2021 • Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.
Ranked #1 on Image Retrieval on Localized Narratives
no code implementations • 26 Jan 2021 • Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu
Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.
no code implementations • 24 Jan 2021 • Longteng Guo, Jing Liu, Xinxin Zhu, Hanqing Lu
These models are autoregressive in that they generate each word by conditioning on previously generated words, which leads to heavy latency during inference.
no code implementations • 16 Dec 2020 • Xinxin Zhu, Weining Wang, Longteng Guo, Jing Liu
The whole process involves a visual understanding module and a language generation module, which brings more challenges to the design of deep neural networks than other tasks.
no code implementations • 10 May 2020 • Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, Hanqing Lu
In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL).
no code implementations • CVPR 2020 • Longteng Guo, Jing Liu, Xinxin Zhu, Peng Yao, Shichen Lu, Hanqing Lu
First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that brings the benefits of normalization inside SA.
no code implementations • 17 Oct 2019 • Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, Jing Liu
This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages.
1 code implementation • 6 Aug 2019 • Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, Hanqing Lu
Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper.
no code implementations • CVPR 2019 • Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, Hanqing Lu
The discriminator and the generator are trained in an adversarial manner to enable more natural and human-like captions.