Search Results for author: Longteng Guo

Found 30 papers, 13 papers with code

Breaking the Encoder Barrier for Seamless Video-Language Understanding

no code implementations24 Mar 2025 Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model.

Decoder Language Modeling +3

FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks

no code implementations18 Mar 2025 Siqi Zhang, Yanyuan Qiao, Qunbo Wang, Longteng Guo, Zhihua Wei, Jing Liu

In this paper, we propose FlexVLN, an innovative hierarchical approach to VLN that integrates the fundamental navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner, enabling effective generalization across diverse VLN datasets.

Vision and Language Navigation

VRoPE: Rotary Position Embedding for Video Large Language Models

1 code implementation17 Feb 2025 Zikang Liu, Longteng Guo, Yepeng Tang, Junxian Cai, Kai Ma, Xi Chen, Jing Liu

Rotary Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs), but extending it to video remains a challenge due to the intricate spatiotemporal structure of video frames.

Position Video Understanding

Ada-K Routing: Boosting the Efficiency of MoE-based LLMs

no code implementations14 Oct 2024 Tongtian Yue, Longteng Guo, Jie Cheng, Xuange Gao, Jing Liu

In this paper, we propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token, thereby improving the balance between computational efficiency and model performance.

Computational Efficiency

EEGPT: Unleashing the Potential of EEG Generalist Foundation Model by Autoregressive Pre-training

no code implementations14 Oct 2024 Tongtian Yue, Shuning Xue, Xuange Gao, Yepeng Tang, Longteng Guo, Jie Jiang, Jing Liu

First, we propose an electrode-wise modeling strategy that treats each electrode as a fundamental unit, enabling the integration of diverse EEG datasets collected from up to 138 electrodes, amassing 37. 5M pre-training samples.

EEG Transfer Learning

MM-LDM: Multi-Modal Latent Diffusion Model for Sounding Video Generation

1 code implementation2 Oct 2024 Mingzhen Sun, Weining Wang, Yanyuan Qiao, Jiahui Sun, Zihan Qin, Longteng Guo, Xinxin Zhu, Jing Liu

Sounding Video Generation (SVG) is an audio-video joint generation task challenged by high-dimensional signal spaces, distinct data formats, and different patterns of content information.

Video Generation

OneDiff: A Generalist Model for Image Difference Captioning

no code implementations8 Jul 2024 Erdong Hu, Longteng Guo, Tongtian Yue, Zijia Zhao, Shuning Xue, Jing Liu

This paper introduces the OneDiff model, a novel generalist approach that utilizes a robust vision-language model architecture, integrating a siamese image encoder with a Visual Delta Module.

Language Modelling model +1

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

1 code implementation13 Jun 2024 Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, WeiPeng Chen, Jing Liu

In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.

Benchmarking Video Generation +2

Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering

1 code implementation22 Apr 2024 Dongze Hao, Qunbo Wang, Longteng Guo, Jie Jiang, Jing Liu

Motivated by the research of retrieval-augmented generation in the field of natural language processing, we use Dense Passage Retrieval (DPR) to retrieve related knowledge to help the model answer questions.

Language Modeling Language Modelling +6

VL-Mamba: Exploring State Space Models for Multimodal Learning

no code implementations20 Mar 2024 Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu

The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

Language Modeling Language Modelling +5

SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models

1 code implementation CVPR 2024 Tongtian Yue, Jie Cheng, Longteng Guo, Xingyuan Dai, Zijia Zhao, Xingjian He, Gang Xiong, Yisheng Lv, Jing Liu

In this paper, we present and delve into the self-consistency capability of LVLMs, a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process.

Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation

1 code implementation CVPR 2024 Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

To foster future research into fine-grained visual grounding our benchmark RefCOCOm the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES.

Descriptive Object +3

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

1 code implementation13 Dec 2023 Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES

Descriptive Object +3

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE

no code implementations23 Aug 2023 Junyi Chen, Longteng Guo, Jia Sun, Shuai Shao, Zehuan Yuan, Liang Lin, Dongyu Zhang

Owing to the combination of the unified architecture and pre-training task, EVE is easy to scale up, enabling better downstream performance with fewer resources and faster training speed.

Image-text matching Image-text Retrieval +5

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

1 code implementation19 May 2023 Zikang Liu, Sihan Chen, Longteng Guo, Handong Li, Xingjian He, Jing Liu

In this paper, we propose a novel method called Joint QA and DC GEneration (JADE), which utilizes a pre-trained multimodal model and easily-crawled image-text pairs to automatically generate and filter large-scale VQA and dense captioning datasets.

Dense Captioning Image Captioning +4

VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset

1 code implementation17 Apr 2023 Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, Jinhui Tang

Different from widely-studied vision-language pretraining models, VALOR jointly models relationships of vision, audio and language in an end-to-end manner.

 Ranked #1 on Video Captioning on VATEX (using extra training data)

Audio captioning Audio-Video Question Answering (AVQA) +17

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation Learning

no code implementations9 Oct 2022 Zijia Zhao, Longteng Guo, Xingjian He, Shuai Shao, Zehuan Yuan, Jing Liu

Our method performs joint masking on image-text input and integrates both implicit and explicit targets for the masked signals to recover.

Image-text Retrieval multimodal interaction +6

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

2 code implementations1 Jul 2021 Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Audio to Text Retrieval Cross-Modal Retrieval +4

CPTR: Full Transformer Network for Image Captioning

no code implementations26 Jan 2021 Wei Liu, Sihan Chen, Longteng Guo, Xinxin Zhu, Jing Liu

Besides, we provide detailed visualizations of the self-attention between patches in the encoder and the "words-to-patches" attention in the decoder thanks to the full Transformer architecture.

Decoder Image Captioning

Fast Sequence Generation with Multi-Agent Reinforcement Learning

no code implementations24 Jan 2021 Longteng Guo, Jing Liu, Xinxin Zhu, Hanqing Lu

These models are autoregressive in that they generate each word by conditioning on previously generated words, which leads to heavy latency during inference.

Image Captioning Machine Translation +6

AutoCaption: Image Captioning with Neural Architecture Search

no code implementations16 Dec 2020 Xinxin Zhu, Weining Wang, Longteng Guo, Jing Liu

The whole process involves a visual understanding module and a language generation module, which brings more challenges to the design of deep neural networks than other tasks.

Decoder Image Captioning +2

Non-Autoregressive Image Captioning with Counterfactuals-Critical Multi-Agent Learning

no code implementations10 May 2020 Longteng Guo, Jing Liu, Xinxin Zhu, Xingjian He, Jie Jiang, Hanqing Lu

In this paper, we propose a Non-Autoregressive Image Captioning (NAIC) model with a novel training paradigm: Counterfactuals-critical Multi-Agent Learning (CMAL).

Image Captioning Machine Translation +3

Vatex Video Captioning Challenge 2020: Multi-View Features and Hybrid Reward Strategies for Video Captioning

no code implementations17 Oct 2019 Xinxin Zhu, Longteng Guo, Peng Yao, Shichen Lu, Wei Liu, Jing Liu

This report describes our solution for the VATEX Captioning Challenge 2020, which requires generating descriptions for the videos in both English and Chinese languages.

Video Captioning

Aligning Linguistic Words and Visual Semantic Units for Image Captioning

1 code implementation6 Aug 2019 Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, Hanqing Lu

Image captioning attempts to generate a sentence composed of several linguistic words, which are used to describe objects, attributes, and interactions in an image, denoted as visual semantic units in this paper.

Attribute Image Captioning +2

MSCap: Multi-Style Image Captioning With Unpaired Stylized Text

no code implementations CVPR 2019 Longteng Guo, Jing Liu, Peng Yao, Jiangwei Li, Hanqing Lu

The discriminator and the generator are trained in an adversarial manner to enable more natural and human-like captions.

Image Captioning Sentence

Cannot find the paper you are looking for? You can Submit a new open access paper.