Search Results for author: Yongfei Liu

Found 18 papers, 9 papers with code

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model

no code implementations28 May 2024 Haogeng Liu, Quanzeng You, Xiaotian Han, Yongfei Liu, Huaibo Huang, Ran He, Hongxia Yang

In the realm of Multimodal Large Language Models (MLLMs), vision-language connector plays a crucial role to link the pre-trained vision encoders with Large Language Models (LLMs).

Language Modelling Large Language Model +1

ViTAR: Vision Transformer with Any Resolution

no code implementations27 Mar 2024 Qihang Fan, Quanzeng You, Xiaotian Han, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang

Firstly, we propose a novel module for dynamic resolution adjustment, designed with a single Transformer block, specifically to achieve highly efficient incremental token integration.

Self-Supervised Learning Semantic Segmentation

Exploring the Reasoning Abilities of Multimodal Large Language Models (MLLMs): A Comprehensive Survey on Emerging Trends in Multimodal Reasoning

no code implementations10 Jan 2024 Yiqi Wang, Wentao Chen, Xiaotian Han, Xudong Lin, Haiteng Zhao, Yongfei Liu, Bohan Zhai, Jianbo Yuan, Quanzeng You, Hongxia Yang

In this survey, we comprehensively review the existing evaluation protocols of multimodal reasoning, categorize and illustrate the frontiers of MLLMs, introduce recent trends in applications of MLLMs on reasoning-intensive tasks, and finally discuss current practices and future directions.

Multimodal Reasoning

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

no code implementations3 Dec 2023 Tianqi Chen, Yongfei Liu, Zhendong Wang, Jianbo Yuan, Quanzeng You, Hongxia Yang, Mingyuan Zhou

In light of the remarkable success of in-context learning in large language models, its potential extension to the vision domain, particularly with visual foundation models like Stable Diffusion, has sparked considerable interest.

In-Context Learning

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

no code implementations28 Nov 2023 Xiaohui Chen, Yongfei Liu, Yingxiang Yang, Jianbo Yuan, Quanzeng You, Li-Ping Liu, Hongxia Yang

Recent advancements in text-to-image (T2I) generative models have shown remarkable capabilities in producing diverse and imaginative visuals based on text prompts.

Image Generation

InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models

no code implementations20 Nov 2023 Xiaotian Han, Quanzeng You, Yongfei Liu, Wentao Chen, Huangjie Zheng, Khalil Mrini, Xudong Lin, Yiqi Wang, Bohan Zhai, Jianbo Yuan, Heng Wang, Hongxia Yang

To mitigate this issue, we manually curate a benchmark dataset specifically designed for MLLMs, with a focus on complex reasoning tasks.

Grounded Image Text Matching with Mismatched Relation Reasoning

no code implementations ICCV 2023 Yu Wu, Yana Wei, Haozhe Wang, Yongfei Liu, Sibei Yang, Xuming He

This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models.

Image-text matching Relation +2

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

1 code implementation CVPR 2023 Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He

In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection.

Decoder Human-Object Interaction Detection +3

Weakly-supervised HOI Detection via Prior-guided Bi-level Representation Learning

no code implementations2 Mar 2023 Bo Wan, Yongfei Liu, Desen Zhou, Tinne Tuytelaars, Xuming He

Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.

Human-Object Interaction Detection Knowledge Distillation +3

Cascaded Sparse Feature Propagation Network for Interactive Segmentation

1 code implementation10 Mar 2022 Chuyu Zhang, Chuanyang Hu, Hui Ren, Yongfei Liu, Xuming He

We aim to tackle the problem of point-based interactive segmentation, in which the key challenge is to propagate the user-provided annotations to unlabeled regions efficiently.

Foreground Segmentation Interactive Segmentation +2

KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation

1 code implementation Findings (NAACL) 2022 Yongfei Liu, Chenfei Wu, Shao-Yen Tseng, Vasudev Lal, Xuming He, Nan Duan

Self-supervised vision-and-language pretraining (VLP) aims to learn transferable multi-modal representations from large-scale image-text data and to achieve strong performances on a broad scope of vision-language tasks after finetuning.

cross-modal alignment Knowledge Distillation +2

GEM: A General Evaluation Benchmark for Multimodal Tasks

1 code implementation Findings (ACL) 2021 Lin Su, Nan Duan, Edward Cui, Lei Ji, Chenfei Wu, Huaishao Luo, Yongfei Liu, Ming Zhong, Taroon Bharti, Arun Sacheti

Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages.

Relation-aware Instance Refinement for Weakly Supervised Visual Grounding

1 code implementation CVPR 2021 Yongfei Liu, Bo Wan, Lin Ma, Xuming He

Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding.

Object Relation +3

Learning Cross-modal Context Graph for Visual Grounding

2 code implementations20 Nov 2019 Yongfei Liu, Bo Wan, Xiaodan Zhu, Xuming He

To address their limitations, this paper proposes a language-guided graph representation to capture the global context of grounding entities and their relations, and develop a cross-modal graph matching strategy for the multiple-phrase visual grounding task.

Graph Matching Graph Neural Network +1

Pose-aware Multi-level Feature Network for Human Object Interaction Detection

1 code implementation ICCV 2019 Bo Wan, Desen Zhou, Yongfei Liu, Rongjie Li, Xuming He

Reasoning human object interactions is a core problem in human-centric scene understanding and detecting such relations poses a unique challenge to vision systems due to large variations in human-object configurations, multiple co-occurring relation instances and subtle visual difference between relation categories.

Human-Object Interaction Detection Object +2

Cannot find the paper you are looking for? You can Submit a new open access paper.