Search Results for author: Yunlong Tang

Found 16 papers, 8 papers with code

VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

no code implementations17 Nov 2024 Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content.

Multiple-choice

Scaling Concept With Text-Guided Diffusion Models

no code implementations31 Oct 2024 Chao Huang, Susan Liang, Yunlong Tang, Yapeng Tian, Anurag Kumar, Chenliang Xu

Through an empirical study, we identify a trend where concepts can be decomposed in text-guided diffusion models.

EAGLE: Egocentric AGgregated Language-video Engine

no code implementations26 Sep 2024 Jing Bi, Yunlong Tang, Luchuan Song, Ali Vosoughi, Nguyen Nguyen, Chenliang Xu

The rapid evolution of egocentric video analysis brings new insights into understanding human activities and intentions from a first-person perspective.

Action Recognition Language Modelling +5

CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

no code implementations21 Aug 2024 Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, Chenliang Xu

In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction.

Language Modelling Large Language Model +3

Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?

no code implementations18 Jun 2024 Mingqian Feng, Yunlong Tang, Zeliang Zhang, Chenliang Xu

This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning?

Attribute Hallucination +2

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning

no code implementations18 Apr 2024 Hang Hua, Yunlong Tang, Chenliang Xu, Jiebo Luo

Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT).

Text Summarization Video Summarization

DPStyler: Dynamic PromptStyler for Source-Free Domain Generalization

1 code implementation25 Mar 2024 Yunlong Tang, Yuxuan Wan, Lei Qi, Xin Geng

Moreover, since the Style Generation module, responsible for generating style word vectors using random sampling or style mixing, makes the model sensitive to input text prompts, we introduce a model ensemble method to mitigate this sensitivity.

Source-free Domain Generalization Style Transfer

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

no code implementations24 Mar 2024 Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu

This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos.

Dense Video Captioning Temporal Localization +1

GaussianStyle: Gaussian Head Avatar via StyleGAN

1 code implementation1 Feb 2024 Pinxin Liu, Luchuan Song, Daoan Zhang, Hang Hua, Yunlong Tang, Huaijin Tu, Jiebo Luo, Chenliang Xu

Existing methods like Neural Radiation Fields (NeRF) and 3D Gaussian Splatting (3DGS) have made significant strides in facial attribute control such as facial animation and components editing, yet they struggle with fine-grained representation and scalability in dynamic head modeling.

Attribute Contrastive Learning +2

Video Understanding with Large Language Models: A Survey

1 code implementation29 Dec 2023 Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, JianGuo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly.

Survey Video Understanding

LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

1 code implementation7 Jul 2023 Siting Xu, Yunlong Tang, Feng Zheng

To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically.

Language Modelling

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

1 code implementation17 Jun 2023 Yunlong Tang, Jinrui Zhang, Xiangchen Wang, Teng Wang, Feng Zheng

This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality.

Boundary Captioning Language Modelling +1

Caption Anything: Interactive Image Description with Diverse Multimodal Controls

1 code implementation4 May 2023 Teng Wang, Jinrui Zhang, Junjie Fei, Hao Zheng, Yunlong Tang, Zhe Li, Mingqi Gao, Shanshan Zhao

Controllable image captioning is an emerging multimodal topic that aims to describe the image with natural language following human purpose, $\textit{e. g.}$, looking at the specified regions or telling in a particular text style.

controllable image captioning Instruction Following

Multi-modal Segment Assemblage Network for Ad Video Editing with Importance-Coherence Reward

1 code implementation25 Sep 2022 Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng

The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage.

Decoder Video Editing +2

Cannot find the paper you are looking for? You can Submit a new open access paper.