Search Results for author: Shuhuai Ren

Found 30 papers, 24 papers with code

TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos

2 code implementations24 Apr 2025 Linli Yao, Yicheng Li, Yuancheng Wei, Lei LI, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu sun

Remarkably, our experiments demonstrate that DTD achieves an 82. 8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance.

MME Video MME +1

TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment

1 code implementation21 Mar 2025 Shicheng Li, Lei LI, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu sun

We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization.

Scheduling

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation

no code implementations20 Mar 2025 Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu

Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially.

Quantization

UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?

1 code implementation13 Mar 2025 Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu sun, Lu Jiang

To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench.

Next Block Prediction: Video Generation via Semi-Auto-Regressive Modeling

no code implementations11 Feb 2025 Shuhuai Ren, Shuming Ma, Xu sun, Furu Wei

Our model achieves FVD scores of 103. 3 on UCF101 and 25. 5 on K600, outperforming the vanilla NTP model by an average of 4. 4.

Video Generation

Parallelized Autoregressive Visual Generation

no code implementations CVPR 2025 Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu

Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process.

Video Generation

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

1 code implementation16 Dec 2024 Liang Chen, Zekun Wang, Shuhuai Ren, Lei LI, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang

As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context.

Language Modeling Language Modelling +2

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

1 code implementation CVPR 2025 Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei LI, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun

With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1. 5 Pro, as well as open-source image models like InternVL-Chat-V1. 5 and video models like LLaVA-NeXT-Video.

MME Video MME

DeCo: Decoupling Token Compression from Semantic Abstraction in Multimodal Large Language Models

1 code implementation31 May 2024 Linli Yao, Lei LI, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu sun, Lu Hou

Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors.

cross-modal alignment Visual Localization +1

Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality

1 code implementation28 Mar 2024 Sishuo Chen, Lei LI, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan Bi, Xu sun, Lu Hou

Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries.

Data Augmentation Diversity +1

TempCompass: Do Video LLMs Really Understand Videos?

1 code implementation1 Mar 2024 Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei LI, Sishuo Chen, Xu sun, Lu Hou

Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats.

Diversity

PCA-Bench: Evaluating Multimodal Large Language Models in Perception-Cognition-Action Chain

1 code implementation21 Feb 2024 Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang

To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments.

Autonomous Driving Decision Making

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding

2 code implementations CVPR 2024 Shuhuai Ren, Linli Yao, Shicheng Li, Xu sun, Lu Hou

This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.

Ranked #2 on Video-Text Retrieval on Test-of-Time (using extra training data)

Dense Captioning Highlight Detection +9

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

1 code implementation29 Oct 2023 Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu sun, Lu Hou

TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding.

 Ranked #1 on Video Retrieval on Condensed Movies (using extra training data)

Form Language Modelling +3

M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning

no code implementations7 Jun 2023 Lei LI, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu sun, Lingpeng Kong, Qi Liu

To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions.

World Knowledge

Delving into the Openness of CLIP

1 code implementation4 Jun 2022 Shuhuai Ren, Lei LI, Xuancheng Ren, Guangxiang Zhao, Xu sun

However, evaluating the openness of CLIP-like models is challenging, as the models are open to arbitrary vocabulary in theory, but their accuracy varies in practice.

image-classification Image Classification +2

Dynamic Knowledge Distillation for Pre-trained Language Models

1 code implementation EMNLP 2021 Lei LI, Yankai Lin, Shuhuai Ren, Peng Li, Jie zhou, Xu sun

Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models.

Knowledge Distillation

Text AutoAugment: Learning Compositional Augmentation Policy for Text Classification

1 code implementation EMNLP 2021 Shuhuai Ren, Jinchao Zhang, Lei LI, Xu sun, Jie zhou

Data augmentation aims to enrich training samples for alleviating the overfitting issue in low-resource or class-imbalanced situations.

Bayesian Optimization Data Augmentation +3

Learning Relation Alignment for Calibrated Cross-modal Retrieval

1 code implementation ACL 2021 Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, An Yang, Jingren Zhou, Xu sun, Hongxia Yang

To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions.

Cross-Modal Retrieval Image-text Retrieval +4

DCA: Diversified Co-Attention towards Informative Live Video Commenting

no code implementations7 Nov 2019 Zhihan Zhang, Zhiyi Yin, Shuhuai Ren, Xinhang Li, Shicheng Li

In this paper, we aim to collect diversified information from video and text for informative comment generation.

Comment Generation Metric Learning

Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency

1 code implementation ACL 2019 Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che

Experiments on three popular datasets using convolutional as well as LSTM models show that PWWS reduces the classification accuracy to the most extent, and keeps a very low word substitution rate.

Adversarial Attack General Classification +6

Cannot find the paper you are looking for? You can Submit a new open access paper.