1 code implementation • 4 Jun 2025 • Xiaomi LLM-Core Team, :, Zihao Yue, Zhenru Lin, YiFan Song, Weikun Wang, Shuhuai Ren, Shuhao Gu, Shicheng Li, Peidian Li, Liang Zhao, Lei LI, Kainan Bao, Hao Tian, Hailin Zhang, Gang Wang, Dawei Zhu, Cici, Chenhong He, Bowen Ye, Bowen Shen, Zihan Zhang, Zihan Jiang, Zhixian Zheng, Zhichao Song, Zhenbo Luo, Yue Yu, Yudong Wang, Yuanyuan Tian, Yu Tu, Yihan Yan, Yi Huang, Xu Wang, Xinzhe Xu, Xingchen Song, Xing Zhang, Xing Yong, Xin Zhang, Xiangwei Deng, Wenyu Yang, Wenhan Ma, Weiwei Lv, Weiji Zhuang, Wei Liu, Sirui Deng, Shuo Liu, Shimao Chen, Shihua Yu, Shaohui Liu, Shande Wang, Rui Ma, Qiantong Wang, Peng Wang, Nuo Chen, Menghang Zhu, Kangyang Zhou, Kang Zhou, Kai Fang, Jun Shi, Jinhao Dong, Jiebao Xiao, Jiaming Xu, Huaqiu Liu, Hongshen Xu, Heng Qu, Haochen Zhao, Hanglong Lv, Guoan Wang, Duo Zhang, Dong Zhang, Di Zhang, Chong Ma, Chang Liu, Can Cai, Bingquan Xia
We open-source MiMo-VL-7B-SFT and MiMo-VL-7B-RL, two powerful vision-language models delivering state-of-the-art performance in both general visual understanding and multimodal reasoning.
1 code implementation • 12 May 2025 • Xiaomi LLM-Core Team, :, Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, YiFan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, Qingkai Fang, Kang Zhou, Kangyang Zhou, Lei LI, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages.
2 code implementations • 24 Apr 2025 • Linli Yao, Yicheng Li, Yuancheng Wei, Lei LI, Shuhuai Ren, Yuanxin Liu, Kun Ouyang, Lean Wang, Shicheng Li, Sida Li, Lingpeng Kong, Qi Liu, Yuanxing Zhang, Xu sun
Remarkably, our experiments demonstrate that DTD achieves an 82. 8% reduction in video tokens while maintaining 98% performance on StreamingBench, revealing that over 80% of visual content in streaming videos is naturally redundant without requiring language guidance.
1 code implementation • 21 Mar 2025 • Shicheng Li, Lei LI, Kun Ouyang, Shuhuai Ren, Yuanxin Liu, Yuanxing Zhang, Fuzheng Zhang, Lingpeng Kong, Qi Liu, Xu sun
We further analyze the transferability of DPO data across architectures and the role of difficulty scheduling in optimization.
no code implementations • 20 Mar 2025 • Yuqing Wang, Zhijie Lin, Yao Teng, Yuanzhi Zhu, Shuhuai Ren, Jiashi Feng, Xihui Liu
Autoregressive visual generation models typically rely on tokenizers to compress images into tokens that can be predicted sequentially.
1 code implementation • 13 Mar 2025 • Yuanxin Liu, Rui Zhu, Shuhuai Ren, Jiacong Wang, Haoyuan Guo, Xu sun, Lu Jiang
To evaluate the performance of automatic metrics in unified AIGV evaluation, we introduce a benchmark called UVE-Bench.
no code implementations • 11 Feb 2025 • Shuhuai Ren, Shuming Ma, Xu sun, Furu Wei
Our model achieves FVD scores of 103. 3 on UCF101 and 25. 5 on K600, outperforming the vanilla NTP model by an average of 4. 4.
no code implementations • CVPR 2025 • Yuqing Wang, Shuhuai Ren, Zhijie Lin, Yujin Han, Haoyuan Guo, Zhenheng Yang, Difan Zou, Jiashi Feng, Xihui Liu
Autoregressive models have emerged as a powerful approach for visual generation but suffer from slow inference speed due to their sequential token-by-token prediction process.
1 code implementation • 16 Dec 2024 • Liang Chen, Zekun Wang, Shuhuai Ren, Lei LI, Haozhe Zhao, Yunshui Li, Zefan Cai, Hongcheng Guo, Lei Zhang, Yizhe Xiong, Yichi Zhang, Ruoyu Wu, Qingxiu Dong, Ge Zhang, Jian Yang, Lingwei Meng, Shujie Hu, Yulong Chen, Junyang Lin, Shuai Bai, Andreas Vlachos, Xu Tan, Minjia Zhang, Wen Xiao, Aaron Yee, Tianyu Liu, Baobao Chang
As Large Language Models (LLMs) have advanced to unify understanding and generation tasks within the textual modality, recent research has shown that tasks from different modalities can also be effectively encapsulated within the NTP framework, transforming the multimodal information into tokens and predict the next one given the context.
1 code implementation • CVPR 2025 • Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei LI, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, Peixian Chen, Yanwei Li, Shaohui Lin, Sirui Zhao, Ke Li, Tong Xu, Xiawu Zheng, Enhong Chen, Rongrong Ji, Xing Sun
With Video-MME, we extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1. 5 Pro, as well as open-source image models like InternVL-Chat-V1. 5 and video models like LLaVA-NeXT-Video.
1 code implementation • 31 May 2024 • Linli Yao, Lei LI, Shuhuai Ren, Lean Wang, Yuanxin Liu, Xu sun, Lu Hou
Specifically, we trace back the semantic relevance flow from generated language tokens to raw visual encoder patches and the intermediate outputs produced by projectors.
1 code implementation • 16 Apr 2024 • Yuchi Wang, Shuhuai Ren, Rundong Gao, Linli Yao, Qingyan Guo, Kaikai An, Jianhong Bai, Xu sun
Diffusion models have exhibited remarkable capabilities in text-to-image generation.
Ranked #8 on
Image Captioning
on COCO Captions
(ROUGE-L metric)
1 code implementation • 28 Mar 2024 • Sishuo Chen, Lei LI, Shuhuai Ren, Rundong Gao, Yuanxin Liu, Xiaohan Bi, Xu sun, Lu Hou
Video paragraph captioning (VPC) involves generating detailed narratives for long videos, utilizing supportive modalities such as speech and event boundaries.
1 code implementation • 1 Mar 2024 • Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei LI, Sishuo Chen, Xu sun, Lu Hou
Motivated by these two problems, we propose the \textbf{TempCompass} benchmark, which introduces a diversity of temporal aspects and task formats.
1 code implementation • 21 Feb 2024 • Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Xiangdi Meng, Tianyu Liu, Baobao Chang
To address this, we introduce Embodied-Instruction-Evolution (EIE), an automatic framework for synthesizing instruction tuning examples in multimodal embodied environments.
2 code implementations • CVPR 2024 • Shuhuai Ren, Linli Yao, Shicheng Li, Xu sun, Lu Hou
This work proposes TimeChat, a time-sensitive multimodal large language model specifically designed for long video understanding.
Ranked #2 on
Video-Text Retrieval
on Test-of-Time
(using extra training data)
1 code implementation • 29 Nov 2023 • Shicheng Li, Lei LI, Shuhuai Ren, Yuanxin Liu, Yi Liu, Rundong Gao, Xu sun, Lu Hou
The ability to perceive how objects change over time is a crucial ingredient in human intelligence.
1 code implementation • NeurIPS 2023 • Yuanxin Liu, Lei LI, Shuhuai Ren, Rundong Gao, Shicheng Li, Sishuo Chen, Xu sun, Lu Hou
The multi-aspect categorization of FETV enables fine-grained analysis of the metrics' reliability in different scenarios.
1 code implementation • 29 Oct 2023 • Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu sun, Lu Hou
TESTA can reduce the number of visual tokens by 75% and thus accelerate video encoding.
Ranked #1 on
Video Retrieval
on Condensed Movies
(using extra training data)
1 code implementation • 3 Oct 2023 • Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, Baobao Chang
In this study, we explore the potential of Multimodal Large Language Models (MLLMs) in improving embodied decision-making processes for agents.
no code implementations • 7 Jun 2023 • Lei LI, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, Jingjing Xu, Xu sun, Lingpeng Kong, Qi Liu
To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$^3$IT) dataset, designed to optimize VLM alignment with human instructions.
1 code implementation • NeurIPS 2023 • Shuhuai Ren, Aston Zhang, Yi Zhu, Shuai Zhang, Shuai Zheng, Mu Li, Alex Smola, Xu sun
This work proposes POMP, a prompt pre-training method for vision-language models.
1 code implementation • 4 Jun 2022 • Shuhuai Ren, Lei LI, Xuancheng Ren, Guangxiang Zhao, Xu sun
However, evaluating the openness of CLIP-like models is challenging, as the models are open to arbitrary vocabulary in theory, but their accuracy varies in practice.
no code implementations • 27 Dec 2021 • Yuan YAO, Qingxiu Dong, Jian Guan, Boxi Cao, Zhengyan Zhang, Chaojun Xiao, Xiaozhi Wang, Fanchao Qi, Junwei Bao, Jinran Nie, Zheni Zeng, Yuxian Gu, Kun Zhou, Xuancheng Huang, Wenhao Li, Shuhuai Ren, Jinliang Lu, Chengqiang Xu, Huadong Wang, Guoyang Zeng, Zile Zhou, Jiajun Zhang, Juanzi Li, Minlie Huang, Rui Yan, Xiaodong He, Xiaojun Wan, Xin Zhao, Xu sun, Yang Liu, Zhiyuan Liu, Xianpei Han, Erhong Yang, Zhifang Sui, Maosong Sun
We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic.
1 code implementation • EMNLP 2021 • Lei LI, Yankai Lin, Shuhuai Ren, Peng Li, Jie zhou, Xu sun
Knowledge distillation~(KD) has been proved effective for compressing large-scale pre-trained language models.
1 code implementation • EMNLP 2021 • Shuhuai Ren, Jinchao Zhang, Lei LI, Xu sun, Jie zhou
Data augmentation aims to enrich training samples for alleviating the overfitting issue in low-resource or class-imbalanced situations.
1 code implementation • ACL 2021 • Shuhuai Ren, Junyang Lin, Guangxiang Zhao, Rui Men, An Yang, Jingren Zhou, Xu sun, Hongxia Yang
To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions.
1 code implementation • Findings (EMNLP) 2021 • Lei LI, Yankai Lin, Deli Chen, Shuhuai Ren, Peng Li, Jie zhou, Xu sun
On the other hand, the exiting decisions made by internal classifiers are unreliable, leading to wrongly emitted early predictions.
no code implementations • 7 Nov 2019 • Zhihan Zhang, Zhiyi Yin, Shuhuai Ren, Xinhang Li, Shicheng Li
In this paper, we aim to collect diversified information from video and text for informative comment generation.
1 code implementation • ACL 2019 • Shuhuai Ren, Yihe Deng, Kun He, Wanxiang Che
Experiments on three popular datasets using convolutional as well as LSTM models show that PWWS reduces the classification accuracy to the most extent, and keeps a very low word substitution rate.