Search Results for author: Yuqi Huo

Found 18 papers, 12 papers with code

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

1 code implementation3 Jan 2025 Yifan Du, Zikang Liu, YiFan Li, Wayne Xin Zhao, Yuqi Huo, Bingning Wang, WeiPeng Chen, Zheng Liu, Zhongyuan Wang, Ji-Rong Wen

Moreover, it seems that such textual reasoning data can be even more effective than visual reasoning data in eliciting the slow-thinking capacities of MLLMs.

Language Modeling Language Modelling +1

Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining

1 code implementation21 Oct 2024 Han Huang, Yuqi Huo, Zijia Zhao, Haoyu Lu, Shu Wu, Bingning Wang, Qiang Liu, WeiPeng Chen, Liang Wang

A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets.

Exploring the Design Space of Visual Context Representation in Video MLLMs

1 code implementation17 Oct 2024 Yifan Du, Yuqi Huo, Kun Zhou, Zijia Zhao, Haoyu Lu, Han Huang, Wayne Xin Zhao, Bingning Wang, WeiPeng Chen, Ji-Rong Wen

Then, we explore the scaling effects in frame selection and token selection respectively, and fit the corresponding function curve by conducting extensive empirical experiments.

Language Modeling Language Modelling

Baichuan-Omni Technical Report

2 code implementations11 Oct 2024 Yadong Li, Haoze Sun, MingAn Lin, Tianpeng Li, Guosheng Dong, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, Song Chen, Xu Li, Da Pan, Shusen Zhang, Xin Wu, Zheng Liang, Jun Liu, Tao Zhang, Keer Lu, Yaqi Zhao, Yanjun Shen, Fan Yang, Kaicheng Yu, Tao Lin, Jianhua Xu, Zenan Zhou, WeiPeng Chen

The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart.

Language Modeling Language Modelling +3

Towards Event-oriented Long Video Understanding

1 code implementation20 Jun 2024 Yifan Du, Kun Zhou, Yuqi Huo, YiFan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, WeiPeng Chen, Ji-Rong Wen

Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench.

Video Understanding

Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs

1 code implementation13 Jun 2024 Zijia Zhao, Haoyu Lu, Yuqi Huo, Yifan Du, Tongtian Yue, Longteng Guo, Bingning Wang, WeiPeng Chen, Jing Liu

In this paper, we propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.

Benchmarking Video Generation +1

VDT: General-purpose Video Diffusion Transformers via Mask Modeling

1 code implementation22 May 2023 Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, Mingyu Ding

We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.

Autonomous Driving Video Generation +1

UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling

2 code implementations13 Feb 2023 Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi Tomizuka, Mingyu Ding

Particularly, on the MSRVTT retrieval task, UniAdapter achieves 49. 7% recall@1 with 2. 2% model parameters, outperforming the latest competitors by 2. 0%.

Image-text Retrieval Text Retrieval +3

LGDN: Language-Guided Denoising Network for Video-Language Modeling

no code implementations23 Sep 2022 Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, Zhiwu Lu

However, this hypothesis often fails for two reasons: (1) With the rich semantics of video contents, it is difficult to cover all frames with a single video-level description; (2) A raw video typically has noisy/meaningless information (e. g., scenery shot, transition or teaser).

Denoising Language Modeling +1

COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

no code implementations CVPR 2022 Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen

Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10, 800X faster in inference) w. r. t.

Contrastive Learning Cross-Modal Retrieval +6

Compressed Video Contrastive Learning

no code implementations NeurIPS 2021 Yuqi Huo, Mingyu Ding, Haoyu Lu, Nanyi Fei, Zhiwu Lu, Ji-Rong Wen, Ping Luo

To enhance the representation ability of the motion vectors, hence the effectiveness of our method, we design a cross guidance contrastive learning algorithm based on multi-instance InfoNCE loss, where motion vectors can take supervision signals from RGB frames and vice versa.

Contrastive Learning Representation Learning

Towards artificial general intelligence via a multimodal foundation model

1 code implementation27 Oct 2021 Nanyi Fei, Zhiwu Lu, Yizhao Gao, Guoxing Yang, Yuqi Huo, Jingyuan Wen, Haoyu Lu, Ruihua Song, Xin Gao, Tao Xiang, Hao Sun, Ji-Rong Wen

To overcome this limitation and take a solid step towards artificial general intelligence (AGI), we develop a foundation model pre-trained with huge multimodal data, which can be quickly adapted for various downstream cognitive tasks.

Image Classification Reading Comprehension +2

Learning Versatile Neural Architectures by Propagating Network Codes

1 code implementation ICLR 2022 Mingyu Ding, Yuqi Huo, Haoyu Lu, Linjie Yang, Zhe Wang, Zhiwu Lu, Jingdong Wang, Ping Luo

(4) Thorough studies of NCP on inter-, cross-, and intra-tasks highlight the importance of cross-task neural architecture design, i. e., multitask neural architectures and architecture transferring between different tasks.

Image Segmentation Neural Architecture Search +2

Self-Supervised Video Representation Learning with Constrained Spatiotemporal Jigsaw

no code implementations1 Jan 2021 Yuqi Huo, Mingyu Ding, Haoyu Lu, Zhiwu Lu, Tao Xiang, Ji-Rong Wen, Ziyuan Huang, Jianwen Jiang, Shiwei Zhang, Mingqian Tang, Songfang Huang, Ping Luo

With the constrained jigsaw puzzles, instead of solving them directly, which could still be extremely hard, we carefully design four surrogate tasks that are more solvable but meanwhile still ensure that the learned representation is sensitive to spatiotemporal continuity at both the local and global levels.

Representation Learning

Mobile Video Action Recognition

no code implementations27 Aug 2019 Yuqi Huo, Xiaoli Xu, Yao Lu, Yulei Niu, Zhiwu Lu, Ji-Rong Wen

In addition to motion vectors, we also provide a temporal fusion method to explicitly induce the temporal context.

Action Recognition Temporal Action Localization

Cannot find the paper you are looking for? You can Submit a new open access paper.