Search Results for author: Mingzhen Sun

Found 5 papers, 4 papers with code

VL-Mamba: Exploring State Space Models for Multimodal Learning

no code implementations • 20 Mar 2024 • Yanyuan Qiao, Zheng Yu, Longteng Guo, Sihan Chen, Zijia Zhao, Mingzhen Sun, Qi Wu, Jing Liu

The extensive experiments on diverse multimodal benchmarks with competitive performance show the effectiveness of our proposed VL-Mamba and demonstrate the great potential of applying state space models for multimodal learning tasks.

Ranked #61 on Visual Question Answering on MM-Vet

Language Modelling Large Language Model +1

Paper
Add Code

GLOBER: Coherent Non-autoregressive Video Generation via GLOBal Guided Video DecodER

1 code implementation • NeurIPS 2023 • Mingzhen Sun, Weining Wang, Zihan Qin, Jiahui Sun, Sihan Chen, Jing Liu

Specifically, we propose a video auto-encoder, where a video encoder encodes videos into global features, and a video decoder, built on a diffusion model, decodes the global features and synthesizes video frames in a non-autoregressive manner.

Video Generation

Paper
Code

VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset

1 code implementation • NeurIPS 2023 • Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, Jing Liu

Based on the proposed VAST-27M dataset, we train an omni-modality video-text foundational model named VAST, which can perceive and process vision, audio, and subtitle modalities from video, and better support various tasks including vision-text, audio-text, and multi-modal video-text tasks (retrieval, captioning and QA).

Ranked #1 on Image Captioning on COCO Captions (SPICE metric, using extra training data)

Audio captioning Audio-Visual Captioning +14

173

Paper
Code

MOSO: Decomposing MOtion, Scene and Object for Video Prediction

2 code implementations • CVPR 2023 • Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu

Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101.

Object Unconditional Video Generation +2

Paper
Code

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

2 code implementations • 1 Jul 2021 • Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources.

Ranked #1 on Image Retrieval on Localized Narratives

Audio to Text Retrieval Cross-Modal Retrieval +3

332

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.