1 code implementation • 10 Oct 2024 • Jinbin Bai, Tian Ye, Wei Chow, Enxin Song, Qing-Guo Chen, Xiangtai Li, Zhen Dong, Lei Zhu, Shuicheng Yan
We present Meissonic, which elevates non-autoregressive masked image modeling (MIM) text-to-image to a level comparable with state-of-the-art diffusion models like SDXL.
no code implementations • 4 Oct 2024 • Wenhao Chai, Enxin Song, Yilun Du, Chenlin Meng, Vashisht Madhavan, Omer Bar-Tal, Jeng-Neng Hwang, Saining Xie, Christopher D. Manning
AuroraCap shows superior performance on various video and image captioning benchmarks, for example, obtaining a CIDEr of 88. 9 on Flickr30k, beating GPT-4V (55. 3) and Gemini-1. 5 Pro (82. 2).
1 code implementation • 26 Apr 2024 • Enxin Song, Wenhao Chai, Tian Ye, Jenq-Neng Hwang, Xi Li, Gaoang Wang
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.
Ranked #4 on Question Answering on NExT-QA (Open-ended VideoQA)
no code implementations • 24 Sep 2023 • Yichen Xu, Zihan Xu, Wenhao Chai, Zhonghan Zhao, Enxin Song, Gaoang Wang
In order to appropriately filter multi-modality data sets on a web-scale, it becomes crucial to employ suitable filtering methods to boost performance and reduce training costs.
1 code implementation • CVPR 2024 • Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, Yan Lu, Jenq-Neng Hwang, Gaoang Wang
Recently, integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks.
Video-based Generative Performance Benchmarking (Consistency) Video-based Generative Performance Benchmarking (Contextual Understanding) +10