no code implementations • 1 Dec 2024 • Weiming Ren, Huan Yang, Jie Min, Cong Wei, Wenhu Chen
Current large multimodal models (LMMs) face significant challenges in processing and comprehending long-duration or high-resolution videos, which is mainly due to the lack of high-quality datasets.
1 code implementation • 26 Nov 2024 • Cong Wei, Yujie Zhong, Haoxian Tan, Yong liu, Zheng Zhao, Jie Hu, Yujiu Yang
This paper aims to address universal segmentation for image and video perception with the strong reasoning ability empowered by Visual Large Language Models (VLLMs).
Ranked #1 on Referring Expression Segmentation on RefCOCO+ val (using extra training data)
Large Language Model Open Vocabulary Semantic Segmentation +8
no code implementations • 11 Nov 2024 • Cong Wei, Zheyang Xiong, Weiming Ren, Xinrun Du, Ge Zhang, Wenhu Chen
This is due to the application of simple filtering methods like CLIP-score.
1 code implementation • 2 May 2024 • Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis also maintains a strong single-image performance on par with CogVLM and Emu2.
1 code implementation • 12 Apr 2024 • Cong Wei, Haoxian Tan, Yujie Zhong, Yujiu Yang, Lin Ma
Recent advancements have empowered Large Language Models for Vision (vLLMs) to generate detailed perceptual outcomes, including bounding boxes and masks.
1 code implementation • 21 Mar 2024 • Max Ku, Cong Wei, Weiming Ren, Harry Yang, Wenhu Chen
AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks, including prompt-based editing, reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods.
1 code implementation • 6 Feb 2024 • Weiming Ren, Huan Yang, Ge Zhang, Cong Wei, Xinrun Du, Wenhao Huang, Wenhu Chen
To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation.
no code implementations • 22 Dec 2023 • Max Ku, Dongfu Jiang, Cong Wei, Xiang Yue, Wenhu Chen
In the rapidly advancing field of conditional image generation research, challenges such as limited explainability lie in effectively evaluating the performance and capabilities of various models.
no code implementations • 28 Nov 2023 • Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, Wenhu Chen
Existing information retrieval (IR) models often assume a homogeneous format, limiting their applicability to diverse user needs, such as searching for images with text descriptions, searching for a news article with a headline image, or finding a similar photo with a query image.
4 code implementations • CVPR 2024 • Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, Wenhu Chen
We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.
no code implementations • 22 Jun 2023 • Tianle Li, Max Ku, Cong Wei, Wenhu Chen
In this work, we aspire to fill the void and propose two novel subject-driven sub-tasks, i. e., Subject Replacement and Subject Addition.
1 code implementation • CVPR 2023 • Cong Wei, Brendan Duke, Ruowei Jiang, Parham Aarabi, Graham W. Taylor, Florian Shkurti
Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy on ImageNet compared to token sparsity.