1 code implementation • 4 Sep 2024 • Xiang Yue, Tianyu Zheng, Yuansheng Ni, YuBo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, Graham Neubig
This paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark.
1 code implementation • 24 Jun 2024 • Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Austin Wang, Rob Fergus, Yann Lecun, Saining Xie
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
no code implementations • 16 May 2024 • Yuexiang Zhai, Hao Bai, Zipeng Lin, Jiayi Pan, Shengbang Tong, Yifei Zhou, Alane Suhr, Saining Xie, Yann Lecun, Yi Ma, Sergey Levine
Finally, our framework uses these task rewards to fine-tune the entire VLM with RL.
no code implementations • 16 Mar 2024 • Hongxiang Zhao, Xili Dai, Jianan Wang, Shengbang Tong, Jingyuan Zhang, Weida Wang, Lei Zhang, Yi Ma
This consequently limits the performance of downstream tasks, such as image-to-multiview generation and 3D reconstruction.
1 code implementation • CVPR 2024 • Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann Lecun, Saining Xie
To understand the roots of these errors, we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning.
1 code implementation • 22 Nov 2023 • Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D. Haeffele, Yi Ma
This leads to a family of white-box transformer-like deep network architectures, named CRATE, which are mathematically fully interpretable.
no code implementations • 19 Sep 2023 • Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, Yi Ma
However, catastrophic forgetting, a notorious phenomenon where the fine-tuned model fails to retain similar performance compared to the pre-trained model, still remains an inherent problem in multimodal LLMs (MLLM).
1 code implementation • 30 Aug 2023 • Yaodong Yu, Tianzhe Chu, Shengbang Tong, Ziyang Wu, Druv Pai, Sam Buchanan, Yi Ma
Transformer-like models for vision tasks have recently proven effective for a wide range of downstream applications such as segmentation and detection.
1 code implementation • NeurIPS 2023 • Shengbang Tong, Erik Jones, Jacob Steinhardt
Because CLIP is the backbone for most state-of-the-art multimodal systems, these inputs produce failures in Midjourney 5. 1, DALL-E, VideoFusion, and others.
1 code implementation • 8 Jun 2023 • Tianzhe Chu, Shengbang Tong, Tianjiao Ding, Xili Dai, Benjamin David Haeffele, René Vidal, Yi Ma
In this paper, we propose a novel image clustering pipeline that leverages the powerful feature representation of large pre-trained models such as CLIP and cluster images effectively and efficiently at scale.
1 code implementation • NeurIPS 2023 • Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin D. Haeffele, Yi Ma
Particularly, we show that the standard transformer block can be derived from alternating optimization on complementary parts of this objective: the multi-head self-attention operator can be viewed as a gradient descent step to compress the token sets by minimizing their lossy coding rate, and the subsequent multi-layer perceptron can be viewed as attempting to sparsify the representation of the tokens.
3 code implementations • 8 Apr 2023 • Shengbang Tong, Yubei Chen, Yi Ma, Yann Lecun
Recently, self-supervised learning (SSL) has achieved tremendous success in learning image representation.
no code implementations • 18 Feb 2023 • Xili Dai, Ke Chen, Shengbang Tong, Jingyuan Zhang, Xingjian Gao, Mingyang Li, Druv Pai, Yuexiang Zhai, Xiaojun Yuan, Heung-Yeung Shum, Lionel M. Ni, Yi Ma
Our method is arguably the first to demonstrate that a concatenation of multiple convolution sparse coding/decoding layers leads to an interpretable and effective autoencoder for modeling the distribution of large-scale natural image datasets.
no code implementations • ICCV 2023 • Tianjiao Ding, Shengbang Tong, Kwan Ho Ryan Chan, Xili Dai, Yi Ma, Benjamin D. Haeffele
We consider the problem of simultaneously clustering and learning a linear representation of data lying close to a union of low-dimensional manifolds, a fundamental task in machine learning and computer vision.
1 code implementation • 30 Oct 2022 • Shengbang Tong, Xili Dai, Yubei Chen, Mingyang Li, Zengyi Li, Brent Yi, Yann Lecun, Yi Ma
This paper proposes an unsupervised method for learning a unified representation that serves both discriminative and generative purposes.
1 code implementation • 24 Oct 2022 • Xili Dai, Mingyang Li, Pengyuan Zhai, Shengbang Tong, Xingjian Gao, Shao-Lun Huang, Zhihui Zhu, Chong You, Yi Ma
We show that such models have equally strong empirical performance on CIFAR-10, CIFAR-100, and ImageNet datasets when compared to conventional neural networks.
1 code implementation • 11 Feb 2022 • Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi, Yi Ma
Our method is simpler than existing approaches for incremental learning, and more efficient in terms of model size, storage, and computation: it requires only a single, fixed-capacity autoencoding network with a feature space that is used for both discriminative and generative purposes.
1 code implementation • 12 Nov 2021 • Xili Dai, Shengbang Tong, Mingyang Li, Ziyang Wu, Michael Psenka, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Xiaojun Yuan, Heung Yeung Shum, Yi Ma
In particular, we propose to learn a closed-loop transcription between a multi-class multi-dimensional data distribution and a linear discriminative representation (LDR) in the feature space that consists of multiple independent multi-dimensional linear subspaces.