1 code implementation • 9 Dec 2024 • Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, ran Xu
Our multi-image instruction data leads to an 8% improvement on Mantis-Eval.
Ranked #105 on Visual Question Answering on MM-Vet
1 code implementation • 7 Dec 2024 • Zixian Ma, JianGuo Zhang, Zhiwei Liu, Jieyu Zhang, Juntao Tan, Manli Shu, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Caiming Xiong, Ranjay Krishna, Silvio Savarese
While open-source multi-modal language models perform well on simple question answering tasks, they often fail on complex questions that require multiple capabilities, such as fine-grained recognition, visual grounding, and reasoning, and that demand multi-step solutions.
Ranked #61 on Visual Question Answering on MM-Vet
no code implementations • 12 Nov 2024 • Anas Awadalla, Le Xue, Manli Shu, An Yan, Jun Wang, Senthil Purushwalkam, Sheng Shen, Hannah Lee, Oscar Lo, Jae Sung Park, Etash Guha, Silvio Savarese, Ludwig Schmidt, Yejin Choi, Caiming Xiong, ran Xu
We introduce BLIP3-KALE, a dataset of 218 million image-text pairs that bridges the gap between descriptive synthetic captions and factual web-scale alt-text.
no code implementations • 21 Oct 2024 • Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, ran Xu, Caiming Xiong, Juan Carlos Niebles
We present xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, particularly designed to efficiently capture temporal information over multiple frames.
no code implementations • 22 Aug 2024 • Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, ran Xu, Caiming Xiong
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions.
1 code implementation • 16 Aug 2024 • Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, ran Xu
The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.
1 code implementation • 17 Jun 2024 • Anas Awadalla, Le Xue, Oscar Lo, Manli Shu, Hannah Lee, Etash Kumar Guha, Matt Jordan, Sheng Shen, Mohamed Awadalla, Silvio Savarese, Caiming Xiong, ran Xu, Yejin Choi, Ludwig Schmidt
Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs).
1 code implementation • 21 Feb 2024 • Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, Tom Goldstein
It has recently been shown that adversarial attacks on large language models (LLMs) can "jailbreak" the model into making harmful statements.
1 code implementation • 5 Feb 2024 • Yuancheng Xu, Jiarui Yao, Manli Shu, Yanchao Sun, Zichu Wu, Ning Yu, Tom Goldstein, Furong Huang
Vision-Language Models (VLMs) excel in generating textual responses from visual inputs, but their versatility raises security concerns.
2 code implementations • NeurIPS 2023 • Micah Goldblum, Hossein Souri, Renkun Ni, Manli Shu, Viraj Prabhu, Gowthami Somepalli, Prithvijit Chattopadhyay, Mark Ibrahim, Adrien Bardes, Judy Hoffman, Rama Chellappa, Andrew Gordon Wilson, Tom Goldstein
Battle of the Backbones (BoB) makes this choice easier by benchmarking a diverse suite of pretrained models, including vision-language models, those trained via self-supervised learning, and the Stable Diffusion backbone, across a diverse set of computer vision tasks ranging from classification to object detection to OOD generalization and more.
1 code implementation • NeurIPS 2023 • Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, Tom Goldstein
In this work, we investigate how an adversary can exploit instruction tuning by injecting specific instruction-following examples into the training data that intentionally changes the model's behavior.
1 code implementation • 23 Jun 2023 • Neel Jain, Khalid Saifullah, Yuxin Wen, John Kirchenbauer, Manli Shu, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein
With the rise of Large Language Models (LLMs) and their ubiquitous deployment in diverse domains, measuring language model behavior on realistic data is imperative.
1 code implementation • 7 Jun 2023 • John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, Kezhi Kong, Kasun Fernando, Aniruddha Saha, Micah Goldblum, Tom Goldstein
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
no code implementations • 6 Jan 2023 • Manli Shu, Le Xue, Ning Yu, Roberto Martín-Martín, Caiming Xiong, Tom Goldstein, Juan Carlos Niebles, ran Xu
By plugging our proposed modules into the state-of-the-art transformer-based 3D detectors, we improve the previous best results on both benchmarks, with more significant improvements on smaller objects.
1 code implementation • 13 Dec 2022 • Amin Ghiasi, Hamid Kazemi, Eitan Borgnia, Steven Reich, Manli Shu, Micah Goldblum, Andrew Gordon Wilson, Tom Goldstein
In addition, we show that ViTs maintain spatial information in all layers except the final layer.
1 code implementation • 15 Sep 2022 • Manli Shu, Weili Nie, De-An Huang, Zhiding Yu, Tom Goldstein, Anima Anandkumar, Chaowei Xiao
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
no code implementations • NeurIPS 2021 • Yu Shen, Laura Zheng, Manli Shu, Weizi Li, Tom Goldstein, Ming Lin
We introduce a simple yet effective framework for improving the robustness of learning algorithms against image corruptions for autonomous driving.
no code implementations • ICLR 2022 • Renkun Ni, Manli Shu, Hossein Souri, Micah Goldblum, Tom Goldstein
Contrastive learning has recently taken off as a paradigm for learning from unlabeled data.
1 code implementation • 3 Aug 2021 • Roman Levin, Manli Shu, Eitan Borgnia, Furong Huang, Micah Goldblum, Tom Goldstein
We find that samples which cause similar parameters to malfunction are semantically similar.
no code implementations • 26 Feb 2021 • Yu Shen, Laura Zheng, Manli Shu, Weizi Li, Tom Goldstein, Ming C. Lin
For safety of autonomous driving, vehicles need to be able to drive under various lighting, weather, and visibility conditions in different environments.
no code implementations • 1 Jan 2021 • Yu Shen, Laura Yu Zheng, Manli Shu, Weizi Li, Tom Goldstein, Ming Lin
To ensure the wide adoption and safety of autonomous driving, the vehicles need to be able to drive under various lighting, weather, and visibility conditions in different environments.
no code implementations • 14 Oct 2020 • Chen Zhu, Zheng Xu, Ali Shafahi, Manli Shu, Amin Ghiasi, Tom Goldstein
Further, we demonstrate that the compact structure and corresponding initialization from the Lottery Ticket Hypothesis can also help in data-free training.
no code implementations • 28 Sep 2020 • Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein
Adversarial training is the industry standard for producing models that are robust to small adversarial perturbations.
1 code implementation • NeurIPS 2021 • Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein
We adapt adversarial training by directly perturbing feature statistics, rather than image pixels, to produce models that are robust to various unseen distributional shifts.
no code implementations • 20 Apr 2020 • Ahmed Abdelkader, Michael J. Curry, Liam Fowl, Tom Goldstein, Avi Schwarzschild, Manli Shu, Christoph Studer, Chen Zhu
We first demonstrate successful transfer attacks against a victim network using \textit{only} its feature extractor.