no code implementations • 3 Apr 2025 • Anita Rau, Mark Endo, Josiah Aklilu, Jaewoo Heo, Khaled Saab, Alberto Paderno, Jeffrey Jopling, F. Christopher Holsinger, Serena Yeung-Levy
Large Vision-Language Models offer a new paradigm for AI-driven image understanding, enabling models to perform tasks without task-specific training.
no code implementations • 26 Mar 2025 • Alejandro Lozano, Min Woo Sun, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, Robert Tibshirani, Serena Yeung-Levy
Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential.
1 code implementation • 17 Mar 2025 • James Burgess, Jeffrey J Nirschl, Laura Bravo-Sánchez, Alejandro Lozano, Sanket Rajan Gupte, Jesus G. Galaz-Montoya, Yuhui Zhang, Yuchang Su, Disha Bhowmik, Zachary Coman, Sarina M. Hasan, Alexandra Johannesson, William D. Leineweber, Malvika G Nair, Ridhi Yarlagadda, Connor Zuraski, Wah Chiu, Sarah Cohen, Jan N. Hansen, Manuel D Leonetti, Chad Liu, Emma Lundberg, Serena Yeung-Levy
Scientific research demands sophisticated reasoning over multimodal data, a challenge especially prevalent in biology.
no code implementations • 10 Mar 2025 • James Burgess, Xiaohan Wang, Yuhui Zhang, Anita Rau, Alejandro Lozano, Lisa Dunlap, Trevor Darrell, Serena Yeung-Levy
How do two individuals differ when performing the same action?
no code implementations • 5 Mar 2025 • Devanish N. Kamtam, Joseph B. Shrager, Satya Deepya Malla, Xiaohan Wang, Nicole Lin, Juan J. Cardona, Serena Yeung-Levy, Clarence Hu
Conclusion: SAM 2 achieves remarkable zero-shot and fine-tuned performance for surgical scene segmentation, surpassing prior SOTA models across several organ classes of diverse datasets.
no code implementations • 2 Mar 2025 • Jeffrey Gu, Serena Yeung-Levy
Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models.
no code implementations • 13 Feb 2025 • Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology.
no code implementations • 23 Jan 2025 • Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy
Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models.
2 code implementations • 13 Jan 2025 • Alejandro Lozano, Min Woo Sun, James Burgess, Liangyu Chen, Jeffrey J Nirschl, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Austin Wolfgang Katzer, Collin Chiu, Anita Rau, Xiaohan Wang, Yuhui Zhang, Alfred Seunghoon Song, Robert Tibshirani, Serena Yeung-Levy
The development of vision-language models (VLMs) is driven by large-scale and diverse multimodal datasets.
1 code implementation • 6 Jan 2025 • Yuhui Zhang, Yuchang Su, Yiming Liu, Xiaohan Wang, James Burgess, Elaine Sui, Chenyu Wang, Josiah Aklilu, Alejandro Lozano, Anjiang Wei, Ludwig Schmidt, Serena Yeung-Levy
The rapid development of vision language models (VLMs) demands rigorous and reliable evaluation.
no code implementations • 17 Dec 2024 • Mark Endo, Xiaohan Wang, Serena Yeung-Levy
In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities.
no code implementations • 13 Dec 2024 • Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, Serena Yeung-Levy, Xide Xia
Apollo-7B is state-of-the-art compared to 7B LMMs with a 70. 9 on MLVU, and 63. 3 on Video-MME.
no code implementations • 18 Nov 2024 • Jaewoo Heo, George Hu, Zeyu Wang, Serena Yeung-Levy
DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder to effectively regress the visual features extracted from a frozen pretrained vision transformer (ViT) encoder.
no code implementations • 15 Nov 2024 • Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy
Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail.
no code implementations • 18 Oct 2024 • Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy
Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis.
no code implementations • 1 Oct 2024 • Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy
Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data.
no code implementations • 18 Sep 2024 • Charlotte Bunne, Yusuf Roohani, Yanay Rosen, Ankit Gupta, Xikun Zhang, Marcel Roed, Theo Alexandrov, Mohammed AlQuraishi, Patricia Brennan, Daniel B. Burkhardt, Andrea Califano, Jonah Cool, Abby F. Dernburg, Kirsty Ewing, Emily B. Fox, Matthias Haury, Amy E. Herr, Eric Horvitz, Patrick D. Hsu, Viren Jain, Gregory R. Johnson, Thomas Kalil, David R. Kelley, Shana O. Kelley, Anna Kreshuk, Tim Mitchison, Stephani Otte, Jay Shendure, Nicholas J. Sofroniew, Fabian Theis, Christina V. Theodoris, Srigokul Upadhyayula, Marc Valer, Bo wang, Eric Xing, Serena Yeung-Levy, Marinka Zitnik, Theofanis Karaletsos, Aviv Regev, Emma Lundberg, Jure Leskovec, Stephen R. Quake
Accurate modeling of cells is important for this understanding as well as for determining the root causes of disease.
no code implementations • 15 Aug 2024 • Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy
Humans continuously perceive and process visual signals.
1 code implementation • 8 Jul 2024 • Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets.
1 code implementation • 1 Jul 2024 • Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy
Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research.
1 code implementation • 28 May 2024 • Yuhui Zhang, Alyssa Unell, Xiaohan Wang, Dhruba Ghosh, Yuchang Su, Ludwig Schmidt, Serena Yeung-Levy
Image classification is one of the most fundamental capabilities of machine vision intelligence.
1 code implementation • 19 Mar 2024 • Elaine Sui, Xiaohan Wang, Serena Yeung-Levy
Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting.
no code implementations • 19 Mar 2024 • Anita Rau, Josiah Aklilu, F. Christopher Holsinger, Serena Yeung-Levy
This work proposes a novel approach to uncertainty in depth priors for NeRF supervision.
2 code implementations • 15 Mar 2024 • Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.
Ranked #6 on
Zero-Shot Video Question Answer
on NExT-QA
1 code implementation • 12 Mar 2024 • Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, ZiYi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jianfeng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Serena Yeung-Levy, Curtis P. Langlotz, Sheng Wang, Hoifung Poon
Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications.
no code implementations • 26 Feb 2024 • Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy
Conventional approaches to human mesh recovery predominantly employ a region-based strategy.
1 code implementation • 25 Jan 2024 • Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J. Nirschl, Serena Yeung-Levy
Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks.
no code implementations • 22 Jan 2024 • Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang
We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image.
1 code implementation • 16 Jan 2024 • Yuhui Zhang, Elaine Sui, Serena Yeung-Levy
However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists.
1 code implementation • CVPR 2024 • Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy
To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning.
1 code implementation • 14 Sep 2023 • James Burgess, Kuan-Chieh Wang, Serena Yeung-Levy
We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models.
1 code implementation • 16 Mar 2023 • Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy
Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images.