Search Results for author: Serena Yeung-Levy

Found 32 papers, 14 papers with code

A Large-Scale Vision-Language Dataset Derived from Open Scientific Literature to Advance Biomedical Generalist AI

no code implementations26 Mar 2025 Alejandro Lozano, Min Woo Sun, James Burgess, Jeffrey J. Nirschl, Christopher Polzak, Yuhui Zhang, Liangyu Chen, Jeffrey Gu, Ivan Lopez, Josiah Aklilu, Anita Rau, Austin Wolfgang Katzer, Collin Chiu, Orr Zohar, Xiaohan Wang, Alfred Seunghoon Song, Chiang Chia-Chun, Robert Tibshirani, Serena Yeung-Levy

Despite the excitement behind biomedical artificial intelligence (AI), access to high-quality, diverse, and large-scale data - the foundation for modern AI systems - is still a bottleneck to unlocking its full potential.

SurgiSAM2: Fine-tuning a foundational model for surgical video anatomy segmentation and detection

no code implementations5 Mar 2025 Devanish N. Kamtam, Joseph B. Shrager, Satya Deepya Malla, Xiaohan Wang, Nicole Lin, Juan J. Cardona, Serena Yeung-Levy, Clarence Hu

Conclusion: SAM 2 achieves remarkable zero-shot and fine-tuned performance for surgical scene segmentation, surpassing prior SOTA models across several organ classes of diverse datasets.

Anatomy Scene Segmentation +1

Foundation Models Secretly Understand Neural Network Weights: Enhancing Hypernetwork Architectures with Foundation Models

no code implementations2 Mar 2025 Jeffrey Gu, Serena Yeung-Levy

Large pre-trained models, or foundation models, have shown impressive performance when adapted to a variety of downstream tasks, often out-performing specialized models.

CellFlow: Simulating Cellular Morphology Changes via Flow Matching

no code implementations13 Feb 2025 Yuhui Zhang, Yuchang Su, Chenyu Wang, Tianhong Li, Zoe Wefers, Jeffrey Nirschl, James Burgess, Daisy Ding, Alejandro Lozano, Emma Lundberg, Serena Yeung-Levy

Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology.

Temporal Preference Optimization for Long-Form Video Understanding

no code implementations23 Jan 2025 Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models.

Form MME +2

Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration

no code implementations17 Dec 2024 Mark Endo, Xiaohan Wang, Serena Yeung-Levy

In this work, we examine the popular acceleration approach of early pruning of visual tokens inside the language model and find that its strong performance across many tasks is not due to an exceptional ability to compress visual information, but rather the benchmarks' limited ability to assess fine-grained visual capabilities.

Language Modeling Language Modelling

DeforHMR: Vision Transformer with Deformable Cross-Attention for 3D Human Mesh Recovery

no code implementations18 Nov 2024 Jaewoo Heo, George Hu, Zeyu Wang, Serena Yeung-Levy

DeforHMR leverages a novel query-agnostic deformable cross-attention mechanism within the transformer decoder to effectively regress the visual features extracted from a frozen pretrained vision transformer (ViT) encoder.

Decoder Human Mesh Recovery +1

Motion Diffusion-Guided 3D Global HMR from a Dynamic Camera

no code implementations15 Nov 2024 Jaewoo Heo, Kuan-Chieh Wang, Karen Liu, Serena Yeung-Levy

Motion capture technologies have transformed numerous fields, from the film and gaming industries to sports science and healthcare, by providing a tool to capture and analyze human movement in great detail.

Motion Generation

Zero-shot Action Localization via the Confidence of Large Vision-Language Models

no code implementations18 Oct 2024 Josiah Aklilu, Xiaohan Wang, Serena Yeung-Levy

Precise action localization in untrimmed video is vital for fields such as professional sports and minimally invasive surgery, where the delineation of particular motions in recordings can dramatically enhance analysis.

Action Localization Language Modelling +4

Ask, Pose, Unite: Scaling Data Acquisition for Close Interactions with Vision Language Models

no code implementations1 Oct 2024 Laura Bravo-Sánchez, Jaewoo Heo, Zhenzhen Weng, Kuan-Chieh Wang, Serena Yeung-Levy

Social dynamics in close human interactions pose significant challenges for Human Mesh Estimation (HME), particularly due to the complexity of physical contacts and the scarcity of training data.

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

1 code implementation8 Jul 2024 Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy

The performance of Large Vision Language Models (LVLMs) is dependent on the size and quality of their training datasets.

Action Quality Assessment Descriptive +1

μ-Bench: A Vision-Language Benchmark for Microscopy Understanding

1 code implementation1 Jul 2024 Alejandro Lozano, Jeffrey Nirschl, James Burgess, Sanket Rajan Gupte, Yuhui Zhang, Alyssa Unell, Serena Yeung-Levy

Recent advances in microscopy have enabled the rapid generation of terabytes of image data in cell biology and biomedical research.

Cell Detection Classification +2

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models

1 code implementation19 Mar 2024 Elaine Sui, Xiaohan Wang, Serena Yeung-Levy

Advancements in vision-language models (VLMs) have propelled the field of computer vision, particularly in the zero-shot learning setting.

Image Classification Prompt Engineering +3

Depth-guided NeRF Training via Earth Mover's Distance

no code implementations19 Mar 2024 Anita Rau, Josiah Aklilu, F. Christopher Holsinger, Serena Yeung-Levy

This work proposes a novel approach to uncertainty in depth priors for NeRF supervision.

Denoising NeRF

VideoAgent: Long-form Video Understanding with Large Language Model as Agent

2 code implementations15 Mar 2024 Xiaohan Wang, Yuhui Zhang, Orr Zohar, Serena Yeung-Levy

Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences.

EgoSchema Form +5

Multi-Human Mesh Recovery with Transformers

no code implementations26 Feb 2024 Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

Conventional approaches to human mesh recovery predominantly employ a region-based strategy.

Human Mesh Recovery

Revisiting Active Learning in the Era of Vision Foundation Models

1 code implementation25 Jan 2024 Sanket Rajan Gupte, Josiah Aklilu, Jeffrey J. Nirschl, Serena Yeung-Levy

Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks.

Active Learning Diversity +1

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

no code implementations22 Jan 2024 Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang

We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image.

Decoder NeRF

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

1 code implementation16 Jan 2024 Yuhui Zhang, Elaine Sui, Serena Yeung-Levy

However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists.

Text-to-Image Generation Video Captioning

Describing Differences in Image Sets with Natural Language

1 code implementation CVPR 2024 Lisa Dunlap, Yuhui Zhang, Xiaohan Wang, Ruiqi Zhong, Trevor Darrell, Jacob Steinhardt, Joseph E. Gonzalez, Serena Yeung-Levy

To aid in this discovery process, we explore the task of automatically describing the differences between two $\textbf{sets}$ of images, which we term Set Difference Captioning.

Language Modelling

Viewpoint Textual Inversion: Discovering Scene Representations and 3D View Control in 2D Diffusion Models

1 code implementation14 Sep 2023 James Burgess, Kuan-Chieh Wang, Serena Yeung-Levy

We conclude that since the view token controls the 3D `rendering' viewpoint, there is likely a scene representation embedded in frozen 2D diffusion models.

Novel View Synthesis Text-to-Image Generation

Diffusion-HPC: Synthetic Data Generation for Human Mesh Recovery in Challenging Domains

1 code implementation16 Mar 2023 Zhenzhen Weng, Laura Bravo-Sánchez, Serena Yeung-Levy

Recent text-to-image generative models have exhibited remarkable abilities in generating high-fidelity and photo-realistic images.

Human Mesh Recovery Synthetic Data Generation

Cannot find the paper you are looking for? You can Submit a new open access paper.