Search Results for author: Meera Hahn

Found 12 papers, 3 papers with code

VideoPoet: A Large Language Model for Zero-Shot Video Generation

no code implementations • 21 Dec 2023 • Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang

We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals.

Ranked #7 on Text-to-Video Generation on UCF-101

Decoder Language Modelling +3

Paper
Add Code

Photorealistic Video Generation with Diffusion Models

no code implementations • 11 Dec 2023 • Agrim Gupta, Lijun Yu, Kihyuk Sohn, Xiuye Gu, Meera Hahn, Li Fei-Fei, Irfan Essa, Lu Jiang, José Lezama

We present W. A. L. T, a transformer-based approach for photorealistic video generation via diffusion modeling.

Ranked #1 on Video Prediction on Kinetics-600 12 frames, 64x64

Text-to-Video Generation Video Generation +1

Paper
Add Code

Which way is `right'?: Uncovering limitations of Vision-and-Language Navigation model

no code implementations • 30 Nov 2023 • Meera Hahn, Amit Raj, James M. Rehg

The challenging task of Vision-and-Language Navigation (VLN) requires embodied agents to follow natural language instructions to reach a goal location or object (e. g. `walk down the hallway and turn left at the piano').

Vision and Language Navigation

Paper
Add Code

Text and Click inputs for unambiguous open vocabulary instance segmentation

1 code implementation • 24 Nov 2023 • Nikolai Warner, Meera Hahn, Jonathan Huang, Irfan Essa, Vighnesh Birodkar

We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment.

Instance Segmentation Segmentation +1

Paper
Code

Transformer-based Localization from Embodied Dialog with Large-scale Pre-training

no code implementations • 10 Oct 2022 • Meera Hahn, James M. Rehg

We address the challenging task of Localization via Embodied Dialog (LED).

Paper
Add Code

Learning a Visually Grounded Memory Assistant

no code implementations • 7 Oct 2022 • Meera Hahn, Kevin Carlberg, Ruta Desai, James Hillis

We introduce a novel interface for large scale collection of human memory and assistance.

Paper
Add Code

No RL, No Simulation: Learning to Navigate without Navigating

1 code implementation • NeurIPS 2021 • Meera Hahn, Devendra Chaplot, Shubham Tulsiani, Mustafa Mukadam, James M. Rehg, Abhinav Gupta

Most prior methods for learning navigation policies require access to simulation environments, as they need online policy interaction and rely on ground-truth maps for rewards.

Navigate Reinforcement Learning (RL)

Paper
Code

Where Are You? Localization from Embodied Dialog

2 code implementations • EMNLP 2020 • Meera Hahn, Jacob Krantz, Dhruv Batra, Devi Parikh, James M. Rehg, Stefan Lee, Peter Anderson

In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.

Navigate Visual Dialog

Paper
Code

Tripping through time: Efficient Localization of Activities in Videos

no code implementations • 22 Apr 2019 • Meera Hahn, Asim Kadav, James M. Rehg, Hans Peter Graf

Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video.

Paper
Add Code

Action2Vec: A Crossmodal Embedding Approach to Action Learning

no code implementations • 2 Jan 2019 • Meera Hahn, Andrew Silva, James M. Rehg

We describe a novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips.

Action Recognition General Classification +2