1 code implementation • NAACL 2022 • Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, Anna Rohrbach
We test the robustness of recent methods on the proposed automatic contrast sets, and compare them to additionally collected human-generated counterparts, to assess their effectiveness.
no code implementations • 29 Nov 2023 • Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach
Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data.
no code implementations • CVPR 2023 • Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, Mohamed Elhoseiny
Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function.
no code implementations • 11 May 2023 • Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding.
no code implementations • 1 Dec 2022 • Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell
When manipulating an object, existing text-to-image diffusion models often ignore the shape of the object and generate content that is incorrectly scaled, cut off, or replaced with background content.
no code implementations • 1 Dec 2022 • Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu
In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model's ability to generate accurate news captions.
1 code implementation • 28 Nov 2022 • Grace Luo, Giscard Biamby, Trevor Darrell, Daniel Fried, Anna Rohrbach
We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations and an associated textual guidebook for GeoGuessr, a popular interactive geolocation game.
no code implementations • 14 Aug 2022 • Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid
In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.
no code implementations • 15 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.
Point- of-no-return (PNR) temporal localization
Temporal Localization
no code implementations • 13 Jun 2022 • Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.
1 code implementation • 28 Apr 2022 • Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
2 code implementations • 20 Apr 2022 • Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao
We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.
1 code implementation • ACL 2022 • Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach
Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain.
no code implementations • CVPR 2022 • Suzanne Petryk, Lisa Dunlap, Keyan Nasseri, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach
To do this, we ground task-relevant words or phrases with attention maps from a pretrained large-scale model.
no code implementations • 10 Feb 2022 • Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi
We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.
1 code implementation • 21 Dec 2021 • Shruti Agarwal, Liwen Hu, Evonne Ng, Trevor Darrell, Hao Li, Anna Rohrbach
In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques.
1 code implementation • NAACL 2022 • Giscard Biamby, Grace Luo, Trevor Darrell, Anna Rohrbach
Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance.
no code implementations • 10 Dec 2021 • Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell
We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.
1 code implementation • CVPR 2022 • Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.
Ranked #3 on
Action Recognition
on Diving-48
4 code implementations • 13 Jul 2021 • Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.
Ranked #4 on
Vision and Language Navigation
on RxR
(using extra training data)
no code implementations • NeurIPS 2021 • Medhini Narasimhan, Anna Rohrbach, Trevor Darrell
A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes.
1 code implementation • CVPR 2022 • Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson
Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture.
Ranked #1 on
Few-Shot Object Detection
on COCO 2017
1 code implementation • EMNLP 2021 • Grace Luo, Trevor Darrell, Anna Rohrbach
Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes.
1 code implementation • ECCV 2020 • Jae Sung Park, Trevor Darrell, Anna Rohrbach
This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description.
1 code implementation • 27 Jun 2020 • Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson
Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
no code implementations • ACL 2019 • Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko
The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.
1 code implementation • ICCV 2019 • Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko
E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.
Ranked #3 on
Referring Expression Comprehension
on CLEVR-Ref+
1 code implementation • ICCV 2019 • Dong Huk Park, Trevor Darrell, Anna Rohrbach
We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.
1 code implementation • CVPR 2019 • Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach
Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.
1 code implementation • EMNLP 2018 • Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, Kate Saenko
Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene.
2 code implementations • ECCV 2018 • Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata
Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments.
no code implementations • 2 Jul 2018 • Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, Anna Rohrbach
Most machine learning methods are known to capture and exploit biases of the training data.
1 code implementation • NeurIPS 2018 • Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell
We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.
2 code implementations • ECCV 2018 • Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach
We introduce a new Equalizer model that ensures equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present.
no code implementations • 21 Mar 2018 • Anna Khoreva, Anna Rohrbach, Bernt Schiele
We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS'16 and is competitive to methods using scribbles on the challenging DAVIS'17 dataset.
Ranked #1 on
Video Object Segmentation
on DAVIS 2017
(mIoU metric)
1 code implementation • CVPR 2018 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths.
no code implementations • 17 Nov 2017 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We also introduce a multimodal methodology for generating visual and textual explanations simultaneously.
no code implementations • 16 Oct 2017 • Sayna Ebrahimi, Anna Rohrbach, Trevor Darrell
We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks.
no code implementations • CVPR 2018 • Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, Dawn Song
Our work sheds new light on understanding adversarial attacks on vision systems which have a language component and shows that attention, bounding box localization, and compositional internal structures are vulnerable to adversarial attacks.
no code implementations • CVPR 2017 • Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele
At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.
2 code implementations • CVPR 2017 • Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal
In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance.
9 code implementations • EMNLP 2016 • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
no code implementations • 12 May 2016 • Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele
In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.
3 code implementations • 12 Nov 2015 • Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #12 on
Phrase Grounding
on Flickr30k Entities Test
no code implementations • 4 Jun 2015 • Anna Rohrbach, Marcus Rohrbach, Bernt Schiele
Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.
no code implementations • 23 Feb 2015 • Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele
To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.
no code implementations • CVPR 2015 • Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele
In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.