Search Results for author: Anna Rohrbach

Found 48 papers, 25 papers with code

How Much Can CLIP Benefit Vision-and-Language Tasks?

4 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #4 on Vision and Language Navigation on RxR (using extra training data)

Question Answering Vision and Language Navigation +2

K-LITE: Learning Transferable Visual Models with External Knowledge

2 code implementations20 Apr 2022 Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao

We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts.

Benchmarking Descriptive +4

DETReg: Unsupervised Pretraining with Region Priors for Object Detection

1 code implementation CVPR 2022 Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture.

Few-Shot Learning Few-Shot Object Detection +6

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations12 Nov 2015 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Language Modelling Natural Language Visual Grounding +2

Women also Snowboard: Overcoming Bias in Captioning Models

2 code implementations ECCV 2018 Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach

We introduce a new Equalizer model that ensures equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present.

Image Captioning

Speaker-Follower Models for Vision-and-Language Navigation

1 code implementation NeurIPS 2018 Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.

Data Augmentation Vision and Language Navigation

Language-Conditioned Graph Networks for Relational Reasoning

1 code implementation ICCV 2019 Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko

E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.

Object Referring Expression Comprehension +2

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

2 code implementations ACL 2022 Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain.

Image Classification Referring Expression +1

Textual Explanations for Self-Driving Vehicles

2 code implementations ECCV 2018 Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata

Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments.

Object Hallucination in Image Captioning

1 code implementation EMNLP 2018 Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, Kate Saenko

Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene.

Hallucination Image Captioning +2

Robust Change Captioning

1 code implementation ICCV 2019 Dong Huk Park, Trevor Darrell, Anna Rohrbach

We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.

Natural Language Visual Grounding

Object-Region Video Transformers

1 code implementation CVPR 2022 Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.

Action Detection Few-Shot action recognition +3

Adversarial Inference for Multi-Sentence Video Description

1 code implementation CVPR 2019 Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.

Image Captioning Sentence +1

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

1 code implementation28 Apr 2022 Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion.

Question Answering Visual Question Answering

Compositional Video Synthesis with Action Graphs

1 code implementation27 Jun 2020 Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson

Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.

Scheduling Video Generation +2

NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media

1 code implementation EMNLP 2021 Grace Luo, Trevor Darrell, Anna Rohrbach

Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes.

Misinformation

Identity-Aware Multi-Sentence Video Description

1 code implementation ECCV 2020 Jae Sung Park, Trevor Darrell, Anna Rohrbach

This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description.

Gender Prediction Sentence +1

G^3: Geolocation via Guidebook Grounding

1 code implementation28 Nov 2022 Grace Luo, Giscard Biamby, Trevor Darrell, Daniel Fried, Anna Rohrbach

We propose the task of Geolocation via Guidebook Grounding that uses a dataset of StreetView images from a diverse set of locations and an associated textual guidebook for GeoGuessr, a popular interactive geolocation game.

Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

1 code implementation NAACL 2022 Giscard Biamby, Grace Luo, Trevor Darrell, Anna Rohrbach

Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance.

Misinformation

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

2 code implementations CVPR 2017 Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal

In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance.

Descriptive Language Modelling +3

Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion

1 code implementation21 Dec 2021 Shruti Agarwal, Liwen Hu, Evonne Ng, Trevor Darrell, Hao Li, Anna Rohrbach

In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques.

Misinformation

CLIP-It! Language-Guided Video Summarization

1 code implementation NeurIPS 2021 Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes.

Query-focused Summarization Video Summarization

Video Object Segmentation with Language Referring Expressions

no code implementations21 Mar 2018 Anna Khoreva, Anna Rohrbach, Bernt Schiele

We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS'16 and is competitive to methods using scribbles on the challenging DAVIS'17 dataset.

 Ranked #1 on Video Object Segmentation on DAVIS 2017 (mIoU metric)

Object Referring Expression Segmentation +4

Fooling Vision and Language Models Despite Localization and Attention Mechanism

no code implementations CVPR 2018 Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, Dawn Song

Our work sheds new light on understanding adversarial attacks on vision systems which have a language component and shows that attention, bounding box localization, and compositional internal structures are vulnerable to adversarial attacks.

Dense Captioning Natural Language Understanding +2

Gradient-free Policy Architecture Search and Adaptation

no code implementations16 Oct 2017 Sayna Ebrahimi, Anna Rohrbach, Trevor Darrell

We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks.

Autonomous Driving Neural Architecture Search

Generating Descriptions with Grounded and Co-Referenced People

no code implementations CVPR 2017 Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.

Movie Description

no code implementations12 May 2016 Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.

Benchmarking

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

no code implementations23 Feb 2015 Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele

To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.

Activity Recognition

The Long-Short Story of Movie Description

no code implementations4 Jun 2015 Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.

Image Captioning Sentence

A Dataset for Movie Description

no code implementations CVPR 2015 Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.

Benchmarking Descriptive

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

no code implementations ACL 2019 Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.

Vision and Language Navigation

More Control for Free! Image Synthesis with Semantic Diffusion Guidance

no code implementations10 Dec 2021 Xihui Liu, Dong Huk Park, Samaneh Azadi, Gong Zhang, Arman Chopikyan, Yuxiao Hu, Humphrey Shi, Anna Rohrbach, Trevor Darrell

We investigate fine-grained, continuous control of this model class, and introduce a novel unified framework for semantic diffusion guidance, which allows either language or image guidance, or both.

Continuous Control Denoising +1

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

no code implementations10 Feb 2022 Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.

Visual Abductive Reasoning Visual Reasoning

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

no code implementations13 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.

Action Recognition Video Understanding

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

no code implementations15 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.

Point- of-no-return (PNR) temporal localization Temporal Localization

Exposing the Limits of Video-Text Models through Contrast Sets

1 code implementation NAACL 2022 Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, Anna Rohrbach

We test the robustness of recent methods on the proposed automatic contrast sets, and compare them to additionally collected human-generated counterparts, to assess their effectiveness.

Language Modelling Multiple-choice +2

Shape-Guided Diffusion with Inside-Outside Attention

no code implementations1 Dec 2022 Dong Huk Park, Grace Luo, Clayton Toste, Samaneh Azadi, Xihui Liu, Maka Karalashvili, Anna Rohrbach, Trevor Darrell

When manipulating an object, existing text-to-image diffusion models often ignore the shape of the object and generate content that is incorrectly scaled, cut off, or replaced with background content.

Object

Focus! Relevant and Sufficient Context Selection for News Image Captioning

no code implementations1 Dec 2022 Mingyang Zhou, Grace Luo, Anna Rohrbach, Zhou Yu

In our paper, we first demonstrate that by combining more fine-grained context that captures the key named entities (obtained via an oracle) and the global context that summarizes the news, we can dramatically improve the model's ability to generate accurate news captions.

Image Captioning Relation Extraction +1

MammalNet: A Large-scale Video Benchmark for Mammal Recognition and Behavior Understanding

no code implementations CVPR 2023 Jun Chen, Ming Hu, Darren J. Coker, Michael L. Berumen, Blair Costelloe, Sara Beery, Anna Rohrbach, Mohamed Elhoseiny

Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function.

Object-based (yet Class-agnostic) Video Domain Adaptation

no code implementations29 Nov 2023 Dantong Niu, Amir Bar, Roei Herzig, Trevor Darrell, Anna Rohrbach

Existing video-based action recognition systems typically require dense annotation and struggle in environments when there is significant distribution shift relative to the training data.

Action Recognition Domain Adaptation +1

Cannot find the paper you are looking for? You can Submit a new open access paper.