Search Results for author: Anna Rohrbach

Found 41 papers, 21 papers with code

Exposing the Limits of Video-Text Models through Contrast Sets

1 code implementation NAACL 2022 Jae Sung Park, Sheng Shen, Ali Farhadi, Trevor Darrell, Yejin Choi, Anna Rohrbach

We test the robustness of recent methods on the proposed automatic contrast sets, and compare them to additionally collected human-generated counterparts, to assess their effectiveness.

Language Modelling Multiple-choice +1

Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022

no code implementations15 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

First, as both images and videos contain structured information, we enrich a transformer model with a set of \emph{object tokens} that can be used across images and videos.

Temporal Localization

Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens

no code implementations13 Jun 2022 Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson

We explore a particular instantiation of scene structure, namely a \emph{Hand-Object Graph}, consisting of hands and objects with their locations as nodes, and physical relations of contact/no-contact as edges.

Action Recognition Video Understanding

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

1 code implementation28 Apr 2022 Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion.

Question Answering Visual Question Answering +1

K-LITE: Learning Transferable Visual Models with External Knowledge

no code implementations20 Apr 2022 Sheng Shen, Chunyuan Li, Xiaowei Hu, Yujia Xie, Jianwei Yang, Pengchuan Zhang, Anna Rohrbach, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Jianfeng Gao

In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge to build transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that can understand both visual concepts and their knowledge; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models.

Image Classification object-detection +2

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

1 code implementation ACL 2022 Sanjay Subramanian, William Merrill, Trevor Darrell, Matt Gardner, Sameer Singh, Anna Rohrbach

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain.

Image Classification Referring Expression +1

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

no code implementations10 Feb 2022 Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.

Visual Reasoning

Watch Those Words: Video Falsification Detection Using Word-Conditioned Facial Motion

1 code implementation21 Dec 2021 Shruti Agarwal, Liwen Hu, Evonne Ng, Trevor Darrell, Hao Li, Anna Rohrbach

In today's era of digital misinformation, we are increasingly faced with new threats posed by video falsification techniques.

Misinformation

Twitter-COMMs: Detecting Climate, COVID, and Military Multimodal Misinformation

1 code implementation NAACL 2022 Giscard Biamby, Grace Luo, Trevor Darrell, Anna Rohrbach

Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance.

Misinformation

Object-Region Video Transformers

no code implementations CVPR 2022 Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

In this work, we present Object-Region Video Transformers (ORViT), an \emph{object-centric} approach that extends video transformer layers with a block that directly incorporates object representations.

Action Detection Few Shot Action Recognition +1

How Much Can CLIP Benefit Vision-and-Language Tasks?

3 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #6 on Visual Entailment on SNLI-VE val (using extra training data)

Question Answering Visual Entailment +1

CLIP-It! Language-Guided Video Summarization

no code implementations NeurIPS 2021 Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes.

Video Summarization

DETReg: Unsupervised Pretraining with Region Priors for Object Detection

1 code implementation CVPR 2022 Amir Bar, Xin Wang, Vadim Kantorov, Colorado J Reed, Roei Herzig, Gal Chechik, Anna Rohrbach, Trevor Darrell, Amir Globerson

Recent self-supervised pretraining methods for object detection largely focus on pretraining the backbone of the object detector, neglecting key parts of detection architecture.

Few-Shot Learning Few-Shot Object Detection +4

NewsCLIPpings: Automatic Generation of Out-of-Context Multimodal Media

1 code implementation EMNLP 2021 Grace Luo, Trevor Darrell, Anna Rohrbach

Online misinformation is a prevalent societal issue, with adversaries relying on tools ranging from cheap fakes to sophisticated deep fakes.

Misinformation

Compositional Video Synthesis with Action Graphs

1 code implementation27 Jun 2020 Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, Amir Globerson

Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.

Video Generation Video Prediction +1

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation

no code implementations ACL 2019 Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, Kate Saenko

The actual grounding can connect language to the environment through multiple modalities, e. g. "stop at the door" might ground into visual objects, while "turn right" might rely only on the geometric structure of a route.

Vision and Language Navigation

Language-Conditioned Graph Networks for Relational Reasoning

1 code implementation ICCV 2019 Ronghang Hu, Anna Rohrbach, Trevor Darrell, Kate Saenko

E. g., conditioning on the "on" relationship to the plate, the object "mug" gathers messages from the object "plate" to update its representation to "mug on the plate", which can be easily consumed by a simple classifier for answer prediction.

Referring Expression Comprehension Relational Reasoning +1

Robust Change Captioning

1 code implementation ICCV 2019 Dong Huk Park, Trevor Darrell, Anna Rohrbach

We present a novel Dual Dynamic Attention Model (DUDA) to perform robust Change Captioning.

Natural Language Visual Grounding

Adversarial Inference for Multi-Sentence Video Description

1 code implementation CVPR 2019 Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.

Image Captioning Video Description

Object Hallucination in Image Captioning

1 code implementation EMNLP 2018 Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, Kate Saenko

Despite continuously improving performance, contemporary image captioning models are prone to "hallucinating" objects that are not actually in a scene.

Image Captioning

Textual Explanations for Self-Driving Vehicles

1 code implementation ECCV 2018 Jinkyu Kim, Anna Rohrbach, Trevor Darrell, John Canny, Zeynep Akata

Finally, we explore a version of our model that generates rationalizations, and compare with introspective explanations on the same video segments.

Speaker-Follower Models for Vision-and-Language Navigation

1 code implementation NeurIPS 2018 Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach, Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, Trevor Darrell

We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction.

Data Augmentation Vision and Language Navigation

Women also Snowboard: Overcoming Bias in Captioning Models

1 code implementation ECCV 2018 Kaylee Burns, Lisa Anne Hendricks, Kate Saenko, Trevor Darrell, Anna Rohrbach

We introduce a new Equalizer model that ensures equal gender probability when gender evidence is occluded in a scene and confident predictions when gender evidence is present.

Image Captioning

Video Object Segmentation with Language Referring Expressions

no code implementations21 Mar 2018 Anna Khoreva, Anna Rohrbach, Bernt Schiele

We show that our language-supervised approach performs on par with the methods which have access to a pixel-level mask of the target object on DAVIS'16 and is competitive to methods using scribbles on the challenging DAVIS'17 dataset.

Referring Expression Segmentation Semantic Segmentation +2

Gradient-free Policy Architecture Search and Adaptation

no code implementations16 Oct 2017 Sayna Ebrahimi, Anna Rohrbach, Trevor Darrell

We develop a method for policy architecture search and adaptation via gradient-free optimization which can learn to perform autonomous driving tasks.

Autonomous Driving Neural Architecture Search

Fooling Vision and Language Models Despite Localization and Attention Mechanism

no code implementations CVPR 2018 Xiaojun Xu, Xinyun Chen, Chang Liu, Anna Rohrbach, Trevor Darrell, Dawn Song

Our work sheds new light on understanding adversarial attacks on vision systems which have a language component and shows that attention, bounding box localization, and compositional internal structures are vulnerable to adversarial attacks.

Natural Language Understanding Question Answering +2

Generating Descriptions with Grounded and Co-Referenced People

no code implementations CVPR 2017 Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.

Natural Language Processing

A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering

2 code implementations CVPR 2017 Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, Christopher Pal

In addition to presenting statistics and a description of the dataset, we perform a detailed analysis of 5 different models' predictions, and compare these with human performance.

Language Modelling object-detection +2

Movie Description

no code implementations12 May 2016 Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations12 Nov 2015 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Language Modelling Natural Language Visual Grounding +2

The Long-Short Story of Movie Description

no code implementations4 Jun 2015 Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.

Image Captioning

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

no code implementations23 Feb 2015 Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele

To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.

Activity Recognition

A Dataset for Movie Description

no code implementations CVPR 2015 Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.

Cannot find the paper you are looking for? You can Submit a new open access paper.