Search Results for author: Marcus Rohrbach

Found 69 papers, 37 papers with code

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

no code implementations28 Apr 2022 Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

This new problem formulation, metric, and analysis for VQA provide the groundwork for building effective and reliable VQA models that have the self-awareness to abstain if and only if they don't know the answer.

Question Answering Visual Question Answering +1

Learning To Recognize Procedural Activities with Distant Supervision

no code implementations26 Jan 2022 Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Action Classification Language Modelling +1

FLAVA: A Foundational Language And Vision Alignment Model

no code implementations8 Dec 2021 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Zero-shot Image Retrieval Zero-shot Text Retrieval

A New Split for Evaluating True Zero-Shot Action Recognition

no code implementations27 Jul 2021 Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach

We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation.

Few Shot Action Recognition Zero-Shot Action Recognition +1

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

no code implementations18 Jan 2021 Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach

Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes.

Action Recognition Generalized Zero-Shot Learning +2

SMART Frame Selection for Action Recognition

no code implementations19 Dec 2020 Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

In this work, however, we focus on the more standard short, trimmed action recognition problem.

Action Recognition Frame

Adversarial Continual Learning

1 code implementation ECCV 2020 Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach

We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks.

Continual Learning Image Classification

In Defense of Grid Features for Visual Question Answering

2 code implementations CVPR 2020 Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA).

Image Captioning Question Answering +2

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations CVPR 2020 Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +2

Learning to Generate Grounded Visual Captions without Localization Supervision

2 code implementations1 Jun 2019 Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.

Image Captioning Language Modelling +1

Cycle-Consistency for Robust Visual Question Answering

no code implementations CVPR 2019 Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh

Despite significant progress in Visual Question Answering over the years, robustness of today's VQA models leave much to be desired.

Question Answering Question Generation +2

Exploring the Challenges towards Lifelong Fact Learning

no code implementations26 Dec 2018 Mohamed Elhoseiny, Francesca Babiloni, Rahaf Aljundi, Marcus Rohrbach, Manohar Paluri, Tinne Tuytelaars

So far life-long learning (LLL) has been studied in relatively small-scale and relatively artificial setups.

Grounded Video Description

2 code implementations CVPR 2019 Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach

Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.

Video Description

Adversarial Inference for Multi-Sentence Video Description

1 code implementation CVPR 2019 Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.

Image Captioning Video Description

Efficient Lifelong Learning with A-GEM

2 code implementations ICLR 2019 Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny

In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task.

Continual Learning

Graph-Based Global Reasoning Networks

5 code implementations CVPR 2019 Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, Yannis Kalantidis

In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.

Action Classification Action Recognition +3

A Dataset for Telling the Stories of Social Media Videos

no code implementations EMNLP 2018 Sp Gella, ana, Mike Lewis, Marcus Rohrbach

Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories.

Video Captioning Video Description

Uncertainty-guided Lifelong Learning in Bayesian Networks

no code implementations27 Sep 2018 Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach

Sequentially learning of tasks arriving in a continuous stream is a complex problem and becomes more challenging when the model has a fixed capacity.

Continual Learning

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

7 code implementations26 Jul 2018 Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

Data Augmentation Visual Question Answering +1

Selfless Sequential Learning

1 code implementation ICLR 2019 Rahaf Aljundi, Marcus Rohrbach, Tinne Tuytelaars

In particular, we propose a novel regularizer, that encourages representation sparsity by means of neural inhibition.

Large-Scale Visual Relationship Understanding

2 code implementations27 Apr 2018 Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples.

Memory Aware Synapses: Learning what (not) to forget

2 code implementations ECCV 2018 Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars

We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.

Object Recognition

Learning to Reason: End-to-End Module Networks for Visual Question Answering

1 code implementation ICCV 2017 Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko

Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.

Visual Dialog Visual Question Answering

Generating Descriptions with Grounded and Co-Referenced People

no code implementations CVPR 2017 Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.

Attentive Explanations: Justifying Decisions and Pointing to the Evidence

no code implementations14 Dec 2016 Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions.

Decision Making Question Answering +1

Modeling Relationships in Referential Expressions with Compositional Modular Networks

2 code implementations CVPR 2017 Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko

In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.

Visual Question Answering

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

no code implementations30 Aug 2016 Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.

Image Captioning Language Modelling +1

Captioning Images with Diverse Objects

1 code implementation CVPR 2017 Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, Kate Saenko

We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets.

Object Recognition

Movie Description

no code implementations12 May 2016 Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

1 code implementation9 May 2016 Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem.

Question Answering Visual Question Answering +1

Attributes as Semantic Units between Natural Language and Visual Recognition

no code implementations12 Apr 2016 Marcus Rohrbach

Impressive progress has been made in the fields of computer vision and natural language processing.

Generating Visual Explanations

no code implementations28 Mar 2016 Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell

Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself.

General Classification

Segmentation from Natural Language Expressions

3 code implementations20 Mar 2016 Ronghang Hu, Marcus Rohrbach, Trevor Darrell

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

Referring Expression Segmentation Semantic Segmentation

Sequence to Sequence - Video to Text

no code implementations ICCV 2015 Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.

Language Modelling

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

1 code implementation CVPR 2016 Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell

Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.

Image Captioning Object Recognition +1

Natural Language Object Retrieval

1 code implementation CVPR 2016 Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.

Image Captioning Image Retrieval +2

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations12 Nov 2015 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Language Modelling Natural Language Visual Grounding +2

Neural Module Networks

1 code implementation CVPR 2016 Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein

Visual question answering is fundamentally compositional in nature---a question like "where is the dog?"

Visual Question Answering VQA

Spatial Semantic Regularisation for Large Scale Object Detection

no code implementations ICCV 2015 Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell

Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.

Object Detection

The Long-Short Story of Movie Description

no code implementations4 Jun 2015 Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.

Image Captioning

A Multi-scale Multiple Instance Video Description Network

no code implementations21 May 2015 Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, Kate Saenko

Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video.

Frame Multiple Instance Learning +2

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

no code implementations ICCV 2015 Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question).

Question Answering

Sequence to Sequence -- Video to Text

4 code implementations3 May 2015 Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.

Language Modelling

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

no code implementations23 Feb 2015 Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele

To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.

Activity Recognition

A Dataset for Movie Description

no code implementations CVPR 2015 Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

7 code implementations CVPR 2015 Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.

Video Recognition

Transfer Learning in a Transductive Setting

no code implementations NeurIPS 2013 Marcus Rohrbach, Sandra Ebert, Bernt Schiele

Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.

Activity Recognition Few-Shot Learning +2

Grounding Action Descriptions in Videos

no code implementations TACL 2013 Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, Manfred Pinkal

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used.

Semantic Textual Similarity Video Understanding

Cannot find the paper you are looking for? You can Submit a new open access paper.