Search Results for author: Marcus Rohrbach

Found 73 papers, 43 papers with code

Pythia v0.1: the Winning Entry to the VQA Challenge 2018

9 code implementations26 Jul 2018 Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh

We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.

Data Augmentation Visual Question Answering (VQA)

Memory Aware Synapses: Learning what (not) to forget

3 code implementations ECCV 2018 Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars

We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.

Object Recognition

Efficient Lifelong Learning with A-GEM

2 code implementations ICLR 2019 Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny

In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task.

Class Incremental Learning

FLAVA: A Foundational Language And Vision Alignment Model

3 code implementations CVPR 2022 Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.

Image Retrieval Image-to-Text Retrieval +3

12-in-1: Multi-Task Vision and Language Representation Learning

5 code implementations CVPR 2020 Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee

Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.

Image Retrieval Question Answering +3

Modeling Relationships in Referential Expressions with Compositional Modular Networks

2 code implementations CVPR 2017 Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko

In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.

Visual Question Answering (VQA)

Neural Module Networks

1 code implementation CVPR 2016 Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein

Visual question answering is fundamentally compositional in nature---a question like "where is the dog?"

Visual Question Answering

In Defense of Grid Features for Visual Question Answering

2 code implementations CVPR 2020 Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen

Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA).

Image Captioning Question Answering +1

Graph-Based Global Reasoning Networks

9 code implementations CVPR 2019 Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, Yannis Kalantidis

In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.

Action Classification Action Recognition +4

Grounded Video Description

2 code implementations CVPR 2019 Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach

Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.

Sentence Video Description

Adversarial Continual Learning

1 code implementation ECCV 2020 Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach

We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks.

Continual Learning Image Classification

Grounding of Textual Phrases in Images by Reconstruction

3 code implementations12 Nov 2015 Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.

Language Modelling Natural Language Visual Grounding +2

Learning to Generate Grounded Visual Captions without Localization Supervision

2 code implementations1 Jun 2019 Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira

When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.

Image Captioning Language Modelling +2

Ask Your Neurons: A Deep Learning Approach to Visual Question Answering

1 code implementation9 May 2016 Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem.

Question Answering Visual Question Answering

Large-Scale Visual Relationship Understanding

2 code implementations27 Apr 2018 Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny

Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples.

Relationship Detection

Natural Language Object Retrieval

1 code implementation CVPR 2016 Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell

In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.

Image Captioning Image Retrieval +4

Long-term Recurrent Convolutional Networks for Visual Recognition and Description

7 code implementations CVPR 2015 Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.

Retrieval Video Recognition

Segmentation from Natural Language Expressions

4 code implementations20 Mar 2016 Ronghang Hu, Marcus Rohrbach, Trevor Darrell

To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.

Referring Expression Segmentation Segmentation +1

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

1 code implementation CVPR 2016 Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell

Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.

Image Captioning Novel Concepts +3

Sequence to Sequence -- Video to Text

4 code implementations3 May 2015 Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.

Language Modelling Sentence

Learning To Recognize Procedural Activities with Distant Supervision

1 code implementation CVPR 2022 Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Action Classification Language Modelling +1

Adversarial Inference for Multi-Sentence Video Description

1 code implementation CVPR 2019 Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach

Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.

Image Captioning Sentence +1

Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly

1 code implementation28 Apr 2022 Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach

We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion.

Question Answering Visual Question Answering

Selfless Sequential Learning

1 code implementation ICLR 2019 Rahaf Aljundi, Marcus Rohrbach, Tinne Tuytelaars

In particular, we propose a novel regularizer, that encourages representation sparsity by means of neural inhibition.

Captioning Images with Diverse Objects

1 code implementation CVPR 2017 Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, Kate Saenko

We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets.

Object Object Recognition

A New Split for Evaluating True Zero-Shot Action Recognition

1 code implementation27 Jul 2021 Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach

We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation.

Few-Shot action recognition Few Shot Action Recognition +2

Attentive Explanations: Justifying Decisions and Pointing to the Evidence

no code implementations14 Dec 2016 Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach

In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions.

Decision Making Question Answering +2

Generating Descriptions with Grounded and Co-Referenced People

no code implementations CVPR 2017 Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele

At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

no code implementations30 Aug 2016 Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell

Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.

Image Captioning Image Segmentation +3

Movie Description

no code implementations12 May 2016 Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele

In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.

Benchmarking

Attributes as Semantic Units between Natural Language and Visual Recognition

no code implementations12 Apr 2016 Marcus Rohrbach

Impressive progress has been made in the fields of computer vision and natural language processing.

Sentence

Generating Visual Explanations

no code implementations28 Mar 2016 Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell

Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself.

General Classification Sentence +1

A Multi-scale Multiple Instance Video Description Network

no code implementations21 May 2015 Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, Kate Saenko

Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video.

Image Segmentation Multiple Instance Learning +3

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

no code implementations23 Feb 2015 Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele

To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.

Activity Recognition

Spatial Semantic Regularisation for Large Scale Object Detection

no code implementations ICCV 2015 Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell

Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.

Clustering Object +2

Ask Your Neurons: A Neural-based Approach to Answering Questions about Images

no code implementations ICCV 2015 Mateusz Malinowski, Marcus Rohrbach, Mario Fritz

In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question).

Question Answering

The Long-Short Story of Movie Description

no code implementations4 Jun 2015 Anna Rohrbach, Marcus Rohrbach, Bernt Schiele

Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.

Image Captioning Sentence

A Dataset for Movie Description

no code implementations CVPR 2015 Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele

In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.

Benchmarking Descriptive

A Dataset for Telling the Stories of Social Media Videos

no code implementations EMNLP 2018 Sp Gella, ana, Mike Lewis, Marcus Rohrbach

Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories.

Sentence Video Captioning +1

Transfer Learning in a Transductive Setting

no code implementations NeurIPS 2013 Marcus Rohrbach, Sandra Ebert, Bernt Schiele

Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.

Activity Recognition Attribute +3

Exploring the Challenges towards Lifelong Fact Learning

no code implementations26 Dec 2018 Mohamed Elhoseiny, Francesca Babiloni, Rahaf Aljundi, Marcus Rohrbach, Manohar Paluri, Tinne Tuytelaars

So far life-long learning (LLL) has been studied in relatively small-scale and relatively artificial setups.

Sequence to Sequence - Video to Text

no code implementations ICCV 2015 Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko

Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.

Language Modelling Sentence

Grounding Action Descriptions in Videos

no code implementations TACL 2013 Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, Manfred Pinkal

Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used.

Semantic Textual Similarity Video Understanding

Cycle-Consistency for Robust Visual Question Answering

no code implementations CVPR 2019 Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh

Despite significant progress in Visual Question Answering over the years, robustness of today's VQA models leave much to be desired.

Question Answering Question Generation +2

SMART Frame Selection for Action Recognition

no code implementations19 Dec 2020 Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

In this work, however, we focus on the more standard short, trimmed action recognition problem.

Action Recognition

CLASTER: Clustering with Reinforcement Learning for Zero-Shot Action Recognition

no code implementations18 Jan 2021 Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach

Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes.

Action Recognition Clustering +4

Uncertainty-guided Lifelong Learning in Bayesian Networks

no code implementations27 Sep 2018 Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach

Sequentially learning of tasks arriving in a continuous stream is a complex problem and becomes more challenging when the model has a fixed capacity.

Continual Learning

Efficient Pre-training for Localized Instruction Generation of Videos

no code implementations27 Nov 2023 Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller

Understanding such videos is challenging, involving the precise localization of steps and the generation of textual instructions.

Cannot find the paper you are looking for? You can Submit a new open access paper.