1 code implementation • 13 Dec 2024 • Tobias Braun, Mark Rothermel, Marcus Rohrbach, Anna Rohrbach
The proliferation of disinformation presents a growing threat to societal trust and democracy, necessitating robust and scalable Fact-Checking systems.
no code implementations • - 2024 • Mark Rothermel, Tobias Braun, Marcus Rohrbach, Anna Rohrbach
The spread of disinformation poses a global threat to democratic societies, necessitating robust and scalable Automated Fact-Checking (AFC) systems.
Ranked #3 on Fact Checking on AVeriTeC
1 code implementation • 26 Jun 2024 • Boris Meinardus, Anil Batra, Anna Rohrbach, Marcus Rohrbach
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
1 code implementation • 27 Nov 2023 • Anil Batra, Davide Moltisanti, Laura Sevilla-Lara, Marcus Rohrbach, Frank Keller
To mitigate these issues, we propose a novel technique, Sieve-&-Swap, to automatically generate high-quality training data for the recipe domain: (i) Sieve: filters irrelevant transcripts and (ii) Swap: acquires high-quality text by replacing transcripts with human-written instruction from a text-only recipe dataset.
1 code implementation • CVPR 2023 • Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, Marcus Rohrbach
In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data.
no code implementations • 11 May 2023 • Suzanne Petryk, Spencer Whitehead, Joseph E. Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
The ability to judge whether a caption correctly describes an image is a critical part of vision-language understanding.
Ranked #62 on Visual Reasoning on Winoground
no code implementations • 9 Jun 2022 • Shreyank N Gowda, Marcus Rohrbach, Frank Keller, Laura Sevilla-Lara
We propose to learn what makes a good video for action recognition and select only high-quality samples for augmentation.
Ranked #2 on Few Shot Action Recognition on HMDB51
1 code implementation • 28 Apr 2022 • Spencer Whitehead, Suzanne Petryk, Vedaad Shakib, Joseph Gonzalez, Trevor Darrell, Anna Rohrbach, Marcus Rohrbach
We first enable abstention capabilities for several VQA models, and analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
1 code implementation • CVPR 2022 • Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani
In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.
Ranked #4 on Video Classification on COIN
4 code implementations • CVPR 2022 • Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks.
Ranked #4 on Image Retrieval on MS COCO
1 code implementation • 27 Jul 2021 • Shreyank N Gowda, Laura Sevilla-Lara, Kiyoon Kim, Frank Keller, Marcus Rohrbach
We benchmark several recent approaches on the proposed True Zero-Shot(TruZe) Split for UCF101 and HMDB51, with zero-shot and generalized zero-shot evaluation.
no code implementations • 18 Jan 2021 • Shreyank N Gowda, Laura Sevilla-Lara, Frank Keller, Marcus Rohrbach
Theproblem can be seen as learning a function which general-izes well to instances of unseen classes without losing dis-crimination between classes.
Ranked #2 on Zero-Shot Action Recognition on Olympics
no code implementations • CVPR 2021 • Kenneth Marino, Xinlei Chen, Devi Parikh, Abhinav Gupta, Marcus Rohrbach
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
Ranked #8 on Visual Question Answering (VQA) on A-OKVQA
no code implementations • 19 Dec 2020 • Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara
In this work, however, we focus on the more standard short, trimmed action recognition problem.
Ranked #6 on Action Recognition on UCF101
1 code implementation • ICLR 2021 • Sayna Ebrahimi, Suzanne Petryk, Akash Gokul, William Gan, Joseph E. Gonzalez, Marcus Rohrbach, Trevor Darrell
The goal of continual learning (CL) is to learn a sequence of tasks without suffering from the phenomenon of catastrophic forgetting.
no code implementations • ECCV 2020 • Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, Amanpreet Singh
Image descriptions can help visually impaired people to quickly understand the image content.
1 code implementation • ECCV 2020 • Sayna Ebrahimi, Franziska Meier, Roberto Calandra, Trevor Darrell, Marcus Rohrbach
We show that shared features are significantly less prone to forgetting and propose a novel hybrid continual learning framework that learns a disjoint representation for task-invariant and task-specific features required to solve a sequence of tasks.
2 code implementations • CVPR 2020 • Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, Xinlei Chen
Popularized as 'bottom-up' attention, bounding box (or region) based visual features have recently surpassed vanilla grid-based convolutional features as the de facto standard for vision and language tasks like visual question answering (VQA).
Ranked #18 on Visual Question Answering (VQA) on VQA v2 test-std
5 code implementations • CVPR 2020 • Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, Stefan Lee
Much of vision-and-language research focuses on a small but diverse set of independent tasks and supporting datasets often studied in isolation; however, the visually-grounded language understanding skills required for success at these tasks overlap significantly.
Ranked #1 on 10-shot image generation on MEAD
1 code implementation • CVPR 2020 • Ronghang Hu, Amanpreet Singh, Trevor Darrell, Marcus Rohrbach
Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question.
4 code implementations • ICLR 2020 • Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, Yannis Kalantidis
The long-tail distribution of the visual world poses great challenges for deep learning based classification models on how to handle the class imbalance problem.
Ranked #3 on Long-tail learning with class descriptors on CUB-LT
2 code implementations • ICLR 2020 • Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach
Continual learning aims to learn new tasks without forgetting previously learned ones.
2 code implementations • 1 Jun 2019 • Chih-Yao Ma, Yannis Kalantidis, Ghassan AlRegib, Peter Vajda, Marcus Rohrbach, Zsolt Kira
When automatically generating a sentence description for an image or video, it often remains unclear how well the generated caption is grounded, that is whether the model uses the correct image regions to output particular words, or if the model is hallucinating based on priors in the dataset and/or the language model.
7 code implementations • CVPR 2019 • Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Marcus Rohrbach
We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset.
Ranked #3 on Visual Question Answering (VQA) on VizWiz 2018
28 code implementations • ICCV 2019 • Yunpeng Chen, Haoqi Fan, Bing Xu, Zhicheng Yan, Yannis Kalantidis, Marcus Rohrbach, Shuicheng Yan, Jiashi Feng
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
Ranked #149 on Action Classification on Kinetics-400
1 code implementation • NAACL 2019 • Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
Specifically, we construct a dialog grammar that is grounded in the scene graphs of the images from the CLEVR dataset.
6 code implementations • 27 Feb 2019 • Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K. Dokania, Philip H. S. Torr, Marc'Aurelio Ranzato
But for a successful knowledge transfer, the learner needs to remember how to perform previous tasks.
Ranked #7 on Class Incremental Learning on cifar100
no code implementations • ICLR 2019 • Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, Devi Parikh
We propose a new class of probabilistic neural-symbolic models, that have symbolic functional programs as a latent, stochastic variable.
no code implementations • CVPR 2019 • Meet Shah, Xinlei Chen, Marcus Rohrbach, Devi Parikh
Despite significant progress in Visual Question Answering over the years, robustness of today's VQA models leave much to be desired.
no code implementations • CVPR 2019 • Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan
Motion has shown to be useful for video understanding, where motion is typically represented by optical flow.
Ranked #1 on Action Recognition on UCF-101
no code implementations • 26 Dec 2018 • Mohamed Elhoseiny, Francesca Babiloni, Rahaf Aljundi, Marcus Rohrbach, Manohar Paluri, Tinne Tuytelaars
So far life-long learning (LLL) has been studied in relatively small-scale and relatively artificial setups.
2 code implementations • CVPR 2019 • Luowei Zhou, Yannis Kalantidis, Xinlei Chen, Jason J. Corso, Marcus Rohrbach
Our dataset, ActivityNet-Entities, augments the challenging ActivityNet Captions dataset with 158k bounding box annotations, each grounding a noun phrase.
1 code implementation • CVPR 2019 • Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach
Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video.
3 code implementations • ICLR 2019 • Arslan Chaudhry, Marc'Aurelio Ranzato, Marcus Rohrbach, Mohamed Elhoseiny
In lifelong learning, the learner is presented with a sequence of tasks, incrementally building a data-driven prior which may be leveraged to speed up learning of a new task.
Ranked #6 on Continual Learning on ASC (19 tasks)
9 code implementations • CVPR 2019 • Yunpeng Chen, Marcus Rohrbach, Zhicheng Yan, Shuicheng Yan, Jiashi Feng, Yannis Kalantidis
In this work, we propose a new approach for reasoning globally in which a set of features are globally aggregated over the coordinate space and then projected to an interaction space where relational reasoning can be efficiently computed.
no code implementations • EMNLP 2018 • Sp Gella, ana, Mike Lewis, Marcus Rohrbach
Video content on social media platforms constitutes a major part of the communication between people, as it allows everyone to share their stories.
no code implementations • 27 Sep 2018 • Sayna Ebrahimi, Mohamed Elhoseiny, Trevor Darrell, Marcus Rohrbach
Sequentially learning of tasks arriving in a continuous stream is a complex problem and becomes more challenging when the model has a fixed capacity.
1 code implementation • ECCV 2018 • Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, Marcus Rohrbach
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context.
Ranked #1 on Common Sense Reasoning on Visual Dialog v0.9
9 code implementations • 26 Jul 2018 • Yu Jiang, Vivek Natarajan, Xinlei Chen, Marcus Rohrbach, Dhruv Batra, Devi Parikh
We demonstrate that by making subtle but important changes to the model architecture and the learning rate schedule, fine-tuning image features, and adding data augmentation, we can significantly improve the performance of the up-down model on VQA v2. 0 dataset -- from 65. 67% to 70. 22%.
Ranked #12 on Visual Question Answering (VQA) on A-OKVQA
1 code implementation • ICLR 2019 • Rahaf Aljundi, Marcus Rohrbach, Tinne Tuytelaars
In particular, we propose a novel regularizer, that encourages representation sparsity by means of neural inhibition.
2 code implementations • 27 Apr 2018 • Ji Zhang, Yannis Kalantidis, Marcus Rohrbach, Manohar Paluri, Ahmed Elgammal, Mohamed Elhoseiny
Large scale visual understanding is challenging, as it requires a model to handle the widely-spread and imbalanced distribution of <subject, relation, object> triples.
1 code implementation • CVPR 2018 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths.
2 code implementations • ACL 2019 • Jin-Hwa Kim, Nikita Kitaev, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh
The game involves two players: a Teller and a Drawer.
3 code implementations • ECCV 2018 • Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, Marcus Rohrbach, Tinne Tuytelaars
We show state-of-the-art performance and, for the first time, the ability to adapt the importance of the parameters based on unlabeled data towards what the network needs (not) to forget, which may vary depending on test conditions.
no code implementations • 17 Nov 2017 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
We also introduce a multimodal methodology for generating visual and textual explanations simultaneously.
1 code implementation • ICCV 2017 • Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko
Natural language questions are inherently compositional, and many are most easily answered by reasoning about their decomposition into modular sub-problems.
Ranked #44 on Visual Question Answering (VQA) on VQA v2 test-dev
no code implementations • CVPR 2017 • Anna Rohrbach, Marcus Rohrbach, Siyu Tang, Seong Joon Oh, Bernt Schiele
At training time, we first learn how to localize characters by relating their visual appearance to mentions in the descriptions via a semi-supervised approach.
3 code implementations • ICCV 2017 • Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele
While strong progress has been made in image captioning over the last years, machine and human captions are still quite distinct.
no code implementations • 14 Dec 2016 • Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Bernt Schiele, Trevor Darrell, Marcus Rohrbach
In contrast, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions.
2 code implementations • CVPR 2017 • Ronghang Hu, Marcus Rohrbach, Jacob Andreas, Trevor Darrell, Kate Saenko
In this paper we instead present a modular deep architecture capable of analyzing referential expressions into their component parts, identifying entities and relationships mentioned in the input expression and grounding them all in the scene.
Ranked #1 on Visual Question Answering (VQA) on Visual7W
no code implementations • 30 Aug 2016 • Ronghang Hu, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell
Image segmentation from referring expressions is a joint vision and language modeling task, where the input is an image and a textual expression describing a particular region in the image; and the goal is to localize and segment the specific image region based on the given expression.
1 code implementation • CVPR 2017 • Subhashini Venugopalan, Lisa Anne Hendricks, Marcus Rohrbach, Raymond Mooney, Trevor Darrell, Kate Saenko
We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets.
10 code implementations • EMNLP 2016 • Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, Marcus Rohrbach
Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations.
no code implementations • 12 May 2016 • Anna Rohrbach, Atousa Torabi, Marcus Rohrbach, Niket Tandon, Christopher Pal, Hugo Larochelle, Aaron Courville, Bernt Schiele
In addition we also collected and aligned movie scripts used in prior work and compare the two sources of descriptions.
1 code implementation • 9 May 2016 • Mateusz Malinowski, Marcus Rohrbach, Mario Fritz
By combining latest advances in image representation and natural language processing, we propose Ask Your Neurons, a scalable, jointly trained, end-to-end formulation to this problem.
no code implementations • 12 Apr 2016 • Marcus Rohrbach
Impressive progress has been made in the fields of computer vision and natural language processing.
no code implementations • 28 Mar 2016 • Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself.
4 code implementations • 20 Mar 2016 • Ronghang Hu, Marcus Rohrbach, Trevor Darrell
To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information.
Ranked #16 on Referring Expression Segmentation on J-HMDB
3 code implementations • NAACL 2016 • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein
We describe a question answering model that applies to both images and structured knowledge bases.
no code implementations • ICCV 2015 • Subhashini Venugopalan, Marcus Rohrbach, Jeffrey Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.
1 code implementation • CVPR 2016 • Lisa Anne Hendricks, Subhashini Venugopalan, Marcus Rohrbach, Raymond Mooney, Kate Saenko, Trevor Darrell
Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet.
1 code implementation • CVPR 2016 • Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng, Kate Saenko, Trevor Darrell
In this paper, we address the task of natural language object retrieval, to localize a target object within a given image based on a natural language query of the object.
Ranked #12 on Referring Expression Comprehension on Talk2Car
3 code implementations • 12 Nov 2015 • Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele
We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly.
Ranked #13 on Phrase Grounding on Flickr30k Entities Test
2 code implementations • CVPR 2016 • Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein
Visual question answering is fundamentally compositional in nature---a question like "where is the dog?"
Ranked #6 on Visual Question Answering (VQA) on VQA v1 test-std
no code implementations • ICCV 2015 • Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, Trevor Darrell
Our approach proves to be especially useful in large scale settings with thousands of classes, where spatial and semantic interactions are very frequent and only weakly supervised detectors can be built due to a lack of bounding box annotations.
no code implementations • 4 Jun 2015 • Anna Rohrbach, Marcus Rohrbach, Bernt Schiele
Generating descriptions for videos has many applications including assisting blind people and human-robot interaction.
no code implementations • 21 May 2015 • Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, Kate Saenko
Most state-of-the-art methods for solving this problem borrow existing deep convolutional neural network (CNN) architectures (AlexNet, GoogLeNet) to extract a visual representation of the input video.
no code implementations • ICCV 2015 • Mateusz Malinowski, Marcus Rohrbach, Mario Fritz
In contrast to previous efforts, we are facing a multi-modal problem where the language output (answer) is conditioned on visual and natural language input (image and question).
4 code implementations • 3 May 2015 • Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, Kate Saenko
Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip.
no code implementations • 23 Feb 2015 • Marcus Rohrbach, Anna Rohrbach, Michaela Regneri, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele
To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts.
no code implementations • CVPR 2015 • Anna Rohrbach, Marcus Rohrbach, Niket Tandon, Bernt Schiele
In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies.
1 code implementation • HLT 2015 • Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, Kate Saenko
Solving the visual symbol grounding problem has long been a goal of artificial intelligence.
7 code implementations • CVPR 2015 • Jeff Donahue, Lisa Anne Hendricks, Marcus Rohrbach, Subhashini Venugopalan, Sergio Guadarrama, Kate Saenko, Trevor Darrell
Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or "temporally deep", are effective for tasks involving sequences, visual and otherwise.
Ranked #3 on Human Interaction Recognition on BIT
no code implementations • 24 Mar 2014 • Anna Senina, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Sikandar Amin, Mykhaylo Andriluka, Manfred Pinkal, Bernt Schiele
Humans can easily describe what they see in a coherent way and at varying level of detail.
no code implementations • NeurIPS 2013 • Marcus Rohrbach, Sandra Ebert, Bernt Schiele
Our approach consistently outperforms state-of-the-art transfer and semi-supervised approaches on all datasets.
no code implementations • TACL 2013 • Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, Manfred Pinkal
Recent work has shown that the integration of visual information into text-based models can substantially improve model predictions, but so far only visual information extracted from static images has been used.