Search Results for author: Makarand Tapaswi

Found 45 papers, 24 papers with code

The Sound of Water: Inferring Physical Properties from Pouring Liquids

1 code implementation18 Nov 2024 Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman

We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids.

 Ranked #1 on Physical Attribute Prediction on Sound of Water 50 (using extra training data)

Physical Attribute Prediction

IdentifyMe: A Challenging Long-Context Mention Resolution Benchmark for LLMs

1 code implementation12 Nov 2024 Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal

The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance.

coreference-resolution Multiple-choice

Detect, Describe, Discriminate: Moving Beyond VQA for MLLM Evaluation

no code implementations23 Sep 2024 Manu Gaur, Darshan Singh S, Makarand Tapaswi

Specifically, we assess the ability of MLLMs to capture specific points of visual differences using self-retrieval, i. e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor.

Multiple-choice Question Answering +2

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

no code implementations4 Sep 2024 Manu Gaur, Darshan Singh S, Makarand Tapaswi

We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e. g. +4. 8% - 7. 1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.

Image Captioning Retrieval

Major Entity Identification: A Generalizable Alternative to Coreference Resolution

1 code implementation20 Jun 2024 Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi

Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities.

coreference-resolution

VELOCITI: Benchmarking Video-Language Compositional Reasoning with Strict Entailment

no code implementations16 Jun 2024 Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi

To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events.

Action Understanding Benchmarking +2

MICap: A Unified Model for Identity-aware Movie Descriptions

no code implementations CVPR 2024 Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi

While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels.

Caption Generation Decoder

"Previously on ..." From Recaps to Story Summarization

no code implementations19 May 2024 Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi

We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed.

Video Summarization

FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos

no code implementations15 Jan 2024 Darshan Singh S, Zeeshan Khan, Makarand Tapaswi

We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts.

Previously on ... From Recaps to Story Summarization

no code implementations CVPR 2024 Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi

We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed.

Video Summarization

Seeing Eye to AI: Comparing Human Gaze and Model Attention in Video Memorability

no code implementations26 Nov 2023 Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar

Understanding what makes a video memorable has important applications in advertising or education technology.

Panoptic Segmentation

Generalized Cross-domain Multi-label Few-shot Learning for Chest X-rays

no code implementations8 Sep 2023 Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan

Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation.

Few-Shot Learning Transfer Learning

How you feelin'? Learning Emotions and Mental States in Movie Scenes

1 code implementation CVPR 2023 Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi

Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character.

Emotion Recognition Multi-Label Classification

GrapeQA: GRaph Augmentation and Pruning to Enhance Question-Answering

no code implementations22 Mar 2023 Dhaval Taunk, Lakshya Khanna, Pavan Kandru, Vasudeva Varma, Charu Sharma, Makarand Tapaswi

Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG).

Common Sense Reasoning Knowledge Graphs +2

Test of Time: Instilling Video-Language Models with a Sense of Time

1 code implementation CVPR 2023 Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek

Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.

Ranked #3 on Video-Text Retrieval on Test-of-Time (using extra training data)

Video-Text Retrieval Video Understanding

Sonus Texere! Automated Dense Soundtrack Construction for Books using Movie Adaptations

no code implementations2 Dec 2022 Jaidev Shriram, Makarand Tapaswi, Vinoo Alluri

Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey.

Can we Adopt Self-supervised Pretraining for Chest X-Rays?

no code implementations23 Nov 2022 Arsh Verma, Makarand Tapaswi

Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions.

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

1 code implementation17 Nov 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.

Object Relation

Unsupervised Audio-Visual Lecture Segmentation

1 code implementation29 Oct 2022 Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi

We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content.

Navigate Optical Character Recognition (OCR) +1

Grounded Video Situation Recognition

no code implementations19 Oct 2022 Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi

Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities.

Descriptive Structured Prediction +1

Learning from Unlabeled 3D Environments for Vision-and-Language Navigation

1 code implementation24 Aug 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.

Language Modeling Language Modelling +4

Learning Object Manipulation Skills from Video via Approximate Differentiable Physics

2 code implementations3 Aug 2022 Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi

We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.

3D Reconstruction Friction +1

Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation

1 code implementation CVPR 2022 ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev

To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.

Efficient Exploration Navigate +2

Airbert: In-domain Pretraining for Vision-and-Language Navigation

2 code implementations ICCV 2021 Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid

Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.

Navigate Referring Expression +1

Learning Object Manipulation Skills via Approximate State Estimation from Real Videos

1 code implementation13 Nov 2020 Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic

We evaluate our method on simple single- and two-object actions from the Something-Something dataset.

Object

Deep Multimodal Feature Encoding for Video Ordering

1 code implementation5 Apr 2020 Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen

True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions.

Action Recognition

Learning Interactions and Relationships between Movie Characters

1 code implementation CVPR 2020 Anna Kukleva, Makarand Tapaswi, Ivan Laptev

Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.

The Shmoop Corpus: A Dataset of Stories with Loosely Aligned Summaries

1 code implementation30 Dec 2019 Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler

Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies.

Abstractive Text Summarization Form +2

Video Face Clustering with Unknown Number of Clusters

1 code implementation ICCV 2019 Makarand Tapaswi, Marc T. Law, Sanja Fidler

Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing.

Clustering Face Clustering +1

Visual Reasoning by Progressive Module Networks

1 code implementation ICLR 2019 Seung Wook Kim, Makarand Tapaswi, Sanja Fidler

Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output.

Visual Reasoning

Now You Shake Me: Towards Automatic 4D Cinema

no code implementations CVPR 2018 Yuhao Zhou, Makarand Tapaswi, Sanja Fidler

We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies.

MovieGraphs: Towards Understanding Human-Centric Situations from Videos

no code implementations CVPR 2018 Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler

Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips.

Common Sense Reasoning

Book2Movie: Aligning Video Scenes With Book Chapters

no code implementations CVPR 2015 Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen

Such an alignment facilitates finding differences between the adaptation and the original source, and also acts as a basis for deriving rich descriptions from the novel for the video clips.

Video Alignment

StoryGraphs: Visualizing Character Interactions as a Timeline

1 code implementation CVPR 2014 Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen

We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions as a chart.

Person Identification

Cannot find the paper you are looking for? You can Submit a new open access paper.