1 code implementation • 18 Nov 2024 • Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek, Andrew Zisserman
We study the connection between audio-visual observations and the underlying physics of a mundane yet intriguing everyday activity: pouring liquids.
Ranked #1 on
Physical Attribute Prediction
on Sound of Water 50
(using extra training data)
1 code implementation • 12 Nov 2024 • Kawshik Manikantan, Makarand Tapaswi, Vineet Gandhi, Shubham Toshniwal
The benchmark also consists of a curated mixture of different mention types and corresponding entities, allowing for a fine-grained analysis of model performance.
no code implementations • 23 Sep 2024 • Manu Gaur, Darshan Singh S, Makarand Tapaswi
Specifically, we assess the ability of MLLMs to capture specific points of visual differences using self-retrieval, i. e., by retrieving the target image using its generated caption against the other image in the pair serving as the distractor.
no code implementations • 4 Sep 2024 • Manu Gaur, Darshan Singh S, Makarand Tapaswi
We evaluate and compare several state-of-the-art open-source MLLMs on TrueMatch, and find that our SR approach outperforms them all by a significant margin (e. g. +4. 8% - 7. 1% over Cambrian) while having 1-2 orders of magnitude fewer parameters.
1 code implementation • 20 Jun 2024 • Kawshik Manikantan, Shubham Toshniwal, Makarand Tapaswi, Vineet Gandhi
Rather than relying on this additional annotation, we propose an alternative referential task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities.
no code implementations • 16 Jun 2024 • Darshana Saravanan, Varun Gupta, Darshan Singh, Zeeshan Khan, Vineet Gandhi, Makarand Tapaswi
To this end, we introduce VELOCITI, a benchmark to study Video-LLMs by disentangling and assessing the comprehension of agents, actions, and their associations across multiple events.
no code implementations • CVPR 2024 • Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi
While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels.
no code implementations • 19 May 2024 • Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi
We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed.
no code implementations • 9 May 2024 • Yash Khandelwal, Mayur Arvind, Sriram Kumar, Ashish Gupta, Sachin Kumar Danisetty, Piyush Bagad, Anish Madan, Mayank Lunayach, Aditya Annavajjala, Abhishek Maiti, Sansiddh Jain, Aman Dalmia, Namrata Deka, Jerome White, Jigar Doshi, Angjoo Kanazawa, Rahul Panicker, Alpan Raval, Srinivas Rana, Makarand Tapaswi
Our goal is to equip health workers and public health systems with a solution for contactless newborn anthropometry in the community.
no code implementations • 15 Jan 2024 • Darshan Singh S, Zeeshan Khan, Makarand Tapaswi
We use the SRL and verb information to create rule-based detailed captions, making sure they capture most of the visual concepts.
no code implementations • CVPR 2024 • Aditya Kumar Singh, Dhruv Srivastava, Makarand Tapaswi
We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed.
no code implementations • 26 Nov 2023 • Prajneya Kumar, Eshika Khandelwal, Makarand Tapaswi, Vishnu Sreekumar
Understanding what makes a video memorable has important applications in advertising or education technology.
no code implementations • 8 Sep 2023 • Aroof Aimen, Arsh Verma, Makarand Tapaswi, Narayanan C. Krishnan
Real-world application of chest X-ray abnormality classification requires dealing with several challenges: (i) limited training data; (ii) training and evaluation sets that are derived from different domains; and (iii) classes that appear during training may have partial overlap with classes of interest during evaluation.
1 code implementation • CVPR 2023 • Dhruv Srivastava, Aditya Kumar Singh, Makarand Tapaswi
Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character.
no code implementations • 22 Mar 2023 • Dhaval Taunk, Lakshya Khanna, Pavan Kandru, Vasudeva Varma, Charu Sharma, Makarand Tapaswi
Commonsense question-answering (QA) methods combine the power of pre-trained Language Models (LM) with the reasoning provided by Knowledge Graphs (KG).
Ranked #8 on
Question Answering
on OpenBookQA
1 code implementation • CVPR 2023 • Piyush Bagad, Makarand Tapaswi, Cees G. M. Snoek
Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch.
Ranked #3 on
Video-Text Retrieval
on Test-of-Time
(using extra training data)
no code implementations • 2 Dec 2022 • Jaidev Shriram, Makarand Tapaswi, Vinoo Alluri
Reading, much like music listening, is an immersive experience that transports readers while taking them on an emotional journey.
no code implementations • 23 Nov 2022 • Arsh Verma, Makarand Tapaswi
Chest radiograph (or Chest X-Ray, CXR) is a popular medical imaging modality that is used by radiologists across the world to diagnose heart or lung conditions.
1 code implementation • 17 Nov 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
In this work we propose a language-conditioned transformer model for grounding 3D objects and their spatial relations.
1 code implementation • 29 Oct 2022 • Darshan Singh S, Anchit Gupta, C. V. Jawahar, Makarand Tapaswi
We formulate lecture segmentation as an unsupervised task that leverages visual, textual, and OCR cues from the lecture, while clip representations are fine-tuned on a pretext self-supervised task of matching the narration with the temporally aligned visual content.
no code implementations • 19 Oct 2022 • Zeeshan Khan, C. V. Jawahar, Makarand Tapaswi
Recently, Video Situation Recognition (VidSitu) is framed as a task for structured prediction of multiple events, their relationships, and actions and various verb-role pairs attached to descriptive entities.
2 code implementations • 11 Sep 2022 • Pierre-Louis Guhur, ShiZhe Chen, Ricardo Garcia, Makarand Tapaswi, Ivan Laptev, Cordelia Schmid
In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions.
Ranked #6 on
Robot Manipulation Generalization
on GEMBench
1 code implementation • 24 Aug 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
Our resulting HM3D-AutoVLN dataset is an order of magnitude larger than existing VLN datasets in terms of navigation environments and instructions.
Ranked #1 on
Visual Navigation
on SOON Test
2 code implementations • 3 Aug 2022 • Vladimir Petrik, Mohammad Nomaan Qureshi, Josef Sivic, Makarand Tapaswi
We evaluate our approach on a 3D reconstruction task that consists of 54 video demonstrations sourced from 9 actions such as pull something from right to left or put something in front of something.
1 code implementation • CVPR 2022 • ShiZhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, Ivan Laptev
To balance the complexity of large action space reasoning and fine-grained language grounding, we dynamically combine a fine-scale encoding over local observations and a coarse-scale encoding on a global map via graph transformers.
Ranked #5 on
Visual Navigation
on SOON Test
1 code implementation • 10 Nov 2021 • Rahul Vigneswaran, Marc T. Law, Vineeth N. Balasubramanian, Makarand Tapaswi
Oversampling instances of the tail classes attempts to solve this imbalance.
Ranked #1 on
Long-tail Learning
on mini-ImageNet-LT
2 code implementations • ICCV 2021 • Pierre-Louis Guhur, Makarand Tapaswi, ShiZhe Chen, Ivan Laptev, Cordelia Schmid
Given the scarcity of domain-specific training data and the high diversity of image and language inputs, the generalization of VLN agents to unseen environments remains challenging.
Ranked #3 on
Vision and Language Navigation
on VLN Challenge
1 code implementation • 13 Nov 2020 • Vladimír Petrík, Makarand Tapaswi, Ivan Laptev, Josef Sivic
We evaluate our method on simple single- and two-object actions from the Something-Something dataset.
no code implementations • 5 Apr 2020 • Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen
We demonstrate our method on the challenging task of learning representations for video face clustering.
1 code implementation • 5 Apr 2020 • Vivek Sharma, Makarand Tapaswi, Rainer Stiefelhagen
True understanding of videos comes from a joint analysis of all its modalities: the video frames, the audio track, and any accompanying text such as closed captions.
1 code implementation • CVPR 2020 • Anna Kukleva, Makarand Tapaswi, Ivan Laptev
Localizing the pair of interacting characters in video is a time-consuming process, instead, we train our model to learn from clip-level weak labels.
1 code implementation • 30 Dec 2019 • Atef Chaudhury, Makarand Tapaswi, Seung Wook Kim, Sanja Fidler
Understanding stories is a challenging reading comprehension problem for machines as it requires reading a large volume of text and following long-range dependencies.
1 code implementation • ICCV 2019 • Makarand Tapaswi, Marc T. Law, Sanja Fidler
Understanding videos such as TV series and movies requires analyzing who the characters are and what they are doing.
4 code implementations • ICCV 2019 • Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations.
Ranked #4 on
Temporal Action Localization
on CrossTask
Action Localization
Long Video Retrieval (Background Removed)
+3
1 code implementation • 3 Mar 2019 • Vivek Sharma, Makarand Tapaswi, M. Saquib Sarfraz, Rainer Stiefelhagen
In this paper, we address video face clustering using unsupervised methods.
1 code implementation • ICLR 2019 • Seung Wook Kim, Makarand Tapaswi, Sanja Fidler
Thus, a module for a new task learns to query existing modules and composes their outputs in order to produce its own output.
no code implementations • CVPR 2018 • Yuhao Zhou, Makarand Tapaswi, Sanja Fidler
We are interested in enabling automatic 4D cinema by parsing physical and special effects from untrimmed movies.
no code implementations • CVPR 2018 • Paul Vicol, Makarand Tapaswi, Lluis Castrejon, Sanja Fidler
Towards this goal, we introduce a novel dataset called MovieGraphs which provides detailed, graph-based annotations of social situations depicted in movie clips.
1 code implementation • ICCV 2017 • Ruiyu Li, Makarand Tapaswi, Renjie Liao, Jiaya Jia, Raquel Urtasun, Sanja Fidler
We address the problem of recognizing situations in images.
Ranked #10 on
Situation Recognition
on imSitu
no code implementations • 22 Nov 2016 • Manuel Martinez, Monica Haurilet, Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen
The Earth Mover's Distance (EMD) computes the optimal cost of transforming one distribution into another, given a known transport metric between them.
no code implementations • CVPR 2016 • Ziad Al-Halah, Makarand Tapaswi, Rainer Stiefelhagen
In this work, we aim to carry out attribute-based zero-shot classification in an unsupervised manner.
1 code implementation • CVPR 2016 • Makarand Tapaswi, Yukun Zhu, Rainer Stiefelhagen, Antonio Torralba, Raquel Urtasun, Sanja Fidler
We introduce the MovieQA dataset which aims to evaluate automatic story comprehension from both video and text.
no code implementations • CVPR 2015 • Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen
Such an alignment facilitates finding differences between the adaptation and the original source, and also acts as a basis for deriving rich descriptions from the novel for the video clips.
1 code implementation • CVPR 2014 • Makarand Tapaswi, Martin Bauml, Rainer Stiefelhagen
We present a novel way to automatically summarize and represent the storyline of a TV episode by visualizing character interactions as a chart.
no code implementations • CVPR 2013 • Martin Bauml, Makarand Tapaswi, Rainer Stiefelhagen
We address the problem of person identification in TV series.