no code implementations • 12 Mar 2025 • Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Gedas Bertasius, Lorenzo Torresani
The self-attention mechanism provides a general solution for sequence modeling, but it has a prohibitive cost when applied to a massive number of spatiotemporal tokens in long videos.
no code implementations • 21 Feb 2025 • Yue Yang, Linfeng Zhao, Mingyu Ding, Gedas Bertasius, Daniel Szafir
However, even in simple long-horizon tasks like skill chaining, hierarchical approaches often struggle due to a problem we identify as Observation Space Shift (OSS), where the sequential execution of preceding skills causes shifts in the observation space, disrupting the performance of subsequent individually trained skill policies.
1 code implementation • 12 Dec 2024 • Xizi Wang, Feng Cheng, Ziyang Wang, Huiyu Wang, Md Mohaiminul Islam, Lorenzo Torresani, Mohit Bansal, Gedas Bertasius, David Crandall
Recent work has focused on enabling Video LLMs to perform video temporal grounding via next-token prediction of temporal timestamps.
1 code implementation • 22 Nov 2024 • Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu, Thomas Seidl, Gedas Bertasius
We propose ReVisionLLM, a recursive vision-language model designed to locate events in hour-long videos.
no code implementations • 30 Sep 2024 • Md Mohaiminul Islam, Tushar Nagarajan, Huiyu Wang, Fu-Jen Chu, Kris Kitani, Gedas Bertasius, Xitong Yang
Goal-oriented planning, or anticipating a series of actions that transition an agent from its current state to a predefined objective, is crucial for developing intelligent assistants aiding users in daily procedural tasks.
no code implementations • 11 Sep 2024 • Yan-Bo Lin, Yu Tian, Linjie Yang, Gedas Bertasius, Heng Wang
We present a framework for learning to generate background music from video inputs.
1 code implementation • 29 May 2024 • Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin, Jaehong Yoon, Feng Cheng, Gedas Bertasius, Mohit Bansal
Specifically, we incorporate multigranularity information into a tree-based representation, allowing VideoTree to extract query-relevant details from long videos in a coarse-to-fine manner.
1 code implementation • 28 Mar 2024 • Yan-Bo Lin, Gedas Bertasius
Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes.
no code implementations • 20 Mar 2024 • Yue Yang, Bryce Ikeda, Gedas Bertasius, Daniel Szafir
Our framework facilitates scalable and diverse demonstration collection for real-world tasks.
1 code implementation • 13 Mar 2024 • Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.
2 code implementations • CVPR 2024 • Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius
We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.
Ranked #15 on
Zero-Shot Video Question Answer
on EgoSchema (fullset)
1 code implementation • 19 Jan 2024 • Xiyao Wang, YuHang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang
However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated.
1 code implementation • 28 Dec 2023 • Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius
Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.
Ranked #2 on
Zero-Shot Video Question Answer
on NExT-GQA
2 code implementations • 11 Dec 2023 • Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius
Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance.
Ranked #2 on
Natural Language Moment Retrieval
on MAD
Natural Language Moment Retrieval
Natural Language Queries
+2
2 code implementations • CVPR 2024 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray
We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.
1 code implementation • ICCV 2023 • Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal
Specifically, our model captures the cross-modal similarity information at different granularity levels.
Ranked #12 on
Video Retrieval
on MSR-VTT
1 code implementation • CVPR 2024 • Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall
These two contexts are complementary to each other and can help infer the active speaker.
Active Speaker Detection
Audio-Visual Active Speaker Detection
1 code implementation • CVPR 2023 • Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, Gedas Bertasius
Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies.
Ranked #2 on
Scene Segmentation
on MovieNet
1 code implementation • CVPR 2023 • Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius
To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.
Ranked #4 on
Audio-visual Question Answering
on MUSIC-AVQA
1 code implementation • CVPR 2023 • Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius
Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.
Ranked #2 on
Video Retrieval
on Condensed Movies
(using extra training data)
2 code implementations • ICCV 2023 • Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer
Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation.
Ranked #2 on
Interactive Segmentation
on SBD
no code implementations • 24 Aug 2022 • Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal
In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.
1 code implementation • 24 Jul 2022 • Md Mohaiminul Islam, Gedas Bertasius
This report describes our submission called "TarHeels" for the Ego4D: Object State Change Classification Challenge.
1 code implementation • 11 May 2022 • Avinash Madasu, Junier Oliva, Gedas Bertasius
To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent.
1 code implementation • 6 Apr 2022 • Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius
We introduce an audiovisual method for long-range text-to-video retrieval.
1 code implementation • 4 Apr 2022 • Feng Cheng, Gedas Bertasius
To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory.
1 code implementation • 4 Apr 2022 • Md Mohaiminul Islam, Gedas Bertasius
Most modern video recognition models are designed to operate on short video clips (e. g., 5-10s in length).
Ranked #6 on
Video Classification
on Breakfast
1 code implementation • CVPR 2022 • Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani
In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.
Ranked #4 on
Video Classification
on COIN
no code implementations • CVPR 2022 • Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani
Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
16 code implementations • 9 Feb 2021 • Gedas Bertasius, Heng Wang, Lorenzo Torresani
We present a convolution-free approach to video classification built exclusively on self-attention over space and time.
Ranked #1 on
Video Question Answering
on Howto100M-QA
no code implementations • CVPR 2021 • Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani
We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
no code implementations • NeurIPS 2020 • Gedas Bertasius, Lorenzo Torresani
A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms.
no code implementations • CVPR 2020 • Gedas Bertasius, Lorenzo Torresani
We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence.
3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.
Ranked #2 on
Multi-Person Pose Estimation
on PoseTrack2017
(using extra training data)
no code implementations • 10 Apr 2019 • Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai
This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements.
no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani
Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.
no code implementations • ECCV 2018 • Gedas Bertasius, Lorenzo Torresani, Jianbo Shi
We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos.
no code implementations • CVPR 2018 • Gedas Bertasius, Aaron Chan, Jianbo Shi
We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence.
no code implementations • 5 Sep 2017 • Gedas Bertasius, Jianbo Shi
We present a first-person method for cooperative basketball intention prediction: we predict with whom the camera wearer will cooperate in the near future from unlabeled first-person images.
no code implementations • ICCV 2017 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi
Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better.
1 code implementation • ICCV 2017 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi
In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers.
no code implementations • 24 May 2016 • Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi
In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model.
no code implementations • CVPR 2017 • Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi
It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network.
no code implementations • 15 Mar 2016 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi
Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close.
no code implementations • CVPR 2016 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues.
no code implementations • 9 Nov 2015 • Gedas Bertasius, Hyun Soo Park, Jianbo Shi
We empirically show that this representation can accurately characterize the egocentric object prior by testing it on an egocentric RGBD dataset for three tasks: the 3D saliency detection, future saliency prediction, and interaction classification.
no code implementations • ICCV 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
We can view this process as a "Low-for-High" scheme, where low-level boundaries aid high-level vision tasks.
no code implementations • CVPR 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani
This section of the network is applied to four different scales of the image input.