Search Results for author: Gedas Bertasius

Found 41 papers, 20 papers with code

Siamese Vision Transformers are Scalable Audio-visual Learners

1 code implementation28 Mar 2024 Yan-Bo Lin, Gedas Bertasius

Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes.

Contrastive Learning Retrieval

Augmented Reality Demonstrations for Scalable Robot Imitation Learning

no code implementations20 Mar 2024 Yue Yang, Bryce Ikeda, Gedas Bertasius, Daniel Szafir

Our framework facilitates scalable and diverse demonstration collection for real-world tasks.

Imitation Learning

DAM: Dynamic Adapter Merging for Continual Video QA Learning

1 code implementation13 Mar 2024 Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

Continual Learning Image Classification +2

Video ReCap: Recursive Captioning of Hour-Long Videos

no code implementations20 Feb 2024 Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Video Captioning Video Understanding

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

1 code implementation19 Jan 2024 Xiyao Wang, YuHang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated.

Language Modelling Large Language Model

A Simple LLM Framework for Long-Range Video Question-Answering

1 code implementation28 Dec 2023 Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.

Large Language Model Long-range modeling +2

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations30 Nov 2023 Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

LoCoNet: Long-Short Context Network for Active Speaker Detection

1 code implementation19 Jan 2023 Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

These two contexts are complementary to each other and can help infer the active speaker.

Efficient Movie Scene Detection using State-Space Transformers

1 code implementation CVPR 2023 Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, Gedas Bertasius

Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies.

Video Recognition

Vision Transformers are Parameter-Efficient Audio-Visual Learners

1 code implementation CVPR 2023 Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.

Audio-visual Question Answering

VindLU: A Recipe for Effective Video-and-Language Pretraining

1 code implementation CVPR 2023 Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.

Ranked #2 on Video Retrieval on Condensed Movies (using extra training data)

Question Answering Retrieval +3

MuMUR : Multilingual Multimodal Universal Retrieval

no code implementations24 Aug 2022 Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal

In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.

Image Retrieval Machine Translation +3

Learning to Retrieve Videos by Asking Questions

1 code implementation11 May 2022 Avinash Madasu, Junier Oliva, Gedas Bertasius

To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent.

Retrieval Text to Video Retrieval +1

Long Movie Clip Classification with State-Space Video Models

1 code implementation4 Apr 2022 Md Mohaiminul Islam, Gedas Bertasius

Most modern video recognition models are designed to operate on short video clips (e. g., 5-10s in length).

Classification Video Classification +2

TALLFormer: Temporal Action Localization with a Long-memory Transformer

1 code implementation4 Apr 2022 Feng Cheng, Gedas Bertasius

To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory.

Action Recognition Temporal Action Localization

Learning To Recognize Procedural Activities with Distant Supervision

1 code implementation CVPR 2022 Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Action Classification Language Modelling +1

Long-Short Temporal Contrastive Learning of Video Transformers

no code implementations CVPR 2022 Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.

Action Recognition Contrastive Learning +1

Is Space-Time Attention All You Need for Video Understanding?

13 code implementations9 Feb 2021 Gedas Bertasius, Heng Wang, Lorenzo Torresani

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Action Classification Action Recognition +5

COBE: Contextualized Object Embeddings from Narrated Instructional Video

no code implementations NeurIPS 2020 Gedas Bertasius, Lorenzo Torresani

A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms.

Human-Object Interaction Detection Object +3

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

3 code implementations NeurIPS 2019 Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.

Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)

Multi-Person Pose Estimation Optical Flow Estimation

Attentive Action and Context Factorization

no code implementations10 Apr 2019 Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai

This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements.

Action Recognition Temporal Action Localization

Object Detection in Video with Spatiotemporal Sampling Networks

no code implementations ECCV 2018 Gedas Bertasius, Lorenzo Torresani, Jianbo Shi

We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos.

Object object-detection +2

Egocentric Basketball Motion Planning from a Single First-Person Image

no code implementations CVPR 2018 Gedas Bertasius, Aaron Chan, Jianbo Shi

We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence.

Motion Planning

Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

no code implementations5 Sep 2017 Gedas Bertasius, Jianbo Shi

We present a first-person method for cooperative basketball intention prediction: we predict with whom the camera wearer will cooperate in the near future from unlabeled first-person images.

Pose Estimation

Am I a Baller? Basketball Performance Assessment from First-Person Videos

no code implementations ICCV 2017 Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better.

Unsupervised Learning of Important Objects from First-Person Videos

1 code implementation ICCV 2017 Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers.

Object Segmentation +1

Convolutional Random Walk Networks for Semantic Image Segmentation

no code implementations CVPR 2017 Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi

It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network.

Image Segmentation Scene Labeling +2

Local Perturb-and-MAP for Structured Prediction

no code implementations24 May 2016 Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi

In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model.

Combinatorial Optimization Structured Prediction

First Person Action-Object Detection with EgoNet

no code implementations15 Mar 2016 Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close.

Human-Object Interaction Detection Object +2

Semantic Segmentation with Boundary Neural Fields

no code implementations CVPR 2016 Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues.

Boundary Detection Object Localization +2

Exploiting Egocentric Object Prior for 3D Saliency Detection

no code implementations9 Nov 2015 Gedas Bertasius, Hyun Soo Park, Jianbo Shi

We empirically show that this representation can accurately characterize the egocentric object prior by testing it on an egocentric RGBD dataset for three tasks: the 3D saliency detection, future saliency prediction, and interaction classification.

Object Saliency Prediction

Cannot find the paper you are looking for? You can Submit a new open access paper.