Search Results for author: Gedas Bertasius

Found 41 papers, 20 papers with code

Siamese Vision Transformers are Scalable Audio-visual Learners

1 code implementation • 28 Mar 2024 • Yan-Bo Lin, Gedas Bertasius

Our framework uses a single shared vision transformer backbone to process audio and visual inputs, improving its parameter efficiency, reducing the GPU memory footprint, and allowing us to scale our method to larger datasets and model sizes.

Contrastive Learning Retrieval

Paper
Code

Augmented Reality Demonstrations for Scalable Robot Imitation Learning

no code implementations • 20 Mar 2024 • Yue Yang, Bryce Ikeda, Gedas Bertasius, Daniel Szafir

Our framework facilitates scalable and diverse demonstration collection for real-world tasks.

Imitation Learning

Paper
Add Code

DAM: Dynamic Adapter Merging for Continual Video QA Learning

1 code implementation • 13 Mar 2024 • Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

Our DAM model outperforms prior state-of-the-art continual learning approaches by 9. 1% while exhibiting 1. 9% less forgetting on 6 VidQA datasets spanning various domains.

Continual Learning Image Classification +2

Paper
Code

Video ReCap: Recursive Captioning of Hour-Long Videos

no code implementations • 20 Feb 2024 • Md Mohaiminul Islam, Ngan Ho, Xitong Yang, Tushar Nagarajan, Lorenzo Torresani, Gedas Bertasius

We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos.

Video Captioning Video Understanding

Paper
Add Code

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences

1 code implementation • 19 Jan 2024 • Xiyao Wang, YuHang Zhou, Xiaoyu Liu, Hongjin Lu, Yuancheng Xu, Feihong He, Jaehong Yoon, Taixi Lu, Gedas Bertasius, Mohit Bansal, Huaxiu Yao, Furong Huang

However, current MLLM benchmarks are predominantly designed to evaluate reasoning based on static information about a single image, and the ability of modern MLLMs to extrapolate from image sequences, which is essential for understanding our ever-changing world, has been less investigated.

Language Modelling Large Language Model

Paper
Code

A Simple LLM Framework for Long-Range Video Question-Answering

1 code implementation • 28 Dec 2023 • Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, Gedas Bertasius

Furthermore, we show that a specialized prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question leads to a significant LVQA performance boost.

Ranked #1 on Zero-Shot Video Question Answer on NExT-GQA

Large Language Model Long-range modeling +2

Paper
Code

RGNet: A Unified Clip Retrieval and Grounding Network for Long Videos

1 code implementation • 11 Dec 2023 • Tanveer Hannan, Md Mohaiminul Islam, Thomas Seidl, Gedas Bertasius

Adapting existing short video (5-30 seconds) grounding methods to this problem yields poor performance.

Natural Language Queries Retrieval +2

Paper
Code

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

no code implementations • 30 Nov 2023 • Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mohaiminul Islam, Suyog Jain, Rawal Khirodkar, Devansh Kukreja, Kevin J Liang, Jia-Wei Liu, Sagnik Majumder, Yongsen Mao, Miguel Martin, Effrosyni Mavroudi, Tushar Nagarajan, Francesco Ragusa, Santhosh Kumar Ramakrishnan, Luigi Seminara, Arjun Somayazulu, Yale Song, Shan Su, Zihui Xue, Edward Zhang, Jinxu Zhang, Angela Castillo, Changan Chen, Xinzhu Fu, Ryosuke Furuta, Cristina Gonzalez, Prince Gupta, Jiabo Hu, Yifei HUANG, Yiming Huang, Weslie Khoo, Anush Kumar, Robert Kuo, Sach Lakhavani, Miao Liu, Mi Luo, Zhengyi Luo, Brighid Meredith, Austin Miller, Oluwatumininu Oguntola, Xiaqing Pan, Penny Peng, Shraman Pramanick, Merey Ramazanova, Fiona Ryan, Wei Shan, Kiran Somasundaram, Chenan Song, Audrey Southerland, Masatoshi Tateno, Huiyu Wang, Yuchen Wang, Takuma Yagi, Mingfei Yan, Xitong Yang, Zecheng Yu, Shengxin Cindy Zha, Chen Zhao, Ziwei Zhao, Zhifan Zhu, Jeff Zhuo, Pablo Arbelaez, Gedas Bertasius, David Crandall, Dima Damen, Jakob Engel, Giovanni Maria Farinella, Antonino Furnari, Bernard Ghanem, Judy Hoffman, C. V. Jawahar, Richard Newcombe, Hyun Soo Park, James M. Rehg, Yoichi Sato, Manolis Savva, Jianbo Shi, Mike Zheng Shou, Michael Wray

We present Ego-Exo4D, a diverse, large-scale multimodal multiview video dataset and benchmark challenge.

Video Understanding

Paper
Add Code

Unified Coarse-to-Fine Alignment for Video-Text Retrieval

1 code implementation • ICCV 2023 • Ziyang Wang, Yi-Lin Sung, Feng Cheng, Gedas Bertasius, Mohit Bansal

Specifically, our model captures the cross-modal similarity information at different granularity levels.

Ranked #11 on Video Retrieval on MSR-VTT

Retrieval Text Retrieval +2

Paper
Code

LoCoNet: Long-Short Context Network for Active Speaker Detection

1 code implementation • 19 Jan 2023 • Xizi Wang, Feng Cheng, Gedas Bertasius, David Crandall

These two contexts are complementary to each other and can help infer the active speaker.

Paper
Code

Efficient Movie Scene Detection using State-Space Transformers

1 code implementation • CVPR 2023 • Md Mohaiminul Islam, Mahmudul Hasan, Kishan Shamsundar Athrey, Tony Braskich, Gedas Bertasius

Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies.

Video Recognition

Paper
Code

Vision Transformers are Parameter-Efficient Audio-Visual Learners

1 code implementation • CVPR 2023 • Yan-Bo Lin, Yi-Lin Sung, Jie Lei, Mohit Bansal, Gedas Bertasius

To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT.

Ranked #4 on Audio-visual Question Answering on MUSIC-AVQA

Audio-visual Question Answering

Paper
Code

VindLU: A Recipe for Effective Video-and-Language Pretraining

1 code implementation • CVPR 2023 • Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA.

Ranked #2 on Video Retrieval on Condensed Movies (using extra training data)

Question Answering Retrieval +3

Paper
Code

SimpleClick: Interactive Image Segmentation with Simple Vision Transformers

2 code implementations • ICCV 2023 • Qin Liu, Zhenlin Xu, Gedas Bertasius, Marc Niethammer

Although this design is simple and has been proven effective, it has not yet been explored for interactive image segmentation.

Ranked #2 on Interactive Segmentation on SBD

Image Segmentation Interactive Segmentation +2

169

Paper
Code

MuMUR : Multilingual Multimodal Universal Retrieval

no code implementations • 24 Aug 2022 • Avinash Madasu, Estelle Aflalo, Gabriela Ben Melech Stan, Shachar Rosenman, Shao-Yen Tseng, Gedas Bertasius, Vasudev Lal

In this paper, we propose a framework MuMUR, that utilizes knowledge transfer from a multilingual model to boost the performance of multi-modal (image and video) retrieval.

Image Retrieval Machine Translation +3

Paper
Add Code

Object State Change Classification in Egocentric Videos using the Divided Space-Time Attention Mechanism

1 code implementation • 24 Jul 2022 • Md Mohaiminul Islam, Gedas Bertasius

This report describes our submission called "TarHeels" for the Ego4D: Object State Change Classification Challenge.

Object Object State Change Classification +1

Paper
Code

Learning to Retrieve Videos by Asking Questions

1 code implementation • 11 May 2022 • Avinash Madasu, Junier Oliva, Gedas Bertasius

To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog, where the user refines retrieved results by answering questions generated by an AI agent.

Retrieval Text to Video Retrieval +1

Paper
Code

ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound

1 code implementation • 6 Apr 2022 • Yan-Bo Lin, Jie Lei, Mohit Bansal, Gedas Bertasius

We introduce an audiovisual method for long-range text-to-video retrieval.

Retrieval Text to Video Retrieval +1

Paper
Code

Long Movie Clip Classification with State-Space Video Models

1 code implementation • 4 Apr 2022 • Md Mohaiminul Islam, Gedas Bertasius

Most modern video recognition models are designed to operate on short video clips (e. g., 5-10s in length).

Ranked #4 on Video Classification on Breakfast

Classification Video Classification +2

Paper
Code

TALLFormer: Temporal Action Localization with a Long-memory Transformer

1 code implementation • 4 Apr 2022 • Feng Cheng, Gedas Bertasius

To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory.

Action Recognition Temporal Action Localization

Paper
Code

Learning To Recognize Procedural Activities with Distant Supervision

1 code implementation • CVPR 2022 • Xudong Lin, Fabio Petroni, Gedas Bertasius, Marcus Rohrbach, Shih-Fu Chang, Lorenzo Torresani

In this paper we consider the problem of classifying fine-grained, multi-step activities (e. g., cooking different recipes, making disparate home improvements, creating various forms of arts and crafts) from long videos spanning up to several minutes.

Ranked #3 on Video Classification on Breakfast

Action Classification Language Modelling +1

Paper
Code

Long-Short Temporal Contrastive Learning of Video Transformers

no code implementations • CVPR 2022 • Jue Wang, Gedas Bertasius, Du Tran, Lorenzo Torresani

Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.

Action Recognition Contrastive Learning +1

Paper
Add Code

Is Space-Time Attention All You Need for Video Understanding?

13 code implementations • 9 Feb 2021 • Gedas Bertasius, Heng Wang, Lorenzo Torresani

We present a convolution-free approach to video classification built exclusively on self-attention over space and time.

Ranked #1 on Video Question Answering on Howto100M-QA

Action Classification Action Recognition +5

3,884

Paper
Code

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

no code implementations • CVPR 2021 • Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani

We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.

Question Answering Text Generation

Paper
Add Code

COBE: Contextualized Object Embeddings from Narrated Instructional Video

no code implementations • NeurIPS 2020 • Gedas Bertasius, Lorenzo Torresani

A fully-supervised approach to recognizing object states and their contexts in the real-world is unfortunately marred by the long-tailed, open-ended distribution of the data, which would effectively require massive amounts of annotations to capture the appearance of objects in all their different forms.

Human-Object Interaction Detection Object +3

Paper
Add Code

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

no code implementations • CVPR 2020 • Gedas Bertasius, Lorenzo Torresani

We introduce a method for simultaneously classifying, segmenting and tracking object instances in a video sequence.

Instance Segmentation Object +3

Paper
Add Code

Learning Temporal Pose Estimation from Sparsely-Labeled Videos

3 code implementations • NeurIPS 2019 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

To reduce the need for dense annotations, we propose a PoseWarper network that leverages training videos with sparse annotations (every k frames) to learn to perform dense temporal pose propagation and estimation.

Ranked #2 on Multi-Person Pose Estimation on PoseTrack2018 (using extra training data)

Multi-Person Pose Estimation Optical Flow Estimation

4,982

Paper
Code

Attentive Action and Context Factorization

no code implementations • 10 Apr 2019 • Yang Wang, Vinh Tran, Gedas Bertasius, Lorenzo Torresani, Minh Hoai

This is a challenging task due to the subtlety of human actions in video and the co-occurrence of contextual elements.

Action Recognition Temporal Action Localization

Paper
Add Code

Learning Discriminative Motion Features Through Detection

no code implementations • 11 Dec 2018 • Gedas Bertasius, Christoph Feichtenhofer, Du Tran, Jianbo Shi, Lorenzo Torresani

Our network learns to spatially sample features from Frame B in order to maximize pose detection accuracy in Frame A.

Fine-grained Action Recognition Pose Estimation +1

Paper
Add Code

Object Detection in Video with Spatiotemporal Sampling Networks

no code implementations • ECCV 2018 • Gedas Bertasius, Lorenzo Torresani, Jianbo Shi

We propose a Spatiotemporal Sampling Network (STSN) that uses deformable convolutions across time for object detection in videos.

Object object-detection +2

Paper
Add Code

Egocentric Basketball Motion Planning from a Single First-Person Image

no code implementations • CVPR 2018 • Gedas Bertasius, Aaron Chan, Jianbo Shi

We present a model that uses a single first-person image to generate an egocentric basketball motion sequence in the form of a 12D camera configuration trajectory, which encodes a player's 3D location and 3D head orientation throughout the sequence.

Motion Planning

Paper
Add Code

Using Cross-Model EgoSupervision to Learn Cooperative Basketball Intention

no code implementations • 5 Sep 2017 • Gedas Bertasius, Jianbo Shi

We present a first-person method for cooperative basketball intention prediction: we predict with whom the camera wearer will cooperate in the near future from unlabeled first-person images.

Pose Estimation

Paper
Add Code

Am I a Baller? Basketball Performance Assessment from First-Person Videos

no code implementations • ICCV 2017 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Finally, we use this feature to learn a basketball assessment model from pairs of labeled first-person basketball videos, for which a basketball expert indicates, which of the two players is better.

Paper
Add Code

Unsupervised Learning of Important Objects from First-Person Videos

1 code implementation • ICCV 2017 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

In this work, we show that we can detect important objects in first-person images without the supervision by the camera wearer or even third-person labelers.

Object Segmentation +1

Paper
Code

Convolutional Random Walk Networks for Semantic Image Segmentation

no code implementations • CVPR 2017 • Gedas Bertasius, Lorenzo Torresani, Stella X. Yu, Jianbo Shi

It combines these two objectives via a novel random walk layer that enforces consistent spatial grouping in the deep layers of the network.

Image Segmentation Scene Labeling +2

Paper
Add Code

Local Perturb-and-MAP for Structured Prediction

no code implementations • 24 May 2016 • Gedas Bertasius, Qiang Liu, Lorenzo Torresani, Jianbo Shi

In this work, we present a new Local Perturb-and-MAP (locPMAP) framework that replaces the global optimization with a local optimization by exploiting our observed connection between locPMAP and the pseudolikelihood of the original CRF model.

Combinatorial Optimization Structured Prediction

Paper
Add Code

First Person Action-Object Detection with EgoNet

no code implementations • 15 Mar 2016 • Gedas Bertasius, Hyun Soo Park, Stella X. Yu, Jianbo Shi

Unlike traditional third-person cameras mounted on robots, a first-person camera, captures a person's visual sensorimotor object interactions from up close.

Human-Object Interaction Detection Object +2

Paper
Add Code

Semantic Segmentation with Boundary Neural Fields

no code implementations • CVPR 2016 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

To overcome these problems, we introduce a Boundary Neural Field (BNF), which is a global energy model integrating FCN predictions with boundary cues.

Boundary Detection Object Localization +2

Paper
Add Code

Exploiting Egocentric Object Prior for 3D Saliency Detection

no code implementations • 9 Nov 2015 • Gedas Bertasius, Hyun Soo Park, Jianbo Shi

We empirically show that this representation can accurately characterize the egocentric object prior by testing it on an egocentric RGBD dataset for three tasks: the 3D saliency detection, future saliency prediction, and interaction classification.

Object Saliency Prediction

Paper
Add Code

High-for-Low and Low-for-High: Efficient Boundary Detection from Deep Object Features and its Applications to High-Level Vision

no code implementations • ICCV 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

We can view this process as a "Low-for-High" scheme, where low-level boundaries aid high-level vision tasks.

Boundary Detection Object +3

Paper
Add Code

DeepEdge: A Multi-Scale Bifurcated Deep Network for Top-Down Contour Detection

no code implementations • CVPR 2015 • Gedas Bertasius, Jianbo Shi, Lorenzo Torresani

This section of the network is applied to four different scales of the image input.

Contour Detection Feature Engineering +6

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.