Search Results for author: AJ Piergiovanni

Found 42 papers, 19 papers with code

AssembleNet++: Assembling Modality Representations via Attention Connections - Supplementary Material -

no code implementations ECCV 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Activity Recognition

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

no code implementations CVPR 2024 AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.

Action Classification Audio Classification +1

Diversifying Joint Vision-Language Tokenization Learning

no code implementations6 Jun 2023 Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.

Question Answering Representation Learning +2

Joint Adaptive Representations for Image-Language Learning

no code implementations31 May 2023 AJ Piergiovanni, Anelia Angelova

We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

1 code implementation29 Mar 2023 Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.

Cross-Modal Retrieval Decoder +8

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1 code implementation CVPR 2023 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.

Ranked #2 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition In Videos

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation30 Sep 2022 Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

Pre-training image-language transformers for open-vocabulary tasks

no code implementations9 Sep 2022 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Question Answering Visual Entailment +1

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations1 Aug 2022 AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Question Answering Video Question Answering +1

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code implementations2 May 2022 AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Decoder Image Captioning +5

FindIt: Generalized Localization with Natural Language Queries

no code implementations31 Mar 2022 Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.

Natural Language Queries Object +5

TokenLearner: Adaptive Space-Time Tokenization for Videos

1 code implementation NeurIPS 2021 Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Representation Learning Video Recognition +1

4D-Net for Learned Multi-Modal Alignment

1 code implementation ICCV 2021 AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.

3D Object Detection object-detection

Unsupervised Discovery of Actions in Instructional Videos

no code implementations28 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

4 code implementations21 Jun 2021 Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Action Classification Image Classification +3

Unsupervised Action Segmentation for Instructional Videos

no code implementations7 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.

Action Segmentation Segmentation

Adaptive Intermediate Representations for Video Understanding

no code implementations14 Apr 2021 Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova

A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow.

Action Classification Optical Flow Estimation +3

AssembleNet++: Assembling Modality Representations via Attention Connections

1 code implementation18 Aug 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Action Classification Activity Recognition

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

no code implementations ECCV 2020 Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.

Classification General Classification +1

AViD Dataset: Anonymized Videos from Diverse Countries

1 code implementation NeurIPS 2020 AJ Piergiovanni, Michael S. Ryoo

We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries.

Action Classification Action Detection +1

Tiny Video Networks

2 code implementations15 Oct 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.

Video Understanding

Model-based Behavioral Cloning with Future Image Similarity Learning

1 code implementation8 Oct 2019 Alan Wu, AJ Piergiovanni, Michael S. Ryoo

We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials.

Imitation Learning

Unseen Action Recognition with Unpaired Adversarial Multimodal Learning

no code implementations ICLR 2019 AJ Piergiovanni, Michael S. Ryoo

In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos.

Action Recognition General Classification +1

Differentiable Grammars for Videos

no code implementations1 Feb 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.

Representation Flow for Action Recognition

5 code implementations CVPR 2019 AJ Piergiovanni, Michael S. Ryoo

Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.

Action Classification Action Recognition In Videos +4

Learning Multimodal Representations for Unseen Activities

1 code implementation21 Jun 2018 AJ Piergiovanni, Michael S. Ryoo

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos.

General Classification Temporal Action Localization

Fine-grained Activity Recognition in Baseball Videos

3 code implementations9 Apr 2018 AJ Piergiovanni, Michael S. Ryoo

In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.

Action Detection Activity Detection +3

Temporal Gaussian Mixture Layer for Videos

1 code implementation ICLR 2019 AJ Piergiovanni, Michael S. Ryoo

We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos.

Action Detection Activity Detection

Learning Latent Super-Events to Detect Multiple Activities in Videos

2 code implementations CVPR 2018 AJ Piergiovanni, Michael S. Ryoo

In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos.

Action Detection Activity Detection

Learning Latent Sub-events in Activity Videos Using Temporal Attention Filters

1 code implementation26 May 2016 AJ Piergiovanni, Chenyou Fan, Michael S. Ryoo

In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.

Action Classification Action Recognition In Videos +2

Cannot find the paper you are looking for? You can Submit a new open access paper.