no code implementations • ECCV 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
no code implementations • CVPR 2024 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture.
no code implementations • CVPR 2024 • AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.
Ranked #1 on Audio Classification on VGGSound
no code implementations • 6 Jun 2023 • Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova
Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.
no code implementations • 31 May 2023 • AJ Piergiovanni, Anelia Angelova
We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Ranked #1 on Fine-Grained Image Recognition on OVEN
1 code implementation • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova
We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
Ranked #1 on Video Captioning on MSVD
1 code implementation • CVPR 2023 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.
Ranked #2 on Action Classification on Kinetics-600 (using extra training data)
no code implementations • 2 Dec 2022 • Maxwell Mbabilla Aladago, AJ Piergiovanni
We concatenate all the compound tokens for further processing with multimodal encoder.
1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
no code implementations • 9 Sep 2022 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.
Ranked #4 on Video Question Answering on iVQA
no code implementations • 2 May 2022 • AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.
no code implementations • 31 Mar 2022 • Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova
We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.
1 code implementation • NeurIPS 2021 • Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
1 code implementation • ICCV 2021 • AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
no code implementations • 28 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.
4 code implementations • 21 Jun 2021 • Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Ranked #1 on Action Classification on Charades
no code implementations • 7 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.
no code implementations • 14 Apr 2021 • Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova
A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow.
Ranked #5 on Action Classification on Toyota Smarthome dataset
no code implementations • CVPR 2021 • AJ Piergiovanni, Michael S. Ryoo
Standard methods for video recognition use large CNNs designed to capture spatio-temporal data.
Ranked #3 on Action Classification on Toyota Smarthome dataset (CV1 metric)
1 code implementation • 18 Aug 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
Ranked #4 on Action Classification on Toyota Smarthome dataset
no code implementations • ECCV 2020 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
In this paper we propose an adversarial generative grammar model for future prediction.
no code implementations • ECCV 2020 • Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua
The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.
1 code implementation • NeurIPS 2020 • AJ Piergiovanni, Michael S. Ryoo
We confirm that most of the existing video datasets are statistically biased to only capture action videos from a limited number of countries.
Ranked #2 on Action Classification on AViD
no code implementations • CVPR 2020 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from large-scale unlabeled video data.
2 code implementations • 15 Oct 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.
1 code implementation • 8 Oct 2019 • Alan Wu, AJ Piergiovanni, Michael S. Ryoo
We present a visual imitation learning framework that enables learning of robot action policies solely based on expert samples without any robot trials.
no code implementations • 7 Jun 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from unlabeled data.
2 code implementations • ICLR 2020 • Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova
Learning to represent videos is a very challenging task both algorithmically and computationally.
no code implementations • ICLR 2019 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we present a method to learn a joint multimodal representation space that allows for the recognition of unseen activities in videos.
no code implementations • 18 Apr 2019 • AJ Piergiovanni, Michael S. Ryoo
Injuries are a major cost in sports.
no code implementations • 1 Feb 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.
no code implementations • ICCV 2019 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos.
Ranked #20 on Action Classification on MiT
5 code implementations • CVPR 2019 • AJ Piergiovanni, Michael S. Ryoo
Our representation flow layer is a fully-differentiable layer designed to capture the `flow' of any representation channel within a convolutional neural network for action recognition.
Ranked #16 on Action Recognition on HMDB-51
1 code implementation • 21 Jun 2018 • AJ Piergiovanni, Michael S. Ryoo
We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos.
no code implementations • 20 May 2018 • AJ Piergiovanni, Alan Wu, Michael S. Ryoo
Learning to control robots directly based on images is a primary challenge in robotics.
3 code implementations • 9 Apr 2018 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we introduce a challenging new dataset, MLB-YouTube, designed for fine-grained activity detection.
1 code implementation • ICLR 2019 • AJ Piergiovanni, Michael S. Ryoo
We introduce a new convolutional layer named the Temporal Gaussian Mixture (TGM) layer and present how it can be used to efficiently capture longer-term temporal information in continuous activity videos.
Ranked #4 on Action Detection on Multi-THUMOS
2 code implementations • CVPR 2018 • AJ Piergiovanni, Michael S. Ryoo
In this paper, we introduce the concept of learning latent super-events from activity videos, and present how it benefits activity detection in continuous videos.
Ranked #6 on Action Detection on Multi-THUMOS
1 code implementation • 26 May 2016 • AJ Piergiovanni, Chenyou Fan, Michael S. Ryoo
In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos.
Ranked #1 on Activity Recognition In Videos on DogCentric