no code implementations • ECCV 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
no code implementations • 22 Nov 2024 • AJ Piergiovanni, Dahun Kim, Michael S. Ryoo, Isaac Noble, Anelia Angelova
Instead, we propose an efficient, online approach which outputs frequent, detailed and temporally aligned captions, without access to future frames.
no code implementations • 4 Jan 2024 • Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng
3D panoptic segmentation is a challenging perception task, especially in autonomous driving.
no code implementations • CVPR 2024 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture.
no code implementations • CVPR 2024 • AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova
We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.
Ranked #1 on Audio Classification on VGGSound
2 code implementations • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present a new open-vocabulary detection approach based on region-centric image-language pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #2 on Open Vocabulary Object Detection on LVIS v1.0
no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).
Ranked #8 on Open Vocabulary Object Detection on LVIS v1.0
no code implementations • 6 Jun 2023 • Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova
Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.
no code implementations • 31 May 2023 • AJ Piergiovanni, Anelia Angelova
We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.
2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.
Ranked #1 on Fine-Grained Image Recognition on OVEN
2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo
We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.
Ranked #6 on Zero-Shot Cross-Modal Retrieval on Flickr30k
1 code implementation • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova
We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
Ranked #1 on Video Captioning on MSVD
1 code implementation • CVPR 2023 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.
Ranked #2 on Action Classification on Kinetics-600 (using extra training data)
1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova
We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.
1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut
PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.
no code implementations • 9 Sep 2022 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova
We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.
no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova
Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.
Ranked #4 on Video Question Answering on iVQA
no code implementations • 2 May 2022 • AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova
We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.
no code implementations • 31 Mar 2022 • Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova
We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.
no code implementations • 22 Jan 2022 • Huang Huang, Michael Danielczuk, Chung Min Kim, Letian Fu, Zachary Tam, Jeffrey Ichnowski, Anelia Angelova, Brian Ichter, Ken Goldberg
Shelves are common in homes, warehouses, and commercial settings due to their storage efficiency.
1 code implementation • NeurIPS 2021 • Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
1 code implementation • ICCV 2021 • AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
no code implementations • ICCV 2021 • Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.
5 code implementations • 15 Aug 2021 • Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo
In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.
Ranked #2 on Open World Object Detection on COCO VOC to non-VOC
no code implementations • 28 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.
6 code implementations • 21 Jun 2021 • Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova
In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.
Ranked #1 on Action Classification on Charades
no code implementations • 7 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa
In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.
2 code implementations • CVPR 2021 • Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia Angelova, Rico Jonschkowski
We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by $36\%$ to $40\%$ (over the prior best method UFlow) and even outperforms several supervised approaches such as PWC-Net and FlowNet2.
no code implementations • 14 Apr 2021 • Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova
A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow.
Ranked #5 on Action Classification on Toyota Smarthome dataset
no code implementations • 26 Mar 2021 • Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo
We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs.
5 code implementations • 30 Oct 2020 • Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova
We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision.
Ranked #10 on Unsupervised Monocular Depth Estimation on Cityscapes
1 code implementation • 18 Aug 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
Ranked #4 on Action Classification on Toyota Smarthome dataset
no code implementations • ECCV 2020 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
In this paper we propose an adversarial generative grammar model for future prediction.
no code implementations • ECCV 2020 • Wei-cheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai
We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses.
no code implementations • ECCV 2020 • Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua
The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.
5 code implementations • ECCV 2020 • Rico Jonschkowski, Austin Stone, Jonathan T. Barron, Ariel Gordon, Kurt Konolige, Anelia Angelova
We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective.
Ranked #5 on Optical Flow Estimation on Sintel Clean unsupervised
no code implementations • 19 May 2020 • Peter Karkus, Anelia Angelova, Vincent Vanhoucke, Rico Jonschkowski
We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN).
no code implementations • CVPR 2021 • Yao Lu, Sören Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, Ariel Gordon
Many computer vision tasks address the problem of scene understanding and are naturally interrelated e. g. object classification, detection, scene segmentation, depth estimation, etc.
no code implementations • 20 Apr 2020 • Michael Danielczuk, Anelia Angelova, Vincent Vanhoucke, Ken Goldberg
For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object.
no code implementations • 11 Apr 2020 • Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova
We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames.
no code implementations • CVPR 2020 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from large-scale unlabeled video data.
no code implementations • 13 Dec 2019 • Steven Schwarcz, Peng Xu, David D'Ambrosio, Juhana Kangaspunta, Anelia Angelova, Huong Phan, Navdeep Jaitly
The corpus consists of ping pong play with three main annotation streams that can be used to learn tracking and action recognition models -- tracking of the ping pong ball and poses of humans in the videos and the spin of the ball being hit by humans.
1 code implementation • CVPR 2020 • Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige
We address two problems: first, we establish an easy method for capturing and labeling 3D keypoints on desktop objects with an RGB camera; and second, we develop a deep neural network, called $KeyPose$, that learns to accurately predict object poses using 3D keypoints, from stereo input, and works even for transparent objects.
2 code implementations • 15 Oct 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.
no code implementations • 12 Jun 2019 • Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova
We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion.
no code implementations • 7 Jun 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
We present a new method to learn video representations from unlabeled data.
2 code implementations • ICLR 2020 • Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova
Learning to represent videos is a very challenging task both algorithmically and computationally.
4 code implementations • ICCV 2019 • Ariel Gordon, Hanhan Li, Rico Jonschkowski, Anelia Angelova
We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal.
Ranked #11 on Unsupervised Monocular Depth Estimation on Cityscapes
1 code implementation • ICCV 2019 • Wei-cheng Kuo, Anelia Angelova, Jitendra Malik, Tsung-Yi Lin
However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required.
no code implementations • 1 Feb 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo
This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.
no code implementations • 28 Nov 2018 • Suhani Vora, Reza Mahjourian, Soeren Pirk, Anelia Angelova
Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents; particularly for agents which must rely heavily on real time visual data for decision making.
1 code implementation • 27 Nov 2018 • David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia Angelova, Niko Sünderhauf
We introduce Probabilistic Object Detection, the task of detecting objects in images and accurately quantifying the spatial and semantic uncertainties of the detections.
no code implementations • ICCV 2019 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos.
Ranked #20 on Action Classification on MiT
11 code implementations • 15 Nov 2018 • Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova
Models and examples built with TensorFlow
Ranked #12 on Unsupervised Monocular Depth Estimation on Cityscapes
2 code implementations • CVPR 2018 • Reza Mahjourian, Martin Wicke, Anelia Angelova
We present a novel approach for unsupervised learning of depth and ego-motion from monocular video.
1 code implementation • 26 Jan 2018 • Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar
We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision.
1 code implementation • ICML 2017 • Michael Gygli, Mohammad Norouzi, Anelia Angelova
We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input.
no code implementations • 8 Dec 2016 • Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova
We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization.
no code implementations • 20 Sep 2016 • Reza Mahjourian, Martin Wicke, Anelia Angelova
We consider the problem of next frame prediction from video input.
no code implementations • 4 Sep 2015 • Ivan Bogun, Anelia Angelova, Navdeep Jaitly
Videos, unlike still images, are temporally coherent which makes the application of deep networks non-trivial.
3 code implementations • 9 Dec 2014 • Joseph Redmon, Anelia Angelova
We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks.
Ranked #5 on Robotic Grasping on Cornell Grasp Dataset
no code implementations • CVPR 2013 • Anelia Angelova, Shenghuo Zhu
The algorithm first detects low-level regions that could potentially belong to the object and then performs a full-object segmentation through propagation.