Search Results for author: Anelia Angelova

Found 60 papers, 25 papers with code

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation30 Sep 2022 Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

2 code implementations CVPR 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning object-detection +4

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

1 code implementation29 Sep 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.

Contrastive Learning Object +2

Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

4 code implementations ICCV 2019 Ariel Gordon, Hanhan Li, Rico Jonschkowski, Anelia Angelova

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal.

Depth Prediction Monocular Depth Estimation +1

Unsupervised Monocular Depth Learning in Dynamic Scenes

5 code implementations30 Oct 2020 Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova

We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision.

Depth Prediction Monocular Depth Estimation +2

KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects

1 code implementation CVPR 2020 Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige

We address two problems: first, we establish an easy method for capturing and labeling 3D keypoints on desktop objects with an RGB camera; and second, we develop a deep neural network, called $KeyPose$, that learns to accurately predict object poses using 3D keypoints, from stereo input, and works even for transparent objects.

3D Pose Estimation Keypoint Estimation +1

What Matters in Unsupervised Optical Flow

5 code implementations ECCV 2020 Rico Jonschkowski, Austin Stone, Jonathan T. Barron, Ariel Gordon, Kurt Konolige, Anelia Angelova

We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective.

Occlusion Handling Optical Flow Estimation

AssembleNet++: Assembling Modality Representations via Attention Connections

1 code implementation18 Aug 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Action Classification Activity Recognition

Tiny Video Networks

2 code implementations15 Oct 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.

Video Understanding

SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping

2 code implementations CVPR 2021 Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia Angelova, Rico Jonschkowski

We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by $36\%$ to $40\%$ (over the prior best method UFlow) and even outperforms several supervised approaches such as PWC-Net and FlowNet2.

Optical Flow Estimation

ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors

1 code implementation ICCV 2019 Wei-cheng Kuo, Anelia Angelova, Jitendra Malik, Tsung-Yi Lin

However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required.

Instance Segmentation Object +1

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

4 code implementations21 Jun 2021 Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Action Classification Image Classification +3

Learning Open-World Object Proposals without Learning to Classify

3 code implementations15 Aug 2021 Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.

Object object-detection +4

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1 code implementation CVPR 2023 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.

Ranked #2 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition In Videos

TokenLearner: Adaptive Space-Time Tokenization for Videos

1 code implementation NeurIPS 2021 Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Representation Learning Video Recognition +1

Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

1 code implementation ICML 2017 Michael Gygli, Mohammad Norouzi, Anelia Angelova

We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input.

General Classification Image Segmentation +3

4D-Net for Learned Multi-Modal Alignment

1 code implementation ICCV 2021 AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.

3D Object Detection object-detection

Real-Time Grasp Detection Using Convolutional Neural Networks

3 code implementations9 Dec 2014 Joseph Redmon, Anelia Angelova

We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks.

General Classification Object +3

Probabilistic Object Detection: Definition and Evaluation

1 code implementation27 Nov 2018 David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia Angelova, Niko Sünderhauf

We introduce Probabilistic Object Detection, the task of detecting objects in images and accurately quantifying the spatial and semantic uncertainties of the detections.

Object object-detection +1

Object category learning and retrieval with weak supervision

1 code implementation26 Jan 2018 Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision.

Clustering Deep Clustering +2

Improved generator objectives for GANs

no code implementations8 Dec 2016 Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova

We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization.

Density Ratio Estimation

Object Recognition from Short Videos for Robotic Perception

no code implementations4 Sep 2015 Ivan Bogun, Anelia Angelova, Navdeep Jaitly

Videos, unlike still images, are temporally coherent which makes the application of deep networks non-trivial.

Object Object Recognition

Future Segmentation Using 3D Structure

no code implementations28 Nov 2018 Suhani Vora, Reza Mahjourian, Soeren Pirk, Anelia Angelova

Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents; particularly for agents which must rely heavily on real time visual data for decision making.

Attribute Decision Making +2

Efficient Object Detection and Segmentation for Fine-Grained Recognition

no code implementations CVPR 2013 Anelia Angelova, Shenghuo Zhu

The algorithm first detects low-level regions that could potentially belong to the object and then performs a full-object segmentation through propagation.

Object object-detection +2

Differentiable Grammars for Videos

no code implementations1 Feb 2019 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.

SPIN: A High Speed, High Resolution Vision Dataset for Tracking and Action Recognition in Ping Pong

no code implementations13 Dec 2019 Steven Schwarcz, Peng Xu, David D'Ambrosio, Juhana Kangaspunta, Anelia Angelova, Huong Phan, Navdeep Jaitly

The corpus consists of ping pong play with three main annotation streams that can be used to learn tracking and action recognition models -- tracking of the ping pong ball and poses of humans in the videos and the spin of the ball being hit by humans.

Action Recognition Pose Estimation +1

Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos

no code implementations11 Apr 2020 Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova

We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames.

Segmentation Semantic Segmentation

X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions

no code implementations20 Apr 2020 Michael Danielczuk, Anelia Angelova, Vincent Vanhoucke, Ken Goldberg

For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object.

Object

Taskology: Utilizing Task Relations at Scale

no code implementations CVPR 2021 Yao Lu, Sören Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, Ariel Gordon

Many computer vision tasks address the problem of scene understanding and are naturally interrelated e. g. object classification, detection, scene segmentation, depth estimation, etc.

Depth Estimation Motion Estimation +4

Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization

no code implementations19 May 2020 Peter Karkus, Anelia Angelova, Vincent Vanhoucke, Rico Jonschkowski

We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN).

Visual Localization

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

no code implementations ECCV 2020 Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.

Classification General Classification +1

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

no code implementations ECCV 2020 Wei-cheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses.

Image to 3D Object +3

AssembleNet++: Assembling Modality Representations via Attention Connections - Supplementary Material -

no code implementations ECCV 2020 Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Activity Recognition

Visionary: Vision architecture discovery for robot learning

no code implementations26 Mar 2021 Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs.

Neural Architecture Search Robot Manipulation

Adaptive Intermediate Representations for Video Understanding

no code implementations14 Apr 2021 Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova

A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow.

Action Classification Optical Flow Estimation +3

Unsupervised Action Segmentation for Instructional Videos

no code implementations7 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.

Action Segmentation Segmentation

Unsupervised Discovery of Actions in Instructional Videos

no code implementations28 Jun 2021 AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

no code implementations ICCV 2021 Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.

Retrieval Scene Understanding

FindIt: Generalized Localization with Natural Language Queries

no code implementations31 Mar 2022 Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.

Natural Language Queries Object +5

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code implementations2 May 2022 AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Image Captioning Question Answering +4

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations1 Aug 2022 AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Question Answering Video Question Answering +1

Pre-training image-language transformers for open-vocabulary tasks

no code implementations9 Sep 2022 AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Question Answering Visual Entailment +1

Joint Adaptive Representations for Image-Language Learning

no code implementations31 May 2023 AJ Piergiovanni, Anelia Angelova

We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.

Diversifying Joint Vision-Language Tokenization Learning

no code implementations6 Jun 2023 Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.

Question Answering Representation Learning +2

Contrastive Feature Masking Open-Vocabulary Vision Transformer

no code implementations ICCV 2023 Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).

Contrastive Learning object-detection +3

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

no code implementations9 Nov 2023 AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.

Action Classification Audio Classification +1

Cannot find the paper you are looking for? You can Submit a new open access paper.