Search Results for author: Anelia Angelova

Found 60 papers, 25 papers with code

Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints

2 code implementations • CVPR 2018 • Reza Mahjourian, Martin Wicke, Anelia Angelova

We present a novel approach for unsupervised learning of depth and ego-motion from monocular video.

Depth And Camera Motion Monocular Depth Estimation +1

76,579

Paper
Code

AssembleNet: Searching for Multi-Stream Neural Connectivity in Video Architectures

2 code implementations • ICLR 2020 • Michael S. Ryoo, AJ Piergiovanni, Mingxing Tan, Anelia Angelova

Learning to represent videos is a very challenging task both algorithmically and computationally.

Ranked #1 on Multimodal Activity Recognition on Moments in Time Dataset

Action Classification Action Recognition +4

72,249

Paper
Code

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

12 code implementations • 15 Nov 2018 • Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova

Models and examples built with TensorFlow

Ranked #10 on Unsupervised Monocular Depth Estimation on Cityscapes

Depth And Camera Motion Monocular Depth Estimation +3

65,338

Paper
Code

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

1 code implementation • 30 Sep 2022 • Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova

We present F-VLM, a simple open-vocabulary object detection method built upon Frozen Vision and Language Models.

Knowledge Distillation object-detection +1

32,749

Paper
Code

Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers

2 code implementations • CVPR 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) - a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection.

Ranked #5 on Zero-Shot Cross-Modal Retrieval on Flickr30k

Contrastive Learning object-detection +4

32,748

Paper
Code

Detection-Oriented Image-Text Pretraining for Open-Vocabulary Detection

1 code implementation • 29 Sep 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present a new open-vocabulary detection approach based on detection-oriented image-text pretraining to bridge the gap between image-level pretraining and open-vocabulary object detection.

Ranked #1 on Open Vocabulary Object Detection on LVIS v1.0

Contrastive Learning Object +2

32,747

Paper
Code

Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras

4 code implementations • ICCV 2019 • Ariel Gordon, Hanhan Li, Rico Jonschkowski, Anelia Angelova

We present a novel method for simultaneous learning of depth, egomotion, object motion, and camera intrinsics from monocular videos, using only consistency across neighboring video frames as supervision signal.

Ranked #9 on Unsupervised Monocular Depth Estimation on Cityscapes

Depth Prediction Monocular Depth Estimation +1

32,745

Paper
Code

Unsupervised Monocular Depth Learning in Dynamic Scenes

5 code implementations • 30 Oct 2020 • Hanhan Li, Ariel Gordon, Hang Zhao, Vincent Casser, Anelia Angelova

We present a method for jointly training the estimation of depth, ego-motion, and a dense 3D translation field of objects relative to the scene, with monocular photometric consistency being the sole source of supervision.

Ranked #8 on Unsupervised Monocular Depth Estimation on Cityscapes

Depth Prediction Monocular Depth Estimation +2

32,745

Paper
Code

KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects

1 code implementation • CVPR 2020 • Xingyu Liu, Rico Jonschkowski, Anelia Angelova, Kurt Konolige

We address two problems: first, we establish an easy method for capturing and labeling 3D keypoints on desktop objects with an RGB camera; and second, we develop a deep neural network, called $KeyPose$, that learns to accurately predict object poses using 3D keypoints, from stereo input, and works even for transparent objects.

3D Pose Estimation Keypoint Estimation +1

32,743

Paper
Code

What Matters in Unsupervised Optical Flow

5 code implementations • ECCV 2020 • Rico Jonschkowski, Austin Stone, Jonathan T. Barron, Ariel Gordon, Kurt Konolige, Anelia Angelova

We systematically compare and analyze a set of key components in unsupervised optical flow to identify which photometric loss, occlusion handling, and smoothness regularization is most effective.

Ranked #5 on Optical Flow Estimation on Sintel Clean unsupervised

Occlusion Handling Optical Flow Estimation

32,743

Paper
Code

AssembleNet++: Assembling Modality Representations via Attention Connections

1 code implementation • 18 Aug 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.

Ranked #4 on Action Classification on Toyota Smarthome dataset

Action Classification Activity Recognition

32,743

Paper
Code

Tiny Video Networks

2 code implementations • 15 Oct 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world.

Video Understanding

32,738

Paper
Code

SMURF: Self-Teaching Multi-Frame Unsupervised RAFT with Full-Image Warping

2 code implementations • CVPR 2021 • Austin Stone, Daniel Maurer, Alper Ayvaci, Anelia Angelova, Rico Jonschkowski

We present SMURF, a method for unsupervised learning of optical flow that improves state of the art on all benchmarks by $36\%$ to $40\%$ (over the prior best method UFlow) and even outperforms several supervised approaches such as PWC-Net and FlowNet2.

Optical Flow Estimation

32,736

Paper
Code

ShapeMask: Learning to Segment Novel Objects by Refining Shape Priors

1 code implementation • ICCV 2019 • Wei-cheng Kuo, Anelia Angelova, Jitendra Malik, Tsung-Yi Lin

However, it is difficult and costly to segment objects in novel categories because a large number of mask annotations is required.

Instance Segmentation Object +1

5,176

Paper
Code

TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?

4 code implementations • 21 Jun 2021 • Michael S. Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Ranked #1 on Action Classification on Charades

Action Classification Image Classification +3

2,983

Paper
Code

PaLI: A Jointly-Scaled Multilingual Language-Image Model

1 code implementation • 14 Sep 2022 • Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

PaLI generates text based on visual and textual inputs, and with this interface performs many vision, language, and multimodal tasks, in many languages.

Ranked #1 on Zero-Shot Transfer Image Classification on ImageNet-S

Few-Shot Image Classification Image Captioning +5

1,537

Paper
Code

Learning Open-World Object Proposals without Learning to Classify

3 code implementations • 15 Aug 2021 • Dahun Kim, Tsung-Yi Lin, Anelia Angelova, In So Kweon, Weicheng Kuo

In this paper, we identify that the problem is that the binary classifiers in existing proposal methods tend to overfit to the training categories.

Ranked #2 on Open World Object Detection on COCO VOC to non-VOC

Object object-detection +4

188

Paper
Code

Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning

1 code implementation • CVPR 2023 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs.

Ranked #2 on Action Classification on Kinetics-600 (using extra training data)

Action Classification Action Recognition In Videos

Paper
Code

PaLI-X: On Scaling up a Multilingual Vision and Language Model

2 code implementations • 29 May 2023 • Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Carlos Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani, Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang, Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong, Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, Radu Soricut

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture.

Ranked #1 on Fine-Grained Image Recognition on OVEN

Chart Question Answering document understanding +9

Paper
Code

TokenLearner: Adaptive Space-Time Tokenization for Videos

1 code implementation • NeurIPS 2021 • Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, Anelia Angelova

In this paper, we introduce a novel visual representation learning which relies on a handful of adaptively learned tokens, and which is applicable to both image and video understanding tasks.

Representation Learning Video Recognition +1

Paper
Code

Deep Value Networks Learn to Evaluate and Iteratively Refine Structured Outputs

1 code implementation • ICML 2017 • Michael Gygli, Mohammad Norouzi, Anelia Angelova

We approach structured output prediction by optimizing a deep value network (DVN) to precisely estimate the task loss on different output configurations for a given input.

General Classification Image Segmentation +3

Paper
Code

4D-Net for Learned Multi-Modal Alignment

1 code implementation • ICCV 2021 • AJ Piergiovanni, Vincent Casser, Michael S. Ryoo, Anelia Angelova

We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.

3D Object Detection object-detection

Paper
Code

Real-Time Grasp Detection Using Convolutional Neural Networks

3 code implementations • 9 Dec 2014 • Joseph Redmon, Anelia Angelova

We present an accurate, real-time approach to robotic grasp detection based on convolutional neural networks.

Ranked #5 on Robotic Grasping on Cornell Grasp Dataset

General Classification Object +3

Paper
Code

Probabilistic Object Detection: Definition and Evaluation

1 code implementation • 27 Nov 2018 • David Hall, Feras Dayoub, John Skinner, Haoyang Zhang, Dimity Miller, Peter Corke, Gustavo Carneiro, Anelia Angelova, Niko Sünderhauf

We introduce Probabilistic Object Detection, the task of detecting objects in images and accurately quantifying the spatial and semantic uncertainties of the detections.

Object object-detection +1

Paper
Code

Object category learning and retrieval with weak supervision

1 code implementation • 26 Jan 2018 • Steven Hickson, Anelia Angelova, Irfan Essa, Rahul Sukthankar

We consider the problem of retrieving objects from image data and learning to classify them into meaningful semantic categories with minimal supervision.

Clustering Deep Clustering +2

Paper
Code

Geometry-Based Next Frame Prediction from Monocular Video

no code implementations • 20 Sep 2016 • Reza Mahjourian, Martin Wicke, Anelia Angelova

We consider the problem of next frame prediction from video input.

Autonomous Driving Benchmarking +2

Paper
Add Code

Improved generator objectives for GANs

no code implementations • 8 Dec 2016 • Ben Poole, Alexander A. Alemi, Jascha Sohl-Dickstein, Anelia Angelova

We present a framework to understand GAN training as alternating density ratio estimation and approximate divergence minimization.

Density Ratio Estimation

Paper
Add Code

Object Recognition from Short Videos for Robotic Perception

no code implementations • 4 Sep 2015 • Ivan Bogun, Anelia Angelova, Navdeep Jaitly

Videos, unlike still images, are temporally coherent which makes the application of deep networks non-trivial.

Object Object Recognition

Paper
Add Code

Evolving Space-Time Neural Architectures for Videos

no code implementations • ICCV 2019 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

We present a new method for finding video CNN architectures that capture rich spatio-temporal information in videos.

Ranked #22 on Action Classification on MiT

Action Classification Action Recognition In Videos

Paper
Add Code

Future Segmentation Using 3D Structure

no code implementations • 28 Nov 2018 • Suhani Vora, Reza Mahjourian, Soeren Pirk, Anelia Angelova

Predicting the future to anticipate the outcome of events and actions is a critical attribute of autonomous agents; particularly for agents which must rely heavily on real time visual data for decision making.

Attribute Decision Making +2

Paper
Add Code

Efficient Object Detection and Segmentation for Fine-Grained Recognition

no code implementations • CVPR 2013 • Anelia Angelova, Shenghuo Zhu

The algorithm first detects low-level regions that could potentially belong to the object and then performs a full-object segmentation through propagation.

Object object-detection +2

Paper
Add Code

Differentiable Grammars for Videos

no code implementations • 1 Feb 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

This paper proposes a novel algorithm which learns a formal regular grammar from real-world continuous data, such as videos.

Paper
Add Code

Evolving Losses for Unlabeled Video Representation Learning

no code implementations • 7 Jun 2019 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

We present a new method to learn video representations from unlabeled data.

Few-Shot Learning Multi-Task Learning +2

Paper
Add Code

Unsupervised Monocular Depth and Ego-motion Learning with Structure and Semantics

no code implementations • 12 Jun 2019 • Vincent Casser, Soeren Pirk, Reza Mahjourian, Anelia Angelova

We present an approach which takes advantage of both structure and semantics for unsupervised monocular learning of depth and ego-motion.

Ranked #39 on Monocular Depth Estimation on KITTI Eigen split unsupervised

Depth And Camera Motion Monocular Depth Estimation +1

Paper
Add Code

SPIN: A High Speed, High Resolution Vision Dataset for Tracking and Action Recognition in Ping Pong

no code implementations • 13 Dec 2019 • Steven Schwarcz, Peng Xu, David D'Ambrosio, Juhana Kangaspunta, Anelia Angelova, Huong Phan, Navdeep Jaitly

The corpus consists of ping pong play with three main annotation streams that can be used to learn tracking and action recognition models -- tracking of the ping pong ball and poses of humans in the videos and the spin of the ball being hit by humans.

Action Recognition Pose Estimation +1

Paper
Add Code

Evolving Losses for Unsupervised Video Representation Learning

no code implementations • CVPR 2020 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo

We present a new method to learn video representations from large-scale unlabeled video data.

Ranked #4 on Self-Supervised Action Recognition on UCF101 (finetuned)

Few-Shot Learning Multi-Task Learning +2

Paper
Add Code

Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos

no code implementations • 11 Apr 2020 • Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova

We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames.

Segmentation Semantic Segmentation

Paper
Add Code

X-Ray: Mechanical Search for an Occluded Object by Minimizing Support of Learned Occupancy Distributions

no code implementations • 20 Apr 2020 • Michael Danielczuk, Anelia Angelova, Vincent Vanhoucke, Ken Goldberg

For applications in e-commerce, warehouses, healthcare, and home service, robots are often required to search through heaps of objects to grasp a specific target object.

Object

Paper
Add Code

Taskology: Utilizing Task Relations at Scale

no code implementations • CVPR 2021 • Yao Lu, Sören Pirk, Jan Dlabal, Anthony Brohan, Ankita Pasad, Zhao Chen, Vincent Casser, Anelia Angelova, Ariel Gordon

Many computer vision tasks address the problem of scene understanding and are naturally interrelated e. g. object classification, detection, scene segmentation, depth estimation, etc.

Depth Estimation Motion Estimation +4

Paper
Add Code

Differentiable Mapping Networks: Learning Structured Map Representations for Sparse Visual Localization

no code implementations • 19 May 2020 • Peter Karkus, Anelia Angelova, Vincent Vanhoucke, Rico Jonschkowski

We address these tasks by combining spatial structure (differentiable mapping) and end-to-end learning in a novel neural network architecture: the Differentiable Mapping Network (DMN).

Visual Localization

Paper
Add Code

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

no code implementations • ECCV 2020 • Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

The discovered attention cells can be seamlessly inserted into existing backbone networks, e. g., I3D or S3D, and improve video classification accuracy by more than 2% on both Kinetics-600 and MiT datasets.

Classification General Classification +1

Paper
Add Code

Mask2CAD: 3D Shape Prediction by Learning to Segment and Retrieve

no code implementations • ECCV 2020 • Wei-cheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

We propose to leverage existing large-scale datasets of 3D models to understand the underlying 3D structure of objects seen in an image by constructing a CAD-based representation of the objects and their poses.

Image to 3D Object +3

Paper
Add Code

Adversarial Generative Grammars for Human Activity Prediction

no code implementations • ECCV 2020 • AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo

In this paper we propose an adversarial generative grammar model for future prediction.

Activity Prediction Future prediction +1

Paper
Add Code

AssembleNet++: Assembling Modality Representations via Attention Connections - Supplementary Material -

no code implementations • ECCV 2020 • Michael S. Ryoo, AJ Piergiovanni, Juhana Kangaspunta, Anelia Angelova

Activity Recognition

Paper
Add Code

Visionary: Vision architecture discovery for robot learning

no code implementations • 26 Mar 2021 • Iretiayo Akinola, Anelia Angelova, Yao Lu, Yevgen Chebotar, Dmitry Kalashnikov, Jacob Varley, Julian Ibarz, Michael S. Ryoo

We propose a vision-based architecture search algorithm for robot manipulation learning, which discovers interactions between low dimension action inputs and high dimensional visual inputs.

Neural Architecture Search Robot Manipulation

Paper
Add Code

Adaptive Intermediate Representations for Video Understanding

no code implementations • 14 Apr 2021 • Juhana Kangaspunta, AJ Piergiovanni, Rico Jonschkowski, Michael Ryoo, Anelia Angelova

A common strategy to video understanding is to incorporate spatial and motion information by fusing features derived from RGB frames and optical flow.

Ranked #5 on Action Classification on Toyota Smarthome dataset

Action Classification Optical Flow Estimation +3

Paper
Add Code

Unsupervised Action Segmentation for Instructional Videos

no code implementations • 7 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos, which are rarely annotated with atomic actions.

Action Segmentation Segmentation

Paper
Add Code

Unsupervised Discovery of Actions in Instructional Videos

no code implementations • 28 Jun 2021 • AJ Piergiovanni, Anelia Angelova, Michael S. Ryoo, Irfan Essa

In this paper we address the problem of automatically discovering atomic actions in unsupervised manner from instructional videos.

Paper
Add Code

Patch2CAD: Patchwise Embedding Learning for In-the-Wild Shape Retrieval from a Single Image

no code implementations • ICCV 2021 • Weicheng Kuo, Anelia Angelova, Tsung-Yi Lin, Angela Dai

3D perception of object shapes from RGB image input is fundamental towards semantic scene understanding, grounding image-based perception in our spatially 3-dimensional real-world environments.

Retrieval Scene Understanding

Paper
Add Code

Mechanical Search on Shelves using a Novel "Bluction" Tool

no code implementations • 22 Jan 2022 • Huang Huang, Michael Danielczuk, Chung Min Kim, Letian Fu, Zachary Tam, Jeffrey Ichnowski, Anelia Angelova, Brian Ichter, Ken Goldberg

Shelves are common in homes, warehouses, and commercial settings due to their storage efficiency.

Paper
Add Code

FindIt: Generalized Localization with Natural Language Queries

no code implementations • 31 Mar 2022 • Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, Anelia Angelova

We propose FindIt, a simple and versatile framework that unifies a variety of visual grounding and localization tasks including referring expression comprehension, text-based localization, and object detection.

Natural Language Queries Object +5

Paper
Add Code

Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering

no code implementations • 2 May 2022 • AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, Anelia Angelova

We present Answer-Me, a task-aware multi-task framework which unifies a variety of question answering tasks, such as, visual question answering, visual entailment, visual reasoning.

Image Captioning Question Answering +4

Paper
Add Code

Video Question Answering with Iterative Video-Text Co-Tokenization

no code implementations • 1 Aug 2022 • AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, Anelia Angelova

Video question answering is a challenging task that requires understanding jointly the language input, the visual information in individual video frames, as well as the temporal information about the events occurring in the video.

Ranked #4 on Video Question Answering on iVQA

Question Answering Video Question Answering +1

Paper
Add Code

Pre-training image-language transformers for open-vocabulary tasks

no code implementations • 9 Sep 2022 • AJ Piergiovanni, Weicheng Kuo, Anelia Angelova

We present a pre-training approach for vision and language transformer models, which is based on a mixture of diverse tasks.

Question Answering Visual Entailment +1

Paper
Add Code

MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks

no code implementations • 29 Mar 2023 • Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, Anelia Angelova

We propose a novel paradigm of training with a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.

Ranked #1 on Visual Question Answering on COCO Visual Question Answering (VQA) real images 2.0 open ended

Cross-Modal Retrieval Image Retrieval +7

Paper
Add Code

Joint Adaptive Representations for Image-Language Learning

no code implementations • 31 May 2023 • AJ Piergiovanni, Anelia Angelova

We here propose a much easier recipe for image-language learning, which produces effective models, outperforming bigger and more expensive ones, often trained on orders of magnitude larger datasets.

Paper
Add Code

Diversifying Joint Vision-Language Tokenization Learning

no code implementations • 6 Jun 2023 • Vardaan Pahuja, AJ Piergiovanni, Anelia Angelova

Building joint representations across images and text is an essential step for tasks such as Visual Question Answering and Video Question Answering.

Question Answering Representation Learning +2

Paper
Add Code

Contrastive Feature Masking Open-Vocabulary Vision Transformer

no code implementations • ICCV 2023 • Dahun Kim, Anelia Angelova, Weicheng Kuo

We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region-level representation for open-vocabulary object detection (OVD).

Ranked #5 on Open Vocabulary Object Detection on LVIS v1.0

Contrastive Learning object-detection +3

Paper
Add Code

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

no code implementations • 9 Nov 2023 • AJ Piergiovanni, Isaac Noble, Dahun Kim, Michael S. Ryoo, Victor Gomes, Anelia Angelova

We propose a multimodal model, called Mirasol3B, consisting of an autoregressive component for the time-synchronized modalities (audio and video), and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential.

Ranked #1 on Audio Classification on VGGSound

Action Classification Audio Classification +1

Paper
Add Code

3D Open-Vocabulary Panoptic Segmentation with 2D-3D Vision-Language Distillation

no code implementations • 4 Jan 2024 • Zihao Xiao, Longlong Jing, Shangxuan Wu, Alex Zihao Zhu, Jingwei Ji, Chiyu Max Jiang, Wei-Chih Hung, Thomas Funkhouser, Weicheng Kuo, Anelia Angelova, Yin Zhou, Shiwei Sheng

3D panoptic segmentation is a challenging perception task, especially in autonomous driving.

Autonomous Driving Classification +3

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.