Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems.
1 code implementation • 4 Jan 2023 • Vignesh Ramanathan, Anmol Kalia, Vladan Petrovic, Yi Wen, Baixue Zheng, Baishan Guo, Rui Wang, Aaron Marquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, Dhruv Mahajan
This motivates the need for large datasets which go beyond traditional object masks and provide richer annotations such as part masks and attributes.
Compared to competing supervised approaches, ours is a task-agnostic approach ideally suited for the event domain, where task specific labeled data is scarce.
We show that the existing approaches either do not scale to this dataset or underperform compared to the simple baseline of training a model on the union of data from all training domains.
However, existing approaches which rely only on image-level class labels predominantly suffer from errors due to (a) partial segmentation of objects and (b) missing object predictions.
State-of-the-art object detection approaches typically rely on pre-trained classification models to achieve better performance and faster convergence.
This requires a detection framework that can be jointly trained with limited number of bounding box annotated images and large number of weakly labelled images.
Weakly supervised object detection aims at reducing the amount of supervision required to train detection models.
Ranked #1 on Weakly Supervised Object Detection on Charades
The ability to capture temporal information has been critical to the development of video understanding models.
ImageNet classification is the de facto pretraining task for these models.
Ranked #184 on Image Classification on ImageNet
Our method uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts.
Different from the conventional LSTM, we share the information between multiple LSTMs through a new pooling layer.
Ranked #1 on Trajectory Prediction on Stanford Drone (ADE (8/12) @K=5 metric)
In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event.
Human actions capture a wide variety of interactions between people and objects.
In this paper, we tackle the problem of adapting object detectors learned from images to work well on videos.