We evaluate our SSL approach on two downstream tasks -- object detection and semantic segmentation, using COCO, PASCAL VOC, and CityScapes datasets.
As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal.
We show that the resulting similarity models perform, and can be visually explained, better than the corresponding baseline models trained without these constraints.
We propose a classification module to generate action labels for each segment in the video, and a deep metric learning module to learn the similarity between different action instances.
Ranked #1 on Temporal Action Localization on ActivityNet-1.2
In this paper, we present a multi-user interaction interface for a large immersive space that supports simultaneous screen interactions by combining (1) user input via personal smartphones and Bluetooth microphones, (2) spatial tracking via an overhead array of Kinect sensors, and (3) WebSocket interfaces to a webpage running on the large screen.
While there has been substantial progress in learning suitable distance metrics, these techniques in general lack transparency and decision reasoning, i. e., explaining why the input set of images is similar or dissimilar.
We present methods to generate visual attention from the learned latent space, and also demonstrate such attention explanations serve more than just explaining VAE predictions.
Transcripts of natural, multi-person meetings differ significantly from documents like news articles, which can make Natural Language Generation models for generating summaries unfocused.
Designing real-world person re-identification (re-id) systems requires attention to operational aspects not typically considered in academic research.
Designing useful person re-identification systems for real-world applications requires attention to operational aspects not typically considered in academic research.
To ensure a fair comparison, all of the approaches were implemented using a unified code library that includes 11 feature extraction algorithms and 22 metric learning and ranking techniques.
This paper introduces a new approach to address the person re-identification problem in cameras with non-overlapping fields of view.