Models trained with self-supervised learning (SSL) tend to generalize better than their supervised counterparts for transfer learning; yet, they still lag behind supervised models on IN1K.
We propose a new problem formulation and a corresponding evaluation framework to advance research on unsupervised domain adaptation for semantic image segmentation.
While the first provides rich and implicit context for the search, the latter explicitly calls for new traits, or specifies how some elements of the example image should be changed to retrieve the desired target image.
Second, they are typically trained with a global loss that only acts on top of an aggregation of local features; by contrast, testing is based on local feature matching, which creates a discrepancy between training and testing.
Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query.
Given a raw video sequence taken from a freely-moving camera, we study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground containing the objects that move in the video sequence.
Dimensionality reduction methods are unsupervised approaches which learn low-dimensional spaces where some properties of the initial space, typically the notion of "neighborhood", are preserved.
Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space.
In this paper, we argue that the semantic relationships between seen and unseen concepts affect generalization performance and propose ImageNet-CoG, a novel benchmark on the ImageNet-21K (IN-21K) dataset that enables measuring concept generalization in a principled way.
In this context, we show that one way to learn models that are inherently more robust against forgetting is domain randomization - for vision tasks, randomizing the current domain's distribution with heavy image manipulations.
Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space.
Based on these observations, and motivated by the success of data mixing, we propose hard negative mixing strategies at the feature level, that can be computed on-the-fly with a minimal computational overhead.
Starting from the observation that captioned images are easily crawlable, we argue that this overlooked source of information can be exploited to supervise the training of visual representations.
We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting.
Self-supervision can dramatically cut back the amount of manually-labelled data required to train deep neural networks.
In this paper we adopt a different approach and carefully design each component of a simple deep architecture and, critically, the strategy for training it effectively for person re-identification.
Following this observation, we learn a visual embedding of the images where the similarity in the visual space is correlated with their semantic similarity surrogate.
Despite significant progress of deep learning in recent years, state-of-the-art semantic matching methods still rely on legacy features such as SIFT or HoG.
Second, we build on the recent R-MAC descriptor, show that it can be interpreted as a deep and differentiable architecture, and present improvements to enhance it.
While recent research in image understanding has often focused on recognizing more types of objects, understanding more about the objects is just as important.
We propose a novel approach for instance-level image retrieval.
Ranked #3 on Image Retrieval on Oxf105k
Fisher Vectors (FV) and Convolutional Neural Networks(CNN) are two image classification pipelines with different strengths.
Fisher Vectors and related orderless visual statistics have demonstrated excellent performance in object detection, sometimes superior to established approaches such as the Deformable Part Models.
Based on a state-of-the-art segmentation framework and a novel manually segmented image database (both indoor and outdoor scenes) that contain 4-channel images (RGB+NIR), we study how to best incorporate the specific characteristics of the NIR response.