The core problem in zero-shot open vocabulary detection is how to align visual and text features, so that the detector performs well on unseen classes.
The promise of self-supervised learning (SSL) is to leverage large amounts of unlabeled data to solve complex tasks.
Much of the recent progress in 3D vision has been driven by the development of specialized architectures that incorporate geometrical inductive biases.
In this work we address a clear limitation of the vanilla coarse-to-fine approach -- that it is based on a heuristic and not trained end-to-end for the task at hand.
In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding.
The objective of this work is to learn a compact embedding of a set of descriptors that is suitable for efficient retrieval and ranking, whilst maintaining discriminability of the individual descriptors.
The objective of this paper is to be able to separate a video into its natural layers, and to control which of the separated layers to attend to.
We tackle the problem of object discovery, where objects are segmented for a given input image, and the system is trained without using any direct supervision whatsoever.
Second, we demonstrate that the model can be trained effectively from weak supervision in the form of matching and non-matching image pairs without the need for costly manual annotation of point to point correspondences.
Ranked #2 on Semantic correspondence on PF-PASCAL (PCK (weak) metric)
The objective of this paper is to learn a compact representation of image sets for template-based face recognition.
Ranked #3 on Face Verification on IJB-A
We tackle the task of semantic alignment where the goal is to compute dense semantic correspondence aligning two images depicting objects of the same category.
We make the following contributions: (i) show that audio and visual embeddings can be learnt that enable both within-mode (e. g. audio-to-audio) and between-mode retrieval; (ii) explore various architectures for the AVC task, including those for the visual stream that ingest a single image, or multiple images, or a single image and multi-frame optical flow; (iii) show that the semantic object that sounds within an image can be localized (using only the sound, no motion or flow information); and (iv) give a cautionary tale on how to avoid undesirable shortcuts in the data preparation.
We consider the question: what can be learnt by looking at and listening to a large number of unlabelled videos?
Ranked #21 on Audio Classification on ESC-50
We address the problem of determining correspondences between two images in agreement with a geometric model such as an affine or thin-plate spline transformation, and estimating its parameters.
The proposed approach proceeds by finding a linear transformation of the data that effectively reduces the minimization of the pairwise distortions to the minimization of individual reconstruction errors.
We tackle the problem of large scale visual place recognition, where the task is to quickly and accurately recognize the location of a given query photograph.
Ranked #3 on Visual Place Recognition on Mid-Atlantic Ridge
The objective of this work is object retrieval in large scale image datasets, where the object is specified by an image query and retrieval should be immediate at run time in the manner of Video Google .
Ranked #6 on Image Matching on IMC PhotoTourism (using extra training data)