This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations.
The goal of this paper is to bypass the need for labelled examples in few-shot video understanding at run time.
For image segmentation, the current standard is to perform pixel-level optimization and inference in Euclidean output embedding spaces through linear hyperplanes.
For universal action models, we first seek to find a hyperspherical optimal transport mapping from unseen action prototypes to the set of all projected test videos.
This paper investigates the problem of zero-shot action recognition, in the setting where no training videos with seen actions are available.
Video relation detection forms a new and challenging problem in computer vision, where subjects and objects need to be localized spatio-temporally and a predicate label needs to be assigned if and only if there is an interaction between the two.
For universal object models, we outline a weighted transport variant from unseen action embeddings to object embeddings directly.
In this paper, we find that all existing approaches share a common limitation: reconstruction breaks down in and around the high-frequency parts of CT images.
The deep image prior showed that a randomly initialized network with a suitable architecture can be trained to solve inverse imaging problems by simply optimizing it's parameters to reconstruct a single degraded image.
This work strives for the classification and localization of human actions in videos, without the need for any labeled video training examples.
In this paper, we define data augmentation between point clouds as a shortest path linear interpolation.
Ranked #3 on 3D Point Cloud Data Augmentation on ModelNet40
The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label.
This paper introduces hyperspherical prototype networks, which unify classification and regression with prototypes on hyperspherical output spaces.
Rather than disconnecting the spatio-temporal learning from the training, we propose Spatio-Temporal Instance Learning, which enables action localization directly from box proposals in video frames.
Experimental evaluation on three action localization datasets shows our pointly-supervised approach (i) is as effective as traditional box-supervision at a fraction of the annotation cost, (ii) is robust to sparse and noisy point annotations, (iii) benefits from pseudo-points during inference, and (iv) outperforms recent weakly-supervised alternatives.
Action localization and classification experiments on four contemporary action video datasets support our proposal.
Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only.
To deal with the problems of over-specific classes and classes with few images, we introduce a bottom-up and top-down approach for reorganization of the ImageNet hierarchy based on all its 21, 814 classes and more than 14 million images.
Experimental evaluation on the Video Water Database and the DynTex database indicates the effectiveness of the proposed algorithm, outperforming multiple algorithms for dynamic texture recognition and material recognition by ca.