Video summarization aims at choosing parts of a video that narrate a story as close as possible to the original one.
The views are ordered in pairs, such that they are either positive, encoding different views of the same object, or negative, corresponding to views of different objects.
iii) We evaluate Align-Former on HICO-DET  and V-COCO , and show that Align-Former outperforms existing image-level supervised HO-I detectors by a large margin (4. 71% mAP improvement from 16. 14% to 20. 85% on HICO-DET ).
To demonstrate that wiggling the weights consistently improves classification, we choose a standard network and modify it to a transform-augmented network.
We aim for accurate scale-equivariant convolutional neural networks (SE-CNNs) applicable for problems where high granularity of scale and small kernel sizes are required.
We focus on building robustness in the convolutions of neural visual classifiers, especially against natural perturbations like elastic deformations, occlusions and Gaussian noise.
For Cifar-10 and STL-10 natural perturbed training even improves the accuracy for clean data and reaches the state of the art performance.
Our experiments show that SSC leads to an important increase in interaction recognition performance, while using much fewer parameters.
In this paper we aim to explore the general robustness of neural network classifiers by utilizing adversarial as well as natural perturbations.
We develop the theory for scale-equivariant Siamese trackers, and provide a simple recipe for how to make a wide range of existing trackers scale-equivariant.
Ranked #1 on Visual Object Tracking on OTB-2013
The effectiveness of Convolutional Neural Networks (CNNs) has been substantially attributed to their built-in property of translation equivariance.
Ranked #30 on Image Classification on STL-10
Deep computer vision systems being vulnerable to imperceptible and carefully crafted noise have raised questions regarding the robustness of their decisions.
We introduce the OxUvA dataset and benchmark for evaluating single-object tracking algorithms.
An analysis of i-RevNets learned representations suggests an alternative explanation for the success of deep networks by a progressive contraction and linear separation with depth.
From these neural measurements and the contrast statistics of the natural image stimuli, we derive an across subject Weibull response model.
This fundamental insight opens new directions in the assessment of feature similarity, with projected improvements in object and scene recognition algorithms.
We propose a method for reconstruction of human brain states directly from functional neuroimaging data.