Most self-supervised video representation learning approaches focus on action recognition.
We propose a novel one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task, which requires to localize humans and objects, and predicts their interactions.
This work studies feature representations for dense label propagation in video, with a focus on recently proposed methods that learn video correspondence using self-supervised signals such as colorization or temporal cycle consistency.
Next, to tackle harder tracking cases, we mine hard examples across an unlabeled pool of real videos with a tracker trained on our hallucinated video data.
We present MoDist as a novel method to explicitly distill motion information into self-supervised video representations.
no code implementations • 2 Apr 2021 • Jiaojiao Zhao, Yanyi Zhang, Xinyu Li, Hao Chen, Shuai Bing, Mingze Xu, Chunhui Liu, Kaustav Kundu, Yuanjun Xiong, Davide Modolo, Ivan Marsic, Cees G. M. Snoek, Joseph Tighe
We propose TubeR: a simple solution for spatio-temporal video action detection.
In this work, we focus on improving the inference efficiency of current action recognition backbones on trimmed videos, and illustrate that one action model can also cover then informative region by dropping non-informative features.
Multi-object tracking systems often consist of a combination of a detector, a short term linker, a re-identification feature extractor and a solver that takes the output from these separate components and makes a final prediction.
Our approach consists of three components: (i) a Clip Tracking Network that performs body joint detection and tracking simultaneously on small video clips; (ii) a Video Tracking Pipeline that merges the fixed-length tracklets produced by the Clip Tracking Network to arbitrary length tracks; and (iii) a Spatial-Temporal Merging procedure that refines the joint locations based on spatial and temporal smoothing terms.
Ranked #1 on Pose Tracking on PoseTrack2018
Our results show that (i) mistakes on background are substantial and they are responsible for 18-49% of the total error, (ii) models do not generalize well to different kinds of backgrounds and perform poorly on completely background images, and (iii) models make many more mistakes than those captured by the standard Mean Absolute Error (MAE) metric, as counting on background compensates considerably for misses on foreground.
In this work we focus on how to improve the representation capacity of the network, but rather than altering the backbone, we focus on improving the last layers of the network, where changes have low impact in terms of computational cost.
Ranked #20 on Action Recognition on Something-Something V1 (using extra training data)
We present a semantic part detection approach that effectively leverages object information. We use the object appearance and its class as indicators of what parts to expect.
We evaluate our models on the challenging PASCAL-Part dataset  and show how their performance increases at every step of the learning, with the final models more than doubling the performance of directly training from images retrieved by querying for part names (from 12. 9 to 27. 2 AP).
We also investigate the other direction: we determine which semantic parts are the most discriminative and whether they correspond to those parts emerging in the network.
We present Context Forest (ConF), a technique for predicting properties of the objects in an image based on its global appearance.
We formulate joint calibration as a constrained optimization problem and devise an efficient optimization algorithm to find its global optimum.