Video object detection is the task of detecting objects from a video as opposed to images.
In this paper, we address a novel task, namely weakly-supervised spatio-temporally grounding natural sentence in video.
We introduce a systematic framework for quantifying the robustness of classifiers to naturally occurring perturbations of images found in videos.
The latency reduction by this hard attention mechanism comes at the cost of degraded accuracy.
Two new models, RetinaNet-Double and RetinaNet-Flow, are proposed, based respectively on the concatenation of a target frame with a preceding frame, and the concatenation of the optical flow with the target frame.
Instead of relying on optical flow, this paper proposes a novel module called Progressive Sparse Local Attention (PSLA), which establishes the spatial correspondence between features across frames in a local region with progressive sparser stride and uses the correspondence to propagate features.
By introducing a parameterized canonical model to model correlated data and defining corresponding operations as required for CNN training and inference, we show that SCNN can process multiple frames of correlated images effectively, hence achieving significant speedup over existing CNN models.
In vision-enabled autonomous systems such as robots and autonomous cars, video object detection plays a crucial role, and both its speed and accuracy are important factors to provide reliable operation.
Adversarial examples have been demonstrated to threaten many computer vision tasks including object detection.
Accurate detection and tracking of objects is vital for effective video understanding.