Objects do not disappear: Video object detection by single-frame object location anticipation

Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes, and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation.

PDF Abstract ICCV 2023 PDF ICCV 2023 Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Object Detection EPIC-KITCHENS-55 Ours (Faster RCNN) mAP@.5 41.7 # 1
Video Object Detection ImageNet VID Ours (Faster RCNN + R101) MAP 87.2 # 8
Video Object Detection ImageNet VID Ours (Def. DETR + R101) MAP 87.9 # 6
Video Object Detection ImageNet VID Ours (Def. DETR + SwinB) MAP 91.3 # 2
Video Object Detection Waymo Open Dataset AP 59.28 # 1
Video Object Detection YT-BB mAP 59.8 # 1


No methods listed for this paper. Add relevant methods here