The accuracy of detection suffers from degenerated object appearances in videos, e. g., motion blur, video defocus, rare poses, etc.
#2 best model for Video Object Detection on ImageNet VID
Similarly, the output feature maps of a convolution layer can also be seen as a mixture of information at different frequencies.
#6 best model for Image Classification on ImageNet
The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training.
The explosive growth in video streaming gives rise to challenges on efficiently extracting the spatial-temporal information to perform video understanding at low computation cost.
Recent years have seen tremendous progress in still-image segmentation; however the na\"ive application of these state-of-the-art algorithms to every video frame requires considerable computation and ignores the temporal continuity inherent in video.