What If We Do Not Have Multiple Videos of the Same Action? -- Video Action Localization Using Web Images

CVPR 2016 · Waqas Sultani, Mubarak Shah ·

This paper tackles the problem of spatio-temporal action localization in a video without assuming the availability of multiple videos or any prior annotations. Action is localized by employing images downloaded from internet using action name. Given web images, we first mitigate image noise using random walk framework and evade distracting backgrounds within images using image action proposals. Then, given a video, we generate multiple spatio-temporal action proposals. We suppress camera and background generated proposals by exploiting optical flow gradients within proposal. To obtain the most action representative proposal, we propose to reconstruct action proposals in the video by leveraging the action proposal in images. Moreover, we preserve the temporal smoothness of the video by introducing consensus regularization. Consensus regularization enforces consistency among coefficients vectors of multiple frames within proposal. %We reconstruct video action proposals from image action proposals while enforcing consistency across coefficient vectors of multiple frames by consensus regularization. Finally, the video proposal that have the lowest reconstruction cost and is motion salient is considered as final action localization. Our extensive experiments on trimmed as well as untrimmed datasets validate the effectiveness of proposed approach.

PDF Abstract