This paper proposes a two-stream flow-guided convolutional attention networks
for action recognition in videos. The central idea is that optical flows, when
properly compensated for the camera motion, can be used to guide attention to
the human foreground. We thus develop cross-link layers from the temporal
network (trained on flows) to the spatial network (trained on RGB frames).
These cross-link layers guide the spatial-stream to pay more attention to the
human foreground areas and be less affected by background clutter. We obtain
promising performances with our approach on the UCF101, HMDB51 and Hollywood2