no code implementations • 9 Nov 2017 • Yuxin Peng, Yunzhen Zhao, Junchao Zhang
Recently, researchers generally adopt the deep networks to capture the static and motion information \textbf{\emph{separately}}, which mainly has two limitations: (1) Ignoring the coexistence relationship between spatial and temporal attention, while they should be jointly modelled as the spatial and temporal evolutions of video, thus discriminative video features can be extracted.
no code implementations • 23 Mar 2017 • Yunzhen Zhao, Yuxin Peng
Then two streams of 3D CNN are trained individually for raw frames and optical flow on salient areas, and another 2D CNN is trained for raw frames on non-salient areas.