To obtain the single-frame supervision, the annotators are asked to identify only a single frame within the temporal window of an action.
Ranked #4 on Weakly Supervised Action Localization on GTEA
However, in current video datasets it has been observed that action classes can often be recognized without any temporal information from a single frame of video.
Our proposed late fusion of CNN- and motion-based features can further increase the mean average precision (mAP) on MED'14 from 34. 95% to 38. 74%.