VideoGraph: Recognizing Minutes-Long Human Activities in Videos

13 May 2019  ·  Noureldien Hussein, Efstratios Gavves, Arnold W. M. Smeulders ·

Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.

PDF Abstract
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Video Classification Breakfast VideoGraph Accuracy (%) 69.5 # 7
Long-video Activity Recognition Breakfast VideoGraph (I3D-K400-Pretrain-feature) mAP 63.14 # 6

Methods


No methods listed for this paper. Add relevant methods here