We present an unsupervised representation learning approach using videos
without semantic labels. We leverage the temporal coherence as a supervisory
signal by formulating representation learning as a sequence sorting task. We
take temporally shuffled frames (i.e., in non-chronological order) as inputs
and train a convolutional neural network to sort the shuffled sequences.
Similar to comparison-based sorting algorithms, we propose to extract features
from all frame pairs and aggregate them to predict the correct order. As
sorting shuffled image sequence requires an understanding of the statistical
temporal structure of images, training with such a proxy task allows us to
learn rich and generalizable visual representation. We validate the
effectiveness of the learned representation using our method as pre-training on
high-level recognition problems. The experimental results show that our method
compares favorably against state-of-the-art methods on action recognition,
image classification and object detection tasks.