Group Activity Recognition detects the activity collectively performed by a group of actors, which requires compositional reasoning of actors and objects.
We evaluate over CATER dataset and find that Hopper achieves 73. 2% Top-1 accuracy using just 1 FPS by hopping through just a few critical frames.
We propose a sequential variational autoencoder to learn disentangled representations of sequential data (e. g., videos and audios) under self-supervision.
Keypoints are tracked using our Pose Entailment method, in which, first, a pair of pose estimates is sampled from different frames in a video and tokenized.
Ranked #2 on Pose Tracking on PoseTrack2017
Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video.
We address the problem of video captioning by grounding language generation on object interactions in the video.
Human actions often involve complex interactions across several inter-related objects in the scene.
However, magnitude-based pruning of weights reduces a significant number of parameters from the fully connected layers and may not adequately reduce the computation costs in the convolutional layers due to irregular sparsity in the pruned networks.
Ranked #1 on Network Pruning on ImageNet