To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with.
Temporal relational reasoning, the ability to link meaningful transformations of objects or entities over time, is a fundamental property of intelligent species. In this paper, we introduce an effective and interpretable network module, the Temporal Relation Network (TRN), designed to learn and reason about temporal dependencies between video frames at multiple time scales.
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels.
Models need to distinguish different human instances in the image panel and learn rich features to represent the details of each instance. Parsing R-CNN is very flexible and efficient, which is applicable to many issues in human instance analysis.
SOTA for Pose Estimation on DensePose-COCO
Recent years have witnessed rapid progress in detecting and recognizing individual object instances. Our core idea is that the appearance of a person or an object instance contains informative cues on which relevant parts of an image to attend to for facilitating interaction prediction.