Today's image prediction methods struggle to change the locations of objects in a scene, producing blurry images that average over the many positions they might occupy.
We learn representations for video frames and frame-to-frame transition probabilities by fitting a video-specific model trained using contrastive learning.
By randomly traversing edges with high transition probabilities, we generate diverse temporally smooth videos with novel sequences and transitions.
Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning.
In this work we ask whether it is possible to create a "universal" detector for telling apart real images from these generated by a CNN, regardless of architecture or dataset used.
Most malicious photo manipulations are created using standard image editing tools, such as Adobe Photoshop.
Specifically, we perform cross-modal translation from "in-the-wild'' monologue speech of a single speaker to their hand and arm motion.
We present a system that allows users to visualize complex human motion via 3D motion sculptures---a representation that conveys the 3D structure swept by a human body as it moves through space.
This model -- a deep, multimodal convolutional network -- predicts the outcome of a candidate grasp adjustment, and then executes a grasp by iteratively selecting the most promising actions.
In this paper, we propose a learning algorithm for detecting visual image manipulations that is trained only using a large dataset of real photographs.
The thud of a bouncing ball, the onset of speech as lips open -- when visual and audio events occur together, it suggests that there might be a common, underlying event that produced both signals.
The sound of crashing waves, the roar of fast-moving cars -- sound conveys important information about the objects in our surroundings.
In this work, we investigate the question of whether touch sensing aids in predicting grasp outcomes within a multimodal sensing framework that combines vision and touch.
We show that, through this process, the network learns a representation that conveys information about objects and scenes.