We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched.
We train our network to map the activity- and interaction-based latent structural representations of the different modalities to per-frame highlight scores based on the representativeness of the frames.
Our network consists of two components: a generator to synthesize gestures from a joint embedding space of features encoded from the input speech and the seed poses, and a discriminator to distinguish between the synthesized pose sequences and real 3D pose sequences.
Our task is to map gestures to novel emotion categories not encountered in training.
We report an AP of 65. 83 across 4 categories on GroupWalk, which is also an improvement over prior methods.
Ranked #1 on Emotion Recognition in Context on EMOTIC
Additionally, we extract and compare affective cues corresponding to perceived emotion from the two modalities within a video to infer whether the input video is "real" or "fake".
We present a data-driven deep neural algorithm for detecting deceptive walking behavior using nonverbal cues like gaits and gestures.
In practice, our approach reduces the average prediction error by more than 54% over prior algorithms and achieves a weighted average accuracy of 91. 2% for behavior prediction.
Ranked #1 on Trajectory Prediction on ApolloScape
For the annotated data, we also train a classifier to map the latent embeddings to emotion labels.
Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities.
We use hundreds of annotated real-world gait videos and augment them with thousands of annotated synthetic gaits generated using a novel generative network called STEP-Gen, built on an ST-GCN based Conditional Variational Autoencoder (CVAE).
RobustTP is an approach that first computes trajectories using a combination of a non-linear motion model and a deep learning-based instance segmentation algorithm.
We present a realtime tracking algorithm, RoadTrack, to track heterogeneous road-agents in dense traffic videos.
We also present an EWalk (Emotion Walk) dataset that consists of videos of walking individuals with gaits and labeled emotions.
Our approach significantly outperforms the state-of-the-art robust 3D registration method based on a line process in terms of both speed and accuracy.
We evaluate the performance of our prediction algorithm, TraPHic, on the standard datasets and also introduce a new dense, heterogeneous traffic dataset corresponding to urban Asian videos and agent trajectories.
Ranked #1 on Trajectory Prediction on NGSIM