Action anticipation involves predicting future actions having observed the initial portion of a video.
Data imbalance, in which a plurality of the data samples come from a small proportion of labels, poses a challenge in training deep neural networks.
We study settings where gradient penalties are used alongside risk minimization with the goal of obtaining predictors satisfying different notions of monotonicity.
Learning causal relationships in high-dimensional data (images, videos) is a hard task, as they are often defined on low dimensional manifolds and must be extracted from complex signals dominated by appearance, lighting, textures and also spurious correlations in the data.
Importantly, we argue and empirically demonstrate that MUSE, compared to other feature discrepancy functions, is a more functional proxy to introduce dependency and effectively improve the expressivity of all features in the knowledge distillation framework.
We study the setting where risk minimization is performed over general classes of models and consider two cases where monotonicity is treated as either a requirement to be satisfied everywhere or a useful property.
We evaluate this approach on our dataset, demonstrating that human-object relations can significantly reduce the ambiguity of articulated object reconstructions from challenging real-world videos.
We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation.
Deep neural networks are susceptible to catastrophic forgetting: when encountering a new task, they can only remember the new task and fail to preserve its ability to accomplish previously learned tasks.
In contrast, we propose a parameter efficient framework, Piggyback GAN, which learns the current task by building a set of convolutional and deconvolutional filters that are factorized into filters of the models trained on previous tasks.
Hence, we develop an approach based on intermediate representations of poses and appearance: our pose-guided appearance rendering network firstly encodes the targets' poses using an encoder-decoder neural network.
We propose Discriminative Prototype DTW (DP-DTW), a novel method to learn class-specific discriminative prototypes for temporal recognition tasks.
Learning from heterogeneous data poses challenges such as combining data from various sources and of different types.
Detecting and localizing action instances in untrimmed videos requires reasoning over multiple action instances in a video.
Ranked #3 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.1 metric)
We consider the problem of optimizing a robot morphology to achieve the best performance for a target task, under computational resource limitations.
We present a mutual information-based framework for unsupervised image-to-image translation.
This paper proposes a novel graph-constrained generative adversarial network, whose generator and discriminator are built upon relational architecture.
Normalizing flows transform a simple base distribution into a complex target distribution and have proved to be powerful models for data generation and density estimation.
In this work, we propose a novel probabilistic sequence model that excels at capturing high variability in time series data, both across sequences and within an individual sequence.
Then, we develop an efficient weight-transfer method to explain decisions for any image without back-propagation.
Human activity videos involve rich, varied interactions between people and objects.
Event sequences can be modeled by temporal point processes (TPPs) to capture their asynchronous and probabilistic nature.
In this paper, we propose an arbitrarily-conditioned data imputation framework built upon variational autoencoders and normalizing flows.
Despite promising progress on unimodal data imputation (e. g. image inpainting), models for multimodal data imputation are far from satisfactory.
Generating graph structures is a challenging problem due to the diverse representations and complex dependencies among nodes.
A general graph-structured neural network architecture operates on graphs through two core components: (1) complex enough message functions; (2) a fixed information aggregation process.
We present a relational graph learning approach for robotic crowd navigation using model-based deep reinforcement learning that plans actions by looking into the future.
Understanding causes and effects in mechanical systems is an essential component of reasoning in the physical world.
Learning from only partially-observed data for imputation has been an active research area.
In this paper, we propose Continuous Graph Flow, a generative continuous flow based method that aims to model complex distributions of graph-structured data.
Knowledge distillation is a widely applicable technique for training a student neural network under the guidance of a trained teacher network.
This makes it possible to perform image-conditioned generation tasks in a lifelong learning setting.
In particular, we express the latent variable space of a variational autoencoder (VAE) in terms of a Bayesian network with a learned, flexible dependency structure.
Multi-label classification is a more difficult task than single-label classification because both the input images and output label spaces are more complex.
Deep neural network compression has the potential to bring modern resource-hungry deep networks to resource-limited devices.
Second, we propose a Relational Autoencoder model for unsupervised learning of features for action and scene retrieval.
Numerous powerful point process models have been developed to understand temporal patterns in sequential data from fields such as health-care, electronic commerce, social networks, and natural disaster forecasting.
Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context.
Ranked #1 on Semantic Object Interaction Classification on VLOG
This allows us to take advantage of the complementary nature of pruning and quantization and to recover from premature pruning errors, which is not possible with current two-stage approaches.
In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos.
We explore a key architectural aspect of deep convolutional neural networks: the pattern of internal skip connections used to aggregate outputs of earlier layers for consumption by deeper layers.
An architecture combining a hierarchical temporal model for predicting human poses and encoder-decoder convolutional neural networks for rendering target appearances is proposed.
When approaching a novel visual recognition problem in a specialized image domain, a common strategy is to start with a pre-trained deep neural network and fine-tune it to the specialized domain.
Matrix and tensor factorization methods are often used for finding underlying low-dimensional patterns from noisy data.
Videos are a rich source of high-dimensional structured data, with a wide range of interacting components at varying levels of granularity.
Our method uses Q-learning to learn a data labeling policy on a small labeled training dataset, and then uses this to automatically label noisy web data for new visual concepts.
We propose a general purpose active learning algorithm for structured prediction, gathering labeled data for training a model that outputs a set of related labels for an image or video.
Activity analysis in which multiple people interact across a large space is challenging due to the interplay of individual actions and collective group dynamics.
Our class-independent TPN outperforms other tubelet generation methods, and our unified temporal deep network achieves state-of-the-art localization results on all three datasets.
We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling.
We advocate that high-recall holistic inference of image concepts provides valuable information for detailed pixel labeling.
In order to model both person-level and group-level dynamics, we present a 2-stage deep temporal model for the group activity recognition problem.
This paper presents HCRF-Boost, a novel and general framework for learning HCRFs in functional space.
In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions.
Ranked #9 on Temporal Action Localization on THUMOS’14 (mAP IOU@0.2 metric)
In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity.
Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible.
As a concrete example, group activity recognition involves the interactions and relative spatial relations of a set of people in a scene.
Ranked #5 on Group Activity Recognition on Collective Activity
Every moment counts in action recognition.
Ranked #7 on Action Detection on Multi-THUMOS
We present a method for learning an embedding that places images of humans in similar poses nearby.
This paper presents a deep neural-network-based hierarchical graphical model for individual and group activity recognition in surveillance scenes.
We present a novel approach for discovering human interactions in videos.
Many visual recognition problems can be approached by counting instances.
We propose a new weakly-supervised structured learning approach for recognition and spatio-temporal localization of actions in video.
We introduce a graphical framework for multiple instance learning (MIL) based on Markov networks.
We propose a novel information theoretic approach for semi-supervised learning of conditional random fields.
In particular, our experimental results demonstrate that combining large-scale global features and local patch features performs significantly better than directly applying hCRF on local patches alone.