Our model takes the inputs of a Hierarchical Volumetric Representation (HVR) of the environment and an egocentric video, infers the 3D action location as a latent variable, and recognizes the action based on the video and contextual cues surrounding its potential locations.
Many approaches model each target in isolation and lack the ability to use all the targets in the scene to jointly update the memory.
Continual learning is known for suffering from catastrophic forgetting, a phenomenon where earlier learned concepts are forgotten at the expense of more recent samples.
To understand human daily social interaction from egocentric perspective, we introduce a novel task of reconstructing a time series of second-person 3D human body meshes from monocular egocentric videos.
In this paper, we focus on the LED task -- providing a strong baseline model with detailed ablations characterizing both dataset biases and the importance of various modeling choices.
This is challenging as it requires a model to learn a representation that can infer both the visible and occluded portions of any object using a limited training set.
Moving beyond the dataset, we propose a novel deep model for joint gaze estimation and action recognition in FPV.
In this work, we present a method for obtaining an implicit objective function for vision-based navigation.
Panel count data describes aggregated counts of recurrent events observed at discrete time points.
Inspired by the Thomson problem in physics where the distribution of multiple propelling electrons on a unit sphere can be modeled via minimizing some potential energy, hyperspherical energy minimization has demonstrated its potential in regularizing neural networks and improving their generalization power.
We consider the problem of online adaptation of a neural network designed to represent vehicle dynamics.
Localizing moments in untrimmed videos via language queries is a new and interesting task that requires the ability to accurately ground language into video.
The synthesizer and target networks are trained in an adversarial manner wherein each network is updated with a goal to outdo the other.
Additionally, to learn from 2D poses "in the wild", we train an unsupervised 2D domain adapter network to allow for an expansion of 2D data.
Ranked #18 on 3D Human Pose Estimation on MPI-INF-3DHP
To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition.
Ranked #33 on Action Recognition on UCF101
We describe a novel cross-modal embedding space for actions, named Action2Vec, which combines linguistic cues from class labels with spatio-temporal features derived from video clips.
Automatic generation of textual video descriptions that are time-aligned with video content is a long-standing goal in computer vision.
We address the task of jointly determining what a person is doing and where they are looking based on the analysis of video captured by a headworn camera.
We also propose novel data augmentation approaches to efficiently train recurrent models that score object tracks on both appearance and motion.
Our method produces a compact 3D representation of the scene, which can be readily used for applications like autonomous driving.
Ranked #3 on Vehicle Pose Estimation on KITTI Cars Hard (using extra training data)
Estimation of 3D motion in a dynamic scene from a temporal pair of images is a core task in many scene understanding problems.
We propose an active teacher model that can actively query the learner (i. e., make the learner take exams) for estimating the learner's status and provably guide the learner to achieve faster convergence.
Estimating the head pose of a person is a crucial problem that has a large amount of applications such as aiding in gaze estimation, modeling attention, fitting 3D models to video and performing face alignment.
Ranked #4 on Head Pose Estimation on BIWI (MAE (trained with BIWI data) metric)
Face detection is a very important task and a necessary pre-processing step for many applications such as facial landmark detection, pose estimation, sentiment analysis and face recognition.
We present a parameter learning method for GLM emissions and survival model fitting, and present promising results on both synthetic data and an mHealth drug use dataset.
We present an information theoretic approach to stochastic optimal control problems that can be used to derive general sampling based optimization schemes.
Different from traditional machine teaching which views the learners as batch algorithms, we study a new paradigm where the learner uses an iterative algorithm and a teacher can feed examples sequentially and intelligently based on the current performance of the learner.
Approximate Bayesian Computation (ABC) is a framework for performing likelihood-free posterior inference for simulation models.
We address the problem of minimizing human effort in interactive tracking by learning sequence-specific model parameters.
The Continuous-Time Hidden Markov Model (CT-HMM) is an attractive approach to modeling disease progression due to its ability to describe noisy observations arriving irregularly in time.
Motivated by these applications, this paper focuses on the problem of egocentric video summarization.
The dataset design bias does not only create the discomforting disconnection between fixations and salient object segmentation, but also misleads the algorithm designing.
By precomputing a graph which can be used for parametric min-cuts over different seeds, we speed up the generation of the segment pool.