In recent years, differential privacy has seen significant advancements in image classification; however, its application to video activity recognition remains under-explored.
Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL.
We propose Image-to-Image Schr\"odinger Bridge (I$^2$SB), a new class of conditional diffusion models that directly learn the nonlinear diffusion processes between two given distributions.
no code implementations • 9 Feb 2023 • Zhuolin Yang, Wei Ping, Zihan Liu, Vijay Korthikanti, Weili Nie, De-An Huang, Linxi Fan, Zhiding Yu, Shiyi Lan, Bo Li, Ming-Yu Liu, Yuke Zhu, Mohammad Shoeybi, Bryan Catanzaro, Chaowei Xiao, Anima Anandkumar
Augmenting pretrained language models (LMs) with a vision encoder (e. g., Flamingo) has obtained state-of-the-art results in image-to-text generation.
In evaluating cross-dataset generalization with unseen categories, TPT performs on par with the state-of-the-art approaches that use additional training data.
In this work, we propose a generalized notion of multi-task learning by incorporating both auxiliary tasks that the model should perform well on and adversarial tasks that the model should not perform well on.
By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP.
Autonomous agents have made great strides in specialist domains like Atari games and Go.
1 code implementation • 3 Feb 2022 • Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, Yuke Zhu
Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.
In real-world applications, however, there are multiple protected attributes yielding a large number of intersectional protected groups.
We enable such unit tests through variations in a semantically-interpretable latent space of a generative model.
A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert.
We upend this view and show that URRL improves both the natural accuracy of unsupervised representation learning and its robustness to corruptions and adversarial noise.
In this paper, we propose a novel spatio-temporal graph model for video captioning that exploits object interactions in space and time.
We further show that by using the automatically inferred goal from the video demonstration, our robot is able to reproduce the same task in a real kitchen environment.
Modeling and prediction of human motion dynamics has long been a challenging problem in computer vision, and most existing methods rely on the end-to-end supervised training of various architectures of recurrent neural networks.
Ranked #2 on Human Pose Forecasting on Human3.6M (MAR, walking, 1,000ms metric)
The key technical challenge is that the symbol grounding is prone to error with limited training data and leads to subsequent symbolic planning failures.
In this paper, we study the problem of procedure planning in instructional videos, which can be seen as a step towards enabling autonomous agents to plan for complex tasks in everyday settings such as cooking.
The key technical challenge for discriminative modeling with weak supervision is that the loss function of the ordering supervision is usually formulated using dynamic programming and is thus not differentiable.
In this paper, we propose a new action-agnostic method for short- and long-term human pose forecasting.
Ranked #5 on Human Pose Forecasting on Human3.6M (MAR, walking, 1,000ms metric)
A major challenge in computer vision is scaling activity understanding to the long tail of complex activities without requiring collecting large quantities of data for new actions.
We propose Neural Graph Matching (NGM) Networks, a novel framework that can learn to recognize a previous unseen 3D action class with only a few examples.
Ranked #1 on Skeleton Based Action Recognition on CAD-120
We hypothesize that to successfully generalize to unseen complex tasks from a single video demonstration, it is necessary to explicitly incorporate the compositional structure of the tasks into the model.
Our goal is to predict future video frames given a sequence of input frames.
The ability to capture temporal information has been critical to the development of video understanding models.
In this work, we propose to tackle this new task with a weakly-supervised framework for reference-aware visual grounding in instructional videos, where only the temporal alignment between the transcription and the video segment are available for supervision.
This allows us to apply IRL at scale and directly imitate the dynamics in high-dimensional continuous visual sequences from the raw pixel values.
We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e. g., "dressing") to the action (e. g., "mix yogurt") that produced it.
We present an unsupervised representation learning approach that compactly encodes the motion dependencies in videos.
We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time.
We develop predictive models of pedestrian dynamics by encoding the coupled nature of multi-pedestrian interaction using game theory, and deep learning-based visual analysis to estimate person-specific behavior parameters.
Collecting datasets in this way, however, requires robust and efficient ways for detecting and excluding outliers that are common and prevalent.
Furthermore, we develop a hierarchical extension to the DPP clustering algorithm and show that it can be used to discover appearance-based grasp taxonomies.