While the current trend in the generative field is scaling up towards larger models and more training data for generalized domain representations, we go the opposite direction in this work by synthesizing unseen domain images without additional training.
To perform a task, recent works often require a model to learn from trace examples of the task via either supervised learning or few/many-shot prompting.
To overcome this, our proposed vision-language dataset distillation method jointly distills the image-text pairs in a contrastive formulation.
We propose an algorithm that compresses the critical information of a large dataset into compact addressable memories.
Hence, we develop an approach based on intermediate representations of poses and appearance: our pose-guided appearance rendering network firstly encodes the targets' poses using an encoder-decoder neural network.
The ability to perform effective planning is crucial for building an instruction-following agent.
To this end, we propose BabyWalk, a new VLN agent that is learned to navigate by decomposing long instructions into shorter ones (BabySteps) and completing them sequentially.
In the Vision-and-Language Navigation (VLN) task, an agent with egocentric vision navigates to a destination given natural language instructions.
A general graph-structured neural network architecture operates on graphs through two core components: (1) complex enough message functions; (2) a fixed information aggregation process.
In this paper, we propose Continuous Graph Flow, a generative continuous flow based method that aims to model complex distributions of graph-structured data.
In this paper, we exploit this rich structure for performing graph-based inference in label space for a number of tasks: multi-label image and video classification and action detection in untrimmed videos.
We explore a key architectural aspect of deep convolutional neural networks: the pattern of internal skip connections used to aggregate outputs of earlier layers for consumption by deeper layers.
Matrix and tensor factorization methods are often used for finding underlying low-dimensional patterns from noisy data.
We propose a general purpose active learning algorithm for structured prediction, gathering labeled data for training a model that outputs a set of related labels for an image or video.
Our class-independent TPN outperforms other tubelet generation methods, and our unified temporal deep network achieves state-of-the-art localization results on all three datasets.
We advocate that holistic inference of image concepts provides valuable information for detailed pixel labeling.
We advocate that high-recall holistic inference of image concepts provides valuable information for detailed pixel labeling.
In order to model both person-level and group-level dynamics, we present a 2-stage deep temporal model for the group activity recognition problem.
In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity.
Images of scenes have various objects as well as abundant attributes, and diverse levels of visual categorization are possible.
As a concrete example, group activity recognition involves the interactions and relative spatial relations of a set of people in a scene.
Ranked #5 on Group Activity Recognition on Collective Activity
This paper presents a deep neural-network-based hierarchical graphical model for individual and group activity recognition in surveillance scenes.