HANT tackles the problem in two phase: In the first phase, a large number of alternative operations per every layer of the teacher model is trained using layer-wise feature map distillation.
Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space, resulting in fewer network evaluations and faster sampling.
Ranked #1 on Image Generation on CIFAR-10
Detecting out-of-distribution (OOD) samples plays a key role in open-world and safety-critical applications such as autonomous systems and healthcare.
Ranked #2 on Anomaly Detection on Unlabeled CIFAR-10 vs CIFAR-100
In this work, we introduce GradInversion, using which input images from a larger batch (8 - 48 images) can also be recovered for large networks such as ResNets (50 layers), on complex datasets such as ImageNet (1000 classes, 224x224 px).
To tackle this issue, we propose an energy-based prior defined by the product of a base prior distribution and a reweighting factor, designed to bring the base closer to the aggregate posterior.
Ranked #2 on Image Generation on CelebA 256x256 (FID metric)
VAEBM captures the overall mode structure of the data distribution using a state-of-the-art VAE and it relies on its EBM component to explicitly exclude non-data-like regions from the model and refine the image samples.
Ranked #1 on Image Generation on Stacked MNIST
For example, on CIFAR-10, NVAE pushes the state-of-the-art from 2. 98 to 2. 91 bits per dimension, and it produces high-quality images on CelebA HQ.
Ranked #2 on Image Generation on FFHQ 256 x 256 (bits/dimension metric)
Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions.
Our framework brings the best of both worlds, and it enables us to search for architectures with both differentiable and non-differentiable criteria in one unified framework while maintaining a low search cost.
To adapt to the domain shift, the model is trained on the target domain using a set of noisy object bounding boxes that are obtained by a detection model trained only in the source domain.
Building a large image dataset with high-quality object masks for semantic segmentation is costly and time consuming.
Training of discrete latent variable models remains challenging because passing gradient information through discrete units is difficult.
Ranked #36 on Image Generation on CIFAR-10 (bits/dimension metric)
Collecting large training datasets, annotated with high-quality labels, is costly and time-consuming.
In order to model both person-level and group-level dynamics, we present a 2-stage deep temporal model for the group activity recognition problem.
In group activity recognition, the temporal dynamics of the whole activity can be inferred based on the dynamics of the individual people representing the activity.
As a concrete example, group activity recognition involves the interactions and relative spatial relations of a set of people in a scene.
Ranked #3 on Group Activity Recognition on Collective Activity
We present a novel approach for discovering human interactions in videos.
Many visual recognition problems can be approached by counting instances.