We then finetune different algorithms on our MAW dataset to significantly improve the quality of the reconstructed albedo both quantitatively and qualitatively.
Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy.
We propose LASER, a neuro-symbolic approach that learns semantic video representations by leveraging logic specifications that can capture rich spatial and temporal properties in video data.
In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels.
First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation.
To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space.
Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models.
Videos are created to express emotion, exchange information, and share experiences.
Ranked #9 on Video Generation on UCF-101
We study the properties of various over-parametrized convolutional neural architectures through their respective Gaussian process and neural tangent kernels.
This assumption is mostly satisfied in datasets such as ImageNet where there is a large, centered object, which is highly likely to be present in random crops of the full image.
We also show that model bias favors texture and shape features differently under different test settings.
The Maneuver Identification Challenge hosted at maneuver-id. mit. edu provides thousands of trajectories collected from pilots practicing in flight simulators, descriptions of maneuvers, and examples of these maneuvers performed by experienced pilots.
Presentation attack detection (PAD) is a critical component in secure face authentication.
Using this, we prove that shift invariance in neural networks produces adversarial examples for the simple case of two classes, each consisting of a single image with a black or white dot on a gray background.
In particular, we show that using activation functions with low (exact or approximate) curvature values has a regularization effect that significantly reduces both the standard and robust generalization gaps in adversarial training.
Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information.
Ranked #18 on Object Detection on PASCAL VOC 2007
Ideally, this results in images from two domains that present shared information to the primary network.
Ranked #2 on Monocular Depth Estimation on Make3D
Therefore, we propose a novel evaluation benchmark to assess the performance of existing AQG systems for long-text answers.
Recent works have partly attributed the generalization ability of over-parameterized neural networks to frequency bias -- networks trained with gradient descent on data drawn from a uniform distribution find a low frequency fit before high frequency ones.
By training classifiers on top of these feature extractors, we produce new models that inherit the robustness of their parent networks.
Probability density estimation is a classical and well studied problem, but standard density estimation methods have historically lacked the power to model complex and high-dimensional image distributions.
We propose a general framework to approximately solve large-scale semidefinite problems (SDPs) at low complexity.
We suggest a novel shape matching algorithm for three-dimensional surface meshes of disk or sphere topology.
We show on a modified MNIST dataset that when faced with scale variation, building in scale-invariance allows ConvNets to learn more discriminative features with reduced chances of over-fitting.
We discuss methodological issues related to the evaluation of unsupervised binary code construction methods for nearest neighbor search.