Recent studies show that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks.
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories.
Self-supervised learning has recently shown great potential in vision tasks through contrastive learning, which aims to discriminate each image, or instance, in the dataset.
Second, we develop point cloud aggregation modules to gather the style information of the 3D scene, and then modulate the features in the point cloud with a linear transformation matrix.
Our framework consists of two components: an implicit representation of the 3D scene with the neural radiance fields model, and a hypernetwork to transfer the style information into the scene representation.
Recent years have witnessed the rapid progress of generative adversarial networks (GANs).
Ranked #1 on Image Generation on 25% ImageNet 128x128
Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains.
In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community.
Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing.
People often create art by following an artistic workflow involving multiple stages that inform the overall design.
With the growing attention on learning-to-learn new tasks using only a few examples, meta-learning has been widely used in numerous problems such as few-shot classification, reinforcement learning, and domain generalization.
Few-shot classification aims to recognize novel categories with only few labeled images in each class.
This intermediate domain is constructed by translating the source images to mimic the ones in the target domain.
Through extensive experimentation on the ObjectNet3D and Pascal3D+ benchmark datasets, we demonstrate that our framework, which we call MetaView, significantly outperforms fine-tuning the state-of-the-art models with few examples, and that the specific architectural innovations of our method are crucial to achieving good performance.
In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images.
In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs.
Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time.