Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf property semantic predictors to estimate due to the immitigable domain gap.
We present an approach to generate a 360-degree view of a person with a consistent, high-resolution appearance from a single input image.
In response to this gap, we introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes, which allow users to specify the intended content and dynamics for synthesis.
In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image.
Dynamic radiance field reconstruction methods aim to model the time-varying structure and appearance of a dynamic scene.
Generating images from hand-drawings is a crucial and fundamental task in content creation.
For example, it provides style variability for image generation and extension, and equips image-to-image translation with further extension capabilities.
Recent studies show that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks.
The audio-visual video parsing task aims to temporally parse a video into audio or visual event categories.
Self-supervised learning has recently shown great potential in vision tasks through contrastive learning, which aims to discriminate each image, or instance, in the dataset.
Second, we develop point cloud aggregation modules to gather the style information of the 3D scene, and then modulate the features in the point cloud with a linear transformation matrix.
Our framework consists of two components: an implicit representation of the 3D scene with the neural radiance fields model, and a hypernetwork to transfer the style information into the scene representation.
Recent years have witnessed the rapid progress of generative adversarial networks (GANs).
Ranked #1 on Image Generation on 25% ImageNet 128x128
Generating a smooth sequence of intermediate results bridges the gap of two different domains, facilitating the morphing effect across domains.
In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community.
Image generation from scene description is a cornerstone technique for the controlled generation, which is beneficial to applications such as content creation and image editing.
People often create art by following an artistic workflow involving multiple stages that inform the overall design.
With the growing attention on learning-to-learn new tasks using only a few examples, meta-learning has been widely used in numerous problems such as few-shot classification, reinforcement learning, and domain generalization.
Few-shot classification aims to recognize novel categories with only few labeled images in each class.
Ranked #6 on Cross-Domain Few-Shot on CUB
This intermediate domain is constructed by translating the source images to mimic the ones in the target domain.
Through extensive experimentation on the ObjectNet3D and Pascal3D+ benchmark datasets, we demonstrate that our framework, which we call MetaView, significantly outperforms fine-tuning the state-of-the-art models with few examples, and that the specific architectural innovations of our method are crucial to achieving good performance.
In this work, we present an approach based on disentangled representation for generating diverse outputs without paired training images.
In this work, we propose a simple yet effective regularization term to address the mode collapse issue for cGANs.
Our model takes the encoded content features extracted from a given input and the attribute vectors sampled from the attribute space to produce diverse outputs at test time.