OneLLM: One Framework to Align All Modalities with Language

csuhan/onellm 6 Dec 2023

In detail, we first train an image projection module to connect a vision encoder with LLM.

Question Answering

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

sh-lee-prml/hierspeechpp 21 Nov 2023

Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.

Speech Synthesis Super-Resolution +2

DeepCache: Accelerating Diffusion Models for Free

horseee/deepcache 1 Dec 2023

Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities.

Denoising Image Generation

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

sunzey/alphaclip 6 Dec 2023

Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents.

Aligning and Prompting Everything All at Once for Universal Visual Perception

shenyunhang/ape 4 Dec 2023

However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding.

object-detection Object Detection +4

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

shi-labs/smooth-diffusion 7 Dec 2023

Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step.

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

vvictoryuki/animatezero 6 Dec 2023

For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image.

Image Animation Video Generation

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

zhyever/PatchFusion 4 Dec 2023

Single image depth estimation is a foundational task in computer vision and generative modeling.

Depth Estimation

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

google-deepmind/concordia 6 Dec 2023

Agent-based modeling has been around for decades, and applied widely across the social and natural sciences.

Common Sense Reasoning

DiffiT: Diffusion Vision Transformers for Image Generation

nvlabs/diffit 4 Dec 2023

We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.

Denoising Image Generation

