OneLLM: One Framework to Align All Modalities with Language

csuhan/onellm 6 Dec 2023

In detail, we first train an image projection module to connect a vision encoder with LLM.

Question Answering

137
2.18 stars / hour

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis

sh-lee-prml/hierspeechpp 21 Nov 2023

Furthermore, we significantly improve the naturalness and speaker similarity of synthetic speech even in zero-shot speech synthesis scenarios.

Speech Synthesis Super-Resolution +2

713
1.78 stars / hour

DeepCache: Accelerating Diffusion Models for Free

horseee/deepcache 1 Dec 2023

Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities.

Denoising Image Generation

211
1.74 stars / hour

Alpha-CLIP: A CLIP Model Focusing on Wherever You Want

sunzey/alphaclip 6 Dec 2023

Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents.

67
1.66 stars / hour

Aligning and Prompting Everything All at Once for Universal Visual Perception

shenyunhang/ape 4 Dec 2023

However, predominant paradigms, driven by casting instance-level tasks as an object-word alignment, bring heavy cross-modality interaction, which is not effective in prompting object detection and visual grounding.

object-detection Object Detection +4

192
1.65 stars / hour

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

shi-labs/smooth-diffusion 7 Dec 2023

Specifically, we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step.

57
1.58 stars / hour

AnimateZero: Video Diffusion Models are Zero-Shot Image Animators

vvictoryuki/animatezero 6 Dec 2023

For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation for ensuring the generated first frame is equal to the given generated image.

Image Animation Video Generation

72
1.51 stars / hour

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

zhyever/PatchFusion 4 Dec 2023

Single image depth estimation is a foundational task in computer vision and generative modeling.

Depth Estimation

141
1.33 stars / hour

Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia

google-deepmind/concordia 6 Dec 2023

Agent-based modeling has been around for decades, and applied widely across the social and natural sciences.

Common Sense Reasoning

59
1.20 stars / hour

DiffiT: Diffusion Vision Transformers for Image Generation

nvlabs/diffit 4 Dec 2023

We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.

Denoising Image Generation

134
1.18 stars / hour