Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs).
This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation.
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs.
A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks.
Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
Ranked #1 on Virtual Try-on on VITON-HD
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
Ranked #23 on Visual Question Answering on MM-Vet
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation.
Large Language Models (LLMs) have catalyzed significant advancements in Natural Language Processing (NLP), yet they encounter challenges such as hallucination and the need for domain-specific knowledge.
However, this comes with high memory consumption, e. g., a well-trained Gaussian field may utilize three million Gaussian primitives and over 700 MB of memory.
Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.
Ranked #6 on Visual Question Answering on MM-Vet