Contrastive learning has emerged as a transformative method for learning effective visual representations through the alignment of image and text embeddings.
Starting with a set of pre-trained LoRA adapters, our gating strategy uses the hidden states to dynamically mix adapted layers, allowing the resulting X-LoRA model to draw upon different capabilities and create never-before-used deep layer-wise combinations to solve tasks.
Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
Ranked #1 on Virtual Try-on on VITON-HD
With the rapid advancement of Large Language Models (LLMs), significant progress has been made in multi-agent applications.
The voice styles are not directly copied from and constrained by the style of the reference speaker.
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence.
Compared to both open-source and proprietary models, InternVL 1. 5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.
Ranked #6 on Visual Question Answering on MM-Vet
PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.
Dense Captioning Video-based Generative Performance Benchmarking +1
ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions.
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge.