Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization.
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation.
Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e. g., canny edges or depth maps) into control tokens.
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models.
Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding.
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing.
Ranked #3 on Temporal Relation Extraction on Vinoground
The advent of wearable computers enables a new source of context for AI that is embedded in egocentric sensor data.
Ranked #1 on 3D Reconstruction on Aria Synthetic Environments
Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks.
This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer.
Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.