PrefixQuant: Static Quantization Beats Dynamic through Prefixed Outliers in LLMs

chenmnz/prefixquant 7 Oct 2024

Specifically, PrefixQuant identifies high-frequency outlier tokens and prefixes them in the KV cache, preventing the generation of outlier tokens during inference and simplifying quantization.

Common Sense Reasoning Quantization

37
0.31 stars / hour

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

tothebeginning/pulid 24 Apr 2024

We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation.

Text-to-Image Generation

2,249
0.31 stars / hour

ControlAR: Controllable Image Generation with Autoregressive Models

hustvl/controlar 3 Oct 2024

Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e. g., canny edges or depth maps) into control tokens.

Image Generation

49
0.30 stars / hour

MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages

hlt-mt/mosel 1 Oct 2024

The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models.

Automatic Speech Recognition speech-recognition +1

126
0.29 stars / hour

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

whb139426/grounded-video-llm 4 Oct 2024

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding.

Dense Video Captioning Sentence +1

34
0.28 stars / hour

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

qwenlm/qwen2-vl 18 Sep 2024

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing.

Temporal Relation Extraction Visual Question Answering

2,563
0.28 stars / hour

EFM3D: A Benchmark for Measuring Progress Towards 3D Egocentric Foundation Models

facebookresearch/efm3d 14 Jun 2024

The advent of wearable computers enables a new source of context for AI that is embedded in egocentric sensor data.

3D Object Detection 3D Reconstruction +2

75
0.28 stars / hour

AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

MCG-NJU/AWT 5 Jul 2024

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks.

Action Recognition Few-Shot Image Classification +3

67
0.28 stars / hour

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

chenllliang/dnd-transformer 2 Oct 2024

This work tackles the information loss bottleneck of vector-quantization (VQ) autoregressive image generation by introducing a novel model architecture called the 2-Dimensional Autoregression (DnD) Transformer.

Image Generation Quantization

38
0.28 stars / hour

Fast Inference from Transformers via Speculative Decoding

ericlbuehler/mistral.rs 30 Nov 2022

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model.

Language Modelling

3,704
0.28 stars / hour