$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

apoorvkh/academic-pretraining 30 Oct 2024

We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed.

64
1.44 stars / hour

MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector

wjfu99/mia-tuner 16 Aug 2024

Existing studies have partially addressed this need through an exploration of the pre-training data detection problem, which is an instance of a membership inference attack (MIA).

Inference Attack Membership Inference Attack

46
1.26 stars / hour

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

homebrewltd/ichigo 20 Oct 2024

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities.

Question Answering speech-recognition +1

1,374
1.13 stars / hour

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

gpt-omni/mini-omni2 15 Oct 2024

It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction.

Language Modelling

1,328
1.03 stars / hour

Moonshine: Speech Recognition for Live Transcription and Voice Commands

usefulsensors/moonshine 21 Oct 2024

This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing.

Decoder Position +2

1,838
0.97 stars / hour

In-Context LoRA for Diffusion Transformers

ali-vilab/In-Context-LoRA 31 Oct 2024

While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems.

Image Generation

24
0.96 stars / hour

LightRAG: Simple and Fast Retrieval-Augmented Generation

hkuds/lightrag 8 Oct 2024

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to user needs.

Information Retrieval RAG +1

6,780
0.86 stars / hour

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Vision-CAIR/LongVU 22 Oct 2024

Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.

Token Reduction Video Understanding +1

215
0.84 stars / hour

ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference

bytedance/ShadowKV 28 Oct 2024

By evaluating ShadowKV on a broad range of benchmarks, including RULER, LongBench, and Needle In A Haystack, and models like Llama-3. 1-8B, Llama-3-8B-1M, GLM-4-9B-1M, Yi-9B-200K, Phi-3-Mini-128K, and Qwen2-7B-128K, we demonstrate that it can support up to 6$\times$ larger batch sizes and boost throughput by up to 3. 04$\times$ on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.

61
0.82 stars / hour

Grounding Image Matching in 3D with MASt3R

naver/mast3r 14 Jun 2024

Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision.

3D Reconstruction

1,262
0.78 stars / hour