AudioLCM: Text-to-Audio Generation with Latent Consistency Models

Text-to-Audio/AudioLCM 1 Jun 2024

To overcome the convergence issue inherent in LDMs with reduced sample iterations, we propose the Guided Latent Consistency Distillation with a multi-step Ordinary Differential Equation (ODE) solver.

Audio Generation Audio Synthesis

Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

buoyancy99/diffusion-forcing 1 Jul 2024

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.

Decision Making

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

lxtgh/omg-seg 27 Jun 2024

Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding.

Decoder Segmentation +2

MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis

fusiming3/mars 10 Jul 2024

Auto-regressive models have made significant progress in the realm of language generation, yet they do not perform on par with diffusion models in the domain of image synthesis.

Image Generation Text Generation

Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

om-ai-lab/OmDet 11 Mar 2024

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities.

Open Vocabulary Object Detection Real-Time Object Detection

Agentless: Demystifying LLM-based Software Engineering Agents

OpenAutoCoder/Agentless 1 Jul 2024

However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents?

Program Repair

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

LLaVA-VL/LLaVA-NeXT 10 Jul 2024

To this end, we introduce LLaVA-NeXT-Interleave, which simultaneously tackles Multi-image, Multi-frame (video), Multi-view (3D), and Multi-patch (single-image) scenarios in LMMs.

Inference Performance Optimization for Large Language Models on CPUs

intel/xfastertransformer 10 Jul 2024

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks.

Inference Optimization

TextGrad: Automatic "Differentiation" via Text

zou-group/textgrad 11 Jun 2024

Without modifying the framework, TextGrad improves the zero-shot accuracy of GPT-4o in Google-Proof Question Answering from $51\%$ to $55\%$, yields $20\%$ relative performance gain in optimizing LeetCode-Hard coding problem solutions, improves prompts for reasoning, designs new druglike small molecules with desirable in silico binding, and designs radiation oncology treatment plans with high specificity.

 Question Answering Specificity

Question Answering Specificity

A Comprehensive Survey on Human Video Generation: Challenges, Methods, and Insights

wentaol86/awesome-human-body-video-generation 11 Jul 2024

The goal of this survey is to offer the research community a clear and holistic view of the advancements in human video generation, highlighting the milestones achieved and the challenges that lie ahead.

Video Generation

