While large language models have shown reasoning capabilities, their application to the audio modality, particularly in large audio-language models (ALMs), remains significantly underdeveloped.
To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection).
Agentic tasks, which require multi-step problem solving with autonomy, tool use, and adaptive reasoning, are becoming increasingly central to the advancement of NLP and AI.
Most Reinforcement Learning (RL) environments are created by adapting existing physics simulators or video games.
The era of intelligent agents is upon us, driven by revolutionary advancements in large language models.
Based on DeltaSpace, we propose a novel framework called DeltaEdit, which maps the CLIP visual feature differences to the latent space directions of a generative model during the training phase, and predicts the latent space directions from the CLIP textual feature differences during the inference phase.
Deep learning has revolutionized medical image segmentation, yet its full potential remains constrained by the paucity of annotated datasets.
In this paper, we provide a comprehensive review of deep learning-based image anomaly detection techniques, from the perspectives of neural network architectures, levels of supervision, loss functions, metrics and datasets.
This paper explores a simple extension of diffusion-based rectified flow Transformers for text-to-music generation, termed as FluxMusic.
Ranked #2 on
Text-to-Music Generation
on MusicCaps
While Transformer self-attention offers strong parallelism, the Key-Value (KV) cache grows linearly with sequence length and becomes a bottleneck for inference efficiency.