In this work, we make the first attempt to fine-tune all-modality models (i. e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions.
We present FireRedASR, a family of large-scale automatic speech recognition (ASR) models for Mandarin, designed to meet diverse requirements in superior performance and optimal efficiency across various applications.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
Lastly, we propose a theory as to why RLSP search strategy is more suitable for LLMs inspired by a remarkable result that says CoT provably increases computational power of LLMs, which grows as the number of steps in CoT \cite{li2024chain, merrill2023expresssive}.
We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades.
Ranked #1 on
Referring Expression Comprehension
on RefCOCOg-test
While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100, 000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples.
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care.
Advances in foundation modeling have reshaped computational pathology.
To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs.
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context.
We further introduce adapter modules to enable fine-tuning towards any given property constraints with a labeled dataset.