We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning.
We propose a new family of policy gradient methods for reinforcement learning, which alternate between sampling data through interaction with the environment, and optimizing a "surrogate" objective function using stochastic gradient ascent.
To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called VL-Health.
Automatically generating presentations from documents is a challenging task that requires balancing content quality, visual design, and structural coherence.
To tackle this issue, we propose the Frequency \& Attention-driven Masking and Throwing Strategy (FAMT), which can extract semantic patches and reduce the number of training patches to boost model performance and training efficiency simultaneously.
Deep supervised learning algorithms typically require a large volume of labeled data to achieve satisfactory performance.
Fine-tuning AF2 on LongAudio leads to exceptional performance on our proposed LongAudioBench, an expert annotated benchmark for evaluating ALMs on long audio understanding capabilities.
We first analyze the limitations of current Multimodal Large Language Models (MLLMs) in this area: they struggle to accurately comprehending basic geometric elements and their relationships.
Recent advances in text-to-image generation have primarily relied on extensive datasets and parameter-heavy architectures.
Our modified model architecture and training recipe achieve both better training stability and improved per-token efficiency.