Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care.
In this work, we make the first attempt to fine-tune all-modality models (i. e. input and output with any modality, also named any-to-any models) using human preference data across all modalities (including text, image, audio, and video), ensuring its behavior aligns with human intentions.
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale.
DiT-based video generation has achieved remarkable results, but research into enhancing existing models remains relatively unexplored.
IntellAgent represents a paradigm shift in evaluating conversational AI.
To overcome these challenges, we introduce a specialized cognitive module, temporal working memory (TWM), which aims to enhance the temporal modeling capabilities of MFMs.
While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100, 000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples.
We present a generative image inpainting system to complete images with free-form mask and guidance.
Ranked #3 on
Image Inpainting
on Places2 val
The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context.