Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge.
Conditional Image Generation
Personalized Image Generation
+1
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.
Ranked #2 on
Multimodal Machine Translation
on Multi30K
(BLUE (DE-EN) metric)
In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.
The refined monodepth is in turn guides stereo effectively at ill-posed regions.
The capacity for complex mathematical reasoning is a key benchmark for artificial intelligence.
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e. g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element.
We propose NdLinear as a drop-in replacement for standard linear layers -- marking an important step toward next-generation neural architectures.
In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.