MultiMed: Multilingual Medical Speech Recognition via Attention Encoder Decoder

leduckhai/multimed 21 Sep 2024

Multilingual automatic speech recognition (ASR) in the medical domain serves as a foundational task for various downstream applications such as speech translation, spoken language understanding, and voice-activated assistants.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

169
0.52 stars / hour

Less-to-More Generalization: Unlocking More Controllability by In-Context Generation

bytedance/uno 2 Apr 2025

In this study, we propose a highly-consistent data synthesis pipeline to tackle this challenge.

Conditional Image Generation Personalized Image Generation +1

845
0.50 stars / hour

Attention Is All You Need

exa-labs/exa-mcp-server NeurIPS 2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration.

Ranked #2 on Multimodal Machine Translation on Multi30K (BLUE (DE-EN) metric)

Abstractive Text Summarization All +12

703
0.50 stars / hour

A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce

rlhflow/minimal-rl 15 Apr 2025

In this work, we revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.

Reinforcement Learning (RL)

99
0.49 stars / hour

MonSter: Marry Monodepth to Stereo Unleashes Power

junda24/monster 15 Jan 2025

The refined monodepth is in turn guides stereo effectively at ill-posed regions.

Monocular Depth Estimation Stereo Matching +1

344
0.48 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt 14 Mar 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +3

5,318
0.42 stars / hour

SkyReels-A2: Compose Anything in Video Diffusion Transformers

skyworkai/skyreels-a2 3 Apr 2025

This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements (e. g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element.

Video Generation

385
0.41 stars / hour

NdLinear Is All You Need for Representation Learning

ensemble-core/ndlinear 21 Mar 2025

We propose NdLinear as a drop-in replacement for standard linear layers -- marking an important step toward next-generation neural architectures.

All Representation Learning

236
0.40 stars / hour

Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning

jincan333/MAS-TTS 14 Apr 2025

In this work, we introduce an adaptive multi-agent framework designed to enhance collaborative reasoning through both model-level training and system-level coordination.

Mathematical Reasoning mbpp

48
0.38 stars / hour