A Diffusion Model and Knowledge Distillation Framework for Robust Coral Detection in Complex Underwater Environments

RDXiaoLu/MambaCoral-DiffDet SSRN 2025

From an engineering perspective, MCDD significantly advances automated coral detection in challenging underwater conditions, providing a reliable solution for monitoring marine ecosystems.

 Ranked #1 on 2D Object Detection on SCoralDet Dataset (using extra training data)

2D Object Detection Knowledge Distillation +2

110
0.53 stars / hour

LBM: Latent Bridge Matching for Fast Image-to-Image Translation

gojasper/lbm 10 Mar 2025

In this paper, we introduce Latent Bridge Matching (LBM), a new, versatile and scalable method that relies on Bridge Matching in a latent space to achieve fast image-to-image translation.

Depth Estimation Image Relighting +2

275
0.52 stars / hour

LatentSync: Audio Conditioned Latent Diffusion Models for Lip Sync

bytedance/LatentSync 12 Dec 2024

Since we did not change the overall training framework of SyncNet, our experience can also be applied to other lip sync and audio-driven portrait animation methods that utilize SyncNet.

Portrait Animation

3,210
0.46 stars / hour

Transformers without Normalization

jiachenzhu/DyT 13 Mar 2025

Normalization layers are ubiquitous in modern neural networks and have long been considered essential.

Self-Supervised Learning

695
0.46 stars / hour

HourVideo: 1-Hour Video-Language Understanding

keshik6/HourVideo 7 Nov 2024

We present HourVideo, a benchmark dataset for hour-long video-language understanding.

Benchmarking counterfactual +3

121
0.46 stars / hour

Next-Scale Autoregressive Models are Zero-Shot Single-Image Object View Synthesizers

shiran-yuan/archonview 17 Mar 2025

We present Next-Scale Autoregression Conditioned by View (ArchonView), a method that significantly exceeds state-of-the-art methods despite being trained from scratch with 3D rendering data only and no 2D pretraining.

Novel View Synthesis

41
0.44 stars / hour

SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks

facebookresearch/sweet_rl 19 Mar 2025

Large language model (LLM) agents need to perform multi-turn interactions in real-world tasks.

Language Modeling Language Modelling +1

65
0.44 stars / hour

YOLOE: Real-Time Seeing Anything

THU-MIG/yoloe 10 Mar 2025

Object detection and segmentation are widely employed in computer vision applications, yet conventional models like YOLO series, while efficient and accurate, are limited by predefined categories, hindering adaptability in open scenarios.

10-shot image generation

855
0.43 stars / hour

MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot

snowteam2023/medrag 6 Feb 2025

However, the diagnostic accuracy and specificity of existing heuristic-based RAG models used in the medical domain are inadequate, particularly for diseases with similar manifestations.

Diagnostic RAG +2

111
0.43 stars / hour

ZO2: Scalable Zeroth-Order Fine-Tuning for Extremely Large Language Models with Limited GPU Memory

liangyuwang/zo2 16 Mar 2025

Alternatively, zeroth-order (ZO) techniques can compute gradients using just forward operations, eliminating the need to store activations.

54
0.40 stars / hour