VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt 14 Mar 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +2

2,243
11.37 stars / hour

Neural Fields with Thermal Activations for Arbitrary-Scale Super-Resolution

prs-eth/thera 29 Nov 2023

We present a novel way to design neural fields such that points can be queried with an adaptive Gaussian PSF, so as to guarantee correct anti-aliasing at any desired output resolution.

Image Super-Resolution

525
2.50 stars / hour

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

mims-harvard/TxAgent 14 Mar 2025

It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation.

AI Agent Decision Making

269
2.49 stars / hour

ReasonGraph: Visualisation of Reasoning Paths

ZongqianLi/ReasonGraph 6 Mar 2025

Large Language Models (LLMs) reasoning processes are challenging to analyze due to their complexity and the lack of organized visualization tools.

344
2.14 stars / hour

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

xiaomi-research/r1-aqa 14 Mar 2025

Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks.

Audio Question Answering Question Answering +1

168
1.81 stars / hour

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

stepfun-ai/step-video-ti2v 14 Mar 2025

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs.

Image to Video Generation

86
1.62 stars / hour

KBLaM: Knowledge Base augmented Language Model

microsoft/KBLaM 14 Oct 2024

In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge.

8k In-Context Learning +6

175
1.61 stars / hour

Data Formulator 2: Iterative Creation of Data Visualizations, with AI Transforming Data Along the Way

microsoft/data-formulator 28 Aug 2024

Data analysts often need to iterate between data transformations and chart designs to create rich visualizations for exploratory data analysis.

Code Generation Navigate

9,873
1.53 stars / hour

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

5,259
1.53 stars / hour

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

aigc3d/LHM 13 Mar 2025

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation.

3D Human Reconstruction

314
1.39 stars / hour