ImageNet Classification with Deep Convolutional Neural Networks

computerhistory/AlexNet-Source-Code NeurIPS 2012

We trained a large, deep convolutional neural network to classify the 1. 3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.

General Classification Graph Classification +2

842
15.11 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt 14 Mar 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +2

2,529
4.68 stars / hour

KBLaM: Knowledge Base augmented Language Model

microsoft/KBLaM 14 Oct 2024

In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge.

8k In-Context Learning +6

356
2.70 stars / hour

TxAgent: An AI Agent for Therapeutic Reasoning Across a Universe of Tools

mims-harvard/TxAgent 14 Mar 2025

It selects tools based on task objectives and executes structured function calls to solve therapeutic tasks that require clinical reasoning and cross-source validation.

AI Agent Decision Making

293
2.34 stars / hour

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

5,504
1.60 stars / hour

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering

xiaomi-research/r1-aqa 14 Mar 2025

Recently, reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs), and RL-based approaches have been progressively applied to visual multimodal tasks.

Audio Question Answering Question Answering +1

181
1.57 stars / hour

Step-Video-TI2V Technical Report: A State-of-the-Art Text-Driven Image-to-Video Generation Model

stepfun-ai/step-video-ti2v 14 Mar 2025

We present Step-Video-TI2V, a state-of-the-art text-driven image-to-video generation model with 30B parameters, capable of generating videos up to 102 frames based on both text and image inputs.

Image to Video Generation

176
1.55 stars / hour

ReasonGraph: Visualisation of Reasoning Paths

ZongqianLi/ReasonGraph 6 Mar 2025

Large Language Models (LLMs) reasoning processes are challenging to analyze due to their complexity and the lack of organized visualization tools.

372
1.54 stars / hour

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

aigc3d/LHM 13 Mar 2025

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation.

3D Human Reconstruction

376
1.25 stars / hour

Data Formulator 2: Iterative Creation of Data Visualizations, with AI Transforming Data Along the Way

microsoft/data-formulator 28 Aug 2024

Data analysts often need to iterate between data transformations and chart designs to create rich visualizations for exploratory data analysis.

Code Generation Navigate

10,498
1.24 stars / hour