ImageNet Classification with Deep Convolutional Neural Networks

computerhistory/AlexNet-Source-Code NeurIPS 2012

We trained a large, deep convolutional neural network to classify the 1. 3 million high-resolution images in the LSVRC-2010 ImageNet training set into the 1000 different classes.

General Classification Graph Classification +2

2,126
4.08 stars / hour

InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity

bytedance/infiniteyou 20 Mar 2025

Achieving flexible and high-fidelity identity-preserved image generation remains formidable, particularly with advanced Diffusion Transformers (DiTs) like FLUX.

Image Generation

1,212
3.55 stars / hour

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

showlab/FAR 26 Mar 2025

Existing RoPE lacks effective temporal decay for remote context and fails to extrapolate well to long video sequences.

Text Generation Unconditional Video Generation

112
3.29 stars / hour

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

aigc3d/LHM 13 Mar 2025

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation.

3D Human Reconstruction

1,081
2.96 stars / hour

VGGT: Visual Geometry Grounded Transformer

facebookresearch/vggt 14 Mar 2025

We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views.

Depth Estimation Novel View Synthesis +2

3,235
2.13 stars / hour

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

bytedance/ui-tars 21 Jan 2025

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).

3,499
1.63 stars / hour

KBLaM: Knowledge Base augmented Language Model

microsoft/KBLaM 14 Oct 2024

In this paper, we propose Knowledge Base augmented Language Model (KBLaM), a new method for augmenting Large Language Models (LLMs) with external knowledge.

8k In-Context Learning +6

637
1.46 stars / hour
346
1.31 stars / hour

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

sparkaudio/spark-tts 3 Mar 2025

Recent advancements in large language models (LLMs) have driven significant progress in zero-shot text-to-speech (TTS) synthesis.

Attribute Text to Speech +1

6,553
1.24 stars / hour