AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

scutzzj/aniportrait 26 Mar 2024

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.

Face Reenactment

1,170
10.53 stars / hour

AIOS: LLM Agent Operating System

agiresearch/aios 25 Mar 2024

Inspired by these challenges, this paper presents AIOS, an LLM agent operating system, which embeds large language model into operating systems (OS) as the brain of the OS, enabling an operating system "with soul" -- an important step towards AGI.

Language Modelling Large Language Model +1

388
3.43 stars / hour

Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

lichao-sun/mora 20 Mar 2024

Sora is the first large-scale generalist video generation model that garnered significant attention across society.

Image to Video Generation Text-to-Video Generation +1

929
3.31 stars / hour

VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild

jasonppy/voicecraft 25 Mar 2024

We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts.

Language Modelling

1,385
3.21 stars / hour

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

IDKiro/sdxs 25 Mar 2024

Recent advancements in diffusion models have positioned them at the forefront of image generation.

Image-to-Image Translation Knowledge Distillation

252
2.92 stars / hour

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Image Comprehension Visual Dialog +1

208
2.75 stars / hour

T-Rex2: Towards Generic Object Detection via Text-Visual Prompt Synergy

idea-research/t-rex 21 Mar 2024

Recognizing the complementary strengths and weaknesses of both text and visual prompts, we introduce T-Rex2 that synergizes both prompts within a single model through contrastive learning.

Contrastive Learning Descriptive +3

1,362
2.74 stars / hour

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text

picsart-ai-research/streamingt2v 21 Mar 2024

To overcome these limitations, we introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.

Text-to-Video Generation Video Generation

331
2.21 stars / hour

General Object Foundation Model for Images and Videos at Scale

FoundationVision/GLEE 14 Dec 2023

We present GLEE in this work, an object-level foundation model for locating and identifying objects in images and videos.

Long-tail Video Object Segmentation Multi-Object Tracking +8

665
1.94 stars / hour

Evolutionary Optimization of Model Merging Recipes

sakanaai/evolutionary-model-merge 19 Mar 2024

Surprisingly, our Japanese Math LLM achieved state-of-the-art performance on a variety of established Japanese LLM benchmarks, even surpassing models with significantly more parameters, despite not being explicitly trained for such tasks.

Evolutionary Algorithms Math

788
1.93 stars / hour