Making Images Real Again: A Comprehensive Survey on Deep Image Composition

bcmi/awesome-object-insertion 28 Jun 2021

Image composition task could be decomposed into multiple sub-tasks, in which each sub-task targets at one or more issues.

Image Harmonization Object +1

457
0.47 stars / hour

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

bytedance/ui-tars 21 Jan 2025

This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions (e. g., keyboard and mouse operations).

2,176
0.46 stars / hour

LLMs can see and hear without any training

facebookresearch/mils 30 Jan 2025

We present MILS: Multimodal Iterative LLM Solver, a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM.

Audio captioning Style Transfer +1

124
0.45 stars / hour

Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Diffusion Transformer Networks

fudan-generative-vision/hallo3 1 Dec 2024

Existing methodologies for animating portrait images face significant challenges, particularly in handling non-frontal perspectives, rendering dynamic objects around the portrait, and generating immersive, realistic backgrounds.

Image Animation Portrait Animation

1,011
0.43 stars / hour

Free-T2M: Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss

Hxxxz0/Free-T2m 30 Jan 2025

Rapid progress in text-to-motion generation has been largely driven by diffusion models.

Denoising Motion Generation +1

50
0.42 stars / hour

Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models

Vchitect/Vchitect-2.0 14 Jan 2025

We present Vchitect-2. 0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.

Benchmarking Text-to-Video Generation +1

881
0.41 stars / hour

A Survey of Graph Retrieval-Augmented Generation for Customized Large Language Models

deep-polyu/awesome-graphrag 21 Jan 2025

Large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks, yet their application to specialized domains remains challenging due to the need for deep expertise.

RAG Text Retrieval

492
0.41 stars / hour

UnCommon Objects in 3D

facebookresearch/uco3d 13 Jan 2025

We introduce Uncommon Objects in 3D (uCO3D), a new object-centric dataset for 3D deep learning and 3D generative AI.

Object

603
0.40 stars / hour

PaSa: An LLM Agent for Comprehensive Academic Paper Search

bytedance/pasa 17 Jan 2025

Notably, PaSa-7B surpasses the best Google-based baseline, Google with GPT-4o, by 37. 78% in recall@20 and 39. 90% in recall@50.

722
0.39 stars / hour

Auto-AVSR: Audio-Visual Speech Recognition with Automatic Labels

mpc001/auto_avsr 25 Mar 2023

Recently, the performance of automatic, visual, and audio-visual speech recognition (ASR, VSR, and AV-ASR, respectively) has been substantially improved, mainly due to the use of larger models and training sets.

 Ranked #1 on Lipreading on LRS2 (using extra training data)

Audio-Visual Speech Recognition Automatic Speech Recognition +4

302
0.37 stars / hour