CleanDiffuser: An Easy-to-use Modularized Library for Diffusion Models in Decision Making

cleandiffuserteam/cleandiffuser 13 Jun 2024

By revisiting the roles of DMs in the decision-making domain, we identify a set of essential sub-modules that constitute the core of CleanDiffuser, allowing for the implementation of various DM algorithms with simple and flexible building blocks.

Decision Making

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B

trotsky1997/mathblackbox 11 Jun 2024

This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks.

Decision Making GSM8K +2

A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models

xinchengshuai/awesome-image-editing 20 Jun 2024

Image editing aims to edit the given synthetic or real image to meet the specific requirements from users.

Video Editing

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

alpha-vllm/lumina-t2x 9 May 2024

Sora unveils the potential of scaling Diffusion Transformer for generating photorealistic images and videos at arbitrary resolutions, aspect ratios, and durations, yet it still lacks sufficient implementation details.

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens

mlfoundations/mint-1t 17 Jun 2024

Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs).

Proactive Detection of Voice Cloning with Localized Watermarking

facebookresearch/audioseal 30 Jan 2024

In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning.

Voice Cloning

TroL: Traversal of Layers for Large Language and Vision Models

byungkwanlee/trol 18 Jun 2024

Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning.

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

damo-nlp-sg/videollama2 11 Jun 2024

In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks.

Multiple-choice Question Answering +3

Matching Anything by Segmenting Anything

siyuanliii/masa CVPR 2024

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT).

Domain Generalization Multiple Object Tracking +2

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving

Thinklab-SJTU/Bench2Drive 6 Jun 2024

In an era marked by the rapid scaling of foundation models, autonomous driving technologies are approaching a transformative threshold where end-to-end autonomous driving (E2E-AD) emerges due to its potential of scaling up in the data-driven manner.

Autonomous Driving Benchmarking

