Video Diffusion Alignment via Reward Gradients

mihirp1998/vader 11 Jul 2024

We show that backpropagating gradients from these reward models to a video diffusion model can allow for compute and sample efficient alignment of the video diffusion model.

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

microsoft/MInference 2 Jul 2024

With the pattern and sparse indices, we perform efficient sparse attention calculations via our optimized GPU kernels to significantly reduce the latency in the pre-filling stage of long-context LLMs.

Language Modelling Large Language Model

Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence

openbmb/ioa 9 Jul 2024

The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents.

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.


DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

baaivision/densefusion 11 Jul 2024

To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions.

Cradle: Empowering Foundation Agents Towards General Computer Control

baai-agents/cradle 5 Mar 2024

To handle this issue, we propose the General Computer Control (GCC) setting to restrict foundation agents to interact with software through the most unified and standardized interface, i. e., using screenshots as input and keyboard and mouse actions as output.

Efficient Exploration

FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds

open-mmlab/foleycrafter 1 Jul 2024

Meanwhile, the temporal controller incorporates an onset detector and a timestampbased adapter to achieve precise audio-video alignment.

Audio Generation Video Alignment +1

Scaling Synthetic Data Creation with 1,000,000,000 Personas

tencent-ailab/persona-hub 28 Jun 2024

We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data.

Language Modelling Large Language Model +2

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

ruiyang-061x/lise 11 Jul 2024

In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe).

3D Object Detection object-detection +2

CorrNet3D: Unsupervised End-to-end Learning of Dense Correspondence for 3D Point Clouds


The symmetric deformer, with an additional regularized loss, transforms the two permuted point clouds to each other to drive the unsupervised learning of the correspondence.

Ranked #6 on 3D Dense Shape Correspondence on SHREC'19 (using extra training data)

3D Dense Shape Correspondence

