Trending Research

WorldGPT: Empowering LLM as Multimodal World Model

dcdmllm/worldgpt • • 28 Apr 2024

As for evaluation, we build WorldNet, a multimodal state transition prediction benchmark encompassing varied real-life scenarios.

Language Modelling Large Language Model

0.38 stars / hour

Paper
Code

QLoRA: Efficient Finetuning of Quantized LLMs

internlm/xtuner • • NeurIPS 2023

Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99. 3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU.

Chatbot Instruction Following +2

2,480

0.37 stars / hour

Paper
Code

MicroDreamer: Zero-shot 3D Generation in $\sim$20 Seconds by Score-based Iterative Reconstruction

ml-gsai/microdreamer • • 30 Apr 2024

In this paper, we introduce score-based iterative reconstruction (SIR), an efficient and general algorithm for 3D generation with a multi-view score-based diffusion model.

3D Generation 3D Reconstruction

0.33 stars / hour

Paper
Code

OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning

nvlabs/omnidrive • 2 May 2024

We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning.

Autonomous Driving counterfactual +4

0.33 stars / hour

Paper
Code

Make Your LLM Fully Utilize the Context

hsiehjackson/ruler • • 25 Apr 2024

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge.

4k Information Retrieval +1

108

0.30 stars / hour

Paper
Code

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

McGill-NLP/webllama • • 8 Feb 2024

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve real-world tasks in a multi-turn dialogue fashion.

Ranked #1 on Conversational Web Navigation on WebLINX

Conversational Web Navigation Text Generation +1

1,081

0.30 stars / hour

Paper
Code

ManiSkill: Generalizable Manipulation Skill Benchmark with Large-Scale Demonstrations

haosulab/ManiSkill • • 30 Jul 2021

Here we propose SAPIEN Manipulation Skill Benchmark (ManiSkill) to benchmark manipulation skills over diverse objects in a full-physics simulator.

475

0.28 stars / hour

Paper
Code

PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning

magic-research/PLLaVA • • arXiv 2024

PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks.

Ranked #1 on Video-based Generative Performance Benchmarking on VideoInstruct

Dense Captioning Video-based Generative Performance Benchmarking +1

285

0.28 stars / hour

Paper
Code

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

ironjr/streammultidiffusion • • 14 Mar 2024

The enormous success of diffusion models in text-to-image synthesis has made them promising candidates for the next generation of end-user applications for image generation and editing.

Text-to-Image Generation

445

0.28 stars / hour

Paper
Code

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

ailab-cvc/seed-x • • 22 Apr 2024

We hope that our work will inspire future research into what can be achieved by versatile multimodal foundation models in real-world applications.

Image Generation

212

0.27 stars / hour

Paper
Code