Trending Research

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm • 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.

Retrieval

2,022

6.95 stars / hour

Paper
Code

MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

pku-yuangroup/magictime • • 7 Apr 2024

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Text-to-Video Generation Video Generation

818

2.70 stars / hour

Paper
Code

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

openbmb/omnilmm • • 18 Mar 2024

To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution.

752

2.12 stars / hour

Paper
Code

AutoCodeRover: Autonomous Program Improvement

nus-apr/auto-code-rover • 8 Apr 2024

Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding.

Bug fixing Code Search +1

1,745

1.89 stars / hour

Paper
Code

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

tencentarc/instantmesh • • 10 Apr 2024

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.

Image to 3D

454

1.70 stars / hour

Paper
Code

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

FoundationVision/VAR • • 3 Apr 2024

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

Ranked #6 on Image Generation on ImageNet 256x256

Image Generation Language Modelling +2

2,133

1.59 stars / hour

Paper
Code

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini • • 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Ranked #8 on Visual Question Answering on MM-Vet

Image Comprehension Visual Dialog +1

966

1.42 stars / hour

Paper
Code

Rho-1: Not All Tokens Are What You Need

microsoft/rho • 11 Apr 2024

After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40. 6% and 51. 8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens.

Continual Pretraining Language Modelling +1

181

1.36 stars / hour

Paper
Code

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

alibabaresearch/advancedliteratemachinery • • 8 Apr 2024

The core of LayoutLLM is a layout instruction tuning strategy, which is specially designed to enhance the comprehension and utilization of document layouts.

document understanding

865

1.03 stars / hour

Paper
Code

Probing the 3D Awareness of Visual Foundation Models

mbanani/probe3d • 12 Apr 2024

Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure?

115

0.96 stars / hour

Paper
Code