Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models

stanford-oval/storm 22 Feb 2024

We study how to apply large language models to write grounded and organized long-form articles from scratch, with comparable breadth and depth to Wikipedia pages.


MagicTime: Time-lapse Video Generation Models as Metamorphic Simulators

pku-yuangroup/magictime 7 Apr 2024

Recent advances in Text-to-Video generation (T2V) have achieved remarkable success in synthesizing high-quality general videos from textual descriptions.

Text-to-Video Generation Video Generation

LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images

openbmb/omnilmm 18 Mar 2024

To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution.

AutoCodeRover: Autonomous Program Improvement

nus-apr/auto-code-rover 8 Apr 2024

Recent progress in Large Language Models (LLMs) has significantly impacted the development process, where developers can use LLM-based programming assistants to achieve automated coding.

Bug fixing Code Search +1

InstantMesh: Efficient 3D Mesh Generation from a Single Image with Sparse-view Large Reconstruction Models

tencentarc/instantmesh 10 Apr 2024

We present InstantMesh, a feed-forward framework for instant 3D mesh generation from a single image, featuring state-of-the-art generation quality and significant training scalability.

Image to 3D

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

FoundationVision/VAR 3 Apr 2024

We present Visual AutoRegressive modeling (VAR), a new generation paradigm that redefines the autoregressive learning on images as coarse-to-fine "next-scale prediction" or "next-resolution prediction", diverging from the standard raster-scan "next-token prediction".

Image Generation Language Modelling +2

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models

dvlab-research/minigemini 27 Mar 2024

We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i. e., high-resolution visual tokens, high-quality data, and VLM-guided generation.

Image Comprehension Visual Dialog +1

Rho-1: Not All Tokens Are What You Need

microsoft/rho 11 Apr 2024

After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40. 6% and 51. 8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens.

Continual Pretraining Language Modelling +1

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

alibabaresearch/advancedliteratemachinery 8 Apr 2024

The core of LayoutLLM is a layout instruction tuning strategy, which is specially designed to enhance the comprehension and utilization of document layouts.

document understanding

Probing the 3D Awareness of Visual Foundation Models

mbanani/probe3d 12 Apr 2024

Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure?

