In the pre-training stage, we introduce the Retrieval Augmented Mask Prediction (RAMP) task, where the model learns to leverage search tools to fill masked spans on a large number of pre-training data, thus acquiring universal retrieval and reasoning capabilities for LLMs.
Existing acceleration techniques for video diffusion models often rely on uniform heuristics or time-embedding variants to skip timesteps and reuse cached features.
Graph Convolutional Networks (GCNs) achieve an impressive performance due to the remarkable representation ability in learning the graph information.
Survey paper plays a crucial role in scientific research, especially given the rapid growth of research publications.
Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks.
We present RWKV-7 "Goose", a new sequence modeling architecture with constant memory usage and constant inference time per token.
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.
Linear Recurrent Neural Networks (linear RNNs) have emerged as competitive alternatives to Transformers for sequence modeling, offering efficient training and linear-time inference.
To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i)Visual Quality-semantic alignment with human posters, (ii)Textual Coherence-language fluency, (iii)Holistic Assessment-six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv)PaperQuiz-the poster's ability to convey core paper content as measured by VLMs answering generated quizzes.
Online free-viewpoint video (FVV) streaming is a challenging problem, which is relatively under-explored.