Trending Research

OneChart: Purify the Chart Structural Extraction via One Auxiliary Token

lingyvkong/onechart • • 15 Apr 2024

To address this, we propose OneChart: a reliable agent specifically devised for the structural extraction of chart information.

0.43 stars / hour

Paper
Code

UFO: A UI-Focused Agent for Windows OS Interaction

microsoft/UFO • 8 Feb 2024

We introduce UFO, an innovative UI-Focused agent to fulfill user requests tailored to applications on Windows OS, harnessing the capabilities of GPT-Vision.

Navigate

4,030

0.42 stars / hour

Paper
Code

HairFastGAN: Realistic and Robust Hair Transfer with a Fast Encoder-Based Approach

airi-institute/hairfastgan • • 1 Apr 2024

Our paper addresses the complex task of transferring a hairstyle from a reference image to an input photo for virtual hair try-on.

234

0.41 stars / hour

Paper
Code

Attention Satisfies: A Constraint-Satisfaction Lens on Factual Errors of Language Models

microsoft/mechanistic-error-probe • • 26 Sep 2023

We investigate the internal behavior of Transformer-based Large Language Models (LLMs) when they generate factually incorrect text.

0.41 stars / hour

Paper
Code

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

opengvlab/internvl • • 21 Dec 2023

However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs.

Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT-full (using extra training data)

Image Retrieval Image-to-Text Retrieval +10

794

0.40 stars / hour

Paper
Code

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

modelscope/swift • • 18 Dec 2023

Image diffusion models have been utilized in various tasks, such as text-to-image generation and controllable image synthesis.

Text-to-Image Generation

1,162

0.38 stars / hour

Paper
Code

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

showlab/show-1 • • 27 Sep 2023

In this paper, we are the first to propose a hybrid model, dubbed as Show-1, which marries pixel-based and latent-based VDMs for text-to-video generation.

Ranked #2 on Text-to-Video Generation on EvalCrafter Text-to-Video (ECTV) Dataset (using extra training data)

Text-to-Video Generation Video Alignment +1

909

0.38 stars / hour

Paper
Code

LLoCO: Learning Long Contexts Offline

jeffreysijuntan/lloco • • 11 Apr 2024

We introduce LLoCO, a technique that combines context compression, retrieval, and parameter-efficient finetuning using LoRA.

4k In-Context Learning +1

0.38 stars / hour

Paper
Code

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

siyan-zhao/prepacking • • 15 Apr 2024

In this work, we highlight the following pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length.

0.36 stars / hour

Paper
Code

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

mcgill-nlp/llm2vec • • 9 Apr 2024

We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB).

Contrastive Learning

279

0.35 stars / hour

Paper
Code