LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

KwaiVGI/LivePortrait 3 Jul 2024

Instead of following mainstream diffusion-based methods, we explore and extend the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability.

Computational Efficiency Face Reenactment +3

SEED-Story: Multimodal Long Story Generation with Large Language Model

tencentarc/seed-story 11 Jul 2024

We further propose multimodal attention sink mechanism to enable the generation of stories with up to 25 sequences (only 10 for training) in a highly efficient autoregressive manner.

Image Generation Language Modelling +3

Grounding Image Matching in 3D with MASt3R

naver/mast3r 14 Jun 2024

Image Matching is a core component of all best-performing algorithms and pipelines in 3D vision.

3D Reconstruction

RouteLLM: Learning to Route LLMs with Preference Data

lm-sys/routellm 26 Jun 2024

Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost.

Data Augmentation Transfer Learning

MambaVision: A Hybrid Mamba-Transformer Vision Backbone

nvlabs/mambavision 10 Jul 2024

We propose a novel hybrid Mamba-Transformer backbone, denoted as MambaVision, which is specifically tailored for vision applications.

Image Classification Instance Segmentation +3

OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training

PrimeIntellect-ai/OpenDiLoCo 10 Jul 2024

OpenDiLoCo is an open-source implementation and replication of the Distributed Low-Communication (DiLoCo) training method for large language models.

MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases

facebookresearch/mobilellm 22 Feb 2024

The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0. 7%/0. 8% than MobileLLM 125M/350M.

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

gair-nlp/anole 8 Jul 2024

Previous open-source large multimodal models (LMMs) have faced several limitations: (1) they often lack native integration, requiring adapters to align visual representations with pre-trained large language models (LLMs); (2) many are restricted to single-modal generation; (3) while some support multimodal generation, they rely on separate diffusion models for visual modeling and generation.

multimodal generation Text Generation

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

test-time-training/ttt-lm-jax 5 Jul 2024

We evaluate our instantiations at the scale of 125M to 1. 3B parameters, comparing with a strong Transformer and Mamba, a modern RNN.

16k 8k +1

MAVIS: Mathematical Visual Instruction Tuning

zrrskywalker/mavis 11 Jul 2024

We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills.

Contrastive Learning Language Modelling +3

