DiffiT: Diffusion Vision Transformers for Image Generation

nvlabs/diffit 4 Dec 2023

We also introduce latent DiffiT which consists of transformer model with the proposed self-attention layers, for high-resolution image generation.

Denoising Image Generation

OneLLM: One Framework to Align All Modalities with Language

csuhan/onellm 6 Dec 2023

In detail, we first train an image projection module to connect a vision encoder with LLM.

Question Answering

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

lkeab/gaussian-grouping 1 Dec 2023

To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes.

Colorization Novel View Synthesis +1

SeamlessM4T: Massively Multilingual & Multimodal Machine Translation

facebookresearch/seamless_communication 22 Aug 2023

What does it take to create the Babel Fish, a tool that can help individuals translate speech between any two languages?

Automatic Speech Recognition Speech-to-Speech Translation +3

RETVec: Resilient and Efficient Text Vectorizer

google-research/retvec NeurIPS 2023

The RETVec embedding model is pre-trained using pair-wise metric learning to be robust against typos and character-level adversarial attacks.

Adversarial Text Metric Learning +1

GauHuman: Articulated Gaussian Splatting from Monocular Human Videos

skhu101/gauhuman 5 Dec 2023

We present, GauHuman, a 3D human model with Gaussian Splatting for both fast training (1 ~ 2 minutes) and real-time rendering (up to 189 FPS), compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame.

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

lllyasviel/fooocus ICCV 2023

Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity.

Denoising Image Generation

PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation

zhyever/PatchFusion 4 Dec 2023

Single image depth estimation is a foundational task in computer vision and generative modeling.

Depth Estimation

Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models

yuliang-liu/monkey 11 Nov 2023

Additionally, experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats.

Image Captioning Question Answering +2

MEDITRON-70B: Scaling Medical Pretraining for Large Language Models

epfllm/meditron 27 Nov 2023

Large language models (LLMs) can potentially democratize access to medical knowledge.

 Ranked #1 on Multiple Choice Question Answering (MCQA) on MedMCQA (Dev Set (Acc-%) metric)

Conditional Text Generation Multiple Choice Question Answering (MCQA)

