OCR-free Document Understanding Transformer

clovaai/donut 30 Nov 2021

Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs.

Optical Character Recognition (OCR)

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

microsoft/MM-REACT 20 Mar 2023

We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.

High-throughput Generative Inference of Large Language Models with a Single GPU

fminference/flexgen 13 Mar 2023

As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.

Language Modelling

Zero-1-to-3: Zero-shot One Image to 3D Object

cvlab-columbia/zero123 20 Mar 2023

We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.

3D Reconstruction Novel View Synthesis +1

Conditional Image-to-Video Generation with Latent Flow Diffusion Models

nihaomiao/cvpr23_lfdm 24 Mar 2023

In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image.

Image to Video Generation Optical Flow Estimation

marl-jax: Multi-agent Reinforcement Leaning framework for Social Generalization

kinalmehta/marl-jax 24 Mar 2023

We present marl-jax, a multi-agent reinforcement learning software package for training and evaluating social generalization of the agents.

Multi-agent Reinforcement Learning reinforcement-learning +1

UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View

megvii-research/cvpr2023-unidistill 27 Mar 2023

In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex.

3D Object Detection Autonomous Driving +2

EVA-CLIP: Improved Training Techniques for CLIP at Scale

baaivision/eva 27 Mar 2023

Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Ranked #4 on Image Classification on ObjectNet (using extra training data)

Image Classification Representation Learning

Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models

lukashoel/text2room 21 Mar 2023

We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.

Monocular Depth Estimation

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

timdettmers/bitsandbytes 15 Aug 2022

We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance.

Language Modelling Linguistic Acceptability +4

