Current Visual Document Understanding (VDU) methods outsource the task of reading text to off-the-shelf Optical Character Recognition (OCR) engines and focus on the understanding task with the OCR outputs.
We propose MM-REACT, a system paradigm that integrates ChatGPT with a pool of vision experts to achieve multimodal reasoning and action.
As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144.
We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image.
In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image.
We present marl-jax, a multi-agent reinforcement learning software package for training and evaluating social generalization of the agents.
Multi-agent Reinforcement Learning reinforcement-learning +1
In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex.
Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.
Ranked #4 on Image Classification on ObjectNet (using extra training data)
We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input.
We develop a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cut the memory needed for inference by half while retaining full precision performance.
Ranked #2 on Language Modelling on C4