A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise

bradyfu/awesome-multimodal-large-language-models 19 Dec 2023

They endow Large Language Models (LLMs) with powerful capabilities in visual understanding, enabling them to tackle diverse multi-modal tasks.

Visual Reasoning

8,673
0.33 stars / hour

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

boheumd/MA-LMM 8 Apr 2024

However, existing LLM-based large multimodal models (e. g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

Question Answering Video Captioning +4

85
0.30 stars / hour

Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding

ivan-tang-3d/any2point 11 Apr 2024

The adapter incorporates prior spatial knowledge from the source modality to guide the local feature aggregation of 3D tokens, compelling the semantic adaption of any-modality transformers.

29
0.30 stars / hour

BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models

ledzy/badam 3 Apr 2024

This work presents BAdam, an optimizer that leverages the block coordinate optimization framework with Adam as the inner solver.

48
0.29 stars / hour

Hash3D: Training-free Acceleration for 3D Generation

Adamdad/hash3D 9 Apr 2024

The evolution of 3D generative modeling has been notably propelled by the adoption of 2D diffusion models.

3D Generation Image to 3D +1

96
0.29 stars / hour

Policy-Guided Diffusion

emptyjackson/policy-guided-diffusion 9 Apr 2024

Our approach provides an effective alternative to autoregressive offline world models, opening the door to the controllable generation of synthetic training data.

60
0.29 stars / hour

AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation

scutzzj/aniportrait 26 Mar 2024

In this study, we propose AniPortrait, a novel framework for generating high-quality animation driven by audio and a reference portrait image.

Face Reenactment

3,464
0.28 stars / hour

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding

x-plug/mplug-docowl 19 Mar 2024

In this work, we emphasize the importance of structure information in Visual Document Understanding and propose the Unified Structure Learning to boost the performance of MLLMs.

document understanding Optical Character Recognition (OCR)

807
0.28 stars / hour

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

stanfordnlp/dsp 5 Oct 2023

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks.

Language Modelling Math

9,984
0.28 stars / hour

NeuroNCAP: Photorealistic Closed-loop Safety Testing for Autonomous Driving

wljungbergh/neuroncap 11 Apr 2024

We present a versatile NeRF-based simulator for testing autonomous driving (AD) software systems, designed with a focus on sensor-realistic closed-loop evaluation and the creation of safety-critical scenarios.

Autonomous Driving

28
0.27 stars / hour