Search Results

X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning

2 code implementations30 Nov 2023

To enable this framework, we devise a scalable pipeline that automatically generates high-quality, instruction-tuning datasets from readily available captioning data across different modalities, and contribute 24K QA data for audio and 250K QA data for 3D.

Visual Reasoning

LAVIS: A Library for Language-Vision Intelligence

1 code implementation15 Sep 2022

We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.

Benchmarking Image Captioning +8

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

9 code implementations28 Jan 2022

Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.

Ranked #3 on Open Vocabulary Attribute Detection on OVAD-Box benchmark (using extra training data)

Image Captioning Image-text matching +5

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

1 code implementation NeurIPS 2023

Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions.

Personalized Image Generation Representation Learning +1

Moonshot: Towards Controllable Video Generation and Editing with Multimodal Conditions

2 code implementations3 Jan 2024

This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text.

Image Animation Video Editing +1

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

4 code implementations NeurIPS 2023

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.

1 Image, 2*2 Stitching Diversity +5

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

6 code implementations NeurIPS 2021

Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.

Ranked #5 on Open Vocabulary Attribute Detection on OVAD-Box benchmark (using extra training data)

Grounded language learning Image-text matching +8

Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

3 code implementations17 Oct 2022

Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting.

Image Captioning Network Interpretation +2

Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data

1 code implementation25 Jan 2023

Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding.

Causal Discovery Causal Inference +2

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

2 code implementations3 May 2023

In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions.

Causal Language Modeling Decoder +4