To enable this framework, we devise a scalable pipeline that automatically generates high-quality, instruction-tuning datasets from readily available captioning data across different modalities, and contribute 24K QA data for audio and 250K QA data for 3D.
We introduce LAVIS, an open-source deep learning library for LAnguage-VISion research and applications.
Furthermore, performance improvement has been largely achieved by scaling up the dataset with noisy image-text pairs collected from the web, which is a suboptimal source of supervision.
Ranked #3 on
Open Vocabulary Attribute Detection
on OVAD-Box benchmark
(using extra training data)
Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions.
Ranked #6 on
Personalized Image Generation
on DreamBooth
This work presents Moonshot, a new video generation model that conditions simultaneously on multimodal inputs of image and text.
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence.
Ranked #5 on
Visual Question Answering
on BenchLMM
Most existing methods employ a transformer-based multimodal encoder to jointly model visual tokens (region-based image features) and word tokens.
Ranked #5 on
Open Vocabulary Attribute Detection
on OVAD-Box benchmark
(using extra training data)
Visual question answering (VQA) is a hallmark of vision and language reasoning and a challenging task under the zero-shot setting.
Ranked #2 on
Visual Question Answering (VQA)
on VQA v2 val
Finally, we provide a user interface (UI) that allows users to perform causal analysis on data without coding.
In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions.