Image to text
89 papers with code • 1 benchmarks • 2 datasets
Benchmarks
These leaderboards are used to track progress in Image to text
Trend | Dataset | Best Model | Paper | Code | Compare |
---|
Libraries
Use these libraries to find Image to text models and implementationsMost implemented papers
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.
Effective Use of Word Order for Text Categorization with Convolutional Neural Networks
Convolutional neural network (CNN) is a neural network that can make use of the internal structure of data such as the 2D structure of image data.
Distilled Dual-Encoder Model for Vision-Language Understanding
We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.
Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding
Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms.
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model.
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality.
Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design
We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding.
Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks
Text-to-Image translation has been an active area of research in the recent past.
MirrorGAN: Learning Text-to-image Generation by Redescription
Generating an image from a given text description has two goals: visual realism and semantic consistency.