Image to text

89 papers with code • 1 benchmarks • 2 datasets

This task has no description! Would you like to contribute one?

Libraries

Use these libraries to find Image to text models and implementations
2 papers
161

Most implemented papers

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

salesforce/lavis Conference 2023

The cost of vision-and-language pre-training has become increasingly prohibitive due to end-to-end training of large-scale models.

Effective Use of Word Order for Text Categorization with Convolutional Neural Networks

tensorflow/models HLT 2015

Convolutional neural network (CNN) is a neural network that can make use of the internal structure of data such as the 2D structure of image data.

Distilled Dual-Encoder Model for Vision-Language Understanding

kugwzk/distilled-dualencoder 16 Dec 2021

We propose a cross-modal attention distillation framework to train a dual-encoder model for vision-language understanding tasks, such as visual reasoning and visual question answering.

Pix2Struct: Screenshot Parsing as Pretraining for Visual Language Understanding

google-research/pix2struct 7 Oct 2022

Visually-situated language is ubiquitous -- sources range from textbooks with diagrams to web pages with images and tables, to mobile apps with buttons and forms.

Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation

ysmiura/ifcc NAACL 2021

We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

shi-labs/versatile-diffusion ICCV 2023

In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model.

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

thu-ml/unidiffuser 12 Mar 2023

Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality.

Cephalo: Multi-Modal Vision-Language Models for Bio-Inspired Materials Analysis and Design

lamm-mit/Cephalo-Phi-3-MoE 29 May 2024

We present Cephalo, a series of multimodal vision large language models (V-LLMs) designed for materials science applications, integrating visual and linguistic data for enhanced understanding.

Text-to-Image-to-Text Translation using Cycle Consistent Adversarial Networks

CSC2548/text2image2textGAN 14 Aug 2018

Text-to-Image translation has been an active area of research in the recent past.

MirrorGAN: Learning Text-to-image Generation by Redescription

komiya-m/MirrorGAN CVPR 2019

Generating an image from a given text description has two goals: visual realism and semantic consistency.