Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information.
This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge.
In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion.
In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement.
It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence.
Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues.
In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning.
This work investigates the problem of sketch-guided object localization (SGOL), where human sketches are used as queries to conduct the object localization in natural images.
Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models).
Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems.
This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it.
Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding.
Ranked #1 on Fine-Grained Image Classification on Con-Text
ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image.
This paper explores the possibilities of image style transfer applied to text maintaining the original transcriptions.
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image.
We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image.