Particularly, a similar Wikimedia image can be used to illustrate different articles, and the produced caption needs to be adapted to a specific context, therefore allowing us to explore the limits of a model to adjust captions to different contextual information.
In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion.
Date estimation of historical document images is a challenging problem, with several contributions in the literature that lack of the ability to generalize from one dataset to others.
In this paper, we propose a Text-Degradation Invariant Auto Encoder (Text-DIAE), a self-supervised model designed to tackle two tasks, text recognition (handwritten or scene-text) and document image enhancement.
It is our hope that OCR-IDL can be a starting point for future works on Document Intelligence.
In this work, we propose two metrics that evaluate the degree of semantic relevance of retrieved items, independently of their annotated binary relevance.
Explaining an image with missing or non-existent objects is known as object bias (hallucination) in image captioning.
This work addresses the problem of Question Answering (QA) on handwritten document collections.
This paper presents a novel method for date estimation of historical photographs from archival sources.
In this paper, we explore and evaluate the use of ranking-based objective functions for learning simultaneously a word string and a word image encoder.
Low resource Handwritten Text Recognition (HTR) is a hard problem due to the scarce annotated data and the very limited linguistic information (dictionaries and language models).
Scene text instances found in natural images carry explicit semantic information that can provide important cues to solve a wide array of computer vision problems.
State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets.
Text contained in an image carries high-level semantics that can be exploited to achieve richer image understanding.
Ranked #1 on Fine-Grained Image Classification on Con-Text
In this work we target the problem of hate speech detection in multimodal publications formed by a text and an image.
ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image.
This paper explores the possibilities of image style transfer applied to text maintaining the original transcriptions.
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image.
We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image.
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places.
In this work we propose to exploit this free available data to learn a multimodal image and text embedding, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval.
In this paper we propose to learn a multimodal image and text embedding from Web and Social Media data, aiming to leverage the semantic knowledge learnt in the text domain and transfer it to a visual model for semantic image retrieval.
We perform a language separate treatment of the data and show that it can be extrapolated to a tourists and locals separate analysis, and that tourism is reflected in Social Media at a neighborhood level.
We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration.
End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible.
Text Proposals have emerged as a class-dependent version of object proposals - efficient approaches to reduce the search space of possible text object locations in an image.
Although widely studied for document images and handwritten documents, it remains an almost unexplored territory for scene text images.
Instead of resizing input images to a fixed aspect ratio as in the typical use of holistic CNN classifiers, we propose here a patch-based classification framework in order to preserve discriminative parts of the image that are characteristic of its class.