We demonstrate the limitations of current Scene Text VQA and VideoQA methods and propose ways to incorporate scene text information into VideoQA methods.
Recognition of text on word or line images, without the need for sub-word segmentation has become the mainstream of research and development of text recognition for Indian languages.
In this report we present results of the ICDAR 2021 edition of the Document Visual Question Challenges.
This work addresses the problem of Question Answering (QA) on handwritten document collections.
Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements.
And the performance is bench-marked on a new IIIT-ILST dataset comprising of hundreds of real scene images containing text in the above mentioned scripts.
Images in the medical domain are fundamentally different from the general domain images.
For the task 1 a new dataset is introduced comprising 50, 000 questions-answer(s) pairs defined over 12, 767 document images.
The dataset consists of 50, 000 questions defined on 12, 000+ document images.
Ranked #1 on Visual Question Answering on DocVQA test
State of the art methods for text detection, recognition and tracking are evaluated on the new dataset and the results signify the challenges in unconstrained driving videos compared to existing datasets.
ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image.
Word error rate of an ocr is often higher than its character error rate.
For scripts like Arabic, a major challenge in developing robust recognizers is the lack of large quantity of annotated data.