TextVQA
25 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in TextVQA
Libraries
Use these libraries to find TextVQA models and implementationsMost implemented papers
Towards VQA Models That Can Read
We show that LoRRA outperforms existing state-of-the-art VQA models on our TextVQA dataset.
CogVLM: Visual Expert for Pretrained Language Models
We introduce CogVLM, a powerful open-source visual language foundation model.
CogVLM2: Visual Language Models for Image and Video Understanding
Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.
Structured Multimodal Attentions for TextVQA
In this paper, we propose an end-to-end structured multimodal attention (SMA) neural network to mainly solve the first two issues above.
Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA
Recent work has explored the TextVQA task that requires reading and understanding text in images to answer a question.
Spatially Aware Multimodal Transformers for TextVQA
Further, each head in our multi-head self-attention layer focuses on a different subset of relations.
RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering
Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question.
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption
Due to this aligned representation learning, even pre-trained on the same downstream task dataset, TAP already boosts the absolute accuracy on the TextVQA dataset by +5. 4%, compared with a non-TAP baseline.
Simple is not Easy: A Simple Strong Baseline for TextVQA and TextCaps
Texts appearing in daily scenes that can be recognized by OCR (Optical Character Recognition) tools contain significant information, such as street name, product brand and prices.
A First Look: Towards Explainable TextVQA Models via Visual and Textual Explanations
Explainable deep learning models are advantageous in many situations.