no code implementations • 11 Sep 2023 • Souhail Bakkali, Sanket Biswas, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, Oriol Ramos Terrades, Josep Lladós
The field of visual document understanding has witnessed a rapid growth in emerging challenges and powerful multi-modal strategies.
Ranked #19 on Document Image Classification on RVL-CDIP
1 code implementation • 5 Sep 2023 • Sergi Garcia-Bordils, Dimosthenis Karatzas, Marçal Rusiñol
We introduce the structured scene-text spotting task, which requires a scene-text OCR system to spot text in the wild according to a query regular expression.
no code implementations • IJDAR 2021 • Souhail Bakkali, Ziheng Ming, Mickael Coustaty, Marçal Rusiñol
To the best of our knowledge, this is the first time to leverage a mutual learning approach along with a self-attention-based fusion module to perform document image classification.
Ranked #1 on Document Image Classification on RVL-CDIP
no code implementations • 24 May 2022 • Souhail Bakkali, Zuheng Ming, Mickael Coustaty, Marçal Rusiñol, Oriol Ramos Terrades
Multimodal learning from document data has achieved great success lately as it allows to pre-train semantically meaningful features as a prior into a learnable downstream task.
Ranked #18 on Document Image Classification on RVL-CDIP
no code implementations • 12 Apr 2022 • Lei Kang, Pau Riba, Marçal Rusiñol, Alicia Fornés, Mauricio Villegas
Once properly trained, our method can also be adapted to new target data by only accessing unlabeled text-line images to mimic handwritten styles and produce images with any textual content.
no code implementations • CVPRW 2020 • Souhail Bakkali, Ziheng Ming, Mickael Coustaty, Marçal Rusiñol
Moreover, a joint feature learning approach that combines image features and text embeddings is introduced as a late fusion methodology.
Ranked #2 on Document Image Classification on RVL-CDIP
no code implementations • 1 Jun 2020 • Lluís Gómez, Ali Furkan Biten, Rubèn Tito, Andrés Mafla, Marçal Rusiñol, Ernest Valveny, Dimosthenis Karatzas
This paper presents a new model for the task of scene text visual question answering, in which questions about a given image can only be answered by reading and understanding scene text that is present in it.
no code implementations • 26 May 2020 • Lei Kang, Pau Riba, Marçal Rusiñol, Alicia Fornés, Mauricio Villegas
Sequential architectures are a perfect fit to model text lines, not only because of the inherent temporal aspect of text, but also to learn probability distributions over sequences of characters and words.
Ranked #8 on Handwritten Text Recognition on IAM
3 code implementations • ECCV 2020 • Lei Kang, Pau Riba, Yaxing Wang, Marçal Rusiñol, Alicia Fornés, Mauricio Villegas
We propose a novel method that is able to produce credible handwritten word images by conditioning the generative process with both calligraphic style features and textual content.
no code implementations • 21 Dec 2019 • Lei Kang, Pau Riba, Mauricio Villegas, Alicia Fornés, Marçal Rusiñol
The main challenge faced when training a language model is to deal with the language model corpus which is usually different to the one used for training the handwritten word recognition system.
no code implementations • 18 Sep 2019 • Lei Kang, Marçal Rusiñol, Alicia Fornés, Pau Riba, Mauricio Villegas
Handwritten Text Recognition (HTR) is still a challenging problem because it must deal with two important difficulties: the variability among writing styles, and the scarcity of labelled data.
no code implementations • 30 Jun 2019 • Ali Furkan Biten, Rubèn Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Minesh Mathew, C. V. Jawahar, Ernest Valveny, Dimosthenis Karatzas
ST-VQA introduces an important aspect that is not addressed by any Visual Question Answering system up to date, namely the incorporation of scene text to answer questions asked about an image.
1 code implementation • 4 Jun 2019 • Raul Gomez, Ali Furkan Biten, Lluis Gomez, Jaume Gibert, Marçal Rusiñol, Dimosthenis Karatzas
This paper explores the possibilities of image style transfer applied to text maintaining the original transcriptions.
3 code implementations • ICCV 2019 • Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, Marçal Rusiñol, Ernest Valveny, C. V. Jawahar, Dimosthenis Karatzas
Current visual question answering datasets do not consider the rich semantic information conveyed by text within an image.
1 code implementation • CVPR 2019 • Ali Furkan Biten, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas
We propose a novel captioning method that is able to leverage contextual information provided by the text of news articles associated with an image.
no code implementations • 31 Jan 2019 • Yash Patel, Lluis Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
Cross-modal retrieval methods have been significantly improved in last years with the use of deep neural networks and large-scale annotated datasets such as ImageNet and Places.
3 code implementations • ECCV 2018 • Lluís Gómez, Andrés Mafla, Marçal Rusiñol, Dimosthenis Karatzas
In this way, the text based image retrieval task can be casted as a simple nearest neighbor search of the query text representation over the outputs of the CNN over the entire image database.
1 code implementation • 4 Jul 2018 • Yash Patel, Lluis Gomez, Raul Gomez, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration.
no code implementations • 18 Oct 2017 • Dimosthenis Karatzas, Lluis Gómez, Anguelos Nicolaou, Marçal Rusiñol
The ICDAR Robust Reading Competition (RRC), initiated in 2003 and re-established in 2011, has become a de-facto evaluation standard for robust reading systems and algorithms.
no code implementations • CVPR 2017 • Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis Karatzas, C. V. Jawahar
End-to-end training from scratch of current deep architectures for new computer vision problems would require Imagenet-scale datasets, and this is not always possible.