no code implementations • CVPR 2025 • Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, Ron Litman
Vision-Language Models (VLMs) excel in diverse visual tasks but face challenges in document understanding, which requires fine-grained text processing.
no code implementations • 7 Nov 2024 • Jonathan Fhima, Elad Ben Avraham, Oren Nuriel, Yair Kittenplon, Roy Ganz, Aviad Aberdam, Ron Litman
In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model.
Optical Character Recognition
Optical Character Recognition (OCR)
1 code implementation • 17 Jul 2024 • Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha
In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models.
1 code implementation • 12 Jun 2024 • Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, Raghavendra Pappagari
Document translation poses a challenge for Neural Machine Translation (NMT) systems.
no code implementations • CVPR 2024 • Roy Ganz, Yair Kittenplon, Aviad Aberdam, Elad Ben Avraham, Oren Nuriel, Shai Mazor, Ron Litman
This integration results in dynamic visual features focusing on relevant image aspects to the posed question.
no code implementations • CVPR 2024 • Tsachi Blau, Sharon Fogel, Roi Ronen, Alona Golts, Roy Ganz, Elad Ben Avraham, Aviad Aberdam, Shahar Tsiper, Ron Litman
The increasing use of transformer-based large language models brings forward the challenge of processing long sequences.
no code implementations • ICCV 2023 • Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon, Shai Mazor, Ron Litman
Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image.
no code implementations • ICCV 2023 • Aviad Aberdam, David Bensaïd, Alona Golts, Roy Ganz, Oren Nuriel, Royee Tichauer, Shai Mazor, Ron Litman
Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text.
no code implementations • 14 Sep 2022 • Sergi Garcia-Bordils, Andrés Mafla, Ali Furkan Biten, Oren Nuriel, Aviad Aberdam, Shai Mazor, Ron Litman, Dimosthenis Karatzas
This paper presents final results of the Out-Of-Vocabulary 2022 (OOV) challenge.
Optical Character Recognition
Optical Character Recognition (OCR)
+1
2 code implementations • 8 May 2022 • Aviad Aberdam, Roy Ganz, Shai Mazor, Ron Litman
In a novel setup, consistency is enforced on each modality separately.
1 code implementation • CVPR 2022 • Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar Appalaraju, R. Manmatha
Accounting for this, we propose a single objective pre-training scheme that requires only text and spatial cues.
1 code implementation • 9 May 2021 • Oren Nuriel, Sharon Fogel, Ron Litman
However in some cases, their decisions are based on unintended information leading to high performance on standard benchmarks but also to a lack of generalization to challenging testing conditions and unintuitive failures.
no code implementations • 23 Dec 2020 • Ron Slossberg, Oron Anschel, Amir Markovitz, Ron Litman, Aviad Aberdam, Shahar Tsiper, Shai Mazor, Jon Wu, R. Manmatha
Although the topic of confidence calibration has been an active research area for the last several decades, the case of structured and sequence prediction calibration has been scarcely explored.
2 code implementations • CVPR 2021 • Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R. Manmatha, Pietro Perona
We propose a framework for sequence-to-sequence contrastive learning (SeqCLR) of visual representations, which we apply to text recognition.
2 code implementations • CVPR 2020 • Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, Shai Mazor, R. Manmatha
The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer.