Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments

10 Feb 2025  ยท  Sankalp Nagaonkar, Augustya Sharma, Ashish Choithani, Ashutosh Trivedi ยท

This paper introduces an open-source benchmark for evaluating Vision-Language Models (VLMs) on Optical Character Recognition (OCR) tasks in dynamic video environments. We present a curated dataset containing 1,477 manually annotated frames spanning diverse domains, including code editors, news broadcasts, YouTube videos, and advertisements. Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o are benchmarked against traditional OCR systems such as EasyOCR and RapidOCR. Evaluation metrics include Word Error Rate (WER), Character Error Rate (CER), and Accuracy. Our results highlight the strengths and limitations of VLMs in video-based OCR tasks, demonstrating their potential to outperform conventional OCR models in many scenarios. However, challenges such as hallucinations, content security policies, and sensitivity to occluded or stylized text remain. The dataset and benchmarking framework are publicly available to foster further research.

PDF Abstract

Datasets


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Optical Character Recognition (OCR) VideoDB's OCR Benchmark Public Collection GPT-4o Character Error Rate (CER) 0.2378 # 1
Word Error Rate (WER) 0.5117 # 4
Average Accuracy 76.22 # 1
Optical Character Recognition (OCR) VideoDB's OCR Benchmark Public Collection EasyOCR Character Error Rate (CER) 0.5070 # 4
Word Error Rate (WER) 0.8262 # 5
Average Accuracy 49.30 # 5
Optical Character Recognition (OCR) VideoDB's OCR Benchmark Public Collection RapidOCR Character Error Rate (CER) 0.7620 # 5
Word Error Rate (WER) 0.4302 # 2
Average Accuracy 56.98 # 4
Optical Character Recognition (OCR) VideoDB's OCR Benchmark Public Collection Claude-3 Sonnet Character Error Rate (CER) 0.3229 # 3
Word Error Rate (WER) 0.4663 # 3
Average Accuracy 67.71 # 3
Optical Character Recognition (OCR) VideoDB's OCR Benchmark Public Collection Gemini-1.5 Pro Character Error Rate (CER) 0.2387 # 2
Word Error Rate (WER) 0.2385 # 1
Average Accuracy 76.13 # 2

Methods


No methods listed for this paper. Add relevant methods here