ISI-PPT

This is a Dataset for Arabic/English text detection and optical character recognition. All image data are text-slides extracted from PowerPoint files downloaded from Internet through the Google API. All annotations are automatically generated mainly through the WinCom32 Python API. Postprocess is also applied to place a more accurate text bounding box or to suppress false-alarms, e.g. a text box only containing spaces. Finally, all annotation results are briefly reviewed by human to reject extreme bad samples, e.g. a slide with a large portion of copied table as image. In summary, this dataset contains 10,692 images, and roughly 100K line samples.

Source: https://gitlab.com/rex-yue-wu/ISI-PPT-Dataset

Papers


Paper Code Results Date Stars

Tasks


License


  • Unknown

Modalities


Languages