Post-OCR parsing: building simple and robust parser via BIO tagging

Parsing textual information embedded in images is important for various down- stream tasks. However, many previously developed parsers are limited to handling the information presented in one dimensional sequence format. Here, we present Post Ocr Tagging based parser (POT), a simple and robust parser that can parse visually embedded texts by BIO-tagging the output of optical character recognition (OCR) task. Our shallow parsing approach enables building robust neural parser with less than a thousand labeled data. POT is validated on receipt and namecard parsing tasks.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here