A Characterwise Windowed Approach to Hebrew Morphological Segmentation

WS 2018  ·  Amir Zeldes ·

This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98% accuracy on the benchmark SPMRL shared task data for Hebrew, and 97% accuracy on a new out of domain Wikipedia dataset, an improvement of ~4% and 5% over previous state of the art performance.

PDF Abstract WS 2018 PDF WS 2018 Abstract

Datasets


Introduced in the Paper:

Wiki5K Hebrew segmentation

Used in the Paper:

SPMRL Hebrew segmentation data
Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Text Segmentation SPMRL Hebrew segmentation data RFTokenizer F-Score 97.08 # 1
Text Segmentation Wiki5K Hebrew segmentation RFTokenizer F-Score 96.35 # 1

Methods


No methods listed for this paper. Add relevant methods here