1 code implementation • 7 Nov 2023 • Taehee Jeon, BongSeok Yang, ChangHwan Kim, Yoonseob Lim
We introduce a morpheme-aware subword tokenization method that utilizes sub-character decomposition to address the challenges of applying Byte Pair Encoding (BPE) to Korean, a language characterized by its rich morphology and unique writing system.