Search Results for author: Sangwhan Moon

Found 8 papers, 2 papers with code

OpenKorPOS: Democratizing Korean Tokenization with Voting-Based Open Corpus Annotation

no code implementations • LREC 2022 • Sangwhan Moon, Won Ik Cho, Hye Joo Han, Naoaki Okazaki, Nam Soo Kim

As this problem originates from the conventional scheme used when creating a POS tagging corpus, we propose an improvement to the existing scheme, which makes it friendlier to generative tasks.

POS POS Tagging +1

Paper
Add Code

PatchBERT: Just-in-Time, Out-of-Vocabulary Patching

no code implementations • EMNLP 2020 • Sangwhan Moon, Naoaki Okazaki

Large scale pre-trained language models have shown groundbreaking performance improvements for transfer learning in the domain of natural language processing.

Transfer Learning

Paper
Add Code

Two Counterexamples to Tokenization and the Noiseless Channel

no code implementations • 22 Feb 2024 • Marco Cognetta, Vilém Zouhar, Sangwhan Moon, Naoaki Okazaki

In Tokenization and the Noiseless Channel (Zouhar et al., 2023a), R\'enyi efficiency is suggested as an intrinsic mechanism for evaluating a tokenizer: for NLP tasks, the tokenizer which leads to the highest R\'enyi efficiency of the unigram distribution should be chosen.

Machine Translation

Paper
Add Code

Learning How to Translate North Korean through South Korean

no code implementations • LREC 2022 • Hwichan Kim, Sangwhan Moon, Naoaki Okazaki, Mamoru Komachi

Training a model using North Korean data is the most straightforward approach to solving this problem, but there is insufficient data to train NMT models.

Machine Translation NMT +1

Paper
Add Code

StyleKQC: A Style-Variant Paraphrase Corpus for Korean Questions and Commands

1 code implementation • LREC 2022 • Won Ik Cho, Sangwhan Moon, Jong In Kim, Seok Min Kim, Nam Soo Kim

Paraphrasing is often performed with less concern for controlled style conversion.

Natural Language Queries

Paper
Code

Open Korean Corpora: A Practical Report

no code implementations • EMNLP (NLPOSS) 2020 • Won Ik Cho, Sangwhan Moon, YoungSook Song

Korean is often referred to as a low-resource language in the research community.

Paper
Add Code

Jamo Pair Encoding: Subcharacter Representation-based Extreme Korean Vocabulary Compression for Efficient Subword Tokenization

no code implementations • LREC 2020 • Sangwhan Moon, Naoaki Okazaki

In the context of multilingual language model pre-training, vocabulary size for languages with a broad set of potential characters is an unsolved problem.

Language Modelling

Paper
Add Code

Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Won Ik Cho, Young Ki Moon, Sangwhan Moon, Seok Min Kim, Nam Soo Kim

Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective.

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.