1 code implementation • Findings (ACL) 2022 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
We present two simple modifications for word-level perturbation: Word Replacement considering Length (WR-L) and Compositional Word Replacement (CWR). In conventional word replacement, a word in an input is replaced with a word sampled from the entire vocabulary, regardless of the length and context of the target word. WR-L considers the length of a target word by sampling words from the Poisson distribution. CWR considers the compositional candidates by restricting the source of sampling to related words that appear in subword regularization. Experimental results showed that the combination of WR-L and CWR improved the performance of text classification and machine translation.
1 code implementation • spnlp (ACL) 2022 • Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki
We adopt table representations to model the entities and relations, casting the entity and relation extraction as a table-labeling problem.
no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.
no code implementations • 15 Feb 2024 • Tatsuya Hiraoka, Naoaki Okazaki
Do pretrained language models have knowledge regarding the surface information of tokens?
no code implementations • 21 Apr 2023 • Tatsuya Hiraoka, Tomoya Iwakura
Is preferred tokenization for humans also preferred for machine-learning (ML) models?
no code implementations • 21 Apr 2023 • Tatsuya Hiraoka, Tomoya Iwakura
This paper proposes an example of the BiLSTM-based tokenizer with vocabulary restriction, which can capture wider contextual information for the tokenization process than non-neural-based tokenization methods used in existing work.
1 code implementation • COLING 2022 • Tatsuya Hiraoka
We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization.
no code implementations • Findings (ACL) 2022 • Sho Takase, Tatsuya Hiraoka, Naoaki Okazaki
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
2 code implementations • Findings (ACL) 2021 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task.
1 code implementation • Journal of Natural Language Processing 2022 • Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki
In this study, a novel method for extracting named entities and relations from unstructured text based on the table representation is presented.
Ranked #3 on Relation Extraction on CoNLL04 (NER Micro F1 metric)
no code implementations • ACL 2019 • Tatsuya Hiraoka, Hiroyuki Shindo, Yuji Matsumoto
To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification.