Search Results for author: Tatsuya Hiraoka

Found 12 papers, 6 papers with code

Word-level Perturbation Considering Word Length and Compositional Subwords

1 code implementation Findings (ACL) 2022 Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

We present two simple modifications for word-level perturbation: Word Replacement considering Length (WR-L) and Compositional Word Replacement (CWR). In conventional word replacement, a word in an input is replaced with a word sampled from the entire vocabulary, regardless of the length and context of the target word. WR-L considers the length of a target word by sampling words from the Poisson distribution. CWR considers the compositional candidates by restricting the source of sampling to related words that appear in subword regularization. Experimental results showed that the combination of WR-L and CWR improved the performance of text classification and machine translation.

Machine Translation text-classification +2

An Analysis of BPE Vocabulary Trimming in Neural Machine Translation

no code implementations30 Mar 2024 Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter

We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.

Machine Translation Translation

Knowledge of Pretrained Language Models on Surface Information of Tokens

no code implementations15 Feb 2024 Tatsuya Hiraoka, Naoaki Okazaki

Do pretrained language models have knowledge regarding the surface information of tokens?

Downstream Task-Oriented Neural Tokenizer Optimization with Vocabulary Restriction as Post Processing

no code implementations21 Apr 2023 Tatsuya Hiraoka, Tomoya Iwakura

This paper proposes an example of the BiLSTM-based tokenizer with vocabulary restriction, which can capture wider contextual information for the tokenization process than non-neural-based tokenization methods used in existing work.

text-classification Text Classification

MaxMatch-Dropout: Subword Regularization for WordPiece

1 code implementation COLING 2022 Tatsuya Hiraoka

We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization.

Machine Translation Text Classification +1

Joint Optimization of Tokenization and Downstream Model

2 code implementations Findings (ACL) 2021 Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki

Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance.

Machine Translation text-classification +2

Stochastic Tokenization with a Language Model for Neural Text Classification

no code implementations ACL 2019 Tatsuya Hiraoka, Hiroyuki Shindo, Yuji Matsumoto

To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification.

General Classification Language Modelling +5

Cannot find the paper you are looking for? You can Submit a new open access paper.