1 code implementation • Findings (ACL) 2022 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
We present two simple modifications for word-level perturbation: Word Replacement considering Length (WR-L) and Compositional Word Replacement (CWR). In conventional word replacement, a word in an input is replaced with a word sampled from the entire vocabulary, regardless of the length and context of the target word. WR-L considers the length of a target word by sampling words from the Poisson distribution. CWR considers the compositional candidates by restricting the source of sampling to related words that appear in subword regularization. Experimental results showed that the combination of WR-L and CWR improved the performance of text classification and machine translation.
1 code implementation • spnlp (ACL) 2022 • Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki
We adopt table representations to model the entities and relations, casting the entity and relation extraction as a table-labeling problem.
1 code implementation • 23 Jan 2025 • Munachiso Nwadike, Zangir Iklassov, Toluwani Aremu, Tatsuya Hiraoka, Velibor Bojkovic, Benjamin Heinzerling, Hilal Alqaubeh, Martin Takáč, Kentaro Inui
We introduce the concept of the self-referencing causal cycle (abbreviated RECALL) - a mechanism that enables large language models (LLMs) to bypass the limitations of unidirectional causality, which underlies a phenomenon known as the reversal curse.
no code implementations • 17 Oct 2024 • Ahmed Oumar El-Shangiti, Tatsuya Hiraoka, Hilal AlQuabeh, Benjamin Heinzerling, Kentaro Inui
This paper investigates whether large language models (LLMs) utilize numerical attributes encoded in a low-dimensional subspace of the embedding space when answering logical comparison questions (e. g., Was Cristiano born before Messi?).
no code implementations • 17 Oct 2024 • Tatsuya Hiraoka, Kentaro Inui
This paper introduces repetition neurons, regarded as skill neurons responsible for the repetition problem in text generation tasks.
1 code implementation • 10 Sep 2024 • Kohei Tsuji, Tatsuya Hiraoka, Yuchang Cheng, Tomoya Iwakura
NLP datasets may still contain annotation errors, even when they are manually annotated.
Ranked #1 on
Named Entity Recognition (NER)
on CoNLL-2020
no code implementations • 4 Jul 2024 • LLM-jp, :, Akiko Aizawa, Eiji Aramaki, Bowen Chen, Fei Cheng, Hiroyuki Deguchi, Rintaro Enomoto, Kazuki Fujii, Kensuke Fukumoto, Takuya Fukushima, Namgi Han, Yuto Harada, Chikara Hashimoto, Tatsuya Hiraoka, Shohei Hisada, Sosuke Hosokawa, Lu Jie, Keisuke Kamata, Teruhito Kanazawa, Hiroki Kanezashi, Hiroshi Kataoka, Satoru Katsumata, Daisuke Kawahara, Seiya Kawano, Atsushi Keyaki, Keisuke Kiryu, Hirokazu Kiyomaru, Takashi Kodama, Takahiro Kubo, Yohei Kuga, Ryoma Kumon, Shuhei Kurita, Sadao Kurohashi, Conglong Li, Taiki Maekawa, Hiroshi Matsuda, Yusuke Miyao, Kentaro Mizuki, Sakae Mizuki, Yugo Murawaki, Akim Mousterou, Ryo Nakamura, Taishi Nakamura, Kouta Nakayama, Tomoka Nakazato, Takuro Niitsuma, Jiro Nishitoba, Yusuke Oda, Hayato Ogawa, Takumi Okamoto, Naoaki Okazaki, Yohei Oseki, Shintaro Ozaki, Koki Ryu, Rafal Rzepka, Keisuke Sakaguchi, Shota Sasaki, Satoshi Sekine, Kohei Suda, Saku Sugawara, Issa Sugiura, Hiroaki Sugiyama, Hisami Suzuki, Jun Suzuki, Toyotaro Suzumura, Kensuke Tachibana, Yu Takagi, Kyosuke Takami, Koichi Takeda, Masashi Takeshita, Masahiro Tanaka, Kenjiro Taura, Arseny Tolmachev, Nobuhiro Ueda, Zhen Wan, Shuntaro Yada, Sakiko Yahata, Yuya Yamamoto, Yusuke Yamauchi, Hitomi Yanaka, Rio Yokota, Koichiro Yoshino
This paper introduces LLM-jp, a cross-organizational project for the research and development of Japanese large language models (LLMs).
no code implementations • 30 Mar 2024 • Marco Cognetta, Tatsuya Hiraoka, Naoaki Okazaki, Rico Sennrich, Yuval Pinter
We explore threshold vocabulary trimming in Byte-Pair Encoding subword tokenization, a postprocessing step that replaces rare subwords with their component subwords.
no code implementations • 29 Mar 2024 • Jesse Atuhurra, Iqra Ali, Tatsuya Hiraoka, Hidetaka Kamigaito, Tomoya Iwakura, Taro Watanabe
Our contribution is four-fold: 1) we introduced nine vision-and-language (VL) tasks (including object recognition, image-text matching, and more) and constructed multilingual visual-text datasets in four languages: English, Japanese, Swahili, and Urdu through utilizing templates containing \textit{questions} and prompting GPT4-V to generate the \textit{answers} and the \textit{rationales}, 2) introduced a new VL task named \textit{unrelatedness}, 3) introduced rationales to enable human understanding of the VLM reasoning process, and 4) employed human evaluation to measure the suitability of proposed datasets for VL tasks.
no code implementations • 15 Feb 2024 • Tatsuya Hiraoka, Naoaki Okazaki
Do pretrained language models have knowledge regarding the surface information of tokens?
no code implementations • 21 Apr 2023 • Tatsuya Hiraoka, Tomoya Iwakura
This paper proposes an example of the BiLSTM-based tokenizer with vocabulary restriction, which can capture wider contextual information for the tokenization process than non-neural-based tokenization methods used in existing work.
no code implementations • 21 Apr 2023 • Tatsuya Hiraoka, Tomoya Iwakura
Is preferred tokenization for humans also preferred for machine-learning (ML) models?
1 code implementation • COLING 2022 • Tatsuya Hiraoka
We present a subword regularization method for WordPiece, which uses a maximum matching algorithm for tokenization.
no code implementations • Findings (ACL) 2022 • Sho Takase, Tatsuya Hiraoka, Naoaki Okazaki
Subword regularizations use multiple subword segmentations during training to improve the robustness of neural machine translation models.
2 code implementations • Findings (ACL) 2021 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
Since traditional tokenizers are isolated from a downstream task and model, they cannot output an appropriate tokenization depending on the task and model, although recent studies imply that the appropriate tokenization improves the performance.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Tatsuya Hiraoka, Sho Takase, Kei Uchiumi, Atsushi Keyaki, Naoaki Okazaki
In traditional NLP, we tokenize a given sentence as a preprocessing, and thus the tokenization is unrelated to a target downstream task.
1 code implementation • Journal of Natural Language Processing 2022 • Youmi Ma, Tatsuya Hiraoka, Naoaki Okazaki
In this study, a novel method for extracting named entities and relations from unstructured text based on the table representation is presented.
Ranked #3 on
Relation Extraction
on CoNLL04
(NER Micro F1 metric)
no code implementations • ACL 2019 • Tatsuya Hiraoka, Hiroyuki Shindo, Yuji Matsumoto
To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification.