no code implementations • WNUT (ACL) 2021 • Shohei Higashiyama, Masao Utiyama, Taro Watanabe, Eiichiro Sumita
Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing.
no code implementations • Findings (ACL) 2022 • Kehai Chen, Masao Utiyama, Eiichiro Sumita, Rui Wang, Min Zhang
Machine translation typically adopts an encoder-to-decoder framework, in which the decoder generates the target sentence word-by-word in an auto-regressive manner.
no code implementations • WMT (EMNLP) 2020 • Zuchao Li, Hai Zhao, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita
In this paper, we introduced our joint team SJTU-NICT ‘s participation in the WMT 2020 machine translation shared task.
no code implementations • EMNLP 2021 • Zuchao Li, Masao Utiyama, Eiichiro Sumita, Hai Zhao
Machine translation usually relies on parallel corpora to provide parallel signals for training.
no code implementations • ACL 2022 • Zuchao Li, Masao Utiyama, Eiichiro Sumita, Hai Zhao
Although this can satisfy the requirements overall, it usually requires a larger beam size and far longer decoding time than unrestricted translation, which limits the concurrent processing ability of the translation model in deployment, and thus its practicality.
no code implementations • Findings (ACL) 2022 • Zuchao Li, Yiran Wang, Masao Utiyama, Eiichiro Sumita, Hai Zhao, Taro Watanabe
Inspired by this discovery, we then propose approaches to improving it, with respect to model structure and model training, to make the deep decoder practical in NMT.
no code implementations • ACL (WAT) 2021 • Zuchao Li, Masao Utiyama, Eiichiro Sumita, Hai Zhao
This paper describes our system (Team ID: nictrb) for participating in the WAT’21 restricted machine translation task.
no code implementations • AMTA 2022 • Xiaolin Wang, Masao Utiyama, Eiichiro Sumita
“Who said what” is essential for users to understand video streams that have more than one speaker, but conventional simultaneous interpretation systems merely present “what was said” in the form of subtitles.
no code implementations • ICON 2021 • Hour Kaing, Chenchen Ding, Katsuhito Sudoh, Masao Utiyama, Eiichiro Sumita, Satoshi Nakamura
Pretrained multilingual language models have become a key part of cross-lingual transfer for many natural language processing tasks, even those without bilingual information.
no code implementations • COLING 2022 • Abhisek Chakrabarty, Raj Dabre, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Eiichiro Sumita
In this paper we present FeatureBART, a linguistically motivated sequence-to-sequence monolingual pre-training strategy in which syntactic features such as lemma, part-of-speech and dependency labels are incorporated into the span prediction based pre-training framework (BART).
no code implementations • WMT (EMNLP) 2021 • Zuchao Li, Masao Utiyama, Eiichiro Sumita, Hai Zhao
In this paper, we describe our MiSS system that participated in the WMT21 news translation task.
no code implementations • EMNLP (ACL) 2021 • Zuchao Li, Kevin Parnow, Masao Utiyama, Eiichiro Sumita, Hai Zhao
With this system, we aim to provide a complete translation experience for machine translation users.
2 code implementations • 14 Mar 2025 • Jonas Belouadi, Eddy Ilg, Margret Keuper, Hideki Tanaka, Masao Utiyama, Raj Dabre, Steffen Eger, Simone Paolo Ponzetto
Meanwhile, large amounts of unaligned graphics programs and captioned raster images are more readily available.
no code implementations • 8 Mar 2025 • Haryo Akbarianto Wibowo, Haiyue Song, Hideki Tanaka, Masao Utiyama, Alham Fikri Aji, Raj Dabre
Large Language Models (LLMs) have grown increasingly expensive to deploy, driving the need for effective model compression techniques.
3 code implementations • 6 Jan 2025 • Zhi Qu, Yiran Wang, Jiannan Mao, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
We further scale up and collect 9. 3 billion sentence pairs across 24 languages from public datasets to pre-train two models, namely MITRE (multilingual translation with registers).
1 code implementation • 3 Dec 2024 • Zhi Qu, Yiran Wang, Chenchen Ding, Hideki Tanaka, Masao Utiyama, Taro Watanabe
We propose dividing the decoding process into two stages so that target tokens are explicitly excluded in the first stage to implicitly boost the transfer capability across languages.
1 code implementation • 5 Oct 2024 • Yiran Wang, Masao Utiyama
Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text.
1 code implementation • 12 Jun 2024 • Yiran Wang, Masao Utiyama
In this paper, we investigate the feasibility of further introducing it to the output side, aiming to allow models to output binary labels instead.
Ranked #1 on
Constituency Parsing
on Penn Treebank
1 code implementation • 17 Feb 2024 • Hiroyuki Deguchi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe, Hideki Tanaka, Masao Utiyama
Minimum Bayes risk (MBR) decoding achieved state-of-the-art translation performance by using COMET, a neural metric that has a high correlation with human evaluation.
no code implementations • 9 Jan 2023 • Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, Hai Zhao
Representation learning is the foundation of natural language processing (NLP).
no code implementations • 1 Dec 2022 • Zhuosheng Zhang, Hai Zhao, Masao Utiyama, Eiichiro Sumita
Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones.
1 code implementation • EMNLP 2021 • Zhuosheng Zhang, Siru Ouyang, Hai Zhao, Masao Utiyama, Eiichiro Sumita
In this work, we propose an effective gating strategy by smoothing the two dialogue states in only one decoder and bridge decision making and question generation to provide a richer dialogue state reference.
no code implementations • 27 Jul 2021 • Zuchao Li, Kevin Parnow, Hai Zhao, Zhuosheng Zhang, Rui Wang, Masao Utiyama, Eiichiro Sumita
Though the pre-trained contextualized language model (PrLM) has made a significant impact on NLP, training PrLMs in languages other than English can be impractical for two reasons: other languages often lack corpora sufficient for training powerful PrLMs, and because of the commonalities among human languages, computationally expensive PrLM training for different languages is somewhat redundant.
1 code implementation • NAACL 2021 • Shohei Higashiyama, Masao Utiyama, Taro Watanabe, Eiichiro Sumita
Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT).
no code implementations • 11 Feb 2021 • Zuchao Li, Zhuosheng Zhang, Hai Zhao, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita
In this paper, we propose explicit and implicit text compression approaches to enhance the Transformer encoding and evaluate models using this approach on several typical downstream tasks that rely on the encoding heavily.
no code implementations • 1 Jan 2021 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita
Self-attention networks (SANs) have shown promising empirical results in various natural language processing tasks.
no code implementations • 1 Jan 2021 • Zuchao Li, Kevin Barry Parnow, Hai Zhao, Zhuosheng Zhang, Rui Wang, Masao Utiyama, Eiichiro Sumita
Though the pre-trained contextualized language model (PrLM) has made a significant impact on NLP, training PrLMs in languages other than English can be impractical for two reasons: other languages often lack corpora sufficient for training powerful PrLMs, and because of the commonalities among human languages, computationally expensive PrLM training for different languages is somewhat redundant.
no code implementations • 30 Dec 2020 • Zhuosheng Zhang, Haojie Yu, Hai Zhao, Rui Wang, Masao Utiyama
Word representation is a fundamental component in neural language understanding models.
no code implementations • COLING 2020 • Hiroyuki Deguchi, Masao Utiyama, Akihiro Tamura, Takashi Ninomiya, Eiichiro Sumita
This paper proposed a new subword segmentation method for neural machine translation, {``}Bilingual Subword Segmentation,{''} which tokenizes sentences to minimize the difference between the number of subword units in a sentence and that of its translation.
no code implementations • COLING 2020 • Abhisek Chakrabarty, Raj Dabre, Chenchen Ding, Masao Utiyama, Eiichiro Sumita
In this study, linguistic knowledge at different levels are incorporated into the neural machine translation (NMT) framework to improve translation quality for language pairs with extremely limited data.
no code implementations • 11 Oct 2020 • Zuchao Li, Hai Zhao, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita
In this paper, we introduced our joint team SJTU-NICT 's participation in the WMT 2020 machine translation shared task.
no code implementations • EMNLP (NLP-COVID19) 2020 • Akiko Aizawa, Frederic Bergeron, Junjie Chen, Fei Cheng, Katsuhiko Hayashi, Kentaro Inui, Hiroyoshi Ito, Daisuke Kawahara, Masaru Kitsuregawa, Hirokazu Kiyomaru, Masaki Kobayashi, Takashi Kodama, Sadao Kurohashi, Qianying Liu, Masaki Matsubara, Yusuke Miyao, Atsuyuki Morishima, Yugo Murawaki, Kazumasa Omura, Haiyue Song, Eiichiro Sumita, Shinji Suzuki, Ribeka Tanaka, Yu Tanaka, Masashi Toyoda, Nobuhiro Ueda, Honai Ueoka, Masao Utiyama, Ying Zhong
The global pandemic of COVID-19 has made the public pay close attention to related news, covering various domains, such as sanitation, treatment, and effects on education.
no code implementations • ACL 2020 • Chenchen Ding, Masao Utiyama, Eiichiro Sumita
We present that, the rank-frequency relation in textual data follows $f \propto r^{-\alpha}(r+\gamma)^{-\beta}$, where $f$ is the token frequency and $r$ is the rank by frequency, with ($\alpha$, $\beta$, $\gamma$) as parameters.
no code implementations • ACL 2020 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita
Neural machine translation (NMT) encodes the source sentence in a universal way to generate the target sentence word-by-word.
no code implementations • LREC 2020 • Aye Myat Mon, Chenchen Ding, Hour Kaing, Khin Mar Soe, Masao Utiyama, Eiichiro Sumita
For the Myanmar (Burmese) language, robust automatic transliteration for borrowed English words is a challenging task because of the complex Myanmar writing system and the lack of data.
1 code implementation • ICLR 2020 • Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, Hai Zhao
Though visual information has been introduced for enhancing neural machine translation (NMT), its effectiveness strongly relies on the availability of large amounts of bilingual parallel sentence pairs with manual image annotations.
no code implementations • ACL 2020 • Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs.
no code implementations • NAACL 2021 • Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Unsupervised neural machine translation (UNMT) that relies solely on massive monolingual corpora has achieved remarkable results in several translation tasks.
no code implementations • 8 Apr 2020 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita
Thus, we propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Zuchao Li, Hai Zhao, Rui Wang, Masao Utiyama, Eiichiro Sumita
Further enriching the idea of pivot translation by extending the use of parallel corpora beyond the source-target paradigm, we propose a new reference language-based framework for UNMT, RUNMT, in which the reference language only shares a parallel corpus with the source, but this corpus still indicates a signal clear enough to help the reconstruction training of UNMT through a proposed reference agreement mechanism.
no code implementations • COLING 2020 • Haipeng Sun, Rui Wang, Kehai Chen, Xugang Lu, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
Unsupervised neural machine translation (UNMT) has recently attracted great interest in the machine translation community.
no code implementations • 28 Feb 2020 • Chaoqun Duan, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Conghui Zhu, Tiejun Zhao
Existing neural machine translation (NMT) systems utilize sequence-to-sequence neural networks to generate target translation word by word, and then make the generated word at each time-step and the counterpart in the references as consistent as possible.
1 code implementation • 27 Dec 2019 • Zuchao Li, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Zhuosheng Zhang, Hai Zhao
In this paper, we propose an explicit sentence compression method to enhance the source sentence representation for NMT.
no code implementations • 7 Nov 2019 • Zhuosheng Zhang, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Hai Zhao
We present a universal framework to model contextualized sentence representations with visual awareness that is motivated to overcome the shortcomings of the multimodal parallel data with manual annotations.
no code implementations • WS 2019 • Benjamin Marie, Hour Kaing, Aye Myat Mon, Chenchen Ding, Atsushi Fujita, Masao Utiyama, Eiichiro Sumita
This paper presents the NICT{'}s supervised and unsupervised machine translation systems for the WAT2019 Myanmar-English and Khmer-English translation tasks.
no code implementations • CONLL 2019 • Zuchao Li, Hai Zhao, Zhuosheng Zhang, Rui Wang, Masao Utiyama, Eiichiro Sumita
This paper describes our SJTU-NICT{'}s system for participating in the shared task on Cross-Framework Meaning Representation Parsing (MRP) at the 2019 Conference for Computational Language Learning (CoNLL).
no code implementations • IJCNLP 2019 • Chenchen Ding, Masao Utiyama, Eiichiro Sumita
MY-AKKHARA is a method used to input Burmese texts encoded in the Unicode standard, based on commonly accepted Latin transcription.
no code implementations • IJCNLP 2019 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita
To address this issue, this work proposes a recurrent positional embedding approach based on word vector.
no code implementations • WS 2019 • Rui Wang, Haipeng Sun, Kehai Chen, Chenchen Ding, Masao Utiyama, Eiichiro Sumita
This paper presents the NICT{'}s participation (team ID: NICT) in the 6th Workshop on Asian Translation (WAT-2019) shared translation task, specifically Myanmar (Burmese) - English task in both translation directions.
no code implementations • 31 Oct 2019 • Shu Jiang, Rui Wang, Zuchao Li, Masao Utiyama, Kehai Chen, Eiichiro Sumita, Hai Zhao, Bao-liang Lu
Most existing document-level NMT approaches are satisfied with a smattering sense of global document-level information, while this work focuses on exploiting detailed document-level context in terms of a memory network.
no code implementations • WS 2019 • Junya Ono, Masao Utiyama, Eiichiro Sumita
We apply a model parallel approach to the RNN encoder-decoder part of the Seq2Seq model and a data parallel approach to the attention-softmax part of the model.
no code implementations • 26 Aug 2019 • Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao, Chenhui Chu
However, it has not been well-studied for unsupervised neural machine translation (UNMT), although UNMT has recently achieved remarkable results in several domain-specific language pairs.
no code implementations • WS 2019 • Raj Dabre, Kehai Chen, Benjamin Marie, Rui Wang, Atsushi Fujita, Masao Utiyama, Eiichiro Sumita
In this paper, we describe our supervised neural machine translation (NMT) systems that we developed for the news translation task for Kazakh↔English, Gujarati↔English, Chinese↔English, and English→Finnish translation directions.
no code implementations • WS 2019 • Benjamin Marie, Haipeng Sun, Rui Wang, Kehai Chen, Atsushi Fujita, Masao Utiyama, Eiichiro Sumita
This paper presents the NICT{'}s participation in the WMT19 unsupervised news translation task.
no code implementations • ACL 2019 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita
The reordering model plays an important role in phrase-based statistical machine translation.
no code implementations • ACL 2019 • Haipeng Sun, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
In previous methods, UBWE is first trained using non-parallel monolingual corpora and then this pre-trained UBWE is used to initialize the word embedding in the encoder and decoder of UNMT.
no code implementations • ACL 2019 • Mingming Yang, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Min Zhang, Tiejun Zhao
The training objective of neural machine translation (NMT) is to minimize the loss between the words in the translated sentences and those in the references.
1 code implementation • NAACL 2019 • Shohei Higashiyama, Masao Utiyama, Eiichiro Sumita, Masao Ideuchi, Yoshiaki Oida, Yohei Sakamoto, Isaac Okada
Neural network models have been actively applied to word segmentation, especially Chinese, because of the ability to minimize the effort in feature engineering.
Ranked #2 on
Japanese Word Segmentation
on BCCWJ
no code implementations • NAACL 2019 • Chunpeng Ma, Akihiro Tamura, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
The explicit use of syntactic information has been proved useful for neural machine translation (NMT).
no code implementations • WS 2018 • Benjamin Marie, Rui Wang, Atsushi Fujita, Masao Utiyama, Eiichiro Sumita
Our systems are ranked first for the Estonian-English and Finnish-English language pairs (constraint) according to BLEU-cased.
no code implementations • WS 2018 • Rui Wang, Benjamin Marie, Masao Utiyama, Eiichiro Sumita
Using the clean data of the WMT18 shared news translation task, we designed several features and trained a classifier to score each sentence pairs in the noisy data.
1 code implementation • EMNLP 2018 • Zhisong Zhang, Rui Wang, Masao Utiyama, Eiichiro Sumita, Hai Zhao
In Neural Machine Translation (NMT), the decoder can capture the features of the entire prediction history with neural connections and representations.
no code implementations • ACL 2018 • Chunpeng Ma, Akihiro Tamura, Masao Utiyama, Tiejun Zhao, Eiichiro Sumita
Tree-based neural machine translation (NMT) approaches, although achieved impressive performance, suffer from a major drawback: they only use the 1-best parse tree to direct the translation, which potentially introduces translation mistakes due to parsing errors.
no code implementations • ACL 2018 • Chenchen Ding, Masao Utiyama, Eiichiro Sumita
An abugida is a writing system where the consonant letters represent syllables with a default vowel and other vowels are denoted by diacritics.
no code implementations • ACL 2018 • Rui Wang, Masao Utiyama, Eiichiro Sumita
Traditional Neural machine translation (NMT) involves a fixed training procedure where each sentence is sampled once during each epoch.
no code implementations • NAACL 2018 • Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Graham Neubig, Satoshi Nakamura
Specifically, for an input sentence, we use a search engine to retrieve sentence pairs whose source sides are similar with the input sentence, and then collect $n$-grams that are both in the retrieved target sentences and aligned with words that match in the source sentences, which we call "translation pieces".
1 code implementation • EMNLP 2018 • Xiaolin Wang, Masao Utiyama, Eiichiro Sumita
This paper presents an open-source neural machine translation toolkit named CytonMT (https://github. com/arthurxlw/cytonMt).
no code implementations • 12 Nov 2017 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
In this paper, we extend local attention with syntax-distance constraint, to focus on syntactically related source words with the predicted target word, thus learning a more effective context vector for word prediction.
no code implementations • IJCNLP 2017 • Jingyi Zhang, Masao Utiyama, Eiichro Sumita, Graham Neubig, Satoshi Nakamura
Compared to traditional statistical machine translation (SMT), neural machine translation (NMT) often sacrifices adequacy for the sake of fluency.
no code implementations • WS 2017 • Yusuke Oda, Katsuhito Sudoh, Satoshi Nakamura, Masao Utiyama, Eiichiro Sumita
This paper describes the details about the NAIST-NICT machine translation system for WAT2017 English-Japanese Scientific Paper Translation Task.
no code implementations • IJCNLP 2017 • Hideya Mino, Masao Utiyama, Eiichiro Sumita, Takenobu Tokunaga
In this paper, we propose a neural machine translation (NMT) with a key-value attention mechanism on the source-side encoder.
no code implementations • IJCNLP 2017 • Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Tiejun Zhao
In Neural Machine Translation (NMT), each word is represented as a low-dimension, real-value vector for encoding its syntax and semantic information.
1 code implementation • EMNLP 2017 • Rui Wang, Masao Utiyama, Lemao Liu, Kehai Chen, Eiichiro Sumita
Instance weighting has been widely applied to phrase-based machine translation domain adaptation.
no code implementations • EMNLP 2017 • Kehai Chen, Rui Wang, Masao Utiyama, Lemao Liu, Akihiro Tamura, Eiichiro Sumita, Tiejun Zhao
Source dependency information has been successfully introduced into statistical machine translation.
no code implementations • ACL 2017 • Rui Wang, Andrew Finch, Masao Utiyama, Eiichiro Sumita
Although new corpora are becoming increasingly available for machine translation, only those that belong to the same or similar domains are typically able to improve translation performance.
no code implementations • COLING 2016 • Xiaolin Wang, Andrew Finch, Masao Utiyama, Eiichiro Sumita
Simultaneous interpretation allows people to communicate spontaneously across language boundaries, but such services are prohibitively expensive for the general public.
no code implementations • WS 2016 • Xiaolin Wang, Andrew Finch, Masao Utiyama, Eiichiro Sumita
Simultaneous interpretation is a very challenging application of machine translation in which the input is a stream of words from a speech recognition engine.
no code implementations • WS 2016 • Masaru Fuji, Masao Utiyama, Eiichiro Sumita, Yuji Matsumoto
When translating formal documents, capturing the sentence structure specific to the sublanguage is extremely necessary to obtain high-quality translations.
no code implementations • COLING 2016 • Rei Miyata, Anthony Hartley, Kyo Kageura, C{\'e}cile Paris, Masao Utiyama, Eiichiro Sumita
The paper introduces a web-based authoring support system, MuTUAL, which aims to help writers create multilingual texts.
no code implementations • WS 2016 • Chenchen Ding, Masao Utiyama, Eiichiro Sumita
This paper illustrates the similarity between Thai and Laotian, and between Malay and Indonesian, based on an investigation on raw parallel data from Asian Language Treebank.
no code implementations • COLING 2016 • Lemao Liu, Masao Utiyama, Andrew Finch, Eiichiro Sumita
The attention mechanisim is appealing for neural machine translation, since it is able to dynam- ically encode a source sentence by generating a alignment between a target word and source words.
no code implementations • 29 Jul 2016 • Rui Wang, Hai Zhao, Sabine Ploux, Bao-liang Lu, Masao Utiyama, Eiichiro Sumita
Most of the existing methods for bilingual word embedding only consider shallow context or simple co-occurrence information.
no code implementations • COLING 2016 • Rui Wang, Hai Zhao, Bao-liang Lu, Masao Utiyama, Eiichro Sumita
Although more additional corpora are now available for Statistical Machine Translation (SMT), only the ones which belong to the same or similar domains with the original corpus can indeed enhance SMT performance directly.
no code implementations • LREC 2016 • Ye Kyaw Thu, Win Pa Pa, Masao Utiyama, Andrew Finch, Eiichiro Sumita
The project has so far created a corpus for Myanmar and will extend in scope to include other languages in the near future.
no code implementations • LREC 2016 • Toshiaki Nakazawa, Manabu Yaguchi, Kiyotaka Uchimoto, Masao Utiyama, Eiichiro Sumita, Sadao Kurohashi, Hitoshi Isahara
In this paper, we describe the details of the ASPEC (Asian Scientific Paper Excerpt Corpus), which is the first large-size parallel corpus of scientific paper domain.
1 code implementation • 15 Oct 2015 • Vichet Chea, Ye Kyaw Thu, Chenchen Ding, Masao Utiyama, Andrew Finch, Eiichiro Sumita
The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus.