Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains
Corpus is a valuable resource for information retrieval and data-driven natural language processing systems,especially for spoken dialogue research in specific domains. However,there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. In this paper, we build a large-scale and high-quality Chinese corpus, called CSDC (Chinese Spoken Dialogue Corpus). It contains five domains and more than 140 thousand dialogues in all. Each sentence in this corpus is annotated with slot information additionally compared to other corpora. To our best knowledge, this is the largest Chinese spoken dialogue corpus, as well as the first one with slot information. With this corpus, we proposed a method and did a well-designed experiment. The indicative result is reported at last.
PDF Abstract IJCNLP 2017 PDF IJCNLP 2017 Abstract