TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Chinese Sentiment Analysis	ChnSentiCorp	ZEN (Init with Chinese BERT)	F1	96.08	# 1
Chinese Sentiment Analysis	ChnSentiCorp	ZEN (Random Init)	F1	94.42	# 2
Chinese Sentiment Analysis	ChnSentiCorp Dev	ZEN (Random Init)	F1	94.87	# 2
Chinese Sentiment Analysis	ChnSentiCorp Dev	ZEN (Init with Chinese BERT)	F1	95.66	# 1
Chinese Part-of-Speech Tagging	CTB5	ZEN (Init with Chinese BERT)	F1	96.64	# 1
Chinese Part-of-Speech Tagging	CTB5	ZEN (Random Init)	F1	95.82	# 3
Chinese Part-of-Speech Tagging	CTB5 Dev	ZEN (Init with Chinese BERT)	F1	97.43	# 1
Chinese Part-of-Speech Tagging	CTB5 Dev	ZEN (Random Init)	F1	96.12	# 2
Chinese Sentence Pair Classification	LCQMC	ZEN (Random Init)	F1	85.27	# 4
Chinese Sentence Pair Classification	LCQMC	ZEN (Init with Chinese BERT)	F1	87.95	# 2
Chinese Sentence Pair Classification	LCQMC Dev	ZEN (Random Init)	F1	88.1	# 3
Chinese Sentence Pair Classification	LCQMC Dev	ZEN (Init with Chinese BERT)	F1	90.2	# 2
Chinese Word Segmentation	MSR	ZEN (Init with Chinese BERT)	F1	98.35	# 4
Chinese Word Segmentation	MSR	ZEN (Random Init)	F1	97.89	# 6
Chinese Named Entity Recognition	MSRA	ZEN (Init with Chinese BERT)	F1	95.25	# 9
Chinese Named Entity Recognition	MSRA	ZEN (Random Init)	F1	93.24	# 18
Chinese Document Classification	THUCNews	ZEN (Init with Chinese BERT)	F1	97.64	# 2
Chinese Document Classification	THUCNews	ZEN (Random Init)	F1	96.87	# 3
Chinese Document Classification	THUCNews Dev	ZEN (Random Init)	F1	97.2	# 3
Chinese Document Classification	THUCNews Dev	ZEN (Init with Chinese BERT)	F1	97.66	# 2
Chinese Sentence Pair Classification	XNLI	ZEN (Init with Chinese BERT)	F1	79.2	# 2
Chinese Sentence Pair Classification	XNLI	ZEN (Random Init)	F1	77.03	# 3
Chinese Sentence Pair Classification	XNLI Dev	ZEN (Init with Chinese BERT)	F1	80.48	# 2
Chinese Sentence Pair Classification	XNLI Dev	ZEN (Random Init)	F1	77.11	# 3

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentiment-analysis-on-chnsenticorp-1)](https://paperswithcode.com/sota/chinese-sentiment-analysis-on-chnsenticorp-1?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentiment-analysis-on-chnsenticorp)](https://paperswithcode.com/sota/chinese-sentiment-analysis-on-chnsenticorp?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-part-of-speech-tagging-on-ctb5)](https://paperswithcode.com/sota/chinese-part-of-speech-tagging-on-ctb5?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-part-of-speech-tagging-on-ctb5-dev)](https://paperswithcode.com/sota/chinese-part-of-speech-tagging-on-ctb5-dev?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentence-pair-classification-on-lcqmc)](https://paperswithcode.com/sota/chinese-sentence-pair-classification-on-lcqmc?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentence-pair-classification-on-lcqmc-1)](https://paperswithcode.com/sota/chinese-sentence-pair-classification-on-lcqmc-1?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-document-classification-on-thucnews-1)](https://paperswithcode.com/sota/chinese-document-classification-on-thucnews-1?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-document-classification-on-thucnews)](https://paperswithcode.com/sota/chinese-document-classification-on-thucnews?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentence-pair-classification-on-xnli)](https://paperswithcode.com/sota/chinese-sentence-pair-classification-on-xnli?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-sentence-pair-classification-on-xnli-1)](https://paperswithcode.com/sota/chinese-sentence-pair-classification-on-xnli-1?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-word-segmentation-on-msr)](https://paperswithcode.com/sota/chinese-word-segmentation-on-msr?p=zen-pre-training-chinese-text-encoder)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/zen-pre-training-chinese-text-encoder/chinese-named-entity-recognition-on-msra)](https://paperswithcode.com/sota/chinese-named-entity-recognition-on-msra?p=zen-pre-training-chinese-text-encoder)`

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

Findings of the Association for Computational Linguistics 2020 · Shizhe Diao, Jiaxin Bai, Yan Song, Tong Zhang, Yonggang Wang ·

The pre-training of text encoders normally processes text as a sequence of tokens corresponding to small text units, such as word pieces in English and characters in Chinese. It omits information carried by larger text granularity, and thus the encoders cannot easily adapt to certain combinations of characters. This leads to a loss of important semantic information, which is especially problematic for Chinese because the language does not have explicit word boundaries. In this paper, we propose ZEN, a BERT-based Chinese (Z) text encoder Enhanced by N-gram representations, where different combinations of characters are considered during training. As a result, potential word or phase boundaries are explicitly pre-trained and fine-tuned with the character encoder (BERT). Therefore ZEN incorporates the comprehensive information of both the character sequence and words or phrases it contains. Experimental results illustrated the effectiveness of ZEN on a series of Chinese NLP tasks. We show that ZEN, using less resource than other published encoders, can achieve state-of-the-art performance on most tasks. Moreover, it is shown that reasonable performance can be obtained when ZEN is trained on a small corpus, which is important for applying pre-training techniques to scenarios with limited data. The code and pre-trained models of ZEN are available at https://github.com/sinovation/zen.

PDF Abstract Findings of 2020 PDF Findings of 2020 Abstract

Code

Add Remove Mark official

sinovation/ZEN official

640

SVAIGBA/WMSeg

173

SVAIGBA/TwASP

cuhksz-nlp/het-mc

cuhksz-nlp/mcasp

See all 7 implementations

Tasks

Add Remove

Chinese Named Entity Recognition

Chinese Word Segmentation

Document Classification

Natural Language Inference

Part-Of-Speech Tagging

Sentence Pair Modeling

Sentiment Analysis

Datasets

XNLI MSRA CN NER LCQMC THUCNews

Results from the Paper

Edit

Ranked #1 on Chinese Part-of-Speech Tagging on CTB5 Dev

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Chinese Sentiment Analysis	ChnSentiCorp	ZEN (Init with Chinese BERT)	F1	96.08	# 1	Compare
Chinese Sentiment Analysis	ChnSentiCorp	ZEN (Random Init)	F1	94.42	# 2	Compare
Chinese Sentiment Analysis	ChnSentiCorp Dev	ZEN (Random Init)	F1	94.87	# 2	Compare
Chinese Sentiment Analysis	ChnSentiCorp Dev	ZEN (Init with Chinese BERT)	F1	95.66	# 1	Compare
Chinese Part-of-Speech Tagging	CTB5	ZEN (Init with Chinese BERT)	F1	96.64	# 1	Compare
Chinese Part-of-Speech Tagging	CTB5	ZEN (Random Init)	F1	95.82	# 3	Compare
Chinese Part-of-Speech Tagging	CTB5 Dev	ZEN (Init with Chinese BERT)	F1	97.43	# 1	Compare
Chinese Part-of-Speech Tagging	CTB5 Dev	ZEN (Random Init)	F1	96.12	# 2	Compare
Chinese Sentence Pair Classification	LCQMC	ZEN (Random Init)	F1	85.27	# 4	Compare
Chinese Sentence Pair Classification	LCQMC	ZEN (Init with Chinese BERT)	F1	87.95	# 2	Compare
Chinese Sentence Pair Classification	LCQMC Dev	ZEN (Random Init)	F1	88.1	# 3	Compare
Chinese Sentence Pair Classification	LCQMC Dev	ZEN (Init with Chinese BERT)	F1	90.2	# 2	Compare
Chinese Word Segmentation	MSR	ZEN (Init with Chinese BERT)	F1	98.35	# 4	Compare
Chinese Word Segmentation	MSR	ZEN (Random Init)	F1	97.89	# 6	Compare
Chinese Named Entity Recognition	MSRA	ZEN (Init with Chinese BERT)	F1	95.25	# 9	Compare
Chinese Named Entity Recognition	MSRA	ZEN (Random Init)	F1	93.24	# 18	Compare
Chinese Document Classification	THUCNews	ZEN (Init with Chinese BERT)	F1	97.64	# 2	Compare
Chinese Document Classification	THUCNews	ZEN (Random Init)	F1	96.87	# 3	Compare
Chinese Document Classification	THUCNews Dev	ZEN (Random Init)	F1	97.2	# 3	Compare
Chinese Document Classification	THUCNews Dev	ZEN (Init with Chinese BERT)	F1	97.66	# 2	Compare
Chinese Sentence Pair Classification	XNLI	ZEN (Init with Chinese BERT)	F1	79.2	# 2	Compare
Chinese Sentence Pair Classification	XNLI	ZEN (Random Init)	F1	77.03	# 3	Compare
Chinese Sentence Pair Classification	XNLI Dev	ZEN (Init with Chinese BERT)	F1	80.48	# 2	Compare
Chinese Sentence Pair Classification	XNLI Dev	ZEN (Random Init)	F1	77.11	# 3	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove