TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Chinese Word Segmentation	CTB6	BABERT-LE	F1	97.56	# 2
Chinese Word Segmentation	CTB6	BABERT	F1	97.45	# 3
Chinese Word Segmentation	MSR	BABERT-LE	F1	98.63	# 1
Chinese Word Segmentation	MSR	BABERT	F1	98.44	# 2
Chinese Word Segmentation	MSRA	BABERT-LE	F1	98.63	# 1
Chinese Word Segmentation	MSRA	BABERT	F1	98.44	# 2
Chinese Word Segmentation	PKU	BABERT-LE	F1	96.84	# 1
Chinese Word Segmentation	PKU	BABERT	F1	96.70	# 2

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unsupervised-boundary-aware-language-model/chinese-word-segmentation-on-msr)](https://paperswithcode.com/sota/chinese-word-segmentation-on-msr?p=unsupervised-boundary-aware-language-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unsupervised-boundary-aware-language-model/chinese-word-segmentation-on-msra)](https://paperswithcode.com/sota/chinese-word-segmentation-on-msra?p=unsupervised-boundary-aware-language-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unsupervised-boundary-aware-language-model/chinese-word-segmentation-on-pku)](https://paperswithcode.com/sota/chinese-word-segmentation-on-pku?p=unsupervised-boundary-aware-language-model)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/unsupervised-boundary-aware-language-model/chinese-word-segmentation-on-ctb6)](https://paperswithcode.com/sota/chinese-word-segmentation-on-ctb6?p=unsupervised-boundary-aware-language-model)`

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

27 Oct 2022 · Peijie Jiang, Dingkun Long, Yanzhao Zhang, Pengjun Xie, Meishan Zhang, Min Zhang ·

Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition. Previous studies usually resorted to the use of a high-quality external lexicon, where lexicon items can offer explicit boundary information. However, to ensure the quality of the lexicon, great human effort is always necessary, which has been generally ignored. In this work, we suggest unsupervised statistical boundary information instead, and propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature induction of Chinese sequence labeling tasks. Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets. In addition, our method can complement previous supervised lexicon exploration, where further improvements can be achieved when integrated with external lexicon information.

PDF Abstract

Code

Add Remove Mark official

modelscope/AdaSeq official

355

modelscope/modelscope

6,005

Tasks

Add Remove

Chinese Named Entity Recognition

Chinese Word Segmentation

Language Modelling

Named Entity Recognition

Named Entity Recognition (NER)

Part-Of-Speech Tagging

Datasets

Universal Dependencies MSRA CN NER

Results from the Paper

Add Remove

Ranked #1 on Chinese Word Segmentation on MSRA

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Chinese Word Segmentation	CTB6	BABERT-LE	F1	97.56	# 2	Compare
Chinese Word Segmentation	CTB6	BABERT	F1	97.45	# 3	Compare
Chinese Word Segmentation	MSR	BABERT-LE	F1	98.63	# 1	Compare
Chinese Word Segmentation	MSR	BABERT	F1	98.44	# 2	Compare
Chinese Word Segmentation	MSRA	BABERT-LE	F1	98.63	# 1	Compare
Chinese Word Segmentation	MSRA	BABERT	F1	98.44	# 2	Compare
Chinese Word Segmentation	PKU	BABERT-LE	F1	96.84	# 1	Compare
Chinese Word Segmentation	PKU	BABERT	F1	96.70	# 2	Compare

Methods

Add Remove

Adam • Attention Dropout • BERT • Dense Connections • Dropout • GELU • Layer Normalization • Linear Layer • Linear Warmup With Linear Decay • Multi-Head Attention • Residual Connection • Scaled Dot-Product Attention • Softmax • Weight Decay • WordPiece

Edit Social Preview

Unsupervised Boundary-Aware Language Model Pretraining for Chinese Sequence Labeling

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit Add Remove

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Add Remove

Methods

Add Remove