Penn Treebank

Introduced by Mitchell P. Marcus et al. in Building a Large Annotated Corpus of English: The Penn Treebank

The English Penn Treebank (PTB) corpus, and in particular the section of the corpus corresponding to the articles of Wall Street Journal (WSJ), is one of the most known and used corpus for the evaluation of models for sequence labelling. The task consists of annotating each word with its Part-of-Speech tag. In the most common split of this corpus, sections from 0 to 18 are used for training (38 219 sentences, 912 344 tokens), sections from 19 to 21 are used for validation (5 527 sentences, 131 768 tokens), and sections from 22 to 24 are used for testing (5 462 sentences, 129 654 tokens). The corpus is also commonly used for character-level and word-level Language Modelling.

Source: Seq2Biseq: Bidirectional Output-wise Recurrent Neural Networks for Sequence Modelling

Homepage

Benchmarks

Add a new result Link an existing benchmark

Task	Dataset Variant	Best Model
Language Modelling	Penn Treebank (Word Level)	GPT-3
Constituency Parsing	Penn Treebank	SAPar + XLNet
Dependency Parsing	Penn Treebank	Label Attention Layer + HPSG + XLNet
Part-Of-Speech Tagging	Penn Treebank	SALE-BART encoder
Language Modelling	Penn Treebank (Character Level)	Mogrifier LSTM + dynamic eval
Chunking	Penn Treebank	ACE
Unsupervised Dependency Parsing	Penn Treebank	Iterative reranking
Stochastic Optimization	Penn Treebank (Character Level) 3x1000 LSTM - 500 Epochs	AvaGrad
Open Information Extraction	Penn Treebank	Deepstruct zero-shot
Missing Elements	Penn Treebank	Kato and Matsubara