Replicability Analysis for Natural Language Processing: Testing Significance with Multiple Datasets

TACL 2017 Rotem DrorGili BaumerMarina BogomolovRoi Reichart

With the ever-growing amounts of textual data from a large variety of languages, domains, and genres, it has become standard to evaluate NLP algorithms on multiple datasets in order to ensure consistent performance across heterogeneous setups. However, such multiple comparisons pose significant challenges to traditional statistical analysis methods in NLP and can lead to erroneous conclusions... (read more)

PDF Abstract TACL 2017 PDF TACL 2017 Abstract

Results from the Paper

  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods used in the Paper

🤖 No Methods Found Help the community by adding them if they're not listed; e.g. Deep Residual Learning for Image Recognition uses ResNet