tasksource: A Dataset Harmonization Framework for Streamlined NLP Multi-Task Learning and Evaluation

14 Jan 2023  ·  Damien Sileo ·

The HuggingFace Datasets Hub hosts thousands of datasets, offering exciting opportunities for language model training and evaluation. However, datasets for a specific task type often have different schemas, making harmonization challenging. Multi-task training or evaluation necessitates manual work to fit data into task templates. Several initiatives independently tackle this issue by releasing harmonized datasets or providing harmonization codes to preprocess datasets into a consistent format. We identify patterns across previous preprocessing efforts, such as column name mapping and extracting specific sub-fields from structured data in a column. We then propose a structured annotation framework that ensures our annotations are fully exposed and not hidden within unstructured code. We release a dataset annotation framework and dataset annotations for more than 500 English tasks\footnote{\url{https://github.com/sileod/tasksource}}. These annotations include metadata, such as the names of columns to be used as input or labels for all datasets, which can save time for future dataset preprocessing, regardless of whether our framework is utilized. We fine-tune a multi-task text encoder on all tasksource tasks, outperforming every publicly available text encoder of comparable size in an external evaluation.

PDF Abstract

Datasets


Introduced in the Paper:

Tasksource

Used in the Paper:

GLUE MultiNLI CoLA BIG-bench

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Linguistic Acceptability CoLA deberta-v3-base+tasksource Accuracy 87.15% # 4

Methods


No methods listed for this paper. Add relevant methods here