Texts

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

Introduced by Abreu-Vicente et al. in The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models

We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental design, and the nature of the experimental method as an additional class. SourceData-NLP contains more than 620,000 annotated biomedical entities, curated from 18,689 figures in 3,223 papers in molecular and cell biology. We illustrate the dataset's usefulness by assessing BioLinkBERT and PubmedBERT, two transformers-based models, fine-tuned on the SourceData-NLP dataset for NER. We also introduce a novel context-dependent semantic task that infers whether an entity is the target of a controlled intervention or the object of measurement.

Homepage

Benchmarks

Add a new result Link an existing benchmark

Trend	Task	Dataset Variant	Best Model	Paper	Code
	NER	The EMBO SourceData-NLP dataset	BioLinkBERT-large

Papers

Paper	Code	Results	Date	Stars

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

MedMentions

BioRED

SourceData-NLP

Usage

License

Modalities

Languages

The EMBO SourceData-NLP dataset (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

MedMentions

BioRED

SourceData-NLP

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages