Texts

BabyLM

Introduced by Warstadt et al. in Call for Papers -- The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

BabyLM is a dataset for small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, it provides a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome).

Source: Call for Papers - The BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Language Modelling

Similar Datasets

BLiMP

Usage

License

Unknown

Modalities

Texts

Languages

English

BabyLM

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit