5 dataset results for Self-Supervised Learning AND Texts

TV show Caption is a large-scale multimodal captioning dataset, containing 261,490 caption descriptions paired with 108,965 short video moments. TVC is unique as its captions may also describe dialogues/subtitles while the captions in the other datasets are only describing the visual content.

15 PAPERS • 1 BENCHMARK

OrangeSum

Source: BARThez: a Skilled Pretrained French Sequence-to-Sequence Model

7 PAPERS • 3 BENCHMARKS

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

5 PAPERS • NO BENCHMARKS YET

Wild-Time

Wild-Time is a benchmark of 5 datasets that reflect temporal distribution shifts arising in a variety of real-world applications, including patient prognosis and news classification. On these datasets, we systematically benchmark 13 prior approaches, including methods in domain generalization, continual learning, self-supervised learning, and ensemble learning.

4 PAPERS • NO BENCHMARKS YET

ApartmenTour

Contains a large number of online videos and subtitles.

1 PAPER • NO BENCHMARKS YET

Datasets

5 dataset results for Self-Supervised Learning AND Texts