ParaPat (Parallel Corpus of Patents Abstracts)

Introduced by Soares et al. in ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

A parallel corpus from the open access Google Patents dataset in 74 language pairs, comprising more than 68 million sentences and 800 million tokens. Sentences were automatically aligned using the Hunalign algorithm for the largest 22 language pairs, while the others were abstract (i.e. paragraph) aligned.

Source: ParaPat: The Multi-Million Sentences Parallel Corpus of Patents Abstracts

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


License


  • Unknown

Modalities


Languages