The Spoken Wikipedia Corpora

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

  • hundreds of hours of aligned audio
  • from a diverse set of readers
  • about a diverse set of topics
  • in a well-researched textual genre
  • licensed under a free license (CC BY-SA 4.0)
  • Annotations can be mapped back to the original html
  • phoneme-level alignments

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets


License


Modalities


Languages