OpenSpeaks Voice: Odia

Introduced by Panigrahi in Building a Public Domain Voice Database for Odia

OpenSpeaks Voice: Odia is a large speech dataset in the Odia language of India that is stewarded by Subhashish Panigrahi and is hosted at the O Foundation. It currently hosts over 70,000 audio files under a Universal Public Domain (CC0 1.0) Release. Of these, 66,000, hosted on Wikimedia Commons, include pronunciation of words and phrases, and the remaining 4,400 include pronunciation of sentences and are hosted on Mozilla Common Voice. The files on Wikimedia Commons were also released n 2023 as four physical media in the form of DVD-ROMs titled OpenSpeaks Voice: Odia Volume I, OpenSpeaks Voice: Odia Volume II, OpenSpeaks Voice: Balesoria-Odia Volume I, and OpenSpeaks Voice: Balesoria-Odia Volume II. The dataset uses Free/Libre and Open Source Software, primarily using web-based platforms such as Lingua Libre and Common Voice. Other tools used for this project include Kathabhidhana, developed by Panigrahi by forking the Voice Recorder for Tamil Wiktionary by Shrinivasan T, and Spell4wiki, Audacity among others. Over 64,000 files in this dataset are in the standard spoken variant of Odia (Central Odia), and the remaining 6,300 files are in Balesoria (Baleswari), the northern dialect of Odia. OpenSpeaks Voice: Balesoria-Odia Volume II was created by extracting words and phrases from the Nani Ma, a Balesoria-Odia documentary short directed by Panigrahi. The files within this dataset include transcription in Odia, making them accessible for automatic speech recognition (ASR). All the files are publicly available for ASR research and application building.

Source: OpenSpeaks before

Papers


Paper Code Results Date Stars

Dataset Loaders


No data loaders found. You can submit your data loader here.

Tasks


Modalities


Languages