BSTC (Baidu Speech Translation Corpus)

Introduced by Zhang et al. in BSTC: A Large-Scale Chinese-English Speech Translation Dataset

BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations. The corpus can be used to build automatic simultaneous interpretation system. The corpus is collected from the Chinese mandarin talks and reports, including science, technology, culture, economy, etc.,. The utterances in talks and reports are carefully transcribed into Chinese text, and further translated into English text. The sentence boundary is determined by the English text instead of the Chinese text which is analogous to the previous related corpus (TED and Translation Augmented LibriSpeech Corpus).

The corpus is divided into training/develop/test datasets. In each dataset, there are three types of files: 1. Acoustic signal files, which are named as baidu_XX.wav, where XX is the identical code. All signal files are encoded in Waveform Audio File Format (WAVE) from a mono recording, with a sample rate of 16K Hz, and a bit resolution of 16bits (2 bytes). 2. Description files, encoded in JSON format for each utterance, including the corresponding description information for each acoustic signal file, such as translation, transcript, duration, offset and so on.

Source: BSTC

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

Tasks

Similar Datasets

CoVoST

Usage

License

Unknown

Modalities

Speech

Languages

Chinese

BSTC (Baidu Speech Translation Corpus)

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit