BSTC (Baidu Speech Translation Corpus) is a large-scale dataset for automatic simultaneous interpretation. BSTC version 1.0 contains 50 hours of real speeches, including three parts, the audio files, the transcripts, and the translations. The corpus can be used to build automatic simultaneous interpretation system. The corpus is collected from the Chinese mandarin talks and reports, including science, technology, culture, economy, etc.,. The utterances in talks and reports are carefully transcribed into Chinese text, and further translated into English text. The sentence boundary is determined by the English text instead of the Chinese text which is analogous to the previous related corpus (TED and Translation Augmented LibriSpeech Corpus).
The corpus is divided into training/develop/test datasets. In each dataset, there are three types of files: 1. Acoustic signal files, which are named as baidu_XX.wav, where XX is the identical code. All signal files are encoded in Waveform Audio File Format (WAVE) from a mono recording, with a sample rate of 16K Hz, and a bit resolution of 16bits (2 bytes). 2. Description files, encoded in JSON format for each utterance, including the corresponding description information for each acoustic signal file, such as translation, transcript, duration, offset and so on.Source: BSTC