MuST-Cinema

Introduced by Karakanta et al. in MuST-Cinema: a Speech-to-Subtitles corpus

MuST-Cinema is a Multilingual Speech-to-Subtitles corpus ideal for building subtitle-oriented machine and speech translation systems. It comprises audio recordings from English TED Talks, which are automatically aligned at the sentence level with their manual transcriptions and translations.

MuST-Cinema was built by annotating MuST-C with subtitle breaks based on the original subtitle files. Special symbols have been inserted in the aligned sentences to mark subtitle breaks as follows:

<eob>: block break (breaks between subtitle blocks)
<eol>: line breaks (breaks between lines inside the same subtitle block)

Source: MuST-Cinema

Homepage

Benchmarks

Add a new result Link an existing benchmark

No benchmarks yet. Start a new benchmark or link an existing one.

Papers

Paper	Code	Results	Date	Stars

Dataset Loaders

Add Remove

No data loaders found. You can submit your data loader here.

MuST-Cinema

Benchmarks

Add a new result Link an existing benchmark

Papers

Dataset Loaders

Add Remove

Tasks

Similar Datasets

MaRVL

Usage

License

Modalities

Languages

MuST-Cinema

Benchmarks Edit Add a new result Link an existing benchmark

Papers

Dataset Loaders Edit Add Remove

Tasks Edit

Similar Datasets

MaRVL

Usage

License Edit

Modalities Edit

Languages Edit

Benchmarks

Add a new result Link an existing benchmark

Dataset Loaders

Add Remove

Tasks

License

Modalities

Languages