This project aimed at extending the test sets of the MuST-C speech translation (ST) corpus with new reference translations.
In this work, we contribute to such a line of inquiry by exploring the emergence of gender bias in Speech Translation (ST).
Simultaneous speech translation (SimulST) systems aim at generating their output with the lowest possible latency, which is normally computed in terms of Average Lagging (AL).
Recent work has shown that systems for speech translation (ST) -- similarly to automatic speech recognition (ASR) -- poorly handle person names.
The primary goal of this FBK's systems submission to the IWSLT 2022 offline and simultaneous speech translation tasks is to reduce model training costs without sacrificing translation quality.
In simultaneous speech translation (SimulST), finding the best trade-off between high translation quality and low latency is a challenging task.
Gender bias is largely recognized as a problematic phenomenon affecting language technologies, with recent studies underscoring that it might surface differently across languages.
Transformer-based models have gained increasing popularity achieving state-of-the-art performance in many research fields including speech translation.
Ranked #1 on Speech-to-Text Translation on MuST-C EN->NL
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i. e. captions).
Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora.
Five years after the first published proofs of concept, direct approaches to speech translation (ST) are now competing with traditional cascade solutions.
In light of this finding, we propose a combined approach that preserves BPE overall translation quality, while leveraging the higher ability of character-based segmentation to properly translate gender.
The audio segmentation mismatch between training data and those seen at run-time is a major problem in direct speech translation.
Machine translation (MT) technology has facilitated our daily tasks by providing accessible shortcuts for gathering, elaborating and communicating information.
In particular, by translating speech audio data without intermediate transcription, direct ST models are able to leverage and preserve essential information present in the input (e. g. speaker's vocal characteristics) that is otherwise lost in the cascade framework.
Direct speech translation (ST) has shown to be a complex task requiring knowledge transfer from its sub-tasks: automatic speech recognition (ASR) and machine translation (MT).
Then, subword-level segmentation became the state of the art in neural machine translation as it produces shorter sequences that reduce the training time, while being superior to word-level models.
We show that our context-aware solution is more robust to VAD-segmented input, outperforming a strong base model and the fine-tuning on different VAD segmentations of an English-German test set by up to 4. 25 BLEU points.
The test talks are provided in two versions: one contains the data already segmented with automatic tools and the other is the raw data without any segmentation.