Lingvo is a Tensorflow framework offering a complete solution for collaborative deep learning research, with a particular focus towards sequence-to-sequence models.
Furthermore, we investigate a comparison between syllable based model and context-independent phoneme (CI-phoneme) based model with the Transformer in Mandarin Chinese.
Specifically, in our previous work, we propose a multistep visual adaptive training approach which improves the accuracy of an audio-based Automatic Speech Recognition (ASR) system.
We also investigate model complementarity: we find that we can improve WERs by up to 9% relative by rescoring N-best lists generated from a strong word-piece based baseline with either the phoneme or the grapheme model.
Ranked #32 on Speech Recognition on LibriSpeech test-clean
We present a recurrent encoder-decoder deep neural network architecture that directly translates speech in one language into text in another.
On How2 English-Portuguese speech translation, we reduce latency to 0. 7 second (-84% rel.)