We also develop voices using the existing implementations of the aforementioned systems, and (2) We use these voices to generate sample audios for randomly chosen words; manually evaluate the audio generated, and produce audio for all WordNet words using the winner voice model.
Large pretrained models have seen enormous success in extractive summarization tasks.
This aid is based on modern pedagogical axioms and is aligned to the learning objectives of the syllabi of the school education in India.
Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded.
We focus on the audio-visual video parsing (AVVP) problem that involves detecting audio and visual event labels with temporal boundaries.
RNN-Transducer (RNN-T) models have become synonymous with streaming end-to-end ASR systems.
Post-editing in Automatic Speech Recognition (ASR) entails automatically correcting common and systematic errors produced by the ASR system.
We propose a subset selection approach using the recently proposed submodular mutual information functions, in which we identify a diverse set of utterances that match the target accent.
The original algorithm relies on computationally expensive data augmentation steps that involve perturbing the raw images and computing features for each perturbed image.
In this work, we propose the use of bilingual intermediate pretraining as a reliable technique to derive large and consistent performance gains on three different NLP tasks using code-switched text.
Generating code-switched text is a problem of growing interest, especially given the scarcity of corpora containing large volumes of real code-switched text.
In this work, we propose the first large scale study of automatic speech recognition (ASR) in Sanskrit, with an emphasis on the impact of unit selection in Sanskrit ASR.
In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities.
Ranked #1 on Event Detection on Audio Set
Towards this, we propose a method that demonstrates that we are able to generate naturalistic samples of video and audio data by the joint correlated generation of audio and video modalities.
Spoken language is different from the written language in its style and structure.
1 code implementation • 1 Apr 2021 • Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham
For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English.
In response, we identify a key structural idiom in OKVQA , viz., S3 (select, substitute and search), and build a new data set and challenge around it.
Video retrieval using natural language queries requires learning semantically meaningful joint embeddings between the text and the audio-visual input.
Ranked #1 on Video Retrieval on Charades-STA
We consider the task of personalizing ASR models while being constrained by a fixed budget on recording speaker-specific utterances.
A systematic comparison of these two approaches for end-to-end robust ASR has not been attempted before.
We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset.
Building Automatic Speech Recognition (ASR) systems for code-switched speech has recently gained renewed attention due to the widespread use of speech technologies in multilingual communities worldwide.
We use a state-of-the-art end-to-end ASR system, comprising convolutional and recurrent layers, that is trained on a large amount of US-accented English speech and evaluate the model on speech samples from seven different English accents.
We also make use of additional fluent text in the target language to help generate fluent translations.
Accordingly, we propose a novel coupling of an open-source accent-tuned local model with the black-box service where the output from the service guides frame-level inference in the local model.
We propose coupled training for encoder-decoder ASR models that acts on pairs of utterances corresponding to the same text spoken by speakers with different accents.
Neural language models (LMs) have shown to benefit significantly from enhancing word vectors with subword-level information, especially for morphologically rich languages.
While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.
For a new language, such training instances are hard to obtain making the QG problem even more challenging.
We analyze the performance of different sentiment classification models on syntactically complex inputs like A-but-B sentences.
We present CROSSGRAD, a method to use multi-domain training data to learn a classifier that generalizes to new domains.
Ranked #58 on Domain Generalization on PACS
In this paper, we propose the generation of accented speech using generative adversarial networks.
The problem of automatic accent identification is important for several applications like speaker profiling and recognition as well as for improving speech recognition systems.
Since code-switching is a blend of two or more different languages, a standard bilingual language model can be improved upon by using structures of the monolingual language models.
Mismatched transcriptions have been proposed as a mean to acquire probabilistic transcriptions from non-native speakers of a language. Prior work has demonstrated the value of these transcriptions by successfully adapting cross-lingual ASR systems for different tar-get languages.
We evaluate our techniques using mismatched transcriptions for Cantonese speech acquired from native English and Mandarin speakers.