1 code implementation • 1 Apr 2024 • Simran Khanuja, Sathyanarayanan Ramamoorthy, Yueqi Song, Graham Neubig
First, we build three pipelines comprising state-of-the-art generative models to do the task.
1 code implementation • 3 Mar 2024 • Yueqi Song, Simran Khanuja, Graham Neubig
NLP models today strive for supporting multiple languages and modalities, improving accessibility for diverse users.
no code implementations • 10 Nov 2023 • Simran Khanuja, Srinivas Gowriraj, Lucio Dery, Graham Neubig
In this paper, we introduce DEMUX, a framework that prescribes the exact data-points to label from vast amounts of unlabelled multilingual data, having unknown degrees of overlap with the target set.
no code implementations • 25 May 2023 • Anubha Kabra, Emmy Liu, Simran Khanuja, Alham Fikri Aji, Genta Indra Winata, Samuel Cahyawijaya, Anuoluwapo Aremu, Perez Ogayo, Graham Neubig
Figurative language permeates human communication, but at the same time is relatively understudied in NLP.
no code implementations • 24 May 2023 • Yueqi Song, Catherine Cui, Simran Khanuja, PengFei Liu, Fahim Faisal, Alissa Ostapenko, Genta Indra Winata, Alham Fikri Aji, Samuel Cahyawijaya, Yulia Tsvetkov, Antonios Anastasopoulos, Graham Neubig
Despite the major advances in NLP, significant disparities in NLP system performance across languages still exist.
1 code implementation • 25 May 2022 • Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, Ankur Bapna
We introduce FLEURS, the Few-shot Learning Evaluation of Universal Representations of Speech benchmark.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
no code implementations • 25 May 2022 • Simran Khanuja, Sebastian Ruder, Partha Talukdar
In order for NLP technology to be widely applicable, fair, and useful, it needs to serve a diverse set of speakers across the world's languages, be equitable, i. e., not unduly biased towards any particular language, and be inclusive of all users, particularly in low-resource settings where compute constraints are common.
no code implementations • 21 Mar 2022 • Alexis Conneau, Ankur Bapna, Yu Zhang, Min Ma, Patrick von Platen, Anton Lozhkov, Colin Cherry, Ye Jia, Clara Rivera, Mihir Kale, Daan van Esch, Vera Axelrod, Simran Khanuja, Jonathan H. Clark, Orhan Firat, Michael Auli, Sebastian Ruder, Jason Riesa, Melvin Johnson
Covering 102 languages from 10+ language families, 3 different domains and 4 task families, XTREME-S aims to simplify multilingual speech representation evaluation, as well as catalyze research in "universal" speech representation learning.
no code implementations • 3 Feb 2022 • Ankur Bapna, Colin Cherry, Yu Zhang, Ye Jia, Melvin Johnson, Yong Cheng, Simran Khanuja, Jason Riesa, Alexis Conneau
We present mSLAM, a multilingual Speech and LAnguage Model that learns cross-lingual cross-modal representations of speech and text by pre-training jointly on large amounts of unlabeled speech and text in multiple languages.
no code implementations • 5 Jun 2021 • Simran Khanuja, Melvin Johnson, Partha Talukdar
Pre-trained multilingual language models (LMs) have achieved state-of-the-art results in cross-lingual transfer, but they often lead to an inequitable representation of languages due to limited capacity, skewed pre-training data, and sub-optimal vocabularies.
1 code implementation • 19 Mar 2021 • Simran Khanuja, Diksha Bansal, Sarvesh Mehtani, Savya Khosla, Atreyee Dey, Balaji Gopalan, Dilip Kumar Margam, Pooja Aggarwal, Rajiv Teja Nagipogu, Shachi Dave, Shruti Gupta, Subhash Chandra Bose Gali, Vish Subramanian, Partha Talukdar
This can be explained by the fact that multilingual language models (LMs) are often trained on 100+ languages together, leading to a small representation of IN languages in their vocabulary and training data.
no code implementations • 12 Nov 2020 • Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram
Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset.
no code implementations • ACL 2020 • Simran Khanuja, D, S apat, ipan, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • 26 Apr 2020 • Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • LREC 2020 • Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.
no code implementations • ICON 2019 • Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.