no code implementations • SIGUL (LREC) 2022 • A. Seza Doğruöz, Sunayana Sitaram
There is a growing interest in building language technologies (LTs) for low resource languages (LRLs).
no code implementations • EACL (HumEval) 2021 • Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram
We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist.
no code implementations • 25 May 2025 • Dhruv Agarwal, Anya Shukla, Sunayana Sitaram, Aditya Vashistha
Large language models (LLMs) are used around the world but exhibit Western cultural tendencies.
1 code implementation • 26 Mar 2025 • Sunayana Sitaram, Adrian de Wynter, Isobel McCrum, Qilong Gu, Si-Qing Chen
Misgendering is the act of referring to someone by a gender that does not match their chosen identity.
1 code implementation • 6 Mar 2025 • Chenglong Wang, Haoyu Tang, Xiyuan Yang, Yueqi Xie, Jina Suh, Sunayana Sitaram, Junming Huang, Yu Xie, Zhaoya Gong, Xing Xie, Fangzhao Wu
In this paper, we explore inequalities in new knowledge learning by LLMs across different languages and four key dimensions: effectiveness, transferability, prioritization, and robustness.
no code implementations • 21 Oct 2024 • Divyanshu Aggarwal, Sankarshan Damle, Navin Goyal, Satya Lokam, Sunayana Sitaram
A common challenge towards the adaptability of Large Language Models (LLMs) is their ability to learn new languages over time without hampering the model's performance on languages in which the model is already proficient (usually English).
no code implementations • 21 Oct 2024 • Divyanshu Aggarwal, Ashutosh Sathe, Sunayana Sitaram
Prior works have shown that encoder-only models such as BERT or XLM-RoBERTa show impressive cross lingual transfer of their capabilities from English to other languages.
no code implementations • 21 Oct 2024 • Sanchit Ahuja, Varun Gumma, Sunayana Sitaram
Benchmark contamination refers to the presence of test datasets in Large Language Model (LLM) pre-training or post-training data.
no code implementations • 17 Oct 2024 • Varun Gumma, Anandhita Raghunath, Mohit Jain, Sunayana Sitaram
Assessing the capabilities and limitations of large language models (LLMs) has garnered significant interest, yet the evaluation of multiple models in real-world scenarios remains rare.
no code implementations • 14 Oct 2024 • Hemant Yadav, Rajiv Ratn Shah, Sunayana Sitaram
Given the orthogonal nature of other and content information, attempting to optimize both within a single embedding results in suboptimal solutions.
no code implementations • 20 Aug 2024 • Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
Current self-supervised learning (SSL) methods such as HuBERT are very good at modeling the content information present in speech.
no code implementations • 13 Jul 2024 • Sanchit Ahuja, Kumar Tanmay, Hardik Hansrajbhai Chauhan, Barun Patra, Kriti Aggarwal, Luciano del Corro, Arindam Mitra, Tejas Indulal Dhamecha, Ahmed Awadallah, Monojit Choudhary, Vishrav Chaudhary, Sunayana Sitaram
In order to address this, we introduce a novel recipe for creating a multilingual synthetic instruction tuning dataset, sPhinX, which is created by selectively translating instruction response pairs from English into 50 languages.
no code implementations • 4 Jul 2024 • Ashutosh Sathe, Divyanshu Aggarwal, Sunayana Sitaram
Prior research has demonstrated noticeable performance gains through the use of probabilistic tokenizations, an approach that involves employing multiple tokenizations of the same input string during the training phase of a language model.
no code implementations • 4 Jul 2024 • Florian Schneider, Sunayana Sitaram
Since the release of ChatGPT, the field of Natural Language Processing has experienced rapid advancements, particularly in Large Language Models (LLMs) and their multimodal counterparts, Large Multimodal Models (LMMs).
1 code implementation • 22 Jun 2024 • Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Orevaoghene Ahia, Shuyue Stella Li, Vidhisha Balachandran, Sunayana Sitaram, Yulia Tsvetkov
Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages.
no code implementations • 21 Jun 2024 • Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram
We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia.
no code implementations • 17 Jun 2024 • Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury
We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy.
no code implementations • 9 Jun 2024 • Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah
In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • 1 Jun 2024 • Millicent Ochieng, Varun Gumma, Sunayana Sitaram, Jindong Wang, Vishrav Chaudhary, Keshet Ronen, Kalika Bali, Jacki O'Neill
The deployment of Large Language Models (LLMs) in real-world applications presents both opportunities and challenges, particularly in multilingual and code-mixed communication settings.
no code implementations • 28 May 2024 • Somnath Kumar, Vaibhav Balloli, Mercy Ranjit, Kabir Ahuja, Tanuja Ganu, Sunayana Sitaram, Kalika Bali, Akshay Nambi
Second, we introduce a new hybrid approach that synergizes LLM Retrieval Augmented Generation (RAG) with multilingual embeddings and achieves improved multilingual task performance.
no code implementations • 2 Apr 2024 • Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram
This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL).
1 code implementation • 1 Mar 2024 • Tanmay Rajore, Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan
For scenarios where the model weights need to be kept private, we describe solutions from confidential computing and cryptography that can aid in private benchmarking.
no code implementations • 23 Feb 2024 • Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram
Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering.
no code implementations • 21 Feb 2024 • Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions.
no code implementations • 12 Feb 2024 • Prachi Jain, Ashutosh Sathe, Varun Gumma, Kabir Ahuja, Sunayana Sitaram
In this work, we aim to modularly debias a pretrained language model across multiple dimensions.
2 code implementations • 9 Feb 2024 • Cheng Li, Mengzhou Chen, Jindong Wang, Sunayana Sitaram, Xing Xie
Since multilingual cultural data are often expensive to collect, existing efforts handle this by prompt engineering or culture-specific pre-training.
no code implementations • 15 Jan 2024 • Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram
Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages.
no code implementations • 13 Nov 2023 • Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.
no code implementations • 31 Oct 2023 • A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong
Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions.
1 code implementation • 8 Oct 2023 • Hemant Yadav, Erica Cooper, Junichi Yamagishi, Sunayana Sitaram, Rajiv Ratn Shah
That is the partial rank similarity is measured (PRS) rather than the individual MOS values as with the L1 loss.
no code implementations • 14 Sep 2023 • Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
no code implementations • 4 Jul 2023 • Aniket Vashishtha, Kabir Ahuja, Sunayana Sitaram
While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English.
no code implementations • 28 May 2023 • Somnath Kumar, Vaibhav Balloli, Mercy Ranjit, Kabir Ahuja, Sunayana Sitaram, Kalika Bali, Tanuja Ganu, Akshay Nambi
Large language models (LLMs) have revolutionized various domains but still struggle with non-Latin scripts and low-resource languages.
1 code implementation • 22 Mar 2023 • Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, Sunayana Sitaram
Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages.
no code implementations • 4 Mar 2023 • Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.
no code implementations • 24 Feb 2023 • Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury
With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.
no code implementations • ACL 2021 • A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock, Almeida Jacqueline Toribio
To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.
no code implementations • 22 Nov 2022 • Injy Hamed, Amir Hussein, Oumnia Chellah, Shammur Chowdhury, Hamdy Mubarak, Sunayana Sitaram, Nizar Habash, Ahmed Ali
Code-switching poses a number of challenges and opportunities for multilingual automatic speech recognition.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+4
1 code implementation • 21 Oct 2022 • Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
no code implementations • nlppower (ACL) 2022 • Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.
no code implementations • 24 Mar 2022 • Karthikeyan K, Shaily Bhatt, Pankaj Singh, Somak Aditya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
We compare the TEA CheckLists with CheckLists created with different levels of human intervention.
no code implementations • LREC 2022 • Hemant Yadav, Sunayana Sitaram
Although Automatic Speech Recognition (ASR) systems have achieved human-like performance for a few languages, the majority of the world's languages do not have usable systems due to the lack of large speech datasets to train these models.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • 17 Oct 2021 • Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury
Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.
no code implementations • ICON 2021 • Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram
Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.
1 code implementation • EACL 2021 • Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram
Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.
1 code implementation • 1 Apr 2021 • Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan, Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra, Ashish Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan, Tejaswi Seeram, Basil Abraham
For this purpose, we provide a total of ~600 hours of transcribed speech data, comprising train and test sets, in these languages including two code-switched language pairs, Hindi-English and Bengali-English.
1 code implementation • 25 Nov 2020 • Hemant Yadav, Atul Anshuman Singh, Rachit Mittal, Sunayana Sitaram, Yi Yu, Rajiv Ratn Shah
Training a robust system, e. g., Speech to Text (STT), requires large datasets.
no code implementations • 12 Nov 2020 • Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram
Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset.
no code implementations • ACL 2020 • Simran Khanuja, D, S apat, ipan, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • 9 Jun 2020 • Gurunath Reddy Madhumani, Sanket Shah, Basil Abraham, Vikas Joshi, Sunayana Sitaram
Recently, we showed that monolingual ASR systems fine-tuned on code-switched data deteriorate in performance on monolingual speech recognition, which is not desirable as ASR systems deployed in multilingual scenarios should recognize both monolingual and code-switched speech with high accuracy.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 1 Jun 2020 • Sanket Shah, Basil Abraham, Gurunath Reddy M, Sunayana Sitaram, Vikas Joshi
In this work, we show that fine-tuning ASR models on code-switched speech harms performance on monolingual speech.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • LREC 2020 • Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, Vivek Seshadri
Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task.
no code implementations • 26 Apr 2020 • Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • LREC 2020 • Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.
no code implementations • ICON 2019 • Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.
no code implementations • WS 2019 • Sanket Shah, Pratik Joshi, Sebastin Santy, Sunayana Sitaram
Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world.
no code implementations • 22 Jun 2019 • Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi
While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.
no code implementations • 25 Mar 2019 • Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W. black
Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world.
no code implementations • EMNLP 2018 • Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram
We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.
no code implementations • ACL 2018 • Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali
Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.
Automatic Speech Recognition (ASR)
Language Identification
+4
no code implementations • WS 2018 • Rallab, SaiKrishna i, Sunayana Sitaram, Alan W. black
We hypothesize that it may be useful for an ASR system to be able to first detect the switching style of a particular utterance from acoustics, and then use specialized language models or other adaptation techniques for decoding the speech.
Automatic Speech Recognition (ASR)
Language Identification
+1
no code implementations • WS 2018 • Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury
Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • NAACL 2016 • Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W. black, Lori Levin, Chris Dyer
We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted.
no code implementations • LREC 2016 • Sunayana Sitaram, Alan W. black
Most Text to Speech (TTS) systems today assume that the input text is in a single language and is written in the same language that the text needs to be synthesized in.