no code implementations • ICON 2021 • Saujas Vaduguru, Partho Sarthi, Monojit Choudhury, Dipti Sharma
Learning linguistic generalizations from only a few examples is a challenging task.
no code implementations • EACL (AdaptNLP) 2021 • Sebastin Santy, Anirudh Srinivasan, Monojit Choudhury
Models such as mBERT and XLMR have shown success in solving Code-Mixed NLP tasks even though they were not exposed to such text during pretraining.
no code implementations • Findings (NAACL) 2022 • Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Few-shot transfer often shows substantial gain over zero-shot transfer (CITATION), which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretained model-based systems.
no code implementations • Findings (ACL) 2022 • Prashant Kodali, Anmol Goel, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations.
no code implementations • WNUT (ACL) 2021 • Adithya Pratapa, Monojit Choudhury
Code-mixed text generation systems have found applications in many downstream tasks, including speech recognition, translation and dialogue.
no code implementations • LREC 2022 • Ishani Mondal, Kalika Bali, Mohit Jain, Monojit Choudhury, Jacki O’Neill, Millicent Ochieng, Kagnoya Awori, Keshet Ronen
In this work, we conduct a quantitative linguistic analysis of the language usage patterns of multilingual peer supporters in two health-focused WhatsApp groups in Kenya comprising of youth living with HIV.
no code implementations • EMNLP (LAW, DMR) 2021 • Ishani Mondal, Kalika Bali, Mohit Jain, Monojit Choudhury, ASHISH SHARMA, Evans Gitau, Jacki O’Neill, Kagonya Awori, Sarah Gitau
In recent years, remote digital healthcare using online chats has gained momentum, especially in the Global South.
1 code implementation • 25 Nov 2024 • Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan
In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.
no code implementations • 28 Oct 2024 • Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, Monojit Choudhury
The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced.
no code implementations • 18 Jun 2024 • Abhinav Rao, Monojit Choudhury, Somak Aditya
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not.
no code implementations • 17 Jun 2024 • Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury
We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy.
no code implementations • 28 May 2024 • Andrew H. Lee, Sina J. Semnani, Galo Castillo-López, Gäel de Chalendar, Monojit Choudhury, Ashna Dua, Kapil Rajesh Kavitha, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Alexis Lombard, Mehrad Moradshahi, Gihyun Park, Nasredine Semmar, Jiwon Seo, Tianhao Shen, Manish Shrivastava, Deyi Xiong, Monica S. Lam
However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89. 6%-96. 8% accuracy in DST, and (2) more than 99% correct response generation across different languages.
no code implementations • 9 May 2024 • Prashant Kodali, Anmol Goel, Likhith Asapu, Vamshi Krishna Bonagiri, Anirudh Govil, Monojit Choudhury, Manish Shrivastava, Ponnurangam Kumaraguru
To this end, we construct Cline - a dataset containing human acceptability judgements for English-Hindi (en-hi) code-mixed text.
1 code implementation • 8 May 2024 • Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, Tanushree Mitra
Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools.
no code implementations • 29 Apr 2024 • Utkarsh Agarwal, Kumar Tanmay, Aditi Khandelwal, Monojit Choudhury
Ethical reasoning is a crucial skill for Large Language Models (LLMs).
1 code implementation • 5 Mar 2024 • Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury
We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs).
no code implementations • 3 Feb 2024 • Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury
This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test.
1 code implementation • 14 Dec 2023 • Navreet Kaur, Monojit Choudhury, Danish Pruthi
As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express.
no code implementations • 11 Oct 2023 • Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury
In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale.
no code implementations • 23 Sep 2023 • Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury
In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model.
no code implementations • 14 Sep 2023 • Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram
Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.
1 code implementation • 30 Jun 2023 • Mehrad Moradshahi, Tianhao Shen, Kalika Bali, Monojit Choudhury, Gaël de Chalendar, Anmol Goel, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Nasredine Semmar, Sina J. Semnani, Jiwon Seo, Vivek Seshadri, Manish Shrivastava, Michael Sun, Aditya Yadavalli, Chaobin You, Deyi Xiong, Monica S. Lam
We create a new multilingual benchmark, X-RiSAWOZ, by translating the Chinese RiSAWOZ to 4 languages: English, French, Hindi, Korean; and a code-mixed English-Hindi language.
1 code implementation • 24 May 2023 • Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury
Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies.
1 code implementation • 23 May 2023 • Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji
This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited.
no code implementations • 23 May 2023 • Kriti Aggarwal, Aditi Khandelwal, Kumar Tanmay, Owais Mohammed Khan, Qiang Liu, Monojit Choudhury, Hardik Hansrajbhai Chauhan, Subhojit Som, Vishrav Chaudhary, Saurabh Tiwary
Visual document understanding is a complex task that involves analyzing both the text and the visual elements in document images.
Ranked #1 on Visual Question Answering (VQA) on DeepForm
no code implementations • 4 Mar 2023 • Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.
no code implementations • 24 Feb 2023 • Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury
With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.
1 code implementation • 27 Oct 2022 • Harshita Diddee, Sandipan Dandapat, Monojit Choudhury, Tanuja Ganu, Kalika Bali
Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages.
1 code implementation • 21 Oct 2022 • Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury
Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.
no code implementations • 31 Aug 2022 • Deepanway Ghosal, Somak Aditya, Monojit Choudhury
The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion.
no code implementations • 30 Jun 2022 • Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Few-shot transfer often shows substantial gain over zero-shot transfer~\cite{lauscher2020zero}, which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretrained model-based systems.
no code implementations • ACL 2022 • Kabir Ahuja, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury
Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning.
no code implementations • NAACL 2022 • Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat
Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models.
no code implementations • nlppower (ACL) 2022 • Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.
no code implementations • COLING 2022 • Ishani Mondal, Kabir Ahuja, Mohit Jain, Jacki O Neil, Kalika Bali, Monojit Choudhury
The COVID-19 pandemic has brought out both the best and worst of language technology (LT).
no code implementations • 24 Mar 2022 • Karthikeyan K, Shaily Bhatt, Pankaj Singh, Somak Aditya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
We compare the TEA CheckLists with CheckLists created with different levels of human intervention.
2 code implementations • LREC 2022 • Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, Pavel Brazdil
We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30, 000 annotated tweets per language (and 14, 000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets.
no code implementations • 4 Dec 2021 • Ishan Tarunesh, Somak Aditya, Monojit Choudhury
Natural Language Inference (NLI) is considered a representative task to test natural language understanding (NLU).
no code implementations • 17 Oct 2021 • Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury
Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.
no code implementations • 14 Oct 2021 • Namrata Mukhija, Monojit Choudhury, Kalika Bali
Development of speech and language technology for social good (LT4SG), especially those targeted at the welfare of marginalized communities and speakers of low-resource and under-served languages, has been a prominent theme of research within NLP, Speech, and the AI communities.
1 code implementation • EMNLP (MRL) 2021 • Karthikeyan K, Aalok Sathe, Somak Aditya, Monojit Choudhury
Multilingual language models achieve impressive zero-shot accuracies in many languages in complex tasks such as Natural Language Inference (NLI).
no code implementations • ICON 2021 • Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram
Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.
1 code implementation • 15 Jul 2021 • Ishan Tarunesh, Somak Aditya, Monojit Choudhury
The recent state-of-the-art natural language understanding (NLU) systems often behave unpredictably, failing on simpler reasoning examples.
1 code implementation • ACL (SIGMORPHON) 2021 • Saujas Vaduguru, Aalok Sathe, Monojit Choudhury, Dipti Misra Sharma
Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples.
no code implementations • Findings (ACL) 2021 • Sebastin Santy, Anku Rani, Monojit Choudhury
Ethical aspects of research in language technologies have received much attention recently.
1 code implementation • EACL 2021 • Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram
Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.
1 code implementation • CONLL 2020 • Pratik Joshi, Somak Aditya, Aalok Sathe, Monojit Choudhury
Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task.
no code implementations • ACL 2020 • Simran Khanuja, D, S apat, ipan, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
no code implementations • LREC 2020 • Anirudh Srinivasan, D, S apat, ipan, Monojit Choudhury
In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees.
no code implementations • LREC 2020 • Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, Pratik Joshi, Preethi Jyoti, Sunayana Sitaram, Vivek Seshadri
Unfortunately, collecting labelled speech data in any language is an expensive and resource-intensive task.
no code implementations • LREC 2020 • Abhishek Srivastava, Kalika Bali, Monojit Choudhury
Our analysis shows that both intra-sentential and inter-sentential script-mixing are present on Twitter and show different behavior in different contexts.
no code implementations • 26 Apr 2020 • Simran Khanuja, Sandipan Dandapat, Anirudh Srinivasan, Sunayana Sitaram, Monojit Choudhury
We present results on all these tasks using cross-lingual word embedding models and multilingual models.
1 code implementation • ACL 2020 • Pratik Joshi, Sebastin Santy, Amar Budhiraja, Kalika Bali, Monojit Choudhury
Language technologies contribute to promoting multilingualism and linguistic diversity around the world.
no code implementations • LREC 2020 • Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury
Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.
no code implementations • ICON 2019 • Pratik Joshi, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury, Kalika Bali
In this paper, we examine and analyze the challenges associated with developing and introducing language technologies to low-resource language communities.
1 code implementation • IJCNLP 2019 • Sebastin Santy, D, S apat, ipan, Monojit Choudhury, Kalika Bali
In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions.
no code implementations • EMNLP 2018 • Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram
We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.
no code implementations • ACL 2018 • Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali
Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.
Automatic Speech Recognition (ASR) Language Identification +3
no code implementations • WS 2018 • Anshul Bawa, Monojit Choudhury, Kalika Bali
We find that the saliency or markedness of a language in context directly affects the degree of accommodation observed.
no code implementations • WS 2018 • Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury
Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +1
no code implementations • ICLR 2018 • Rahul Anand Sharma, Navin Goyal, Monojit Choudhury, Praneeth Netrapalli
This paper explores the simplicity of learned neural networks under various settings: learned on real vs random data, varying size/architecture and using large minibatch size vs small minibatch size.
no code implementations • EMNLP 2017 • Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee
Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts.
no code implementations • 25 Jul 2017 • Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee
Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts.
no code implementations • ACL 2017 • Shruti Rijhwani, Royal Sequiera, Monojit Choudhury, Kalika Bali, Ch Maddila, ra Shekhar
Word-level language detection is necessary for analyzing code-switched text, where multiple languages could be mixed within a sentence.
no code implementations • 15 Mar 2017 • Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee
We first propose context based clustering method to sample a set of candidate words from the social media data. Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; these metrics were used to score and rank the candidate words indicating their borrowed likeliness.
1 code implementation • 14 Dec 2016 • Gayatri Bhat, Monojit Choudhury, Kalika Bali
We make one of the first attempts to build working models for intra-sentential code-switching based on the Equivalence-Constraint (Poplack 1980) and Matrix-Language (Myers-Scotton 1993) theories.
no code implementations • LREC 2016 • Rafiya Begum, Kalika Bali, Monojit Choudhury, Koustav Rudra, Niloy Ganguly
Code-Switching (CS) between two languages is extremely common in communities with societal multilingualism where speakers switch between two or more languages when interacting with each other.
no code implementations • LREC 2012 • K Saravanan, Monojit Choudhury, Raghavendra Udupa, A. Kumaran
Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies.
no code implementations • LREC 2012 • Kanika Gupta, Monojit Choudhury, Kalika Bali
This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics.