Search Results for author: Monojit Choudhury

Found 83 papers, 18 papers with code

BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?

no code implementations EACL (AdaptNLP) 2021 Sebastin Santy, Anirudh Srinivasan, Monojit Choudhury

Models such as mBERT and XLMR have shown success in solving Code-Mixed NLP tasks even though they were not exposed to such text during pretraining.

”Diversity and Uncertainty in Moderation” are the Key to Data Selection for Multilingual Few-shot Transfer

no code implementations Findings (NAACL) 2022 Shanu Kumar, Sandipan Dandapat, Monojit Choudhury

Few-shot transfer often shows substantial gain over zero-shot transfer (CITATION), which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretained model-based systems.

Diversity Language Modelling +3

Comparing Grammatical Theories of Code-Mixing

no code implementations WNUT (ACL) 2021 Adithya Pratapa, Monojit Choudhury

Code-mixed text generation systems have found applications in many downstream tasks, including speech recognition, translation and dialogue.

speech-recognition Speech Recognition +2

Language Patterns and Behaviour of the Peer Supporters in Multilingual Healthcare Conversational Forums

no code implementations LREC 2022 Ishani Mondal, Kalika Bali, Mohit Jain, Monojit Choudhury, Jacki O’Neill, Millicent Ochieng, Kagnoya Awori, Keshet Ronen

In this work, we conduct a quantitative linguistic analysis of the language usage patterns of multilingual peer supporters in two health-focused WhatsApp groups in Kenya comprising of youth living with HIV.

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

1 code implementation25 Nov 2024 Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, Henok Biadglign Ademtew, Yahya Hmaiti, Amandeep Kumar, Kartik Kuckreja, Mykola Maslych, Wafa Al Ghallabi, Mihail Mihaylov, Chao Qin, Abdelrahman M Shaker, Mike Zhang, Mahardika Krisna Ihsani, Amiel Esplana, Monil Gokani, Shachar Mirkin, Harsh Singh, Ashay Srivastava, Endre Hamerlik, Fathinah Asma Izzati, Fadillah Adamsyah Maani, Sebastian Cavada, Jenny Chim, Rohit Gupta, Sanjay Manjunath, Kamila Zhumakhanova, Feno Heriniaina Rabevohitra, Azril Amirudin, Muhammad Ridzuan, Daniya Kareem, Ketan More, Kunyang Li, Pramesh Shakya, Muhammad Saad, Amirpouya Ghasemaghaei, Amirbek Djanibekov, Dilshod Azizov, Branislava Jankovic, Naman Bhatia, Alvaro Cabrera, Johan Obando-Ceron, Olympiah Otieno, Fabian Farestam, Muztoba Rabbani, Sanoojan Baliah, Santosh Sanjeev, Abduragim Shtanchaev, Maheen Fatima, Thao Nguyen, Amrin Kareem, Toluwani Aremu, Nathan Xavier, Amit Bhatkal, Hawau Toyin, Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen, Thamar Solorio, Monojit Choudhury, Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Khan

In pursuit of culturally diverse global multimodal models, our proposed All Languages Matter Benchmark (ALM-bench) represents the largest and most comprehensive effort to date for evaluating LMMs across 100 languages.

Long Question Answer Multiple-choice +2

The Zeno's Paradox of `Low-Resource' Languages

no code implementations28 Oct 2024 Hellina Hailu Nigatu, Atnafu Lambebo Tonja, Benjamin Rosman, Thamar Solorio, Monojit Choudhury

The disparity in the languages commonly studied in Natural Language Processing (NLP) is typically reflected by referring to languages as low vs high-resourced.

[WIP] Jailbreak Paradox: The Achilles' Heel of LLMs

no code implementations18 Jun 2024 Abhinav Rao, Monojit Choudhury, Somak Aditya

We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not.

Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting

no code implementations17 Jun 2024 Sagnik Mukherjee, Muhammad Farid Adilazuarda, Sunayana Sitaram, Kalika Bali, Alham Fikri Aji, Monojit Choudhury

We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy.

Ethics MMLU

Benchmarks Underestimate the Readiness of Multi-lingual Dialogue Agents

no code implementations28 May 2024 Andrew H. Lee, Sina J. Semnani, Galo Castillo-López, Gäel de Chalendar, Monojit Choudhury, Ashna Dua, Kapil Rajesh Kavitha, Sungkyun Kim, Prashant Kodali, Ponnurangam Kumaraguru, Alexis Lombard, Mehrad Moradshahi, Gihyun Park, Nasredine Semmar, Jiwon Seo, Tianhao Shen, Manish Shrivastava, Deyi Xiong, Monica S. Lam

However, after manual evaluation of the validation set, we find that by correcting gold label errors and improving dataset annotation schema, GPT-4 with our prompts can achieve (1) 89. 6%-96. 8% accuracy in DST, and (2) more than 99% correct response generation across different languages.

Dialogue State Tracking In-Context Learning +1

"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations

1 code implementation8 May 2024 Preetam Prabhu Srikar Dammu, Hayoung Jung, Anjali Singh, Monojit Choudhury, Tanushree Mitra

Large language models (LLMs) have emerged as an integral part of modern societies, powering user-facing applications such as personal assistants and enterprise applications like recruitment tools.

Towards Measuring and Modeling "Culture" in LLMs: A Survey

1 code implementation5 Mar 2024 Muhammad Farid Adilazuarda, Sagnik Mukherjee, Pradhyumna Lavania, Siddhant Singh, Alham Fikri Aji, Jacki O'Neill, Ashutosh Modi, Monojit Choudhury

We present a survey of more than 90 recent papers that aim to study cultural representation and inclusion in large language models (LLMs).

Survey

Do Moral Judgment and Reasoning Capability of LLMs Change with Language? A Study using the Multilingual Defining Issues Test

no code implementations3 Feb 2024 Aditi Khandelwal, Utkarsh Agarwal, Kumar Tanmay, Monojit Choudhury

This paper explores the moral judgment and moral reasoning abilities exhibited by Large Language Models (LLMs) across languages through the Defining Issues Test.

Evaluating Large Language Models for Health-related Queries with Presuppositions

1 code implementation14 Dec 2023 Navreet Kaur, Monojit Choudhury, Danish Pruthi

As corporations rush to integrate large language models (LLMs) to their search offerings, it is critical that they provide factually accurate information that is robust to any presuppositions that a user may express.

Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs

no code implementations11 Oct 2023 Abhinav Rao, Aditi Khandelwal, Kumar Tanmay, Utkarsh Agarwal, Monojit Choudhury

In this position paper, we argue that instead of morally aligning LLMs to specific set of ethical principles, we should infuse generic ethical reasoning capabilities into them so that they can handle value pluralism at a global scale.

Ethics Position

Probing the Moral Development of Large Language Models through Defining Issues Test

no code implementations23 Sep 2023 Kumar Tanmay, Aditi Khandelwal, Utkarsh Agarwal, Monojit Choudhury

In this study, we measure the moral reasoning ability of LLMs using the Defining Issues Test - a psychometric instrument developed for measuring the moral development stage of a person according to the Kohlberg's Cognitive Moral Development Model.

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

no code implementations14 Sep 2023 Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.

Language Modelling Large Language Model +2

Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks

1 code implementation24 May 2023 Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, Monojit Choudhury

Recent explorations with commercial Large Language Models (LLMs) have shown that non-expert users can jailbreak LLMs by simply manipulating their prompts; resulting in degenerate output behavior, privacy and security breaches, offensive outputs, and violations of content regulator policies.

LLM-powered Data Augmentation for Enhanced Cross-lingual Performance

1 code implementation23 May 2023 Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji

This paper explores the potential of leveraging Large Language Models (LLMs) for data augmentation in multilingual commonsense reasoning datasets where the available training data is extremely limited.

Data Augmentation

DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

no code implementations4 Mar 2023 Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.

Zero-Shot Cross-Lingual Transfer

Fairness in Language Models Beyond English: Gaps and Challenges

no code implementations24 Feb 2023 Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury

With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.

Fairness

Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models

1 code implementation27 Oct 2022 Harshita Diddee, Sandipan Dandapat, Monojit Choudhury, Tanuja Ganu, Kalika Bali

Leveraging shared learning through Massively Multilingual Models, state-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages.

Knowledge Distillation Machine Translation +1

On the Calibration of Massively Multilingual Language Models

1 code implementation21 Oct 2022 Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury

Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.

Cross-Lingual Transfer

Generating Intermediate Steps for NLI with Next-Step Supervision

no code implementations31 Aug 2022 Deepanway Ghosal, Somak Aditya, Monojit Choudhury

The Natural Language Inference (NLI) task often requires reasoning over multiple steps to reach the conclusion.

Data Augmentation Natural Language Inference

"Diversity and Uncertainty in Moderation" are the Key to Data Selection for Multilingual Few-shot Transfer

no code implementations30 Jun 2022 Shanu Kumar, Sandipan Dandapat, Monojit Choudhury

Few-shot transfer often shows substantial gain over zero-shot transfer~\cite{lauscher2020zero}, which is a practically useful trade-off between fully supervised and unsupervised learning approaches for multilingual pretrained model-based systems.

Diversity Language Modelling +3

Multi Task Learning For Zero Shot Performance Prediction of Multilingual Models

no code implementations ACL 2022 Kabir Ahuja, Shanu Kumar, Sandipan Dandapat, Monojit Choudhury

Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning.

feature selection Multi-Task Learning

On the Economics of Multilingual Few-shot Learning: Modeling the Cost-Performance Trade-offs of Machine Translated and Manual Data

no code implementations NAACL 2022 Kabir Ahuja, Monojit Choudhury, Sandipan Dandapat

Borrowing ideas from {\em Production functions} in micro-economics, in this paper we introduce a framework to systematically evaluate the performance and cost trade-offs between machine-translated and manually-created labelled data for task-specific fine-tuning of massively multilingual language models.

Few-Shot Learning Machine Translation +1

Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

no code implementations nlppower (ACL) 2022 Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.

Benchmarking Diversity +2

NaijaSenti: A Nigerian Twitter Sentiment Corpus for Multilingual Sentiment Analysis

2 code implementations LREC 2022 Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani, Sebastian Ruder, Ibrahim Said Ahmad, Idris Abdulmumin, Bello Shehu Bello, Monojit Choudhury, Chris Chinenye Emezue, Saheed Salahudeen Abdullahi, Anuoluwapo Aremu, Alipio Jeorge, Pavel Brazdil

We introduce the first large-scale human-annotated Twitter sentiment dataset for the four most widely spoken languages in Nigeria (Hausa, Igbo, Nigerian-Pidgin, and Yor\`ub\'a ) consisting of around 30, 000 annotated tweets per language (and 14, 000 for Nigerian-Pidgin), including a significant fraction of code-mixed tweets.

Sentiment Analysis

Predicting the Performance of Multilingual NLP Models

no code implementations17 Oct 2021 Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury

Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.

Multilingual NLP

Designing Language Technologies for Social Good: The Road not Taken

no code implementations14 Oct 2021 Namrata Mukhija, Monojit Choudhury, Kalika Bali

Development of speech and language technology for social good (LT4SG), especially those targeted at the welfare of marginalized communities and speakers of low-resource and under-served languages, has been a prominent theme of research within NLP, Speech, and the AI communities.

Ethics

Analyzing the Effects of Reasoning Types on Cross-Lingual Transfer Performance

1 code implementation EMNLP (MRL) 2021 Karthikeyan K, Aalok Sathe, Somak Aditya, Monojit Choudhury

Multilingual language models achieve impressive zero-shot accuracies in many languages in complex tasks such as Natural Language Inference (NLI).

Cross-Lingual Transfer Natural Language Inference

On the Universality of Deep Contextual Language Models

no code implementations ICON 2021 Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram

Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.

XLM-R Zero-Shot Cross-Lingual Transfer

Trusting RoBERTa over BERT: Insights from CheckListing the Natural Language Inference Task

1 code implementation15 Jul 2021 Ishan Tarunesh, Somak Aditya, Monojit Choudhury

The recent state-of-the-art natural language understanding (NLU) systems often behave unpredictably, failing on simpler reasoning examples.

Natural Language Inference Natural Language Understanding

Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems

1 code implementation ACL (SIGMORPHON) 2021 Saujas Vaduguru, Aalok Sathe, Monojit Choudhury, Dipti Misra Sharma

Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples.

Program Synthesis

GCM: A Toolkit for Generating Synthetic Code-mixed Text

1 code implementation EACL 2021 Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram

Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.

TaxiNLI: Taking a Ride up the NLU Hill

1 code implementation CONLL 2020 Pratik Joshi, Somak Aditya, Aalok Sathe, Monojit Choudhury

Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task.

Natural Language Inference

Code-mixed parse trees and how to find them

no code implementations LREC 2020 Anirudh Srinivasan, D, S apat, ipan, Monojit Choudhury

In this paper, we explore the methods of obtaining parse trees of code-mixed sentences and analyse the obtained trees.

Understanding Script-Mixing: A Case Study of Hindi-English Bilingual Twitter Users

no code implementations LREC 2020 Abhishek Srivastava, Kalika Bali, Monojit Choudhury

Our analysis shows that both intra-sentential and inter-sentential script-mixing are present on Twitter and show different behavior in different contexts.

Sentence

A New Dataset for Natural Language Inference from Code-mixed Conversations

no code implementations LREC 2020 Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.

Natural Language Inference

INMT: Interactive Neural Machine Translation Prediction

1 code implementation IJCNLP 2019 Sebastin Santy, D, S apat, ipan, Monojit Choudhury, Kalika Bali

In this paper, we demonstrate an Interactive Machine Translation interface, that assists human translators with on-the-fly hints and suggestions.

Machine Translation Translation

Word Embeddings for Code-Mixed Language Processing

no code implementations EMNLP 2018 Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram

We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.

Machine Translation POS +3

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

no code implementations ACL 2018 Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.

Automatic Speech Recognition (ASR) Language Identification +3

Accommodation of Conversational Code-Choice

no code implementations WS 2018 Anshul Bawa, Monojit Choudhury, Kalika Bali

We find that the saliency or markedness of a language in context directly affects the degree of accommodation observed.

Information Retrieval Retrieval

Phone Merging For Code-Switched Speech Recognition

no code implementations WS 2018 Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury

Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learnability of Learned Neural Networks

no code implementations ICLR 2018 Rahul Anand Sharma, Navin Goyal, Monojit Choudhury, Praneeth Netrapalli

This paper explores the simplicity of learned neural networks under various settings: learned on real vs random data, varying size/architecture and using large minibatch size vs small minibatch size.

Is this word borrowed? An automatic approach to quantify the likeliness of borrowing in social media

no code implementations15 Mar 2017 Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee

We first propose context based clustering method to sample a set of candidate words from the social media data. Next, we propose three novel and similar metrics based on the usage of these words by the users in different tweets; these metrics were used to score and rank the candidate words indicating their borrowed likeliness.

Clustering

Grammatical Constraints on Intra-sentential Code-Switching: From Theories to Working Models

1 code implementation14 Dec 2016 Gayatri Bhat, Monojit Choudhury, Kalika Bali

We make one of the first attempts to build working models for intra-sentential code-switching based on the Equivalence-Constraint (Poplack 1980) and Matrix-Language (Myers-Scotton 1993) theories.

Functions of Code-Switching in Tweets: An Annotation Framework and Some Initial Experiments

no code implementations LREC 2016 Rafiya Begum, Kalika Bali, Monojit Choudhury, Koustav Rudra, Niloy Ganguly

Code-Switching (CS) between two languages is extremely common in communities with societal multilingualism where speakers switch between two or more languages when interacting with each other.

An Empirical Study of the Occurrence and Co-Occurrence of Named Entities in Natural Language Corpora

no code implementations LREC 2012 K Saravanan, Monojit Choudhury, Raghavendra Udupa, A. Kumaran

Named Entities (NEs) that occur in natural language text are important especially due to the advent of social media, and they play a critical role in the development of many natural language technologies.

Information Retrieval Transliteration

Mining Hindi-English Transliteration Pairs from Online Hindi Lyrics

no code implementations LREC 2012 Kanika Gupta, Monojit Choudhury, Kalika Bali

This paper describes a method to mine Hindi-English transliteration pairs from online Hindi song lyrics.

Transliteration

Cannot find the paper you are looking for? You can Submit a new open access paper.