Search Results for author: Sunayana Sitaram

Found 47 papers, 6 papers with code

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist

no code implementations EACL (HumEval) 2021 Shaily Bhatt, Rahul Jain, Sandipan Dandapat, Sunayana Sitaram

We conduct experiments for evaluating an offensive content detection system and use a data augmentation technique for improving the model using insights from Checklist.

Data Augmentation

METAL: Towards Multilingual Meta-Evaluation

no code implementations2 Apr 2024 Rishav Hada, Varun Gumma, Mohamed Ahmed, Kalika Bali, Sunayana Sitaram

This dataset is created specifically to evaluate LLM-based evaluators, which we refer to as meta-evaluation (METAL).

Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs

no code implementations1 Mar 2024 Nishanth Chandran, Sunayana Sitaram, Divya Gupta, Rahul Sharma, Kashish Mittal, Manohar Swaminathan

To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model.


DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures

no code implementations23 Feb 2024 Agrima Seth, Sanchit Ahuja, Kalika Bali, Sunayana Sitaram

Generative models are increasingly being used in various applications, such as text generation, commonsense reasoning, and question-answering.

Question Answering Text Generation

MAPLE: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models

no code implementations15 Jan 2024 Divyanshu Aggarwal, Ashutosh Sathe, Ishaan Watts, Sunayana Sitaram

Prior work on multilingual evaluation has shown that there is a large gap between the performance of LLMs on English and other languages.

MEGAVERSE: Benchmarking Large Language Models Across Languages, Modalities, Models and Tasks

no code implementations13 Nov 2023 Sanchit Ahuja, Divyanshu Aggarwal, Varun Gumma, Ishaan Watts, Ashutosh Sathe, Millicent Ochieng, Rishav Hada, Prachi Jain, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

We also perform a study on data contamination and find that several models are likely to be contaminated with multilingual evaluation benchmarks, necessitating approaches to detect and handle contamination while assessing the multilingual performance of LLMs.


Representativeness as a Forgotten Lesson for Multilingual and Code-switched Data Collection and Preparation

no code implementations31 Oct 2023 A. Seza Doğruöz, Sunayana Sitaram, Zheng-Xin Yong

Multilingualism is widespread around the world and code-switching (CSW) is a common practice among different language pairs/tuples across locations and regions.

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

no code implementations14 Sep 2023 Rishav Hada, Varun Gumma, Adrian de Wynter, Harshita Diddee, Mohamed Ahmed, Monojit Choudhury, Kalika Bali, Sunayana Sitaram

Large Language Models (LLMs) excel in various Natural Language Processing (NLP) tasks, yet their evaluation, particularly in languages beyond the top $20$, remains inadequate due to existing benchmarks and metrics limitations.

Language Modelling Large Language Model +2

On Evaluating and Mitigating Gender Biases in Multilingual Settings

no code implementations4 Jul 2023 Aniket Vashishtha, Kabir Ahuja, Sunayana Sitaram

While understanding and removing gender biases in language models has been a long-standing problem in Natural Language Processing, prior research work has primarily been limited to English.

Breaking Language Barriers with a LEAP: Learning Strategies for Polyglot LLMs

no code implementations28 May 2023 Akshay Nambi, Vaibhav Balloli, Mercy Ranjit, Tanuja Ganu, Kabir Ahuja, Sunayana Sitaram, Kalika Bali

Our results show substantial advancements in multilingual understanding and generation across a diverse range of languages.

Question Answering Retrieval

MEGA: Multilingual Evaluation of Generative AI

1 code implementation22 Mar 2023 Kabir Ahuja, Harshita Diddee, Rishav Hada, Millicent Ochieng, Krithika Ramesh, Prachi Jain, Akshay Nambi, Tanuja Ganu, Sameer Segal, Maxamed Axmed, Kalika Bali, Sunayana Sitaram

Most studies on generative LLMs have been restricted to English and it is unclear how capable these models are at understanding and generating text in other languages.


DiTTO: A Feature Representation Imitation Approach for Improving Cross-Lingual Transfer

no code implementations4 Mar 2023 Shanu Kumar, Abbaraju Soujanya, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Zero-shot cross-lingual transfer is promising, however has been shown to be sub-optimal, with inferior transfer performance across low-resource languages.

Zero-Shot Cross-Lingual Transfer

Fairness in Language Models Beyond English: Gaps and Challenges

no code implementations24 Feb 2023 Krithika Ramesh, Sunayana Sitaram, Monojit Choudhury

With language models becoming increasingly ubiquitous, it has become essential to address their inequitable treatment of diverse demographic groups and factors.


A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies

no code implementations ACL 2021 A. Seza Doğruöz, Sunayana Sitaram, Barbara E. Bullock, Almeida Jacqueline Toribio

To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies.

On the Calibration of Massively Multilingual Language Models

1 code implementation21 Oct 2022 Kabir Ahuja, Sunayana Sitaram, Sandipan Dandapat, Monojit Choudhury

Massively Multilingual Language Models (MMLMs) have recently gained popularity due to their surprising effectiveness in cross-lingual transfer.

Cross-Lingual Transfer

Beyond Static Models and Test Sets: Benchmarking the Potential of Pre-trained Models Across Tasks and Languages

no code implementations nlppower (ACL) 2022 Kabir Ahuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Although recent Massively Multilingual Language Models (MMLMs) like mBERT and XLMR support around 100 languages, most existing multilingual NLP benchmarks provide evaluation data in only a handful of these languages with little linguistic diversity.

Benchmarking Multilingual NLP +1

A Survey of Multilingual Models for Automatic Speech Recognition

no code implementations LREC 2022 Hemant Yadav, Sunayana Sitaram

Although Automatic Speech Recognition (ASR) systems have achieved human-like performance for a few languages, the majority of the world's languages do not have usable systems due to the lack of large speech datasets to train these models.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Predicting the Performance of Multilingual NLP Models

no code implementations17 Oct 2021 Anirudh Srinivasan, Sunayana Sitaram, Tanuja Ganu, Sandipan Dandapat, Kalika Bali, Monojit Choudhury

Recent advancements in NLP have given us models like mBERT and XLMR that can serve over 100 languages.

Multilingual NLP

On the Universality of Deep Contextual Language Models

no code implementations ICON 2021 Shaily Bhatt, Poonam Goyal, Sandipan Dandapat, Monojit Choudhury, Sunayana Sitaram

Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing due to their ability to scale across multiple tasks rapidly by pre-training a single model, followed by task-specific fine-tuning.

XLM-R Zero-Shot Cross-Lingual Transfer

GCM: A Toolkit for Generating Synthetic Code-mixed Text

1 code implementation EACL 2021 Mohd Sanad Zaki Rizvi, Anirudh Srinivasan, Tanuja Ganu, Monojit Choudhury, Sunayana Sitaram

Code-mixing is common in multilingual communities around the world, and processing it is challenging due to the lack of labeled and unlabeled data.

Cross-lingual and Multilingual Spoken Term Detection for Low-Resource Indian Languages

no code implementations12 Nov 2020 Sanket Shah, Satarupa Guha, Simran Khanuja, Sunayana Sitaram

Since no publicly available dataset exists for Spoken Term Detection in these languages, we create a new dataset using a publicly available TTS dataset.

Learning not to Discriminate: Task Agnostic Learning for Improving Monolingual and Code-switched Speech Recognition

no code implementations9 Jun 2020 Gurunath Reddy Madhumani, Sanket Shah, Basil Abraham, Vikas Joshi, Sunayana Sitaram

Recently, we showed that monolingual ASR systems fine-tuned on code-switched data deteriorate in performance on monolingual speech recognition, which is not desirable as ASR systems deployed in multilingual scenarios should recognize both monolingual and code-switched speech with high accuracy.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

A New Dataset for Natural Language Inference from Code-mixed Conversations

no code implementations LREC 2020 Simran Khanuja, Sandipan Dandapat, Sunayana Sitaram, Monojit Choudhury

Code-mixing is the use of more than one language in the same conversation or utterance, and is prevalent in multilingual communities all over the world.

Natural Language Inference

CoSSAT: Code-Switched Speech Annotation Tool

no code implementations WS 2019 Sanket Shah, Pratik Joshi, Sebastin Santy, Sunayana Sitaram

Code-switching refers to the alternation of two or more languages in a conversation or utterance and is common in multilingual communities across the world.

End-to-End ASR for Code-switched Hindi-English Speech

no code implementations22 Jun 2019 Brij Mohan Lal Srivastava, Basil Abraham, Sunayana Sitaram, Rupesh Mehta, Preethi Jyothi

While the lack of data adversely affects the performance of end-to-end models, we see promising improvements with MTL and balancing the corpus.

Multi-Task Learning

A Survey of Code-switched Speech and Language Processing

no code implementations25 Mar 2019 Sunayana Sitaram, Khyathi Raghavi Chandu, Sai Krishna Rallabandi, Alan W. black

Code-switching, the alternation of languages within a conversation or utterance, is a common communicative phenomenon that occurs in multilingual communities across the world.

Word Embeddings for Code-Mixed Language Processing

no code implementations EMNLP 2018 Adithya Pratapa, Monojit Choudhury, Sunayana Sitaram

We compare three existing bilingual word embedding approaches, and a novel approach of training skip-grams on synthetic code-mixed text generated through linguistic models of code-mixing, on two tasks - sentiment analysis and POS tagging for code-mixed text.

Machine Translation POS +3

Language Modeling for Code-Mixing: The Role of Linguistic Theory based Synthetic Data

no code implementations ACL 2018 Adithya Pratapa, Gayatri Bhat, Monojit Choudhury, Sunayana Sitaram, D, S apat, ipan, Kalika Bali

Training language models for Code-mixed (CM) language is known to be a difficult problem because of lack of data compounded by the increased confusability due to the presence of more than one language.

Automatic Speech Recognition (ASR) Language Identification +3

Phone Merging For Code-Switched Speech Recognition

no code implementations WS 2018 Sunit Sivasankaran, Brij Mohan Lal Srivastava, Sunayana Sitaram, Kalika Bali, Monojit Choudhury

Though the best performance gain of 1. 2{\%} WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Automatic Detection of Code-switching Style from Acoustics

no code implementations WS 2018 Rallab, SaiKrishna i, Sunayana Sitaram, Alan W. black

We hypothesize that it may be useful for an ASR system to be able to first detect the switching style of a particular utterance from acoustics, and then use specialized language models or other adaptation techniques for decoding the speech.

Automatic Speech Recognition (ASR) Language Identification +1

Polyglot Neural Language Models: A Case Study in Cross-Lingual Phonetic Representation Learning

no code implementations NAACL 2016 Yulia Tsvetkov, Sunayana Sitaram, Manaal Faruqui, Guillaume Lample, Patrick Littell, David Mortensen, Alan W. black, Lori Levin, Chris Dyer

We introduce polyglot language models, recurrent neural network models trained to predict symbol sequences in many different languages using shared representations of symbols and conditioning on typological information about the language to be predicted.

Representation Learning

Speech Synthesis of Code-Mixed Text

no code implementations LREC 2016 Sunayana Sitaram, Alan W. black

Most Text to Speech (TTS) systems today assume that the input text is in a single language and is written in the same language that the text needs to be synthesized in.

Language Identification Speech Synthesis

Cannot find the paper you are looking for? You can Submit a new open access paper.