Search Results for author: Sumanth Doddapaneni

Found 14 papers, 8 papers with code

Bitions@DravidianLangTech-EACL2021: Ensemble of Multilingual Language Models with Pseudo Labeling for offence Detection in Dravidian Languages

1 code implementation • EACL (DravidianLangTech) 2021 • Debapriya Tula, Prathyush Potluri, Shreyas Ms, Sumanth Doddapaneni, Pranjal Sahu, Rohan Sukumaran, Parth Patwa

Our model is able to handle code-mixed data as well as instances where the script used is mixed (for instance, Tamil and Latin).

Paper
Code

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

1 code implementation • 11 Mar 2024 • Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad G, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages.

Paper
Code

User Embedding Model for Personalized Language Prompting

no code implementations • 10 Jan 2024 • Sumanth Doddapaneni, Krishna Sayana, Ambarish Jash, Sukhdeep Sodhi, Dima Kuzmin

Modeling long histories plays a pivotal role in enhancing recommendation systems, allowing to capture user's evolving preferences, resulting in more precise and personalized recommendations.

Recommendation Systems

Paper
Add Code

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

2 code implementations • 25 May 2023 • Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan

Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India.

Machine Translation Sentence +1

174

Paper
Code

A Comprehensive Analysis of Adapter Efficiency

2 code implementations • 12 May 2023 • Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra

However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility.

Natural Language Understanding

Paper
Code

Vārta: A Large-Scale Headline-Generation Dataset for Indic Languages

1 code implementation • 10 May 2023 • Rahul Aralikatte, Ziling Cheng, Sumanth Doddapaneni, Jackie Chi Kit Cheung

We present V\=arta, a large-scale multilingual dataset for headline generation in Indic languages.

Headline Generation

Paper
Code

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

1 code implementation • 20 Dec 2022 • Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan

The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages.

Named Entity Recognition Sentence

Paper
Code

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

1 code implementation • 11 Dec 2022 • Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.

Natural Language Understanding XLM-R

Paper
Code

Effectiveness of Mining Audio and Text Pairs from Public Data for Improving ASR Systems for Low-Resource Languages

no code implementations • 26 Aug 2022 • Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5. 8\% for 7 languages on the IndicSUPERB benchmark.

Optical Character Recognition (OCR) Self-Supervised Learning +3

Paper
Add Code

A Survey of Adversarial Defences and Robustness in NLP

no code implementations • 12 Mar 2022 • Shreya Goyal, Sumanth Doddapaneni, Mitesh M. Khapra, Balaraman Ravindran

In the past few years, it has become increasingly evident that deep neural networks are not resilient enough to withstand adversarial perturbations in input data, leaving them vulnerable to attack.

Adversarial Defense named-entity-recognition +5

Paper
Add Code

Offense Detection in Dravidian Languages using Code-Mixing Index based Focal Loss

no code implementations • 12 Nov 2021 • Debapriya Tula, Shreyas Ms, Viswanatha Reddy, Pranjal Sahu, Sumanth Doddapaneni, Prathyush Potluri, Rohan Sukumaran, Parth Patwa

To summarize, our model can handle offensive language detection in a low-resource, class imbalanced, multilingual and code-mixed setting.

Paper
Add Code

Towards Building ASR Systems for the Next Billion Users

no code implementations • 6 Nov 2021 • Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.

Paper
Add Code

A Primer on Pretrained Multilingual Language Models

no code implementations • 1 Jul 2021 • Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar

Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.}

Joint Multilingual Sentence Representations Multilingual text classification +4

Paper
Add Code

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

1 code implementation • 12 Apr 2021 • Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences.

Machine Translation Multilingual NLP +3

108

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.