no code implementations • WAT 2022 • Toshiaki Nakazawa, Hideya Mino, Isao Goto, Raj Dabre, Shohei Higashiyama, Shantipriya Parida, Anoop Kunchukuttan, Makoto Morishita, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022).
no code implementations • ACL (WAT) 2021 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021).
no code implementations • NAACL 2022 • Chaitanya Agarwal, Vivek Gupta, Anoop Kunchukuttan, Manish Shrivastava
Existing research on Tabular Natural Language Inference (TNLI) exclusively examines the task in a monolingual setting where the tabular premise and hypothesis are in the same language.
no code implementations • WMT (EMNLP) 2020 • Vikrant Goyal, Anoop Kunchukuttan, Rahul Kejriwal, Siddharth Jain, Amit Bhagwat
We describe our submission for the English→Tamil and Tamil→English news translation shared task.
no code implementations • AACL (WAT) 2020 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Sadao Kurohashi
This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020).
no code implementations • 28 Nov 2024 • Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M. Khapra, Raj Dabre
To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC).
1 code implementation • 7 Nov 2024 • Sparsh Jain, Ashwin Sankar, Devilal Choudhary, Dhairya Suman, Nikhil Narasimhan, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M Khapra, Raj Dabre
To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44, 400 hours and 17M text segments.
1 code implementation • 17 Oct 2024 • Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra
This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs.
1 code implementation • 8 Jul 2024 • Nandini Mundra, Aditya Nanda Kishore, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra
Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages.
1 code implementation • 6 Jun 2024 • Anushka Singh, Ananya B. Sai, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra
While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models.
no code implementations • 25 Mar 2024 • Kartik Kartik, Sanjana Soni, Anoop Kunchukuttan, Tanmoy Chakraborty, Md Shad Akhtar
In this paper, we tackle the problem of code-mixed (Hinglish and Bengalish) to English machine translation.
1 code implementation • 11 Mar 2024 • Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra
We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages.
1 code implementation • 26 Jan 2024 • Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain, Aswanth Kumar M, Mohammed Safi Ur Rahman Khan, Diptesh Kanojia, Ratish Puduppully, Mitesh M. Khapra, Raj Dabre, Rudra Murthy, Anoop Kunchukuttan
We announce the initial release of "Airavata," an instruction-tuned LLM for Hindi.
no code implementations • 25 Jan 2024 • Jaavid Aktar Husain, Raj Dabre, Aswanth Kumar, Jay Gala, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan
This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts.
1 code implementation • 25 May 2023 • Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan
We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.
1 code implementation • 25 May 2023 • Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan
Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India.
1 code implementation • 23 May 2023 • Aswanth Kumar, Ratish Puduppully, Raj Dabre, Anoop Kunchukuttan
We learn a regression model, CTQ Scorer (Contextual Translation Quality), that selects examples based on multiple features in order to maximize the translation quality.
1 code implementation • 22 May 2023 • Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre, Ai Ti Aw, Nancy F. Chen
This study investigates machine translation between related languages i. e., languages within the same family that share linguistic characteristics such as word order and lexical similarity.
2 code implementations • 12 May 2023 • Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra
However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility.
Natural Language Understanding parameter-efficient fine-tuning
no code implementations • 9 May 2023 • Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, Anoop Kunchukuttan
We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL).
1 code implementation • 25 Apr 2023 • Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan
Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing.
1 code implementation • 20 Dec 2022 • Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics.
1 code implementation • 20 Dec 2022 • Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages.
1 code implementation • 11 Dec 2022 • Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.
no code implementations • 26 Aug 2022 • Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5. 8\% for 7 languages on the IndicSUPERB benchmark.
1 code implementation • 24 Aug 2022 • Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.
Automatic Speech Recognition Automatic Speech Recognition (ASR) +6
2 code implementations • 6 May 2022 • Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul NC, Ruchi Khapra, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs.
1 code implementation • 19 Apr 2022 • Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan
While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited.
no code implementations • 10 Mar 2022 • Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush Kumar
Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages.
no code implementations • 6 Nov 2021 • Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
no code implementations • 14 Oct 2021 • Anoop Kunchukuttan
In this paper, we present an extensive investigation of multi-bridge, many-to-many multilingual NMT models (MB-M2M) ie., models trained on non-English language pairs in addition to English-centric language pairs.
3 code implementations • Findings (ACL) 2022 • Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar
We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.
no code implementations • 1 Jul 2021 • Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.}
Joint Multilingual Sentence Representations Multilingual text classification +5
no code implementations • ACL (WAT) 2021 • Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard
This work introduces Itihasa, a large-scale translation dataset containing 93, 000 pairs of Sanskrit shlokas and their English translations.
Ranked #1 on Machine Translation on Itihasa
1 code implementation • 12 Apr 2021 • Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra
We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences.
1 code implementation • EACL 2021 • Anoop Kunchukuttan, Siddharth Jain, Rahul Kejriwal
We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages.
no code implementations • COLING 2020 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
The advent of neural machine translation (NMT) has opened up exciting research in building multilingual translation systems i. e. translation models that can handle more than one language pair.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar.
These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark).
2 code implementations • 30 Apr 2020 • Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2. 7 billion words for 10 Indian languages from two language families.
no code implementations • WS 2020 • Pratik Jawanpuria, N T V Satya Dev, Anoop Kunchukuttan, Bamdev Mishra
We propose a geometric framework for learning meta-embeddings of words from different embedding sources.
no code implementations • 19 Mar 2020 • Anoop Kunchukuttan, Pushpak Bhattacharyya
To the best of our knowledge, this is the first large-scale study specifically devoted to utilizing language relatedness to improve translation between related languages.
no code implementations • 4 Jan 2020 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.
no code implementations • WS 2019 • Toshiaki Nakazawa, Nobushige Doi, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Yusuke Oda, Shantipriya Parida, Ond{\v{r}}ej Bojar, Sadao Kurohashi
This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task.
no code implementations • 14 May 2019 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.
no code implementations • NAACL 2019 • Rudra Murthy V, Anoop Kunchukuttan, Pushpak Bhattacharyya
To bridge this divergence, We propose to pre-order the assisting language sentence to match the word order of the source language and train the parent model.
1 code implementation • 3 Oct 2018 • Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai, Bamdev Mishra
In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch.
2 code implementations • TACL 2019 • Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra
Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings.
1 code implementation • ACL 2018 • Rudra Murthy, Anoop Kunchukuttan, Pushpak Bhattacharyya
Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages.
no code implementations • TACL 2018 • Anoop Kunchukuttan, Mitesh Khapra, Gurneet Singh, Pushpak Bhattacharyya
We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration).
no code implementations • WS 2017 • S. Singh, hya, Ritesh Panjwani, Anoop Kunchukuttan, Pushpak Bhattacharyya
In this paper, we empirically compare the two encoder-decoder neural machine translation architectures: convolutional sequence to sequence model (ConvS2S) and recurrent sequence to sequence model (RNNS2S) for English-Hindi language pair as part of IIT Bombay{'}s submission to WAT2017 shared task.
no code implementations • LREC 2018 • Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya
We present the IIT Bombay English-Hindi Parallel Corpus.
no code implementations • IJCNLP 2017 • Anoop Kunchukuttan, Maulik Shah, Pradyot Prakash, Pushpak Bhattacharyya
We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting.
no code implementations • WS 2016 • S. Singh, hya, Anoop Kunchukuttan, Pushpak Bhattacharyya
The Neural Probabilistic Language Model (NPLM) gave relatively high BLEU points for Indonesian to English translation system while the Neural Network Joint Model (NNJM) performed better for English to Indonesian direction of translation system.
no code implementations • WS 2016 • Anoop Kunchukuttan, Pushpak Bhattacharyya
The increase in length is also impacted by the specific choice of data format for representing the sentences as subwords.
no code implementations • WS 2017 • Anoop Kunchukuttan, Pushpak Bhattacharyya
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task.
no code implementations • EMNLP 2016 • Anoop Kunchukuttan, Pushpak Bhattacharyya
We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts.
no code implementations • LREC 2014 • Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya
We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families.
no code implementations • LREC 2014 • Mitesh M. Khapra, Ananthakrishnan Ramanathan, Anoop Kunchukuttan, Karthik Visweswariah, Pushpak Bhattacharyya
In contrast, we propose a low-cost QC mechanism which is fair to both workers and requesters.
no code implementations • LREC 2012 • Anoop Kunchukuttan, Shourya Roy, Pratik Patel, Kushal Ladha, Somya Gupta, Mitesh M. Khapra, Pushpak Bhattacharyya
The logistics of collecting resources for Machine Translation (MT) has always been a cause of concern for some of the resource deprived languages of the world.