no code implementations • AACL (WAT) 2020 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Sadao Kurohashi
This paper presents the results of the shared tasks from the 7th workshop on Asian translation (WAT2020).
no code implementations • NAACL 2022 • Chaitanya Agarwal, Vivek Gupta, Anoop Kunchukuttan, Manish Shrivastava
Existing research on Tabular Natural Language Inference (TNLI) exclusively examines the task in a monolingual setting where the tabular premise and hypothesis are in the same language.
no code implementations • WMT (EMNLP) 2020 • Vikrant Goyal, Anoop Kunchukuttan, Rahul Kejriwal, Siddharth Jain, Amit Bhagwat
We describe our submission for the English→Tamil and Tamil→English news translation shared task.
no code implementations • ACL (WAT) 2021 • Toshiaki Nakazawa, Hideki Nakayama, Chenchen Ding, Raj Dabre, Shohei Higashiyama, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Shantipriya Parida, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 8th workshop on Asian translation (WAT2021).
no code implementations • WAT 2022 • Toshiaki Nakazawa, Hideya Mino, Isao Goto, Raj Dabre, Shohei Higashiyama, Shantipriya Parida, Anoop Kunchukuttan, Makoto Morishita, Ondřej Bojar, Chenhui Chu, Akiko Eriguchi, Kaori Abe, Yusuke Oda, Sadao Kurohashi
This paper presents the results of the shared tasks from the 9th workshop on Asian translation (WAT2022).
1 code implementation • 25 May 2023 • Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan
We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.
1 code implementation • 25 May 2023 • AI4Bharat, Jay Gala, Pranjal A. Chitale, Raghavan AK, Sumanth Doddapaneni, Varun Gumma, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan
Prior to this work, there was (i) no parallel training data spanning all the 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India.
no code implementations • 23 May 2023 • Aswanth Kumar, Anoop Kunchukuttan, Ratish Puduppully, Raj Dabre
On multiple language pairs and language models, we show that our example selection method significantly outperforms random selection as well as strong single-factor baselines reported in the literature.
2 code implementations • 12 May 2023 • Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra
However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility.
no code implementations • 9 May 2023 • Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, Anoop Kunchukuttan
On closely related HRL and LRL pairs from multiple language families, we observe that our method significantly outperforms the baseline MT as well as approaches proposed previously to address cross-lingual transfer between closely related languages.
1 code implementation • 25 Apr 2023 • Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan
Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing.
1 code implementation • 20 Dec 2022 • Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics.
1 code implementation • 20 Dec 2022 • Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan
The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages.
1 code implementation • 11 Dec 2022 • Sumanth Doddapaneni, Rahul Aralikatte, Gowtham Ramesh, Shreya Goyal, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature.
no code implementations • 26 Aug 2022 • Kaushal Santosh Bhogale, Abhigyan Raman, Tahir Javed, Sumanth Doddapaneni, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Significantly, we show that adding Shrutilipi to the training set of Wav2Vec models leads to an average decrease in WER of 5. 8\% for 7 languages on the IndicSUPERB benchmark.
Optical Character Recognition (OCR)
Self-Supervised Learning
+3
1 code implementation • 24 Aug 2022 • Tahir Javed, Kaushal Santosh Bhogale, Abhigyan Raman, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
We hope IndicSUPERB contributes to the progress of developing speech language understanding models for Indian languages.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+6
1 code implementation • 6 May 2022 • Yash Madhani, Sushane Parthan, Priyanka Bedekar, Ruchi Khapra, Vivek Seshadri, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
We introduce a new, large, diverse testset for Indic language transliteration containing 103k words pairs spanning 19 languages that enables fine-grained analysis of transliteration models.
1 code implementation • 19 Apr 2022 • Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan
While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited.
no code implementations • 10 Mar 2022 • Aman Kumar, Himani Shrotriya, Prachi Sahu, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Amogh Mishra, Mitesh M. Khapra, Pratyush Kumar
Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages.
no code implementations • 6 Nov 2021 • Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra
Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.
no code implementations • 14 Oct 2021 • Anoop Kunchukuttan
In this paper, we present an extensive investigation of multi-bridge, many-to-many multilingual NMT models (MB-M2M) ie., models trained on non-English language pairs in addition to English-centric language pairs.
1 code implementation • Findings (ACL) 2022 • Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, Pratyush Kumar
We present IndicBART, a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.
no code implementations • 1 Jul 2021 • Sumanth Doddapaneni, Gowtham Ramesh, Mitesh M. Khapra, Anoop Kunchukuttan, Pratyush Kumar
Multilingual Language Models (\MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.}
Joint Multilingual Sentence Representations
Multilingual text classification
+4
no code implementations • ACL (WAT) 2021 • Rahul Aralikatte, Miryam de Lhoneux, Anoop Kunchukuttan, Anders Søgaard
This work introduces Itihasa, a large-scale translation dataset containing 93, 000 pairs of Sanskrit shlokas and their English translations.
Ranked #1 on
Machine Translation
on Itihasa
1 code implementation • 12 Apr 2021 • Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra
We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences.
1 code implementation • EACL 2021 • Anoop Kunchukuttan, Siddharth Jain, Rahul Kejriwal
We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages.
no code implementations • COLING 2020 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
The advent of neural machine translation (NMT) has opened up exciting research in building multilingual translation systems i. e. translation models that can handle more than one language pair.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, Gokul N.C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar.
These resources include: (a) large-scale sentence-level monolingual corpora, (b) pre-trained word embeddings, (c) pre-trained language models, and (d) multiple NLU evaluation datasets (IndicGLUE benchmark).
2 code implementations • 30 Apr 2020 • Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar
We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2. 7 billion words for 10 Indian languages from two language families.
no code implementations • WS 2020 • Pratik Jawanpuria, N T V Satya Dev, Anoop Kunchukuttan, Bamdev Mishra
We propose a geometric framework for learning meta-embeddings of words from different embedding sources.
no code implementations • 19 Mar 2020 • Anoop Kunchukuttan, Pushpak Bhattacharyya
To the best of our knowledge, this is the first large-scale study specifically devoted to utilizing language relatedness to improve translation between related languages.
no code implementations • 4 Jan 2020 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.
no code implementations • WS 2019 • Toshiaki Nakazawa, Nobushige Doi, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Yusuke Oda, Shantipriya Parida, Ond{\v{r}}ej Bojar, Sadao Kurohashi
This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task.
no code implementations • 14 May 2019 • Raj Dabre, Chenhui Chu, Anoop Kunchukuttan
We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.
no code implementations • NAACL 2019 • Rudra Murthy V, Anoop Kunchukuttan, Pushpak Bhattacharyya
To bridge this divergence, We propose to pre-order the assisting language sentence to match the word order of the source language and train the parent model.
1 code implementation • 3 Oct 2018 • Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai, Bamdev Mishra
In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch.
2 code implementations • TACL 2019 • Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra
Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings.
1 code implementation • ACL 2018 • Rudra Murthy, Anoop Kunchukuttan, Pushpak Bhattacharyya
Multilingual learning for Neural Named Entity Recognition (NNER) involves jointly training a neural network for multiple languages.
no code implementations • TACL 2018 • Anoop Kunchukuttan, Mitesh Khapra, Gurneet Singh, Pushpak Bhattacharyya
We address the task of joint training of transliteration models for multiple language pairs (multilingual transliteration).
no code implementations • WS 2017 • S. Singh, hya, Ritesh Panjwani, Anoop Kunchukuttan, Pushpak Bhattacharyya
In this paper, we empirically compare the two encoder-decoder neural machine translation architectures: convolutional sequence to sequence model (ConvS2S) and recurrent sequence to sequence model (RNNS2S) for English-Hindi language pair as part of IIT Bombay{'}s submission to WAT2017 shared task.
no code implementations • LREC 2018 • Anoop Kunchukuttan, Pratik Mehta, Pushpak Bhattacharyya
We present the IIT Bombay English-Hindi Parallel Corpus.
no code implementations • IJCNLP 2017 • Anoop Kunchukuttan, Maulik Shah, Pradyot Prakash, Pushpak Bhattacharyya
We investigate pivot-based translation between related languages in a low resource, phrase-based SMT setting.
no code implementations • WS 2016 • S. Singh, hya, Anoop Kunchukuttan, Pushpak Bhattacharyya
The Neural Probabilistic Language Model (NPLM) gave relatively high BLEU points for Indonesian to English translation system while the Neural Network Joint Model (NNJM) performed better for English to Indonesian direction of translation system.
no code implementations • WS 2016 • Anoop Kunchukuttan, Pushpak Bhattacharyya
The increase in length is also impacted by the specific choice of data format for representing the sentences as subwords.
no code implementations • WS 2017 • Anoop Kunchukuttan, Pushpak Bhattacharyya
We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task.
no code implementations • EMNLP 2016 • Anoop Kunchukuttan, Pushpak Bhattacharyya
We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts.
no code implementations • LREC 2014 • Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya
We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families.
no code implementations • LREC 2014 • Mitesh M. Khapra, Ananthakrishnan Ramanathan, Anoop Kunchukuttan, Karthik Visweswariah, Pushpak Bhattacharyya
In contrast, we propose a low-cost QC mechanism which is fair to both workers and requesters.
no code implementations • LREC 2012 • Anoop Kunchukuttan, Shourya Roy, Pratik Patel, Kushal Ladha, Somya Gupta, Mitesh M. Khapra, Pushpak Bhattacharyya
The logistics of collecting resources for Machine Translation (MT) has always been a cause of concern for some of the resource deprived languages of the world.