Search Results for author: Anoop Kunchukuttan

Found 75 papers, 26 papers with code

Bilingual Tabular Inference: A Case Study on Indic Languages

no code implementations NAACL 2022 Chaitanya Agarwal, Vivek Gupta, Anoop Kunchukuttan, Manish Shrivastava

Existing research on Tabular Natural Language Inference (TNLI) exclusively examines the task in a monolingual setting where the tabular premise and hypothesis are in the same language.

Natural Language Inference

Pralekha: An Indic Document Alignment Evaluation Benchmark

no code implementations28 Nov 2024 Sanjay Suryanarayanan, Haiyue Song, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M. Khapra, Raj Dabre

To address the challenge of aligning documents using sentence and chunk-level alignments, we propose a novel scoring method, Document Alignment Coefficient (DAC).

Sentence Sentence Embedding +1

BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages

1 code implementation7 Nov 2024 Sparsh Jain, Ashwin Sankar, Devilal Choudhary, Dhairya Suman, Nikhil Narasimhan, Mohammed Safi Ur Rahman Khan, Anoop Kunchukuttan, Mitesh M Khapra, Raj Dabre

To this end, we introduce BhasaAnuvaad, the largest publicly available dataset for AST involving 13 out of 22 scheduled Indian languages and English spanning over 44, 400 hours and 17M text segments.

automatic-speech-translation Synthetic Data Generation +1

Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs

1 code implementation17 Oct 2024 Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Dilip Venkatesh, Raj Dabre, Anoop Kunchukuttan, Mitesh M. Khapra

This would enable benchmarking of general-purpose multilingual LLMs and facilitate meta-evaluation of Evaluator LLMs.

Benchmarking

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches for Language Models

1 code implementation8 Jul 2024 Nandini Mundra, Aditya Nanda Kishore, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M. Khapra

Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages.

How Good is Zero-Shot MT Evaluation for Low Resource Indian Languages?

1 code implementation6 Jun 2024 Anushka Singh, Ananya B. Sai, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra

While machine translation evaluation has been studied primarily for high-resource languages, there has been a recent interest in evaluation for low-resource languages due to the increasing availability of data and models.

Machine Translation

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

1 code implementation11 Mar 2024 Mohammed Safi Ur Rahman Khan, Priyam Mehta, Ananth Sankar, Umashankar Kumaravelan, Sumanth Doddapaneni, Suriyaprasaad B, Varun Balan G, Sparsh Jain, Anoop Kunchukuttan, Pratyush Kumar, Raj Dabre, Mitesh M. Khapra

We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages.

Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages

1 code implementation25 May 2023 Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan

We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text.

Language Identification

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

1 code implementation25 May 2023 Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M. Khapra, Raj Dabre, Anoop Kunchukuttan

Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India.

Machine Translation Sentence +1

CTQScorer: Combining Multiple Features for In-context Example Selection for Machine Translation

1 code implementation23 May 2023 Aswanth Kumar, Ratish Puduppully, Raj Dabre, Anoop Kunchukuttan

We learn a regression model, CTQ Scorer (Contextual Translation Quality), that selects examples based on multiple features in order to maximize the translation quality.

In-Context Learning Machine Translation +2

Decomposed Prompting for Machine Translation Between Related Languages using Large Language Models

1 code implementation22 May 2023 Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre, Ai Ti Aw, Nancy F. Chen

This study investigates machine translation between related languages i. e., languages within the same family that share linguistic characteristics such as word order and lexical similarity.

Machine Translation Translation

A Comprehensive Analysis of Adapter Efficiency

2 code implementations12 May 2023 Nandini Mundra, Sumanth Doddapaneni, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra

However, adapters have not been sufficiently analyzed to understand if PEFT translates to benefits in training/deployment efficiency and maintainability/extensibility.

Natural Language Understanding parameter-efficient fine-tuning

CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages

no code implementations9 May 2023 Kaushal Kumar Maurya, Rahul Kejriwal, Maunendra Sankar Desarkar, Anoop Kunchukuttan

We address the task of machine translation (MT) from extremely low-resource language (ELRL) to English by leveraging cross-lingual transfer from 'closely-related' high-resource language (HRL).

Cross-Lingual Transfer Machine Translation +1

Evaluating Inter-Bilingual Semantic Parsing for Indian Languages

1 code implementation25 Apr 2023 Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan

Despite significant progress in Natural Language Generation for Indian languages (IndicNLP), there is a lack of datasets around complex structured tasks such as semantic parsing.

Semantic Parsing Text Generation +1

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

1 code implementation20 Dec 2022 Ananya B. Sai, Vignesh Nagarajan, Tanay Dixit, Raj Dabre, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

In this paper, we fill this gap by creating an MQM dataset consisting of 7000 fine-grained annotations, spanning 5 Indian languages and 7 MT systems, and use it to establish correlations between annotator scores and scores obtained using existing automatic metrics.

Machine Translation

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages

1 code implementation20 Dec 2022 Arnav Mhaske, Harshit Kedia, Sumanth Doddapaneni, Mitesh M. Khapra, Pratyush Kumar, Rudra Murthy V, Anoop Kunchukuttan

The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages.

Named Entity Recognition Sentence

Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users

2 code implementations6 May 2022 Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul NC, Ruchi Khapra, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs.

Transliteration

IndicXNLI: Evaluating Multilingual Inference for Indian Languages

1 code implementation19 Apr 2022 Divyanshu Aggarwal, Vivek Gupta, Anoop Kunchukuttan

While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited.

Cross-Lingual Transfer Machine Translation +1

Towards Building ASR Systems for the Next Billion Users

no code implementations6 Nov 2021 Tahir Javed, Sumanth Doddapaneni, Abhigyan Raman, Kaushal Santosh Bhogale, Gowtham Ramesh, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

Second, using this raw speech data we pretrain several variants of wav2vec style models for 40 Indian languages.

An Empirical Investigation of Multi-bridge Multilingual NMT models

no code implementations14 Oct 2021 Anoop Kunchukuttan

In this paper, we present an extensive investigation of multi-bridge, many-to-many multilingual NMT models (MB-M2M) ie., models trained on non-English language pairs in addition to English-centric language pairs.

NMT Translation

Samanantar: The Largest Publicly Available Parallel Corpora Collection for 11 Indic Languages

1 code implementation12 Apr 2021 Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, Mitesh Shantadevi Khapra

We mine the parallel sentences from the web by combining many corpora, tools, and methods: (a) web-crawled monolingual corpora, (b) document OCR for extracting sentences from scanned documents, (c) multilingual representation models for aligning sentences, and (d) approximate nearest neighbor search for searching in a large collection of sentences.

Machine Translation Multilingual NLP +3

A Large-scale Evaluation of Neural Machine Transliteration for Indic Languages

1 code implementation EACL 2021 Anoop Kunchukuttan, Siddharth Jain, Rahul Kejriwal

We take up the task of large-scale evaluation of neural machine transliteration between English and Indic languages, with a focus on multilingual transliteration to utilize orthographic similarity between Indian languages.

Translation Transliteration

Multilingual Neural Machine Translation

no code implementations COLING 2020 Raj Dabre, Chenhui Chu, Anoop Kunchukuttan

The advent of neural machine translation (NMT) has opened up exciting research in building multilingual translation systems i. e. translation models that can handle more than one language pair.

Machine Translation NMT +2

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

2 code implementations30 Apr 2020 Anoop Kunchukuttan, Divyanshu Kakwani, Satish Golla, Gokul N. C., Avik Bhattacharyya, Mitesh M. Khapra, Pratyush Kumar

We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2. 7 billion words for 10 Indian languages from two language families.

Word Embeddings

Learning Geometric Word Meta-Embeddings

no code implementations WS 2020 Pratik Jawanpuria, N T V Satya Dev, Anoop Kunchukuttan, Bamdev Mishra

We propose a geometric framework for learning meta-embeddings of words from different embedding sources.

Word Similarity

Utilizing Language Relatedness to improve Machine Translation: A Case Study on Languages of the Indian Subcontinent

no code implementations19 Mar 2020 Anoop Kunchukuttan, Pushpak Bhattacharyya

To the best of our knowledge, this is the first large-scale study specifically devoted to utilizing language relatedness to improve translation between related languages.

Machine Translation Translation

A Comprehensive Survey of Multilingual Neural Machine Translation

no code implementations4 Jan 2020 Raj Dabre, Chenhui Chu, Anoop Kunchukuttan

We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.

Machine Translation NMT +3

Overview of the 6th Workshop on Asian Translation

no code implementations WS 2019 Toshiaki Nakazawa, Nobushige Doi, Shohei Higashiyama, Chenchen Ding, Raj Dabre, Hideya Mino, Isao Goto, Win Pa Pa, Anoop Kunchukuttan, Yusuke Oda, Shantipriya Parida, Ond{\v{r}}ej Bojar, Sadao Kurohashi

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including Ja↔En, Ja↔Zh scientific paper translation subtasks, Ja↔En, Ja↔Ko, Ja↔En patent translation subtasks, Hi↔En, My↔En, Km↔En, Ta↔En mixed domain subtasks and Ru↔Ja news commentary translation task.

Translation

A Brief Survey of Multilingual Neural Machine Translation

no code implementations14 May 2019 Raj Dabre, Chenhui Chu, Anoop Kunchukuttan

We present a survey on multilingual neural machine translation (MNMT), which has gained a lot of traction in the recent years.

Machine Translation Survey +2

Addressing word-order Divergence in Multilingual Neural Machine Translation for extremely Low Resource Languages

no code implementations NAACL 2019 Rudra Murthy V, Anoop Kunchukuttan, Pushpak Bhattacharyya

To bridge this divergence, We propose to pre-order the assisting language sentence to match the word order of the source language and train the parent model.

Machine Translation NMT +3

McTorch, a manifold optimization library for deep learning

1 code implementation3 Oct 2018 Mayank Meghwanshi, Pratik Jawanpuria, Anoop Kunchukuttan, Hiroyuki Kasai, Bamdev Mishra

In this paper, we introduce McTorch, a manifold optimization library for deep learning that extends PyTorch.

Deep Learning

Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach

2 code implementations TACL 2019 Pratik Jawanpuria, Arjun Balgovind, Anoop Kunchukuttan, Bamdev Mishra

Our approach decouples learning the transformation from the source language to the target language into (a) learning rotations for language-specific embeddings to align them to a common space, and (b) learning a similarity metric in the common space to model similarities between the embeddings.

Bilingual Lexicon Induction Multilingual Word Embeddings +4

Comparing Recurrent and Convolutional Architectures for English-Hindi Neural Machine Translation

no code implementations WS 2017 S. Singh, hya, Ritesh Panjwani, Anoop Kunchukuttan, Pushpak Bhattacharyya

In this paper, we empirically compare the two encoder-decoder neural machine translation architectures: convolutional sequence to sequence model (ConvS2S) and recurrent sequence to sequence model (RNNS2S) for English-Hindi language pair as part of IIT Bombay{'}s submission to WAT2017 shared task.

Decoder Image Captioning +5

IIT Bombay's English-Indonesian submission at WAT: Integrating Neural Language Models with SMT

no code implementations WS 2016 S. Singh, hya, Anoop Kunchukuttan, Pushpak Bhattacharyya

The Neural Probabilistic Language Model (NPLM) gave relatively high BLEU points for Indonesian to English translation system while the Neural Network Joint Model (NNJM) performed better for English to Indonesian direction of translation system.

Language Modelling Machine Translation +1

Faster decoding for subword level Phrase-based SMT between related languages

no code implementations WS 2016 Anoop Kunchukuttan, Pushpak Bhattacharyya

The increase in length is also impacted by the specific choice of data format for representing the sentences as subwords.

Decoder Translation

Learning variable length units for SMT between related languages via Byte Pair Encoding

no code implementations WS 2017 Anoop Kunchukuttan, Pushpak Bhattacharyya

We explore the use of segments learnt using Byte Pair Encoding (referred to as BPE units) as basic units for statistical machine translation between related languages and compare it with orthographic syllables, which are currently the best performing basic units for this translation task.

Machine Translation Translation

Orthographic Syllable as basic unit for SMT between Related Languages

no code implementations EMNLP 2016 Anoop Kunchukuttan, Pushpak Bhattacharyya

We explore the use of the orthographic syllable, a variable-length consonant-vowel sequence, as a basic unit of translation between related languages which use abugida or alphabetic scripts.

Translation

Shata-Anuvadak: Tackling Multiway Translation of Indian Languages

no code implementations LREC 2014 Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya

We present a compendium of 110 Statistical Machine Translation systems built from parallel corpora of 11 Indian languages belonging to both Indo-Aryan and Dravidian families.

Translation Transliteration

Experiences in Resource Generation for Machine Translation through Crowdsourcing

no code implementations LREC 2012 Anoop Kunchukuttan, Shourya Roy, Pratik Patel, Kushal Ladha, Somya Gupta, Mitesh M. Khapra, Pushpak Bhattacharyya

The logistics of collecting resources for Machine Translation (MT) has always been a cause of concern for some of the resource deprived languages of the world.

Machine Translation Translation

Cannot find the paper you are looking for? You can Submit a new open access paper.