Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.
36 PAPERS • NO BENCHMARKS YET
COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.
20 PAPERS • 3 BENCHMARKS
Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.
7 PAPERS • NO BENCHMARKS YET
Perseus is a dataset for Cross-Lingual Summarization (CLS) which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens.
1 PAPER • NO BENCHMARKS YET