4 dataset results for Machine Translation AND Images

Samanantar is the largest publicly available parallel corpora collection for Indic languages: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu. The corpus has 49.6M sentence pairs between English to Indian Languages.

36 PAPERS • NO BENCHMARKS YET

COCO-CN

COCO-CN is a bilingual image description dataset enriching MS-COCO with manually written Chinese sentences and tags. The new dataset can be used for multiple tasks including image tagging, captioning and retrieval, all in a cross-lingual setting.

20 PAPERS • 3 BENCHMARKS

Hindi Visual Genome

Hindi Visual Genome is a multimodal dataset consisting of text and images suitable for English-Hindi multimodal machine translation task and multimodal research.

7 PAPERS • NO BENCHMARKS YET

Perseus

Perseus is a dataset for Cross-Lingual Summarization (CLS) which collects about 94K Chinese scientific documents paired with English summaries. The average length of documents in Perseus is more than two thousand tokens.

1 PAPER • NO BENCHMARKS YET

Datasets

4 dataset results for Machine Translation AND Images