XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training.
In this study, we present CoDesc -- a large parallel dataset composed of 4. 2 million Java methods and natural language descriptions.
Ranked #1 on Code Search on CoDesc
In this work, we leverage the efficacy of these embedding models using a simple, lightweight 2-layer neural network in the task of semantic code search.
As a bi-product of the standard NLU benchmarks, we introduce a new downstream dataset on natural language inference (NLI) and show that BanglaBERT outperforms previous state-of-the-art results on all tasks by up to 3. 5%.
With the segmenter and the two methods combined, we compile a high-quality Bengali-English parallel corpus comprising of 2. 75 million sentence pairs, more than 2 million of which were not available before.