Although there are increasing and significant ties between China and
Portuguese-speaking countries, there is not much parallel corpora in the
Chinese-Portuguese language pair. Both languages are very populous, with 1.2
billion native Chinese speakers and 279 million native Portuguese speakers, the
language pair, however, could be considered as low-resource in terms of
available parallel corpora...
In this paper, we describe our methods to curate
Chinese-Portuguese parallel corpora and evaluate their quality. We extracted
bilingual data from Macao government websites and proposed a hierarchical
strategy to build a large parallel corpus. Experiments are conducted on
existing and our corpora using both Phrased-Based Machine Translation (PBMT)
and the state-of-the-art Neural Machine Translation (NMT) models. The results
of this work can be used as a benchmark for future Chinese-Portuguese MT
systems. The approach we used in this paper also shows a good example on how to
boost performance of MT systems for low-resource language pairs.