Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection

Shallow Discourse Parsing (SDP), the identification of coherence relations between text spans, relies on large amounts of training data, which so far exists only for English - any other language is in this respect an under-resourced one. For those languages where machine translation from English is available with reasonable quality, MT in conjunction with annotation projection can be an option for producing an SDP resource. In our study, we translate the English Penn Discourse TreeBank into German and experiment with various methods of annotation projection to arrive at the German counterpart of the PDTB. We describe the key characteristics of the corpus as well as some typical sources of errors encountered during its creation. Then we evaluate the GermanPDTB by training components for selected sub-tasks of discourse parsing on this silver data and compare performance to the same components when trained on the gold, original PDTB corpus.

PDF Abstract

Datasets


  Add Datasets introduced or used in this paper

Results from the Paper


  Submit results from this paper to get state-of-the-art GitHub badges and help the community compare results to other papers.

Methods


No methods listed for this paper. Add relevant methods here