Automatic Annotation and Manual Evaluation of the Diachronic German Corpus T\"uBa-D/DC
This paper presents the Tu{\`I}bingen Baumbank des Deutschen Diachron (Tu{\`I}Ba-D/DC), a linguistically annotated corpus of selected diachronic materials from the German Gutenberg Project. It was automatically annotated by a suite of NLP tools integrated into WebLicht, the linguistic chaining tool used in CLARIN-D. The annotation quality has been evaluated manually for a subcorpus ranging from Middle High German to Modern High German. The integration of the Tu{\`I}Ba-D/DC into the CLARIN-D infrastructure includes metadata provision and harvesting as well as sustainable data storage in the Tu{\`I}bingen CLARIN-D center. The paper further provides an overview of the possibilities of accessing the Tu{\`I}Ba-D/DC data. Methods for full-text search of the metadata and object data and for annotation-based search of the object data are described in detail. The WebLicht Service Oriented Architecture is used as an integrated environment for annotation based search of the Tu{\`I}Ba-D/DC. WebLicht thus not only serves as the annotation platform for the Tu{\`I}Ba-D/DC, but also as a generic user interface for accessing and visualizing it.
PDF Abstract