Comparing the Quality of Focused Crawlers and of the Translation Resources Obtained from them

LREC 2014 · Bruno Laranjeira, Viviane Moreira, Aline Villavicencio, Carlos Ramisch, Maria Jos{\'e} Finatto ·

Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domain-specific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.

PDF Abstract