Semi-automatic Parsing for Web Knowledge Extraction through Semantic Annotation

LREC 2016 · Maria Pia di Buono ·

Parsing Web information, namely parsing content to find relevant documents on the basis of a user{'}s query, represents a crucial step to guarantee fast and accurate Information Retrieval (IR). Generally, an automated approach to such task is considered faster and cheaper than manual systems. Nevertheless, results do not seem have a high level of accuracy, indeed, as also Hjorland (2007) states, using stochastic algorithms entails: {\mbox{$\bullet$}} Low precision due to the indexing of common Atomic Linguistic Units (ALUs) or sentences. {\mbox{$\bullet$}} Low recall caused by the presence of synonyms. {\mbox{$\bullet$}} Generic results arising from the use of too broad or too narrow terms. Usually IR systems are based on invert text index, namely an index data structure storing a mapping from content to its locations in a database file, or in a document or a set of documents. In this paper we propose a system, by means of which we will develop a search engine able to process online documents, starting from a natural language query, and to return information to users. The proposed approach, based on the Lexicon-Grammar (LG) framework and its language formalization methodologies, aims at integrating a semantic annotation process for both query analysis and document retrieval.