KnowCoder: Coding Structured Knowledge into LLMs for Universal Information Extraction

In this paper, we propose KnowCoder, a Large Language Model (LLM) to conduct Universal Information Extraction (UIE) via code generation. KnowCoder aims to develop a kind of unified schema representation that LLMs can easily understand and an effective learning framework that encourages LLMs to follow schemas and extract structured knowledge accurately. To achieve these, KnowCoder introduces a code-style schema representation method to uniformly transform different schemas into Python classes, with which complex schema information, such as constraints among tasks in UIE, can be captured in an LLM-friendly manner. We further construct a code-style schema library covering over $\textbf{30,000}$ types of knowledge, which is the largest one for UIE, to the best of our knowledge. To ease the learning process of LLMs, KnowCoder contains a two-phase learning framework that enhances its schema understanding ability via code pretraining and its schema following ability via instruction tuning. After code pretraining on around $1.5$B automatically constructed data, KnowCoder already attains remarkable generalization ability and achieves relative improvements by $\textbf{49.8%}$ F1, compared to LLaMA2, under the few-shot setting. After instruction tuning, KnowCoder further exhibits strong generalization ability on unseen schemas and achieves up to $\textbf{12.5%}$ and $\textbf{21.9%}$, compared to sota baselines, under the zero-shot setting and the low resource setting, respectively. Additionally, based on our unified schema representations, various human-annotated datasets can simultaneously be utilized to refine KnowCoder, which achieves significant improvements up to $\textbf{7.5%}$ under the supervised setting.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Benchmark
UIE ACE 2004 KnowCoder-7b-IE F1 score 86.2 # 1
UIE ACE 2005-EAE KnowCoder-7b-IE F1 score 70.3 # 1
UIE ACE 2005-ED KnowCoder-7b-IE F1 score 74.2 # 1
UIE ACE 2005-NER KnowCoder-7b-IE F1 score 86.1 # 1
UIE ACE 2005-RE KnowCoder-7b-IE F1 score 64.5 # 1
UIE ADE Corpus KnowCoder-7b-IE F1 score 84.3 # 1
UIE AnatEM KnowCoder-7b-IE F1 score 86.4 # 1
UIE BC2GM KnowCoder-7b-IE F1 score 82.0 # 1
UIE BC5CDR KnowCoder-7b-IE F1 score 89.3 # 1
UIE Broad Twitter KnowCoder-7b-IE F1 score 78.3 # 1
UIE CoNLL 2003 KnowCoder-7b-IE F1 score 95.1 # 1
UIE CoNLL 2004 KnowCoder-7b-IE F1 score 73.3 # 1
UIE DIANN KnowCoder-7b-IE F1 score 94.7 # 1
UIE FabNER KnowCoder-7b-IE F1 score 82.9 # 1
UIE FindVehicle KnowCoder-7b-IE F1 score 99.4 # 1
UIE GENIA KnowCoder-7b-IE F1 score 76.7 # 1
UIE GIDS KnowCoder-7b-IE F1 score 78.0 # 1
UIE kbp37 KnowCoder-7b-IE F1 score 73.2 # 1
UIE MIT Movie KnowCoder-7b-IE F1 score 90.6 # 1
UIE MIT Restaurant KnowCoder-7b-IE F1 score 81.3 # 1
UIE MultiNERD KnowCoder-7b-IE F1 score 96.1 # 1
UIE ncbi_disease KnowCoder-7b-IE F1 score 83.8 # 1
UIE NYT KnowCoder-7b-IE F1 score 93.7 # 1
UIE OntoNotes 5.0 KnowCoder-7b-IE F1 score 88.2 # 1
UIE SciERC KnowCoder-7b-IE F1 score 40.0 # 1
UIE semeval RE KnowCoder-7b-IE F1 score 66.3 # 1
UIE WikiANN KnowCoder-7b-IE F1 score 87.0 # 1
UIE WNUT 2017 KnowCoder-7b-IE F1 score 66.4 # 1

Methods