Context-NER : Contextual Phrase Generation at Scale

16 Sep 2021  ·  Himanshu Gupta, Shreyas Verma, Santosh Mashetty, Swaroop Mishra ·

Named Entity Recognition (NER) has seen significant progress in recent years, with numerous state-of-the-art (SOTA) models achieving high performance. However, very few studies have focused on the generation of entities' context. In this paper, we introduce CONTEXT-NER, a task that aims to generate the relevant context for entities in a sentence, where the context is a phrase describing the entity but not necessarily present in the sentence. To facilitate research in this task, we also present the EDGAR10-Q dataset, which consists of annual and quarterly reports from the top 1500 publicly traded companies. The dataset is the largest of its kind, containing 1M sentences, 2.8M entities, and an average of 35 tokens per sentence, making it a challenging dataset. We propose a baseline approach that combines a phrase generation algorithm with inferencing using a 220M language model, achieving a ROUGE-L score of 27% on the test split. Additionally, we perform a one-shot inference with ChatGPT, which obtains a 30% ROUGE-L, highlighting the difficulty of the dataset. We also evaluate models such as T5 and BART, which achieve a maximum ROUGE-L of 49% after supervised finetuning on EDGAR10-Q. We also find that T5-large, when pre-finetuned on EDGAR10-Q, achieve SOTA results on downstream finance tasks such as Headline, FPB, and FiQA SA, outperforming vanilla version by 10.81 points. To our surprise, this 66x smaller pre-finetuned model also surpasses the finance-specific LLM BloombergGPT-50B by 15 points. We hope that our dataset and generated artifacts will encourage further research in this direction, leading to the development of more sophisticated language models for financial text analysis

PDF Abstract

Datasets


Introduced in the Paper:

EDGAR10-Q Dataset

Used in the Paper:

WikiCoref

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
ContextNER EDGAR10-Q Dataset EDGAR T5 Large rougeL F1 49.23 # 1
ContextNER EDGAR10-Q Dataset ChatGPT rougeL F1 30.31 # 2
ContextNER EDGAR10-Q Dataset Rule Based Phrase Generation rougeL F1 27.59 # 3

Methods