BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

17 Apr 2021  ·  Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych ·

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

PDF Abstract

Results from the Paper


Task Dataset Model Metric Name Metric Value Global Rank Result Benchmark
Argument Retrieval ArguAna (BEIR) GenQ nDCG@10 0.517 # 1
Argument Retrieval ArguAna (BEIR) BM25+CE nDCG@10 0.311 # 4
Biomedical Information Retrieval BioASQ (BEIR) BM25 nDCG@10 0.514 # 5
Biomedical Information Retrieval BioASQ (BEIR) BM25+CE nDCG@10 0.523 # 4
Fact Checking CLIMATE-FEVER (BEIR) BM25+CE nDCG@10 0.253 # 3
Duplicate-Question Retrieval CQADupStack (BEIR) ColBERT nDCG@10 0.350 # 5
Duplicate-Question Retrieval CQADupStack (BEIR) BM25+CE nDCG@10 0.370 # 4
Entity Retrieval DBpedia (BEIR) ColBERT nDCG@10 0.392 # 3
Fact Checking FEVER (BEIR) BM25+CE nDCG@10 0.819 # 2
Question Answering FiQA-2018 (BEIR) BM25+CE nDCG@10 0.347 # 4
Question Answering HotpotQA (BEIR) BM25+CE nDCG@10 0.707 # 2
Passage Retrieval MSMARCO (BEIR) DeepCT nDCG@10 0.296 # 8
Passage Retrieval MSMARCO (BEIR) BM25 nDCG@10 0.228 # 11
Passage Retrieval MSMARCO (BEIR) ANCE nDCG@10 0.388 # 5
Passage Retrieval MSMARCO (BEIR) TAS-b nDCG@10 0.408 # 2
Passage Retrieval MSMARCO (BEIR) BM25+CE nDCG@10 0.413 # 1
Passage Retrieval MSMARCO (BEIR) DPR nDCG@10 0.177 # 12
Passage Retrieval MSMARCO (BEIR) ColBERT nDCG@10 0.401 # 3
Passage Retrieval MSMARCO (BEIR) docT5query nDCG@10 0.338 # 7
Passage Retrieval MSMARCO (BEIR) SPARTA nDCG@10 0.351 # 6
Biomedical Information Retrieval NFCorpus (BEIR) ColBERT nDCG@10 0.305 # 7
Biomedical Information Retrieval NFCorpus (BEIR) BM25+CE nDCG@10 0.350 # 4
Question Answering NQ (BEIR) BM25+CE nDCG@10 0.533 # 3
Question Answering NQ (BEIR) ColBERT nDCG@10 0.524 # 4
Duplicate-Question Retrieval Quora (BEIR) BM25+CE nDCG@10 0.825 # 3
Citation Prediction SciDocs (BEIR) BM25+CE nDCG@10 0.166 # 5
Citation Prediction SciDocs (BEIR) BM25 nDCG@10 0.156 # 6
Fact Checking SciFact (BEIR) BM25+CE nDCG@10 0.688 # 3
Fact Checking SciFact (BEIR) ColBERT nDCG@10 0.671 # 5
Tweet Retrieval Signal-1M (RT) (BEIR) BM25 + CE nDCG@10 0.338 # 2
Argument Retrieval Tóuche-2020 (BEIR) BM25+CE nDCG@10 0.271 # 2
Biomedical Information Retrieval TREC-COVID (BEIR) BM25+CE nDCG@10 0.757 # 5
Biomedical Information Retrieval TREC-COVID (BEIR) ColBERT nDCG@10 0.677 # 6
News Retrieval TREC-NEWS (BEIR) BM25+CE nDCG@10 0.431 # 5

Methods


No methods listed for this paper. Add relevant methods here