TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK
Zero-shot Text Search	BEIR	RetroMAE v2 (xiao et al., 2022)	Avg. Accuracy	53.7	# 3
Zero-shot Text Search	BEIR	RetroMAE v2 (xiao et al., 2022)	Avg. nDCG@10	47.5	# 2
Information Retrieval	MS MARCO	RetroMAE v2	MRR@10	42.58	# 1

Badge	Markdown
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/retromae-v2-duplex-masked-auto-encoder-for/information-retrieval-on-ms-marco)](https://paperswithcode.com/sota/information-retrieval-on-ms-marco?p=retromae-v2-duplex-masked-auto-encoder-for)`
	`[![PWC](https://img.shields.io/endpoint.svg?url=https://paperswithcode.com/badge/retromae-v2-duplex-masked-auto-encoder-for/zero-shot-text-search-on-beir)](https://paperswithcode.com/sota/zero-shot-text-search-on-beir?p=retromae-v2-duplex-masked-auto-encoder-for)`

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

16 Nov 2022 · Shitao Xiao, Zheng Liu ·

To better support retrieval applications such as web search and question answering, growing effort is made to develop retrieval-oriented language models. Most of the existing works focus on improving the semantic representation capability for the contextualized embedding of [CLS] token. However, recent study shows that the ordinary tokens besides [CLS] may provide extra information, which helps to produce a better representation effect. As such, it's necessary to extend the current methods where all contextualized embeddings can be jointly pre-trained for the retrieval tasks. With this motivation, we propose a new pre-training method: duplex masked auto-encoder, a.k.a. DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens. It introduces two decoding tasks: one is to reconstruct the original input sentence based on the [CLS] embedding, the other one is to minimize the bag-of-words loss (BoW) about the input sentence based on the entire ordinary tokens' embeddings. The two decoding losses are added up to train a unified encoding model. The embeddings from [CLS] and ordinary tokens, after dimension reduction and aggregation, are concatenated as one unified semantic representation for the input. DupMAE is simple but empirically competitive: with a small decoding cost, it substantially contributes to the model's representation capability and transferability, where remarkable improvements are achieved on MS MARCO and BEIR benchmarks.

PDF Abstract

Code

Add Remove Mark official

staoxiao/retromae official

203

Tasks

Add Remove

Dimensionality Reduction

Information Retrieval

Question Answering

Retrieval

Sentence

Zero-shot Text Search

Datasets

MS MARCO

BEIR

Results from the Paper

Edit

Ranked #1 on Information Retrieval on MS MARCO (MRR@10 metric)

Get a GitHub badge

Task	Dataset	Model	Metric Name	Metric Value	Global Rank	Benchmark
Zero-shot Text Search	BEIR	RetroMAE v2 (xiao et al., 2022)	Avg. Accuracy	53.7	# 3	Compare
Zero-shot Text Search	BEIR	RetroMAE v2 (xiao et al., 2022)	Avg. nDCG@10	47.5	# 2	Compare
Information Retrieval	MS MARCO	RetroMAE v2	MRR@10	42.58	# 1	Compare

Methods

Add Remove

No methods listed for this paper. Add relevant methods here

Edit Social Preview

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

Code Edit Add Remove Mark official

Tasks Edit Add Remove

Datasets Edit

Results from the Paper Edit

Methods Edit Add Remove

Code

Add Remove Mark official

Tasks

Add Remove

Datasets

Results from the Paper

Edit

Methods

Add Remove