Search Results for author: Shitao Xiao

It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications.

Retrieval Self-Knowledge Distillation

4,739

Paper
Code

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

1 code implementation • 15 Jan 2024 • Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings.

Few-Shot Learning Language Modelling

4,739

Paper
Code

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

1 code implementation • 7 Jan 2024 • Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities.

4k Language Modelling

4,739

Paper
Code

Making Large Language Models A Better Foundation For Dense Retrieval

1 code implementation • 24 Dec 2023 • Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao

LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively.

Retrieval Sentence +1

4,739

Paper
Code

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

1 code implementation • 22 Nov 2023 • Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing

Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain.

Language Modelling

4,739

Paper
Code

Retrieve Anything To Augment Large Language Models

1 code implementation • 11 Oct 2023 • Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie

On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios.

Knowledge Distillation Retrieval

4,739

Paper
Code

C-Pack: Packaged Resources To Advance General Chinese Embedding

3 code implementations • 14 Sep 2023 • Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff

Along with our resources on general Chinese embedding, we release our data and models for English text embeddings.

4,739

Paper
Code

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation • 4 May 2023 • Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

It is designed to improve the quality of semantic representation where all contextualized embeddings of the pre-trained model can be leveraged.

Information Retrieval Open-Domain Question Answering +2

200

Paper
Code

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation • 16 Nov 2022 • Shitao Xiao, Zheng Liu

DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens.

Ranked #1 on Information Retrieval on MS MARCO (MRR@10 metric)

Dimensionality Reduction Information Retrieval +4

200

Paper
Code

Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

1 code implementation • 11 Oct 2022 • Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao

Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings.

Knowledge Distillation Quantization +1

Paper
Code

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

1 code implementation • 24 May 2022 • Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling.

Ranked #1 on Information Retrieval on MSMARCO

Information Retrieval Language Modelling +6

200

Paper
Code

Distill-VQ: Learning Retrieval Oriented Vector Quantization By Distilling Knowledge from Dense Embeddings

2 code implementations • 1 Apr 2022 • Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, Denvy Deng, Qi Zhang, Xing Xie

We perform comprehensive explorations for the optimal conduct of knowledge distillation, which may provide useful insights for the learning of VQ based ANN index.

Contrastive Learning Knowledge Distillation +2

Paper
Code

A Mutually Reinforced Framework for Pretrained Sentence Embeddings

no code implementations • 28 Feb 2022 • Junhan Yang, Zheng Liu, Shitao Xiao, Jianxun Lian, Lijun Wu, Defu Lian, Guangzhong Sun, Xing Xie

Instead of relying on annotation heuristics defined by humans, it leverages the sentence representation model itself and realizes the following iterative self-supervision process: on one hand, the improvement of sentence representation may contribute to the quality of data annotation; on the other hand, more effective data annotation helps to generate high-quality positive samples, which will further improve the current sentence representation model.

Contrastive Learning Sentence +1

Paper
Add Code

Uni-Retriever: Towards Learning The Unified Embedding Based Retriever in Bing Sponsored Search

no code implementations • 13 Feb 2022 • Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Denvy Deng, Qi Zhang, Xing Xie

On the other hand, the capability of making high-CTR retrieval is optimized by learning to discriminate user's clicked ads from the entire corpus.

Contrastive Learning Knowledge Distillation +2

Paper
Add Code

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

2 code implementations • 14 Jan 2022 • Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, Qi Zhang, Xing Xie

In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification.

Quantization Retrieval

Paper
Code

LECF: Recommendation via Learnable Edge Collaborative Filtering

1 code implementation • Science China Information Sciences 2021 • Shitao Xiao, Yingxia Shao, Yawen Li, Hongzhi Yin, Yanyan Shen & Bin Cui

In this paper, we model an interaction between user and item as an edge and propose a novel CF framework, called learnable edge collaborative filtering (LECF).

Collaborative Filtering

Paper
Code

GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph

1 code implementation • NeurIPS 2021 • Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, Xing Xie

The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information.

Language Modelling Recommendation Systems +1

Paper
Code

Matching-oriented Product Quantization For Ad-hoc Retrieval

2 code implementations • 16 Apr 2021 • Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie

In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.

Quantization Retrieval

Paper
Code

Training Large-Scale News Recommenders with Pretrained Language Models in the Loop

1 code implementation • 18 Feb 2021 • Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, Xing Xie

Secondly, it improves the data efficiency of the training workflow, where non-informative data can be eliminated from encoding.

News Recommendation Recommendation Systems

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.