Search Results for author: Shitao Xiao

Found 22 papers, 18 papers with code

Matching-oriented Embedding Quantization For Ad-hoc Retrieval

1 code implementation EMNLP 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie

In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.

Quantization Retrieval

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

no code implementations18 Feb 2024 Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu

In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness.

Chunking Language Modelling +1

Extensible Embedding: A Flexible Multipler For LLM's Context Length

no code implementations18 Feb 2024 Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way.

Language Modelling

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

1 code implementation5 Feb 2024 Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu

It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications.

Retrieval Self-Knowledge Distillation

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

1 code implementation15 Jan 2024 Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings.

Few-Shot Learning Language Modelling

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

1 code implementation7 Jan 2024 Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities.

4k Language Modelling

Making Large Language Models A Better Foundation For Dense Retrieval

1 code implementation24 Dec 2023 Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao

LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively.

Retrieval Sentence +1

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

1 code implementation22 Nov 2023 Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing

Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain.

Language Modelling

Retrieve Anything To Augment Large Language Models

1 code implementation11 Oct 2023 Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie

On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios.

Knowledge Distillation Retrieval

C-Pack: Packaged Resources To Advance General Chinese Embedding

3 code implementations14 Sep 2023 Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff

Along with our resources on general Chinese embedding, we release our data and models for English text embeddings.

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation4 May 2023 Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

It is designed to improve the quality of semantic representation where all contextualized embeddings of the pre-trained model can be leveraged.

Information Retrieval Open-Domain Question Answering +2

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation16 Nov 2022 Shitao Xiao, Zheng Liu

DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens.

 Ranked #1 on Information Retrieval on MS MARCO (MRR@10 metric)

Dimensionality Reduction Information Retrieval +4

Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

1 code implementation11 Oct 2022 Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao

Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings.

Knowledge Distillation Quantization +1

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

1 code implementation24 May 2022 Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling.

Information Retrieval Language Modelling +6

A Mutually Reinforced Framework for Pretrained Sentence Embeddings

no code implementations28 Feb 2022 Junhan Yang, Zheng Liu, Shitao Xiao, Jianxun Lian, Lijun Wu, Defu Lian, Guangzhong Sun, Xing Xie

Instead of relying on annotation heuristics defined by humans, it leverages the sentence representation model itself and realizes the following iterative self-supervision process: on one hand, the improvement of sentence representation may contribute to the quality of data annotation; on the other hand, more effective data annotation helps to generate high-quality positive samples, which will further improve the current sentence representation model.

Contrastive Learning Sentence +1

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

2 code implementations14 Jan 2022 Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, Qi Zhang, Xing Xie

In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification.

Quantization Retrieval

LECF: Recommendation via Learnable Edge Collaborative Filtering

1 code implementation Science China Information Sciences 2021 Shitao Xiao, Yingxia Shao, Yawen Li, Hongzhi Yin, Yanyan Shen & Bin Cui

In this paper, we model an interaction between user and item as an edge and propose a novel CF framework, called learnable edge collaborative filtering (LECF).

Collaborative Filtering

GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph

1 code implementation NeurIPS 2021 Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, Xing Xie

The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information.

Language Modelling Recommendation Systems +1

Matching-oriented Product Quantization For Ad-hoc Retrieval

2 code implementations16 Apr 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie

In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.

Quantization Retrieval

Training Large-Scale News Recommenders with Pretrained Language Models in the Loop

1 code implementation18 Feb 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, Xing Xie

Secondly, it improves the data efficiency of the training workflow, where non-informative data can be eliminated from encoding.

News Recommendation Recommendation Systems

Cannot find the paper you are looking for? You can Submit a new open access paper.