Search Results for author: Shitao Xiao

Found 32 papers, 26 papers with code

Matching-oriented Embedding Quantization For Ad-hoc Retrieval

1 code implementation EMNLP 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie

In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.

Quantization Retrieval

Making Text Embedders Few-Shot Learners

1 code implementation24 Sep 2024 Chaofan Li, Minghao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, Zheng Liu

To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings.

Decoder In-Context Learning

Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment

no code implementations22 Aug 2024 Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, Kang Liu

In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in domain accuracy, data efficiency, zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning.

Multi-Task Learning Retrieval +1

SpikeLLM: Scaling up Spiking Neural Network to Large Language Models via Saliency-based Spiking

1 code implementation5 Jul 2024 Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang

In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters.

Language Modelling Large Language Model +1

MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding

3 code implementations6 Jun 2024 Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu

To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU.

Video Understanding

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

1 code implementation6 Jun 2024 Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong

Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data.

Image Retrieval Retrieval

Compressing Lengthy Context With UltraGist

1 code implementation26 May 2024 Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

Compressing lengthy context is a critical but technically challenging problem.

Few-Shot Learning

Extending Llama-3's Context Ten-Fold Overnight

1 code implementation30 Apr 2024 Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou

We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning.

8k Retrieval

Extensible Embedding: A Flexible Multipler For LLM's Context Length

no code implementations18 Feb 2024 Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way.

Language Modelling

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models

no code implementations18 Feb 2024 Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu

In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness.

Chunking Language Modelling +1

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

2 code implementations5 Feb 2024 Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu

It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications.

Retrieval Self-Knowledge Distillation

Flexibly Scaling Large Language Models Contexts Through Extensible Tokenization

1 code implementation15 Jan 2024 Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang

Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings.

Few-Shot Learning Language Modelling

Long Context Compression with Activation Beacon

1 code implementation7 Jan 2024 Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts.

4k document understanding +2

Making Large Language Models A Better Foundation For Dense Retrieval

1 code implementation24 Dec 2023 Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao

LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively.

Retrieval Sentence +1

LM-Cocktail: Resilient Tuning of Language Models via Model Merging

1 code implementation22 Nov 2023 Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing

Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain.

Language Modelling MMLU

Retrieve Anything To Augment Large Language Models

1 code implementation11 Oct 2023 Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie

On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios.

Knowledge Distillation Retrieval

C-Pack: Packed Resources For General Chinese Embeddings

2 code implementations14 Sep 2023 Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, Jian-Yun Nie

Along with our resources on general Chinese embedding, we release our data and models for English text embeddings.

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation4 May 2023 Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

It is designed to improve the quality of semantic representation where all contextualized embeddings of the pre-trained model can be leveraged.

Information Retrieval Open-Domain Question Answering +2

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

1 code implementation16 Nov 2022 Shitao Xiao, Zheng Liu

DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens.

 Ranked #1 on Information Retrieval on MS MARCO (MRR@10 metric)

Dimensionality Reduction Information Retrieval +4

Hybrid Inverted Index Is a Robust Accelerator for Dense Retrieval

1 code implementation11 Oct 2022 Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao

Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings.

Knowledge Distillation Quantization +1

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder

1 code implementation24 May 2022 Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao

The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling.

Decoder Information Retrieval +7

A Mutually Reinforced Framework for Pretrained Sentence Embeddings

no code implementations28 Feb 2022 Junhan Yang, Zheng Liu, Shitao Xiao, Jianxun Lian, Lijun Wu, Defu Lian, Guangzhong Sun, Xing Xie

Instead of relying on annotation heuristics defined by humans, it leverages the sentence representation model itself and realizes the following iterative self-supervision process: on one hand, the improvement of sentence representation may contribute to the quality of data annotation; on the other hand, more effective data annotation helps to generate high-quality positive samples, which will further improve the current sentence representation model.

Contrastive Learning Sentence +1

Progressively Optimized Bi-Granular Document Representation for Scalable Embedding Based Retrieval

2 code implementations14 Jan 2022 Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, Qi Zhang, Xing Xie

In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification.

Quantization Retrieval

LECF: Recommendation via Learnable Edge Collaborative Filtering

1 code implementation Science China Information Sciences 2021 Shitao Xiao, Yingxia Shao, Yawen Li, Hongzhi Yin, Yanyan Shen & Bin Cui

In this paper, we model an interaction between user and item as an edge and propose a novel CF framework, called learnable edge collaborative filtering (LECF).

Collaborative Filtering

GraphFormers: GNN-nested Transformers for Representation Learning on Textual Graph

1 code implementation NeurIPS 2021 Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, Xing Xie

The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information.

Language Modelling Recommendation Systems +1

Matching-oriented Product Quantization For Ad-hoc Retrieval

2 code implementations16 Apr 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie

In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.

Quantization Retrieval

Training Large-Scale News Recommenders with Pretrained Language Models in the Loop

1 code implementation18 Feb 2021 Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, Xing Xie

Secondly, it improves the data efficiency of the training workflow, where non-informative data can be eliminated from encoding.

News Recommendation Recommendation Systems

Cannot find the paper you are looking for? You can Submit a new open access paper.