1 code implementation • EMNLP 2021 • Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie
In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.
no code implementations • 24 Sep 2024 • Zheng Liu, Chenyuan Wu, Ninglu Shao, Shitao Xiao, Chaozhuo Li, Defu Lian
In this approach, the retrieved contexts are compressed into compact embeddings before being encoded by the LLMs.
1 code implementation • 24 Sep 2024 • Chaofan Li, Minghao Qin, Shitao Xiao, Jianlyu Chen, Kun Luo, Yingxia Shao, Defu Lian, Zheng Liu
To this end, we introduce a novel model bge-en-icl, which employs few-shot examples to produce high-quality text embeddings.
1 code implementation • 17 Sep 2024 • Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Shuting Wang, Tiejun Huang, Zheng Liu
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
no code implementations • 22 Aug 2024 • Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, Kang Liu
In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in domain accuracy, data efficiency, zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning.
1 code implementation • 5 Jul 2024 • Xingrun Xing, Boyan Gao, Zheng Zhang, David A. Clifton, Shitao Xiao, Li Du, Guoqi Li, Jiajun Zhang
In contrast, human brains, which contain approximately 86 billion biological neurons, exhibit significantly greater energy efficiency compared to LLMs with a similar number of parameters.
3 code implementations • 6 Jun 2024 • Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Shitao Xiao, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, Zheng Liu
To address the above problems, we propose a new benchmark, called MLVU (Multi-task Long Video Understanding Benchmark), for the comprehensive and in-depth evaluation of LVU.
1 code implementation • 6 Jun 2024 • Junjie Zhou, Zheng Liu, Shitao Xiao, Bo Zhao, Yongping Xiong
Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data.
Ranked #8 on Image Retrieval on CIRR
1 code implementation • 5 Jun 2024 • Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, Guoqi Li
We plug this elastic bi-spiking mechanism in language modeling, named SpikeLM.
1 code implementation • 26 May 2024 • Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
Compressing lengthy context is a critical but technically challenging problem.
1 code implementation • 30 Apr 2024 • Peitian Zhang, Ninglu Shao, Zheng Liu, Shitao Xiao, Hongjin Qian, Qiwei Ye, Zhicheng Dou
We extend the context length of Llama-3-8B-Instruct from 8K to 80K via QLoRA fine-tuning.
no code implementations • 18 Feb 2024 • Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang
2) Strong sample efficiency of training, which enables the embedding model to be learned in a cost-effective way.
no code implementations • 18 Feb 2024 • Kun Luo, Zheng Liu, Shitao Xiao, Kang Liu
In this work, we proposeExtensible Embedding, which realizes high-quality extension of LLM's context with strong flexibility and cost-effectiveness.
2 code implementations • 5 Feb 2024 • Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, Zheng Liu
It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications.
1 code implementation • 15 Jan 2024 • Ninglu Shao, Shitao Xiao, Zheng Liu, Peitian Zhang
Extensible Tokenization stands as a midware in between of the tokenized context and the LLM, which transforms the raw token embeddings into the extensible embeddings.
1 code implementation • 7 Jan 2024 • Peitian Zhang, Zheng Liu, Shitao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou
In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts.
1 code implementation • 24 Dec 2023 • Chaofan Li, Zheng Liu, Shitao Xiao, Yingxia Shao
LLaRA consists of two pretext tasks: EBAE (Embedding-Based Auto-Encoding) and EBAR (Embedding-Based Auto-Regression), where the text embeddings from LLM are used to reconstruct the tokens for the input sentence and predict the tokens for the next sentence, respectively.
1 code implementation • 22 Nov 2023 • Shitao Xiao, Zheng Liu, Peitian Zhang, Xingrun Xing
Despite simplicity, LM-Cocktail is surprisingly effective: the resulted model is able to achieve a strong empirical performance in the whole scope of general tasks while preserving a superior capacity in its targeted domain.
1 code implementation • 11 Oct 2023 • Peitian Zhang, Shitao Xiao, Zheng Liu, Zhicheng Dou, Jian-Yun Nie
On the other hand, the task-specific retrievers lack the required versatility, hindering their performance across the diverse retrieval augmentation scenarios.
2 code implementations • 14 Sep 2023 • Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, Jian-Yun Nie
Along with our resources on general Chinese embedding, we release our data and models for English text embeddings.
1 code implementation • 4 May 2023 • Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao
It is designed to improve the quality of semantic representation where all contextualized embeddings of the pre-trained model can be leveraged.
1 code implementation • 16 Nov 2022 • Shitao Xiao, Zheng Liu
DupMAE, which targets on improving the semantic representation capacity for the contextualized embeddings of both [CLS] and ordinary tokens.
Ranked #1 on Information Retrieval on MS MARCO (MRR@10 metric)
1 code implementation • 11 Oct 2022 • Peitian Zhang, Zheng Liu, Shitao Xiao, Zhicheng Dou, Jing Yao
Based on comprehensive experiments on popular retrieval benchmarks, we verify that clusters and terms indeed complement each other, enabling HI$^2$ to achieve lossless retrieval quality with competitive efficiency across various index settings.
1 code implementation • 24 May 2022 • Shitao Xiao, Zheng Liu, Yingxia Shao, Zhao Cao
The sentence embedding is generated from the encoder's masked input; then, the original sentence is recovered based on the sentence embedding and the decoder's masked input via masked language modeling.
Ranked #1 on Information Retrieval on MSMARCO
2 code implementations • 1 Apr 2022 • Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Defu Lian, Yeyun Gong, Qi Chen, Fan Yang, Hao Sun, Yingxia Shao, Denvy Deng, Qi Zhang, Xing Xie
We perform comprehensive explorations for the optimal conduct of knowledge distillation, which may provide useful insights for the learning of VQ based ANN index.
no code implementations • 28 Feb 2022 • Junhan Yang, Zheng Liu, Shitao Xiao, Jianxun Lian, Lijun Wu, Defu Lian, Guangzhong Sun, Xing Xie
Instead of relying on annotation heuristics defined by humans, it leverages the sentence representation model itself and realizes the following iterative self-supervision process: on one hand, the improvement of sentence representation may contribute to the quality of data annotation; on the other hand, more effective data annotation helps to generate high-quality positive samples, which will further improve the current sentence representation model.
no code implementations • 13 Feb 2022 • Jianjin Zhang, Zheng Liu, Weihao Han, Shitao Xiao, Ruicheng Zheng, Yingxia Shao, Hao Sun, Hanqing Zhu, Premkumar Srinivasan, Denvy Deng, Qi Zhang, Xing Xie
On the other hand, the capability of making high-CTR retrieval is optimized by learning to discriminate user's clicked ads from the entire corpus.
2 code implementations • 14 Jan 2022 • Shitao Xiao, Zheng Liu, Weihao Han, Jianjin Zhang, Yingxia Shao, Defu Lian, Chaozhuo Li, Hao Sun, Denvy Deng, Liangjie Zhang, Qi Zhang, Xing Xie
In this work, we tackle this problem with Bi-Granular Document Representation, where the lightweight sparse embeddings are indexed and standby in memory for coarse-grained candidate search, and the heavyweight dense embeddings are hosted in disk for fine-grained post verification.
1 code implementation • Science China Information Sciences 2021 • Shitao Xiao, Yingxia Shao, Yawen Li, Hongzhi Yin, Yanyan Shen & Bin Cui
In this paper, we model an interaction between user and item as an edge and propose a novel CF framework, called learnable edge collaborative filtering (LECF).
1 code implementation • NeurIPS 2021 • Junhan Yang, Zheng Liu, Shitao Xiao, Chaozhuo Li, Defu Lian, Sanjay Agrawal, Amit Singh, Guangzhong Sun, Xing Xie
The representation learning on textual graph is to generate low-dimensional embeddings for the nodes based on the individual textual features and the neighbourhood information.
2 code implementations • 16 Apr 2021 • Shitao Xiao, Zheng Liu, Yingxia Shao, Defu Lian, Xing Xie
In this work, we propose the Matching-oriented Product Quantization (MoPQ), where a novel objective Multinoulli Contrastive Loss (MCL) is formulated.
1 code implementation • 18 Feb 2021 • Shitao Xiao, Zheng Liu, Yingxia Shao, Tao Di, Xing Xie
Secondly, it improves the data efficiency of the training workflow, where non-informative data can be eliminated from encoding.