One of the main challenges for arbitrary-shaped text detection is to design a good text instance representation that allows networks to learn diverse text geometry variances.
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023.
This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks.
Ranked #11 on
Only Connect Walls Dataset Task 1 (Grouping)
on OCW
(using extra training data)
MTEB Benchmark
Only Connect Walls Dataset Task 1 (Grouping)
+1
It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications.
Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data.
Ranked #9 on
Image Retrieval
on CIRR
MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages.
Ranked #1 on
Text Retrieval
on MTEB
The evaluation of English text embeddings has transitioned from evaluating a handful of datasets to broad coverage across many tasks through benchmarks such as MTEB.
MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date.
Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets.
TENE learns the representations of nodes under the guidance of both proximity matrix which captures the network structure and text cluster membership matrix derived from clustering for text information.