no code implementations • NAACL (TrustNLP) 2022 • Minghan Li, Xueguang Ma, Jimmy Lin
The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it is unclear how DPR’s question encoder and passage encoder individually contributes to overall performance, which we refer to as the encoder attribution problem.
no code implementations • EMNLP 2021 • Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, Jimmy Lin
Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements.
no code implementations • 1 Apr 2025 • YuBo Wang, Xueguang Ma, Ping Nie, Huaye Zeng, Zhiheng Lyu, Yuxuan Zhang, Benjamin Schneider, Yi Lu, Xiang Yue, Wenhu Chen
Academic writing requires both coherent text generation and precise citation of relevant literature.
2 code implementations • 8 Mar 2025 • Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
In particular, we find that Rank-R1 achieves effectiveness on in-domain datasets at par with that of supervised fine-tuning methods, but utilizing only 18\% of the training data used by the fine-tuning methods.
1 code implementation • 25 Feb 2025 • Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, Xilun Chen
In particular, we adopt pruned LLMs as the backbone and train on diverse LLM-augmented data in a single-stage contrastive learning setup.
no code implementations • 31 Jan 2025 • Zhiheng Lyu, Xueguang Ma, Wenhu Chen
Existing foundation models typically process visual input as pixels and textual input as tokens, a paradigm that contrasts with human perception, where both modalities are processed in a unified manner.
1 code implementation • 28 Jan 2025 • Shengyao Zhuang, Ekaterina Khramtsova, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods.
no code implementations • 19 Dec 2024 • Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin
Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems.
no code implementations • 21 Jun 2024 • Ziyan Jiang, Xueguang Ma, Wenhu Chen
In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'.
no code implementations • 17 Jun 2024 • Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e. g., text, image and layout).
2 code implementations • 3 Jun 2024 • YuBo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, Wenhu Chen
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains.
1 code implementation • 29 Apr 2024 • Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Utilizing large language models (LLMs) for zero-shot document ranking is done in one of two ways: (1) prompt-based re-ranking methods, which require no further training but are only feasible for re-ranking a handful of candidate documents due to computational costs; and (2) unsupervised contrastive trained dense retrieval methods, which can retrieve relevant documents from the entire corpus but require a large amount of paired text data for contrastive training.
2 code implementations • 12 Oct 2023 • Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin
Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models.
1 code implementation • 11 Oct 2023 • Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture
Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking.
1 code implementation • 5 Sep 2023 • YuBo Wang, Xueguang Ma, Wenhu Chen
In this study, we present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) designed to enhance the proficiency of LLMs in specialized domains.
2 code implementations • 13 Jun 2023 • Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, Jimmy Lin
BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations.
1 code implementation • 21 May 2023 • Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, Tony Xia
We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts.
Ranked #1 on
Natural Questions
on TheoremQA
no code implementations • 3 May 2023 • Xueguang Ma, Xinyu Zhang, Ronak Pradeep, Jimmy Lin
Supervised ranking methods based on bi-encoder or cross-encoder architectures have shown success in multi-stage text ranking tasks, but they require large amounts of relevance judgments as training data.
1 code implementation • 2 May 2023 • Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, Wenhu Chen
On GrailQA and WebQSP, our model is also on par with other fully-trained models.
no code implementations • 24 Apr 2023 • Xueguang Ma, Tommaso Teofili, Jimmy Lin
With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface.
1 code implementation • 13 Feb 2023 • Minghan Li, Sheng-Chieh Lin, Xueguang Ma, Jimmy Lin
Multi-vector retrieval methods have demonstrated their effectiveness on various retrieval datasets, and among them, ColBERT is the most established method based on the late interaction of contextualized token embeddings of pre-trained language models.
2 code implementations • 20 Dec 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
Given a query, HyDE first zero-shot instructs an instruction-following language model (e. g. InstructGPT) to generate a hypothetical document.
2 code implementations • 22 Nov 2022 • Wenhu Chen, Xueguang Ma, Xinyi Wang, William W. Cohen
By combining PoT with self-consistency decoding, we can achieve SoTA performance on all math problem datasets and near-SoTA performance on financial datasets.
no code implementations • 30 Apr 2022 • Hang Li, Shuai Wang, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon
In this paper we consider the problem of combining the relevance signals from sparse and dense retrievers in the context of Pseudo Relevance Feedback (PRF).
no code implementations • 5 Apr 2022 • Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin
Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research.
1 code implementation • 11 Mar 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
no code implementations • 17 Dec 2021 • Jheng-Hong Yang, Xueguang Ma, Jimmy Lin
Sparse lexical representation learning has demonstrated much progress in improving passage retrieval effectiveness in recent models such as DeepImpact, uniCOIL, and SPLADE.
1 code implementation • 13 Dec 2021 • Hang Li, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon
Finally, we contribute a study of the generalisability of the ANCE-PRF method when dense retrievers other than ANCE are used for the first round of retrieval and for encoding the PRF signal.
no code implementations • 11 Nov 2021 • Alexandre Parmentier, Robin Cohen, Xueguang Ma, Gaurav Sahu, Queenie Chen
In this paper, we present an approach for predicting trust links between peers in social media, one that is grounded in the artificial intelligence area of multiagent trust modeling.
1 code implementation • EMNLP (MRL) 2021 • Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin
We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations.
no code implementations • 28 Jun 2021 • Jimmy Lin, Xueguang Ma
Recent developments in representational learning for information retrieval can be organized in a conceptual framework that establishes two pairs of contrasts: sparse vs. dense representations and unsupervised vs. learned representations.
1 code implementation • 12 Apr 2021 • Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin
Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations.
1 code implementation • 19 Feb 2021 • Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira
Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture.
Cultural Vocal Bursts Intensity Prediction
Information Retrieval
+1
no code implementations • EACL (Louhi) 2021 • Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin
This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain.