1 code implementation • EMNLP (sustainlp) 2020 • Ji Xin, Rodrigo Nogueira, YaoLiang Yu, Jimmy Lin
Pre-trained language models such as BERT have shown their effectiveness in various tasks.
no code implementations • EMNLP (BlackboxNLP) 2021 • Zhiying Jiang, Raphael Tang, Ji Xin, Jimmy Lin
Fine-tuned pre-trained transformers achieve the state of the art in passage reranking.
no code implementations • EMNLP (MRL) 2021 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin
Dense retrieval has shown great success for passage ranking in English.
1 code implementation • EMNLP (MRL) 2021 • Kelechi Ogueji, Yuxin Zhu, Jimmy Lin
In this work, we challenge this assumption and present the first attempt at training a multilingual language model on only low-resource languages.
1 code implementation • EMNLP (sustainlp) 2020 • Xinyu Zhang, Andrew Yates, Jimmy Lin
Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search.
no code implementations • ACL (NLP4Prog) 2021 • Xinyu Zhang, Ji Xin, Andrew Yates, Jimmy Lin
The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language.
no code implementations • NAACL (TrustNLP) 2022 • Minghan Li, Xueguang Ma, Jimmy Lin
The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it is unclear how DPR’s question encoder and passage encoder individually contributes to overall performance, which we refer to as the encoder attribution problem.
no code implementations • ACL (RepL4NLP) 2021 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin
We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model.
no code implementations • EMNLP 2021 • Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, Jimmy Lin
Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements.
no code implementations • EMNLP 2021 • Raphael Tang, Karun Kumar, Kendra Chalkley, Ji Xin, Liming Zhang, Wenyan Li, Gefei Yang, Yajie Mao, Junho Shin, Geoffrey Craig Murray, Jimmy Lin
Query auto completion (QAC) is the task of predicting a search engine user’s final query from their intermediate, incomplete query.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • EMNLP (sdp) 2020 • Shane Ding, Edwin Zhang, Jimmy Lin
Cydex is a platform that provides neural search infrastructure for domain-specific scholarly literature.
no code implementations • EMNLP (sustainlp) 2021 • Yue Zhang, ChengCheng Hu, Yuqi Liu, Hui Fang, Jimmy Lin
It is well known that rerankers built on pretrained transformer models such as BERT have dramatically improved retrieval effectiveness in many tasks.
1 code implementation • Findings (EMNLP) 2021 • Minghan Li, Ming Li, Kun Xiong, Jimmy Lin
Our method reaches state-of-the-art performance on 5 benchmark QA datasets, with up to 10% improvement in top-100 accuracy compared to a joint-training multi-task DPR on SQuAD.
1 code implementation • Findings (EMNLP) 2021 • Anup Anand Deshmukh, Qianqiu Zhang, Ming Li, Jimmy Lin, Lili Mou
In this paper, we address unsupervised chunking as a new task of syntactic structure induction, which is helpful for understanding the linguistic structures of human languages as well as processing low-resource languages.
1 code implementation • 30 Jan 2025 • Manveer Singh Tamber, Jimmy Lin
Our study systematically examines the factors that influence an attack's success, such as the placement of injected content and the balance between relevant and non-relevant material.
1 code implementation • 28 Jan 2025 • Shengyao Zhuang, Ekaterina Khramtsova, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Recent advancements in dense retrieval have introduced vision-language model (VLM)-based retrievers, such as DSE and ColPali, which leverage document screenshots embedded as vectors to enable effective search and offer a simplified pipeline over traditional text-only methods.
no code implementations • 25 Dec 2024 • Jimmy Lin, Pankaj Gupta, Will Horn, Gilad Mishne
When you have a question, the most effective way to have the question answered is to directly connect with experts on the topic and have a conversation with them.
no code implementations • 19 Dec 2024 • Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin
Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems.
no code implementations • 10 Dec 2024 • Zijian Chen, John-Michael Gamble, Micaela Jantzi, John P. Hirdes, Jimmy Lin
Manual assignment of Anatomical Therapeutic Chemical (ATC) codes to prescription records is a significant bottleneck in healthcare research and operations at Ontario Health and InterRAI Canada, requiring extensive expert time and effort.
1 code implementation • 9 Dec 2024 • Nadia Sheikh, Anne-Laure Jousse, Daniel Buades Marcos, Akintunde Oladipo, Olivier Rousseau, Jimmy Lin
Given the dominance of dense retrievers that do not generalize well beyond their training dataset distributions, domain-specific test sets are essential in evaluating retrieval.
no code implementations • 14 Nov 2024 • Ronak Pradeep, Nandan Thakur, Shivani Upadhyay, Daniel Campos, Nick Craswell, Jimmy Lin
Within the TREC setup, we are able to calibrate our fully automatic process against a manual process whereby nuggets are created by human assessors semi-manually and then assigned manually to system answers.
no code implementations • 13 Nov 2024 • Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Daniel Campos, Nick Craswell, Ian Soboroff, Hoa Trang Dang, Jimmy Lin
This paper reports on the results of a large-scale evaluation (the TREC 2024 RAG Track) where four different relevance assessment approaches were deployed in situ: the "standard" fully manual process that NIST has implemented for decades and three different alternatives that take advantage of LLMs to different extents using the open-source UMBRELA tool.
no code implementations • 8 Nov 2024 • Zijian Chen, Ronak Pradeep, Jimmy Lin
To better quantify the computational savings in the original study, we measure and compare latency to find a 21%-42% gain across various models and benchmarks.
no code implementations • 7 Nov 2024 • Xinyu Zhang, Jing Lu, Vinh Q. Tran, Tal Schuster, Donald Metzler, Jimmy Lin
Results show that the general shared semantics could get the models a long way in making the predictions on mLMs with different tokenizers and model sizes.
no code implementations • 4 Nov 2024 • Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping
Our empirical results show that the fine-tuned MLLM retriever is capable of understanding challenging queries, composed of both text and image, but underperforms a smaller CLIP retriever in cross-modal retrieval tasks due to modality bias from MLLMs.
1 code implementation • 17 Oct 2024 • Nandan Thakur, Suleman Kazi, Ge Luo, Jimmy Lin, Amin Ahmad
In our work, we benchmark 19 diverse multilingual-focused LLMs, and achieve a high correlation (Kendall Tau ($\tau$) = 0. 909) using our surrogate judge learned using heuristic features with pairwise evaluations and between GPT-4o as a teacher on the MIRAGE-Bench leaderboard using the Bradley-Terry framework.
no code implementations • 10 Sep 2024 • Jimmy Lin
Practitioners working on dense retrieval today face a bewildering number of choices.
no code implementations • 12 Aug 2024 • Ronak Pradeep, Daniel Lee, Ali Mousavi, Jeff Pound, Yisi Sang, Jimmy Lin, Ihab Ilyas, Saloni Potdar, Mostafa Arefiyan, Yunyao Li
The rapid advancement of Large Language Models (LLMs) and conversational assistants necessitates dynamic, scalable, and configurable conversational datasets for training and evaluation.
no code implementations • 2 Aug 2024 • Jheng-Hong Yang, Jimmy Lin
Vision--Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain.
1 code implementation • 10 Jul 2024 • Nandan Thakur, Luiz Bonifacio, Maik Fröbe, Alexander Bondarenko, Ehsan Kamalloo, Martin Potthast, Matthias Hagen, Jimmy Lin
Our black-box evaluation reveals an inherent bias of neural models towards retrieving short passages from the Touch\'e 2020 data, and we also find that quite a few of the neural models' results are unjudged in the Touch\'e 2020 data.
no code implementations • 26 Jun 2024 • Shi Zong, Jimmy Lin
We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i. e., mood and figure) tested by the existing datasets.
1 code implementation • 24 Jun 2024 • Ronak Pradeep, Nandan Thakur, Sahel Sharifymoghaddam, Eric Zhang, Ryan Nguyen, Daniel Campos, Nick Craswell, Jimmy Lin
In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2. 1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user.
no code implementations • 17 Jun 2024 • Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e. g., text, image and layout).
no code implementations • 14 Jun 2024 • Mohammad Dehghan, Mohammad Ali Alomrani, Sunyam Bagga, David Alfonso-Hermelo, Khalil Bibi, Abbas Ghaddar, Yingxue Zhang, Xiaoguang Li, Jianye Hao, Qun Liu, Jimmy Lin, Boxing Chen, Prasanna Parthasarathi, Mahdi Biparva, Mehdi Rezagholizadeh
To mitigate these issues, we propose our enhanced web and efficient knowledge graph (KG) retrieval solution (EWEK-QA) to enrich the content of the extracted knowledge fed to the system.
no code implementations • 13 Jun 2024 • Manveer Singh Tamber, Jasper Xian, Jimmy Lin
Embedding models that generate representation vectors from natural language text are widely used, reflect substantial investments, and carry significant commercial value.
no code implementations • 12 Jun 2024 • Raphael Tang, Xinyu Zhang, Lixinyu Xu, Yao Lu, Wenyan Li, Pontus Stenetorp, Jimmy Lin, Ferhan Ture
As far as we are aware, we are the first to analyze diffusion variability from a visuolinguistic perspective.
1 code implementation • 10 Jun 2024 • Shivani Upadhyay, Ronak Pradeep, Nandan Thakur, Nick Craswell, Jimmy Lin
Copious amounts of relevance judgments are necessary for the effective training and accurate evaluation of retrieval systems.
no code implementations • 29 May 2024 • Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, Xi Victoria Lin
Large language models (LLMs) often hallucinate and lack the ability to provide attribution for their generations.
no code implementations • 16 May 2024 • Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, Jimmy Lin
Recently, Multi-Modal (MM) Large Language Models (LLMs) have unlocked many complex use-cases that require MM understanding (e. g., image captioning or visual question answering) and MM generation (e. g., text-guided image generation or editing) capabilities.
no code implementations • 8 May 2024 • Shivani Upadhyay, Ehsan Kamalloo, Jimmy Lin
Based on our simulation experiments conducted on three TREC DL datasets, in the extreme scenario of retaining only 10% of judgments, our method achieves a Kendall tau correlation of 0. 87 and 0. 92 on an average for Vicu\~na-7B and GPT-3. 5 Turbo respectively.
no code implementations • 2 May 2024 • Sheng-Chieh Lin, Luyu Gao, Barlas Oguz, Wenhan Xiong, Jimmy Lin, Wen-tau Yih, Xilun Chen
Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses.
1 code implementation • 29 Apr 2024 • Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Utilizing large language models (LLMs) for zero-shot document ranking is done in one of two ways: (1) prompt-based re-ranking methods, which require no further training but are only feasible for re-ranking a handful of candidate documents due to computational costs; and (2) unsupervised contrastive trained dense retrieval methods, which can retrieve relevant documents from the entire corpus but require a large amount of paired text data for contrastive training.
no code implementations • 26 Dec 2023 • Mofetoluwa Adeyemi, Akintunde Oladipo, Ronak Pradeep, Jimmy Lin
Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) and we examine cross-lingual reranking with queries in English and passages in the African languages.
2 code implementations • 26 Dec 2023 • Manveer Singh Tamber, Ronak Pradeep, Jimmy Lin
We present a range of models from 220M parameters to 3B parameters, all with strong reranking results, challenging the necessity of large-scale models for effective zero-shot reranking and opening avenues for more efficient listwise reranking solutions.
1 code implementation • 18 Dec 2023 • Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin
NoMIRACL includes both a non-relevant and a relevant subset.
2 code implementations • 5 Dec 2023 • Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin
In information retrieval, proprietary large language models (LLMs) such as GPT-4 and open-source counterparts such as LLaMA and Vicuna have played a vital role in reranking.
no code implementations • 5 Dec 2023 • Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin
However, current works in this direction all depend on the GPT models, making it a single point of failure in scientific reproducibility.
no code implementations • 4 Dec 2023 • Jimmy Lin, Tommaso Teofili
In this work, we explore the contrarian approach of performing top-$k$ retrieval on dense vector representations using inverted indexes.
1 code implementation • 30 Nov 2023 • Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture
We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.
no code implementations • 30 Nov 2023 • Haonan Chen, Carlos Lassance, Jimmy Lin
The bi-encoder architecture provides a framework for understanding machine-learned retrieval models based on dense and sparse vector representations.
no code implementations • 15 Nov 2023 • Minghan Li, Honglei Zhuang, Kai Hui, Zhen Qin, Jimmy Lin, Rolf Jagerman, Xuanhui Wang, Michael Bendersky
In this paper, we re-examine this conclusion and raise the following question: Can query expansion improve generalization of strong cross-encoder rankers?
2 code implementations • 10 Nov 2023 • Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer
There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages.
1 code implementation • 12 Oct 2023 • Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin
Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models.
1 code implementation • 11 Oct 2023 • Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture
Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking.
3 code implementations • 26 Sep 2023 • Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin
Researchers have successfully applied large language models (LLMs) such as ChatGPT to reranking in an information retrieval context, but to date, such work has mostly been built on proprietary models hidden behind opaque API endpoints.
1 code implementation • 14 Sep 2023 • Chris Kamphuis, Aileen Lin, Siwen Yang, Jimmy Lin, Arjen P. de Vries, Faegheh Hasibi
MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets.
1 code implementation • 10 Sep 2023 • Zijun Wu, Anup Anand Deshmukh, Yongkang Wu, Jimmy Lin, Lili Mou
Our approach involves a two-stage training process: pretraining with an unsupervised parser and finetuning on downstream NLP tasks.
no code implementations • 29 Aug 2023 • Jimmy Lin, Ronak Pradeep, Tommaso Teofili, Jasper Xian
We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection.
no code implementations • 14 Aug 2023 • Cynthia Huang, Yuqing Xie, Zhiying Jiang, Jimmy Lin, Ming Li
Leveraging the approximated information distance, our method allows the direct application of GPT models in quantitative text similarity measurements.
1 code implementation • 31 Jul 2023 • Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, Jimmy Lin
In this paper, we introduce a new dataset, HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) for building end-to-end generative information-seeking models that are capable of retrieving candidate quotes and generating attributed explanations.
1 code implementation • 19 Jul 2023 • Nandan Thakur, Kexin Wang, Iryna Gurevych, Jimmy Lin
In this work, we provide SPRINT, a unified Python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval.
2 code implementations • 13 Jun 2023 • Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, Jimmy Lin
BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations.
1 code implementation • 2 Jun 2023 • Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, Jimmy Lin
We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.
no code implementations • 23 May 2023 • Vanessa Liao, Syed Shariyar Murtaza, Yifan Nie, Jimmy Lin
Our experiments on real scenario production data show that this method of fine tuning improves the downstream text classification tasks as compared to fine tuning only on domain specific text.
no code implementations • 19 May 2023 • Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran
Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer.
no code implementations • 14 May 2023 • Josh Seltzer, Jiahua, Pan, Kathy Cheng, Yuxiao Sun, Santosh Kolagati, Jimmy Lin, Shi Zong
Market research surveys are a powerful methodology for understanding consumer perspectives at scale, but are limited by depth of understanding and insights.
no code implementations • 10 May 2023 • Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Jimmy Lin
The ever-increasing size of language models curtails their widespread availability to the community, thereby galvanizing many companies into offering access to large language models through APIs.
no code implementations • 3 May 2023 • Xueguang Ma, Xinyu Zhang, Ronak Pradeep, Jimmy Lin
Supervised ranking methods based on bi-encoder or cross-encoder architectures have shown success in multi-stage text ranking tasks, but they require large amounts of relevance judgments as training data.
no code implementations • 24 Apr 2023 • Xueguang Ma, Tommaso Teofili, Jimmy Lin
With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface.
2 code implementations • 4 Apr 2023 • Jheng-Hong Yang, Carlos Lassance, Rafael Sampaio de Rezende, Krishna Srinivasan, Miriam Redi, Stéphane Clinchant, Jimmy Lin
This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval.
no code implementations • 3 Apr 2023 • Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur, Jheng-Hong Yang, Xinyu Zhang
The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another.
1 code implementation • 28 Feb 2023 • Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast
We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines.
1 code implementation • 15 Feb 2023 • Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen
We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).
1 code implementation • 13 Feb 2023 • Minghan Li, Sheng-Chieh Lin, Xueguang Ma, Jimmy Lin
Multi-vector retrieval methods have demonstrated their effectiveness on various retrieval datasets, and among them, ColBERT is the most established method based on the late interaction of contextualized token embeddings of pre-trained language models.
no code implementations • 13 Feb 2023 • Xinyu Zhang, Minghan Li, Jimmy Lin
Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets.
no code implementations • 17 Jan 2023 • Shi Zong, Josh Seltzer, Jiahua, Pan, Kathy Cheng, Jimmy Lin
Industry practitioners always face the problem of choosing the appropriate model for deployment under different considerations, such as to maximize a metric that is crucial for production, or to reduce the total cost given financial concerns.
1 code implementation • 27 Dec 2022 • Jimmy Lin
Reproducibility is an ideal that no researcher would dispute "in the abstract", but when aspirations meet the cold hard reality of the academic grind, reproducibility often "loses out".
2 code implementations • 20 Dec 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
Given a query, HyDE first zero-shot instructs an instruction-following language model (e. g. InstructGPT) to generate a hypothetical document.
no code implementations • 19 Dec 2022 • Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin
Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.
no code implementations • 10 Dec 2022 • Yizhen Zhong, Jiajie Xiao, Thomas Vetterli, Mahan Matin, Ellen Loo, Jimmy Lin, Richard Bourgon, Ofer Shapira
The application of natural language processing (NLP) to cancer pathology reports has been focused on detecting cancer cases, largely ignoring precancerous cases.
no code implementations • 21 Nov 2022 • Raphael Tang, Karun Kumar, Gefei Yang, Akshat Pandey, Yajie Mao, Vladislav Belyaev, Madhuri Emmadi, Craig Murray, Ferhan Ture, Jimmy Lin
In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
1 code implementation • 18 Nov 2022 • Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen
In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
no code implementations • 1 Nov 2022 • Jimmy Lin
We evaluate this proposal and find that it can reduce the negative impact of noise added by differential privacy mechanisms on test accuracy by up to 24. 6%, and reduce the negative impact of gradient sparsification on test accuracy by up to 15. 1%.
no code implementations • 25 Oct 2022 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin
We also include global translation exemplars for a target language to facilitate the translation process for large language models.
1 code implementation • 18 Oct 2022 • Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin
MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world.
no code implementations • 13 Oct 2022 • Linqing Liu, Minghan Li, Jimmy Lin, Sebastian Riedel, Pontus Stenetorp
To balance these two considerations, we propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
no code implementations • 11 Oct 2022 • Odunayo Ogundepo, Xinyu Zhang, Jimmy Lin
However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages.
2 code implementations • 10 Oct 2022 • Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture
Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses.
no code implementations • 31 Jul 2022 • Ji Xin, Raphael Tang, Zhiying Jiang, YaoLiang Yu, Jimmy Lin
There exists a wide variety of efficiency methods for natural language processing (NLP) tasks, such as pruning, distillation, dynamic inference, quantization, etc.
1 code implementation • 31 Jul 2022 • Sheng-Chieh Lin, Minghan Li, Jimmy Lin
Pre-trained language models have been successful in many knowledge-intensive NLP tasks.
no code implementations • 23 Jun 2022 • Zhiying Jiang, Yiqin Dai, Ji Xin, Ming Li, Jimmy Lin
Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation.
1 code implementation • 20 Jun 2022 • Sheng-Chieh Lin, Jimmy Lin
In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs).
2 code implementations • 23 May 2022 • Nandan Thakur, Nils Reimers, Jimmy Lin
In our work, we evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever while maintaining efficiency at inference.
1 code implementation • 19 May 2022 • Minghan Li, Xinyu Zhang, Ji Xin, Hongyang Zhang, Jimmy Lin
For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1, 000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0. 38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee.
no code implementations • 30 Apr 2022 • Hang Li, Shuai Wang, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon
In this paper we consider the problem of combining the relevance signals from sparse and dense retrievers in the context of Pseudo Relevance Feedback (PRF).
no code implementations • 5 Apr 2022 • Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin
Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research.
1 code implementation • 21 Mar 2022 • Wei Zhong, Jheng-Hong Yang, Yuqing Xie, Jimmy Lin
With the recent success of dense retrieval methods based on bi-encoders, studies have applied this approach to various interesting downstream retrieval tasks with good efficiency and in-domain effectiveness.
Ranked #1 on
Math Information Retrieval
on ARQMath
(using extra training data)
1 code implementation • 11 Mar 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan
In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.
no code implementations • 26 Jan 2022 • Ellen M. Voorhees, Ian Soboroff, Jimmy Lin
Neural retrieval models are generally regarded as fundamentally different from the retrieval techniques used in the late 1990's when the TREC ad hoc test collections were constructed.
no code implementations • 17 Dec 2021 • Jheng-Hong Yang, Xueguang Ma, Jimmy Lin
Sparse lexical representation learning has demonstrated much progress in improving passage retrieval effectiveness in recent models such as DeepImpact, uniCOIL, and SPLADE.
1 code implementation • 13 Dec 2021 • Hang Li, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon
Finally, we contribute a study of the generalisability of the ANCE-PRF method when dense retrievers other than ANCE are used for the first round of retrieval and for encoding the PRF signal.
1 code implementation • 9 Dec 2021 • Sheng-Chieh Lin, Jimmy Lin
Learned sparse and dense representations capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust.
no code implementations • 22 Oct 2021 • Joel Mackenzie, Andrew Trotman, Jimmy Lin
Recent advances in retrieval models based on learned sparse representations generated by transformers have led us to, once again, consider score-at-a-time query evaluation techniques for the top-k retrieval problem.
no code implementations • 4 Oct 2021 • Minghan Li, Jimmy Lin
Previous work on generalization of DPR mainly focus on testing both encoders in tandem on out-of-distribution (OOD) question-answering (QA) tasks, which is also known as domain adaptation.
no code implementations • 4 Oct 2021 • Jimmy Lin
This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing that attempts to integrate dense and sparse retrieval methods.
no code implementations • 3 Sep 2021 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin
Dense retrieval has shown great success in passage ranking in English.
1 code implementation • EMNLP (MRL) 2021 • Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin
We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations.
1 code implementation • ACL 2021 • Ji Xin, Raphael Tang, YaoLiang Yu, Jimmy Lin
To fill this void in the literature, we study in this paper selective prediction for NLP, comparing different models and confidence estimators.
no code implementations • ACL 2021 • Kelvin Jiang, Ronak Pradeep, Jimmy Lin
This work explores a framework for fact verification that leverages pretrained sequence-to-sequence transformer models for sentence selection and label prediction, two key sub-tasks in fact verification.
no code implementations • 28 Jun 2021 • Jimmy Lin, Xueguang Ma
Recent developments in representational learning for information retrieval can be organized in a conceptual framework that establishes two pairs of contrasts: sparse vs. dense representations and unsupervised vs. learned representations.
no code implementations • 9 May 2021 • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin
Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field.
no code implementations • EMNLP 2021 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin
This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations.
4 code implementations • 14 Apr 2021 • Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury
A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows.
Ranked #15 on
Zero-shot Text Search
on BEIR
1 code implementation • 12 Apr 2021 • Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin
Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations.
1 code implementation • EACL 2021 • Ji Xin, Raphael Tang, YaoLiang Yu, Jimmy Lin
The slow speed of BERT has motivated much research on accelerating its inference, and the early exiting idea has been proposed to make trade-offs between model quality and efficiency.
1 code implementation • 25 Feb 2021 • Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin
In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values.
no code implementations • 25 Feb 2021 • Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, Emine Yilmaz
Leaderboards are a ubiquitous part of modern research in applied machine learning.
1 code implementation • 19 Feb 2021 • Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira
Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture.
Cultural Vocal Bursts Intensity Prediction
Information Retrieval
+1
4 code implementations • 14 Jan 2021 • Ronak Pradeep, Rodrigo Nogueira, Jimmy Lin
We propose a design pattern for tackling text ranking problems, dubbed "Expando-Mono-Duo", that has been empirically validated for a number of ad hoc retrieval tasks in different domains.
1 code implementation • Findings of the Association for Computational Linguistics 2020 • Zhiying Jiang, Raphael Tang, Ji Xin, Jimmy Lin
We show the effectiveness of our method in terms of attribution and the ability to provide insight into how information flows through layers.
no code implementations • COLING 2020 • Jheng-Hong Yang, Sheng-Chieh Lin, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin
While internalized {``}implicit knowledge{''} in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Peng Shi, He Bai, Jimmy Lin
We tackle the challenge of cross-lingual training of neural document ranking models for mono-lingual retrieval, specifically leveraging relevance judgments in English to improve search in non-English languages.
2 code implementations • 22 Oct 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin
We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model.
no code implementations • EACL (Louhi) 2021 • Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin
This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain.
1 code implementation • 15 Oct 2020 • Martin Gauch, Frederik Kratzert, Daniel Klotz, Grey Nearing, Jimmy Lin, Sepp Hochreiter
Compared to naive prediction with a distinct LSTM per timescale, the multi-timescale architectures are computationally more efficient with no loss in accuracy.
1 code implementation • NAACL 2021 • Jimmy Lin, Rodrigo Nogueira, Andrew Yates
There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i. e., result quality) and efficiency (e. g., query latency, model and index size).
no code implementations • EACL 2021 • Mohan Zhang, Luchen Tan, Zhengkai Tu, Zihang Fu, Kun Xiong, Ming Li, Jimmy Lin
The contribution of this work is a novel data generation technique using distant supervision that allows us to start with a pretrained sequence-to-sequence model and fine-tune a paraphrase generator that exhibits this behavior, allowing user-controllable paraphrase generation.
2 code implementations • EMNLP (NLPOSS) 2020 • Raphael Tang, Jaejun Lee, Afsaneh Razi, Julia Cambre, Ian Bicking, Jofish Kaye, Jimmy Lin
We describe Howl, an open-source wake word detection toolkit with native support for open speech datasets, like Mozilla Common Voice and Google Speech Commands.
Ranked #4 on
Keyword Spotting
on Google Speech Commands
1 code implementation • EMNLP (sdp) 2020 • Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, Jimmy Lin
We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.
no code implementations • ACL 2020 • Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin
The Neural Covidex is a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset (CORD-19) curated by the Allen Institute for AI.
no code implementations • WS 2020 • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, Jimmy Lin
Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs.
2 code implementations • ICML 2020 • Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, Margo Seltzer
Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning.
no code implementations • 5 Jun 2020 • Martin Gauch, Jimmy Lin
In recent years, the paradigms of data-driven science have become essential components of physical sciences, particularly in geophysical disciplines such as climatology.
no code implementations • 5 May 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin
Conversational search plays a vital role in conversational information seeking.
1 code implementation • 30 Apr 2020 • He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li
To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token.
Ranked #20 on
Language Modelling
on WikiText-103
1 code implementation • ACL 2020 • Raphael Tang, Jaejun Lee, Ji Xin, Xinyu Liu, Yao-Liang Yu, Jimmy Lin
In natural language processing, a recently popular line of work explores how to best report the experimental results of neural networks.
3 code implementations • ACL 2020 • Ji Xin, Raphael Tang, Jaejun Lee, Yao-Liang Yu, Jimmy Lin
Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.
1 code implementation • 23 Apr 2020 • Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, Jimmy Lin
We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.
1 code implementation • 10 Apr 2020 • Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin
We present the Neural Covidex, a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.
1 code implementation • ACL 2021 • He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li
Experimental results show that the Chinese GPT2 can generate better essay endings with \eop.
no code implementations • 4 Apr 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin
This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs).
2 code implementations • 18 Mar 2020 • Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries
There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems.
no code implementations • 18 Mar 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin
We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis.
Ranked #17 on
Coreference Resolution
on Winograd Schema Challenge
2 code implementations • Findings of the Association for Computational Linguistics 2020 • Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin
We investigate this observation further by varying target words to probe the model's use of latent knowledge.
Ranked #1 on
Ad-Hoc Information Retrieval
on TREC Robust04
1 code implementation • 5 Feb 2020 • Ruixue Zhang, Wei Yang, Luyun Lin, Zhengkai Tu, Yuqing Xie, Zihang Fu, Yuhao Xie, Luchen Tan, Kun Xiong, Jimmy Lin
Techniques for automatically extracting important content elements from business documents such as contracts, statements, and filings have the potential to make business operations more efficient.
no code implementations • 4 Feb 2020 • Jimmy Lin
This paper describes a working prototype that adapts Lucene, the world's most popular and most widely deployed open-source search library, to operate within a serverless environment in the cloud.
no code implementations • 23 Jan 2020 • Rodrigo Nogueira, Zhiying Jiang, Kyunghyun Cho, Jimmy Lin
Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration.
1 code implementation • 15 Jan 2020 • Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz
The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building - all proceeding concurrently in mutually-reinforcing efforts.
1 code implementation • 17 Nov 2019 • Martin Gauch, Juliane Mai, Jimmy Lin
Accurate streamflow prediction largely relies on historical meteorological records and streamflow measurements.
no code implementations • 15 Nov 2019 • Achyudh Ram, Ji Xin, Meiyappan Nagappan, Yao-Liang Yu, Rocío Cabrera Lozoya, Antonino Sabetta, Jimmy Lin
Public vulnerability databases such as CVE and NVD account for only 60% of security vulnerabilities present in open-source projects, and are known to suffer from inconsistent quality.