Search Results for author: Jimmy Lin

Found 180 papers, 81 papers with code

Multi-Task Dense Retrieval via Model Uncertainty Fusion for Open-Domain Question Answering

1 code implementation • Findings (EMNLP) 2021 • Minghan Li, Ming Li, Kun Xiong, Jimmy Lin

Our method reaches state-of-the-art performance on 5 benchmark QA datasets, with up to 10% improvement in top-100 accuracy compared to a joint-training multi-task DPR on SQuAD.

Open-Domain Question Answering Retrieval

Paper
Code

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

no code implementations • ACL (RepL4NLP) 2021 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model.

Document Ranking Knowledge Distillation +2

Paper
Add Code

Cross-Lingual Training of Dense Retrievers for Document Retrieval

no code implementations • EMNLP (MRL) 2021 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin

Dense retrieval has shown great success for passage ranking in English.

Document Ranking Passage Ranking +2

Paper
Add Code

Small Data? No Problem! Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages

1 code implementation • EMNLP (MRL) 2021 • Kelechi Ogueji, Yuxin Zhu, Jimmy Lin

In this work, we challenge this assumption and present the first attempt at training a multilingual language model on only low-resource languages.

Language Modelling named-entity-recognition +5

Paper
Code

How Does BERT Rerank Passages? An Attribution Analysis with Information Bottlenecks

no code implementations • EMNLP (BlackboxNLP) 2021 • Zhiying Jiang, Raphael Tang, Ji Xin, Jimmy Lin

Fine-tuned pre-trained transformers achieve the state of the art in passage reranking.

Paper
Add Code

Bag-of-Words Baselines for Semantic Code Search

no code implementations • ACL (NLP4Prog) 2021 • Xinyu Zhang, Ji Xin, Andrew Yates, Jimmy Lin

The task of semantic code search is to retrieve code snippets from a source code corpus based on an information need expressed in natural language.

Code Search Information Retrieval +2

Paper
Add Code

Early Exiting BERT for Efficient Document Ranking

1 code implementation • EMNLP (sustainlp) 2020 • Ji Xin, Rodrigo Nogueira, YaoLiang Yu, Jimmy Lin

Pre-trained language models such as BERT have shown their effectiveness in various tasks.

Document Ranking

Paper
Code

A Little Bit Is Worse Than None: Ranking with Limited Training Data

no code implementations • EMNLP (sustainlp) 2020 • Xinyu Zhang, Andrew Yates, Jimmy Lin

Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search.

Passage Retrieval Retrieval

Paper
Add Code

Voice Query Auto Completion

no code implementations • EMNLP 2021 • Raphael Tang, Karun Kumar, Kendra Chalkley, Ji Xin, Liming Zhang, Wenyan Li, Gefei Yang, Yajie Mao, Junho Shin, Geoffrey Craig Murray, Jimmy Lin

Query auto completion (QAC) is the task of predicting a search engine user’s final query from their intermediate, incomplete query.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

Cydex: Neural Search Infrastructure for the Scholarly Literature

no code implementations • EMNLP (sdp) 2020 • Shane Ding, Edwin Zhang, Jimmy Lin

Cydex is a platform that provides neural search infrastructure for domain-specific scholarly literature.

Paper
Add Code

Simple and Effective Unsupervised Redundancy Elimination to Compress Dense Vectors for Passage Retrieval

no code implementations • EMNLP 2021 • Xueguang Ma, Minghan Li, Kai Sun, Ji Xin, Jimmy Lin

Recent work has shown that dense passage retrieval techniques achieve better ranking accuracy in open-domain question answering compared to sparse retrieval techniques such as BM25, but at the cost of large space and memory requirements.

Open-Domain Question Answering Passage Retrieval +2

Paper
Add Code

Learning to Rank in the Age of Muppets: Effectiveness–Efficiency Tradeoffs in Multi-Stage Ranking

no code implementations • EMNLP (sustainlp) 2021 • Yue Zhang, ChengCheng Hu, Yuqi Liu, Hui Fang, Jimmy Lin

It is well known that rerankers built on pretrained transformer models such as BERT have dramatically improved retrieval effectiveness in many tasks.

Document Ranking Learning-To-Rank +1

Paper
Add Code

An Encoder Attribution Analysis for Dense Passage Retriever in Open-Domain Question Answering

no code implementations • NAACL (TrustNLP) 2022 • Minghan Li, Xueguang Ma, Jimmy Lin

The bi-encoder design of dense passage retriever (DPR) is a key factor to its success in open-domain question answering (QA), yet it is unclear how DPR’s question encoder and passage encoder individually contributes to overall performance, which we refer to as the encoder attribution problem.

Open-Domain Question Answering Retrieval

Paper
Add Code

Unsupervised Chunking as Syntactic Structure Induction with a Knowledge-Transfer Approach

1 code implementation • Findings (EMNLP) 2021 • Anup Anand Deshmukh, Qianqiu Zhang, Ming Li, Jimmy Lin, Lili Mou

In this paper, we address unsupervised chunking as a new task of syntactic structure induction, which is helpful for understanding the linguistic structures of human languages as well as processing low-resource languages.

Chunking Transfer Learning

Paper
Code

Scaling Down, LiTting Up: Efficient Zero-Shot Listwise Reranking with Seq2seq Encoder-Decoder Models

1 code implementation • 26 Dec 2023 • Manveer Singh Tamber, Ronak Pradeep, Jimmy Lin

We present a range of models from 220M parameters to 3B parameters, all with strong reranking results, challenging the necessity of large-scale models for effective zero-shot reranking and opening avenues for more efficient listwise reranking solutions.

Paper
Code

Zero-Shot Cross-Lingual Reranking with Large Language Models for Low-Resource Languages

no code implementations • 26 Dec 2023 • Mofetoluwa Adeyemi, Akintunde Oladipo, Ronak Pradeep, Jimmy Lin

Our implementation covers English and four African languages (Hausa, Somali, Swahili, and Yoruba) and we examine cross-lingual reranking with queries in English and passages in the African languages.

Cross-Lingual Information Retrieval Retrieval

Paper
Add Code

NoMIRACL: Knowing When You Don't Know for Robust Multilingual Retrieval-Augmented Generation

1 code implementation • 18 Dec 2023 • Nandan Thakur, Luiz Bonifacio, Xinyu Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin

We measure LLM robustness using two metrics: (i) hallucination rate, measuring model tendency to hallucinate an answer, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.

Hallucination Language Modelling +2

Paper
Code

Rank-without-GPT: Building GPT-Independent Listwise Rerankers on Open-Source Large Language Models

no code implementations • 5 Dec 2023 • Xinyu Zhang, Sebastian Hofstätter, Patrick Lewis, Raphael Tang, Jimmy Lin

However, current works in this direction all depend on the GPT models, making it a single point of failure in scientific reproducibility.

Passage Retrieval Retrieval

Paper
Add Code

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

1 code implementation • 5 Dec 2023 • Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin

In information retrieval, proprietary large language models (LLMs) such as GPT-4 and open-source counterparts such as LLaMA and Vicuna have played a vital role in reranking.

Information Retrieval Retrieval

241

Paper
Code

Searching Dense Representations with Inverted Indexes

no code implementations • 4 Dec 2023 • Jimmy Lin, Tommaso Teofili

In this work, we explore the contrarian approach of performing top-$k$ retrieval on dense vector representations using inverted indexes.

Passage Ranking Retrieval

Paper
Add Code

What Do Llamas Really Think? Revealing Preference Biases in Language Model Representations

1 code implementation • 30 Nov 2023 • Raphael Tang, Xinyu Zhang, Jimmy Lin, Ferhan Ture

We propose a logistic Bradley-Terry probe which predicts word pair preferences of LLMs from the words' hidden vectors.

Language Modelling

Paper
Code

End-to-End Retrieval with Learned Dense and Sparse Representations Using Lucene

no code implementations • 30 Nov 2023 • Haonan Chen, Carlos Lassance, Jimmy Lin

The bi-encoder architecture provides a framework for understanding machine-learned retrieval models based on dense and sparse vector representations.

Information Retrieval Retrieval

Paper
Add Code

Generate, Filter, and Fuse: Query Expansion via Multi-Step Keyword Generation for Zero-Shot Neural Rankers

no code implementations • 15 Nov 2023 • Minghan Li, Honglei Zhuang, Kai Hui, Zhen Qin, Jimmy Lin, Rolf Jagerman, Xuanhui Wang, Michael Bendersky

We first show that directly applying the expansion techniques in the current literature to state-of-the-art neural rankers can result in deteriorated zero-shot performance.

Instruction Following Language Modelling +1

Paper
Add Code

Leveraging LLMs for Synthesizing Training Data Across Many Languages in Multilingual Dense Retrieval

1 code implementation • 10 Nov 2023 • Nandan Thakur, Jianmo Ni, Gustavo Hernández Ábrego, John Wieting, Jimmy Lin, Daniel Cer

There has been limited success for dense retrieval models in multilingual retrieval, due to uneven and scarce training data available across multiple languages.

Language Modelling Large Language Model +1

Paper
Code

Fine-Tuning LLaMA for Multi-Stage Text Retrieval

1 code implementation • 12 Oct 2023 • Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, Jimmy Lin

Our findings demonstrate that the effectiveness of large language models indeed surpasses that of smaller models.

Passage Retrieval Retrieval +1

387

Paper
Code

Found in the Middle: Permutation Self-Consistency Improves Listwise Ranking in Large Language Models

1 code implementation • 11 Oct 2023 • Raphael Tang, Xinyu Zhang, Xueguang Ma, Jimmy Lin, Ferhan Ture

Large language models (LLMs) exhibit positional bias in how they use context, which especially complicates listwise ranking.

Paper
Code

RankVicuna: Zero-Shot Listwise Document Reranking with Open-Source Large Language Models

1 code implementation • 26 Sep 2023 • Ronak Pradeep, Sahel Sharifymoghaddam, Jimmy Lin

Researchers have successfully applied large language models (LLMs) such as ChatGPT to reranking in an information retrieval context, but to date, such work has mostly been built on proprietary models hidden behind opaque API endpoints.

Information Retrieval Retrieval

241

Paper
Code

MMEAD: MS MARCO Entity Annotations and Disambiguations

1 code implementation • 14 Sep 2023 • Chris Kamphuis, Aileen Lin, Siwen Yang, Jimmy Lin, Arjen P. de Vries, Faegheh Hasibi

MMEAD, or MS MARCO Entity Annotations and Disambiguations, is a resource for entity links for the MS MARCO datasets.

Entity Embeddings

Paper
Code

Unsupervised Chunking with Hierarchical RNN

1 code implementation • 10 Sep 2023 • Zijun Wu, Anup Anand Deshmukh, Yongkang Wu, Jimmy Lin, Lili Mou

Our approach involves a two-stage training process: pretraining with an unsupervised parser and finetuning on downstream NLP tasks.

Chunking Sentence

Paper
Code

Vector Search with OpenAI Embeddings: Lucene Is All You Need

no code implementations • 29 Aug 2023 • Jimmy Lin, Ronak Pradeep, Tommaso Teofili, Jasper Xian

We provide a reproducible, end-to-end demonstration of vector search with OpenAI embeddings using Lucene on the popular MS MARCO passage ranking test collection.

Passage Ranking

Paper
Add Code

Approximating Human-Like Few-shot Learning with GPT-based Compression

no code implementations • 14 Aug 2023 • Cynthia Huang, Yuqing Xie, Zhiying Jiang, Jimmy Lin, Ming Li

Leveraging the approximated information distance, our method allows the direct application of GPT models in quantitative text similarity measurements.

Data Compression Few-Shot Learning +6

Paper
Add Code

HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

1 code implementation • 31 Jul 2023 • Ehsan Kamalloo, Aref Jafari, Xinyu Zhang, Nandan Thakur, Jimmy Lin

In this paper, we introduce a new dataset, HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) for building end-to-end generative information-seeking models that are capable of retrieving candidate quotes and generating attributed explanations.

Information Retrieval Informativeness +1

Paper
Code

SPRINT: A Unified Toolkit for Evaluating and Demystifying Zero-shot Neural Sparse Retrieval

1 code implementation • 19 Jul 2023 • Nandan Thakur, Kexin Wang, Iryna Gurevych, Jimmy Lin

In this work, we provide SPRINT, a unified Python toolkit based on Pyserini and Lucene, supporting a common interface for evaluating neural sparse retrieval.

Information Retrieval Retrieval

Paper
Code

Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

2 code implementations • 13 Jun 2023 • Ehsan Kamalloo, Nandan Thakur, Carlos Lassance, Xueguang Ma, Jheng-Hong Yang, Jimmy Lin

BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations.

Information Retrieval Representation Learning +1

1,375

Paper
Code

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

1 code implementation • 2 Jun 2023 • Aleksandra Piktus, Odunayo Ogundepo, Christopher Akiki, Akintunde Oladipo, Xinyu Zhang, Hailey Schoelkopf, Stella Biderman, Martin Potthast, Jimmy Lin

We discuss how Pyserini - a widely used toolkit for reproducible IR research can be integrated with the Hugging Face ecosystem of open-source AI libraries and artifacts.

Information Retrieval Retrieval

Paper
Code

Regex-augmented Domain Transfer Topic Classification based on a Pre-trained Language Model: An application in Financial Domain

no code implementations • 23 May 2023 • Vanessa Liao, Syed Shariyar Murtaza, Yifan Nie, Jimmy Lin

Our experiments on real scenario production data show that this method of fine tuning improves the downstream text classification tasks as compared to fine tuning only on domain specific text.

Language Modelling Large Language Model +3

Paper
Add Code

How Does Generative Retrieval Scale to Millions of Passages?

no code implementations • 19 May 2023 • Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran

Popularized by the Differentiable Search Index, the emerging paradigm of generative retrieval re-frames the classic information retrieval problem into a sequence-to-sequence modeling task, forgoing external indices and encoding an entire document corpus within a single Transformer.

Information Retrieval Passage Ranking +1

Paper
Add Code

$SmartProbe$: A Virtual Moderator for Market Research Surveys

no code implementations • 14 May 2023 • Josh Seltzer, Jiahua, Pan, Kathy Cheng, Yuxiao Sun, Santosh Kolagati, Jimmy Lin, Shi Zong

Market research surveys are a powerful methodology for understanding consumer perspectives at scale, but are limited by depth of understanding and insights.

Paper
Add Code

Evaluating Embedding APIs for Information Retrieval

no code implementations • 10 May 2023 • Ehsan Kamalloo, Xinyu Zhang, Odunayo Ogundepo, Nandan Thakur, David Alfonso-Hermelo, Mehdi Rezagholizadeh, Jimmy Lin

The ever-increasing size of language models curtails their widespread availability to the community, thereby galvanizing many companies into offering access to large language models through APIs.

Domain Generalization Information Retrieval +2

Paper
Add Code

Zero-Shot Listwise Document Reranking with a Large Language Model

no code implementations • 3 May 2023 • Xueguang Ma, Xinyu Zhang, Ronak Pradeep, Jimmy Lin

Supervised ranking methods based on bi-encoder or cross-encoder architectures have shown success in multi-stage text ranking tasks, but they require large amounts of relevance judgments as training data.

Language Modelling Large Language Model +1

Paper
Add Code

Anserini Gets Dense Retrieval: Integration of Lucene's HNSW Indexes

no code implementations • 24 Apr 2023 • Xueguang Ma, Tommaso Teofili, Jimmy Lin

With Pyserini, which provides a Python interface to Anserini, users gain access to both sparse and dense retrieval models, as Pyserini implements bindings to the Faiss vector search library alongside Lucene inverted indexes in a uniform, consistent interface.

Information Retrieval Retrieval

Paper
Add Code

AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

2 code implementations • 4 Apr 2023 • Jheng-Hong Yang, Carlos Lassance, Rafael Sampaio de Rezende, Krishna Srinivasan, Miriam Redi, Stéphane Clinchant, Jimmy Lin

This paper presents the AToMiC (Authoring Tools for Multimedia Content) dataset, designed to advance research in image/text cross-modal retrieval.

Cross-Modal Retrieval Retrieval +1

957

Paper
Code

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

no code implementations • 3 Apr 2023 • Jimmy Lin, David Alfonso-Hermelo, Vitor Jeronymo, Ehsan Kamalloo, Carlos Lassance, Rodrigo Nogueira, Odunayo Ogundepo, Mehdi Rezagholizadeh, Nandan Thakur, Jheng-Hong Yang, Xinyu Zhang

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another.

Cross-Lingual Information Retrieval Retrieval

Paper
Add Code

Spacerini: Plug-and-play Search Engines with Pyserini and Hugging Face

1 code implementation • 28 Feb 2023 • Christopher Akiki, Odunayo Ogundepo, Aleksandra Piktus, Xinyu Zhang, Akintunde Oladipo, Jimmy Lin, Martin Potthast

We present Spacerini, a tool that integrates the Pyserini toolkit for reproducible information retrieval research with Hugging Face to enable the seamless construction and deployment of interactive search engines.

Information Retrieval Retrieval

Paper
Code

How to Train Your DRAGON: Diverse Augmentation Towards Generalizable Dense Retrieval

1 code implementation • 15 Feb 2023 • Sheng-Chieh Lin, Akari Asai, Minghan Li, Barlas Oguz, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen

We hence propose a new DA approach with diverse queries and sources of supervision to progressively train a generalizable DR. As a result, DRAGON, our dense retriever trained with diverse augmentation, is the first BERT-base-sized DR to achieve state-of-the-art effectiveness in both supervised and zero-shot evaluations and even competes with models using more complex late interaction (ColBERTv2 and SPLADE++).

Contrastive Learning Data Augmentation +1

247

Paper
Code

SLIM: Sparsified Late Interaction for Multi-Vector Retrieval with Inverted Indexes

1 code implementation • 13 Feb 2023 • Minghan Li, Sheng-Chieh Lin, Xueguang Ma, Jimmy Lin

Multi-vector retrieval methods have demonstrated their effectiveness on various retrieval datasets, and among them, ColBERT is the most established method based on the late interaction of contextualized token embeddings of pre-trained language models.

Information Retrieval Retrieval

1,455

Paper
Code

Improving Out-of-Distribution Generalization of Neural Rerankers with Contextualized Late Interaction

no code implementations • 13 Feb 2023 • Xinyu Zhang, Minghan Li, Jimmy Lin

Recent progress in information retrieval finds that embedding query and document representation into multi-vector yields a robust bi-encoder retriever on out-of-distribution datasets.

Information Retrieval Out-of-Distribution Generalization +1

Paper
Add Code

Which Model Shall I Choose? Cost/Quality Trade-offs for Text Classification Tasks

no code implementations • 17 Jan 2023 • Shi Zong, Josh Seltzer, Jiahua, Pan, Kathy Cheng, Jimmy Lin

Industry practitioners always face the problem of choosing the appropriate model for deployment under different considerations, such as to maximize a metric that is crucial for production, or to reduce the total cost given financial concerns.

text-classification Text Classification

Paper
Add Code

Building a Culture of Reproducibility in Academic Research

1 code implementation • 27 Dec 2022 • Jimmy Lin

Reproducibility is an ideal that no researcher would dispute "in the abstract", but when aspirations meet the cold hard reality of the academic grind, reproducibility often "loses out".

Cultural Vocal Bursts Intensity Prediction

1,455

Paper
Code

Precise Zero-Shot Dense Retrieval without Relevance Labels

2 code implementations • 20 Dec 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

Given a query, HyDE first zero-shot instructs an instruction-following language model (e. g. InstructGPT) to generate a hypothetical document.

Fact Verification Instruction Following +3

369

Paper
Code

Less is More: Parameter-Free Text Classification with Gzip

no code implementations • 19 Dec 2022 • Zhiying Jiang, Matthew Y. R. Yang, Mikhail Tsirlin, Raphael Tang, Jimmy Lin

Our method also performs particularly well in few-shot settings where labeled data are too scarce for DNNs to achieve a satisfying accuracy.

text-classification Text Classification

Paper
Add Code

Improving Precancerous Case Characterization via Transformer-based Ensemble Learning

no code implementations • 10 Dec 2022 • Yizhen Zhong, Jiajie Xiao, Thomas Vetterli, Mahan Matin, Ellen Loo, Jimmy Lin, Richard Bourgon, Ofer Shapira

The application of natural language processing (NLP) to cancer pathology reports has been focused on detecting cancer cases, largely ignoring precancerous cases.

Ensemble Learning named-entity-recognition +2

Paper
Add Code

SpeechNet: Weakly Supervised, End-to-End Speech Recognition at Industrial Scale

no code implementations • 21 Nov 2022 • Raphael Tang, Karun Kumar, Gefei Yang, Akshat Pandey, Yajie Mao, Vladislav Belyaev, Madhuri Emmadi, Craig Murray, Ferhan Ture, Jimmy Lin

In this paper, we explore training and deploying an ASR system in the label-scarce, compute-limited setting.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval

1 code implementation • 18 Nov 2022 • Minghan Li, Sheng-Chieh Lin, Barlas Oguz, Asish Ghoshal, Jimmy Lin, Yashar Mehdad, Wen-tau Yih, Xilun Chen

In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.

Retrieval

247

Paper
Code

On the Interaction Between Differential Privacy and Gradient Compression in Deep Learning

no code implementations • 1 Nov 2022 • Jimmy Lin

We evaluate this proposal and find that it can reduce the negative impact of noise added by differential privacy mechanisms on test accuracy by up to 24. 6%, and reduce the negative impact of gradient sparsification on test accuracy by up to 15. 1%.

Paper
Add Code

XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing

no code implementations • 25 Oct 2022 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin

We also include global translation exemplars for a target language to facilitate the translation process for large language models.

In-Context Learning Retrieval +4

Paper
Add Code

Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages

1 code implementation • 18 Oct 2022 • Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin

MIRACL (Multilingual Information Retrieval Across a Continuum of Languages) is a multilingual dataset we have built for the WSDM 2023 Cup challenge that focuses on ad hoc retrieval across 18 different languages, which collectively encompass over three billion native speakers around the world.

Information Retrieval Retrieval

130

Paper
Code

Query Expansion Using Contextual Clue Sampling with Language Models

no code implementations • 13 Oct 2022 • Linqing Liu, Minghan Li, Jimmy Lin, Sebastian Riedel, Pontus Stenetorp

To balance these two considerations, we propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.

Information Retrieval Language Modelling +1

Paper
Add Code

Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers

no code implementations • 11 Oct 2022 • Odunayo Ogundepo, Xinyu Zhang, Jimmy Lin

However, only a handful of the 7000+ languages on the planet benefit from specialized, custom-built tokenization algorithms, while the other languages are stuck with a "default" whitespace tokenizer, which cannot capture the intricacies of different languages.

Information Retrieval Retrieval

Paper
Add Code

What the DAAM: Interpreting Stable Diffusion Using Cross Attention

1 code implementation • 10 Oct 2022 • Raphael Tang, Linqing Liu, Akshat Pandey, Zhiying Jiang, Gefei Yang, Karun Kumar, Pontus Stenetorp, Jimmy Lin, Ferhan Ture

Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses.

Denoising Descriptive +3

601

Paper
Code

Building an Efficiency Pipeline: Commutativity and Cumulativeness of Efficiency Operators for Transformers

no code implementations • 31 Jul 2022 • Ji Xin, Raphael Tang, Zhiying Jiang, YaoLiang Yu, Jimmy Lin

There exists a wide variety of efficiency methods for natural language processing (NLP) tasks, such as pruning, distillation, dynamic inference, quantization, etc.

Quantization

Paper
Add Code

Aggretriever: A Simple Approach to Aggregate Textual Representations for Robust Dense Passage Retrieval

1 code implementation • 31 Jul 2022 • Sheng-Chieh Lin, Minghan Li, Jimmy Lin

Pre-trained language models have been successful in many knowledge-intensive NLP tasks.

Knowledge Distillation Language Modelling +2

Paper
Code

Few-Shot Non-Parametric Learning with Deep Latent Variable Model

no code implementations • 23 Jun 2022 • Zhiying Jiang, Yiqin Dai, Ji Xin, Ming Li, Jimmy Lin

Most real-world problems that machine learning algorithms are expected to solve face the situation with 1) unknown data distribution; 2) little domain-specific knowledge; and 3) datasets with limited annotation.

Classification Image Classification

Paper
Add Code

A Dense Representation Framework for Lexical and Semantic Matching

1 code implementation • 20 Jun 2022 • Sheng-Chieh Lin, Jimmy Lin

In contrast, our work integrates lexical representations with dense semantic representations by densifying high-dimensional lexical representations into what we call low-dimensional dense lexical representations (DLRs).

Retrieval Semantic Text Matching +2

Paper
Code

Injecting Domain Adaptation with Learning-to-hash for Effective and Efficient Zero-shot Dense Retrieval

2 code implementations • 23 May 2022 • Nandan Thakur, Nils Reimers, Jimmy Lin

In our work, we evaluate LTH and vector compression techniques for improving the downstream zero-shot retrieval accuracy of the TAS-B dense retriever while maintaining efficiency at inference.

Ad-Hoc Information Retrieval Information Retrieval +3

Paper
Code

Certified Error Control of Candidate Set Pruning for Two-Stage Relevance Ranking

1 code implementation • 19 May 2022 • Minghan Li, Xinyu Zhang, Ji Xin, Hongyang Zhang, Jimmy Lin

For example, on MS MARCO Passage v1, our method yields an average candidate set size of 27 out of 1, 000 which increases the reranking speed by about 37 times, while the MRR@10 is greater than a pre-specified value of 0. 38 with about 90% empirical coverage and the empirical baselines fail to provide such guarantee.

Computational Efficiency Information Retrieval +1

Paper
Code

To Interpolate or not to Interpolate: PRF, Dense and Sparse Retrievers

no code implementations • 30 Apr 2022 • Hang Li, Shuai Wang, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon

In this paper we consider the problem of combining the relevance signals from sparse and dense retrievers in the context of Pseudo Relevance Feedback (PRF).

Information Retrieval Language Modelling +1

Paper
Add Code

Towards Best Practices for Training Multilingual Dense Retrieval Models

no code implementations • 5 Apr 2022 • Xinyu Zhang, Kelechi Ogueji, Xueguang Ma, Jimmy Lin

Dense retrieval models using a transformer-based bi-encoder design have emerged as an active area of research.

Cross-Lingual Transfer Retrieval

Paper
Add Code

Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval

1 code implementation • 21 Mar 2022 • Wei Zhong, Jheng-Hong Yang, Yuqing Xie, Jimmy Lin

With the recent success of dense retrieval methods based on bi-encoders, studies have applied this approach to various interesting downstream retrieval tasks with good efficiency and in-domain effectiveness.

Ranked #1 on Math Information Retrieval on ARQMath (using extra training data)

Information Retrieval Math +2

Paper
Code

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

1 code implementation • 11 Mar 2022 • Luyu Gao, Xueguang Ma, Jimmy Lin, Jamie Callan

In this paper, we present Tevatron, a dense retrieval toolkit optimized for efficiency, flexibility, and code simplicity.

Retrieval

387

Paper
Code

Can Old TREC Collections Reliably Evaluate Modern Neural Retrieval Models?

no code implementations • 26 Jan 2022 • Ellen M. Voorhees, Ian Soboroff, Jimmy Lin

Neural retrieval models are generally regarded as fundamentally different from the retrieval techniques used in the late 1990's when the TREC ad hoc test collections were constructed.

Retrieval

Paper
Add Code

Sparsifying Sparse Representations for Passage Retrieval by Top-$k$ Masking

no code implementations • 17 Dec 2021 • Jheng-Hong Yang, Xueguang Ma, Jimmy Lin

Sparse lexical representation learning has demonstrated much progress in improving passage retrieval effectiveness in recent models such as DeepImpact, uniCOIL, and SPLADE.

Passage Retrieval Representation Learning +2

Paper
Add Code

Improving Query Representations for Dense Retrieval with Pseudo Relevance Feedback: A Reproducibility Study

1 code implementation • 13 Dec 2021 • Hang Li, Shengyao Zhuang, Ahmed Mourad, Xueguang Ma, Jimmy Lin, Guido Zuccon

Finally, we contribute a study of the generalisability of the ANCE-PRF method when dense retrievers other than ANCE are used for the first round of retrieval and for encoding the PRF signal.

Retrieval

Paper
Code

Densifying Sparse Representations for Passage Retrieval by Representational Slicing

1 code implementation • 9 Dec 2021 • Sheng-Chieh Lin, Jimmy Lin

Learned sparse and dense representations capture different successful approaches to text retrieval and the fusion of their results has proven to be more effective and robust.

Passage Retrieval Retrieval +1

Paper
Code

Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation

no code implementations • 22 Oct 2021 • Joel Mackenzie, Andrew Trotman, Jimmy Lin

Recent advances in retrieval models based on learned sparse representations generated by transformers have led us to, once again, consider score-at-a-time query evaluation techniques for the top-k retrieval problem.

Retrieval

Paper
Add Code

A Proposed Conceptual Framework for a Representational Approach to Information Retrieval

no code implementations • 4 Oct 2021 • Jimmy Lin

This paper outlines a conceptual framework for understanding recent developments in information retrieval and natural language processing that attempts to integrate dense and sparse retrieval methods.

Information Retrieval Retrieval +3

Paper
Add Code

Encoder Adaptation of Dense Passage Retrieval for Open-Domain Question Answering

no code implementations • 4 Oct 2021 • Minghan Li, Jimmy Lin

Previous work on generalization of DPR mainly focus on testing both encoders in tandem on out-of-distribution (OOD) question-answering (QA) tasks, which is also known as domain adaptation.

Domain Adaptation Open-Domain Question Answering +2

Paper
Add Code

Cross-Lingual Training with Dense Retrieval for Document Retrieval

no code implementations • 3 Sep 2021 • Peng Shi, Rui Zhang, He Bai, Jimmy Lin

Dense retrieval has shown great success in passage ranking in English.

Document Ranking Passage Ranking +1

Paper
Add Code

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval

1 code implementation • EMNLP (MRL) 2021 • Xinyu Zhang, Xueguang Ma, Peng Shi, Jimmy Lin

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations.

Representation Learning Retrieval

Paper
Code

Exploring Listwise Evidence Reasoning with T5 for Fact Verification

no code implementations • ACL 2021 • Kelvin Jiang, Ronak Pradeep, Jimmy Lin

This work explores a framework for fact verification that leverages pretrained sequence-to-sequence transformer models for sentence selection and label prediction, two key sub-tasks in fact verification.

Data Augmentation Fact Verification +1

Paper
Add Code

The Art of Abstention: Selective Prediction and Error Regularization for Natural Language Processing

1 code implementation • ACL 2021 • Ji Xin, Raphael Tang, YaoLiang Yu, Jimmy Lin

To fill this void in the literature, we study in this paper selective prediction for NLP, comparing different models and confidence estimators.

Paper
Code

A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques

no code implementations • 28 Jun 2021 • Jimmy Lin, Xueguang Ma

Recent developments in representational learning for information retrieval can be organized in a conceptual framework that establishes two pairs of contrasts: sparse vs. dense representations and unsupervised vs. learned representations.

Information Retrieval Passage Ranking +1

Paper
Add Code

MS MARCO: Benchmarking Ranking Models in the Large-Data Regime

no code implementations • 9 May 2021 • Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin

Evaluation efforts such as TREC, CLEF, NTCIR and FIRE, alongside public leaderboard such as MS MARCO, are intended to encourage research and track our progress, addressing big questions in our field.

Benchmarking

Paper
Add Code

Contextualized Query Embeddings for Conversational Search

no code implementations • EMNLP 2021 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin

This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations.

Conversational Search Open-Domain Question Answering +2

Paper
Add Code

Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling

4 code implementations • 14 Apr 2021 • Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, Allan Hanbury

A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows.

Ranked #15 on Zero-shot Text Search on BEIR

Re-Ranking Retrieval +2

Paper
Code

A Replication Study of Dense Passage Retriever

1 code implementation • 12 Apr 2021 • Xueguang Ma, Kai Sun, Ronak Pradeep, Jimmy Lin

Text retrieval using learned dense representations has recently emerged as a promising alternative to "traditional" text retrieval using sparse bag-of-words representations.

Open-Domain Question Answering Retrieval +1

1,455

Paper
Code

BERxiT: Early Exiting for BERT with Better Fine-Tuning and Extension to Regression

1 code implementation • EACL 2021 • Ji Xin, Raphael Tang, YaoLiang Yu, Jimmy Lin

The slow speed of BERT has motivated much research on accelerating its inference, and the early exiting idea has been proposed to make trade-offs between model quality and efficiency.

regression

Paper
Code

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

no code implementations • 25 Feb 2021 • Jimmy Lin, Daniel Campos, Nick Craswell, Bhaskar Mitra, Emine Yilmaz

Leaderboards are a ubiquitous part of modern research in applied machine learning.

Document Ranking Information Retrieval +1

Paper
Add Code

Investigating the Limitations of Transformers with Simple Arithmetic Tasks

1 code implementation • 25 Feb 2021 • Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin

In this work, we investigate if the surface form of a number has any influence on how sequence-to-sequence language models learn simple arithmetic tasks such as addition and subtraction across a wide range of values.

Paper
Code

Pyserini: An Easy-to-Use Python Toolkit to Support Replicable IR Research with Sparse and Dense Representations

1 code implementation • 19 Feb 2021 • Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, Rodrigo Nogueira

Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture.

Cultural Vocal Bursts Intensity Prediction Information Retrieval +1

1,455

Paper
Code

The Expando-Mono-Duo Design Pattern for Text Ranking with Pretrained Sequence-to-Sequence Models

2 code implementations • 14 Jan 2021 • Ronak Pradeep, Rodrigo Nogueira, Jimmy Lin

We propose a design pattern for tackling text ranking problems, dubbed "Expando-Mono-Duo", that has been empirically validated for a number of ad hoc retrieval tasks in different domains.

Document Ranking Retrieval

Paper
Code

Inserting Information Bottlenecks for Attribution in Transformers

1 code implementation • Findings of the Association for Computational Linguistics 2020 • Zhiying Jiang, Raphael Tang, Ji Xin, Jimmy Lin

We show the effectiveness of our method in terms of attribution and the ability to provide insight into how information flows through layers.

Paper
Code

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models

no code implementations • COLING 2020 • Jheng-Hong Yang, Sheng-Chieh Lin, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

While internalized {``}implicit knowledge{''} in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question.

Multiple-choice Natural Language Understanding +1

Paper
Add Code

Cross-Lingual Training of Neural Models for Document Ranking

no code implementations • Findings of the Association for Computational Linguistics 2020 • Peng Shi, He Bai, Jimmy Lin

We tackle the challenge of cross-lingual training of neural document ranking models for mono-lingual retrieval, specifically leveraging relevance judgments in English to improve search in non-English languages.

Document Ranking Retrieval

Paper
Add Code

Distilling Dense Representations for Ranking using Tightly-Coupled Teachers

2 code implementations • 22 Oct 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin

We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model.

Knowledge Distillation

Paper
Code

Scientific Claim Verification with VERT5ERINI

no code implementations • EACL (Louhi) 2021 • Ronak Pradeep, Xueguang Ma, Rodrigo Nogueira, Jimmy Lin

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain.

Claim Verification Retrieval +1

Paper
Add Code

Rainfall-Runoff Prediction at Multiple Timescales with a Single Long Short-Term Memory Network

1 code implementation • 15 Oct 2020 • Martin Gauch, Frederik Kratzert, Daniel Klotz, Grey Nearing, Jimmy Lin, Sepp Hochreiter

Compared to naive prediction with a distinct LSTM per timescale, the multi-timescale architectures are computationally more efficient with no loss in accuracy.

Paper
Code

Pretrained Transformers for Text Ranking: BERT and Beyond

1 code implementation • NAACL 2021 • Jimmy Lin, Rodrigo Nogueira, Andrew Yates

There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i. e., result quality) and efficiency (e. g., query latency, model and index size).

Information Retrieval Retrieval +1

1,375

Paper
Code

Don't Change Me! User-Controllable Selective Paraphrase Generation

no code implementations • EACL 2021 • Mohan Zhang, Luchen Tan, Zhengkai Tu, Zihang Fu, Kun Xiong, Ming Li, Jimmy Lin

The contribution of this work is a novel data generation technique using distant supervision that allows us to start with a pretrained sequence-to-sequence model and fine-tune a paraphrase generator that exhibits this behavior, allowing user-controllable paraphrase generation.

Paraphrase Generation

Paper
Add Code

Howl: A Deployed, Open-Source Wake Word Detection System

2 code implementations • EMNLP (NLPOSS) 2020 • Raphael Tang, Jaejun Lee, Afsaneh Razi, Julia Cambre, Ian Bicking, Jofish Kaye, Jimmy Lin

We describe Howl, an open-source wake word detection toolkit with native support for open speech datasets, like Mozilla Common Voice and Google Speech Commands.

Ranked #4 on Keyword Spotting on Google Speech Commands

Keyword Spotting

191

Paper
Code

Covidex: Neural Ranking Models and Keyword Search Infrastructure for the COVID-19 Open Research Dataset

1 code implementation • EMNLP (sdp) 2020 • Edwin Zhang, Nikhil Gupta, Raphael Tang, Xiao Han, Ronak Pradeep, Kuang Lu, Yue Zhang, Rodrigo Nogueira, Kyunghyun Cho, Hui Fang, Jimmy Lin

We present Covidex, a search engine that exploits the latest neural ranking models to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.

136

Paper
Code

Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT

no code implementations • WS 2020 • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, Jimmy Lin

Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs.

Document Classification General Classification +2

Paper
Add Code

Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset

no code implementations • ACL 2020 • Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin

The Neural Covidex is a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset (CORD-19) curated by the Allen Institute for AI.

Decision Making

Paper
Add Code

Generalized and Scalable Optimal Sparse Decision Trees

2 code implementations • ICML 2020 • Jimmy Lin, Chudi Zhong, Diane Hu, Cynthia Rudin, Margo Seltzer

Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning.

Interpretable Machine Learning

Paper
Code

A Data Scientist's Guide to Streamflow Prediction

no code implementations • 5 Jun 2020 • Martin Gauch, Jimmy Lin

In recent years, the paradigms of data-driven science have become essential components of physical sciences, particularly in geophysical disciplines such as climatology.

Paper
Add Code

Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting

no code implementations • 5 May 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

Conversational search plays a vital role in conversational information seeking.

Ad-Hoc Information Retrieval Conversational Search +2

Paper
Add Code

Segatron: Segment-Aware Transformer for Language Modeling and Understanding

1 code implementation • 30 Apr 2020 • He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li

To verify this, we propose a segment-aware Transformer (Segatron), by replacing the original token position encoding with a combined position encoding of paragraph, sentence, and token.

Ranked #20 on Language Modelling on WikiText-103

Language Modelling Masked Language Modeling +3

Paper
Code

Showing Your Work Doesn't Always Work

1 code implementation • ACL 2020 • Raphael Tang, Jaejun Lee, Ji Xin, Xinyu Liu, Yao-Liang Yu, Jimmy Lin

In natural language processing, a recently popular line of work explores how to best report the experimental results of neural networks.

Paper
Code

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference

3 code implementations • ACL 2020 • Ji Xin, Raphael Tang, Jaejun Lee, Yao-Liang Yu, Jimmy Lin

Large-scale pre-trained language models such as BERT have brought significant improvements to NLP applications.

124,984

Paper
Code

Rapidly Bootstrapping a Question Answering Dataset for COVID-19

1 code implementation • 23 Apr 2020 • Raphael Tang, Rodrigo Nogueira, Edwin Zhang, Nikhil Gupta, Phuong Cam, Kyunghyun Cho, Jimmy Lin

We present CovidQA, the beginnings of a question answering dataset specifically designed for COVID-19, built by hand from knowledge gathered from Kaggle's COVID-19 Open Research Dataset Challenge.

Question Answering

322

Paper
Code

Rapidly Deploying a Neural Search Engine for the COVID-19 Open Research Dataset: Preliminary Thoughts and Lessons Learned

1 code implementation • 10 Apr 2020 • Edwin Zhang, Nikhil Gupta, Rodrigo Nogueira, Kyunghyun Cho, Jimmy Lin

We present the Neural Covidex, a search engine that exploits the latest neural ranking architectures to provide information access to the COVID-19 Open Research Dataset curated by the Allen Institute for AI.

Decision Making

Paper
Code

Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2

1 code implementation • ACL 2021 • He Bai, Peng Shi, Jimmy Lin, Luchen Tan, Kun Xiong, Wen Gao, Jie Liu, Ming Li

Experimental results show that the Chinese GPT2 can generate better essay endings with \eop.

Language Modelling Story Generation

Paper
Code

Conversational Question Reformulation via Sequence-to-Sequence Architectures and Pretrained Language Models

no code implementations • 4 Apr 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs).

Task-Oriented Dialogue Systems

Paper
Add Code

Supporting Interoperability Between Open-Source Search Engines with the Common Index File Format

2 code implementations • 18 Mar 2020 • Jimmy Lin, Joel Mackenzie, Chris Kamphuis, Craig Macdonald, Antonio Mallia, Michał Siedlaczek, Andrew Trotman, Arjen de Vries

There exists a natural tension between encouraging a diverse ecosystem of open-source search engines and supporting fair, replicable comparisons across those systems.

Paper
Code

TTTTTackling WinoGrande Schemas

no code implementations • 18 Mar 2020 • Sheng-Chieh Lin, Jheng-Hong Yang, Rodrigo Nogueira, Ming-Feng Tsai, Chuan-Ju Wang, Jimmy Lin

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the "entailment" token as a score of the hypothesis.

Ranked #17 on Coreference Resolution on Winograd Schema Challenge

Coreference Resolution

Paper
Add Code

Document Ranking with a Pretrained Sequence-to-Sequence Model

2 code implementations • Findings of the Association for Computational Linguistics 2020 • Rodrigo Nogueira, Zhiying Jiang, Jimmy Lin

We investigate this observation further by varying target words to probe the model's use of latent knowledge.

Ranked #1 on Ad-Hoc Information Retrieval on TREC Robust04

Document Ranking General Classification +1

322

Paper
Code

Rapid Adaptation of BERT for Information Extraction on Domain-Specific Business Documents

1 code implementation • 5 Feb 2020 • Ruixue Zhang, Wei Yang, Luyun Lin, Zhengkai Tu, Yuqing Xie, Zihang Fu, Yuhao Xie, Luchen Tan, Kun Xiong, Jimmy Lin

Techniques for automatically extracting important content elements from business documents such as contracts, statements, and filings have the potential to make business operations more efficient.

Paper
Code

A Prototype of Serverless Lucene

no code implementations • 4 Feb 2020 • Jimmy Lin

This paper describes a working prototype that adapts Lucene, the world's most popular and most widely deployed open-source search library, to operate within a serverless environment in the cloud.

Paper
Add Code

Navigation-Based Candidate Expansion and Pretrained Language Models for Citation Recommendation

no code implementations • 23 Jan 2020 • Rodrigo Nogueira, Zhiying Jiang, Kyunghyun Cho, Jimmy Lin

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration.

Citation Recommendation Domain Adaptation +3

Paper
Add Code

The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives

1 code implementation • 15 Jan 2020 • Nick Ruest, Jimmy Lin, Ian Milligan, Samantha Fritz

The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building - all proceeding concurrently in mutually-reinforcing efforts.

131

Paper
Code

The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction

1 code implementation • 17 Nov 2019 • Martin Gauch, Juliane Mai, Jimmy Lin

Accurate streamflow prediction largely relies on historical meteorological records and streamflow measurements.

Paper
Code

Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits

no code implementations • 15 Nov 2019 • Achyudh Ram, Ji Xin, Meiyappan Nagappan, Yao-Liang Yu, Rocío Cabrera Lozoya, Antonino Sabetta, Jimmy Lin

Public vulnerability databases such as CVE and NVD account for only 60% of security vulnerabilities present in open-source projects, and are known to suffer from inconsistent quality.

Paper
Add Code

MKD: a Multi-Task Knowledge Distillation Approach for Pretrained Language Models

no code implementations • 9 Nov 2019 • Linqing Liu, Huan Wang, Jimmy Lin, Richard Socher, Caiming Xiong

Our approach is model agnostic and can be easily applied on different future teacher model architectures.

Knowledge Distillation Multi-Task Learning

Paper
Add Code

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

no code implementations • 8 Nov 2019 • Jaejun Lee, Raphael Tang, Jimmy Lin

We show that only a fourth of the final layers need to be fine-tuned to achieve 90% of the original quality.

Linguistic Acceptability Natural Language Inference +3

Paper
Add Code

Cross-Lingual Relevance Transfer for Document Retrieval

no code implementations • 8 Nov 2019 • Peng Shi, Jimmy Lin

Recent work has shown the surprising ability of multi-lingual BERT to serve as a zero-shot cross-lingual transfer model for a number of language processing tasks.

Retrieval Sentence +1

Paper
Add Code

Explicit Pairwise Word Interaction Modeling Improves Pretrained Transformers for English Semantic Similarity Tasks

no code implementations • 7 Nov 2019 • Yinan Zhang, Raphael Tang, Jimmy Lin

In this paper, we hypothesize that introducing an explicit, constrained pairwise word interaction mechanism to pretrained language models improves their effectiveness on semantic similarity tasks.

Paper
Add Code

Applying BERT to Document Retrieval with Birch

no code implementations • IJCNLP 2019 • Zeynep Akkalyoncu Yilmaz, Shengjin Wang, Wei Yang, Haotian Zhang, Jimmy Lin

We present Birch, a system that applies BERT to document retrieval via integration with the open-source Anserini information retrieval toolkit to demonstrate end-to-end search over large document collections.

Information Retrieval Retrieval

Paper
Add Code

Honkling: In-Browser Personalization for Ubiquitous Keyword Spotting

1 code implementation • IJCNLP 2019 • Jaejun Lee, Raphael Tang, Jimmy Lin

Used for simple commands recognition on devices from smart speakers to mobile phones, keyword spotting systems are everywhere.

Keyword Spotting

Paper
Code

Natural Language Generation for Effective Knowledge Distillation

1 code implementation • WS 2019 • Raphael Tang, Yao Lu, Jimmy Lin

Knowledge distillation can effectively transfer knowledge from BERT, a deep language representation model, to traditional, shallow word embedding-based neural networks, helping them approach or exceed the quality of other heavyweight language representation models.

Knowledge Distillation Linguistic Acceptability +6

Paper
Code

Incorporating Contextual and Syntactic Structures Improves Semantic Similarity Modeling

no code implementations • IJCNLP 2019 • Linqing Liu, Wei Yang, Jinfeng Rao, Raphael Tang, Jimmy Lin

Semantic similarity modeling is central to many NLP problems such as natural language inference and question answering.

Natural Language Inference Question Answering +2

Paper
Add Code

Scalable Knowledge Graph Construction from Text Collections

no code implementations • WS 2019 • Ryan Clancy, Ihab F. Ilyas, Jimmy Lin

We present a scalable, open-source platform that {``}distills{''} a potentially large text collection into a knowledge graph.

Fact Verification graph construction

Paper
Add Code

Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling

no code implementations • IJCNLP 2019 • Jinfeng Rao, Linqing Liu, Yi Tay, Wei Yang, Peng Shi, Jimmy Lin

A core problem of information retrieval (IR) is relevance matching, which is to rank documents by relevance to a user{'}s query.

Information Retrieval Paraphrase Identification +3

Paper
Add Code

What Part of the Neural Network Does This? Understanding LSTMs by Measuring and Dissecting Neurons

no code implementations • IJCNLP 2019 • Ji Xin, Jimmy Lin, Yao-Liang Yu

Memory neurons of long short-term memory (LSTM) networks encode and process information in powerful yet mysterious ways.

Paper
Add Code

Cross-Domain Modeling of Sentence-Level Evidence for Document Retrieval

no code implementations • IJCNLP 2019 • Zeynep Akkalyoncu Yilmaz, Wei Yang, Haotian Zhang, Jimmy Lin

This paper applies BERT to ad hoc document retrieval on news articles, which requires addressing two challenges: relevance judgments in existing test collections are typically provided only at the document level, and documents often exceed the length that BERT was designed to handle.

Retrieval Sentence

Paper
Add Code

Multi-Stage Document Ranking with BERT

2 code implementations • 31 Oct 2019 • Rodrigo Nogueira, Wei Yang, Kyunghyun Cho, Jimmy Lin

The advent of deep neural networks pre-trained via language modeling tasks has spurred a number of successful applications in natural language processing.

Document Ranking Language Modelling

343

Paper
Code

The Performance Envelope of Inverted Indexing on Modern Hardware

no code implementations • 24 Oct 2019 • Jimmy Lin, Lori Paniak, Gordon Boerke

Experiments show that the largest determinants of performance are the physical characteristics of the source and target media, and that physically isolating the two yields the highest indexing throughput.

Paper
Add Code

Lucene for Approximate Nearest-Neighbors Search on Arbitrary Dense Vectors

no code implementations • 22 Oct 2019 • Tommaso Teofili, Jimmy Lin

We demonstrate three approaches for adapting the open-source Lucene search library to perform approximate nearest-neighbor search on arbitrary dense vectors, using similarity search on word embeddings as a case study.

Dimensionality Reduction Word Embeddings

Paper
Add Code

Aligning Cross-Lingual Entities with Multi-Aspect Information

1 code implementation • IJCNLP 2019 • Hsiu-Wei Yang, Yanyan Zou, Peng Shi, Wei Lu, Jimmy Lin, Xu sun

Multilingual knowledge graphs (KGs), such as YAGO and DBpedia, represent entities in different languages.

Entity Alignment Entity Embeddings +1

Paper
Code

Two Birds, One Stone: A Simple, Unified Model for Text Generation from Structured and Unstructured Data

1 code implementation • ACL 2020 • Hamidreza Shahidi, Ming Li, Jimmy Lin

We consider neural table-to-text generation and neural question generation (NQG) tasks for text generation from structured and unstructured data, respectively.

Question Generation Question-Generation +1

Paper
Code

Rethinking Complex Neural Network Architectures for Document Classification

1 code implementation • NAACL 2019 • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, Jimmy Lin

Neural network models for many NLP tasks have grown increasingly complex in recent years, making training and deployment more difficult.

Ranked #2 on Document Classification on IMDb-M

Classification Document Classification +2

584

Paper
Code

Detecting Customer Complaint Escalation with Recurrent Neural Networks and Manually-Engineered Features

no code implementations • NAACL 2019 • Wei Yang, Luchen Tan, Chunwei Lu, Anqi Cui, Han Li, Xi Chen, Kun Xiong, Muzi Wang, Ming Li, Jian Pei, Jimmy Lin

Consumers dissatisfied with the normal dispute resolution process provided by an e-commerce company{'}s customer service agents have the option of escalating their complaints by filing grievances with a government authority.

Paper
Add Code

Critically Examining the "Neural Hype": Weak Baselines and the Additivity of Effectiveness Gains from Neural Ranking Models

1 code implementation • 19 Apr 2019 • Wei Yang, Kuang Lu, Peilin Yang, Jimmy Lin

Is neural IR mostly hype?

Retrieval

Paper
Code

The Simplest Thing That Can Possibly Work: Pseudo-Relevance Feedback Using Text Classification

no code implementations • 18 Apr 2019 • Jimmy Lin

Motivated by recent commentary that has questioned today's pursuit of ever-more complex models and mathematical formalisms in applied machine learning and whether meaningful empirical progress is actually being made, this paper tries to tackle the decades-old problem of pseudo-relevance feedback with "the simplest thing that can possibly work".

General Classification text-classification +1

Paper
Add Code

DocBERT: BERT for Document Classification

3 code implementations • 17 Apr 2019 • Ashutosh Adhikari, Achyudh Ram, Raphael Tang, Jimmy Lin

We present, to our knowledge, the first application of BERT to document classification.

Ranked #1 on Document Classification on Yelp-14

Document Classification General Classification +1

584

Paper
Code

Document Expansion by Query Prediction

5 code implementations • 17 Apr 2019 • Rodrigo Nogueira, Wei Yang, Jimmy Lin, Kyunghyun Cho

One technique to improve the retrieval effectiveness of a search engine is to expand documents with terms that are related or representative of the documents' content. From the perspective of a question answering system, this might comprise questions the document can potentially answer.

Ranked #1 on Passage Re-Ranking on TREC-PM

Passage Re-Ranking Question Answering +2

977

Paper
Code

Data Augmentation for BERT Fine-Tuning in Open-Domain Question Answering

no code implementations • 14 Apr 2019 • Wei Yang, Yuqing Xie, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin

Recently, a simple combination of passage retrieval using off-the-shelf IR techniques and a BERT reader was found to be very effective for question answering directly on Wikipedia, yielding a large improvement over the previous state of the art on a standard benchmark dataset.

Ranked #3 on Open-Domain Question Answering on SQuAD1.1 dev

Data Augmentation Open-Domain Question Answering +2

Paper
Add Code

Simple BERT Models for Relation Extraction and Semantic Role Labeling

3 code implementations • 10 Apr 2019 • Peng Shi, Jimmy Lin

We present simple BERT-based models for relation extraction and semantic role labeling.

Ranked #28 on Relation Extraction on TACRED

Relation Relation Extraction +1

Paper
Code

Distilling Task-Specific Knowledge from BERT into Simple Neural Networks

4 code implementations • 28 Mar 2019 • Raphael Tang, Yao Lu, Linqing Liu, Lili Mou, Olga Vechtomova, Jimmy Lin

In the natural language processing literature, neural networks are becoming increasingly deeper and complex.

Ranked #60 on Sentiment Analysis on SST-2 Binary classification

Natural Language Inference Sentence +2

Paper
Code

Simple Applications of BERT for Ad Hoc Document Retrieval

2 code implementations • 26 Mar 2019 • Wei Yang, Haotian Zhang, Jimmy Lin

Following recent successes in applying BERT to question answering, we explore simple applications to ad hoc document retrieval.

Ranked #2 on Ad-Hoc Information Retrieval on TREC Robust04 (MAP metric)

Ad-Hoc Information Retrieval Question Answering +2

142

Paper
Code

Matching Entities Across Different Knowledge Graphs with Graph Embeddings

1 code implementation • 15 Mar 2019 • Michael Azmy, Peng Shi, Jimmy Lin, Ihab F. Ilyas

This paper explores the problem of matching entities across different knowledge graphs.

General Classification Knowledge Graphs

Paper
Code

End-to-End Open-Domain Question Answering with BERTserini

1 code implementation • NAACL 2019 • Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin

We demonstrate an end-to-end question answering system that integrates BERT with the open-source Anserini information retrieval toolkit.

Ranked #4 on Open-Domain Question Answering on SQuAD1.1 dev

Information Retrieval Open-Domain Question Answering +2

Paper
Code

Streaming Voice Query Recognition using Causal Convolutional Recurrent Neural Networks

no code implementations • 19 Dec 2018 • Raphael Tang, Gefei Yang, Hong Wei, Yajie Mao, Ferhan Ture, Jimmy Lin

Voice-enabled commercial products are ubiquitous, typically enabled by lightweight on-device keyword spotting (KWS) and full automatic speech recognition (ASR) in the cloud.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

Paper
Add Code

The Neural Hype and Comparisons Against Weak Baselines

1 code implementation • ACM SIGIR Forum, Volume 52 Issue 2 2018 • Jimmy Lin

Sculley et al. remind us that "the goal of science is not wins, but knowledge".

Ranked #3 on Ad-Hoc Information Retrieval on TREC Robust04 (MAP metric)

Ad-Hoc Information Retrieval Cultural Vocal Bursts Intensity Prediction +1

977

Paper
Code

FLOPs as a Direct Optimization Objective for Learning Sparse Neural Networks

no code implementations • NIPS Workshop CDNNRIA 2018 • Raphael Tang, Ashutosh Adhikari, Jimmy Lin

There exists a plethora of techniques for inducing structured sparsity in parametric models during the optimization process, with the final goal of resource-efficient inference.

Image Classification Model Compression

Paper
Add Code

Simple Attention-Based Representation Learning for Ranking Short Social Media Posts

no code implementations • NAACL 2019 • Peng Shi, Jinfeng Rao, Jimmy Lin

This paper explores the problem of ranking short social media posts with respect to user queries using neural networks.

Representation Learning

Paper
Add Code

Progress and Tradeoffs in Neural Language Models

no code implementations • 2 Nov 2018 • Raphael Tang, Jimmy Lin

In recent years, we have witnessed a dramatic shift towards techniques driven by neural networks for a variety of NLP tasks.

Language Modelling

Paper
Add Code

JavaScript Convolutional Neural Networks for Keyword Spotting in the Browser: An Experimental Analysis

1 code implementation • 30 Oct 2018 • Jaejun Lee, Raphael Tang, Jimmy Lin

Overall, our robust, cross-device implementation for keyword spotting realizes a new paradigm for serving neural network applications, and one of our slim models reduces latency by 66% with a minimal decrease in accuracy of 4% from 94% to 90%.

Keyword Spotting Model Compression

Paper
Code

Adaptive Pruning of Neural Language Models for Mobile Devices

no code implementations • ICLR 2019 • Raphael Tang, Jimmy Lin

Neural language models (NLMs) exist in an accuracy-efficiency tradeoff space where better perplexity typically comes at the cost of greater computation complexity.

Paper
Add Code

Farewell Freebase: Migrating the SimpleQuestions Dataset to DBpedia

1 code implementation • COLING 2018 • Michael Azmy, Peng Shi, Jimmy Lin, Ihab Ilyas

To address this problem, we present SimpleDBpediaQA, a new benchmark dataset for simple question answering over knowledge graphs that was created by mapping SimpleQuestions entities and predicates from Freebase to DBpedia.

Knowledge Graphs Question Answering +1

Paper
Code

Repeatability Corner Cases in Document Ranking: The Impact of Score Ties

no code implementations • 16 Jul 2018 • Jimmy Lin, Peilin Yang

Due to multi-threaded indexing, which makes experimentation with large modern document collections practical, internal document ids are not assigned consistently between different index instances of the same collection, and thus score ties are broken unpredictably.

Document Ranking Retrieval

Paper
Add Code

Pay-Per-Request Deployment of Neural Network Models Using Serverless Architectures

no code implementations • NAACL 2018 • Zhucheng Tu, Mengping Li, Jimmy Lin

We demonstrate the serverless deployment of neural networks for model inferencing in NLP applications using Amazon{'}s Lambda service for feedforward evaluation and DynamoDB for storing word embeddings.

Answer Selection Management +2

Paper
Add Code

CNNs for NLP in the Browser: Client-Side Deployment and Visualization Opportunities

no code implementations • NAACL 2018 • Yiyun Liang, Zhucheng Tu, Laetitia Huang, Jimmy Lin

We demonstrate a JavaScript implementation of a convolutional neural network that performs feedforward inference completely in the browser.

Interpretable Machine Learning Sentence Classification +1

Paper
Add Code

Multi-Perspective Relevance Matching with Hierarchical ConvNets for Social Media Search

3 code implementations • 21 May 2018 • Jinfeng Rao, Wei Yang, Yuhao Zhang, Ferhan Ture, Jimmy Lin

To our best knowledge, this paper presents the first substantial work tackling search over social media posts using neural ranking models.

Information Retrieval Retrieval

Paper
Code

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks

no code implementations • NAACL 2018 • Salman Mohammed, Peng Shi, Jimmy Lin

We examine the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact.

Entity Linking Knowledge Graphs +1

Paper
Add Code

Deep Residual Learning for Small-Footprint Keyword Spotting

4 code implementations • 28 Oct 2017 • Raphael Tang, Jimmy Lin

We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark.

Small-Footprint Keyword Spotting

504

Paper
Code

Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting

4 code implementations • 18 Oct 2017 • Raphael Tang, Jimmy Lin

We describe Honk, an open-source PyTorch reimplementation of convolutional neural networks for keyword spotting that are included as examples in TensorFlow.

Keyword Spotting speech-recognition +1

504

Paper
Code

An Insight Extraction System on BioMedical Literature with Deep Neural Networks

no code implementations • EMNLP 2017 • Hua He, Kris Ganjam, Navendu Jain, Jessica Lundin, Ryen White, Jimmy Lin

Mining biomedical text offers an opportunity to automatically discover important facts and infer associations among them.

Relation Extraction

Paper
Add Code

Integrating Lexical and Temporal Signals in Neural Ranking Models for Searching Social Media Streams

no code implementations • 25 Jul 2017 • Jinfeng Rao, Hua He, Haotian Zhang, Ferhan Ture, Royal Sequiera, Salman Mohammed, Jimmy Lin

To our knowledge, we are the first to integrate lexical and temporal signals in an end-to-end neural network architecture, in which existing neural ranking models are used to generate query-document similarity vectors that feed into a bidirectional LSTM layer for temporal modeling.

Density Estimation Document Ranking

Paper
Add Code

Exploring the Effectiveness of Convolutional Neural Networks for Answer Selection in End-to-End Question Answering

no code implementations • 25 Jul 2017 • Royal Sequiera, Gaurav Baruah, Zhucheng Tu, Salman Mohammed, Jinfeng Rao, Haotian Zhang, Jimmy Lin

Most work on natural language question answering today focuses on answer selection: given a candidate list of sentences, determine which contains the answer.

Answer Selection Retrieval

Paper
Add Code

Pairwise Word Interaction Modeling with Deep Neural Networks for Semantic Similarity Measurement

no code implementations • NAACL 2016 • Hua He, Jimmy Lin

Ranked #11 on Question Answering on TrecQA

Answer Selection Paraphrase Generation +2

Paper
Add Code

UMD-TTIC-UW at SemEval-2016 Task 1: Attention-Based Multi-Perspective Convolutional Neural Networks for Textual Similarity Measurement

no code implementations • SEMEVAL 2016 • Hua He, John Wieting, Kevin Gimpel, Jinfeng Rao, Jimmy Lin

Feature Engineering Question Answering +2

Paper
Add Code

Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks

no code implementations • EMNLP 2015 • Hua He, Kevin Gimpel, Jimmy Lin

Feature Engineering Machine Translation +5

Paper
Add Code

Gappy Pattern Matching on GPUs for On-Demand Extraction of Hierarchical Translation Grammars

no code implementations • TACL 2015 • Hua He, Jimmy Lin, Adam Lopez

We believe that GPU-based extraction of hierarchical grammars is an attractive proposition, particularly for MT applications that demand high throughput.

Machine Translation Translation

Paper
Add Code

Identifying Duplicate and Contradictory Information in Wikipedia

no code implementations • 4 Jun 2014 • Sarah Weissman, Samet Ayhan, Joshua Bradley, Jimmy Lin

Our study identifies sentences in Wikipedia articles that are either identical or highly similar by applying techniques for near-duplicate detection of web pages.