no code implementations • 15 Oct 2024 • Reno Kriz, Kate Sanders, David Etter, Kenton Murray, Cameron Carpenter, Kelly Van Ochten, Hannah Recknor, Jimena Guallar-Blasco, Alexander Martin, Ronald Colaianni, Nolan King, Eugene Yang, Benjamin Van Durme
Efficiently retrieving and synthesizing information from large-scale multimodal collections has become a critical challenge.
1 code implementation • 24 Jun 2024 • Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van Durme
This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal.
1 code implementation • 7 May 2024 • Eugene Yang
High Recall Retrieval (HRR), such as eDiscovery and medical systematic review, is a search problem that optimizes the cost of retrieving most relevant documents in a given collection.
1 code implementation • 2 May 2024 • Dawn Lawrie, Efsun Kayi, Eugene Yang, James Mayfield, Douglas W. Oard
PLAID, an efficient implementation of the ColBERT late interaction bi-encoder using pretrained language models for ranking, consistently achieves state-of-the-art performance in monolingual, cross-language, and multilingual retrieval.
1 code implementation • 2 May 2024 • Eugene Yang, Dawn Lawrie, James Mayfield
Recent work in cross-language information retrieval (CLIR), where queries and documents are in different languages, has shown the benefit of the Translate-Distill framework that trains a cross-language neural dual-encoder model using translation and distillation.
1 code implementation • 2 May 2024 • Eugene Yang, Thomas Jänich, James Mayfield, Dawn Lawrie
We also evaluate real MLIR systems on two publicly available benchmarks and show that the PEER scores align with prior analytical findings on MLIR fairness.
no code implementations • 2 May 2024 • James Mayfield, Eugene Yang, Dawn Lawrie, Sean MacAvaney, Paul McNamee, Douglas W. Oard, Luca Soldaini, Ian Soboroff, Orion Weller, Efsun Kayi, Kate Sanders, Marc Mason, Noah Hibbler
Reports with these qualities are necessary to satisfy the complex, nuanced, or multi-faceted information needs of users.
1 code implementation • 29 Apr 2024 • Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard, Kevin Duh
Probabilistic Structured Queries (PSQ) is a cross-language information retrieval (CLIR) method that uses translation probabilities statistically derived from aligned corpora.
no code implementations • 11 Apr 2024 • Eugene Yang, Dawn Lawrie, James Mayfield
TT trains a ColBERT model with English queries and passages automatically translated into the document language from the MS-MARCO v1 collection.
no code implementations • 11 Apr 2024 • Eugene Yang, Dawn J. Lawrie, Paul McNamee, James Mayfield
This paper describes the submission runs from the HLTCOE team at the CIRAL CLIR tasks for African languages at FIRE 2023.
no code implementations • 11 Apr 2024 • Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang
The principal tasks are ranked retrieval of news in one of the three languages, using English topics.
1 code implementation • 9 Jan 2024 • Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard, Scott Miller
Applying a similar knowledge distillation approach to training an efficient dual-encoder model for Cross-Language Information Retrieval (CLIR), where queries and documents are in different languages, is challenging due to the lack of a sufficiently large training collection when the query and document languages differ.
no code implementations • 29 Apr 2023 • James Mayfield, Eugene Yang, Dawn Lawrie, Samuel Barham, Orion Weller, Marc Mason, Suraj Nair, Scott Miller
By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse.
no code implementations • 24 Apr 2023 • Dawn Lawrie, Sean MacAvaney, James Mayfield, Paul McNamee, Douglas W. Oard, Luca Soldaini, Eugene Yang
This is the first year of the TREC Neural CLIR (NeuCLIR) track, which aims to study the impact of neural approaches to cross-language information retrieval.
no code implementations • 20 Dec 2022 • Eugene Yang, Suraj Nair, Dawn Lawrie, James Mayfield, Douglas W. Oard
By adding adapters pretrained on language tasks for a specific language with task-specific adapters, prior work has shown that the adapter-enhanced models perform better than fine-tuning the entire model when transferring across languages in various NLP tasks.
1 code implementation • 3 Sep 2022 • Dawn Lawrie, Eugene Yang, Douglas W. Oard, James Mayfield
Providing access to information across languages has been a goal of Information Retrieval (IR) for decades.
no code implementations • 25 Apr 2022 • Eugene Yang, Suraj Nair, Ramraj Chandradevan, Rebecca Iglesias-Flores, Douglas W. Oard
Pretrained language models have improved effectiveness on numerous tasks, including ad-hoc retrieval.
1 code implementation • 23 Feb 2022 • Eugene Yang, David D. Lewis
Technology-assisted review (TAR) is an important industrial application of information retrieval (IR) and machine learning (ML).
1 code implementation • 24 Jan 2022 • Cash Costello, Eugene Yang, Dawn Lawrie, James Mayfield
While there are high-quality software frameworks for information retrieval experimentation, they do not explicitly support cross-language information retrieval (CLIR).
1 code implementation • 24 Jan 2022 • Dawn Lawrie, James Mayfield, Douglas Oard, Eugene Yang
HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments.
1 code implementation • 20 Jan 2022 • Suraj Nair, Eugene Yang, Dawn Lawrie, Kevin Duh, Paul McNamee, Kenton Murray, James Mayfield, Douglas W. Oard
These models have improved the effectiveness of retrieval systems well beyond that of lexical term matching models such as BM25.
no code implementations • 29 Aug 2021 • David D. Lewis, Eugene Yang, Ophir Frieder
Technology-assisted review (TAR) workflows based on iterative active learning are widely used in document review applications.
1 code implementation • 29 Aug 2021 • Eugene Yang, David D. Lewis, Ophir Frieder
Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation.
no code implementations • 18 Jun 2021 • Eugene Yang, David D. Lewis, Ophir Frieder
Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections.
1 code implementation • 18 Jun 2021 • Eugene Yang, David D. Lewis, Ophir Frieder
Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks.
no code implementations • 3 May 2021 • Eugene Yang, Sean MacAvaney, David D. Lewis, Ophir Frieder
We indeed find that the pre-trained BERT model reduces review cost by 10% to 15% in TAR workflows simulated on the RCV1-v2 newswire collection.
no code implementations • EACL (WASSA) 2021 • Tong Xiang, Sean MacAvaney, Eugene Yang, Nazli Goharian
Despite the recent successes of transformer-based models in terms of effectiveness on a variety of tasks, their decisions often remain opaque to humans.
no code implementations • SEMEVAL 2020 • Sajad Sotudeh, Tong Xiang, Hao-Ren Yao, Sean MacAvaney, Eugene Yang, Nazli Goharian, Ophir Frieder
Offensive language detection is an important and challenging task in natural language processing.