no code implementations • 16 Dec 2024 • Michael Bendersky, Donald Metzler, Marc Najork, Xuanhui Wang
This article describes the history of information retrieval on personal document collections.
no code implementations • 29 Nov 2023 • Spurthi Amba Hombaiah, Tao Chen, Mingyang Zhang, Michael Bendersky, Marc Najork, Matt Colen, Sergey Levi, Vladimir Ofitserov, Tanvir Amin
In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet.
1 code implementation • Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23) 2023 • Karan Samel, Cheng Li, Weize Kong, Tao Chen, Mingyang Zhang, Shaleen Gupta, Swaraj Khadanga, Wensong Xu, Xingyu Wang, Kashyap Kolipaka, Mike Bendersky, Marc Najork
These inferred weights and terms can be used directly by a retrieval system to perform a query search.
Ranked #1 on Passage Retrieval on MS MARCO
no code implementations • 19 May 2023 • Aditi Chaudhary, Karthik Raman, Krishna Srinivasan, Kazuma Hashimoto, Mike Bendersky, Marc Najork
While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.
no code implementations • 8 May 2023 • Rongzhi Zhang, Jiaming Shen, Tianqi Liu, Jialu Liu, Michael Bendersky, Marc Najork, Chao Zhang
In this work, we argue that such a learning objective is sub-optimal because there exists a discrepancy between the teacher's output distribution and the ground truth label distribution.
no code implementations • 12 Feb 2023 • Jiaming Shen, Jialu Liu, Dan Finnie, Negar Rahmati, Michael Bendersky, Marc Najork
With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation.
no code implementations • 28 Dec 2022 • Yunan Zhang, Le Yan, Zhen Qin, Honglei Zhuang, Jiaming Shen, Xuanhui Wang, Michael Bendersky, Marc Najork
We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation.
no code implementations • 19 Dec 2022 • Sanket Vaibhav Mehta, Jai Gupta, Yi Tay, Mostafa Dehghani, Vinh Q. Tran, Jinfeng Rao, Marc Najork, Emma Strubell, Donald Metzler
In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents.
no code implementations • 2 Nov 2022 • Aijun Bai, Rolf Jagerman, Zhen Qin, Le Yan, Pratyush Kar, Bing-Rong Lin, Xuanhui Wang, Michael Bendersky, Marc Najork
As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design.
no code implementations • 25 Jan 2022 • Tao Chen, Mingyang Zhang, Jing Lu, Michael Bendersky, Marc Najork
In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting.
no code implementations • 7 Jan 2022 • Beliz Gunel, Navneet Potti, Sandeep Tata, James B. Wendt, Marc Najork, Jing Xie
Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare.
no code implementations • 17 Dec 2021 • Nan Wang, Zhen Qin, Le Yan, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork
Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes.
no code implementations • 30 Sep 2021 • Zhen Qin, Le Yan, Yi Tay, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork
We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers.
no code implementations • 29 Sep 2021 • Nan Wang, Zhen Qin, Le Yan, Honglei Zhuang, Xuanhui Wang, Michael Bendersky, Marc Najork
We further demonstrate that the most popular MCC architecture in deep learning can be mathematically formulated as a LTR pipeline equivalently, with a specific set of choices in terms of ranking model architecture and loss function.
no code implementations • 11 Jun 2021 • Spurthi Amba Hombaiah, Tao Chen, Mingyang Zhang, Michael Bendersky, Marc Najork
To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models.
no code implementations • 5 May 2021 • Donald Metzler, Yi Tay, Dara Bahri, Marc Najork
When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead.
no code implementations • 15 Apr 2021 • Chen Qu, Weize Kong, Liu Yang, Mingyang Zhang, Michael Bendersky, Marc Najork
We investigate the privacy and utility implications of applying dx-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications.
3 code implementations • 2 Mar 2021 • Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork
First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing).
Ranked #1 on Image Retrieval on WIT
no code implementations • ICLR 2021 • Zhen Qin, Le Yan, Honglei Zhuang, Yi Tay, Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, Marc Najork
We first validate this concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available Gradient Boosted Decision Trees (GBDT) in terms of their reported ranking accuracy on benchmark datasets.
2 code implementations • 22 Oct 2020 • Nicholas Monath, Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork, Mert Terzihan, Bryon Tjanaka, YuAn Wang, Yuchen Wu
The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability.
no code implementations • Findings of the Association for Computational Linguistics 2020 • Jiecao Chen, Liu Yang, Karthik Raman, Michael Bendersky, Jung-Jung Yeh, Yun Zhou, Marc Najork, Danyang Cai, Ehsan Emadzadeh
Pre-trained models like BERT (Devlin et al., 2018) have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering.
no code implementations • 2 Oct 2020 • Saar Kuzi, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork
A hybrid approach, which leverages both semantic (deep neural network-based) and lexical (keyword matching-based) retrieval models, is proposed.
1 code implementation • ACL 2020 • Bodhisattwa Majumder, Navneet Potti, Sandeep Tata, James B. Wendt, Qi Zhao, Marc Najork
We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images.
no code implementations • 23 May 2020 • Abbas Kazerouni, Qi Zhao, Jing Xie, Sandeep Tata, Marc Najork
Furthermore, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems.
1 code implementation • 26 Apr 2020 • Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork
In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT.
no code implementations • 17 Apr 2020 • Shuguang Han, Xuanhui Wang, Mike Bendersky, Marc Najork
This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance.
no code implementations • 21 Oct 2019 • Rama Kumar Pasumarthi, Xuanhui Wang, Michael Bendersky, Marc Najork
It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework.
2 code implementations • 30 Nov 2018 • Rama Kumar Pasumarthi, Sebastian Bruch, Xuanhui Wang, Cheng Li, Michael Bendersky, Marc Najork, Jan Pfeifer, Nadav Golbandi, Rohan Anil, Stephan Wolf
We propose TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework.
2 code implementations • 11 Nov 2018 • Qingyao Ai, Xuanhui Wang, Sebastian Bruch, Nadav Golbandi, Michael Bendersky, Marc Najork
To overcome this limitation, we propose a new framework for multivariate scoring functions, in which the relevance score of a document is determined jointly by multiple documents in the list.