Search Results for author: AnHai Doan

Found 5 papers, 4 papers with code

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

1 code implementation Proceedings of the VLDB Endowment 2023 Derek Paulsen, Yash Govind, AnHai Doan

We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster.

Blocking

Deep Entity Matching with Pre-Trained Language Models

1 code implementation1 Apr 2020 Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan

Our experiments show that a straightforward application of language models such as BERT, DistilBERT, or RoBERTa pre-trained on large text corpora already significantly improves the matching quality and outperforms previous state-of-the-art (SOTA), by up to 29% of F1 score on benchmark datasets.

Data Augmentation Entity Resolution

Toward a System Building Agenda for Data Integration

no code implementations29 Sep 2017 AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Sanjib Das, Yash Govind, Pradap Konda, Han Li, Erik Paulson, Paul Suganthan G. C., Haojun Zhang

They provide tools to address the "pain points" of the steps, and tools are built on top of the Python data science and Big Data ecosystem (PyData).

Databases

Cannot find the paper you are looking for? You can Submit a new open access paper.