6 code implementations • 29 Mar 2021 • Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li
Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks.
3 code implementations • 18 Apr 2020 • Zejiang Shen, Kaixuan Zhang, Melissa Dell
Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale.
1 code implementation • 2 Sep 2023 • Abhishek Arora, Melissa Dell
By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
1 code implementation • 5 Oct 2020 • Zejiang Shen, Jian Zhao, Melissa Dell, YaoLiang Yu, Weining Li
Document images often have intricate layout structures, with numerous content regions (e. g. texts, figures, tables) densely arranged on each page.
1 code implementation • 5 Apr 2023 • Jacob Carlson, Tom Bryan, Melissa Dell
Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history.
1 code implementation • 24 May 2023 • Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell
Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching.
no code implementations • NeurIPS Workshop Document_Intelligen 2019 • Kaixuan Zhang, Zejiang Shen, Jie zhou, Melissa Dell
Recent innovations have improved layout analysis of document images, significantly improving our ability to identify text and non-text regions.
no code implementations • 9 Oct 2022 • Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora.
no code implementations • 7 Apr 2023 • Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell
CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods.
no code implementations • 16 Oct 2023 • Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell
Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets.
no code implementations • NeurIPS 2023 • Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring
The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.