Search Results for author: Melissa Dell

Found 11 papers, 6 papers with code

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

6 code implementations29 Mar 2021 Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li

Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks.

A Large Dataset of Historical Japanese Documents with Complex Layouts

3 code implementations18 Apr 2020 Zejiang Shen, Kaixuan Zhang, Melissa Dell

Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale.

Document Layout Analysis

LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models

1 code implementation2 Sep 2023 Abhishek Arora, Melissa Dell

By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.

Blocking Language Modelling +3

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

1 code implementation5 Oct 2020 Zejiang Shen, Jian Zhao, Melissa Dell, YaoLiang Yu, Weining Li

Document images often have intricate layout structures, with numerous content regions (e. g. texts, figures, tables) densely arranged on each page.

Active Learning Object +1

Efficient OCR for Building a Diverse Digital History

1 code implementation5 Apr 2023 Jacob Carlson, Tom Bryan, Melissa Dell

Thousands of users consult digital archives daily, but the information they can access is unrepresentative of the diversity of documentary history.

Image Retrieval Language Modelling +3

Quantifying Character Similarity with Vision Transformers

1 code implementation24 May 2023 Xinmei Yang, Abhishek Arora, Shao-Yu Jheng, Melissa Dell

Not all character substitutions are equally probable, and for some settings there are widely used handcrafted lists denoting which string substitutions are more likely, that improve the accuracy of string matching.

Optical Character Recognition (OCR)

Information Extraction from Text Regions with Complex Tabular Structure

no code implementations NeurIPS Workshop Document_Intelligen 2019 Kaixuan Zhang, Zejiang Shen, Jie zhou, Melissa Dell

Recent innovations have improved layout analysis of document images, significantly improving our ability to identify text and non-text regions.

Noise-Robust De-Duplication at Scale

no code implementations9 Oct 2022 Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora.

Linking Representations with Multimodal Contrastive Learning

no code implementations7 Apr 2023 Abhishek Arora, Xinmei Yang, Shao-Yu Jheng, Melissa Dell

CLIPPINGS outperforms widely used string matching methods by a wide margin and also outperforms unimodal methods.

Contrastive Learning Optical Character Recognition (OCR)

EfficientOCR: An Extensible, Open-Source Package for Efficiently Digitizing World Knowledge

no code implementations16 Oct 2023 Tom Bryan, Jacob Carlson, Abhishek Arora, Melissa Dell

Given the diversity and sheer quantity of public domain texts, liberating them at scale requires optical character recognition (OCR) that is accurate, extremely cheap to deploy, and sample-efficient to customize to novel collections, languages, and character sets.

Image Retrieval Language Modelling +3

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

no code implementations NeurIPS 2023 Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.

Language Modelling Large Language Model +3

Cannot find the paper you are looking for? You can Submit a new open access paper.