Search Results for author: Zejiang Shen

Found 15 papers, 11 papers with code

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

no code implementations NeurIPS 2023 Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.

Language Modelling Large Language Model +3

Are Layout-Infused Language Models Robust to Layout Distribution Shifts? A Case Study with Scientific Documents

1 code implementation1 Jun 2023 Catherine Chen, Zejiang Shen, Dan Klein, Gabriel Stanovsky, Doug Downey, Kyle Lo

Recent work has shown that infusing layout features into language models (LMs) improves processing of visually-rich documents such as scientific papers.

Diversity

Beyond Summarization: Designing AI Support for Real-World Expository Writing Tasks

no code implementations5 Apr 2023 Zejiang Shen, Tal August, Pao Siangliulue, Kyle Lo, Jonathan Bragg, Jeff Hammerbacher, Doug Downey, Joseph Chee Chang, David Sontag

In this position paper, we argue that developing AI supports for expository writing has unique and exciting research challenges and can lead to high real-world impacts.

Multi-LexSum: Real-World Summaries of Civil Rights Lawsuits at Multiple Granularities

1 code implementation22 Jun 2022 Zejiang Shen, Kyle Lo, Lauren Yu, Nathan Dahlberg, Margo Schlanger, Doug Downey

With the advent of large language models, methods for abstractive summarization have made great strides, creating potential for use in applications to aid knowledge workers processing unwieldy document collections.

Abstractive Text Summarization Document Summarization +2

Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search

1 code implementation16 Mar 2022 Daniel King, Zejiang Shen, Nishant Subramani, Daniel S. Weld, Iz Beltagy, Doug Downey

Based on our findings, we present PINOCCHIO, a new decoding method that improves the consistency of a transformer-based abstractive summarizer by constraining beam search to avoid hallucinations.

Abstractive Text Summarization

VILA: Improving Structured Content Extraction from Scientific PDFs Using Visual Layout Groups

1 code implementation1 Jun 2021 Zejiang Shen, Kyle Lo, Lucy Lu Wang, Bailey Kuehl, Daniel S. Weld, Doug Downey

Experiments are conducted on a newly curated evaluation suite, S2-VLUE, that unifies existing automatically-labeled datasets and includes a new dataset of manual annotations covering diverse papers from 19 scientific disciplines.

Language Modelling Text Classification +2

LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis

6 code implementations29 Mar 2021 Zejiang Shen, Ruochen Zhang, Melissa Dell, Benjamin Charles Germain Lee, Jacob Carlson, Weining Li

Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks.

PAWLS: PDF Annotation With Labels and Structure

1 code implementation ACL 2021 Mark Neumann, Zejiang Shen, Sam Skjonsberg

Adobe's Portable Document Format (PDF) is a popular way of distributing view-only documents with a rich visual markup.

OLALA: Object-Level Active Learning for Efficient Document Layout Annotation

1 code implementation5 Oct 2020 Zejiang Shen, Jian Zhao, Melissa Dell, YaoLiang Yu, Weining Li

Document images often have intricate layout structures, with numerous content regions (e. g. texts, figures, tables) densely arranged on each page.

Active Learning Object +1

A Large Dataset of Historical Japanese Documents with Complex Layouts

3 code implementations18 Apr 2020 Zejiang Shen, Kaixuan Zhang, Melissa Dell

Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale.

Document Layout Analysis

Generating Object Stamps

1 code implementation1 Jan 2020 Youssef Alami Mejjati, Zejiang Shen, Michael Snower, Aaron Gokaslan, Oliver Wang, James Tompkin, Kwang In Kim

We present an algorithm to generate diverse foreground objects and composite them into background images using a GAN architecture.

Diversity Object

Information Extraction from Text Regions with Complex Tabular Structure

no code implementations NeurIPS Workshop Document_Intelligen 2019 Kaixuan Zhang, Zejiang Shen, Jie zhou, Melissa Dell

Recent innovations have improved layout analysis of document images, significantly improving our ability to identify text and non-text regions.

Cannot find the paper you are looking for? You can Submit a new open access paper.