Search Results for author: Stefan Larson

Found 16 papers, 7 papers with code

Label Errors in the Tobacco3482 Dataset

1 code implementation17 Dec 2024 Gordon Lim, Stefan Larson, Kevin Leach

Tobacco3482 is a widely used document classification benchmark dataset.

Document Classification valid

Document Type Classification using File Names

no code implementations2 Oct 2024 Zhijian Li, Stefan Larson, Kevin Leach

Rapid document classification is critical in several time-sensitive applications like digital forensics and large-scale media classification.

Classification Document Classification

Generating Hard-Negative Out-of-Scope Data with ChatGPT for Intent Classification

1 code implementation8 Mar 2024 Zhijian Li, Stefan Larson, Kevin Leach

Intent classifiers must be able to distinguish when a user's utterance does not belong to any supported intent to avoid producing incorrect and unrelated system responses.

intent-classification Intent Classification

On Evaluation of Document Classification using RVL-CDIP

no code implementations21 Jun 2023 Stefan Larson, Gordon Lim, Kevin Leach

The RVL-CDIP benchmark is widely used for measuring performance on the task of document classification.

Benchmarking Classification +2

ShabbyPages: A Reproducible Document Denoising and Binarization Dataset

no code implementations16 Mar 2023 Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, Jonathan Boarman

Document denoising and binarization are fundamental problems in the document processing space, but current datasets are often too small and lack sufficient complexity to effectively train and benchmark modern data-driven machine learning models.

Benchmarking Binarization +1

Evaluating Out-of-Distribution Performance on Document Image Classifiers

1 code implementation14 Oct 2022 Stefan Larson, Gordon Lim, Yutong Ai, David Kuang, Kevin Leach

Our new out-of-distribution benchmark consists of two types of documents: those that are not part of any of the 16 in-domain RVL-CDIP categories (RVL-CDIP-O), and those that are one of the 16 in-domain categories yet are drawn from a distribution different from that of the original RVL-CDIP dataset (RVL-CDIP-N).

Document Classification

Augraphy: A Data Augmentation Library for Document Images

2 code implementations30 Aug 2022 Alexander Groleau, Kok Wei Chee, Stefan Larson, Samay Maini, Jonathan Boarman

This paper introduces Augraphy, a Python library for constructing data augmentation pipelines which produce distortions commonly seen in real-world document image datasets.

Data Augmentation Denoising

A Survey of Intent Classification and Slot-Filling Datasets for Task-Oriented Dialog

no code implementations26 Jul 2022 Stefan Larson, Kevin Leach

By extension, so too has interest in developing and improving intent classification and slot-filling models, which are two components that are commonly used in task-oriented dialog systems.

Classification intent-classification +5

Redwood: Using Collision Detection to Grow a Large-Scale Intent Classification Dataset

1 code implementation SIGDIAL (ACL) 2022 Stefan Larson, Kevin Leach

Similarly, developers of such ML-driven systems need to be able to add new training data to an already-existing dataset to support these new skills.

Classification intent-classification +1

Exploring Out-of-Distribution Generalization in Text Classifiers Trained on Tobacco-3482 and RVL-CDIP

no code implementations5 Aug 2021 Stefan Larson, Navtej Singh, Saarthak Maheshwari, Shanti Stewart, Uma Krishnaswamy

To be robust enough for widespread adoption, document analysis systems involving machine learning models must be able to respond correctly to inputs that fall outside of the data distribution that was used to generate the data on which the models were trained.

Document Classification Out-of-Distribution Generalization +1

Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods

no code implementations COLING 2020 Stefan Larson, Adrian Cheung, Anish Mahendran, Kevin Leach, Jonathan K. Kummerfeld

Using three new noisy crowd-annotated datasets, we show that a wide range of inconsistencies occur and can impact system performance if not addressed.

slot-filling Slot Filling

Data Query Language and Corpus Tools for Slot-Filling and Intent Classification Data

no code implementations LREC 2020 Stefan Larson, Eric Guldan, Kevin Leach

Typical machine learning approaches to developing task-oriented dialog systems require the collection and management of large amounts of training data, especially for the tasks of intent classification and slot-filling.

Classification General Classification +6

Outlier Detection for Improved Data Quality and Diversity in Dialog Systems

no code implementations NAACL 2019 Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, Jason Mars

We also present a novel data collection pipeline built atop our detection technique to automatically and iteratively mine unique data samples while discarding erroneous samples.

Diversity intent-classification +7

Cannot find the paper you are looking for? You can Submit a new open access paper.