Search Results for author: Emily Silcock

Found 6 papers, 2 papers with code

News Deja Vu: Connecting Past and Present with Semantic Search

no code implementations21 Jun 2024 Brevin Franklin, Emily Silcock, Abhishek Arora, Tom Bryan, Melissa Dell

Social scientists and the general public often analyze contemporary events by drawing parallels with the past, a process complicated by the vast, noisy, and unstructured nature of historical texts.

Optical Character Recognition (OCR)

Newswire: A Large-Scale Structured Database of a Century of Historical News

1 code implementation13 Jun 2024 Emily Silcock, Abhishek Arora, Luca D'Amico-Wong, Melissa Dell

A text classifier is used to ensure that we only include newswire articles, which historically are in the public domain.

Entity Disambiguation Language Modelling +1

American Stories: A Large-Scale Structured Text Dataset of Historical U.S. Newspapers

no code implementations NeurIPS 2023 Melissa Dell, Jacob Carlson, Tom Bryan, Emily Silcock, Abhishek Arora, Zejiang Shen, Luca D'Amico-Wong, Quan Le, Pablo Querubin, Leander Heldring

The resulting American Stories dataset provides high quality data that could be used for pre-training a large language model to achieve better understanding of historical English and historical world knowledge.

Language Modelling Large Language Model +3

Noise-Robust De-Duplication at Scale

1 code implementation9 Oct 2022 Emily Silcock, Luca D'Amico-Wong, Jinglin Yang, Melissa Dell

Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora.

Cannot find the paper you are looking for? You can Submit a new open access paper.