ACL Anthology Reference Corpus (ACL ARC) is a collection of 10,920 academic papers from the ACL Anthology. ACL ARC is cleaned to remove:
- files that look like not full papers, paper fragments, foreign-language papers (e.g., French), or pure junk.
- headers (title and author information; NOT abstract).
- footers ("References" line and the actual references).
- some bad characters (spurious characters).
- some page numbers (i.e., a single number appearing on a line, with nothing else attached to it).
- significant foreign-language (e.g., French) content in an otherwise English paper.
The cleaned corpus has 10,628 documents.
Source: ACL ARC