Jesse Dodge, Maarten Sap, Ana Marasovic, William Agnew, Gabriel Ilharco, Dirk Groeneveld, Margaret Mitchell, Matt Gardner: Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus. EMNLP (1) 2021: 1286-1305