Skip to content

mghako/epstein-justice-files-text

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Jeffrey Epstein Justice Files - Plain Text Corpus

This repository contains the plain text versions of the released court documents regarding the Jeffrey Epstein/Ghislaine Maxwell cases, originally sourced from the United States Department of Justice.

Purpose: The original documents are provided as PDFs, which are difficult to process programmatically. This repository provides flattened, UTF-8 encoded text files to facilitate:

  • Natural Language Processing (NLP)
  • Retrieval-Augmented Generation (RAG) for LLMs
  • Full-text search and analysis
  • Data mining and research

Data Source

All documents were originally downloaded from the official release portals.

  • Original Format: PDF (Scanned/OCR) - Not uploaded [50GB size] of 2024-2025 files but a LINKS page included if you wish to load into a downloader.
  • Converted Format: Plain Text (.txt)

Folder Structure & Naming Convention

To make the dataset easier to ingest, the original nested folder structure has been "flattened." The folder path is preserved in the filename using double underscores (__).

Ingested in a vector database and chat agent created for the files here : https://promex.ai/epstein

Single Pages of OCR results are added as a zip file in Packages as an asset.

About

The Justice Department 5GB PDF set, converted to clean text- RAG compatible

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published