This repository contains the plain text versions of the released court documents regarding the Jeffrey Epstein/Ghislaine Maxwell cases, originally sourced from the United States Department of Justice.
Purpose: The original documents are provided as PDFs, which are difficult to process programmatically. This repository provides flattened, UTF-8 encoded text files to facilitate:
- Natural Language Processing (NLP)
- Retrieval-Augmented Generation (RAG) for LLMs
- Full-text search and analysis
- Data mining and research
All documents were originally downloaded from the official release portals.
- Original Format: PDF (Scanned/OCR) - Not uploaded [50GB size] of 2024-2025 files but a LINKS page included if you wish to load into a downloader.
- Converted Format: Plain Text (.txt)
To make the dataset easier to ingest, the original nested folder structure has been "flattened." The folder path is preserved in the filename using double underscores (__).
Ingested in a vector database and chat agent created for the files here : https://promex.ai/epstein
Single Pages of OCR results are added as a zip file in Packages as an asset.