| language |
|
| license |
mit |
| size_categories |
|
| task_categories |
|
| pretty_name |
Tiny Molière |
| dataset_info |
| features |
splits |
download_size |
dataset_size |
|
| name |
num_bytes |
num_examples |
train |
2387600 |
1 |
|
|
2387600 |
2387600 |
|
| configs |
| config_name |
data_files |
default |
| split |
path |
train |
data/tinymoliere.txt |
|
|
|
|
| tags |
literature |
french-literature |
moliere |
classical-text |
character-level |
|
A dataset repo generating tinymoliere.txt containing Molière's complete work.
Inspired by tinyshakespeare by Andrej Karpathy, this project provides a consolidated small text corpus ideal for training and learning with small transformer models.
Downloads Molière's complete works from public PDFs, processes them to remove headers/footers and table of contents, then outputs a single clean text file suitable for machine learning tasks.
uv sync
uv run python main.py
This will:
- Download the source PDFs to
data/ directory
- Process and clean the text
- Generate
data/tinymoliere.txt