tiny-moliere

language

fr

license

mit

size_categories

1M<n<10M

task_categories

text-generation

pretty_name

Tiny Molière

dataset_info

features

splits

download_size

dataset_size

name	dtype
text	string

name	num_bytes	num_examples
train	2387600	1

2387600

configs

config_name

data_files

default

split	path
train	data/tinymoliere.txt

tiny-moliere

A dataset repo generating tinymoliere.txt containing Molière's complete work.

Inspired by tinyshakespeare by Andrej Karpathy, this project provides a consolidated small text corpus ideal for training and learning with small transformer models.

What it does

Downloads Molière's complete works from public PDFs, processes them to remove headers/footers and table of contents, then outputs a single clean text file suitable for machine learning tasks.

Usage

uv sync
uv run python main.py

This will:

Download the source PDFs to data/ directory
Process and clean the text
Generate data/tinymoliere.txt

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-moliere

What it does

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-moliere

What it does

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages