Skip to content

grll/tiny-moliere

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

language
fr
license mit
size_categories
1M<n<10M
task_categories
text-generation
pretty_name Tiny Molière
dataset_info
features splits download_size dataset_size
name dtype
text
string
name num_bytes num_examples
train
2387600
1
2387600
2387600
configs
config_name data_files
default
split path
train
data/tinymoliere.txt
tags
literature
french-literature
moliere
classical-text
character-level

tiny-moliere

A dataset repo generating tinymoliere.txt containing Molière's complete work.

Inspired by tinyshakespeare by Andrej Karpathy, this project provides a consolidated small text corpus ideal for training and learning with small transformer models.

What it does

Downloads Molière's complete works from public PDFs, processes them to remove headers/footers and table of contents, then outputs a single clean text file suitable for machine learning tasks.

Usage

uv sync
uv run python main.py

This will:

  1. Download the source PDFs to data/ directory
  2. Process and clean the text
  3. Generate data/tinymoliere.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages