Skip to content

obrhubr/markov-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Markov Chain text generation

The files markov.py and tokenizer.py provide a simple framework to train a text generation Markov Chain on any text.

You can also create a custom Tokenizer or any custom transformation and train a Markov Chain on it. The Markov chain was implemented in numpy and should be reasonably fast.

For the Tokenizer, I re-implemented Karpathy's minBPE.

Usage

First, create a tokenizer and train it on your text.

text = "This is a sample text."

tokenizer = Tokenizer()
train_tokens = tokenizer.train(text, merges=5) # Perform 5 merge operations, resulting in 256 + 5 tokens.

then tokenize some text to text everything is working.

>>> tokenizer.tokenize("sample text")
[115, 97, 109, 112, 108, 101, 32, 116, 101, 120, 116]

Then create a Markov Chain and train it on your tokens:

# dims controls the number of context given
model = Markov(vocab_length=tokenizer.unique_tokens, dims=3) # Predict based on two last tokens
model.train(train_tokens)

Finally, generate some new text.

>>> new_tokens = model.predict(train_tokens[0:2], n = 10)
>>> tokenizer.detokenize(new_tokens)
"This is a sample te"

About

Library to generate text using Markov chains. Includes a BPE tokenizer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published