The files markov.py and tokenizer.py provide a simple framework to train a text generation Markov Chain on any text.
You can also create a custom Tokenizer or any custom transformation and train a Markov Chain on it. The Markov chain was implemented in numpy and should be reasonably fast.
For the Tokenizer, I re-implemented Karpathy's minBPE.
First, create a tokenizer and train it on your text.
text = "This is a sample text."
tokenizer = Tokenizer()
train_tokens = tokenizer.train(text, merges=5) # Perform 5 merge operations, resulting in 256 + 5 tokens.then tokenize some text to text everything is working.
>>> tokenizer.tokenize("sample text")
[115, 97, 109, 112, 108, 101, 32, 116, 101, 120, 116]Then create a Markov Chain and train it on your tokens:
# dims controls the number of context given
model = Markov(vocab_length=tokenizer.unique_tokens, dims=3) # Predict based on two last tokens
model.train(train_tokens)Finally, generate some new text.
>>> new_tokens = model.predict(train_tokens[0:2], n = 10)
>>> tokenizer.detokenize(new_tokens)
"This is a sample te"