Markov Chain text generation

The files markov.py and tokenizer.py provide a simple framework to train a text generation Markov Chain on any text.

You can also create a custom Tokenizer or any custom transformation and train a Markov Chain on it. The Markov chain was implemented in numpy and should be reasonably fast.

For the Tokenizer, I re-implemented Karpathy's minBPE.

Usage

First, create a tokenizer and train it on your text.

text = "This is a sample text."

tokenizer = Tokenizer()
train_tokens = tokenizer.train(text, merges=5) # Perform 5 merge operations, resulting in 256 + 5 tokens.

then tokenize some text to text everything is working.

>>> tokenizer.tokenize("sample text")
[115, 97, 109, 112, 108, 101, 32, 116, 101, 120, 116]

Then create a Markov Chain and train it on your tokens:

# dims controls the number of context given
model = Markov(vocab_length=tokenizer.unique_tokens, dims=3) # Predict based on two last tokens
model.train(train_tokens)

Finally, generate some new text.

>>> new_tokens = model.predict(train_tokens[0:2], n = 10)
>>> tokenizer.detokenize(new_tokens)
"This is a sample te"

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
markov.py		markov.py
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Markov Chain text generation

Usage

About

Uh oh!

Releases

Packages

Languages

obrhubr/markov-text

Folders and files

Latest commit

History

Repository files navigation

Markov Chain text generation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages