Development for this project has moved to sourcehut here. This repository is now in a read-only state.
Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library.
To start with, this will support inference from pretrained models since that’s what needed in onnx.el.
There is a dynamic module that needs you to have cargo installed. This module is automatically compiled when you install the package using something like this:
(use-package tokenizers
:vc (:fetcher github :repo lepisma/tokenizers.el)
:demand t)Note that this method only handles creating .so files which means it won’t work
on non-Linux systems yet. For other systems, manually inspect the Makefile and
compile the module yourself.
(require 'tokenizers)
(setq tk (tokenizers-from-pretrained "sentence-transformers/all-MiniLM-L6-v2"))
;; Returns a list of three vectors, token-ids, type-ids, and attention-mask
;; Last argument tells whether to add special tokens
(tokenizers-encode tk "Test sentence with some words" t)
;; Returns a list of three 2d vectors (batch size is the first dimension),
;; token-ids, type-ids, and attention-mask
(tokenizers-encode-batch tk ["This is an example sentence" "Each sentence is converted"] t)