Skip to content
This repository was archived by the owner on Aug 23, 2025. It is now read-only.

lepisma/tokenizers.el

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tokenizers.el

Development for this project has moved to sourcehut here. This repository is now in a read-only state.

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library.

To start with, this will support inference from pretrained models since that’s what needed in onnx.el.

Installation

There is a dynamic module that needs you to have cargo installed. This module is automatically compiled when you install the package using something like this:

(use-package tokenizers
  :vc (:fetcher github :repo lepisma/tokenizers.el)
  :demand t)

Note that this method only handles creating .so files which means it won’t work on non-Linux systems yet. For other systems, manually inspect the Makefile and compile the module yourself.

Usage

(require 'tokenizers)
(setq tk (tokenizers-from-pretrained "sentence-transformers/all-MiniLM-L6-v2"))

;; Returns a list of three vectors, token-ids, type-ids, and attention-mask
;; Last argument tells whether to add special tokens
(tokenizers-encode tk "Test sentence with some words" t)

;; Returns a list of three 2d vectors (batch size is the first dimension),
;; token-ids, type-ids, and attention-mask
(tokenizers-encode-batch tk ["This is an example sentence" "Each sentence is converted"] t)

About

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published