tokenizers.el

Development for this project has moved to sourcehut here. This repository is now in a read-only state.

Fast tokenizers for Emacs Lisp backed by Huggingface’s rust library.

To start with, this will support inference from pretrained models since that’s what needed in onnx.el.

Installation

There is a dynamic module that needs you to have cargo installed. This module is automatically compiled when you install the package using something like this:

(use-package tokenizers
  :vc (:fetcher github :repo lepisma/tokenizers.el)
  :demand t)

Note that this method only handles creating .so files which means it won’t work on non-Linux systems yet. For other systems, manually inspect the Makefile and compile the module yourself.

Usage

(require 'tokenizers)
(setq tk (tokenizers-from-pretrained "sentence-transformers/all-MiniLM-L6-v2"))

;; Returns a list of three vectors, token-ids, type-ids, and attention-mask
;; Last argument tells whether to add special tokens
(tokenizers-encode tk "Test sentence with some words" t)

;; Returns a list of three 2d vectors (batch size is the first dimension),
;; token-ids, type-ids, and attention-mask
(tokenizers-encode-batch tk ["This is an example sentence" "Each sentence is converted"] t)

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
src		src
.gitignore		.gitignore
Cargo.toml		Cargo.toml
Makefile		Makefile
README.org		README.org
test-tokenizers.el		test-tokenizers.el
tokenizers.el		tokenizers.el

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

tokenizers.el

Installation

Usage

About

Uh oh!

Releases

Packages

Languages

lepisma/tokenizers.el

Folders and files

Latest commit

History

Repository files navigation

tokenizers.el

Installation

Usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages