Skip to content

linyows/zlug

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

English | 日本語

zlug

Fast, safe, multi-byte aware slug generation for Zig.

GitHub Workflow Status GitHub Release MIT License

Features

  • Fast: single-pass UTF-8 decode, transliterate, normalize, and dash-collapse — no intermediate allocations
  • Safe: forgiving UTF-8 decoder, no pointer arithmetic, tables are immutable constants
  • Multi-byte aware: transliterates any BMP codepoint to ASCII via an embedded ~430KB unidecode table
  • 20 languages: language-specific substitutions for bg, cs, de, en, es, fi, fr, gr, hu, id, it, kk, nb, nl, nn, pl, pt, ro, sl, sv, tr
  • Optional Japanese: opt-in dictionary-based slugification with Hepburn romanization (世界 → sekai, プログラミング → puroguramingu)
  • Zero runtime init: tables are embedded via @embedFile into .rodata
  • Flexible API: both a stack-buffer variant (slugify) and an allocating variant (slugifyAlloc)

Inspired by gosimple/slug and gosimple/unidecode.

Two flavors

zlug ships two modules from the same codebase:

Module Embedded data Japanese
zlug ~430 KB ❌ falls back to unidecode (世界 → shi-jie)
zlug_ja ~7.7 MB ✅ Sudachi-based dictionary + Hepburn romanization (世界 → sekai)

Same public API — pick the one that matches your size budget.

Installation

Requires Zig 0.15.2 or later.

Add zlug to your project:

zig fetch --save git+https://github.com/linyows/zlug#v0.1.0

This updates your build.zig.zon:

.dependencies = .{
    .zlug = .{
        .url = "git+https://github.com/linyows/zlug#v0.1.0",
        .hash = "...",
    },
},

Then in build.zig:

const zlug_dep = b.dependency("zlug", .{
    .target = target,
    .optimize = optimize,
});

// Lean variant (no Japanese dictionary):
exe.root_module.addImport("zlug", zlug_dep.module("zlug"));

// Or full variant with Japanese dictionary (~7 MB):
// exe.root_module.addImport("zlug", zlug_dep.module("zlug_ja"));

Usage

const std = @import("std");
const zlug = @import("zlug");

pub fn main() !void {
    var gpa: std.heap.GeneralPurposeAllocator(.{}) = .{};
    defer _ = gpa.deinit();
    const alloc = gpa.allocator();

    // Allocating API
    const slug = try zlug.slugifyAlloc(alloc, "Hello, 世界!", .{});
    defer alloc.free(slug);
    std.debug.print("{s}\n", .{slug}); // => "hello-shi-jie"

    // Stack-buffer API (no allocator)
    var buf: [256]u8 = undefined;
    const s = try zlug.slugify(&buf, "Héllo Wörld", .{ .lang = .de });
    std.debug.print("{s}\n", .{s}); // => "hello-woerld"
}

Options

pub const Options = struct {
    lang: Lang = .en,
    lowercase: bool = true,
    max_length: usize = 0,       // 0 disables truncation
    smart_truncate: bool = true, // cut at last '-' within max_length
    keep_multiple_dashes: bool = false,
    keep_edge_dashes: bool = false,
};

Examples

Input Lang Output
"Hello, world!" en hello-world
"café au lait" en cafe-au-lait
"rock & roll" en rock-and-roll
"über große Größen" de ueber-grosse-groessen
"Здравей Свят" bg zdravey-svyat
"世界" en shi-jie
"it’s mine" en its-mine
"a—b" (em dash) en a-b

API

  • slugify(buf: []u8, input: []const u8, opts: Options) ![]u8 — writes into caller's buffer
  • slugifyAlloc(alloc: Allocator, input: []const u8, opts: Options) ![]u8 — caller owns returned slice
  • isSlug(text: []const u8) bool — validate an existing slug
  • parseLang(tag: []const u8) Lang — parse a BCP-47-ish language tag

How it works

zlug performs slug generation in a single pass over the input:

  1. Decode UTF-8 to a codepoint (forgiving — invalid sequences become U+FFFD)
  2. Apply per-language substitution (e.g. ä → ae in German)
  3. Apply shared default substitutions (smart quotes, en/em dashes)
  4. Look up unidecode transliteration for non-ASCII BMP codepoints
  5. Per ASCII byte: lowercase, authorized-char check, consecutive-dash collapse
  6. Write directly into the output buffer

The unidecode table is stored as two embedded binary blobs:

  • src/bmp_index.bin (262KB) — [0x10001]u32 cumulative byte offsets
  • src/bmp_data.bin (169KB) — concatenated ASCII transliterations

Lookup is data[index[cp]..index[cp+1]], two std.mem.readInt calls and a slice. The tables are embedded via @embedFile and live in .rodata — there is zero runtime initialization.

Regenerate the tables from gosimple/unidecode's table.txt with:

zig run tools/gen_table.zig -- /path/to/table.txt src/

Development

# Run tests
zig build test

# Build the static library
zig build --release=fast

# Check formatting
zig fmt --check src/ tools/

Releasing

Versions are managed by git tags. The build.zig.zon .version field stays at 0.0.0-dev in the tree and is rewritten by the release workflow to match the tag.

git tag v0.1.0
git push origin v0.1.0

The workflow at .github/workflows/release.yml will run tests, build in release mode, and create a GitHub Release with a source tarball and SHA-256 checksum.

License

MIT — see LICENSE.

Unidecode table data is derived from gosimple/unidecode, licensed under the Apache License 2.0.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages