Tags: ausimian/emily
Tags
- The README performance section now compares Emily against both benchmark baselines — EXLA (host CPU) and EMLX (the older MLX-backed Nx backend on the Metal GPU) — instead of EXLA alone, and its rule-of-thumb figures (ViT-base, DistilBERT) are reconciled with the current benchmark report. - The benchmark report's environment block now records the Emily version the numbers were produced on (0.7.0) and drops a misleading run timestamp. - The `MAINTAINING.md` release runbook is corrected: `mix publisho` is no longer described as pushing (it only commits and tags), and the obsolete manual draft-promotion step is dropped — `release-nif.yml` now publishes the release automatically once the NIFs are built.
- **Native Expr compiler — on by default under
`compiler: Emily.Compiler`.** Lowers a traced `Nx.Defn.Expr` to a
flat IR once and replays the whole forward graph in a **single NIF
call per invocation**, collapsing the per-op BEAM↔worker round-trips
a step-evaluated decode loop would otherwise pay. Weights cross the
NIF boundary once (captured by the compiled program) and are never
re-serialised per call. It is the default, so a bare
`compiler: Emily.Compiler` compiles native:
Nx.Defn.jit(&forward/1, compiler: Emily.Compiler).(input)
Coverage is the full Nx primitive set (with `Emily.Backend`'s
dtype-coercion and op-composition semantics ported into the
lowering), the fused `Emily.Fast.*` kernels (RMSNorm, LayerNorm,
RoPE, scaled dot-product attention and its mask / sink / mask+sink
variants), `Nx.Block.*` including the full `LinAlg` family
(`cholesky` / `solve` / `qr` / `eigh` / `lu` / `svd` /
`determinant`), `Nx.Random`, and the control flow `cond` /
`defn while` (with the host loop driven entirely from the worker
thread). Anything the IR can't lower yet routes through
`Nx.Defn.Evaluator` under the default `native_fallback: :eval` (with
a one-shot `[:emily, :compiler, :fallback]` telemetry event), so the
native lane is safe as the default on any model. The default is read
from `config :emily, :native` (defaulting to `true`), so
`config :emily, native: false` opts every defn out of the native lane
application-wide — e.g. on a memory-constrained host where the
one-shot compile peak is too large; a per-call `native:` option
always wins over the app-env default.
`native_fallback: :raise` fails instead — the conformance suites use
this to prove a model lowers fully native.
End-to-end: DistilBERT (question answering with `Nx.Serving`), ViT,
Whisper (`speech_to_text` end-to-end including the featurizer STFT,
encoder/decoder, and autoregressive decode loop), and Bumblebee
`Text.generation` (greedy *and* multinomial sampling) all compile
fully native under `native_fallback: :raise`. Bumblebee generation
on Qwen3-0.6B measures **~5× the evaluator's decode throughput**
(~61 vs ~12 tok/s on an M-series Mac), with byte-identical
completions. Native training drives Axon end-to-end — a LeNet CNN
and a dense MLP train on real MNIST entirely through the single-NIF
path (forward, categorical-cross-entropy, backward, Adam) to the
same >97% / >96% accuracy as the evaluator.
- **`Emily.Compiler` — `:fuse` opt-in.** Adds `mx::compile` fusion on
top of the replay, fusing elementwise runs (RMSNorm, softmax, SiLU
gating, residual adds) the plain replay leaves as separate kernels.
For a `defn while`, the loop body is fused under `mx::compile` and
cached per stream so it cache-hits across iterations rather than
recompiling per step. Enable on top of the native generation path:
Nx.Defn.jit(&forward/1,
compiler: Emily.Compiler, native: true, fuse: true)
On Qwen3-0.6B this lifts greedy decode to **~5.4× the evaluator
(~1.1× over the plain native lane)**, ~68 vs ~62 tok/s; in
isolation on a decode-shaped transformer block, fusion measures
~1.5–1.6× over the plain replay. Trade-off: `mx::compile`
reassociates f32 to within a few ULP, so output is **not**
bit-identical to the evaluator. Greedy argmax is robust to that
empirically (Qwen3-0.6B token ids matched the evaluator exactly in
our run), but the match is empirical, not guaranteed — a near-tie
top-2 logit can flip a token. **Sampling strategies will diverge
from the evaluator under fusion** even with a fixed seed.
- **`Emily.Generation` — a model-agnostic decode-loop driver.**
JIT-compiles a caller-supplied shape-stable per-token forward
(`fn token, offset, cache, params -> {logits, cache} end`) with the
native single-NIF compiler and drives the autoregressive loop from
Elixir — offset bookkeeping, KV-cache threading, stop conditions,
next-token selection (greedy by default), and per-token streaming
via `:on_token`. The forward runs fully native; the loop stays in
Elixir, so token streaming and host-side control are preserved.
Emily supplies only the mechanism — the model (forward + cache) is
the caller's.
- `Emily.async_eval/1` (and `Emily.Native.async_eval/2`) schedule
evaluation of one or more lazy graphs **without blocking on the
GPU**, wrapping `mlx::core::async_eval`. The work is handed to the
device's command queue and the call returns as soon as it is
enqueued — not when it finishes. Lets a caller keep dispatching the
next step's ops while the device computes the current one (e.g. an
autoregressive decode loop), blocking only when a value is actually
read back on the host via `to_binary/1` / `eval/1`. Pass every
output of a step (logits plus all KV-cache buffers) in one call.
- `Emily.Native.fast_rope_int/8` — RoPE with an **integer**
absolute-position `offset` (routing to MLX's int-offset `rope`
overload), for incremental decode where the caller tracks position
host-side. Complements the existing tensor-offset `fast_rope/8`.
Note: feed the kernel the 4-D `{batch, heads, seq, head_dim}`
layout — in 3-D, MLX 0.31 mis-rotates single-token (`seq == 1`)
inputs.
- **Dilated window reductions (`window_dilations > 1`) returned wrong
values.** `window_sum`/`window_max`/`window_min`/`window_product`
with a dilated kernel silently produced garbage for windows past the
first stride positions, on both the eager backend and the native
compiler (they share the window-reduce core). A dilated kernel axis
gets an `as_strided` stride > 1, so the sliding-window view aliases
fewer physical elements than its logical size; MLX's strided-reduce
fast path then read past the aliased buffer. The view is now
materialised contiguously before the reduce when any dilation > 1
(the common non-dilated pooling path is unchanged and stays
copy-free).
This release is a security-hardening pass over the native (NIF) boundary
and the build/release pipeline: direct `Emily.Native` calls now validate
their arguments instead of trusting Elixir-side normalization,
precompiled-NIF downloads verify against a checksum pinned in the hex
package (a trust root independent of the GitHub release), and the
per-stream worker is bounded and tears down without blocking a BEAM
scheduler. It is backward compatible, but two behaviour changes matter
for high-concurrency callers: the per-worker async queue is now bounded
(`worker_queue_limit`, default 8192) and rejects when full, and a stopped
or dropped worker replies `{:error, :stopped}` to queued callers instead
of running their work.
- `Emily.Stream.close/1` stops a stream's worker thread deterministically
instead of waiting for garbage collection: queued operations are
cancelled (their callers get a `RuntimeError`), the in-flight op
finishes, and the OS thread is joined off the BEAM schedulers.
- `config :emily, worker_queue_limit: N` (default `8192`) bounds the
per-worker async queue, and `config :emily, await_timeout: ms` (default
`:infinity`) sets an optional timeout for awaiting native results.
- Worker-thread teardown no longer blocks a BEAM scheduler. The resource
destructor previously drained the worker's entire queue and joined the
OS thread inline, so collecting a busy stream during GC could stall a
scheduler. Workers are now joined off-scheduler by a dedicated reaper
(itself joined at NIF unload), and on stop the worker cancels its
queued tasks — replying `{:error, :stopped}` — instead of running them.
- The async NIF worker queue is now bounded (`worker_queue_limit`, reject
when full) so a flood of operations can't grow it without limit and pin
host/GPU memory, and a stopped or dropped worker now replies
`{:error, :stopped}` to every queued caller instead of leaving it
blocked forever. `Emily.Native.worker_queue_depth/1` exposes the depth
for observability.
- The dev/CI source-build path now refuses to trust an MLX install
directory it doesn't own and keeps the build cache `0700`, so a shared
or attacker-controlled `EMILY_CACHE` can't plant a `libmlx.a` that is
then statically linked into the NIF. Fixed system tools (`getconf`,
`id`, `sw_vers`, plus `xcrun`/`sysctl`/`ps` in `build-mlx.sh`) resolve
from absolute/system paths rather than `$PATH`, and the MLX-build lock
records the holder's process start time so a recycled PID can't be
mistaken for the original holder. Build-time only; no runtime change.
- Precompiled NIF downloads are now verified against checksums pinned
inside the hex package (`native_checksums.txt`) rather than a `.sha256`
sidecar fetched from the same GitHub release as the tarball. Because
the package contents are covered by Hex's package hash in the
consumer's `mix.lock`, the trust root no longer lives in the mutable
release. The tarball is also extracted with `:erl_tar` against a strict
entry allowlist (`libemily.{so,dylib}` + `mlx.metallib`), rejecting
symlinks, hardlinks, `..` traversal, absolute paths, and unexpected
entries — closing a path-traversal/arbitrary-write vector in the old
`tar -xzf` extraction. New `mix emily.checksums` task regenerates the
pinned file per release.
- Integer arguments crossing the NIF boundary are now range-checked
before being narrowed from Elixir's `int64` to C++ `int`. Previously an
out-of-range axis, count, or shape entry wrapped silently (e.g. an axis
of `2^32 + 3` became `3`), dispatching the wrong MLX operation; and
unbounded sample counts in `random_split`/`random_categorical` could
drive huge allocations. Out-of-range values, and negative counts, now
raise `ArgumentError`. Centralized as `checked_int` / `require_count`
helpers applied across the reduce, shape, sort, random, index, linalg,
conv, and fast NIFs.
- Native indexing and window NIFs now validate their vector arguments
against the tensor rank before indexing, and reject non-positive
strides, dilations, and window dimensions. Previously a direct
`Emily.Native` call with a malformed `slice_update` start, a short
pad/window vector, or a zero window stride could read a C++ vector out
of bounds or trigger an integer divide-by-zero (SIGFPE) — both of which
crash the whole BEAM VM rather than raising in the caller. They now
raise `ArgumentError`.
- `Emily.Native.from_binary/3` now validates tensor shapes at the NIF
boundary. Dimensions above `INT32_MAX` are rejected (previously they
silently truncated through MLX's `int32` `ShapeElem`), and the element
and byte counts are computed with overflow checking. Without this an
attacker-chosen shape whose element product wrapped (e.g.
`[2^21, 2^21, 2^22]` → `0`) could pass the binary-size check against an
undersized — even empty — binary and build an array whose shape outran
its allocation, an out-of-bounds read on the next `eval`/`to_binary`.
- `Emily.Native.conv_general/8` now rejects a non-positive `groups`
argument with `ArgumentError` instead of crashing the BEAM VM. MLX's
convolution checks compute `in_channels % groups`, so `groups <= 0`
(or a large value that narrows to zero through the `int64 → int`
conversion) was an integer modulo-by-zero — a SIGFPE that bypassed the
NIF's exception path and terminated the entire node. The guard
validates the un-narrowed value at the NIF boundary.
- `CHANGELOG.md` — corrected the 0.5.0 entry. The published release carried two `### Changed` headings and listed three new-functionality items (`mix emily.doctor`, `config :emily, fallback:`, and the `Emily.Memory` public allocator API) under Changed rather than Added. Merged the duplicate Changed sections, moved the new-functionality items to Added, and put items into reverse chronological order. No code change.
- `Emily.Quantization.dequantize_defn/1` now supports the `nvfp4`
microscaled mode in addition to `affine`, `mxfp4`, and `mxfp8` —
the full MLX `QuantizationMode` enum now runs through the
defn-native dequant path. `nvfp4` reuses the FP4-E2M1 lane LUT
from `mxfp4` and the FP8-E4M3 LUT from `mxfp8` (consumed against
the per-group scale bytes rather than lane codes — the NVIDIA
microscaled convention uses finer-grained group_size=16 with
FP8-E4M3 scales instead of mxfp4/mxfp8's group_size=32 with
FP8-E8M0 scales). Output dtype is bf16 to match
`QuantizedWeight.to_dense/1`, round-trip is bit-identical (max
abs diff = 0.0). `Emily.Quantization.Transform` accepts
`mode: "nvfp4"`.
- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp8`
microscaled mode in addition to `affine` and `mxfp4`. Each 8-bit
lane code decodes through a 256-entry FP8-E4M3 lookup table
precomputed via MLX's `FromFP8` bit-trick (strip sign, shift the
low 7 bits left by 7 to align the E4M3 exponent into f16's
exponent field, multiply by 256 for the bias difference, restore
sign). Per-group scales reuse the FP8-E8M0 decode from the mxfp4
path. Output dtype is bf16 to match `QuantizedWeight.to_dense/1`,
and the round-trip is bit-identical (max abs diff = 0.0) on
realistic data. `Emily.Quantization.Transform` accepts
`mode: "mxfp8"`; only `nvfp4` (which uses an FP8-E4M3 per-group
scale instead of FP8-E8M0) remains defn-unsupported.
- `Emily.Quantization.dequantize_defn/1` now supports the `mxfp4`
microscaled mode in addition to `affine`. Each 4-bit lane code
decodes through MLX's FP4-E2M1 lookup table (`+0.0, +0.5, +1.0,
+1.5, +2.0, +3.0, +4.0, +6.0` and their negatives); each u8 scale
byte decodes through `2^(s - 127)` (FP8-E8M0). Output dtype is
bf16 to match `QuantizedWeight.to_dense/1`, and the round-trip is
bit-identical (max abs diff = 0.0) on realistic scale bytes
because every FP4 LUT entry and every E8M0 power-of-two is exact
in bf16. `Emily.Quantization.Transform` gains a `:mode` option
(default `"affine"`, accepts `"mxfp4"`); `mxfp8` and `nvfp4` are
still defn-unsupported and route through the Native NIF.
- `Emily.Quantization.dequantize_defn/1` now supports int3 and int6
weights in addition to int2/int4/int8. The new path reads each
lane's two adjacent u32 words as a u64, shifts by the in-word bit
offset, and masks — handling the cross-u32 packing MLX uses for
bit widths that don't divide 32 cleanly. `defn_supported_bits/0`
now returns `[2, 3, 4, 6, 8]`; quantized Axon graphs rewritten
via `Emily.Quantization.Transform` (and `Emily.Quantization.Layers.quantized_dense/4`)
pick the expanded set up automatically. Previously the defn path
rejected `bits ∈ {3, 6}` and callers had to fall back to
`QuantizedWeight.to_dense/1` (the Native NIF).
- `ARCHITECTURE.md` — current shape of the library extracted from
`PLAN.md`. Covers the four-layer dispatch model, the worker-thread
+ per-process-stream concurrency model, the public `Emily.Memory`
allocator API, the telemetry event catalogue, the
`:debug_bounds_check` / `:debug_detect_nan_inf` compile-time flags,
build/packaging notes, the per-layer testing oracle table, and the
active risk register. Linked from the README under a new
Documentation section and grouped under "Project" in the HexDocs
sidebar.
- `ROADMAP.md` — active and future work, separated from the
historical milestone log. Lists deferred-to-post-1.0 items
(typed exceptions, GPU interop pointers, source-build doctor
probes) and the open in-roadmap MLX capability gaps (sparse / MoE
matmuls, FP8 dtype, `ThreadLocalStream`).
- `PLAN.md` slimmed to its milestone-history role. The current-shape
sections (architecture diagram, core design decisions, testing
philosophy, risks-and-mitigations) moved to `ARCHITECTURE.md`;
goals, non-goals, and deferred-milestone summaries moved to
`ROADMAP.md`. The M0–M27 milestone narratives, the ratified
project decisions, and the 2026-04-22 MLX capability audit stay in
`PLAN.md` as the historical record. The stale "narrow
`with_stream/2` + `new/1` + `synchronize/1` surface" reference (no
`synchronize/1` ever shipped) and the planned `set_default_stream/1`
primary deliverable (removed during the post-M14 fixes) drop out
with the prologue rewrite.
- `mix emily.doctor` — diagnostic Mix task that verifies the local
Emily runtime installation. Checks the host platform (OS, arch,
macOS version against the active variant's minimum), the active
MLX variant, `priv/libemily.so` and `priv/mlx.metallib`, NIF
loadability, and a tiny `Emily.Backend` smoke test that asserts
the result didn't silently fall back to `Nx.BinaryBackend`. Checks
short-circuit: when a prerequisite fails, dependent checks report
`[skip]` rather than producing cascading noise. Supports
`--variant aot|jit` for "would this host satisfy :jit?" probes and
`--help` for usage.
- `Emily.Native` now annotates NIF errors with operation, input
shape/dtype, options, and worker context. `ArgumentError` and
`RuntimeError` raised from async ops get an `Emily.Native context:
op=… inputs=[…] options=[…] stream=…` suffix, so common failures
(shape mismatches in `matmul`, divisibility errors in `quantize`,
mask shape bugs in `fast_scaled_dot_product_attention`, etc.) are
diagnosable from the message alone. The error-formatting path is
total — bad context maps degrade to `?` markers rather than masking
the underlying NIF error.
- `Emily.Memory` — public allocator API for long-running serving and
training workloads that need to observe and manage MLX memory
without reaching into `Emily.Native`. Exposes `stats/0` (active,
peak, and cached bytes, also emitting `[:emily, :memory, :stats]`),
`reset_peak/0`, and `clear_cache/0`. Documented under the README's
Observability section and grouped with `Emily.Telemetry` in the
ExDoc sidebar.
- `config :emily, fallback: :silent | :warn | :raise` — strict
fallback modes for development and CI. `:silent` (the default)
preserves today's behaviour; `:warn` emits the one-shot
`Logger.warning` per `{op, input_shapes}` pair previously gated by
`:warn_on_fallback`; `:raise` raises `RuntimeError` with op,
shapes, and dtypes on entry, letting CI fail the build when a hot
path unexpectedly routes through `Nx.BinaryBackend`. An invalid
`:fallback` value raises `ArgumentError` on the first fallback so
typos surface immediately.
- `Emily.Telemetry.memory_stats/0` now delegates to
`Emily.Memory.stats/0`. Behaviour is unchanged — same event,
measurements, and return shape — but new code should prefer the
`Emily.Memory` entry point.
- The legacy `config :emily, :warn_on_fallback, true` boolean is
soft-deprecated in favour of `:fallback`. It is still honoured
when `:fallback` is unset (`true` → `:warn`); when both are set,
`:fallback` wins.
- Upgraded to Nx 0.12 / Bumblebee 0.7 / Axon 0.8. Nx 0.12 replaces
the optional-callback list (`lu`, `svd`, `qr`, `cholesky`, `eigh`,
`solve`, `take`, `take_along_axis`, `fft2`, `ifft2`,
`cumulative_*`, `logical_not`, `all_close`) with a single
generic `Nx.Backend.block/4` dispatch keyed on `Nx.Block.*`
structs. `Emily.Backend` now routes every previously-native op
through `block/4`, preserving the MLX fast paths without losing
the BinaryBackend fallback when an unknown block arrives. Existing
`Emily.Backend` consumers see no behavioural change.
- Migrated `Emily.Fast.*` from the now-removed
`Nx.Defn.Expr.optional/3` extension point to `Nx.block/4`. Each
fused kernel (`rms_norm`, `layer_norm`, `rope`, `rope_with_freqs`,
`scaled_dot_product_attention` with and without mask/sinks) now
emits an `Emily.Fast.Block.*` struct that `Emily.Backend.block/4`
pattern-matches to the matching `mx::fast::*` NIF. The
composed-defn fallbacks under non-Emily backends are unchanged.
- Bumblebee 0.7 ships Qwen3 first-class, so
`notebooks/qwen3_quantized.livemd` no longer needs the `main`-ref
Bumblebee pin from the 0.6.3 era.
- `Nx.rfft/2` and `Nx.irfft/2` support. The underlying
`Native.rfftn` / `Native.irfftn` NIFs were already in place from
earlier MLX work; Nx 0.12 surfaces these as backend-block ops so
Emily wires them up at no MLX-side cost.
- Smoke tests for three new Bumblebee 0.7 model families on
`Emily.Backend`: NomicBERT (`:nomic_embeddings`), SmolLM3
(`:smollm3`), and ModernBERT (`:modernbert`). All three drive a
tiny synthetic spec end-to-end through `Axon.predict` so they
remain offline-friendly; tagged `:conformance`.
- Runnable Livebooks for each of the three new Bumblebee 0.7
families: `notebooks/nomic_embeddings.livemd` (NomicBERT
embeddings with cosine similarity), `notebooks/smollm3_chat.livemd`
(SmolLM3-3B chat completion with a `<think>` toggle for hybrid
reasoning), and `notebooks/modernbert_classification.livemd`
(ModernBERT NLI fine-tune). All three are published under the
HexDocs Notebooks group.
- A `[:emily, :block, :fallback]` telemetry event fires whenever
`Emily.Backend.block/4` falls through to the supplied default
`fun`. Surfaces ops we used to handle natively but now land on
the composed-defn path — useful in soak runs to spot silent
regressions after a Bumblebee bump.
- `mix docs` no longer emits autolinker warnings for the
`Emily.Backend.block/4` and `Nx.Defn.Expr.optional/3` references
in the `Emily.Fast` and `Emily.Fast.Block` moduledocs. The
references resolved to `@doc false` callees (the backend callback
is hidden by `Nx.Backend`, and `optional/3` was removed in Nx 0.12);
the prose stays, the `Mod.fun/arity` shape is broken up so the
autolinker no longer follows it. Same pattern as the earlier
fix in `ee32c7c`.
- `{:f8_e4m3fn, 8}` (introduced in Nx 0.11) is rejected at the
backend boundary with the same "no MLX primitive" `ArgumentError`
pattern as `{:f, 64}`. MLX has no float-8 dtype; cast to `:f16` or
`:bf16`.
- `Nx.LinAlg.svd(tensor, full_matrices?: false)` on rank-2 inputs no longer routes through MLX's full-matrices SVD and post-slices — MLX's SVD has no thin switch, so the old path materialised the full m × m U on device and instantly OOM'd Metal for tall matrices like the Qwen3-0.6B embedder kernel (151936 × 1024 → ~92 GB U). The thin case now computes `G = MᵀM → eigh → S, V; U = MV / S` (or the symmetric `MMᵀ` route for wide matrices), keeping the decomposition at min(m, n)². See the `Emily.Backend` moduledoc Divergences section for the numerical caveat (the Gram step squares M's condition number). Refs #84. - `mix docs` runs cleanly. The MNIST notebook referenced `Axon.Loop.trainer/2` (no such arity); three other inline references resolved to `@doc false` callees in upstream libraries (`Nx.Defn.Expr.optional/3`, `Bumblebee.Layers.rms_norm/2`) and triggered autolinker warnings on every doc build. The notebook now uses the correct `trainer/3` arity, and the prose references have been reshaped so the autolinker no longer follows them, keeping the build warning-free for future `--warnings-as-errors` enforcement. Refs #83.
PreviousNext