Skip to content

zk4x/zyx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1,816 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

zyx

crates.io PyPI docs.rs build status license maintenance

Table of Contents

Overview

zyx is a complete ML library and compiler that goes from assembly all the way to neural networks. It has a stable API with performance under active optimization.

Features

  • Unified Graph β€” autograd and laziness share the same graph, enabling seamless kernel fusion across all operations.
  • Lazy Evaluation β€” operations accumulate until realize() triggers execution, reducing temporary allocations.
  • Kernel Fusion β€” tensor operations compile into single optimized kernels (CUDA, OpenCL, WebGPU, etc.).
  • Cross‑Platform Backends β€” native support for OpenCL (CPU via POCL, GPU via native OpenCL drivers), WebGPU, CUDA, and more.
  • Full Linear‑Algebra Coverage β€” mirrors the PyTorch ops API (matmul, convolutions, pooling, reductions, indexing, etc.) by stacking ops.
  • Immutable Tensors β€” tensors cannot be modified in place, preventing back‑prop errors common in PyTorch (RuntimeError: a tensor was modified in place).
  • Explicit Gradient Tape β€” you control what is recorded via GradientTape; no need for torch.no_grad() semantics.
  • Higher-Order Gradients β€” experimental (graph-based, forward-mode autograd planned)
  • Lazy Device Loading β€” tensors load from their current memory pool (disk, another device) into the compute device only when needed, via the runtime scheduler.
  • Parallel Pipelining β€” kernels allocate across heterogeneous devices (GPU, CPU, WebGPU) in a pipelined fashion via the runtime scheduler.
  • Small Footprint β€” compiled library is only a few MB with minimal dependencies (libloading, nanoserde, half).

🐍 Python Bindings

zyx offers Python bindings with full PyTorch API compatibility and multiple backend support:

Basic Usage

import zyx

x = zyx.Tensor.randn(2, 3)
y = zyx.Tensor.uniform_(2, 3, from_=-1.0, to_=1.0)
z = x.relu() + y.tanh()
print(z.shape())

# Autograd example
tape = zyx.GradientTape()
result = x.relu() * y
grads = tape.gradient(result, [x, y])

Crates

Crate Description
zyx Core tensor library with lazy graph and autodiff
zyx-nn Neural network layers (Linear, Conv2d, Attention, etc.) and #[derive(Module)]
zyx-optim Optimizers (SGD, Adam, AdamW, RMSprop)

Installation

Python Installation

# Install from PyPI
pip install zyx-py

# Or install from source for development
pip install git+https://github.com/zk4x/zyx.git#subdirectory=zyx-py

Rust Installation

# Install from crates.io
cargo add zyx zyx-nn zyx-optim

Hello World

Create tensors, apply operations, and trigger computation with realize():

use zyx::{Tensor, DType};

fn main() -> Result<(), zyx::ZyxError> {
    // Create tensors
    let x = Tensor::randn([2, 3], DType::F32)?;
    let y = Tensor::uniform([2, 3], -1f32..1f32)?;
    
    // Perform operations (lazy evaluation)
    let z = x.relu()? + y.tanh()?;
    
    // Realize computation
    let result = z.realize()?;
    
    println!("Result shape: {:?}", result.shape());
    Ok(())
}

Basic Neural Network

A training loop with a two-layer network, using GradientTape for autograd and SGD for optimization:

use zyx::{Tensor, DType, GradientTape};
use zyx_nn::{Linear, Module};
use zyx_optim::SGD;

#[derive(Module)]
struct SimpleNet {
    linear1: Linear,
    linear2: Linear,
}

impl SimpleNet {
    fn new(dtype: DType) -> Result<Self, zyx::ZyxError> {
        Ok(Self {
            linear1: Linear::new(784, 128, true, dtype)?,
            linear2: Linear::new(128, 10, true, dtype)?,
        })
    }
    
    fn forward(&self, x: &Tensor) -> Tensor {
        let x = self.linear1.forward(x).unwrap().relu();
        self.linear2.forward(&x).unwrap()
    }
}

fn main() -> Result<(), zyx::ZyxError> {
    let mut model = SimpleNet::new(DType::F32)?;
    let mut optim = SGD::default();
    let x = Tensor::randn([64, 784], DType::F32)?;
    let target = Tensor::randn([64, 10], DType::F32)?;
    
    for epoch in 0..10 {
        let tape = GradientTape::new();
        let output = model.forward(&x);
        let loss = output.mse_loss(&target)?;
        
        let grads = tape.gradient(&loss, &model);
        optim.update(&mut model, grads);
        
        // Realize to trigger computation
        Tensor::realize_all()?;
        
        println!("Epoch {}: Loss = {:.4}", epoch, loss.item::<f32>()?);
    }
    
    Ok(())
}

Advanced Examples

A Transformer block with multi-head attention, layer normalization, and AdamW optimization:

use zyx::{DType, GradientTape, Module, Tensor};
use zyx_nn::{Linear, LayerNorm, MultiheadAttention};
use zyx_optim::AdamW;

#[derive(Module)]
struct TransformerBlock {
    attn: MultiheadAttention,
    mlp: Linear,
    mlp2: Linear,
    norm1: LayerNorm,
    norm2: LayerNorm,
}

impl TransformerBlock {
    fn new(dim: u64, num_heads: u64, dtype: DType) -> Result<Self, zyx::ZyxError> {
        Ok(Self {
            attn: MultiheadAttention::new(dim, num_heads, 0.0, true, false, false, None, None, true, dtype)?,
            mlp: Linear::new(dim, dim * 4, true, dtype)?,
            mlp2: Linear::new(dim * 4, dim, true, dtype)?,
            norm1: LayerNorm::new([dim], 1e-5, true, true, dtype)?,
            norm2: LayerNorm::new([dim], 1e-5, true, true, dtype)?,
        })
    }

    fn forward(&self, x: &Tensor) -> Result<Tensor, zyx::ZyxError> {
        let attn_out = self.attn.forward(x, x, x, None::<Tensor>, false, None::<Tensor>, true, false)?.0;
        let x = self.norm1.forward(&(x + attn_out))?;
        let mlp_out = self.mlp.forward(&x)?.gelu();
        let mlp_out = self.mlp2.forward(&mlp_out)?;
        Ok(self.norm2.forward(&(x + mlp_out))?)
    }
}

fn main() -> Result<(), zyx::ZyxError> {
    let mut model = TransformerBlock::new(64, 4, DType::F32)?;
    let mut optim = AdamW::default();
    let x = Tensor::randn([2, 8, 64], DType::F32)?;

    let tape = GradientTape::new();
    let out = model.forward(&x)?;
    let grads = tape.gradient(&out, &model);

    // Update parameters with gradients
    optim.update(model.iter_mut(), grads);

    // Realize model to trigger computation (zyx uses lazy evaluation)
    model.realize()?;
    Ok(())
}

Architecture

flowchart LR
    A["Tensor Graph"] --> B["Fusion and Device Schedule Search"]
    B --> C["Device Specific Kernel IR"]
    C --> D["Backend Code / Assembly"]
Loading

Tensor operations build a lazy computation graph. During realization, the graph is analyzed for fusion opportunities and the optimal execution schedule is searched. The fused operations are lowered to a device-specific intermediate representation, then compiled to native code (PTX, OpenCL C, WGSL, etc.) for the target backend.

Why zyx is Different

Feature zyx PyTorch TensorFlow JAX
Execution Model Lazy with explicit realization Eager by default Eager by default Functional + XLA
Gradient Recording Explicit GradientTape Implicit, requires no_grad() Implicit, tf.function Explicit + jit
Tensor Mutability Immutable (no in-place errors) Mutable (risk of back-prop failures) Mutable Immutable
Kernel Fusion Automatic, cross-backend Manual (torch.jit) Manual (XLA) Manual (XLA)
Disk I/O Lazy loading parallel to compute Typically blocking Blocking Blocking
Device Pipelining Built-in heterogeneous pipelining Manual to(device) calls Manual device placement Manual device placement
Compilation Runtime kernel compilation Pre-compiled + jit Pre-compiled Just-in-time
Import Time ~1ms ~2s ~3s ~0.5s
Wheel Size ~4MB (includes CUDA) hundreds of MB

Backends

  • CUDA - NVIDIA GPU acceleration
  • OpenCL - Cross-platform support (CPU via POCL, GPU via native OpenCL drivers)
  • WebGPU (WGPU) - Modern web and native GPU support
  • ROCm - AMD GPU support (planned)

Please see DEVICE_CONFIG.md for detailed information on hardware configuration.

Documentation

Status & License

  • Status: Stable API with active performance optimization
  • License: LGPL-3.0-only (all crates)
  • Rust Version: Requires latest stable Rust
  • Platforms: Linux (primary), macOS, Windows (experimental)

Contributing

Contributions are welcome! Please read the CONTRIBUTING.md, STYLE.md, and CODE_OF_CONDUCT.md for guidelines.

How to Help

  • πŸ› Find bugs: correctness is our top priority
  • πŸ“ Write tests: integration tests are always appreciated
  • πŸ“š Improve documentation: typo fixes and better docs
  • ⚑ Add optimizations: significant performance improvements (>10%)
  • πŸ”Œ Add backends: CUDA, ROCm, Metal, Vulkan support
  • 🎯 Implement features: new tensor operations, layers

Quick Links

  • Examples - MNIST, RNN implementations
  • Issues - Bug reports and feature requests

Debug Options

Value Flag Description
1 dev Print hardware devices and configuration
2 perf Print graph execution characteristics
4 sched Print kernels created by scheduler
8 ir Print kernels in intermediate representation
16 asm Print native assembly/code (OpenCL, WGSL, etc.)

Example: ZYX_DEBUG=16 cargo test --features wgpu relu_1

Quick Reference

Common Tensor Operations

// Creation
let x = Tensor::randn([2, 3], DType::F32)?;
let y = Tensor::zeros([4, 4], DType::F32)?;
let z = Tensor::from([[1.0, 2.0], [3.0, 4.0]]);

// Operations
let sum = x + y;
let product = x * y;
let relu = x.relu()?;
let matmul = x.dot(&y)?;

// Shape manipulation
let reshaped = x.reshape([6, 1])?;
let sliced = x.slice([0..2, 0..2])?;
let transposed = x.t()?;

// Autograd
let tape = GradientTape::new();
let result = x.relu()? * y;
let grads = tape.gradient(&result, &[x, y]);

Python Quick Reference

import zyx

# Creation
x = zyx.Tensor.randn(2, 3)
y = zyx.Tensor.randn(2, 3)  # Same shape for element-wise operations
z = zyx.Tensor([[1, 2], [3, 4]])

# Operations
sum = x + y  # Same shape required
product = x * y  # Same shape required
relu = x.relu()
matmul = x @ zyx.Tensor.randn(3, 2)  # Matrix multiplication, requires compatible shapes

# Shape manipulation
reshaped = x.reshape(6, 1)
sliced = x[0:2, 0:2]
transposed = x.t()

# Autograd
tape = zyx.GradientTape()
result = x.relu() * y
grads = tape.gradient(result, [x, y])

Error Handling

zyx provides clear error messages for common issues:

Shape Mismatch Errors

import zyx

# This will fail - incompatible shapes for matrix multiplication
x = zyx.Tensor.randn(2, 5)
y = zyx.Tensor.randn(17, 8)  # Error: 2x5 @ 17x8 is invalid

try:
    result = x @ y
except Exception as e:
    print(f"Shape error: {e}")

# Correct approach - ensure compatible shapes
x = zyx.Tensor.randn(2, 5)
y = zyx.Tensor.randn(5, 8)  # Valid: 2x5 @ 5x8 = 2x8
result = x @ y
use zyx::Tensor;

let x = Tensor::randn([2, 5], DType::F32)?;
let y = Tensor::randn([17, 8], DType::F32)?;

// This returns an error - incompatible shapes for matrix multiplication
match x.dot(&y) {
    Ok(_) => unreachable!(),
    Err(e) => println!("Shape error: {e}"),
}

// Correct approach
let x = Tensor::randn([2, 5], DType::F32)?;
let y = Tensor::randn([5, 8], DType::F32)?;
let result = x.dot(&y)?;

Device Errors

import zyx

# Operations may succeed initially but fail during realization
x = zyx.Tensor.randn(1000, 1000)  # Large tensor
y = x @ x  # Operation builds in graph

try:
    result = y.realize()  # May fail if device runs out of memory
except Exception as e:
    print(f"Device error during realization: {e}")
    # Handle device errors (e.g., reduce batch size, use smaller tensors)

About

Tensor library for machine learning

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors