- All languages
- Assembly
- Batchfile
- C
- C#
- C++
- CMake
- CSS
- Clojure
- CoffeeScript
- Common Workflow Language
- Cuda
- Cython
- Dockerfile
- Fortran
- GAP
- Go
- HCL
- HTML
- Haskell
- Java
- JavaScript
- Jsonnet
- Julia
- Jupyter Notebook
- Kotlin
- Limbo
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Objective-C
- Objective-C++
- OpenEdge ABL
- PHP
- PLSQL
- Perl
- PostScript
- PureBasic
- Python
- QML
- R
- Roff
- Ruby
- Rust
- SCSS
- Sass
- Scala
- Shell
- SourcePawn
- Svelte
- Swift
- SystemVerilog
- TSQL
- TeX
- Thrift
- TypeScript
- Vim Script
- Vue
- WebAssembly
- Zig
Starred repositories
Efficient Universal Perception Encoder: a single on-device vision encoder with versatile representations that match or exceed specialized experts across multiple task domains.
a disk cache for using DuckDB to access Data Lakes (ducklake, iceberg, delta)
⚡ Super fast clustering for high-dimensional vectors on CPUs (x86, ARM) and GPUs — for Python and C++. 100x faster clustering of vector embeddings than FAISS
MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
Fast and memory-efficient exact kmeans
Build ultra fast, tiny, and cross-platform desktop apps with Typescript.
Pure MLX implementations of UMAP, t-SNE, PaCMAP, TriMap, DREAMS, CNE, MMAE, and NNDescent for Apple Silicon. Metal GPU for computation and video rendering.
UMAP in pure MLX for Apple Silicon. 30x faster than umap-learn.
100M tokens. Infinite compute. Lowest val loss wins.
Various ML tidbits in Python/PyTorch and C++
Official implementation of ViT-5: Vision Transformers for The Mid-2020s
Code and models for the paper: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts
Helpful kernel tutorials and examples for tile-based GPU programming
FLA but cuTile
LLaDA2.0 is the diffusion language model series developed by InclusionAI team, Ant Group.
An interface library for RL post training with environments.
Qwen3-0.6B megakernel: 527 tok/s decode on RTX 3090 (3.8x faster than PyTorch)