YNNPACK is a work-in-progress spiritual successor to XNNPACK. It uses many ideas and principles from XNNPACK, but is otherwise a near total rewrite. This document will focus on the differences from XNNPACK.
XNNPACK is structured into roughly 3 top level layers:
- Subgraph API (
xnn_define_...) - Operator API (
xnn_create_...,xnn_reshape_...,xnn_setup_...,xnn_run_operator) - Microkernels
The subgraph API uses the operator API, which in turn calls microkernels.
YNNPACK has (1) and (3) from above, but does not have an explicit operator API. YNNPACK uses slinky to execute the loops that exist outside the microkernels, which was the role of the operator API in XNNPACK. More importantly, most of the work that the operator API in XNNPACK performed has been moved into subgraph nodes, e.g. packing weights is handled by a subgraph node (that may or may not get constant folded).
XNNPACK would run these loops for each operator one at a time, unless they were
explicitly manually fused into a microkernel or the appropriate
xnn_compute_... function.
Slinky optimizes these loops by finding and fusing loops across many subgraph nodes, which improves memory locality, eliminates synchronization points when parallelizing, and both adds and removes overhead (hopefully removes more than it adds).
The ynnpack/xnnpack subfolder contains a compatibility layer that when linked
into an application, provides an implementation of the XNNPACK public API using
YNNPACK. So far the main purpose of this compatibility layer has been to enable
testing and development using XNNPACK tests, benchmarks, and use cases. It does
not support everything that XNNPACK does, but some applications will work
without modification, and some of those will perform well.
XNNPACK can be thought of as a more "CISC" and YNNPACK a more "RISC" design.
The best way to see this is to look at the
XNNPACK compatibility layer
implementation. For example, instead of providing an explicit
xnn_define_max_pooling_2d operator, YNNPACK provides the building blocks
needed to implement this operator (padding, stencil copies, reductions).
This approach may seem like it would offer lower performance, but many of the building blocks used to implement such operations are usually low or zero cost due to Slinky optimizations, and this approach offers much more flexibility to implement a wider variety of operations with less engineering effort in XNNPACK.
To enable good performance, our strategy is to express as much work as possible as standard subgraph logic (which does not require changes in YNNPACK itself), and then add necessary optimizations to the subgraph to make it perform well. Tactics to improve good performance are in order of preference:
- Improve Slinky and/or YNNPACK to execute subgraph operations with good locality and minimal overhead. This is preferred because these improvements generalize very well and should benefit a wide variety of use cases.
- Optimize the subgraph itself, e.g. fusing common or critical combinations of nodes into one optimized implementation. This may or may not involve adding a new kernel.
- Add a new kind of subgraph node to the public API. This should be rare.
The goal is for subgraph performance optimizations to apply broadly to many use cases. Optimizations that apply narrowly to a specific use case are a "design smell".
YNNPACK's kernel design aims to make it easier to add and change kernels than it was in XNNPACK.
YNNPACK does not require any explicit step to regenerate kernels, they are all described as part of the build process.
Tests, benchmarks, and other uses of kernels are generated automatically from headers describing the kernels that exist. Adding to one of these headers automatically generates a test, benchmark, and enables the kernel to be used in production.
Kernels are currently generated in a few different ways:
- Where reasonably possible, kernels are generated by C++ template code. Examples of this are the Intel AMX dot kernels, ARM SME(2) dot kernels, reductions, and transpose kernels.
- Some dot kernels are generated by a python "dot generator" that generates C++ code. Perhaps this could be done with a C++ template approach.
- Most elementwise kernels are generated by an "elementwise compiler" that translates python code to C++ with SIMD intrinsics.
base/: Basic utilities used throughout the rest of the library.include/ynnpack.h: The public APIkernels/: Low level kernels implementing the basic building blocks of computations expressible by YNNPACK.subgraph/: Implementation of the public API, usingkernels/.xnnpack/: Compatibility shim that implements the XNNPACK subgraph API.
YNNPACK is currently useful for some use cases, but it is far from a complete XNNPACK alternative at this time. It offers significant performance advantages in some cases, but is a significant regression or not functional at all in others. An incomplete list of the gaps are:
- Major features are missing. The most up-to-date status can be found by looking at the XNNPACK compatibility layer implementation.
- Some features may be implemented, but have poor performance due to not-yet-done optimization work.
- Code size, in particular the simplifier in Slinky.
- Initialization time can be large due to subgraph processing cost. There is a
lot of low hanging fruit in this area:
- A lot of redundant work in simplification in Slinky
- Use better data structures/algorithms (e.g. replace std::vector with small bounded size arrays for rank data).
- Basic compiler optimization techniques apply to Slinky, e.g. arena allocating expressions.
- Find a way to cut down the massive redundancy in models (e.g. subgraph "outlining"?)
- Many ARM kernels exist, but so far most attention has been on x86 kernels.
As this is very much a work in progress, any comments, suggestions, or contributions are welcome.
We are continuously testing XNNPACK subgraph tests and benchmarks that are not
tagged with the no_ynnpack tag:
bazel test --define xnnpack_use_ynnpack=true --test_tag_filters=-no_ynnpack //test/subgraph/... //bench/subgraph/...
Our near term goal is to enable specific users to benefit from YNNPACK improvements by directly using the YNNPACK public API.
Long term, we plan to either bring YNNPACK improvements into XNNPACK, or migrate XNNPACK use in TFlite to YNNPACK. Our goal is to do this without disruption to current XNNPACK users, e.g. via the XNNPACK compatibility layer.