-
-
Notifications
You must be signed in to change notification settings - Fork 1
asm_aes
This low-level reference details the AES AssemblyScript source and WASM exports, intended for those auditing, contributing to, or building against the raw module. Most consumers should instead use the TypeScript wrapper or the higher-level AEAD classes.
This module implements AES-128/192/256 as a standalone WebAssembly binary compiled from AssemblyScript. The cipher is bitsliced over WASM v128 SIMD lanes and processes 8 blocks in parallel through a single shared kernel. Five modes ride on top of that kernel: ECB, CBC, CTR, GCM, and GCM-SIV.
Key properties:
-
Bitsliced 8-block kernel. One v128 register holds bit
kfrom every byte across all 8 parallel blocks. Sub-bytes, ShiftRows, MixColumns, and AddRoundKey all run as register-only Boolean circuits with no data-dependent memory access. Käsper-Schwabe 2009 (CHES) §4.1, §4.3, §4.4 + Appendix A. -
Tower-field S-box. Forward and inverse S-boxes are computed with
Canright's GF(2^8) tower-field decomposition (Canright 2005). No S-box
lookup tables anywhere; the gate-only circuit is constant-time by
construction. Forward affine constant
0x63; inverse pre-affine constant0x7E. -
Boyar-Peralta 113-gate scalar S-box. The byte-level key schedule
uses the Boyar-Peralta straight-line program (32 AND, 81 XOR/XNOR
gates, depth 27) for
sboxByte/sboxWord. Faster than running the bitsliced circuit for the four-byte SubWord step. - Equivalent Inverse Cipher decrypt path. FIPS 197 §5.3.5. The decrypt round loop mirrors encrypt, and inverse round keys 1..Nr-1 have InvMixColumns pre-applied at key-schedule time so that AddRoundKey reuses the existing structure.
-
Static memory only. All buffers are fixed offsets in linear memory.
No
memory.grow(), no dynamic allocation. Total footprint is 202688 bytes, fitting comfortably in 4 × 64KB pages with 59456 bytes spare. - Three different counter encodings. Standalone CTR uses 128-bit big-endian (SP 800-38A §F.5). GCM uses a 96-bit fixed J0 prefix with a 32-bit big-endian counter at bytes 12..15 (SP 800-38D §6.5). GCM-SIV uses a 32-bit little-endian counter at bytes 0..3 with the 12-byte nonce at 4..15 (RFC 8452 §4). The three modes share the AES kernel but each owns its own counter loop.
Spec citations: NIST FIPS 197 (final update 2023) §5.1, §5.2, §5.3.5, Appendix B; NIST SP 800-38A §6.2 (CBC), §6.5 (CTR), Appendix B.1 (counter increment); NIST SP 800-38D §6.3 (GF(2^128)), §6.4 (GHASH), §7 (GCM); RFC 8452 §3, §4, Appendix A (POLYVAL, AES-GCM-SIV).
See the TypeScript wrapper for usage-level guidance. This section covers correctness and side-channel posture at the WASM layer.
The S-box is a Boolean circuit on v128 registers. No memory is indexed by secret data inside the kernel; the path and the access pattern are identical regardless of input value. This is algorithm-level constant-time; see architecture.md §Where defense ends for the hardware-level disclaim.
gf128MulH() uses a 16-entry table indexed by nibbles of the running
state. The state is secret-derived, so the table read is the classic
4-bit-windowed GHASH side-channel surface. Mitigations:
- PCLMULQDQ-style carry-less multiply is not exposed to WebAssembly SIMD, so the table-free schoolbook alternative is too slow for production.
- Callers concerned about side-channel leakage should prefer
AESGCMSIVCipher(which uses POLYVAL, same bridge through GHASH but the per-message authentication key is derived from the master, not fixed).
This is the same posture as BoringSSL, OpenSSL, and RustCrypto on pre-PCLMULQDQ paths. See architecture.md §Where defense ends for the canonical disclaim.
incrementCounter() propagates carry from byte 15 toward byte 0. The
inner loop has an early-exit when no carry occurs (if (b < 256) break).
This is a public-data branch: the counter value is derived from the public
nonce and block position, not from key material. An observer who learns
the counter value gains no information about the key or plaintext.
Both modes provide confidentiality only. Without a MAC they are
vulnerable to padding-oracle attacks (CBC), bit-flipping, and chosen-
ciphertext manipulation. Always pair with HMAC (Encrypt-then-MAC) or use
AESGCMSIVCipher instead. The TS wrapper for AESCbc
applies a constant-time PKCS7 padding check that mitigates timing-based
padding oracles, but the fundamental lack of authentication remains.
wipeBuffers() zeroes every buffer declared in buffers.ts: master key,
round-key schedule, inverse round keys, bitsliced state, S-box scratch,
key-schedule scratch, atomic and 8-block I/O buffers, chunk buffers,
nonce, counter, CBC IV, GCM hash subkey H, J0, GHASH accumulator, tag,
J0E pad, length encoding, scratch, GCTR counter, GF128 table, AAD, and
all SIV state. Key material does not persist in WASM linear memory after
an operation completes.
All exported functions are re-exported through src/asm/aes/index.ts.
These return the byte offset of each buffer region in linear memory. The TypeScript wrapper uses them to write inputs and read outputs.
function getModuleId(): i32 // 1
function getKeyOffset(): i32 // 0
function getBlockPtOffset(): i32 // 32
function getBlockCtOffset(): i32 // 48
function getBlockPt8xOffset(): i32 // 64
function getBlockCt8xOffset(): i32 // 192
function getRoundKeysOffset(): i32 // 320
function getBitslicedStateOffset(): i32 // 2240
function getCanrightScratchOffset(): i32 // 2368
function getKeyScheduleScratchOffset(): i32 // 3392
function getInvRoundKeysOffset(): i32 // 3648
function getChunkPtOffset(): i32 // 5568
function getChunkCtOffset(): i32 // 71104
function getNrOffset(): i32 // 136640
function getNonceOffset(): i32 // 136656
function getCounterOffset(): i32 // 136672
function getCbcIvOffset(): i32 // 136688
function getHOffset(): i32 // 136704
function getJ0Offset(): i32 // 136720
function getGhashAccOffset(): i32 // 136736
function getTagOffset(): i32 // 136752
function getGf128TableOffset(): i32 // 136832
function getAadOffset(): i32 // 137088
function getAadBufferSize(): i32 // 65536
function getPolyvalAuthKeyOffset(): i32 // 202624
function getPolyvalEncKeyOffset(): i32 // 202640
function getSivIcOffset(): i32 // 202672
function getChunkSize(): i32 // 65536
function getMemoryPages(): i32getModuleId() returns 1 (the AES slot in the loader registry; serpent
is 0). getMemoryPages() returns the current WASM linear memory size
in 64 KB pages, expected 4 for AES.
function loadKey(keyLen: i32): i32Reads keyLen bytes from KEY_BUFFER (offset 0) and runs the FIPS 197
§5.2 Algorithm 2 key schedule, parameterized on Nk ∈ {4, 6, 8}. The
AES-256 extra-SubWord branch fires when Nk > 6 && i mod Nk == 4.
Two parallel buffers are populated:
-
ROUND_KEYS_BUFFERholds the forward round keys, pre-transposed to bitsliced form (Käsper-Schwabe §4.5, each round key is 8 × v128 = 128 bytes so that AddRoundKey is 8 plain v128 XORs). -
INV_ROUND_KEYS_BUFFERholds the EqInvCipher inverse round keys: round 0 and round Nr are copies of the forward keys; rounds 1..Nr-1 have InvMixColumns pre-applied (FIPS 197 §5.3.5).
The round count Nr (10, 12, or 14) is stored at NR_OFFSET and read by
the encrypt/decrypt round loops on every call.
The GCM hash subkey H = AES_ENC(K, 0^128) is computed and cached at
H_OFFSET, and the GF128_TABLE (16 × 16 bytes for the 4-bit windowed
multiplier) is built from H here too. Per-loadKey work is amortized
across every subsequent GCM call until the next loadKey.
- keyLen: 16, 24, or 32 (AES-128 / 192 / 256)
-
Returns:
0on success,-1ifkeyLenis invalid
Must be called before any encrypt/decrypt operation.
function encryptBlock(): void
function decryptBlock(): voidAtomic single-block encrypt/decrypt. Reads from BLOCK_PT_BUFFER, writes
to BLOCK_CT_BUFFER (encrypt) or vice versa (decrypt). FIPS 197 §5.1
(Algorithm 1) and §5.3.5 (Equivalent Inverse Cipher). Internally
broadcasts the single block across all 8 lanes of the bitsliced kernel
and discards the redundant outputs, the 8-wide kernel is the only
implementation, the atomic exports are convenience wrappers.
function encryptBlock_8x(): void
function decryptBlock_8x(): void8-parallel-block encrypt/decrypt. Reads 8 blocks from
BLOCK_PT_8X_BUFFER (128 bytes), writes 8 blocks to BLOCK_CT_8X_BUFFER
(or vice versa). This is the primary kernel; CTR/CBC/GCM SIMD paths all
call it directly. Inputs and outputs are in plain (non-bitsliced) byte
order; the kernel handles the transpose internally.
function transposeRoundTrip(): void
function sboxRoundTrip(): void
function sboxWordExport(...): u32
function singleRound(roundIdx: i32): voidDebug-only exports used by gate tests. transposeRoundTrip exercises the
8×8 bit transpose forward then inverse and asserts the result is
bit-for-bit identical. sboxRoundTrip does the same for the bitsliced
S-box. sboxWordExport exposes the Boyar-Peralta scalar SubWord
implementation. singleRound runs one forward round at a given index for
spec-vector cross-checking.
function cbcEncryptChunk(len: i32): i32CBC-encrypts len bytes from CHUNK_PT_BUFFER to CHUNK_CT_BUFFER.
Scalar loop because each ciphertext block depends on the previous one.
- len: a positive multiple of 16, at most 65536
-
Returns:
lenon success,-1iflenis invalid
Chaining: C[i] = E_K(P[i] XOR C[i-1]), where C[-1] is the IV at
CBC_IV_BUFFER. The IV buffer is updated to the last ciphertext block
on return for streaming across multiple chunk calls.
function cbcDecryptChunk(len: i32): i32
function cbcDecryptChunk_simd(len: i32): i32CBC-decrypts len bytes from CHUNK_CT_BUFFER to CHUNK_PT_BUFFER. The
SIMD variant batches 8 blocks per iteration through decryptBlock_8x,
falling back to the scalar cbcDecryptChunk path for the trailing 1..7
blocks. The TS wrapper always calls cbcDecryptChunk_simd.
Same chaining and IV-update semantics as the encrypt path. PKCS7 padding is the caller's responsibility.
Note
CBC encryption has no SIMD variant. Each ciphertext block depends on the previous one, so blocks cannot be parallelized. Decryption is fully parallelizable because all ciphertext blocks are available up front.
function resetCounter(): void
function setCounter(hi: i64, lo: i64): voidresetCounter() copies NONCE_BUFFER to COUNTER_BUFFER, the nonce is
the initial 128-bit counter block. setCounter(hi, lo) writes the
counter as two 64-bit big-endian halves; used by worker pools to position
each worker at a non-overlapping range without going through NONCE_BUFFER.
function encryptChunk(chunkLen: i32): i32
function decryptChunk(chunkLen: i32): i32
function encryptChunk_simd(chunkLen: i32): i32
function decryptChunk_simd(chunkLen: i32): i32CTR-encrypt/decrypt chunkLen bytes from CHUNK_PT_BUFFER to
CHUNK_CT_BUFFER. CTR is symmetric, decryptChunk delegates to
encryptChunk and decryptChunk_simd to encryptChunk_simd. The TS
wrapper always calls the SIMD variant.
- chunkLen: 1 to 65536
-
Returns:
chunkLenon success,-1ifchunkLenis out of range
The counter is 128-bit big-endian (SP 800-38A §F.5). Byte 15 is the least-significant byte; carry propagates toward byte 0. Counter state persists across calls for streaming.
The SIMD path generates 8 keystream blocks per iteration through
encryptBlock_8x, falling back to scalar for the trailing 1..7 blocks.
function gcmStart(ivLen: i32, aadLen: i32): i32Initialize a GCM seal/open call. Derives J0 from the IV (12-byte fast
path: J0 = IV || 0x00000001; other lengths trigger a GHASH-based
derivation pass). Computes J0E = E_K(J0) for tag XOR. Resets the GHASH
accumulator. Absorbs AAD_BUFFER[0..aadLen] into GHASH. Initializes the
GCTR working counter at GCM_CB_BUFFER to inc_32(J0). Resets the
running CT-byte length.
- ivLen: 1 to 65536
- aadLen: 0 to 65536
-
Returns:
0on success,-1on invalid lengths
function gcmEncryptChunk(srcOff: i32, dstOff: i32, len: i32): i32GCTR-encrypt len bytes from srcOff to dstOff, then absorb the
ciphertext into GHASH and advance the CT-byte counter. The GCTR counter
format is distinct from standalone CTR: the leftmost 96 bits are fixed
from J0, the rightmost 32 bits are big-endian and increment per block
(inc_32).
-
Returns:
0on success,-1on length error or 32-bit counter overflow
function gcmAbsorbCtChunk(srcOff: i32, len: i32): i32Absorb len bytes of ciphertext at srcOff into GHASH without
decrypting. Used by the open direction's verify-before-decrypt pass.
function gcmDecryptChunk(srcOff: i32, dstOff: i32, len: i32): i32GCTR-decrypt len bytes from srcOff to dstOff. Does not absorb into
GHASH, that work was done by gcmAbsorbCtChunk during the verify pass.
The counter must be re-initialized to inc_32(J0) first via
gcmResetCtrToJ0Plus1().
function gcmResetCtrToJ0Plus1(): voidReset the GCTR working counter to inc_32(J0). Used between the
absorb-CT pass and the decrypt pass for verify-before-decrypt.
function gcmFinalize(): voidAbsorb the final length-encoding block (AAD bit-length || CT bit-length,
both u64 big-endian) into GHASH, XOR the result with J0E, and store
the 128-bit tag at TAG_OFFSET. The TS layer reads the computed tag and
routes the constant-time compare against the received tag through
constantTimeEqual in src/ts/utils.ts (the dedicated cte WASM
module). No AEAD compares tags inside its own module, library policy.
Note
Plaintext is bounded by SP 800-38D §5.2.1.1 at 2^36 - 32 bytes per
(key, IV) pair. The 32-bit GCTR counter spans at most 2^32 - 2
increments, each block is 16 bytes, so the maximum is
16 · (2^32 - 2) = 2^36 - 32 bytes. gcmEncryptChunk rejects when
the cumulative block count would push the counter past the wrap point.
function ghashStart(): void
function ghashAbsorbBlock(srcOff: i32): void
function ghashAbsorbWithLen(srcOff: i32, len: i32): void
function ghashFinalize(aadBits: i64, ctBits: i64): voidStandalone GHASH primitive (NIST SP 800-38D §6.4). Exported for Gate 12
testing. ghashStart zeroes the accumulator. ghashAbsorbBlock absorbs
exactly 16 bytes. ghashAbsorbWithLen absorbs len bytes (full blocks
plus a zero-padded tail if needed). ghashFinalize absorbs the final
length-encoding block constructed from aadBits and ctBits (each as
u64 big-endian).
The accumulator at GHASH_ACC_BUFFER is shared with POLYVAL, the two
modes are mutually exclusive at runtime.
function gf128InitTable(): void
function gf128MulH(): void
function byteReverse16(srcOff: i32, dstOff: i32): void
function mulXGhash(srcOff: i32, dstOff: i32): voidgf128InitTable builds the 16-entry 4-bit windowed multiply table at
GF128_TABLE_BUFFER from H at H_OFFSET. Convention: bit 3 of the
nibble index → u^0 coefficient, descending to bit 0 → u^3.
gf128MulH multiplies the GHASH accumulator at GHASH_ACC_BUFFER by H
in place using the table.
byteReverse16 and mulXGhash are helpers for the GHASH↔POLYVAL bridge
described in RFC 8452 Appendix A: byteReverse16 reverses byte order
in a 16-byte string, mulXGhash multiplies a 16-byte block by u in
the GHASH field.
The reduction polynomial is u^128 + u^7 + u^2 + u + 1. Storage
convention (SP 800-38D §6.3): bit 7 (MSB) of byte 0 is the u^0
coefficient; bit 0 (LSB) of byte 15 is the u^127 coefficient.
function polyvalStart(authKeyOff: i32): void
function polyvalAbsorbBlock(srcOff: i32): void
function polyvalAbsorbWithLen(srcOff: i32, len: i32): void
function polyvalFinalize(aadBits: i64, ctBits: i64): voidPOLYVAL universal hash (RFC 8452 §3, Appendix A). Implemented as a
reflection wrapper around GHASH. Per-call setup byte-reverses the
provided auth key, applies mulXGhash, and feeds the result to
gf128InitTable. Per-block absorption byte-reverses the input into
GHASH bit convention before XOR-and-multiply. polyvalFinalize
byte-reverses the accumulator back to POLYVAL byte order.
The accumulator and table buffers alias the GHASH equivalents; only one mode can be active at a time.
POLYVAL and GHASH are sibling universal hashes over GF(2¹²⁸). They
differ in reduction polynomial and bit-within-byte convention. The
library implements POLYVAL as a reflection wrapper over the existing
gf128MulH multiplier rather than shipping a parallel POLYVAL-native
multiplier.
| Property | GHASH (SP 800-38D) | POLYVAL (RFC 8452 §3) |
|---|---|---|
| Reduction polynomial | u¹²⁸ + u⁷ + u² + u + 1 |
u¹²⁸ + u¹²⁷ + u¹²⁶ + u¹²¹ + 1 |
| Bit-within-byte ordering | bit 7 of byte 0 = u⁰ | bit 0 of byte 0 = u⁰ |
| AEAD home | AES-GCM (RFC 5288) | AES-GCM-SIV (RFC 8452) |
RFC 8452 §3 gives the bridge formula:
POLYVAL(H, X_1..n) = ByteReverse(GHASH(mulX_GHASH(ByteReverse(H)),
ByteReverse(X_1..n)))
ByteReverse reverses byte order in a 16-byte string. RFC 8452 §3:
"the differing interpretations of bit order takes care of reversing
the bits within each byte, and then reversing the bytes does the
rest." The within-byte bit flip is free; the implementation needs only
the byte-reverse and the mulX_GHASH adjustment.
The implementation takes path (a), the reflection wrapper:
-
Per-SIV-operation setup. Byte-reverse the POLYVAL authentication
key, apply
mulX_GHASH(defined ingf128.tsasmulXGhash), and feed the result togf128InitTable. The GF128 table then multiplies bymulX_GHASH(ByteReverse(H))in GHASH bit convention. - Per-block absorption. Byte-reverse the block into GHASH bit convention, XOR into the running accumulator, multiply by H.
-
polyvalFinalize. Byte-reverse the accumulator back to POLYVAL bit convention.
The two helpers required are byteReverse16 and mulXGhash, both in
gf128.ts. The existing GF(2¹²⁸) primitive is unchanged.
Path (b), a POLYVAL-native multiplier with reduction byte 0x87 in
LSB-first storage, was rejected. Path (b) would have added ~250 lines
of parallel multiplier and a second 256-byte table for no algorithmic
benefit on a runtime that lacks PCLMULQDQ (the carry-less-multiply
instruction that makes a native POLYVAL multiplier competitive on x86).
The POLYVAL accumulator aliases on GHASH_ACC_OFFSET. POLYVAL and
GHASH are runtime-exclusive (an AEAD operation picks one and runs it
to completion), so the alias is safe and saves 16 bytes of layout.
GCM_SCRATCH_OFFSET doubles as the 16-byte scratch for the per-block
byte-reverse.
Note
Path (a) matches what BoringSSL, OpenSSL, and RustCrypto ship for their pre-PCLMULQDQ paths. PCLMULQDQ is not available in WebAssembly SIMD, and table-free schoolbook GF(2¹²⁸) is too slow for production use; the sandbox mitigates direct cross-process cache observation, but full mitigation would require CPU carry-less-multiply support or hardware-tied AES-GCM-SIV.
function sivDeriveKeys(nonceOff: i32): voidRFC 8452 §4 derive_keys. Encrypts 4 (AES-128) or 6 (AES-256) counter
blocks under the already-loaded master key. The counter is a 32-bit
little-endian uint at bytes 0..4 of the input block; bytes 4..16 are the
12-byte nonce read from nonceOff. The first 8 bytes of each encrypted
output are concatenated to form POLYVAL_AUTH_KEY (16 bytes) and
POLYVAL_ENC_KEY (16 or 32 bytes).
function sivSeal(aadLen: i32, ptLen: i32): voidLoads POLYVAL_ENC_KEY as the AES round-key schedule. Runs POLYVAL over
padded(AAD) || padded(PT) || length-block. Builds the tag by XORing
the POLYVAL output with the nonce, masking, and AES-encrypting under the
encryption key. SIV-CTR-encrypts CHUNK_PT_BUFFER in place. After
return: tag at TAG_OFFSET, ciphertext at CHUNK_PT_OFFSET.
function sivOpen(aadLen: i32, ctLen: i32): voidLoads POLYVAL_ENC_KEY as the round-key schedule. Builds the initial
CTR block from the provided tag (the TS layer writes it to
SIV_IC_OFFSET first). SIV-CTR-decrypts CHUNK_CT_BUFFER →
CHUNK_PT_BUFFER. Runs POLYVAL over the decrypted plaintext with the
AAD and length block. Builds the EXPECTED tag at TAG_OFFSET. Does NOT
compare, the TS layer reads the expected tag and routes the
constant-time compare through constantTimeEqual.
function sivWipeOnFail(): voidBelt-and-suspenders cleanup for the failed-open path. Zeroes everything
that could carry plaintext or auth-key material: full
CHUNK_PT_BUFFER (64 KiB), POLYVAL accumulator, derived per-message
keys, the GF128 table built from the auth key, the SIV counter, and the
tag scratch.
Note
The SIV-CTR counter format is the third distinct counter encoding in the module. RFC 8452 §4 puts a 32-bit little-endian counter at bytes 0..3 of the 16-byte block, with the 12-byte nonce at bytes 4..15. This is materially different from GCM (96-bit fixed prefix + 32-bit big- endian counter at bytes 12..15) and from standalone CTR (full 128-bit big-endian counter). The three modes share the AES kernel but each owns its counter loop.
function wipeBuffers(): voidZeroes every buffer declared in buffers.ts. The TypeScript wrapper
calls this in dispose().
All buffers are static, starting at offset 0. Total footprint: 202688 bytes (< 262144 = 4 × 64KB pages, with 59456 bytes spare).
| Offset | Size (bytes) | Name | Purpose |
|---|---|---|---|
| 0 | 32 | KEY_BUFFER |
Master key (sized for AES-256) |
| 32 | 16 | BLOCK_PT_BUFFER |
Atomic 1-block input |
| 48 | 16 | BLOCK_CT_BUFFER |
Atomic 1-block output |
| 64 | 128 | BLOCK_PT_8X_BUFFER |
8 parallel plaintext blocks |
| 192 | 128 | BLOCK_CT_8X_BUFFER |
8 parallel ciphertext blocks |
| 320 | 1920 | ROUND_KEYS_BUFFER |
15 × 8 × 16 bitsliced forward round keys |
| 2240 | 128 | BITSLICED_STATE_BUFFER |
8 × v128 AES state (Käsper-Schwabe layout) |
| 2368 | 1024 | CANRIGHT_SCRATCH_BUFFER |
64 v128 scratch slots for the tower-field S-box |
| 3392 | 256 | KEY_SCHEDULE_SCRATCH_BUFFER |
Byte-level scratch during keyExpansion |
| 3648 | 1920 | INV_ROUND_KEYS_BUFFER |
EqInvCipher decrypt round keys |
| 5568 | 65536 | CHUNK_PT_BUFFER |
Bulk plaintext / SIV in-place |
| 71104 | 65536 | CHUNK_CT_BUFFER |
Bulk ciphertext |
| 136640 | 1 | NR_BUFFER |
Round count: 10 / 12 / 14 (u8) |
| 136656 | 16 | NONCE_BUFFER |
CTR initial counter / SIV nonce |
| 136672 | 16 | COUNTER_BUFFER |
CTR working counter (128-bit big-endian) |
| 136688 | 16 | CBC_IV_BUFFER |
CBC chaining block |
| 136704 | 16 | H_BUFFER |
GCM hash subkey H = E_K(0^128)
|
| 136720 | 16 | J0_BUFFER |
GCM pre-counter block |
| 136736 | 16 | GHASH_ACC_BUFFER |
GHASH / POLYVAL running accumulator |
| 136752 | 16 | TAG_BUFFER |
GCM / GCM-SIV authentication tag scratch |
| 136768 | 16 | J0E_BUFFER |
E_K(J0) pad |
| 136784 | 16 | GCM_LENS_BUFFER |
AAD/PT bit-length state (two u64 BE) |
| 136800 | 16 | GCM_SCRATCH_BUFFER |
Partial-block tail scratch |
| 136816 | 16 | GCM_CB_BUFFER |
GCTR working counter (96-bit fixed + 32-bit BE) |
| 136832 | 256 | GF128_TABLE_BUFFER |
4-bit windowed multiply table (16 × 16) |
| 137088 | 65536 | AAD_BUFFER |
GCM additional authenticated data |
| 202624 | 16 | POLYVAL_AUTH_KEY_BUFFER |
SIV per-message auth key (RFC 8452 §4) |
| 202640 | 32 | POLYVAL_ENC_KEY_BUFFER |
SIV per-message encryption key (sized for AES-256) |
| 202672 | 16 | SIV_IC_BUFFER |
SIV initial counter / scratch for provided tag |
| 202688 | END |
Total < 262144 (4 pages) |
Two design notes:
- Bitsliced round keys are 128 bytes per round, not 16. Käsper-Schwabe §4.5: each AES round key is pre-transposed to bitsliced form so that AddRoundKey is 8 plain v128 XORs. The 16 round-key bytes duplicate across the 8 parallel blocks (since all 8 blocks share one key schedule), then transpose, yielding 8 × v128 = 128 bytes per bitsliced round key.
-
GHASH_ACC_BUFFERdoubles as the POLYVAL accumulator. GHASH and POLYVAL are mutually exclusive at runtime under the atomic AEAD pattern. The alias is safe and saves 16 bytes of layout. TheGF128_TABLE_BUFFERis similarly shared.
Bitsliced round keys are 128 bytes per round, not 16. Per
Käsper-Schwabe §4.5, each AES round key is pre-transposed to
bitsliced form so AddRoundKey is 8 plain v128 XORs. The 16
round-key bytes duplicate across the 8 parallel blocks (all 8 blocks
share one key schedule) and transpose, yielding 8 × v128 = 128
bytes per round key.
Key schedule scratch is dedicated, not piggy-backed. An earlier
layout placed the byte-level scratch at ROUND_KEYS_OFFSET + 1408,
the gap above the AES-128 round keys. At AES-256 the schedule
expands to 15 × 128 = 1920 bytes of bitsliced round keys; that
gap vanishes and the scratch collides with rounds 11-13. A dedicated
256-byte region (KEY_SCHEDULE_SCRATCH_BUFFER, 240 bytes used by
AES-256, padded to 256) keeps the round-key buffer purely about
round keys.
Inverse round keys live in a parallel buffer. FIPS 197 §5.3.5
Equivalent Inverse Cipher requires round keys 1..Nr - 1 to be
InvMixColumns-transformed for decrypt; encrypt needs the
untransformed forward round keys. Storing both sets in parallel
buffers lets a single AES instance support both directions without
per-call key-schedule work. Cost is one InvMixColumns per round
key, paid once at loadKey() time.
Defines the static memory layout as i32 constants and getter functions.
No logic. Pure layout declaration. All other modules import offsets from
here.
Bitsliced AES S-box (forward + inverse) using the Canright tower-field
decomposition. Operates in place on 8 v128 registers in
BITSLICED_STATE_OFFSET; sub-results are spilled to
CANRIGHT_SCRATCH_OFFSET (64 v128 scratch slots).
The forward circuit is s = (M·X) · gf256_inv(X⁻¹·a) ⊕ b, where:
-
Xis the standard-basis representation of the tower basis (Y, Z, W tensor products derived from Canright §2.1's basis polynomials) -
Mis the AES affine matrix -
b = 0x63is the AES affine constant (FIPS 197 §5.1.1)
The inverse circuit is a = X · gf256_inv((X⁻¹·M⁻¹)·s ⊕ X⁻¹·M⁻¹·b),
with X⁻¹·M⁻¹·b = 0x7E precomputed. The GF(2^8) inversion kernel is
its own inverse and is shared between forward and inverse S-box; only
the front and back basis-change matrices differ.
GF(2^4) operations are Karatsuba-style compositions of GF(2^2) multiplications. No tables, no data-dependent memory access.
The core cipher kernel.
Bit transposition (Käsper-Schwabe §4.1). A two-stage layered
XOR/shuffle. The byte-shuffle pattern
[0,4,8,12, 1,5,9,13, 2,6,10,14, 3,7,11,15] is self-inverse and
represents the 4×4 transpose of an AES state square. The 8×8
bit-matrix transpose uses three delta-swap stages with strides
{4, 2, 1} and masks {0x0F, 0x33, 0x55} (Hacker's Delight §7-2). 92
v128 operations total.
The transpose is its own inverse: transposeIn and transposeOut share
one implementation modulo source/destination offsets.
ShiftRows as v128 shuffle (§4.3). A single v128.shuffle<i8> per
register; InvShiftRows uses the inverse permutation.
MixColumns (§4.4 and Appendix A). Forward MixColumns expressed via
rl32 / rl64 byte rotations on the bitsliced state. InvMixColumns is
expressed as bit equations directly, denser than forward, but applied
once per round during decrypt.
Key schedule. Unified across Nk ∈ {4, 6, 8} (FIPS 197 §5.2
Algorithm 2). The AES-256 extra-SubWord branch fires when
Nk > 6 && i mod Nk == 4. Both forward keys and EqInvCipher inverse
keys (with InvMixColumns pre-applied to rounds 1..Nr-1) are built at
loadKey() time.
Boyar-Peralta scalar S-box (sboxByte / sboxWord). 113-gate
straight-line program, 32 AND, 81 XOR/XNOR, depth 27. Used by the byte-
level key schedule SubWord step where only 4 bytes need processing (the
bitsliced kernel pays an 8-block transpose tax that is wasted for a
4-byte input).
Round structure. Encrypt and decrypt round loops are parameterized
on Nr (read from NR_BUFFER per call). AddRoundKey is 8 v128 XORs.
SubBytes runs the bitsliced S-box from sbox.ts. ShiftRows is a single
v128.shuffle<i8> per slice. MixColumns runs the bit equations.
encryptBlock_8x and decryptBlock_8x are the primary kernels. The
atomic encryptBlock / decryptBlock wrappers broadcast the single
input across all 8 lanes and discard the duplicates.
The 8-block AES kernel runs over a bitsliced state. After
transposeIn, register state[k] (k ∈ {0..7}) holds bit k of
every byte across all 8 input blocks.
Within state[k], byte position j ∈ {0..15} corresponds to AES
state row r = j / 4, column c = j % 4 (row-major). Within byte
j of state[k], the 8 bits are bit k of that state position
from blocks 0..7.
FIPS 197 §3.4 lays out the AES state as state[r, c] = in[r + 4c],
so bitsliced byte j corresponds to input byte at offset
(j % 4) * 4 + j / 4 within each block.
The Käsper-Schwabe §4.1 layout fuses two factors:
- A per-block 4×4 byte transpose, the row-by-row reorder of an AES state square.
- An 8×8 bit-matrix transpose at every byte position.
The two factors operate on orthogonal axes (byte position vs.
bit-position-within-byte), so they commute. Both are involutions, so
a single kernel serves transposeIn and transposeOut.
Byte-shuffle stage. The pattern
[0, 4, 8, 12, 1, 5, 9, 13, 2, 6, 10, 14, 3, 7, 11, 15] is
self-inverse; it represents the 4×4 transpose of an AES state square.
Bit-matrix stage. Three delta-swap stages with strides
{4, 2, 1} and masks {0x0F, 0x33, 0x55} (Hacker's Delight §7-2).
Total cost is 92 v128 operations per direction. The kernel replaces the roughly 2050 scalar bit-gathers used in the pre-bitsliced implementation.
encryptBlock and decryptBlock sit on the per-block hot path of
AES-GCM-SIV; sivCtrXform calls them once per counter block. The
8×8 transpose amortises across 8 parallel
blocks. The atomic case skips that amortisation.
transposeIn1 / transposeOut1 work directly on the 16-byte
BLOCK_PT and BLOCK_CT buffers and never touch
BLOCK_PT_8X / BLOCK_CT_8X. The bitsliced register state[k]
still holds bit k of byte j across 8 lanes; the single-block
path populates lane 0 (bit 0 of each state[k] byte) and leaves
lanes 1..7 zero.
The 8x kernel then runs unchanged. It computes AES(0) seven times
in parallel on the dummy lanes and we ignore that output. The
savings are the input-side byte-fill and most of the transpose op
count.
Scalar CBC encrypt and CBC decrypt. Calls encryptBlock /
decryptBlock. The IV chains across calls in CBC_IV_BUFFER.
SIMD CBC decrypt. Batches 8 blocks per iteration through
decryptBlock_8x, falling back to scalar cbcDecryptChunk for the
trailing 1..7 blocks. CBC encryption has no SIMD variant (sequential
chaining).
Scalar CTR mode plus counter management. resetCounter() copies
NONCE_BUFFER to COUNTER_BUFFER. setCounter(hi, lo) writes the
counter as two 64-bit big-endian halves. incrementCounter() is an
inline 128-bit big-endian increment with byte-by-byte carry propagation.
processBlock(ptOff, ctOff, len) encrypts the counter to produce a
keystream block, XORs with len plaintext bytes, increments. Handles
partial final blocks (1..15 bytes).
SIMD CTR mode. Generates 8 keystream blocks per iteration through
encryptBlock_8x, falling back to scalar for the trailing 1..7 blocks.
Standalone GHASH primitive. ghashStart zeroes the accumulator.
ghashAbsorbBlock absorbs 16 bytes via XOR and gf128MulH.
ghashAbsorbWithLen handles full blocks plus a zero-padded tail.
ghashFinalize absorbs the length-encoding block.
GF(2^128) primitive: gf128MulH (multiply running accumulator by H
using the table), gf128InitTable (build the table from H),
mulXGhash (multiply by u, used in the POLYVAL setup), and
byteReverse16 (used in the POLYVAL bridge). Internally also defines
gf128MulU and gf128MulU4 as constant-time helpers.
Composes aes.ts, ghash.ts, and gf128.ts into the GCM construction.
gcmStart derives J0, computes J0E, resets GHASH, absorbs AAD, sets
up the GCTR counter. gcmEncryptChunk runs GCTR src→dst then absorbs
ciphertext into GHASH. gcmAbsorbCtChunk absorbs without decrypting
(for verify-before-decrypt). gcmDecryptChunk runs GCTR src→dst without
absorbing. gcmFinalize absorbs the length block, XORs with J0E, and
stores the tag at TAG_OFFSET.
The 12-byte IV fast path sets J0 = IV || 0x00000001 directly. Any
other length runs a GHASH-based J0 derivation pass over the IV with
zero padding and a length encoding.
POLYVAL (RFC 8452 §3, Appendix A) as a reflection wrapper around GHASH.
polyvalStart(authKeyOff) byte-reverses the auth key, applies
mulXGhash, and feeds the result to gf128InitTable.
polyvalAbsorbBlock and polyvalAbsorbWithLen byte-reverse inputs into
GHASH bit convention before XOR-and-multiply. polyvalFinalize
byte-reverses the accumulator back to POLYVAL byte order.
RFC 8452 single-shot AEAD. Glues the AES kernel, POLYVAL, the SIV-CTR
counter loop (32-bit little-endian counter at bytes 0..3, 12-byte nonce
at 4..15), and the derive_keys construction. AES-128 and AES-256 only;
RFC 8452 §6 excludes AES-192. Plaintext bounded by CHUNK_PT_BUFFER
(64 KiB) per call. sivWipeOnFail is a belt-and-suspenders
zeroing path for the failed-open case.
wipeBuffers() runs 21 memory.fill calls covering every buffer in
buffers.ts. Called from dispose() in the TS wrapper.
buffers.ts
^
|
sbox.ts <── aes.ts ─────────────────────┐
| ^ ^ ^ │
| | | └── cbc.ts │
| | └────── ctr.ts ──────│── ctr_simd.ts
| └────────── cbc_simd.ts │
| │
gf128.ts ─── ghash.ts ─── gcm.ts ───────┘
│ aes-gcm-siv.ts
└── polyval.ts ─────────────────────┘
index.ts (re-exports public API)
| Function | Error | Return |
|---|---|---|
loadKey(keyLen) |
keyLen is not 16, 24, or 32 |
-1 |
encryptChunk(chunkLen) / encryptChunk_simd / decryptChunk / decryptChunk_simd
|
chunkLen <= 0 or chunkLen > 65536
|
-1 |
cbcEncryptChunk(len) / cbcDecryptChunk / cbcDecryptChunk_simd
|
len <= 0, len > 65536, or len % 16 !== 0
|
-1 |
gcmStart(ivLen, aadLen) |
ivLen < 1, ivLen > 65536, or aadLen > 65536
|
-1 |
gcmEncryptChunk / gcmDecryptChunk / gcmAbsorbCtChunk
|
len < 0 or 32-bit GCTR counter overflow |
-1 |
Note
encryptBlock / decryptBlock, the _8x variants, and the GHASH /
POLYVAL / SIV functions have no error returns. They assume loadKey
was called successfully and the input buffers contain valid data. The
TypeScript wrapper enforces these preconditions.
| Document | Description |
|---|---|
| index | Project Documentation index |
| asm_imports.md | Per-module AssemblyScript import dependency graphs |
| aes | TypeScript wrapper classes (AES, AESCbc, AESCtr, AESGCM, AESGCMSIV, AESGenerator, AESGCMSIVCipher) |
| aead |
Seal, SealStream, OpenStream: use AESGCMSIVCipher as the suite argument |
| ciphersuite |
AESGCMSIVCipher reference: format enum, key derivation, commitment binding |
| signing |
Sign, SignStream, VerifyStream: scheme-agnostic signing layer |
| signaturesuite |
SignatureSuite interface and the shipped suite catalog (ML-DSA, SLH-DSA, Ed25519, ECDSA-P256, hybrids) |
| asm_sha2 | SHA-2 WASM module (used together with AES via Fortuna and HKDF) |
| architecture | Repository structure, build and CI, WASM modules, public API, test suite, and security posture |
- Sign Tools
-
SignatureSuite
- format-byte catalog, hybrid composite encodings, custom suite contract
- Serpent-256 TypeScript | WASM
-
Serpent,SerpentCtr,SerpentCbc,SerpentGenerator
-
- ChaCha20 TypeScript | WASM
-
ChaCha20,Poly1305,ChaCha20Poly1305,XChaCha20Poly1305,ChaCha20Generator
-
- AES TypeScript | WASM
-
AES,AESCbc,AESCtr,AESGCM,AESGCMSIV,AESGenerator
-
- ML-DSA TypeScript | WASM
- pure (FIPS 204):
MlDsa44,MlDsa65,MlDsa87 - pure-mode suites:
MlDsa44Suite,MlDsa65Suite,MlDsa87Suite - prehash suites:
MlDsa44PreHashSuite,MlDsa65PreHashSuite,MlDsa87PreHashSuite
- pure (FIPS 204):
- SLH-DSA TypeScript | WASM
- pure (FIPS 205):
SlhDsa128f,SlhDsa192f,SlhDsa256f - pure-mode suites:
SlhDsa128fSuite,SlhDsa192fSuite,SlhDsa256fSuite - prehash suites:
SlhDsa128fPreHashSuite,SlhDsa192fPreHashSuite,SlhDsa256fPreHashSuite
- pure (FIPS 205):
- Ed25519 TypeScript | WASM
-
Ed25519(pure + Ed25519ph),Ed25519Suite,Ed25519PreHashSuite
-
- ECDSA-P256 TypeScript | WASM
-
EcdsaP256(hedged + RFC 6979),EcdsaP256Suite - DER codec:
ecdsaSignatureToDer,ecdsaSignatureFromDer,encodeEcPrivateKey,decodeEcPrivateKey,pointDecompress
-
- Hybrid composites PQ-only | Classical+PQ
- PQ-only:
MlDsa44SlhDsa128fSuite,MlDsa65SlhDsa192fSuite,MlDsa87SlhDsa256fSuite - Classical+PQ:
MlDsa44Ed25519Suite,MlDsa65Ed25519Suite,MlDsa44EcdsaP256Suite,MlDsa65EcdsaP256Suite
- PQ-only:
- X25519 TypeScript | WASM
-
X25519,KeyAgreementError(RFC 7748)
-
- ML-KEM TypeScript | WASM
-
MlKem512,MlKem768,MlKem1024
-
-
Ratchet (SPQR)
-
KDFChain,ratchetInit,kemRatchetEncap,kemRatchetDecap,RatchetKeypair,SkippedKeyStore
-
- Hashing overview
- SHA-2 TypeScript | WASM
-
SHA256,SHA384,SHA512,SHA224,SHA512_224,SHA512_256 -
HMAC_SHA256,HMAC_SHA384,HMAC_SHA512,HKDF_SHA256,HKDF_SHA512
-
- SHA-3 TypeScript | WASM
-
SHA3_224,SHA3_256,SHA3_384,SHA3_512,SHAKE128,SHAKE256
-
- BLAKE3 TypeScript | WASM
-
BLAKE3,BLAKE3Stream,BLAKE3KeyedHash,BLAKE3KeyedHashStream -
BLAKE3DeriveKey,BLAKE3DeriveKeyStream,BLAKE3OutputReader,BLAKE3Hash
-
-
KMAC
-
CSHAKE128,CSHAKE256,KMAC128,KMAC256,KMACXOF128,KMACXOF256
-
-
Merkle
-
MerkleVerifier,MerkleLog -
SignedLog,Sha256Tree,Blake3Tree,MemoryStorage
-
-
Fortuna CSPRNG
-
Fortuna,SerpentGenerator,ChaCha20Generator,AESGenerator,SHA256Hash,SHA3_256Hash,BLAKE3Hash
-
- Utils TypeScript | WASM
-
constantTimeEqual,randomBytes,wipe, encoding helpers
-
-
TypeScript interfaces
-
Hash,KeyedHash,Blockcipher,Streamcipher,AEAD,Generator,HashFn
-